Whisper2024-03-25 - NISHIO Hirokazu's Scrapbox (Auto-translated from Japanese)

Whisper2024-03-25

prev Reflection on the use of Whisper

Whisper

I passed 3h44 recordings to Whisper API.

Speech to text - OpenAI API

python whisper.py 0.37s user 0.18s system 0% cpu 10:00.79 total

The second half was just noise, so it was actually two hours.

https://gyazo.com/93f3a6c6c1ab03e26ee499dee3a285a7

Try it with a one-hour study session audio

python whisper.py 0.34s user 0.14s system 0% cpu 3:27.50 total

https://gyazo.com/0237823730c2d31c4c1eda0f821e2430

Unlike the previous example, this one talks for almost the entire hour.

One hour of audio, with a processing time of 3.5 minutes and a cost of 40 cents.

I had Claude summarize the resulting transcriptions.

Transcription and AI summary of the audio from the study group

Nothing hard about the code.

code:python

audio_file = open(audio_path, "rb")

transcription = client.audio.transcriptions.create(

model="whisper-1", file=audio_file

)

print(transcription.text)

with open(f"whisper_out/{indir}_{audio}.txt", "w") as f:

f.write(transcription.text)

The most time-consuming part was installing ffmpeg and starting to put in gcc and stuff.

$ brew install ffmpeg

Note that this is for audio file splitting, so it is not necessary if the audio file is less than 25MB.

The audio of the above hour-long talking session is 15MB, so if you're using a recording app that chops it up in an hour, you don't need it.

---

This page is auto-translated from /nishio/Whisper2024-03-25 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.