Whisper使って動画から字幕srtファイル生成

Whisper

#スクリプト

uvを使う

uv init

uv install openai-whisper

コードはChatGPTとやり取りして書いた後、無料制限超えちゃったのでClaudeでさらに改善、手動で調節

code:main.py

import whisper

import os

import sys

def main():

# コマンドライン引数のチェック

if len(sys.argv) != 2:

print("使用方法: uv run main.py <input_video_file>")

print("例: uv run main.py input.mp4")

sys.exit(1)

# 入力ファイルのパスを取得

video_file = sys.argv1

# 入力ファイルの存在チェック

if not os.path.exists(video_file):

print(f"エラー: 入力ファイル '{video_file}' が見つかりません。")

sys.exit(1)

# 出力ファイル名を入力ファイル名から生成

output_file = os.path.splitext(video_file)0 + ".srt"

# モデルの保存先を指定

custom_cache_dir = os.path.join(os.path.dirname(__file__), "models")

try:

# モデルの読み込み

model = whisper.load_model("turbo", download_root=custom_cache_dir)

# 字幕を生成する関数

def transcribe_video(video_path, output_srt):

# 動画から音声を抽出

audio_path = "temp_audio.wav"

os.system(f"ffmpeg -i \"{video_path}\" -ar 16000 -ac 1 -y {audio_path}")

# 音声をWhisperで文字起こし

result = model.transcribe(audio_path, language="ja")

# 字幕ファイル(SRT形式)を保存

with open(output_srt, "w", encoding="utf-8") as srt_file:

for i, segment in enumerate(result"segments"):

start = segment"start"

end = segment"end"

text = segment"text"

srt_file.write(f"{i + 1}\n")

srt_file.write(f"{format_time(start)} --> {format_time(end)}\n")

srt_file.write(f"{text}\n\n")

# 一時ファイルを削除

os.remove(audio_path)

# 時間形式を変換する関数

def format_time(seconds):

hours = int(seconds // 3600)

minutes = int((seconds % 3600) // 60)

seconds = seconds % 60

milliseconds = int((seconds % 1) * 1000)

return f"{hours:02}:{minutes:02}:{int(seconds):02},{milliseconds:03}"

# 字幕生成の実行

print(f"処理を開始します: {video_file}")

transcribe_video(video_file, output_file)

print(f"字幕ファイルが生成されました: {output_file}")

except Exception as e:

print(f"エラーが発生しました: {str(e)}")

sys.exit(1)

if __name__ == "__main__":

main()

custom_cache_dir = os.path.join(os.path.dirname(__file__), "models")

model = whisper.load_model("turbo", download_root=custom_cache_dir)

の部分でモデルの保存先指定しててmain.pyと同じ位置にmodelsフォルダを作り、そこにモデルを保存する仕組みにしてます

download_rootを指定しない場合はどこだっけ、なんかどっかに保存される

uv run main.py input.mp4

とかでsrt出来る

なんか時間が全部埋まってる感じの字幕になるっぽくて開始位置が実際に喋り出すよりだいぶ早い段階になってたりする

実行時

UserWarning: FP16 is not supported on CPU; using FP32 instead

warnings.warn("FP16 is not supported on CPU; using FP32 instead")

俺の環境だとこういうの出るけど、処理はちゃんと進んでてそのまま待ってれば処理完了して生成された

これGPU使ってなくてCPU使ってるって事かな？GPU使って早く処理させる方法も調べとくかまた

https://scrapbox.io/files/676c3ad8e9048fc6800fb778.webp

出力例

#Python

#自作