More than 1 year has passed since last update.

M1 Macで文字起こしを試してみる

Last updated at 2024-01-15Posted at 2024-01-14

Appleが昨年末(2023年12月)にmlxなるAppleシリコン上で機械学習が動くライブラリを出したそうなのでWhisperで文字起こしを試してみたので備忘録

インストール

ml-explore/mlx-examplesのWhisperのREADMEを参照してインストールと、今回使うライブラリの導入

$ git clone https://github.com/ml-explore/mlx-examples.git
$ cd mlx-examples/whisper
$ pip install -r requirements.txt
$ brew install ImageMagick
$ pip install yt_dlp pysrt moviepy ImageMagick

others ffmpeg

適当な動画をダウンロード

download.py

from yt_dlp import YoutubeDL
ydl = YoutubeDL()
result = ydl.download(['https://www.youtube.com/watch?v=xxx'])

動画から文字起こし

デフォルトはtinyモデルなのでmlx-communityで好みのモデルを探す。
今回は、M1 Mac 8GBのメモリでlarge-v3を試してみたがどうやら動く模様。（なお、とても遅い）
一旦srt形式で字幕テキストを出力して後から別のコードで書き込むようにした。
あと、transcribeのdecode_optionsでfp16=Trueを指定すると半精度で動きます。（若干速くなる）

speech2text.py

import whisper
import pysrt
 
speech_file = "sample.webm"
result = whisper.transcribe(speech_file, verbose=True, language='ja', 
                            path_or_hf_repo="mlx-community/whisper-large-v3-mlx", fp16=True)

# References: https://note.com/9256/n/nce2ddc5e006a
subs = pysrt.SubRipFile()
sub_idx = 1

for i in range(len(result["segments"])):
    start_time = result["segments"][i]["start"]
    end_time = result["segments"][i]["end"]
    duration = end_time - start_time
    timestamp = f"{start_time:.3f} - {end_time:.3f}"
    text = result["segments"][i]["text"]
    
    sub = pysrt.SubRipItem(index=sub_idx, start=pysrt.SubRipTime(seconds=start_time), 
                           end=pysrt.SubRipTime(seconds=end_time), text=text)
    subs.append(sub)
    sub_idx += 1
    
subs.save("sample.srt")

動画にテキストを書き込む

TextClipにsizeを指定すると折り返しにも対応してくれるようだ。
結構時間がかかるので用途によってはmkvで書き出した方が良さそうだ。

composite.py

# ffmpeg -i sample.webm -vf scale=-1:720 sample.mp4

from moviepy.editor import *
from moviepy.video.tools.subtitles import SubtitlesClip

video = VideoFileClip("sample.mp4")

generator = lambda txt: TextClip(
    txt, font='YuGothic-Medium', fontsize=48, color='white',
    stroke_width=10, method='caption', align='south', size=video.size)

subtitles = SubtitlesClip("sample.srt", generator)
result = CompositeVideoClip([video, subtitles.set_pos(('center','bottom'))])

result.write_videofile("out.mp4", fps=video.fps, temp_audiofile="temp-audio.m4a", remove_temp=True, codec="libx264", audio_codec="aac")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up