PythonAdvent Calendar 2024

KDDI株式会社

M4 Mac mini で mlx-whisper を試す（pyenv・venv を使った環境準備も）【Python-4】

Last updated at 2025-01-10Posted at 2024-12-29

この記事は 12/30 に書いたのですが、後から「Python Advent Calendar 2024」の 6日目の記事としても後から登録してみました。

ちなみに、今年の Python のアドベントカレンダーに登録した記事としては 4つ目で、1つ目〜3つ目は以下を書いて登録していました。

はじめに

この記事では、M4 Mac mini で mlx-whisper を試します。

所有している Mac mini のスペックは、以下のものです。

実際のお試し

pyenv などを使った環境の準備

このお試しをする際、pyenv などを使った環境も準備しました。

pyenv

以前から使っていた Intel版の MackBook Pro で pyenv を使っていたのがあり、とりあえずで pyenv を使いました（※選定理由は適当）。

●pyenv/pyenv: Simple Python version management
　https://github.com/pyenv/pyenv

自分の M4 Mac mini で Homebrew は利用可能な状態にしていたので、以下で pyenv を追加しました。

brew install pyenv

以下で 3.12系をインストールして、インストール後の確認を行います。

pyenv install 3.12
pyenv versions

3.12系が問題なくインストールされているのを確認して、以下で利用する Python のバージョンを指定します。

pyenv global 3.12

なお global をつけたやり方ではなく、今回の作業用フォルダ内でのみ有効となるようにバージョン指定をするやり方でも問題ありません。その場合は、上記の pyenv global 3.12 を以下に変えます。

pyenv local 3.12

venv

次に、venv で仮想環境を準備します。

前に書いた記事では Windows を使いましたが、今回は Mac を使っています。そのため、冒頭に掲載していた記事で書いた手順と比べると、仮想寒極のアクティベートの部分が少しだけ異なります。

python -m venv myenv
source myenv/bin/activate

なお、上記の下側の行は以下の書き方でも大丈夫です。

. myenv/bin/activate

ffmpeg の準備

mlx-whisper の公式ページで記載されているとおり、ffmpeg も準備しておきます。そのために、Homebrew を使うことにします。

brew install ffmpeg

仮想環境で mlx-whisper を試す

それでは仮想環境で以下の mlx-whisper をインストールして試していきます。

●mlx-whisper · PyPI
https://pypi.org/project/mlx-whisper/

pip install mlx-whisper

インストール時は mlx-whisper という名称で指定しましたが、コマンドとして実行する時は mlx_whisper になるようです。

コマンドで処理してみる

以下の記事を参考にしてコマンドでの実行を試してみることにします。

コマンドは以下の内容にしてみます。

mlx_whisper --model mlx-community/whisper-large-v3-turbo 【音声ファイル】

利用するファイルは、冒頭に掲載していた記事の中の Whisper に関する記事と同じものを用います。実行するコマンドが以下になります。

mlx_whisper --model mlx-community/whisper-large-v3-turbo jfk.flac

モデルを取得してない状態で処理を実行したので、処理実行時にモデルのダウンロードが行われました。

そして出力としては以下を得ることができました。

Detected language: English
[00:00.000 --> 00:10.380]  And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

Python で処理してみる

次に Python での処理も試してみます。

以下の記事を参考にしつつ、Python のコードと指定するモデルを決めました。
（指定可能なモデルは「Whisper - a mlx-community Collection」のページで確認できるようです）

●Pythonで音声認識モデルWhisperを使って文字起こし | gihyo.jp
　https://gihyo.jp/article/2024/12/monthly-python-2412

Python のコードは以下の通りです。

import mlx_whisper

audio_data = 'jfk.flac'

result = mlx_whisper.transcribe(
  audio_data, path_or_hf_repo="mlx-community/whisper-large-v3-mlx"
)
print(result["text"])

なお、コマンドで試した時のモデルは mlx-community/whisper-large-v3-turbo で、Python のコードで使っているのは mlx-community/whisper-large-v3-mlx にしてみています。先ほどとは異なるモデルを指定したので、また処理の実行時にモデルのダウンロードが行われました。

上記の処理を実行した結果は以下となりました。

And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

ちなみに print(result) とすると以下が出力されます。

{'text': ' And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 10.38, 'text': ' And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.', 'tokens': [50365, 400, 370, 11, 452, 7177, 6280, 11, 1029, 406, 437, 428, 1941, 393, 360, 337, 291, 11, 1029, 437, 291, 393, 360, 337, 428, 1941, 13, 50884], 'temperature': 0.0, 'avg_logprob': -0.11908375805821912, 'compression_ratio': 1.35, 'no_speech_prob': 0.2100057750940323}], 'language': 'en'}

ヘルプを出力させてみる

mlx_whisper の使い方の詳細が気になったので、 mlx_whisper -h でヘルプを表示させてみました。

usage: mlx_whisper [-h] [--model MODEL] [--output-dir OUTPUT_DIR]
                   [--output-format {txt,vtt,srt,tsv,json,all}] [--verbose VERBOSE]
                   [--task {transcribe,translate}]
                   [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
                   [--temperature TEMPERATURE] [--best-of BEST_OF] [--patience PATIENCE]
                   [--length-penalty LENGTH_PENALTY] [--suppress-tokens SUPPRESS_TOKENS]
                   [--initial-prompt INITIAL_PROMPT]
                   [--condition-on-previous-text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16]
                   [--compression-ratio-threshold COMPRESSION_RATIO_THRESHOLD]
                   [--logprob-threshold LOGPROB_THRESHOLD] [--no-speech-threshold NO_SPEECH_THRESHOLD]
                   [--word-timestamps WORD_TIMESTAMPS] [--prepend-punctuations PREPEND_PUNCTUATIONS]
                   [--append-punctuations APPEND_PUNCTUATIONS] [--highlight-words HIGHLIGHT_WORDS]
                   [--max-line-width MAX_LINE_WIDTH] [--max-line-count MAX_LINE_COUNT]
                   [--max-words-per-line MAX_WORDS_PER_LINE]
                   [--hallucination-silence-threshold HALLUCINATION_SILENCE_THRESHOLD]
                   [--clip-timestamps CLIP_TIMESTAMPS]
                   audio [audio ...]

positional arguments:
  audio                 Audio file(s) to transcribe

options:
  -h, --help            show this help message and exit
  --model MODEL         The model directory or hugging face repo (default: mlx-community/whisper-tiny)
  --output-dir OUTPUT_DIR, -o OUTPUT_DIR
                        Directory to save the outputs (default: .)
  --output-format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
                        Format of the output file (default: txt)
  --verbose VERBOSE     Whether to print out progress and debug messages (default: True)
  --task {transcribe,translate}
                        Perform speech recognition ('transcribe') or speech translation ('translate')
                        (default: transcribe)
  --language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
                        Language spoken in the audio, specify None to auto-detect (default: None)
  --temperature TEMPERATURE
                        Temperature for sampling (default: 0)
  --best-of BEST_OF     Number of candidates when sampling with non-zero temperature (default: 5)
  --patience PATIENCE   Optional patience value to use in beam decoding, as in
                        https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional
                        beam search (default: None)
  --length-penalty LENGTH_PENALTY
                        Optional token length penalty coefficient (alpha) as in
                        https://arxiv.org/abs/1609.08144, uses simple length normalization by default.
                        (default: None)
  --suppress-tokens SUPPRESS_TOKENS
                        Comma-separated list of token ids to suppress during sampling; '-1' will suppress
                        most special characters except common punctuations (default: -1)
  --initial-prompt INITIAL_PROMPT
                        Optional text to provide as a prompt for the first window. (default: None)
  --condition-on-previous-text CONDITION_ON_PREVIOUS_TEXT
                        If True, provide the previous output of the model as a prompt for the next window;
                        disabling may make the text inconsistent across windows, but the model becomes
                        less prone to getting stuck in a failure loop (default: True)
  --fp16 FP16           Whether to perform inference in fp16 (default: True)
  --compression-ratio-threshold COMPRESSION_RATIO_THRESHOLD
                        if the gzip compression ratio is higher than this value, treat the decoding as
                        failed (default: 2.4)
  --logprob-threshold LOGPROB_THRESHOLD
                        If the average log probability is lower than this value, treat the decoding as
                        failed (default: -1.0)
  --no-speech-threshold NO_SPEECH_THRESHOLD
                        If the probability of the token is higher than this value the decoding has failed
                        due to `logprob_threshold`, consider the segment as silence (default: 0.6)
  --word-timestamps WORD_TIMESTAMPS
                        Extract word-level timestamps and refine the results based on them (default:
                        False)
  --prepend-punctuations PREPEND_PUNCTUATIONS
                        If word-timestamps is True, merge these punctuation symbols with the next word
                        (default: "'“¿([{-)
  --append-punctuations APPEND_PUNCTUATIONS
                        If word_timestamps is True, merge these punctuation symbols with the previous word
                        (default: "'.。,，!！?？:：”)]}、)
  --highlight-words HIGHLIGHT_WORDS
                        (requires --word_timestamps True) underline each word as it is spoken in srt and
                        vtt (default: False)
  --max-line-width MAX_LINE_WIDTH
                        (requires --word_timestamps True) the maximum number of characters in a line
                        before breaking the line (default: None)
  --max-line-count MAX_LINE_COUNT
                        (requires --word_timestamps True) the maximum number of lines in a segment
                        (default: None)
  --max-words-per-line MAX_WORDS_PER_LINE
                        (requires --word_timestamps True, no effect with --max_line_width) the maximum
                        number of words in a segment (default: None)
  --hallucination-silence-threshold HALLUCINATION_SILENCE_THRESHOLD
                        (requires --word_timestamps True) skip silent periods longer than this threshold
                        (in seconds) when a possible hallucination is detected (default: None)
  --clip-timestamps CLIP_TIMESTAMPS
                        Comma-separated list start,end,start,end,... timestamps (in seconds) of clips to
                        process, where the last end timestamp defaults to the end of the file (default: 0)

その他

自分用メモ・今後試したいことなど

以下は自分用のメモや、今後試したいことに関するリンクです。

●Real-Time Speech-to-Text on MacOS with MLX Whisper With Copy-To-Pasteboard Capabilities | John Maeda’s Blog
　https://maeda.pm/2024/11/10/real-time-speech-to-text-on-macos-with-mlx-whisper-with-copy-to-pasteboard-capabilities/

●mlxのwhisperでリアルタイム文字起こしを試してみる - Qiita
　https://qiita.com/mbotsu/items/c661bfa738fa2c05d9c5

●mlxがつけられた記事一覧 - Qiita
　https://qiita.com/tags/mlx

●"mlx" "whisper" - Google 検索
　https://www.google.com/search?q=%22mlx%22+%22whisper%22

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up