More than 1 year has passed since last update.

fukuoka.ex：福岡Elixirコミュ

Bumblebeeのwhisperを使ってSpeech to Text

Last updated at 2023-05-23Posted at 2023-03-28

はじめに

本記事はLivebook上でBumblebeeからwhiperのネットワークを構築して、実際に文字起こしを行う方法を解説します。

Livebookとは

Elixirで実装されたインタラクティブなノートブックアプリケーションで
コードやデータ処理の自動化を目標としています

Automate code & data workflows with interactive notebooks

対話型ノートブックでコードとデータのワークフローを自動化する

Get rid of scripts, manual steps, and outdated docs. Start using Elixir and Livebook to share knowledge, explore code, visualize data, run machine learning models, and much more!

スクリプト、手作業による手順、古くなったドキュメントから解放されましょう。ElixirとLivebookを使って、知識の共有、コードの探索、データの可視化、機械学習モデルの実行など、さまざまなことを始めましょう！

最新の0.9.0ではノートブックをWebアプリケーションとしてデプロイできるそうです

Bumblebeeとは

Elixrの深層学習フレームワークのAxonでhuggingfaceの学習済みTransformerを簡単に動かすライブラリ

whsiperとは

Whisperは、OpenAIが文字起こしサービスとして公開した無料の音声認識モデルです。WhisperはWebから収集した68万時間分の多言語音声データを教師付きデータで学習させており、高い精度で入力した音声を文字起こしすることが可能になっています。
by https://aismiley.co.jp/ai_news/what-is-whisper/

文字起こしノートブックの作成

ではLivebook上で文字起こしを行うプログラムを作成していきます

ライブラリのインストール

setupセルに以下を追加してください

Mix.install(
  [
    {:kino_bumblebee, "~> 0.3.0"},
    {:exla, "~> 0.5.3"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

SmartCellでSpeech to text

LivebookにはSmartCellという機能があっていろいろな便利ウィジェットを提供しています

SmartCellのNeural Network taskを選択して
taskを Speech to text
モデルを Whisper(small multilingual)
を選択して、英語の音声ファイルをアップロード
Runを実行すると10秒ほどたってテキストが出力されます

テキスト書き起こしてができました！

日本語はどうでしょうか？

英語ではないとは認識できてますが、出力はしてくれないようです・・・

日本語でtransscribe(書き起こし)

このissueによるとトークナイザーに細工をして英語以外で書き起こしをしてくれるようになります
この例はスペイン語なので日本語のjaにすれば日本語になります

※ 2023/05/23更新
speech_to_textに直接トークンの設定を入れていましたが、　0.3からできなくなっていて
Bumblebee.configureで設定を行います

{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-small"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-small"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-small"})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-small"})

generation_config = Bumblebee.configure(
  generation_config,
  max_new_tokens: 100,
  forced_token_ids: [
      {1, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|ja|>")},
      {2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
      {3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
    ]
)

serving =
  Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer, generation_config,
    compile: [batch_size: 1],
    defn_options: [compiler: EXLA]
  )

SmartCellのカスタマイズ

SmartCellは実際のコードをtoggle souceを押してUIとコードの表示を切り替えができます

もちろんコピーできるので新しいセルにコピペして日本語書き起こしができるようにします

{:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"})
- generation_config = Bumblebee.configure(generation_config, max_new_tokens: 100)
+ generation_config = Bumblebee.configure(
+  generation_config,
+  max_new_tokens: 100,
+  forced_token_ids: [
+      {1, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|ja|>")},
+      {2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
+      {3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
+    ]
+)


serving =
  Bumblebee.Audio.speech_to_text(model_info, featurizer, tokenizer, generation_config,
    compile: [batch_size: 1],
    defn_options: [compiler: EXLA]
  )

audio_input = Kino.Input.audio("Audio", sampling_rate: featurizer.sampling_rate)
form = Kino.Control.form([audio: audio_input], submit: "Run")
frame = Kino.Frame.new()

Kino.listen(form, fn %{data: %{audio: audio}} ->
  if audio do
    Kino.Frame.render(frame, Kino.Text.new("Running..."))

    audio =
      audio.data
      |> Nx.from_binary(:f32)
      |> Nx.reshape({:auto, audio.num_channels})
      |> Nx.mean(axes: [1])

    %{results: [%{text: generated_text}]} = Nx.Serving.run(serving, audio)
    Kino.Frame.render(frame, Kino.Text.new(generated_text))
  end
end)

Kino.Layout.grid([form, frame], boxed: true, gap: 16)

無事日本語書き起こしができるようになりました！

シーケンシャルに全文を書き起こし

whisperは最初の30秒の1文しか文字起こしを行ってくれないので、ffmpegでぶつ切りにして文字起こしをすることで1ファイル全ての文字起こしを行います。

こちらを参考にしていきます。

モデルの構築

small, ja, transcribeでservingを作ります

{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-small"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-small"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-small"})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-small"})

generation_config = Bumblebee.configure(
  generation_config,
  max_new_tokens: 100,
  forced_token_ids: [
      {1, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|ja|>")},
      {2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
      {3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
    ]
)

serving =
  Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer, generation_config,
    compile: [batch_size: 1],
    defn_options: [compiler: EXLA]
  )

音声ファイルの再生時間を取得する

最初にファイルの時間を取得します。
Kino.Input.audioにアップロードすると表示されるのでそれでもいいですが、
プログラム的に算出する方法は以下になります。
動画の方はLiveBeatsのモジュール使ってるのですがライブラリ化されていないのとコピーするにもコードが長いです。

Kino.Input.audioで音声ファイルのアップロードフォームを作ります。
単体のセルでもいいですし上のモデルのセルに足しても良いです

audio_input = Kino.Input.audio("audio", sampling_rate: featurizer.sampling_rate)

value = Kino.Input.read(audio_input)
sec = 
  value.data
  |> Nx.from_binary(:f32)
  |> Nx.shape
  |> elem(0)
  |> div(featurizer.sampling_rate)
  |> div(value.num_channels)

ファイルパスを取得する

ffmpegで処理をするためファイルパスを取得します

file_input = Kino.Input.file("audio")

value = Kino.Input.read(file_input)
path = Kino.Input.file_path(value.file_ref)
# ファイル名が表示されなかったら絶対パスを貼り付ける
# path = "/Users/shou/phoenix/livebook/piyopiyo_nako.mp3"

ファイルによってはアップロードしても反応しない場合があるので、そのときは絶対パスで指定してください

Kino.Input.audioでアップロードした場合はアップロード後ファイルが削除されるようなので、
新たにKino.Input.fileでアップロードするようにしています

ffmpegで分割して書き起こし

Task.async_streamを使って4つずつ並行して書き起こしを行います
System.cmdでffmpegを実行し
返ってきたバイナリデータでBumblebeeのwhiperをNx.Serving.runで実行します
おまけに書き起こし中のデータを tap内のinspectで表示していきます
最後に全体の結果をソートし、Enum.mapでテキストだけ抽出しています。

0..sec//10
|> Task.async_stream(
  fn ss ->
    args = ~w(-ac 1 -ar 16k -f f32le -ss #{ss} -t 10 -v quiet -)
    {data, 0} = System.cmd("ffmpeg", ["-i", path] ++ args)

    {ss, Nx.Serving.run(serving, Nx.from_binary(data, :f32))}
    |> tap(fn {ss, res} ->
      res
      |> Map.get(:results)
      |> List.first()
      |> Map.get(:text)
      |> then(&IO.inspect({ss, &1}))
    end)
  end,
  max_concurrency: 4,
  timeout: :infinity
)
|> Enum.map(fn {:ok, {ss, %{results: [%{text: text}]}}} -> {ss, text} end)
|> Enum.sort_by(&elem(&1, 0))
|> Enum.map(&elem(&1,1))

demo

それでは実際に実行してみましょう

CPUを使用してsmallのモデルで9分の音声が3分ほどで書き起こしが完了しました

最後に

Bumblebeeのwhisperで日本語音声の文字起こしを行うプログラムを作成しました

今回はsmallかつマルチリンガルなので制度はそこまで良くないですがチューニング　+　large　+　GPUで動かせば結構良さそうなので出来そうですね

今回はKino.Inputを使って音声ファイルのデータ等を取得していましたが

こちらのプロジェクトでマルチメディアに特化したライブラリのMembraneを使用していたりします
大きなライブラリなのでキャッチアップにも時間がかかりそうで今回は使いませんでしたが、今後機会があれば使ってみたいですね

本記事は以上になりますありがとうございました

参考サイト

https://livebook.dev/
https://phoenixframework.org/blog/whisper-speech-to-text-phoenix
https://qiita.com/RyoWakabayashi/items/9cd0032d1bd4ddad5d6c
https://github.com/elixir-nx/bumblebee
https://openai.com/research/whisper
https://aismiley.co.jp/ai_news/what-is-whisper/
https://membrane.stream/
https://github.com/lawik/membrane_transcription

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up