More than 1 year has passed since last update.

ElixirAdvent Calendar 2023

@RyoWakabayashi(若林諒)in

株式会社オーイーシー

Livebook で Whisper による音声認識を高速実行する

Last updated at 2023-11-13Posted at 2023-11-07

はじめに

Livebook Launch Week 2 を自分でやってみるシリーズ

Day1: Smart Cell からのリモート接続
Day2: Whisper による音声認識の新機能 <- ここ
Day3: ファイルをドラッグ＆ドロップすると、扱うためのコードを自動生成する
Day4:
- データクラウド Snowflake に接続する
- Microsoft SQL Server に接続する
Day5: Vim と Emacs のキーバインド

Day 2 は Livebook での音声認識 Whisper の新機能です

Whisper などの機械学習モデルは CPU で実行すると非常に重いため、 GPU を使いたくなります

今回は Google Colaboratory 上で Livebook を起動し、 GPU を利用します

Google Colaboratory からの Livebook 起動

Livebook 起動用にコードを記述、実行します

実装したノートブックはこちら

asdf のインストール

Erlang と Elixir の最新バージョンを使いたいため、 asdf を利用します

!git clone https://github.com/asdf-vm/asdf.git ~/.asdf

import os
os.environ['PATH'] = "/root/.asdf/shims:/root/.asdf/bin:" + os.environ['PATH']

Erlang のインストール

asdf で Erlang の最新バージョンをインストールします

!asdf plugin add erlang

!asdf install erlang 26.1.2

!asdf global erlang 26.1.2

Elixir のインストール

asdf で Elixir の最新バージョンをインストールします

!asdf plugin add elixir

!asdf global elixir 1.15.7-otp-26

インストールされたことを確認します

!elixir -v

実行結果

Erlang/OTP 26 [erts-14.1.1] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit:ns]

Elixir 1.15.7 (compiled with Erlang/OTP 26)

Livebook のインストール

Livebook をインストールします

!mix local.hex --force
!mix local.rebar --force
!mix escript.install hex livebook --force

!asdf reshim elixir

インストールされたことを確認します

!livebook -v

実行結果

Erlang/OTP 26 [erts-14.1.1] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit:ns]

Elixir 1.15.7 (compiled with Erlang/OTP 26)

Livebook 0.11.3

ngrok のインストール、認証情報の設定

Google Colaboratory 上で起動した Livebook に外部からアクセスするため ngrok を利用します

まだアカウントがない場合はサインアップしてください

ngrok の CLI　をインストールします

!curl -O https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

認証トークンは秘密情報なので、パスワード形式のフォームを作成します

from getpass import getpass

token = getpass()

以下の URL から ngrok の認証トークンを取得し、フォームに入力します

値をコピー＆ペーストした後 Enter キー押下

ngrok の認証トークンを CLI に設定します

!./ngrok authtoken "$token"

Livebook の起動

ngrok で Google Colaboratory 上の 8888　ポートを外部からアクセス可能にします

get_ipython().system_raw('./ngrok http 8888 &')
!sleep 5s

ngrok で作成した外部からアクセス可能な公開 URL を取得します

!curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

実行結果

http://xxxx-11-22-33-44.ngrok-free.app

Livebook を起動します

!livebook server --port 8888

実行結果

[Livebook] Application running at http://localhost:8888/?token=xxxxx

ngrok の公開 URL をブラウザから開くとトークンの入力を求められるため、 Livebook 起動時の URL から末尾の token の値をコピーし、貼り付けます

Livebook のホーム画面が表示されれば Livebook の準備は完了です

セットアップ

Livebook で新しいノートブックを開きます

セットアップセルに以下のコードを入力して実行しましょう

Mix.install(
  [
    {:kino_bumblebee, "~> 0.4"},
    {:exla, "~> 0.6"}
  ],
  config: [nx: [default_backend: EXLA.Backend]]
)

EXLA を使うことで、高速に機械学習モデルによる推論が実行できます

Smart Cell の追加

Neural Network task の Smart Cell を追加します

以下のような入力フォームが表示されます

TASK に Speech-to-text 、 USING に Whisper medium multilingual を選択し、 Evaluate をクリックすると、入力フォームが以下のように変化します

音声認識の実行

青空文庫の「走れメロス」を朗読する場合の音声を認識してみます

自分の声ではなく、 VOICEVOX の WhiteCUL (びえーん) に喋ってもらいます

VOICEVOX から出力した音声ファイルを Livebook 上の入力フォームにアップロードし、 Run ボタンをクリックします

しばらくすると、以下のような結果が表示されました

(Start of transcription)
00:00:00-00:00:07: メロスは激怒した。必ず、彼女、 地防脚のを除かなければならぬと決意した。
00:00:07-00:00:28: メロスには政治がわからぬ。メロスは、村の奥人である、笛を吹、従理はなれたこの子のシラクスの死にやってきた。
00:00:28-00:00:45: メロスには父も母もない、妞婦もない、十六の不敷きな妹と二人暮らしだ。この妹は村のある律儀な一国人を近々、花向子として迎えることになっていた結婚式も間近なのである
00:00:45-00:00:47: メロスはそれゆえ
00:00:47-00:00:51: 花嫁の衣装やら宿苑のごちそうやらを買いに
00:00:51-00:00:53: 春晴一にやってきたのだ
00:00:53-00:01:06: まずそのしなじなを買い集めそれから都の王子をぶらぶら歩いたメロスにはチクバの友があったセリヌンティウスである今はこのシラクスの位置で
00:01:06-00:01:09: 意識をしているその友を
00:01:09-00:01:11: これから尋ねてみるつもりなのだ
00:01:11-00:01:14: 久しぶりに会わなかったのだから
00:01:14-00:01:16: 尋ねていくのが楽しみである
(End of transcription)

「邪智暴虐」などの認識が上手くいっていないのはしょうがないとして、音声のある程度の区切り毎に、時間が表示されています

これが Livebook　上での Whisper の新機能です

どれくらいの時間に何を言ったのかが把握できるようになりました

コードの確認

続いて、 Smart Cell の右上鉛筆アイコンをクリックしましょう

Smart Cell は以下の二つのセルに変換されます

{:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-medium"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-medium"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-medium"})

{:ok, generation_config} =
  Bumblebee.load_generation_config({:hf, "openai/whisper-medium"})

generation_config = Bumblebee.configure(generation_config, max_new_tokens: 100)

serving =
  Bumblebee.Audio.speech_to_text_whisper(
    model_info,
    featurizer,
    tokenizer,
    generation_config,
    compile: [batch_size: 4],
    chunk_num_seconds: 30,
    timestamps: :segments,
    stream: true,
    defn_options: [compiler: EXLA]
  )

audio_input = Kino.Input.audio("Audio", sampling_rate: featurizer.sampling_rate)
form = Kino.Control.form([audio: audio_input], submit: "Run")
frame = Kino.Frame.new()

Kino.listen(form, fn %{data: %{audio: audio}} ->
  if audio do
    audio =
      audio.file_ref
      |> Kino.Input.file_path()
      |> File.read!()
      |> Nx.from_binary(:f32)
      |> Nx.reshape({:auto, audio.num_channels})
      |> Nx.mean(axes: [1])

    Kino.Frame.render(frame, Kino.Text.new("(Start of transcription)", chunk: true))

    for chunk <- Nx.Serving.run(serving, audio) do
      [start_mark, end_mark] =
        for seconds <- [chunk.start_timestamp_seconds, chunk.end_timestamp_seconds] do
          seconds |> round() |> Time.from_seconds_after_midnight() |> Time.to_string()
        end

      text = "
#{start_mark}-#{end_mark}: #{chunk.text}"
      Kino.Frame.append(frame, Kino.Text.new(text, chunk: true))
    end

    Kino.Frame.append(frame, Kino.Text.new("\n(End of transcription)", chunk: true))
  end
end)

Kino.Layout.grid([form, frame], boxed: true, gap: 16)

着目すべきは以下のコードです

...
    compile: [batch_size: 4],
    chunk_num_seconds: 30,
...

これがもう一つの新機能です

実は以前は 30 秒までの音声しか認識できませんでした

新機能として音声を 30 秒毎に分割して実行することにより、長さの上限をなくしています

しかも内部的にはバッチ単位で並列処理をするようになっています

複数 GPU を積んでいる場合、更に高速化されることになります

Livebook Launch Week 2 の記事には以下のように書かれています

we expect our models to perform inference an order of magnitude faster compared to Open AI’s implementation when transcribing larger files on the GPU

日本語訳

GPU上で巨大な音声ファイルを変換する場合、私たちのモデルがOpen AIの実装より桁違いで高速に推論できると期待しています

Elixir の並列分散がまさに有効活用できる実装です

まとめ

Livebook 上での Whisper には以下の新機能が追加されました

音声毎の時間表示
30秒を超える音声への対応
並列処理による高速化

AI を並列分散で高速化する、という Elixir の強みを出してきましたね

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up