Whisper + GPT-3 で会議音声からの議事録書き出し&サマリ自動生成をやってみる！ #gpt-3

こんにちは！逆瀬川 (https://twitter.com/gyakuse)です！
今日は議事録の音声からの書き出しとサマリの自動生成を行います。

概要

会議音声(wavとかmp3ファイル)からWhisperを用いて書き出しを行い、GPT-3.5でサマリを自動生成します。
会議音声としていますが、べつにどんな音声でも大丈夫です。

Colab

whisper.cpp版(処理に動画秒数×10倍程度の時間がかかりますがGPU不要です)

whisper.fp16版(処理は動画秒数/4程度の時間で済みますがGPU必須です)

使い方

OpenAIのAPIキーを貼り付け
ランタイム > すべてのセルを実行から実行し、最初の処理の下に出てくるファイル選択で録音ファイルを選択します
ひたすら待ちます

実装

Whisperの軽量化

Whisperの軽量化としては、cpp実装のwhisper.cppがあります。基本的な内容は以下を参考にしてみます。

15分ごとに切り分け、wav(16kHz)にする必要がありそうです。一旦これを行います。

# 入力音声の変換
import soundfile as sf
import librosa

y, sr = librosa.load(meeting_file_path)
# リサンプリング
y_16k = librosa.resample(y, sr, 16000)
n_samples = int(15 * 60 * 16000)

# 音声ファイルを15分ごとに分割する
segments = [y_16k[i:i+n_samples] for i in range(0, len(y_16k), n_samples)]

# 分割した音声ファイルを保存する
for i, segment in enumerate(segments):
    sf.write(f"/content/meeting_{i}.wav", segment, 16000, format="WAV")

では、whisper.cppを入れていきます。gitで落としたあと、make等を行います。large modelを使いますが、日本語音声認識だとこれより小さいモデルだとつらい部分があります。

!make
!bash ./models/download-ggml-model.sh large

処理ではshファイルを作り、それを実行します。

import os

shellscript = f"""
#!/bin/bash

n={len(segments)}

for i in $(seq 0 $n)
do
  ./main -m /content/whisper.cpp/models/ggml-large.bin -l ja -f /content/meeting_$i.wav -otxt
done
"""

filename = "run.sh"

with open(filename, "w") as f:
    f.write(shellscript)

os.chmod(filename, 0o755)

./run.sh

Whisperの高速化

Whisperの高速化としては、量子化&TorchScript化を行った音声認識モデル Whisper の推論をほぼ倍速に高速化した話などがあります。この記事では内部の実装を見てlayernormの重みだけfp32にしていたり面白いのでおすすめです(ここを見るとたしかに半精度にするとこけちゃう…！みたいな気持ちがわかります)。
この記事で使われている半精度処理を入れます。

サマリとアクションの自動生成

GPT-3.5で行います。一旦以下のようなプロンプトを用います。小さい言語モデルを使う場合は、英語に翻訳したあとfew-shotなどを用いたり、要約用のモデルを使うとよいでしょう。

def create_prompt(transcript):
    return f"""以下は、ある会議の書き起こしです。

{transcript}

この会議のサマリーを作成してください。サマリーは、以下のような形式で書いてください。

- 会議の目的
- 会議の内容
- 会議の結果

サマリー:
"""

def create_prompt_act(transcript):
    return f"""以下は、ある会議の書き起こしです。

{transcript}

この会議の次にするアクションを作成してください。アクションの記述は以下のルールに従ってください。

・リスト形式で出力する (先頭は - を使う)
・簡潔に表現する

アクション:
"""

最終的に以下のような結果となりました。

課題ややってないこと

音声処理は長時間対応していますが、プロンプト処理部分では長時間の会議用の処理を入れてません
- 2000-3000文字ごとに要約処理を噛ませてから、それをくっつけてサマリーを出すとよさそうです
執筆時にtext-davinci-003が落ちていたので002にしていますが、003に変えると性能が向上します
サマリー系のプロンプトはBingGPTにおすすめプロンプトを聞いてそのまま適用させました。もうちょっと考えて作ってもいいと思います。

後記

議事録自動作成サービスは素晴らしいサービスが増えています
- 議事録取れる君はZoom連携や話者特定などの機能が備わっています
  - UI/UXも優れていて、(個人だと)月980円で使えるのでめちゃよいです
Whisperはfine-tuning等も気軽にできるので、サクッといろいろ実装していきましょう

プロンプトとしてコンテンツを安全に渡す方法について (2月16日追加)

OpenAIのTerms of Useによると、API経由でのinput/outputはモデルのトレーニングに使用される懸念があります。これをオプトアウトするには、以下で示されている通り support@openai.com に連絡する必要があります。

(c) Use of Content to Improve Services. One of the main benefits of machine learning models is that they can be improved over time. To help OpenAI provide and maintain the Services, you agree and instruct that we may use Content to develop and improve the Services. You can read more here about how Content may be used to improve model performance. We understand that in some cases you may not want your Content used to improve Services. You can opt out of having Content used for improvement by contacting support@openai.com with your organization ID. Please note that in some cases this may limit the ability of our Services to better address your specific use case.

また、Microsoft Azureから提供されているOpenAIサービスは以下のように用途の制限が表明されています。

Text prompts, queries and responses. The requests & response data may be temporarily stored by the Azure OpenAI Service for up to 30 days. This data is encrypted and is only accessible to authorized engineers for (1) debugging purposes in the event of a failure, (2) investigating patterns of abuse and misuse or (3) improving the content filtering system through using the prompts and completions flagged for abuse or misuse.