More than 3 years have passed since last update.

GCPのspeech to text api試したみた件について

Posted at 2020-09-16

#動機
ミーティング等々がほぼオンラインになったことに伴い、Zoom標準の録画機能などで、ロギングしていくことが増えました。動画を見返せば当日の内容を丸々復旧できるという安心感はあるのですが、さっと確認しづらいので、たまったログをテキスト情報に変えたいなと思い、試してみました。

たくさんLGTMがついている記事や公式のコード等を参考に実装しましたが、いずれもそれ一発では動かなかったので、簡単に試せなくて困っている方の役に立てればと思います。

#GCPの設定
下記の記事を参考にセットアップしました。
使用する音声ファイルに関してはサンプリングレートが違いましたが、GCSにアップロードする点では同じです。
Google Cloud Speech API を使った音声の文字起こし手順

#実装コード

transcribe_gcp.py

from google.cloud import speech_v1
from google.cloud.speech_v1 import enums
import datetime
import codecs
import argparse
import io,sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

def sample_long_running_recognize(storage_uri):
    """
    Transcribe long audio file from Cloud Storage using asynchronous speech
    recognition

    Args:
      storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
    """

    client = speech_v1.SpeechClient()
    # The language of the supplied audio
    language_code = "ja-JP"
    # Encoding of audio data sent. This sample sets this explicitly.
    # This field is optional for FLAC and WAV audio formats.
    encoding = enums.RecognitionConfig.AudioEncoding.FLAC
    config = {
        "audio_channel_count": 2, # headerを確認してチャンネル数を指定する必要あり（default１チャンネルなので、その場合には省略可能）
        "language_code": language_code,
        "encoding": encoding,
    }
    audio = {"uri": storage_uri}

    operation = client.long_running_recognize(config, audio)

    print(u"Waiting for operation to complete...")
    response = operation.result()
    d = datetime.datetime.today()
    today = d.strftime("%Y%m%d-%H%M%S")
    fout = codecs.open('output{}.txt'.format(today), 'a', 'shift_jis')

    for result in response.results:
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        print(u"Transcript: {}".format(alternative.transcript)) #コンソール出力
        for alternative in result.alternatives: #ファイル追記
            fout.write(u'{}\n'.format(alternative.transcript))
    fout.close()


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument(
        'path', help='GCS path for audio file to be recognized')
    args = parser.parse_args()
    sample_long_running_recognize(args.path)
```

# つまづきポイント
###google.cloudがないよというエラーが出るケース
pipでインストールしてあげる必要がある。（シェルが再接続したタイミングで何度か再インストールする必要があった）

```terminal
pip install google-cloud-speech
```

###403エラーが出るケース
[Google Cloud Speech API を使った音声の文字起こし手順](https://qiita.com/knyrc/items/7aab521edfc9bfb06625)
こちらの手順でキーペア作成、ダウンロード、シェルエディタにアップロード、パスを通したはずなのになぜ？といったことが何回か発生したが、こちらもgoogle.cloudがない場合と同じ様に、通したはずのパスの設定が消えていることがあるので再度通してあげれば問題なく動いた。

```terminal
export GOOGLE_APPLICATION_CREDENTIALS=key_pair.json
```

### 日本語エンコードのエラーが出る場合
コンソールに日本語を出力した場合に、エラーが出ることがある。ファイル冒頭に下記の行を追記してあげれば良いが、何もしない状態だとシェルで動いているPythonは2.7なので下記の行を動かせる様にpython3.7にアップデートしてあげる必要あり。

```python:transcribe.py
import io,sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
```
python3.7へのアップデートは下記のリンクを参考に実施
[GCPのCloud Shellのpythonバージョンの更新方法](https://qiita.com/greenteabiscuit/items/cbecdf4f84f0b73ff96e)

#精度に関して
話者3名で音声がかぶることの少なかったオンライン会議を試しに文字起こししてみましたが、出来上がったのは当日参加していなかった人間にはほぼ理解できない程度の内容でした。ある程度キーワードやコンテキストは拾えるので、人の手で直してあげればちゃんと議事録にはなるかとは思います。
録音環境とかを気にせずに意味の通る形で文字起こしが可能になるにはもう少し先な気がしました。

GCPもspeech to text apiも初めて触りましたが、まあなんとか使えるあたりGoogleの凄さを感じました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up