More than 5 years have passed since last update.

Google Cloud Speech APIで音声の自動文字起こし

Last updated at 2018-05-07Posted at 2017-09-03

カンファレンスやインタビュー音源を自動文字起こしする夢

2017年8月にAPIがアップデートされ、最長3時間までの音声が利用できるようになったため、
音声データ→txtファイル変換に挑戦してみました。
インタビューを取ったらすぐに自動書き起こしできるよう、環境は外出先でも使えるGCPのクラウドコンソールを使います。

※参考
http://jp.techcrunch.com/2017/08/15/20170814google-updates-its-cloud-speech-api-with-support-for-more-languages-word-level-timestamps/

環境・言語など

Google Cloud Speech API
Google Cloud Storage
python

Speech APIを有効にする

以下URLを参考にSpeech APIを有効にします。
音声60分までは無料、その後は15秒ごとに0.6セント課金されますが、Google Cloud Platformを初めて使う場合は1年間有効な300ドルが付与されます（2017年8月現在）
https://cloud.google.com/speech/docs/getting-started

認証情報は、サービスアカウントキーファイル（JSON形式）で作成しておきます。

Google Cloud ShellでAPI認証する

Google Cloud Shellを立ち上げ、右上のあたりから、認証用のJSONファイルをアップロードします。

アップロードしたら、JSONファイルで認証します。

$ export GOOGLE_APPLICATION_CREDENTIALS=hogehoge.json

音声ファイルを作成

mp3やAACなどをそのまま利用することはできず、対応した形式に変換する必要があります。いろいろ試しましたが、以下の設定がおすすめです。

FLAC
モノラル
16000Hz
16bit

(参考：オンラインの変換サービス)
https://audio.online-convert.com/convert-to-flac

変換

FLACファイルを、Google Cloud Strageにアップロードします。
Google Cloud Strageの作り方はこちら
https://cloud.google.com/storage/docs/quickstart-console?hl=ja

pythonファイルは、シェルの方に直接アップロードしました。本業エンジニアではないので、チュートリアル見ながらゴニョゴニョ…

transcribe.py

# !/usr/bin/env python
# coding: utf-8
import argparse
import io
import sys
import codecs
import datetime
import locale

def transcribe_gcs(gcs_uri):
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        language_code='ja-JP')

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    operationResult = operation.result()

    d = datetime.datetime.today()
    today = d.strftime("%Y%m%d-%H%M%S")
    fout = codecs.open('output{}.txt'.format(today), 'a', 'shift_jis')

    for result in operationResult.results:
      for alternative in result.alternatives:
          fout.write(u'{}\n'.format(alternative.transcript))
    fout.close()

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument(
        'path', help='GCS path for audio file to be recognized')
    args = parser.parse_args()
    transcribe_gcs(args.path)

最後に、以下を実行してしばらく待つと、変換が終了します。

$ python transcribe.py gs://バケット名/testmusic.flac

注意

ファイルは最長3時間まで
1時間の音声書き起こしに、15分程度かかる
話し言葉なので、句読点は全くない（英語版は句読点を自動付与できるようになったらしいので、日本語版のリリースが待たれます）
たまに「ヘルツ設定が違うよ」というエラーが出るので、その際はpythonファイルにサンプリングレートの値を設定してあげる

config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000, #この行を追加
        language_code='ja-JP')

久しぶりに利用しようとしたら、ImportError: cannot import name speech が出るようになったので、updateした

sudo pip install --upgrade google-cloud-speech

精度（所感）

精度に関係しないもの

マイクの感度
話すスピード
ノイズ

精度に関係するもの

話者の話し方（明瞭かどうか）
部屋の反響

部屋の反響は相当精度に影響するのが意外。空調の音等のノイズは、かなりうるさくても精度に影響しませんでした。分離しやすいのかもしれません。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up