More than 3 years have passed since last update.

Speech-to-Textを試してみた

Last updated at 2021-09-05Posted at 2021-09-05

背景

公式サイトの色んなところに情報が点在していてわかりづらかったので、
一度まとめようと思い記載しました。

公式サイトのどこにかかれているのかを参照しながら解説しています。

クイックスタート

今回はPythonで直接データを読み込みたいので、クイックスタート: クライアントライブラリの使用を見ていきます。

まずはライブラリをインストールしておきましょう。

pip install --upgrade google-cloud-speech

始める前に

この箇所に書いてある下記を準備します。
詳細は別のページにかかれています。
https://cloud.google.com/speech-to-text/docs/before-you-begin?hl=ja

GCPのPJを用意
Speech-to-Text を有効にする
認証キー(JSONファイル)をダウンロード
環境変数の設定

環境変数の設定をPythonで行う場合は下記のように記載します。
実行させたいフォルダに3でダウロードしたファイルをcredentials.jsonという名前でおいておきます。

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'credentials.json'

サンプルアプリケーション

続いてサンプルアプリケーションから実行したい内容を確認します。
今回はローカル音声ファイルの非同期音声文字変換を実行してみました。

コピーしたコードはこちらです。

def transcribe_file(speech_file):
    """Transcribe the given audio file asynchronously."""
    from google.cloud import speech

    client = speech.SpeechClient()

    with open(speech_file, "rb") as audio_file:
        content = audio_file.read()

    """
     Note that transcription is limited to a 60 seconds audio file.
     Use a GCS file for audio longer than 1 minute.
    """
    audio = speech.RecognitionAudio(content=content)

    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="en-US",
    )

    operation = client.long_running_recognize(config=config, audio=audio)

    print("Waiting for operation to complete...")
    response = operation.result(timeout=90)

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))
        print("Confidence: {}".format(result.alternatives[0].confidence))

この関数に対して、音声ファイルに合わせてconfigの設定を変更し、
transcribe_file(speech_file)を実行すれば音声デーテをテキストへ変更することができます。

コードの解説

speech.SpeechClient()

from google.cloud import speech
client = speech.SpeechClient()

この箇所でクライアント認証をしています。
事前に環境変数が設定されていないとエラーになるので注意ください。

speech.RecognitionAudio()

audio = speech.RecognitionAudio(content=content)

RecognitionAudioでエンコードさせる対象のデータを渡します。

渡し方はファイルを渡すかGCSのURIを渡すかの2パターンあります。

speech.RecognitionConfig()

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
)

RecognitionConfigでエンコーディングの設定を行います。

エンコードの種類はこちらから確認できます。

詳細まで理解できていないですが、
エンコーディングを理解するために
Speech-to-Text 用に音声ファイルを最適化するが役に立ちそうでした。

エンコードの知識がないというのもありますが、この設定方法の詳細がわからず苦戦しました。
エラー内容を見るとどう設定すべきか表示されることが多く、設定調整してエンコードすることができました。
色々と設定を変えて試してみてください。

client.long_running_recognize()

operation = client.long_running_recognize(config=config, audio=audio)

print("Waiting for operation to complete...")
response = operation.result(timeout=90)

long_running_recognizeで先ほど設定した、config,audioを渡して、結果を受け取ります。

responseとしてはこちらが返ってくるので、resultで受け取ります。
resultでうまくいくとresponseが取得できるようです。

結果の出力

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))
        print("Confidence: {}".format(result.alternatives[0].confidence))

さらにresponseの中は下記のこのようになっているので、for文です出力しています。

結果のファイルを作成する場合

テキストとしてファイル保存する場合は下記に変更するといけます。

import codecs
with codecs.open('output.txt', 'w', 'utf-8') as f:
    for result in response.results:
        text = result.alternatives[0].transcript
        f.write(text)

容量が大きい場合

上記のローカルファイルを読み込む方法は10485760 bytesまでの制限があるので、
その場合はGCSにアップロードして実行してみてください。

GCSはバケットを作成してファイルをアップロードするだけなので難しくはないです。
ファイルを置きっぱなしにすると料金が発生するので注意ください。

参考コード

import codecs

def transcribe_gcs(gcs_uri):
    """Asynchronously transcribes the audio file specified by the gcs_uri."""
    from google.cloud import speech

    client = speech.SpeechClient()

    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
        encoding = 'LINEAR16', # .wav
        sample_rate_hertz=16000,
        language_code="ja-JP",
        audio_channel_count = 2,
    )

    operation = client.long_running_recognize(config=config, audio=audio)

    print("Waiting for operation to complete...")
    response = operation.result(timeout=1000)
    
    return response


# GCSの設定
bucketname = ''
filename = ''

gcs_uri = 'gs://' + bucketname + '/' + filename
response = transcribe_gcs(gcs_uri)

# ファイルの作成
with codecs.open('output.txt', 'w', 'utf-8') as f:
    for result in response.results:
        text = result.alternatives[0].transcript
        f.write(text)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up