More than 1 year has passed since last update.

大きめのmp3ファイルをGoogle Speech-to-Text でChirpモデルで書き起こす(Python)

Posted at 2024-03-19

前提

mp3ファイルをGoogle Speech-to-Textで書き起こします
1時間以上長めのmp3ファイルを想定します

モデルはChirpを使います

なんか Googleの次世代のSpeech-to-Textモデル らしいのでこれを使います.
モデルの違いによる課金体験の違いなどはなさそうなのでいいモデルを積極的に使いましょう.

非同期音声認識(一括音声認識)を使います

60 秒を超える音声を文字に変換するには、非同期音声認識を使用します。短い音声の場合は、同期音声認識を使用したほうが早くて簡単です。非同期音声認識の上限は 480 分（8 時間）です。

ということなので一括音声認識(BatchRecognition)を使います.

一括音声認識では、Cloud Storage に保存されている音声のみ文字変換できます。音声文字変換の出力は、レスポンスにインラインで送信することも（単一ファイルの一括認識リクエストの場合）、Cloud Storage に書き込むこともできます。

今回はインラインで出力します.

実装

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech
from google.cloud import storage
from google.api_core import client_options

def execute_speech_to_text():
    # GCSの gs://YOUR_BUCKET_NAME/mp3s/target.mp3 に対象mp3があるとします
    bucket_name = "YOUR_BUCKET_NAME"
    gcs_uri = f"gs://{bucket_name}/mp3s/target.mp3"

    # us-central1のapiを呼び出すためにclient_optionsを指定してclientを初期化します
    # https://stackoverflow.com/questions/76393336/calling-google-cloud-speech-to-text-api-regional-recognizers-using-python-clien
    client_options_var = client_options.ClientOptions(
        api_endpoint="us-central1-speech.googleapis.com"
    )
    client = SpeechClient(client_options=client_options_var)

    # chirpを指定
    config = cloud_speech.RecognitionConfig(
        auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
        language_codes=["ja-JP"],
        model="chirp",
    )

    file_metadata = cloud_speech.BatchRecognizeFileMetadata(uri=gcs_uri)

    # chirp使うならlocationをus-central1を指定してください
    request = cloud_speech.BatchRecognizeRequest(
        recognizer=f"projects/YOUR_PROJECT/locations/us-central1/recognizers/_",
        config=config,
        files=[file_metadata],
        recognition_output_config=cloud_speech.RecognitionOutputConfig(
            inline_response_config=cloud_speech.InlineOutputConfig(),
        ),
        processing_strategy=cloud_speech.BatchRecognizeRequest.ProcessingStrategy.DYNAMIC_BATCHING,
    )

    operation = client.batch_recognize(request=request)
    print("Waiting for operation to complete...")
    response = operation.result(timeout=120)

    for result in response.results[gcs_uri].transcript.results:
        if len(result.alternatives) > 0:
            print(f"Transcript: {result.alternatives[0].transcript}")

    return response.results[gcs_uri].transcript

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up