More than 3 years have passed since last update.

Googleさんの Cloud Text-to-SpeechをPythonから使う

Posted at 2021-04-29

はじめに

Coe Font STUDIOさんのニュースがあったからというわけではないのですが、諸般の事情でGoogleさんの音声合成を一つずつ確認しながらGoogleさんのttsを使ってみた。

まずはGoogleさんにAPI使わせて欲しいと申請する

基本的には、Googleアカウントを登録（クレジット番号の登録は必須）して、Google Cloud Platformにログイン、プロジェクトを作成して、そのプロジェクトに紐づける形でCloud Text-to-Speech API を有効化します。
今時のみんなはすでにGoogleアカウントの一つや二つ持っているかもしれません。その際にはクレジット番号を別途追加で登録してあげないといけません（登録していなかった場合）。Googleお支払いセンターで登録しましょう。Googleアカウントから「お支払いと定期購入」から「お支払い方法を管理」で辿れるはず（2021/4/28現在）。
詳細は、こちらが詳しいです。

認証情報関連

準備

上のリンクにも記載がありますが、APIを使うための認証情報を得るために、Google Cloud SDKが必要っぽいです。ここから、環境にあったものをダウンロードします。macで作業をしたので、uname -aでx86環境と確認、該当モジュール「macOS 64 ビット（x86_64）」をダウンロードしました。
tar xvzf filename.tgzで解凍、google-cloud-sdkディレクトリが作成されていることを確認。
./google-cloud-sdk/install.shをおとなしく実行する。tail ~/.zshrcでパスが追加されていることを確認する。
（追加された設定を反映するため）別なシェルを起動して、gcloud initを実行する。（設定が反映されていないと、gcloudなんてコマンドは知らんて言われる）

認証情報の作成

Google Cloud Platformにログイン、左上にあるハンバーガーメニュー（ナビゲーションメニューと表示される）を開き、「APIとサービス」を選択し、表示されるメニューの中から「認証情報」を選択する。

認証情報画面の上部にある「認証情報を作成」リンクをクリックすると、プルダウンメニューが表示されるので、「サービスアカウント」を選択する。
自分でわかりやすい「サービスアカウント名」「サービスアカウントの説明」を入力、作成ボタンをクリックする。「ロール」を「オーナー」にして後は省略可能なので、続行で作成を完了させる。完了すると「認証情報画面」に戻り、下部にあるサービスアカウント一覧の中に、先ほど作成したアカウントが登録されていることを確認する。
作成されたアカウントをクリックしてサービスアカウント詳細画面に遷移、上部にある「キー」を選択する。「鍵を追加」をクリックして表示された「新しい鍵を作成」メニューを選択する。デフォルトで選択されているキーのタイプ「JSON」にしたまま作成ボタンをクリック。無事鍵がダウンロードされると完了
その後、gcloud auth activate-service-account --key-file=＜キーファイルのパス＞を実行。一応export GOOGLE_APPLICATION_CREDENTIALS=＜キーファイルのパス＞も実行する。これは最後にライブラリを利用する際に参照する

gcloud auth print-access-tokenで正常に結果が得られれば良いと思うが、念の為、gcloud config listで期待した設定になっていることを確認する。

試してみる

##　その前に
demoサイトで試すことができます。既に確認されているかもしれません、ここでは、実際にリクエストする際のJSONをチェックすることが目的です。

{
  "audioConfig": {
    "audioEncoding": "LINEAR16",
    "pitch": 0,
    "speakingRate": 1
  },
  "input": {
    "text": "こにゃにゃにゃちわ、元気ですかー"
  },
  "voice": {
    "languageCode": "ja-JP",
    "name": "ja-JP-Standard-C"
  }
}

こんな感じのJSONが得られます。
このjsonを利用して、curlを利用してTTSを実施してみます。

curlで実行

前述のjsonをrequest.jsonというファイル名で保存しておき、以下のコマンドを実行します。（mac環境）

curl -H "Authorization: Bearer "$(gcloud auth print-access-token) -H "Content-Type: application/json; charset=utf-8" -d @request.json https://texttospeech.googleapis.com/v1beta1/text:synthesize > result.json

result.jsonが無事取得できたら、そこから音声データの抽出が必要です。
Googleガイドから該当部分をbase64デコードしてね、とのことです。(以下参照）

base-64 エンコード形式のコンテンツのみをテキストファイルにコピーします。
base64 コマンドラインツールを使用してソーステキストファイルをデコードします。

上記1番で音声部分をコピーするのが面倒だったりするので、便利なコマンド（jq)を利用してみます。以下のようなコマンドになります。

cat result.json | jq -r .audioContent |  base64 --decode - > result.wav

jsonから音声抽出部分をPythonで

以下ソースを作成し、python toWav.pyで実行する。上記コマンドの代わり。

# toWav.py
import json
import base64
import numpy as np

inputf = 'result.json'
outputf = 'result.wav'

with open(inputf) as f:
    df = json.load(f)

b64str = df["audioContent"]
binary = base64.b64decode(b64str)
dat = np.frombuffer(binary,dtype=np.uint8)

with open(outputf,"wb") as f:
    f.write(dat)

request.jsonをPythonで生成

以下ソースを作成し、python txt2rjson.pyで実行する

# txt2rjson.py
import json
import base64
import numpy as np

def makeRequestDict(txt: str):
    dat = {"audioConfig": {
        "audioEncoding": "LINEAR16",
        "pitch": 0,
        "speakingRate": 1
      },
      "voice": {
        "languageCode": "ja-JP",
        "name": "ja-JP-Standard-B"
      }
    }

    dat["input"] = {"text": txt}
    return dat

dat = makeRequestDict("こにゃにゃちわ、元気ですか〜")

outjson = "request.json"
with open(outjson, 'w') as f:
    json.dump(dat, f, indent=2, ensure_ascii=False)

makeRequestDictのパラメータ文字列を変更すると、request.jsonが出力されるので、前述curlコマンドを叩けば良い。

curl部分もPythonでやってみる

#req.py
import urllib.request
import json
import subprocess as sp

def get_token():
    res = sp.run('gcloud auth print-access-token',
            shell=True, stdout=sp.PIPE, stderr=sp.PIPE,
            encoding='utf-8')
    print(res.stderr)
    return res.stdout.strip()


token = get_token()
url = 'https://texttospeech.googleapis.com/v1beta1/text:synthesize'
req_header = {
        'Authorization': f"Bearer {token}",
        'Content-Type': 'application/json; charset=utf-8',
}

out_json = 'result.json'
req_json_file = 'request.json'

with open(req_json_file, encoding='utf-8') as f:
    req_data = f.read()

req = urllib.request.Request(url, data=req_data.encode(), method='POST', headers=req_header)

try:
    with urllib.request.urlopen(req) as response:
        dat = response.read()
        body = json.loads(dat)

        with open(out_json, 'w') as f:
            json.dump(body, f, indent=2)

except urllib.error.URLError as e:
    print("error happen...")
    print(e.reason)
    print(e)

これは、tokenを取得しつつ、request.jsonなどcurlのパラメータにセットして所定のURLにアクセスを行い、結果をresult.jsonで受け取るというものになります。

処理の流れとしては、txt2rjson.pyでrequest.jsonを作成し、それを元にreq.pyでGoogleさんにリクエストを行い結果result.jsonを得る。toWav.pyによりresult.jsonからresult.wavを取得するという一連の処理を実行することができます。

一連の処理をまとめて実行

今まで作成したスクリプトを一まとめにしてみる

import base64
import numpy as np

import urllib.request
import json
import subprocess as sp

def get_token() -> str:
    """
    Google Text-To-Speechの認証した上で、gcloudをセットアップした状態で
    tokenを取得するために、gcloud auth print-access-tokenの結果を取得する
    """
    res = sp.run('gcloud auth print-access-token',
            shell=True, stdout=sp.PIPE, stderr=sp.PIPE,
            encoding='utf-8')
    print(res.stderr)
    return res.stdout.strip()

def makeRequestDict(txt: str) -> dict:
    """
    Google Text-To-Speechへリクエストのための情報を生成する
    SSMLには未対応

    Args:
        txt(in): 音声合成するテキスト

    Returns:
        音声合成するために必要な情報をdictで返却する
    """
    dat = {"audioConfig": {
        "audioEncoding": "LINEAR16",
        "pitch": 0,
        "speakingRate": 1
      },
      "voice": {
        "languageCode": "ja-JP",
        "name": "ja-JP-Standard-B"
      }
    }

    dat["input"] = {"text": txt}
    return dat

def output_wav(dat: dict, ofile: str) -> None:
    """
    Google Text-To-Speechへリクエストした結果を元に音声データにしてファイルに書き込む

    Args:
        dat(in):   リクエストした結果得られたJSON文字列をdictにしたもの
        ofile(in): 音声データを書き出すファイル名
    """
    b64str = dat["audioContent"]
    binary = base64.b64decode(b64str)
    dat = np.frombuffer(binary,dtype=np.uint8)
    with open(ofile,"wb") as f:
        f.write(dat)

def gtts(txt: str, ofile: str) -> None:

    dat = makeRequestDict(txt)
    req_data = json.dumps(dat).encode()

    url = 'https://texttospeech.googleapis.com/v1beta1/text:synthesize'
    token = get_token()
    req_header = {
            'Authorization': f"Bearer {token}",
            'Content-Type': 'application/json; charset=utf-8',
    }
    req = urllib.request.Request(url, data=req_data, method='POST', headers=req_header)

    try:
        with urllib.request.urlopen(req) as response:
            dat = response.read()
            body = json.loads(dat)
            output_wav(body, ofile)
            print("done..")
   except urllib.error.URLError as e:
        print("error happen...")
        print(e.reason)
        print(e)


if __name__ == "__main__":
    gtts("こにゃにゃちわ、元気ですか〜", "result2.wav")

出力ファイルとか、テキストはパラメータとして渡した方が良いでしょうね。

結局ライブラリを使うのが楽なんだけど

pipenvで試す

$ pip install pipenv
$ pipenv install
$ pipenv shell
$ pip install google-cloud-texttospeech
$ vi gct_cli.py

ほぼサンプルのまま

# gct_cli.py
"""Synthesizes speech from the input string of text or ssml.

Note: ssml must be well-formed according to:
    https://www.w3.org/TR/speech-synthesis/
"""
from google.cloud import texttospeech

# Instantiates a client
client = texttospeech.TextToSpeechClient()

# Set the text input to be synthesized
synthesis_input = texttospeech.SynthesisInput(text="こにゃにゃちわ、元気ですか〜")

# Build the voice request, select the language code ("en-US") and the ssml
# voice gender ("neutral")
voice = texttospeech.VoiceSelectionParams(
        name="ja-JP-Standard-B",
        language_code="ja-JP",
        ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL
)

# Select the type of audio file you want returned
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.LINEAR16

# Perform the text-to-speech request on the text input with the selected
# voice parameters and audio file type
response = client.synthesize_speech(
    input=synthesis_input, voice=voice, audio_config=audio_config
)

# The response's audio_content is binary.
with open("output.wav", "wb") as out:
    # Write the response to the output file.
    out.write(response.audio_content)
    print('Audio content written to file "output.wav"')

Encodingを非圧縮にしているけれど、mp3とかにした方が通信量が少なくて済む（サンプルではMP3にしているし）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up