More than 3 years have passed since last update.

株式会社プライム・ブレインズ

クライアントアプリから Google Cloud Speech-to-Text を使ってみた

Posted at 2020-07-25

WEBアプリの開発をしていて「ユーザの音声から文字を起こしたい」つまりブラウザで音声認識したいという要件で、いろいろ調べものしたときの備忘メモ。

TL;DR

どの環境、どのブラウザでも音声認識を動作するように作るのはなかなか難しい
ブラウザのみで完結する音声認識API「Web Speech API」のSpeechRecognition は、Chromeなどかなり限定的なブラウザのみでしか動作しないが、かなりお手軽
Google Speech-to-Text ライブラリは、WEB(JavaScript)から利用できるかよく分からない。Google謹製のデモ見ると利用できるように見えるんだけど。→ RESTインタフェースをコールすることはできる、ことは確認済み
最終的には
- いろいろなブラウザで汎用的に動いた「Web Audio API」の AudioContext をつかって音声を録音し
- 音声をwav形式に変換
- wav形式の音声を**Google Speech-to-Text のRESTインタフェースに渡して、**認識結果を得る

は確認できました。

今回のコンテンツ

今回は Google Speech-to-Text をクライアントアプリから利用してみるところまで、の備忘です。
Google Speech-to-Textは、ココにデモがありますが、音声ファイルをアップロードして認識してもらったり、ブラウザからマイクを利用して音声認識などができるサービスです¹。

前提や環境

% sw_vers
ProductName:    Mac OS X
ProductVersion: 10.15.6
BuildVersion:   19G73

% node --version
v10.19.0

% firebase --version
8.6.0
%

今回は音声データの変換などはMacでやっていますが、プログラム自体はWindowsでも動くと思います。

事前作業

FirebaseとGoogle Cloud Platform(GCP) を利用開始します。Firebase や Google Cloud Platformのサインアップの記事の「Firebaseのサインアップ」「GCPのサインアップ」などを参考にしてください。
ソースをCloneする

% git clone https://github.com/masatomix/speech_node_samples.git
Cloning into 'speech_node_samples'...
Resolving deltas: 100% (75/75), done.

% cd speech_node_samples/sample_ts 
% ls -lrt
total 1944
-rw-r--r--    1 masatomix  staff    5919  7 17 21:26 tsconfig.json
-rw-r--r--    1 masatomix  staff     890  7 24 01:25 package.json
-rw-r--r--    1 masatomix  staff  109687  7 24 01:25 package-lock.json
drwxr-xr-x    6 masatomix  staff     192  7 24 11:41 src
-rw-r--r--@   1 masatomix  staff    7090  7 24 20:56 README.md
% npm install
...
%

ココを参考にサービスアカウントファイル firebase-adminsdk.jsonを取得し、上記の場所に配置します。
Google Cloud Platform のクイックスタートの冒頭を参考に、Cloud Speech-to-Text API を有効にします。²

Google Speech-to-Text をNode.jsのクライアントアプリケーションから実行する

ライブラリを利用する

すべてのクイックスタートにある**「クライアントライブラリの使用」** をやってみましょう。
まずは音声ファイルを準備します。今回はiPhoneのボイスメモで取得した音声ファイル(sample.m4a)を、Mac上のコンバータafconvert でwavファイルに変換しました。

参考: https://tsukada.sumito.jp/2019/06/11/google-speech-api-japanese/

Windowsをご利用の方は、適宜音声ファイルをご準備ください:-)

% ls -lrt
... 省略
-rw-r--r--@   1 masatomix  staff   39898  7 24 13:31 sample.m4a  ← iPhoneのボイスメモで作成したファイル
% 
% afconvert -f WAVE -d LEI16 sample.m4a sample.wav
% ls -lrt
... 省略
-rw-r--r--@   1 masatomix  staff   39898  7 24 13:31 sample.m4a
-rw-r--r--    1 masatomix  staff  420274  7 24 14:02 sample.wav  ← 変換できた
%

音声ファイルの変換ができました。以上で、最終的な構成は以下のようになりました。

% ls -lrt
total 1944
-rw-r--r--    1 masatomix  staff    5919  7 17 21:26 tsconfig.json
-rw-r--r--    1 masatomix  staff     890  7 24 01:25 package.json
-rw-r--r--    1 masatomix  staff  109687  7 24 01:25 package-lock.json
-rw-r--r--@   1 masatomix  staff    7090  7 24 20:56 README.md
drwxr-xr-x    6 masatomix  staff     192  7 24 11:41 src
-rw-r--r--@   1 masatomix  staff   39898  7 24 13:31 sample.m4a
-rw-r--r--    1 masatomix  staff  420274  7 24 14:02 sample.wav
-rw-r--r--@   1 masatomix  staff    2335  7 17 15:21 firebase-adminsdk.json

下記のコマンドを実行します。
準備でダウンロードしたfirebase-adminsdk.json を指定する環境変数を定義して、コードsrc/index.tsを実行しています。

% pwd
/xxx/speech_node_samples/sample_ts
% export GOOGLE_APPLICATION_CREDENTIALS="`pwd`/firebase-adminsdk.json"
%
% npx ts-node src/index.ts
{ results: [ { alternatives: [Array], channelTag: 0 } ] }
Transcription: ボイスメモのテストです。
%

ちゃんと音声認識できていますね！
あもちろん結果は音声ファイルによって異なります :-)

コードの中身

コードを見ておきましょう。さきほどのクイックスタートのほぼまんまですが。

src/index.ts

import speech from '@google-cloud/speech'
import fs from 'fs'

async function main() {
  // Creates a client
  const client = new speech.SpeechClient()

  // The name of the audio file to transcribe
  const fileName = './sample.wav'

  // Reads a local audio file and converts it to base64
  const file: Buffer = fs.readFileSync(fileName)
  const audioBytes: string = file.toString('base64')

  // The audio file's encoding, sample rate in hertz, and BCP-47 language code
  // https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig
  const config = {
    enableAutomaticPunctuation: true,
    encoding: 'LINEAR16',
    languageCode: 'ja-JP',
    model: 'default',
  }
  const request: any = {
     audio: {
      content: audioBytes,
    },
    config: config,
  }

  // Detects speech in the audio file
  const [response]: Array<any> = await client.recognize(request)
  console.log(response)
  const transcription = response.results.map((result: any) => result.alternatives[0].transcript).join('\n')
  console.log(`Transcription: ${transcription}`)
}

if (!module.parent) {
  main().catch(console.error)
}

流れとしては

ファイルsample.wavを読み込んで
Base64 エンコードして文字列化
パラメタconfig 情報とともに、Base64 文字列をライブラリの recognize メソッドを呼び出す
結果を得る

というシンプルなモノです。
まずこれで「音声ファイルをもとに、クライアントアプリケーションから、ライブラリを用いて音声認識する」ことができました。

REST インタフェースを呼んでみる

つづいて@google-cloud/speech のライブラリ経由でなく、RESTインタフェースを直接呼び出してみます。
ちなみにRESTインタフェースの仕様はココ。

追加の準備として、Firebase や Google Cloud Platformのサインアップの「Firebaseのプロジェクト内に、アプリを作成する」を実施し、firebaseConfigの情報を下記のように保存しておきます。

% cat ./src/firebaseConfig.ts 
export default {
  apiKey: 'xx',
  authDomain: 'xx',
  databaseURL: 'xx',
  projectId: 'xx',
  storageBucket: 'xx',
  messagingSenderId: 'xx',
  appId: '1:xx',
}
%  ↑こんなファイルを手動で作る

さあ実行です。

% export GOOGLE_APPLICATION_CREDENTIALS= 
//  環境変数は不要なのでリセット
% npx ts-node ./src/index2.ts
Transcription: ボイスメモのテストです。
%

またまたちゃんと音声認識できていそうです！

コードの中身

さきほどとコードの構成はほぼおなじではありますが、今度はライブラリは使わずRESTインタフェースを直接呼び出しています。

src/index2.ts

import fs from 'fs'
import request from 'request'
import firebaseConfig from './firebaseConfig'

const createRequestPromise = (option: any): Promise<Array<any>> => {
  const promise: Promise<any> = new Promise((resolve, reject) => {
    request(option, function (err: any, response: any, body: string) {
      if (err) {
        reject(err)
        return
      }
      if (response.statusCode >= 400) {
        reject(new Error(JSON.stringify(body)))
      }
      resolve(body)
    })
  })
  return promise
}

function main() {
  const API_KEY = firebaseConfig.apiKey

  // The name of the audio file to transcribe
  const fileName = './sample.wav'

  // Reads a local audio file and converts it to base64
  const file: Buffer = fs.readFileSync(fileName)
  const audioBytes: string = file.toString('base64')

  // The audio file's encoding, sample rate in hertz, and BCP-47 language code
  const config = {
    enableAutomaticPunctuation: true,
    encoding: 'LINEAR16',
    languageCode: 'ja-JP',
    model: 'default',
  }
  const request: any = {
    audio: {
      content: audioBytes,
    },
    config: config,
  }

  const option = {
    uri: `https://speech.googleapis.com/v1p1beta1/speech:recognize?key=${API_KEY}`,
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Accept: 'application/json',
    },
    json: request,
  }

  createRequestPromise(option)
    .then((response: any) => {
      const transcription = response.results.map((result: any) => result.alternatives[0].transcript).join('\n')
      console.log(`Transcription: ${transcription}`)
    })
    .catch((error) => console.log(error))
}

if (!module.parent) {
  main()
}

これで「音声ファイルをもとに、クライアントアプリケーションから、RESTを用いて音声認識する」ことができました。

以下蛇足

ちなみにMacでは afinfo コマンドなどで音声データの確認が可能です。

$ afinfo sample.m4a
File:           sample.m4a
File type ID:   m4af
Num Tracks:     1
----
Data format:     1 ch,  48000 Hz, 'aac ' (0x00000000) 0 bits/channel, 0 bytes/packet, 1024 frames/packet, 0 bytes/frame
                no channel layout.
estimated duration: 4.335187 sec
audio bytes: 36206
audio packets: 206
bit rate: 65908 bits per second
packet size upper bound: 275
maximum packet size: 275
audio data file offset: 44
not optimized
audio 208089 valid frames + 2112 priming + 743 remainder = 210944
format list:
[ 0] format:      1 ch,  48000 Hz, 'aac ' (0x00000000) 0 bits/channel, 0 bytes/packet, 1024 frames/packet, 0 bytes/frame
Channel layout: Mono
----

afplay コマンドは音声を再生できたりします。

% afplay sample.wav
% (再生されてます)

参考: https://qiita.com/fromage-blanc/items/32e2ba83b79151e5ecb9

べんりですね。

まとめ

音声ファイルをもとに、クライアントアプリケーションから、ライブラリを用いて音声認識する ことができました。
音声ファイルをもとに、クライアントアプリケーションから、RESTを用いて音声認識する ことができました。

次回は、マイクを使った音声認識をやってみましょう。

おつかれさまでしたー。

クライアントアプリから Google Cloud Speech-to-Text を使ってみた

TL;DR

今回のコンテンツ

前提や環境

事前作業

Google Speech-to-Text をNode.jsのクライアントアプリケーションから実行する

ライブラリを利用する

コードの中身

REST インタフェースを呼んでみる

コードの中身

以下蛇足

まとめ

関連リンク