はじめに
今回はMLS試験対策で、Amazon Transcribeを触ってみました。
Amazon Transcribeは、AWSが提供するマネージドな音声認識サービスです。機械学習の専門知識がなくても、音声ファイルをアップロードするだけで高精度な音声認識を実行できます。
1. Amazon Transcribeとは
概要
Amazon Transcribeは、AWSが提供するマネージドな音声認識サービスです。自動音声認識技術を使用して、音声をテキストに変換します。
主な特徴
- マネージドサービス: インフラ管理やスケーリングをAWSが自動化
- 複数の言語対応: 英語、日本語、中国語など多数の言語をサポート
- カスタム語彙: 専門用語や固有名詞の認識精度向上
- リアルタイム処理: ストリーミング音声のリアルタイム認識
- バッチ処理: 保存された音声ファイルの一括処理
2. 音声ファイルの準備
2.1 サンプル音声ファイルの作成
まず、音声認識用のサンプルMP3ファイルを用意します。
音声内容:日本語版
こんにちは、今日はAmazon Transcribeのハンズオンを行います。
音声認識の精度を確認するために、様々な単語や文章を読み上げます。
機械学習、人工知能、クラウドコンピューティングなどの専門用語も含まれています。
この音声ファイルを使って、Amazon Transcribeの性能をテストしてみましょう。
音声内容:英語版
Hello, today we'll be doing a hands-on with Amazon Transcribe.
To check the accuracy of speech recognition, we'll be reading out various words and sentences.
It also includes technical terms from machine learning, artificial intelligence, cloud computing, and more.
Let's use this audio file to test the performance of Amazon Transcribe.
Amazon Transcribeでサポートされているファイル形式
- MP3: 最も一般的な形式
- MP4: 動画ファイルから音声を抽出
- WAV: 無圧縮音声形式
- FLAC: 可逆圧縮音声形式
- WebM: Web用音声形式
- AMR: モバイル通信で主に使用される形式
- OGG: PCゲームなどで使用される形式
今回はMP3で用意しました。
3. 音声ファイルのアップロード
3.1 S3バケットの準備
- AWSマネジメントコンソールにログイン
- Amazon S3サービスに移動
- 新しいバケットを作成(例:
qiita-demo-transcribe
)
3.2 音声ファイルのアップロード
- 作成したS3バケットを選択
- アップロードボタンをクリック
- 準備したMP3ファイルをアップロード
4. Amazon Transcribeの実行
4.1 日本語音声ファイルの処理
1.AWSマネジメントコンソールにログイン
2.Amazon Transcribeサービスに移動
3.トランスクリプションジョブをクリック
4.ジョブを作成をクリック
5.ジョブの名前を入力(例: qiita-demo-japanese
)
6.言語で日本語を選択
7.アップロードしたファイルのS3URIを入力
8.「お客様が指定したS3バケット」を選択し、作成したS3バケットに結果を保存するようにする
出力結果
{"jobName":"qiita-demo-japanese","accountId":"XXXXXXXXXXXX","status":"COMPLETED","results":{"transcripts":[{"transcript":"こんにちは。今日はアマゾントランスクライブのハンズオンを行います。音声認識の精度を確認するために、さまざまな単語や文章を読み上げます。機械学習、人工知能、クラウドコンピューティングなどの専門用語も含まれています。この音声ファイルを使って、アマゾントランスクライブの性能をテストしてみましょう。"}],"items":[{"id":0,"type":"pronunciation","alternatives":[{"confidence":"0.644","content":"こんにちは。"}],"start_time":"0.039","end_time":"0.829"},{"id":1,"type":"pronunciation","alternatives":[{"confidence":"0.986","content":"今日"}],"start_time":"1.08","end_time":"1.32"},{"id":2,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"は"}],"start_time":"1.32","end_time":"1.45"},{"id":3,"type":"pronunciation","alternatives":[{"confidence":"0.67","content":"アマゾン"}],"start_time":"1.45","end_time":"1.99"},{"id":4,"type":"pronunciation","alternatives":[{"confidence":"0.838","content":"トランスクライブ"}],"start_time":"1.99","end_time":"2.839"},{"id":5,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"の"}],"start_time":"2.839","end_time":"3.0"},{"id":6,"type":"pronunciation","alternatives":[{"confidence":"0.923","content":"ハンズ"}],"start_time":"3.0","end_time":"3.38"},{"id":7,"type":"pronunciation","alternatives":[{"confidence":"0.743","content":"オン"}],"start_time":"3.38","end_time":"3.48"},{"id":8,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"を"}],"start_time":"3.48","end_time":"3.72"},{"id":9,"type":"pronunciation","alternatives":[{"confidence":"0.991","content":"行い"}],"start_time":"3.72","end_time":"4.07"},{"id":10,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ます。"}],"start_time":"4.07","end_time":"4.53"},{"id":11,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"音声"}],"start_time":"5.219","end_time":"5.739"},{"id":12,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"認識"}],"start_time":"5.739","end_time":"6.199"},{"id":13,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"の"}],"start_time":"6.199","end_time":"6.38"},{"id":14,"type":"pronunciation","alternatives":[{"confidence":"0.981","content":"精度"}],"start_time":"6.38","end_time":"6.659"},{"id":15,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"を"}],"start_time":"6.659","end_time":"6.82"},{"id":16,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"確認"}],"start_time":"6.82","end_time":"7.289"},{"id":17,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"する"}],"start_time":"7.289","end_time":"7.5"},{"id":18,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ため"}],"start_time":"7.5","end_time":"7.739"},{"id":19,"type":"pronunciation","alternatives":[{"confidence":"0.934","content":"に、"}],"start_time":"7.739","end_time":"8.3"},{"id":20,"type":"pronunciation","alternatives":[{"confidence":"0.645","content":"さまざま"}],"start_time":"8.3","end_time":"8.789"},{"id":21,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"な"}],"start_time":"8.789","end_time":"8.899"},{"id":22,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"単語"}],"start_time":"8.899","end_time":"9.289"},{"id":23,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"や"}],"start_time":"9.289","end_time":"9.689"},{"id":24,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"文章"}],"start_time":"9.689","end_time":"9.89"},{"id":25,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"を"}],"start_time":"9.89","end_time":"9.979"},{"id":26,"type":"pronunciation","alternatives":[{"confidence":"0.996","content":"読み上げ"}],"start_time":"9.979","end_time":"10.5"},{"id":27,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ます。"}],"start_time":"10.5","end_time":"10.979"},{"id":28,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"機械"}],"start_time":"11.729","end_time":"12.119"},{"id":29,"type":"pronunciation","alternatives":[{"confidence":"0.992","content":"学習、"}],"start_time":"12.119","end_time":"12.989"},{"id":30,"type":"pronunciation","alternatives":[{"confidence":"0.985","content":"人工"}],"start_time":"13.0","end_time":"13.52"},{"id":31,"type":"pronunciation","alternatives":[{"confidence":"0.981","content":"知能、"}],"start_time":"13.52","end_time":"14.119"},{"id":32,"type":"pronunciation","alternatives":[{"confidence":"0.997","content":"クラウド"}],"start_time":"14.199","end_time":"14.72"},{"id":33,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"コンピューティング"}],"start_time":"14.72","end_time":"15.43"},{"id":34,"type":"pronunciation","alternatives":[{"confidence":"0.997","content":"など"}],"start_time":"15.43","end_time":"15.64"},{"id":35,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"の"}],"start_time":"15.64","end_time":"15.92"},{"id":36,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"専門"}],"start_time":"15.92","end_time":"16.389"},{"id":37,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"用語"}],"start_time":"16.389","end_time":"16.719"},{"id":38,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"も"}],"start_time":"16.719","end_time":"16.84"},{"id":39,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"含ま"}],"start_time":"16.84","end_time":"17.2"},{"id":40,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"れ"}],"start_time":"17.2","end_time":"17.35"},{"id":41,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"て"}],"start_time":"17.35","end_time":"17.43"},{"id":42,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"い"}],"start_time":"17.43","end_time":"17.549"},{"id":43,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ます。"}],"start_time":"17.549","end_time":"18.04"},{"id":44,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"この"}],"start_time":"18.79","end_time":"19.11"},{"id":45,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"音声"}],"start_time":"19.11","end_time":"19.54"},{"id":46,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ファイル"}],"start_time":"19.54","end_time":"19.909"},{"id":47,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"を"}],"start_time":"19.909","end_time":"20.149"},{"id":48,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"使っ"}],"start_time":"20.149","end_time":"20.389"},{"id":49,"type":"pronunciation","alternatives":[{"confidence":"0.851","content":"て、"}],"start_time":"20.389","end_time":"20.909"},{"id":50,"type":"pronunciation","alternatives":[{"confidence":"0.716","content":"アマゾン"}],"start_time":"20.909","end_time":"21.389"},{"id":51,"type":"pronunciation","alternatives":[{"confidence":"0.8","content":"トランスクライブ"}],"start_time":"21.389","end_time":"22.26"},{"id":52,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"の"}],"start_time":"22.26","end_time":"22.469"},{"id":53,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"性能"}],"start_time":"22.469","end_time":"22.79"},{"id":54,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"を"}],"start_time":"22.79","end_time":"22.909"},{"id":55,"type":"pronunciation","alternatives":[{"confidence":"0.997","content":"テスト"}],"start_time":"22.909","end_time":"23.34"},{"id":56,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"し"}],"start_time":"23.34","end_time":"23.43"},{"id":57,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"て"}],"start_time":"23.43","end_time":"23.54"},{"id":58,"type":"pronunciation","alternatives":[{"confidence":"0.995","content":"み"}],"start_time":"23.54","end_time":"23.829"},{"id":59,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"ましょう。"}],"start_time":"23.829","end_time":"24.239"}],"audio_segments":[{"id":0,"transcript":"こんにちは。今日はアマゾントランスクライブのハンズオンを行います。","start_time":"0.0","end_time":"4.559","items":[0,1,2,3,4,5,6,7,8,9,10]},{"id":1,"transcript":"音声認識の精度を確認するために、さまざまな単語や文章を読み上げます。","start_time":"5.099","end_time":"11.01","items":[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]},{"id":2,"transcript":"機械学習、人工知能、クラウドコンピューティングなどの専門用語も含まれています。","start_time":"11.56","end_time":"18.069","items":[28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43]},{"id":3,"transcript":"この音声ファイルを使って、アマゾントランスクライブの性能をテストしてみましょう。","start_time":"18.629","end_time":"24.27","items":[44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59]}]}}
4.2 英語音声ファイルの処理
英語の音声ファイルについても、同様に言語で英語を選択してジョブを実行し、結果を確認します。
出力結果
{"jobName":"qiita-demo-english","accountId":"XXXXXXXXXXXX","status":"COMPLETED","results":{"transcripts":[{"transcript":"Hello, today we'll be doing a hands-on with Amazon Transcribe. To check the accuracy of speech recognition, we'll be reading out various words and sentences. It also includes technical terms from machine learning, artificial intelligence, cloud computing, and more."}],"items":[{"id":0,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"Hello"}],"start_time":"0.009","end_time":"0.589"},{"id":1,"type":"punctuation","alternatives":[{"confidence":"0.0","content":","}]},{"id":2,"type":"pronunciation","alternatives":[{"confidence":"0.961","content":"today"}],"start_time":"0.92","end_time":"1.159"},{"id":3,"type":"pronunciation","alternatives":[{"confidence":"0.99","content":"we'll"}],"start_time":"1.159","end_time":"1.36"},{"id":4,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"be"}],"start_time":"1.36","end_time":"1.6"},{"id":5,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"doing"}],"start_time":"1.6","end_time":"1.84"},{"id":6,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"a"}],"start_time":"1.84","end_time":"2.039"},{"id":7,"type":"pronunciation","alternatives":[{"confidence":"0.867","content":"hands-on"}],"start_time":"2.039","end_time":"2.66"},{"id":8,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"with"}],"start_time":"2.66","end_time":"2.96"},{"id":9,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"Amazon"}],"start_time":"2.96","end_time":"3.599"},{"id":10,"type":"pronunciation","alternatives":[{"confidence":"0.988","content":"Transcribe"}],"start_time":"3.599","end_time":"4.289"},{"id":11,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]},{"id":12,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"To"}],"start_time":"5.13","end_time":"5.369"},{"id":13,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"check"}],"start_time":"5.369","end_time":"5.61"},{"id":14,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"the"}],"start_time":"5.61","end_time":"5.73"},{"id":15,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"accuracy"}],"start_time":"5.73","end_time":"6.409"},{"id":16,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"of"}],"start_time":"6.409","end_time":"6.73"},{"id":17,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"speech"}],"start_time":"6.73","end_time":"7.01"},{"id":18,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"recognition"}],"start_time":"7.01","end_time":"7.8"},{"id":19,"type":"punctuation","alternatives":[{"confidence":"0.0","content":","}]},{"id":20,"type":"pronunciation","alternatives":[{"confidence":"0.991","content":"we'll"}],"start_time":"8.01","end_time":"8.25"},{"id":21,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"be"}],"start_time":"8.25","end_time":"8.449"},{"id":22,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"reading"}],"start_time":"8.449","end_time":"8.85"},{"id":23,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"out"}],"start_time":"8.85","end_time":"9.17"},{"id":24,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"various"}],"start_time":"9.17","end_time":"9.689"},{"id":25,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"words"}],"start_time":"9.689","end_time":"10.13"},{"id":26,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"and"}],"start_time":"10.13","end_time":"10.489"},{"id":27,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"sentences"}],"start_time":"10.489","end_time":"11.329"},{"id":28,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]},{"id":29,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"It"}],"start_time":"12.13","end_time":"12.409"},{"id":30,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"also"}],"start_time":"12.409","end_time":"13.06"},{"id":31,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"includes"}],"start_time":"13.09","end_time":"13.409"},{"id":32,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"technical"}],"start_time":"13.409","end_time":"14.05"},{"id":33,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"terms"}],"start_time":"14.05","end_time":"14.489"},{"id":34,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"from"}],"start_time":"14.489","end_time":"14.93"},{"id":35,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"machine"}],"start_time":"14.93","end_time":"15.25"},{"id":36,"type":"pronunciation","alternatives":[{"confidence":"0.997","content":"learning"}],"start_time":"15.25","end_time":"15.81"},{"id":37,"type":"punctuation","alternatives":[{"confidence":"0.0","content":","}]},{"id":38,"type":"pronunciation","alternatives":[{"confidence":"0.997","content":"artificial"}],"start_time":"15.97","end_time":"16.649"},{"id":39,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"intelligence"}],"start_time":"16.649","end_time":"17.52"},{"id":40,"type":"punctuation","alternatives":[{"confidence":"0.0","content":","}]},{"id":41,"type":"pronunciation","alternatives":[{"confidence":"0.996","content":"cloud"}],"start_time":"17.649","end_time":"18.209"},{"id":42,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"computing"}],"start_time":"18.209","end_time":"18.889"},{"id":43,"type":"punctuation","alternatives":[{"confidence":"0.0","content":","}]},{"id":44,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"and"}],"start_time":"19.01","end_time":"19.329"},{"id":45,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"more"}],"start_time":"19.329","end_time":"19.76"},{"id":46,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]}],"audio_segments":[{"id":0,"transcript":"Hello, today we'll be doing a hands-on with Amazon Transcribe.","start_time":"0.0","end_time":"4.409","items":[0,1,2,3,4,5,6,7,8,9,10,11]},{"id":1,"transcript":"To check the accuracy of speech recognition, we'll be reading out various words and sentences.","start_time":"5.01","end_time":"11.39","items":[12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]},{"id":2,"transcript":"It also includes technical terms from machine learning, artificial intelligence, cloud computing, and more.","start_time":"12.09","end_time":"19.87","items":[29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46]}]}}
4.3 自動言語識別の処理
言語を指定せず自動言語識別にした場合でも正しく認識されるのか気になったので、
日本語音声で試してみました。
{"jobName":"qiita-demo-auto","accountId":"XXXXXXXXXXXX","status":"COMPLETED","results":{"language_code":"ja-JP","language_identification":[{"code":"ja-JP","score":"1"},{"code":"ko-KR","score":"0.0001"},{"code":"pl-PL","score":"0.0001"},{"code":"uz-UZ","score":"0.0001"},{"code":"mn-MN","score":"0.0001"}],"transcripts":[{"transcript":"こんにちは。今日はアマゾントランスクライブのハンズオンを行います。音声認識の精度を確認するために、さまざまな単語や文章を読み上げます。機械学習、人工知能、クラウドコンピューティングなどの専門用語も含まれています。この音声ファイルを使って、アマゾントランスクライブの性能をテストしてみましょう。"}],"items":[{"id":0,"type":"pronunciation","alternatives":[{"confidence":"0.644","content":"こんにちは。"}],"start_time":"0.039","end_time":"0.829"},{"id":1,"type":"pronunciation","alternatives":[{"confidence":"0.986","content":"今日"}],"start_time":"1.08","end_time":"1.32"},{"id":2,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"は"}],"start_time":"1.32","end_time":"1.45"},{"id":3,"type":"pronunciation","alternatives":[{"confidence":"0.67","content":"アマゾン"}],"start_time":"1.45","end_time":"1.99"},{"id":4,"type":"pronunciation","alternatives":[{"confidence":"0.838","content":"トランスクライブ"}],"start_time":"1.99","end_time":"2.839"},{"id":5,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"の"}],"start_time":"2.839","end_time":"3.0"},{"id":6,"type":"pronunciation","alternatives":[{"confidence":"0.923","content":"ハンズ"}],"start_time":"3.0","end_time":"3.38"},{"id":7,"type":"pronunciation","alternatives":[{"confidence":"0.743","content":"オン"}],"start_time":"3.38","end_time":"3.48"},{"id":8,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"を"}],"start_time":"3.48","end_time":"3.72"},{"id":9,"type":"pronunciation","alternatives":[{"confidence":"0.991","content":"行い"}],"start_time":"3.72","end_time":"4.07"},{"id":10,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ます。"}],"start_time":"4.07","end_time":"4.53"},{"id":11,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"音声"}],"start_time":"5.219","end_time":"5.739"},{"id":12,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"認識"}],"start_time":"5.739","end_time":"6.199"},{"id":13,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"の"}],"start_time":"6.199","end_time":"6.38"},{"id":14,"type":"pronunciation","alternatives":[{"confidence":"0.981","content":"精度"}],"start_time":"6.38","end_time":"6.659"},{"id":15,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"を"}],"start_time":"6.659","end_time":"6.82"},{"id":16,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"確認"}],"start_time":"6.82","end_time":"7.289"},{"id":17,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"する"}],"start_time":"7.289","end_time":"7.5"},{"id":18,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ため"}],"start_time":"7.5","end_time":"7.739"},{"id":19,"type":"pronunciation","alternatives":[{"confidence":"0.934","content":"に、"}],"start_time":"7.739","end_time":"8.3"},{"id":20,"type":"pronunciation","alternatives":[{"confidence":"0.645","content":"さまざま"}],"start_time":"8.3","end_time":"8.789"},{"id":21,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"な"}],"start_time":"8.789","end_time":"8.899"},{"id":22,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"単語"}],"start_time":"8.899","end_time":"9.289"},{"id":23,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"や"}],"start_time":"9.289","end_time":"9.689"},{"id":24,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"文章"}],"start_time":"9.689","end_time":"9.89"},{"id":25,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"を"}],"start_time":"9.89","end_time":"9.979"},{"id":26,"type":"pronunciation","alternatives":[{"confidence":"0.996","content":"読み上げ"}],"start_time":"9.979","end_time":"10.5"},{"id":27,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ます。"}],"start_time":"10.5","end_time":"10.979"},{"id":28,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"機械"}],"start_time":"11.729","end_time":"12.119"},{"id":29,"type":"pronunciation","alternatives":[{"confidence":"0.992","content":"学習、"}],"start_time":"12.119","end_time":"12.989"},{"id":30,"type":"pronunciation","alternatives":[{"confidence":"0.985","content":"人工"}],"start_time":"13.0","end_time":"13.52"},{"id":31,"type":"pronunciation","alternatives":[{"confidence":"0.981","content":"知能、"}],"start_time":"13.52","end_time":"14.119"},{"id":32,"type":"pronunciation","alternatives":[{"confidence":"0.997","content":"クラウド"}],"start_time":"14.199","end_time":"14.72"},{"id":33,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"コンピューティング"}],"start_time":"14.72","end_time":"15.43"},{"id":34,"type":"pronunciation","alternatives":[{"confidence":"0.997","content":"など"}],"start_time":"15.43","end_time":"15.64"},{"id":35,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"の"}],"start_time":"15.64","end_time":"15.92"},{"id":36,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"専門"}],"start_time":"15.92","end_time":"16.389"},{"id":37,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"用語"}],"start_time":"16.389","end_time":"16.719"},{"id":38,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"も"}],"start_time":"16.719","end_time":"16.84"},{"id":39,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"含ま"}],"start_time":"16.84","end_time":"17.2"},{"id":40,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"れ"}],"start_time":"17.2","end_time":"17.35"},{"id":41,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"て"}],"start_time":"17.35","end_time":"17.43"},{"id":42,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"い"}],"start_time":"17.43","end_time":"17.549"},{"id":43,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ます。"}],"start_time":"17.549","end_time":"18.04"},{"id":44,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"この"}],"start_time":"18.79","end_time":"19.11"},{"id":45,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"音声"}],"start_time":"19.11","end_time":"19.54"},{"id":46,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"ファイル"}],"start_time":"19.54","end_time":"19.909"},{"id":47,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"を"}],"start_time":"19.909","end_time":"20.149"},{"id":48,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"使っ"}],"start_time":"20.149","end_time":"20.389"},{"id":49,"type":"pronunciation","alternatives":[{"confidence":"0.851","content":"て、"}],"start_time":"20.389","end_time":"20.909"},{"id":50,"type":"pronunciation","alternatives":[{"confidence":"0.716","content":"アマゾン"}],"start_time":"20.909","end_time":"21.389"},{"id":51,"type":"pronunciation","alternatives":[{"confidence":"0.8","content":"トランスクライブ"}],"start_time":"21.389","end_time":"22.26"},{"id":52,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"の"}],"start_time":"22.26","end_time":"22.469"},{"id":53,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"性能"}],"start_time":"22.469","end_time":"22.79"},{"id":54,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"を"}],"start_time":"22.79","end_time":"22.909"},{"id":55,"type":"pronunciation","alternatives":[{"confidence":"0.997","content":"テスト"}],"start_time":"22.909","end_time":"23.34"},{"id":56,"type":"pronunciation","alternatives":[{"confidence":"1.0","content":"し"}],"start_time":"23.34","end_time":"23.43"},{"id":57,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"て"}],"start_time":"23.43","end_time":"23.54"},{"id":58,"type":"pronunciation","alternatives":[{"confidence":"0.995","content":"み"}],"start_time":"23.54","end_time":"23.829"},{"id":59,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"ましょう。"}],"start_time":"23.829","end_time":"24.239"}],"audio_segments":[{"id":0,"transcript":"こんにちは。今日はアマゾントランスクライブのハンズオンを行います。","start_time":"0.0","end_time":"4.559","items":[0,1,2,3,4,5,6,7,8,9,10]},{"id":1,"transcript":"音声認識の精度を確認するために、さまざまな単語や文章を読み上げます。","start_time":"5.099","end_time":"11.01","items":[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]},{"id":2,"transcript":"機械学習、人工知能、クラウドコンピューティングなどの専門用語も含まれています。","start_time":"11.56","end_time":"18.069","items":[28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43]},{"id":3,"transcript":"この音声ファイルを使って、アマゾントランスクライブの性能をテストしてみましょう。","start_time":"18.629","end_time":"24.27","items":[44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59]}]}}
4.4 PII(個人情報)マスキング機能の検証
PIIのマスキング機能も存在しているようでしたので、
以下の文章を読み上げてマスキングが行われるか確認してみました。
(日本語非対応のため英語で実施)
I'm Taro Yamada.
I'm 26 years old and live in Tokyo.
My hobby is gaming.
マスキング前のファイルも出力することで正しく認識されているか確認します。
出力結果(マスキング前)
{"jobName":"qiita-demo-tarou","accountId":"XXXXXXXXXXXX","isRedacted":false,"status":"COMPLETED","results":{"transcripts":[{"transcript":"I'm Taro Yamada. I'm 26 years old and live in Tokyo. My hobby is gaming."}],"items":[{"id":0,"type":"pronunciation","alternatives":[{"confidence":"0.994","content":"I'm"}],"start_time":"0.009","end_time":"0.4"},{"id":1,"type":"pronunciation","alternatives":[{"confidence":"0.976","content":"Taro"}],"start_time":"0.4","end_time":"0.689"},{"id":2,"type":"pronunciation","alternatives":[{"confidence":"0.993","content":"Yamada"}],"start_time":"0.689","end_time":"1.279"},{"id":3,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]},{"id":4,"type":"pronunciation","alternatives":[{"confidence":"0.994","content":"I'm"}],"start_time":"2.15","end_time":"2.39"},{"id":5,"type":"pronunciation","alternatives":[{"confidence":"0.993","content":"26"}],"start_time":"2.39","end_time":"3.109"},{"id":6,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"years"}],"start_time":"3.109","end_time":"3.39"},{"id":7,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"old"}],"start_time":"3.39","end_time":"3.619"},{"id":8,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"and"}],"start_time":"3.619","end_time":"3.94"},{"id":9,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"live"}],"start_time":"3.94","end_time":"4.07"},{"id":10,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"in"}],"start_time":"4.07","end_time":"4.349"},{"id":11,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"Tokyo"}],"start_time":"4.349","end_time":"4.94"},{"id":12,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]},{"id":13,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"My"}],"start_time":"5.619","end_time":"5.63"},{"id":14,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"hobby"}],"start_time":"5.889","end_time":"6.329"},{"id":15,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"is"}],"start_time":"6.329","end_time":"6.489"},{"id":16,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"gaming"}],"start_time":"6.489","end_time":"6.969"},{"id":17,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]}],"audio_segments":[{"id":0,"transcript":"I'm Taro Yamada.","start_time":"0.0","end_time":"1.409","items":[0,1,2,3]},{"id":1,"transcript":"I'm 26 years old and live in Tokyo.","start_time":"1.99","end_time":"5.05","items":[4,5,6,7,8,9,10,11,12]},{"id":2,"transcript":"My hobby is gaming.","start_time":"5.61","end_time":"7.119","items":[13,14,15,16,17]}]}}
出力結果(マスキング後)
{"jobName":"qiita-demo-tarou","accountId":"XXXXXXXXXXXX","isRedacted":true,"status":"COMPLETED","results":{"transcripts":[{"transcript":"I'm [PII]. I'm 26 years old and live in [PII]. My hobby is gaming."}],"items":[{"id":0,"type":"pronunciation","alternatives":[{"confidence":"0.994","content":"I'm"}],"start_time":"0.009","end_time":"0.4"},{"id":1,"type":"pronunciation","alternatives":[{"content":"[PII]","redactions":[{"confidence":"1.0","type":"NAME","category":"PII"}]}],"start_time":"0.4","end_time":"1.279"},{"id":2,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]},{"id":3,"type":"pronunciation","alternatives":[{"confidence":"0.994","content":"I'm"}],"start_time":"2.15","end_time":"2.39"},{"id":4,"type":"pronunciation","alternatives":[{"confidence":"0.993","content":"26"}],"start_time":"2.39","end_time":"3.109"},{"id":5,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"years"}],"start_time":"3.109","end_time":"3.39"},{"id":6,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"old"}],"start_time":"3.39","end_time":"3.619"},{"id":7,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"and"}],"start_time":"3.619","end_time":"3.94"},{"id":8,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"live"}],"start_time":"3.94","end_time":"4.07"},{"id":9,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"in"}],"start_time":"4.07","end_time":"4.349"},{"id":10,"type":"pronunciation","alternatives":[{"content":"[PII]","redactions":[{"confidence":"1.0","type":"ADDRESS","category":"PII"}]}],"start_time":"4.349","end_time":"4.94"},{"id":11,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]},{"id":12,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"My"}],"start_time":"5.619","end_time":"5.63"},{"id":13,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"hobby"}],"start_time":"5.889","end_time":"6.329"},{"id":14,"type":"pronunciation","alternatives":[{"confidence":"0.999","content":"is"}],"start_time":"6.329","end_time":"6.489"},{"id":15,"type":"pronunciation","alternatives":[{"confidence":"0.998","content":"gaming"}],"start_time":"6.489","end_time":"6.969"},{"id":16,"type":"punctuation","alternatives":[{"confidence":"0.0","content":"."}]}],"audio_segments":[{"id":0,"transcript":"I'm [PII].","start_time":"0.0","end_time":"1.409","items":[0,1,2]},{"id":1,"transcript":"I'm 26 years old and live in [PII].","start_time":"1.99","end_time":"5.05","items":[3,4,5,6,7,8,9,10,11]},{"id":2,"transcript":"My hobby is gaming.","start_time":"5.61","end_time":"7.119","items":[12,13,14,15,16]}]}}
5. 考察
今回のハンズオンでは、Amazon Transcribeのバッチ処理を用いて日本語・英語音声ファイルの文字起こしを実施し、さらにPII(個人情報)マスキング機能の挙動も検証しました。
動作確認結果
4.1 日本語音声ファイルの処理
日本語音声ファイルをアップロードし、言語を日本語に指定してジョブを実行した結果、音声の内容が正確に認識されました。以下のように、読み上げた内容が正しくテキスト化されています
{
"jobName": "qiita-demo-japanese",
"accountId": "XXXXXXXXXXXX",
"status": "COMPLETED",
"results": {
"transcripts": [
{
"transcript": "こんにちは。今日はアマゾントランスクライブのハンズオンを行います。音声認識の精度を確認するために、さまざまな単語や文章を読み上げます。機械学習、人工知能、クラウドコンピューティングなどの専門用語も含まれています。この音声ファイルを使って、アマゾントランスクライブの性能をテストしてみましょう。"
}
]
}
}
4.2 英語音声ファイルの処理
英語音声ファイルについても同様に処理を実行し、音声の内容が正確に認識されました。専門用語や長文も正しくテキスト化されています
{
"jobName": "qiita-demo-english",
"accountId": "XXXXXXXXXXXX",
"status": "COMPLETED",
"results": {
"transcripts": [
{
"transcript": "Hello, today we'll be doing a hands-on with Amazon Transcribe. To check the accuracy of speech recognition, we'll be reading out various words and sentences. It also includes technical terms from machine learning, artificial intelligence, cloud computing, and more."
}
]
}
}
4.3 自動言語識別の処理
言語を指定せずに自動言語識別機能を使用した場合も、音声の内容が正確に認識され、日本語として正確に判定されました
{
"jobName": "qiita-demo-auto",
"accountId": "XXXXXXXXXXXX",
"status": "COMPLETED",
"results": {
"language_code": "ja-JP",
"language_identification": [
{
"code": "ja-JP",
"score": "1"
},
{
"code": "ko-KR",
"score": "0.0001"
}
],
"transcripts": [
{
"transcript": "こんにちは。今日はアマゾントランスクライブのハンズオンを行います。音声認識の精度を確認するために、さまざまな単語や文章を読み上げます。機械学習、人工知能、クラウドコンピューティングなどの専門用語も含まれています。この音声ファイルを使って、アマゾントランスクライブの性能をテストしてみましょう。"
}
]
}
}
4.4 PII(個人情報)マスキング機能
PIIマスキング機能を有効にすると、氏名や住所などの個人情報が[PII]
としてマスクされていることが分かります。JSONのisRedacted
がtrue
となり、redactions
フィールドでどの部分がマスクされたかも確認できます。なお、日本語は現時点でPIIマスキング非対応のため、英語でのみ動作を確認しました。
出力結果(マスキング前)
{
"jobName": "qiita-demo-tarou",
"accountId": "XXXXXXXXXXXX",
"isRedacted": false,
"status": "COMPLETED",
"results": {
"transcripts": [
{
"transcript": "I'm Taro Yamada. I'm 26 years old and live in Tokyo. My hobby is gaming."
}
]
}
}
出力結果(マスキング後)
{
"jobName": "qiita-demo-tarou",
"accountId": "XXXXXXXXXXXX",
"isRedacted": true,
"status": "COMPLETED",
"results": {
"transcripts": [
{
"transcript": "I'm [PII]. I'm 26 years old and live in [PII]. My hobby is gaming."
}
]
}
}
6. まとめ
Amazon Transcribeでは、音声ファイルをS3にアップロードしてジョブを実行するだけで、非常に高精度な文字起こしができました。日本語のPIIマスキングは対応していませんでしたが、英語ではしっかりとマスキングができる事も確認できました。時間があればストリーミングでの認識についても試してみたいと思います。