AmiVoiceで複数言語の含まれている音声データの解析に試みた

Posted at 2024-05-11

背景

外国人の私にとっては、Siriなどを利用している時に一番困っているのは音声認識する際に、複数言語に認識できなかったことです。
そのために、AmiVoiceを初めて見た時も、複数の言語の含まれている音声データを認識できるかを試してみました。できませんでした。
この記事では、私がAmiVoiceAPIを利用して、複数言語の含まれている音声データの解析を試みましたJourneyを記録しました。

音声データの出処

まず、テストで利用されている日本語と英語をMixした音声データを以下のサイトからダウンロードして利用させていただきました。

英語のみの音声データはTOEIC リスニングセクションのExampleを利用させていただきました。

AmiVoice APIを試してみる

注意
これからのコードを実行して検証する際に、費用が発生されることがあります。

Repository

今回試す際に使っていたコードなどをこちらのRepositoryに入れました。

まずは軽く英語のみの音声データを試してみる

curl -X POST https://acp-api.amivoice.com/v1/nolog/recognize \
     -F d=-a-general-en \
     -F u=APIKEY_Here \
     -F a=@src/Part2Direction.mp3 > result_Part2Direction.json

結果の抜粋

{
    "results": [
        {
            "tokens": [
                {
                    "written": "Part",
                    "confidence": 0.98,
                    "starttime": 870,
                    "endtime": 1290,
                    "spoken": "Part"
                },
                {
                    "written": "two",
                    "confidence": 0.92,
                    "starttime": 1290,
                    "endtime": 1810,
                    "spoken": "two"
                },
                {
                    "written": "."
                },
                {
                    "written": "directions",
                    "confidence": 0.56,
                    "starttime": 2360,
                    "endtime": 3260,
                    "spoken": "directions"
                },
                {
                    "written": "."
                },
                {
                    "written": "you",
                    "confidence": 0.70,
                    "starttime": 3860,
                    "endtime": 4040,
                    "spoken": "you"
                },
                // 省略
                {
                    "written": "on",
                    "confidence": 0.91,
                    "starttime": 21960,
                    "endtime": 22200,
                    "spoken": "on"
                },
                {
                    "written": "your",
                    "confidence": 1.00,
                    "starttime": 22200,
                    "endtime": 22380,
                    "spoken": "your"
                },
                {
                    "written": "answer",
                    "confidence": 1.00,
                    "starttime": 22380,
                    "endtime": 22700,
                    "spoken": "answer"
                },
                {
                    "written": "sheet",
                    "confidence": 1.00,
                    "starttime": 22700,
                    "endtime": 23180,
                    "spoken": "sheet"
                },
                {
                    "written": "."
                }
            ],
            "confidence": 1.0,
            "starttime": 650,
            "endtime": 23300,
            "tags": [],
            "rulename": "",
            "text": "Part two.directions.you will hear a question or statement and three responses spoken in English.they will not be printed in your test book and will be spoken only one time.select the best response to the question or statement and mark the letter A B or C on your answer sheet."
        }
    ],
    "utteranceid": "20240421/22/018f00cca37b0a30105894c5_20240421_221602",
    "text": "Part two.directions.you will hear a question or statement and three responses spoken in English.they will not be printed in your test book and will be spoken only one time.select the best response to the question or statement and mark the letter A B or C on your answer sheet.",
    "code": "",
    "message": ""
}

音声のテキスト

Directions : You will hear a question or statement and three responses spoken in English. 
They will not be printed in your test book and will be spoken only one time. 
Select the best response to the question or statement and mark the letter (A), (B), or (C) on your answer sheet.

認識されたテキスト

Part two.directions.you will hear a question or statement and three responses spoken in English.
they will not be printed in your test book and will be spoken only one time.
select the best response to the question or statement and mark the letter A B or C on your answer sheet.

Small Summary

英語のみの場合は、得られた結果のフォーマット（大文字、Spaceの区切り、句点）などを除き、意味合い上大きな違いがないと言えます。

日本語と英語混在の場合

利用ファイル　->　src\1-FLT4-1.mp3

英語エンジンで解析した結果

"text": "It's.Mitsuno to unanimous.Taiwanese zero Color un ready to go Dakota  
aside.one.How many dogs do you have I have two dogs?two.When is your birthday?It's 
June 12th.three.What time is it now?It's three 30.",

結果を見ると、音声前半の日本語がほぼ認識できなかったことに対し、後半の英語については、意味は概ね合っていましたが、書き方的に数字と英語が混ざっており、句点もややおかしかったです。

そして、ResultsのConfidenceをPlotしてみて、以下のような図が得られて、前半の自信度は良くないことが分かります。

英語エンジンの結果の抜粋

{
    "results": [
        {
            "tokens": [
                {
                    "written": "It's",
                    "confidence": 0.49,
                    "starttime": 460,
                    "endtime": 1080,
                    "spoken": "__It_is"
                },
                {
                    "written": "."
                },
                // 省略
                {
                    "written": "30",
                    "confidence": 1.00,
                    "starttime": 38300,
                    "endtime": 39080,
                    "spoken": "thirty"
                },
                {
                    "written": "."
                }
            ],
            "confidence": 0.97400004,
            "starttime": 200,
            "endtime": 39300,
            "tags": [],
            "rulename": "",
            "text": "It's.Mitsuno to unanimous.Taiwanese zero Color un ready to go Dakota aside.one.How many dogs do you have I have two dogs?two.When is your birthday?It's June 12th.three.What time is it now?It's three 30."
        }
    ],
    "utteranceid": "20240421/22/018f00df9d5d0a30112294c5_20240421_223645",
    "text": "It's.Mitsuno to unanimous.Taiwanese zero Color un ready to go Dakota aside.one.How many dogs do you have I have two dogs?two.When is your birthday?It's June 12th.three.What time is it now?It's three 30.",
    "code": "",
    "message": ""
}

日本語エンジンで解析した結果

"text": "1三つの対話が読まれます。対話の内容に合うからの中から選び、記号で答えなさい。
メリットアークスというはい中でアークス中央を毎日アップして、月中旬はオフ古いチャイムしねよ。
HTVストーリー",

日本語エンジンで解析してみたら、前半は自信高いことに対し、後半の自信がなくなったような動きでした。

日本語エンジンの結果抜粋

{
    "results": [
        {
            "tokens": [
                {
                    "written": "1",
                    "confidence": 0.87,
                    "starttime": 408,
                    "endtime": 1096,
                    "spoken": "いち"
                },
                {
                    "written": "三つ",
                    "confidence": 0.98,
                    "starttime": 2190,
                    "endtime": 2718,
                    "spoken": "みっつ"
                },
                // 省略
                {
                    "written": "HTV",
                    "confidence": 0.83,
                    "starttime": 37504,
                    "endtime": 38320,
                    "spoken": "えいちてぃーぶい"
                },
                {
                    "written": "ストーリー",
                    "confidence": 0.95,
                    "starttime": 38320,
                    "endtime": 39136,
                    "spoken": "すとーりー"
                }
            ],
            "confidence": 0.9506363,
            "starttime": 200,
            "endtime": 39296,
            "tags": [],
            "rulename": "",
            "text": "1三つの対話が読まれます。対話の内容に合うからの中から選び、記号で答えなさい。メリットアークスというはい中でアークス中央を毎日アップして、月中旬はオフ古いチャイムしねよ。HTVストーリー"
        }
    ],
    "utteranceid": "20240421/22/018f00e6d4860a30368b94c2_20240421_224438",
    "text": "1三つの対話が読まれます。対話の内容に合うからの中から選び、記号で答えなさい。メリットアークスというはい中でアークス中央を毎日アップして、月中旬はオフ古いチャイムしねよ。HTVストーリー",
    "code": "",
    "message": ""
}

複数言語へ踏み込む

多言語モデルについて、論文を調べればきっと難しそうなメカニズムとアルゴリズムが出てきそうな気がしますが、ここでは一番シンプルでわかりやすい方法で試して見ました。

それで開発してみたのは以下になります。

DEMO

ローカルでFast試す

git clone https://github.com/Benzenoil/amivoice-multi-lang-recognize.git

# update .env.example to set APIKEY
vim .env.example
mv .env.example .env

docker-compose up -d

http://localhost:3031/にアクセスすると、以下の画面が表示されます。

英語のみの場合の結果

日英の音声の場合の結果

感想

AmiVoice APIを利用してみて、音声認識がもうここまで利用しやすくなっていることを実感しました。しかし一方で、複数の言語が含まれていた音声を認識したい場合はまだどうやってうまく行けるのかは、道が長いと思います。
また、AmiVoice APIの結果にあるconfidence(信頼度)という値の計算方法も気になりますね。複数言語が含まれていても9割以上の信頼度が出ていているのは、気になります。

以上でAmiVoice APIを利用した複数言語の音声認識の試みでした。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up