コンセプト

音声の文字起こしに後処理として自然言語分類を用いる。
音声の文字起こしでは常に完璧な認識は困難だが、自然言語分類を用いることで情報を補完し、より正確に後続処理を行うことができる。

IBM Cloud (Bluemix)

アカウント

現在のBluemixのフリーアカウントは以下の2種。

ライトアカウント
- IBMの無料アカウント
- 解説記事
- 音声文字起こし(Speech to Text)が使用可能, 自然言語分類(Natural language classifier)は使用できない
トライアルアカウント
- 一ヶ月間ほぼ全ての機能を無料で利用できる
- マニュアル

今回はNatural language classifierを用いるため、トライアルアカウントを用いる。

API

Watson Developer CloudからWatsonのAPIs and SDKsが与えられている。
Node.js, Java, Python, C#, Swift に対応している。
今回はPythonを用いる(python-sdk git)。今回用いたPythonのバージョンは3.5 or 3.6。
APIへの投げ方はこの記事が詳しい。今回の記事も基本的にこの投げ方。

Speech to Text

音声情報の用意

テキストの読み上げ。

Windows
- SofTalk
Mac
- Terminalからsayコマンドで読み上げ可能 (ref)
- Watson Speech to Textはaifに対応していないのでiTunes等での変換が必要 (ref)
say -v kyoko “これはテストです。” -o test.aif
Watson Speech to Textの対応拡張子はIBMの解説ページ参照, content-typeで探すと確認可能

文字起こし

手順は以下2つのみ。
1. BluemixでSpeech to TextのAPIキーを作成する
2. 手元でPythonコードを作成する。

BlumixのAPIキー作成
- Bluemixのカタログページ(右上)でSpeech to Textを選択、サービスを作成
- 左上のメニューからダッシュボード選択、先ほど作成済みのサービスの"サービス資格情報"からAPIキー作成
- jsonで保存
- 解説記事
Pythonコード
- watson_developer_cloudからSpeechToTextV1を読んで、APIキーjsonの中身のusernameとpasswordを渡す


from watson_developer_cloud import SpeechToTextV1
import json

class s2t_setting:
    fname = "test.wav"
    model = "ja-JP_NarrowbandModel"
    ctype = "audio/wav"

f = open('s2t_key.json', 'r')
id_dict = json.load(f)
class s2t_account:
    uname = id_dict["username"]
    passwd = id_dict["password"]

wdc_s2t = SpeechToTextV1( username=s2t_account.uname, password=s2t_account.passwd, x_watson_learning_opt_out=False)
s2t_json = wdc_s2t.recognize( open(s2t_setting.fname, 'rb'), model=s2t_setting.model, content_type=s2t_setting.ctype, word_confidence=True)
print(json.dumps(s2t_json, ensure_ascii=False, indent=2) )

recognizeのオプションについてはspeech_to_text_v1.pyにあるようにtimestampsなど他にも指定可能。

結果はこんな感じ。

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "これ は ですと です ",
          "confidence": 0.529,
          "word_confidence": [
            [
              "これ",
              0.792
            ],
            [
              "は",
              0.702
            ],
            [
              "ですと",
              0.29
            ],
            [
              "です",
              0.522
            ]
          ]
        }
      ],
      "final": true
    }
  ],
  "result_index": 0,
  "warnings": [
    "Unknown arguments: continuous."
  ]
}

各語の信頼度を判断している
transcript: 文字起こし結果
- 今回は"テスト"が"ですと"になっている
confidence: transcriptの信頼性
word_confidence: 最も信頼度が高い文字起こしされた語句とその信頼性
- "ですと"の信頼性は0.29で他よりも低い
warnings: Speech to Textから外れたオプションをSpeechToTextV1.recognizeで引いているエラー, 次のリリースで対応予定とか (issue)

Natural language classifier

APIキー作成はSpeech to Textと同様。

watson-developer-cloud/python-sdkのサンプルデータと同様のcsvを作成し、これを学習させる。

今回の学習データはarxivのastro-phの新着論文タイトル。EP, GA, CO, HE, IM, SRを各5つのタイトルで学習。

import json
from watson_developer_cloud import NaturalLanguageClassifierV1

f_nlc = open('nlc_key.json', 'r')
id_dict = json.load(f_nlc)
class nlc_account:
    uname = id_dict["username"]
    passwd = id_dict["password"]
f_nlc.close()

nlc = NaturalLanguageClassifierV1( username=nlc_account.uname, password=nlc_account.passwd)
d = nlc.create(training_data = open('astro-ph.csv', 'rb'), name='astro-ph')
f = open('c_id.txt', 'w')
f.write(d["classifier_id"])
f.close()

Statusを下記で確認し、Training→Availableで学習終了を確認。(import jsonからf_nlc.close()までは省略)

f_cid = open('c_id.txt', 'r')
c_id = f_cid.readline()

nlc = NaturalLanguageClassifierV1( username=nlc_account.uname, password=nlc_account.passwd)
status = nlc.status(c_id)["status"]
print(status)

以下のようにテストを投げると、

nlc = NaturalLanguageClassifierV1( username=nlc_account.uname, password=nlc_account.passwd)
category = nlc.classify(c_id, 'Formation of Close-in Super-Earths by Giant Impacts: Effects of Initial Eccentricities and Inclinations of Protoplanets')
print(json.dumps(category, indent=2))

結果は下記。ちゃんと分類されている。

{
  "classifier_id": [your c_id]
  "url": "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/[your c_id]",
  "text": "Formation of Close-in Super-Earths by Giant Impacts: Effects of Initial Eccentricities and Inclinations of Protoplanets",
  "top_class": "EP",
  "classes": [
    {
      "class_name": "EP",
      "confidence": 0.41011031439231577
    },
    {
      "class_name": "HE",
      "confidence": 0.27868247955725023
    },
    {
      "class_name": "IM",
      "confidence": 0.21792990040141166
    },
    {
      "class_name": "SR",
      "confidence": 0.03598143466169043
    },
    {
      "class_name": "GA",
      "confidence": 0.03249864877529136
    },
    {
      "class_name": "CO",
      "confidence": 0.024797222212040608
    }
  ]
}

Natural language classifierはライト向けでないので、使わない分類器は即消去。
学習した分類器の削除はnlc.remove(c_id)。

音声認識+ 自然言語分類

上記のくっつき。
英語のタイトルなので、
say say -v Agnes “Formation of Close-in Super-Earths by Giant Impacts: Effects of Initial Eccentricities and Inclinations of Protoplanets” -o title1.aif
とAgnesに読んでもらい、そのままくっつけたようなコードでtitle1.wavを読むと、

import json
from watson_developer_cloud import NaturalLanguageClassifierV1
from watson_developer_cloud import SpeechToTextV1

def s2t():
    class s2t_setting:
        fname = "title1.wav"
        model = "en-US_NarrowbandModel"
        ctype = "audio/wav"

    f_s2t = open('../s2t/s2t_key.json', 'r')
    id_dict = json.load(f_s2t)
    f_s2t.close()
    class s2t_account:
        uname = id_dict["username"]
        passwd = id_dict["password"]

    wdc_s2t = SpeechToTextV1( username=s2t_account.uname, password=s2t_account.passwd, x_watson_learning_opt_out=False)
    s2t_json = wdc_s2t.recognize( open(s2t_setting.fname, 'rb'), model=s2t_setting.model, content_type=s2t_setting.ctype, word_confidence=True)
    return(s2t_json["results"][0]["alternatives"][0]["transcript"])

def nlc(text):
    f_nlc = open('../nlc/nlc_key.json', 'r')
    id_dict = json.load(f_nlc)
    class nlc_account:
        uname = id_dict["username"]
        passwd = id_dict["password"]
    f_nlc.close()

    f_cid = open('../nlc/c_id.txt', 'r')
    c_id = f_cid.readline()

    nlc = NaturalLanguageClassifierV1( username=nlc_account.uname, password=nlc_account.passwd)
    category = nlc.classify(c_id, text)
    return(category)


if __name__ == '__main__':
    s2t_trans = s2t()
    print(s2t_trans)
    nlc_cat = nlc(s2t_trans)
    print(nlc_cat["top_class"])
    print(nlc_cat["classes"])

この結果は、

see formation of clothes in super earths by giant impact effects of initial eccentricities and inclinations are brutal planet
EP
[{'class_name': 'EP', 'confidence': 0.9611551555905992}, {'class_name': 'HE', 'confidence': 0.014304030934184617}, {'class_name': 'GA', 'confidence': 0.011652070470274159}, {'class_name': 'IM', 'confidence': 0.004748696438249261}, {'class_name': 'SR', 'confidence': 0.004435893207788495}, {'class_name': 'CO', 'confidence': 0.0037041533589043437}]

close, Protoplanetsでうまく読めていないが、結果はちゃんとEPに分類されている。
(なぜか文字情報よりもconfidenceが高い…)

Watsonによる音声認識+ 自然言語分類