More than 5 years have passed since last update.

Watson音声認識(STT:Speech to Text)をSwiftアプリで使ってみた。

Last updated at 2018-03-26Posted at 2016-09-10

iPhoneに話しかけたら音声認識するSwiftサンプルアプリを、IBM BluemixのWatsonにあるSpeech to Text (STT)サービスを使って試してみました。iOS 10でSpeech Recognition APIが出たのでiOSに限っては、そちらが本命ですが、コード量が少なくシンプルなので、iOS 9しばりとかハッカソンなどで「声で認識するアプリ」をパパッと作る用途にはイイかもしれません。Watson Developer Cloud iOS SDKを用います。(ちなみにWatsonを使わないApple版のサンプルはこちら)

少し解説

サンプルにはRecorded Audio, Streaming Audio, Custom Capture Sessionsとある。

Recorded Audio - 録音済みの音声ファイルを使う。
Streaming Audio - マイクに話しかけてリアルタイムに音声認識。
Custom Capture Session - 独自のAVCaptureSessionで AVCaptureAudioDataOutputを使う。これは音声データのイコライザーを画面に表現したり、音声データを保存したり、用途多数。

作り方は、以下の通り。

1. サンプルプロジェクトのチェックアウト

GitHubからSpeech to Text Demo (Swift)をクローン。

$ git clone https://github.com/watson-developer-cloud/speech-to-text-swift.git

プロジェクトをCarthageでビルド。

$ carthage update --platform iOS

2. サンプルプロジェクトの修正

SpeechToTextV1.frameworkのリンク切れを直してあげる。前のCarthageのビルドが完了していたら /Carthage/Build/iOSフォルダにビルドされているので、これを使う。
/ios-sdk/Examples/SpeechToText/Carthage/Build/iOS/SpeechToTextV1.framework
Credentials.plist ファイルを作ってあげる。これはBluemixへの認証情報なので、SpeechToTextUsernameキーとSpeechToTextPasswordキーは、自分のBluemixアカウントに紐付いたSpeech To Textサービスの管理画面に設定されているService Credentialsを書き込む。

Credentials.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>SpeechToTextUsername</key>
	<string>70c5833dc-a53a-ca94-6c9ee-f89a23dd3ggg</string>
	<key>SpeechToTextPassword</key>
	<string>0bUzR0Hmkv1g</string>
</dict>
</plist>

3. 日本語に対応させる

TranscriptionSettingsのmodelを追記してあげると、日本語を聞き取ってくれる。

ViewController.swift

// configure settings for streaming
var settings = TranscriptionSettings(contentType: .L16(rate: 44100, channels: 1))
settings.model = "ja-JP_BroadbandModel"

4. 実行結果

Start Streamingボタンを押してiPhoneに話しかけると、文字が表示される。日本語の聞き取りについては、ちょっとビミョーなので、これは調整できるのだろうか...? あとサンプルのソースコード内では削ってしまっているが、本当は確証の度合いも出ているので、これを使うのかも。単語よりも長文で話しかけた方が文脈があるので、精度は高まる？！

もっと詳しいことは

IBM Speech to Text service:

尚、Androidについては、こちらの記事が参考になります。
IBM WatsonのSpeechToTextで「Androidのマイクから音声を拾って文字を起こす」アプリを作った

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up