iOSアプリでの音声認識機能実装方法まとめ で以下のように書きました。
追記 (2016/06/16)
iOS10にて公式の音声認識機能が開放されましたね!
これが本命かと思っております。
しかし実際のところ、まだまだ制限が多く使い方が限られている状況でした。(2017/12時点)
幸いな事にGoogle Speech API
の商用化版のGoogle Cloud Speech API
が出ていました。
これで実用に耐える音声認識機能を実現できそうです。
ハッカソンでこんなアプリを作ってみたり。
先日のYahoo! HACK DAY で同僚とこんな物を作りました。喋った単語が3Dオブジェクトに。SwiftでのiOSネイティブアプリ。ARKit、SceneKit、Google Speech API、Natural Language API, Poly API pic.twitter.com/8VPrG3BGRi
— satoshi0212 (@shmdevelop) 2017年12月11日
認識した音声の3DオブジェクトをAR空間に表示します。
このアプリの音声認識機能部分だけを抜き出してみました。
動かすまでに少し手間が必要だったので、以下参考になれば幸いです。
Google Cloud Speech APIとは
機械学習を利用したオンラインのSpeech To Text サービスです。
ストリーミングで順次結果を返す方式と、RESTで結果を返す方式があります。
Swiftのストリーミング形式の公式サンプルはこちら
Cloud Speech Streaming gRPC Swift Sample
Step1. ConsoleでAPIを有効化
Google API ConsoleページからCloud Speech APIを有効化します。


Step2. pods installと細かい修正
Xcodeから新規プロジェクト作成。
プロジェクト名はGoogleSpeechRecognizerSample
としました。
プロジェクト直下でpod init
しPodfile内容を以下に変更。
target 'GoogleSpeechRecognizerSample' do
use_frameworks!
pod 'googleapis', :path => '.'
end
こちらのpodspecファイルを同階層に配置。
Pod::Spec.new do |s|
s.name = 'googleapis'
s.version = '0.0.1'
s.license = 'Apache 2.0'
s.authors = { 'Google Inc.' => 'timburks@google.com'}
s.homepage = 'http://github.com/GoogleCloudPlatform/ios-docs-samples'
s.source = { :git => 'https://github.com/GoogleCloudPlatform/ios-docs-samples.git',
:tag => '0.0.1' }
s.summary = 'Service definitions for Google Cloud Platform APIs'
s.ios.deployment_target = '7.1'
s.osx.deployment_target = '10.9'
# Run protoc with the Objective-C and gRPC plugins to generate protocol messages and gRPC clients.
s.dependency "!ProtoCompiler-gRPCPlugin", "~> 1.0"
# Pods directory corresponding to this app's Podfile, relative to the location of this podspec.
pods_root = 'Pods'
# Path where Cocoapods downloads protoc and the gRPC plugin.
protoc_dir = "#{pods_root}/!ProtoCompiler"
protoc = "#{protoc_dir}/protoc"
plugin = "#{pods_root}/!ProtoCompiler-gRPCPlugin/grpc_objective_c_plugin"
# Run protoc with the Objective-C and gRPC plugins to generate protocol messages and gRPC clients.
# You can run this command manually if you later change your protos and need to regenerate.
s.prepare_command = <<-CMD
#{protoc} \
--objc_opt=named_framework_to_proto_path_mappings_path=./protomap \
--plugin=protoc-gen-grpc=#{plugin} \
--objc_out=. \
--grpc_out=. \
-I . \
-I #{protoc_dir} \
google/*/*.proto google/*/*/*/*.proto
CMD
# The --objc_out plugin generates a pair of .pbobjc.h/.pbobjc.m files for each .proto file.
s.subspec "Messages" do |ms|
ms.source_files = "google/**/*.pbobjc.{h,m}"
ms.header_mappings_dir = "."
ms.requires_arc = false
ms.dependency "Protobuf"
end
# The --objcgrpc_out plugin generates a pair of .pbrpc.h/.pbrpc.m files for each .proto file with
# a service defined.
s.subspec "Services" do |ss|
ss.source_files = "google/**/*.pbrpc.{h,m}"
ss.header_mappings_dir = "."
ss.requires_arc = true
ss.dependency "gRPC-ProtoRPC"
ss.dependency "#{s.name}/Messages"
end
s.pod_target_xcconfig = {
'GCC_PREPROCESSOR_DEFINITIONS' => '$(inherited) GPB_USE_PROTOBUF_FRAMEWORK_IMPORTS=1',
'USER_HEADER_SEARCH_PATHS' => '$SRCROOT/..'
}
end
pod install
を実行します。
成功したらGoogleSpeechRecognizerSample.xcworkspace
内の以下の修正をします。
少し面倒ですね。
1) gRPC-RxLibrary-umbrella.hの以下を削除
#import "transformations/GRXMappingWriter.h"
2) 以下のimport文を置換
Before:
#import "google/cloud/speech/v1/CloudSpeech.pbobjc.h"
#import "google/api/Annotations.pbobjc.h"
#import "google/longrunning/Operations.pbobjc.h"
#import "google/rpc/Status.pbobjc.h"
#import "google/protobuf/Duration.pbobjc.h"
After:
#import <googleapis/CloudSpeech.pbobjc.h>
#import <googleapis/Annotations.pbobjc.h>
#import <googleapis/Operations.pbobjc.h>
#import <googleapis/Status.pbobjc.h>
#import <googleapis/Duration.pbobjc.h>
Objective-C Brdging Header設定
#import <googleapis/CloudSpeech.pbobjc.h>
#import <googleapis/CloudSpeech.pbrpc.h>

Step3. gRPC Streamの実装
ビルドが通るようになったので、Stream処理を実装していきます。
と言ってもサンプルほぼそのまま。
Audio関連を初期化。
func prepare(specifiedSampleRate: Int) -> OSStatus {
var status = noErr
let session = AVAudioSession.sharedInstance()
do {
try session.setCategory(AVAudioSessionCategoryRecord)
try session.setPreferredIOBufferDuration(10)
} catch {
return -1
}
var sampleRate = session.sampleRate
print("hardware sample rate = \(sampleRate), using specified rate = \(specifiedSampleRate)")
sampleRate = Double(specifiedSampleRate)
// Describe the RemoteIO unit
var audioComponentDescription = AudioComponentDescription()
audioComponentDescription.componentType = kAudioUnitType_Output;
audioComponentDescription.componentSubType = kAudioUnitSubType_RemoteIO;
audioComponentDescription.componentManufacturer = kAudioUnitManufacturer_Apple;
audioComponentDescription.componentFlags = 0;
audioComponentDescription.componentFlagsMask = 0;
// Get the RemoteIO unit
let remoteIOComponent = AudioComponentFindNext(nil, &audioComponentDescription)
status = AudioComponentInstanceNew(remoteIOComponent!, &remoteIOUnit)
if (status != noErr) {
return status
}
let bus1 : AudioUnitElement = 1
var oneFlag : UInt32 = 1
// Configure the RemoteIO unit for input
status = AudioUnitSetProperty(remoteIOUnit!,
kAudioOutputUnitProperty_EnableIO,
kAudioUnitScope_Input,
bus1,
&oneFlag,
UInt32(MemoryLayout<UInt32>.size));
if (status != noErr) {
return status
}
// Set format for mic input (bus 1) on RemoteIO's output scope
var asbd = AudioStreamBasicDescription()
asbd.mSampleRate = sampleRate
asbd.mFormatID = kAudioFormatLinearPCM
asbd.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked
asbd.mBytesPerPacket = 2
asbd.mFramesPerPacket = 1
asbd.mBytesPerFrame = 2
asbd.mChannelsPerFrame = 1
asbd.mBitsPerChannel = 16
status = AudioUnitSetProperty(remoteIOUnit!,
kAudioUnitProperty_StreamFormat,
kAudioUnitScope_Output,
bus1,
&asbd,
UInt32(MemoryLayout<AudioStreamBasicDescription>.size))
if (status != noErr) {
return status
}
// Set the recording callback
var callbackStruct = AURenderCallbackStruct()
callbackStruct.inputProc = recordingCallback
callbackStruct.inputProcRefCon = nil
status = AudioUnitSetProperty(remoteIOUnit!,
kAudioOutputUnitProperty_SetInputCallback,
kAudioUnitScope_Global,
bus1,
&callbackStruct,
UInt32(MemoryLayout<AURenderCallbackStruct>.size));
if (status != noErr) {
return status
}
// Initialize the RemoteIO unit
return AudioUnitInitialize(remoteIOUnit!)
}
録音のCallback。
func recordingCallback (
inRefCon: UnsafeMutableRawPointer,
ioActionFlags: UnsafeMutablePointer<AudioUnitRenderActionFlags>,
inTimeStamp: UnsafePointer<AudioTimeStamp>,
inBusNumber: UInt32,
inNumberFrames: UInt32,
ioData: UnsafeMutablePointer<AudioBufferList>?) -> OSStatus {
var status = noErr
let channelCount: UInt32 = 1
var bufferList = AudioBufferList()
bufferList.mNumberBuffers = channelCount
let buffers = UnsafeMutableBufferPointer<AudioBuffer>(start: &bufferList.mBuffers,
count: Int(bufferList.mNumberBuffers))
buffers[0].mNumberChannels = 1
buffers[0].mDataByteSize = inNumberFrames * 2
buffers[0].mData = nil
// get the recorded samples
status = AudioUnitRender(AudioController.sharedInstance.remoteIOUnit!,
ioActionFlags,
inTimeStamp,
inBusNumber,
inNumberFrames,
UnsafeMutablePointer<AudioBufferList>(&bufferList))
if (status != noErr) {
return status;
}
let data = Data(bytes: buffers[0].mData!, count: Int(buffers[0].mDataByteSize))
DispatchQueue.main.async {
AudioController.sharedInstance.delegate.processSampleData(data)
}
return noErr
}
Callbackを受けてViewController側での処理。
func processSampleData(_ data: Data) -> Void {
audioData.append(data)
// We recommend sending samples in 100ms chunks
let chunkSize : Int /* bytes/chunk */ = Int(0.1 /* seconds/chunk */
* Double(SAMPLE_RATE) /* samples/second */
* 2 /* bytes/sample */);
if (audioData.length > chunkSize) {
SpeechRecognitionService.sharedInstance.streamAudioData(audioData, completion: { [weak self] (response, error) in
guard let strongSelf = self else {
return
}
if let error = error {
strongSelf.textView.text = error.localizedDescription
} else if let response = response {
var finished = false
print(response)
for result in response.resultsArray! {
if let result = result as? StreamingRecognitionResult {
if result.isFinal {
finished = true
}
}
}
strongSelf.textView.text = response.description
if finished {
strongSelf.stopStreaming()
}
}
})
self.audioData = NSMutableData()
}
}
Step4. REST の実装
RESTの場合、端末内部に一度音声ファイルを作成し送信します。
保存先ファイルパス。
var soundFilePath: String {
let paths = NSSearchPathForDirectoriesInDomains(.documentDirectory, .userDomainMask , true)
guard let path = paths.first else { return "" }
return path.appending("/sound.caf")
}
AVAudioSession等を初期化。
func prepare(delegate: AudioRESTControllerDelegate) {
self.delegate = delegate
AVAudioSession.sharedInstance().requestRecordPermission() { [unowned self] allowed in
if allowed {
print(self.soundFilePath)
} else {
}
}
let soundFileURL = URL(fileURLWithPath: soundFilePath)
let session = AVAudioSession.sharedInstance()
do {
try session.setCategory(AVAudioSessionCategoryPlayAndRecord, with: .defaultToSpeaker)
try session.setMode(AVAudioSessionModeMeasurement)
try session.setActive(true)
let settings = [
AVSampleRateKey: SAMPLE_RATE,
AVNumberOfChannelsKey: 1,
AVEncoderBitRateKey: 16,
AVEncoderAudioQualityKey: AVAudioQuality.max.rawValue
]
audioRecorder = try AVAudioRecorder(url: soundFileURL, settings: settings)
audioRecorder?.delegate = self
audioRecorder?.isMeteringEnabled = true
}
catch let error {
print(error)
}
}
音声ファイルを送信し、解析内容候補テキストの入ったJSONを受信。
取り敢えず先頭の候補を返却しています。
func soundFileToText() {
let service = "https://speech.googleapis.com/v1/speech:recognize?key=\(API_KEY)"
let data = try! Data(contentsOf: URL(fileURLWithPath: soundFilePath))
let config: [String: Any] = [
"encoding": "LINEAR16",
"sampleRateHertz": "\(SAMPLE_RATE)",
"languageCode": "ja-JP",
"maxAlternatives": 1]
let audioRequest = ["content": data.base64EncodedString()]
let requestDictionary = ["config": config, "audio": audioRequest]
let requestData = try! JSONSerialization.data(withJSONObject: requestDictionary, options: [])
let request = NSMutableURLRequest(url: URL(string: service)!)
request.addValue("application/json", forHTTPHeaderField: "Content-Type")
request.httpBody = requestData
request.httpMethod = "POST"
let task = URLSession.shared.dataTask(with: request as URLRequest, completionHandler: { (data, resp, err) in
if let data = data, let json = try! JSONSerialization.jsonObject(with: data, options: []) as? [String: Any] {
print(json)
let results = (json["results"] as? [[String: Any]])
if let first = results?.first, let alternatives = first["alternatives"] as? [[String: Any]] {
if let alternativesFirst = alternatives.first, let str = alternativesFirst["transcript"] as? String {
DispatchQueue.main.async {
AudioRESTController.sharedInstance.delegate.doneAnalyze([str])
}
}
}
}
})
task.resume()
}
ViewControllerで表示。
func doneAnalyze(_ items: [String]) -> Void {
if items.isEmpty { return }
textView.text = items.joined(separator: ", ")
}
雑感
APIキー部分を入れ替えれば動作するサンプルをgithubに上げておきました。
GoogleSpeechRecognizerSample
次は連続待ち受け等も試してみたいところ。