音声認識
Swift
SpeechRecognition

Google Cloud Speech APIでiOSアプリで音声認識

More than 1 year has passed since last update.

iOSアプリでの音声認識機能実装方法まとめ で以下のように書きました。


追記 (2016/06/16)
iOS10にて公式の音声認識機能が開放されましたね!
これが本命かと思っております。


しかし実際のところ、まだまだ制限が多く使い方が限られている状況でした。(2017/12時点)

幸いな事にGoogle Speech APIの商用化版のGoogle Cloud Speech APIが出ていました。
これで実用に耐える音声認識機能を実現できそうです。

ハッカソンでこんなアプリを作ってみたり。

認識した音声の3DオブジェクトをAR空間に表示します。

このアプリの音声認識機能部分だけを抜き出してみました。
動かすまでに少し手間が必要だったので、以下参考になれば幸いです。

Google Cloud Speech APIとは

機械学習を利用したオンラインのSpeech To Text サービスです。
ストリーミングで順次結果を返す方式と、RESTで結果を返す方式があります。

Swiftのストリーミング形式の公式サンプルはこちら
Cloud Speech Streaming gRPC Swift Sample

Step1. ConsoleでAPIを有効化

Google API ConsoleページからCloud Speech APIを有効化します。


APIとサービスの有効化をクリック。


Google Cloud Speech APIを選択し「有効化」

Step2. pods installと細かい修正

Xcodeから新規プロジェクト作成。
プロジェクト名はGoogleSpeechRecognizerSampleとしました。

プロジェクト直下でpod initしPodfile内容を以下に変更。

Podfile
target 'GoogleSpeechRecognizerSample' do
  use_frameworks!
  pod 'googleapis', :path => '.'
end

こちらのpodspecファイルを同階層に配置。

googleapis.podspec
Pod::Spec.new do |s|
  s.name     = 'googleapis'
  s.version  = '0.0.1'
  s.license  = 'Apache 2.0'
  s.authors  = { 'Google Inc.' => 'timburks@google.com'}
  s.homepage = 'http://github.com/GoogleCloudPlatform/ios-docs-samples'
  s.source   = { :git => 'https://github.com/GoogleCloudPlatform/ios-docs-samples.git',
                 :tag => '0.0.1' }
  s.summary  = 'Service definitions for Google Cloud Platform APIs'   

  s.ios.deployment_target = '7.1'
  s.osx.deployment_target = '10.9'

 # Run protoc with the Objective-C and gRPC plugins to generate protocol messages and gRPC clients.
  s.dependency "!ProtoCompiler-gRPCPlugin", "~> 1.0"

  # Pods directory corresponding to this app's Podfile, relative to the location of this podspec.
  pods_root = 'Pods'

  # Path where Cocoapods downloads protoc and the gRPC plugin.
  protoc_dir = "#{pods_root}/!ProtoCompiler"
  protoc = "#{protoc_dir}/protoc"
  plugin = "#{pods_root}/!ProtoCompiler-gRPCPlugin/grpc_objective_c_plugin"

  # Run protoc with the Objective-C and gRPC plugins to generate protocol messages and gRPC clients.
  # You can run this command manually if you later change your protos and need to regenerate.  
  s.prepare_command = <<-CMD
    #{protoc} \
        --objc_opt=named_framework_to_proto_path_mappings_path=./protomap \
        --plugin=protoc-gen-grpc=#{plugin} \
        --objc_out=. \
        --grpc_out=. \
        -I . \
        -I #{protoc_dir} \
        google/*/*.proto google/*/*/*/*.proto
  CMD

  # The --objc_out plugin generates a pair of .pbobjc.h/.pbobjc.m files for each .proto file.
  s.subspec "Messages" do |ms|
    ms.source_files = "google/**/*.pbobjc.{h,m}"
    ms.header_mappings_dir = "."
    ms.requires_arc = false
    ms.dependency "Protobuf"
  end

  # The --objcgrpc_out plugin generates a pair of .pbrpc.h/.pbrpc.m files for each .proto file with
  # a service defined.
  s.subspec "Services" do |ss|
    ss.source_files = "google/**/*.pbrpc.{h,m}"
    ss.header_mappings_dir = "."
    ss.requires_arc = true
    ss.dependency "gRPC-ProtoRPC"
    ss.dependency "#{s.name}/Messages"
  end

  s.pod_target_xcconfig = {
    'GCC_PREPROCESSOR_DEFINITIONS' => '$(inherited) GPB_USE_PROTOBUF_FRAMEWORK_IMPORTS=1',
      'USER_HEADER_SEARCH_PATHS' => '$SRCROOT/..'
  }

end

pod installを実行します。

成功したらGoogleSpeechRecognizerSample.xcworkspace内の以下の修正をします。
少し面倒ですね。

1) gRPC-RxLibrary-umbrella.hの以下を削除

#import "transformations/GRXMappingWriter.h"

2) 以下のimport文を置換

Before:
#import "google/cloud/speech/v1/CloudSpeech.pbobjc.h"
#import "google/api/Annotations.pbobjc.h"
#import "google/longrunning/Operations.pbobjc.h"
#import "google/rpc/Status.pbobjc.h"
#import "google/protobuf/Duration.pbobjc.h"

After:
#import <googleapis/CloudSpeech.pbobjc.h>
#import <googleapis/Annotations.pbobjc.h>
#import <googleapis/Operations.pbobjc.h>
#import <googleapis/Status.pbobjc.h>
#import <googleapis/Duration.pbobjc.h>

Objective-C Brdging Header設定

Speech-Bridging-Header.h
#import <googleapis/CloudSpeech.pbobjc.h>
#import <googleapis/CloudSpeech.pbrpc.h>

スクリーンショット 2017-12-18 19.05.29.png

Step3. gRPC Streamの実装

ビルドが通るようになったので、Stream処理を実装していきます。
と言ってもサンプルほぼそのまま。

Audio関連を初期化。

    func prepare(specifiedSampleRate: Int) -> OSStatus {

        var status = noErr

        let session = AVAudioSession.sharedInstance()
        do {
            try session.setCategory(AVAudioSessionCategoryRecord)
            try session.setPreferredIOBufferDuration(10)
        } catch {
            return -1
        }

        var sampleRate = session.sampleRate
        print("hardware sample rate = \(sampleRate), using specified rate = \(specifiedSampleRate)")
        sampleRate = Double(specifiedSampleRate)

        // Describe the RemoteIO unit
        var audioComponentDescription = AudioComponentDescription()
        audioComponentDescription.componentType = kAudioUnitType_Output;
        audioComponentDescription.componentSubType = kAudioUnitSubType_RemoteIO;
        audioComponentDescription.componentManufacturer = kAudioUnitManufacturer_Apple;
        audioComponentDescription.componentFlags = 0;
        audioComponentDescription.componentFlagsMask = 0;

        // Get the RemoteIO unit
        let remoteIOComponent = AudioComponentFindNext(nil, &audioComponentDescription)
        status = AudioComponentInstanceNew(remoteIOComponent!, &remoteIOUnit)
        if (status != noErr) {
            return status
        }

        let bus1 : AudioUnitElement = 1
        var oneFlag : UInt32 = 1

        // Configure the RemoteIO unit for input
        status = AudioUnitSetProperty(remoteIOUnit!,
                                      kAudioOutputUnitProperty_EnableIO,
                                      kAudioUnitScope_Input,
                                      bus1,
                                      &oneFlag,
                                      UInt32(MemoryLayout<UInt32>.size));
        if (status != noErr) {
            return status
        }

        // Set format for mic input (bus 1) on RemoteIO's output scope
        var asbd = AudioStreamBasicDescription()
        asbd.mSampleRate = sampleRate
        asbd.mFormatID = kAudioFormatLinearPCM
        asbd.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked
        asbd.mBytesPerPacket = 2
        asbd.mFramesPerPacket = 1
        asbd.mBytesPerFrame = 2
        asbd.mChannelsPerFrame = 1
        asbd.mBitsPerChannel = 16
        status = AudioUnitSetProperty(remoteIOUnit!,
                                      kAudioUnitProperty_StreamFormat,
                                      kAudioUnitScope_Output,
                                      bus1,
                                      &asbd,
                                      UInt32(MemoryLayout<AudioStreamBasicDescription>.size))
        if (status != noErr) {
            return status
        }

        // Set the recording callback
        var callbackStruct = AURenderCallbackStruct()
        callbackStruct.inputProc = recordingCallback
        callbackStruct.inputProcRefCon = nil
        status = AudioUnitSetProperty(remoteIOUnit!,
                                      kAudioOutputUnitProperty_SetInputCallback,
                                      kAudioUnitScope_Global,
                                      bus1,
                                      &callbackStruct,
                                      UInt32(MemoryLayout<AURenderCallbackStruct>.size));
        if (status != noErr) {
            return status
        }

        // Initialize the RemoteIO unit
        return AudioUnitInitialize(remoteIOUnit!)
    }

録音のCallback。

func recordingCallback (
    inRefCon: UnsafeMutableRawPointer,
    ioActionFlags: UnsafeMutablePointer<AudioUnitRenderActionFlags>,
    inTimeStamp: UnsafePointer<AudioTimeStamp>,
    inBusNumber: UInt32,
    inNumberFrames: UInt32,
    ioData: UnsafeMutablePointer<AudioBufferList>?) -> OSStatus {

    var status = noErr

    let channelCount: UInt32 = 1

    var bufferList = AudioBufferList()
    bufferList.mNumberBuffers = channelCount
    let buffers = UnsafeMutableBufferPointer<AudioBuffer>(start: &bufferList.mBuffers,
                                                          count: Int(bufferList.mNumberBuffers))
    buffers[0].mNumberChannels = 1
    buffers[0].mDataByteSize = inNumberFrames * 2
    buffers[0].mData = nil

    // get the recorded samples
    status = AudioUnitRender(AudioController.sharedInstance.remoteIOUnit!,
                             ioActionFlags,
                             inTimeStamp,
                             inBusNumber,
                             inNumberFrames,
                             UnsafeMutablePointer<AudioBufferList>(&bufferList))
    if (status != noErr) {
        return status;
    }

    let data = Data(bytes:  buffers[0].mData!, count: Int(buffers[0].mDataByteSize))
    DispatchQueue.main.async {
        AudioController.sharedInstance.delegate.processSampleData(data)
    }

    return noErr
}

Callbackを受けてViewController側での処理。

ViewController.swift
    func processSampleData(_ data: Data) -> Void {
        audioData.append(data)

        // We recommend sending samples in 100ms chunks
        let chunkSize : Int /* bytes/chunk */ = Int(0.1 /* seconds/chunk */
            * Double(SAMPLE_RATE) /* samples/second */
            * 2 /* bytes/sample */);

        if (audioData.length > chunkSize) {
            SpeechRecognitionService.sharedInstance.streamAudioData(audioData, completion: { [weak self] (response, error) in
                guard let strongSelf = self else {
                    return
                }

                if let error = error {
                    strongSelf.textView.text = error.localizedDescription
                } else if let response = response {
                    var finished = false
                    print(response)
                    for result in response.resultsArray! {
                        if let result = result as? StreamingRecognitionResult {
                            if result.isFinal {
                                finished = true
                            }
                        }
                    }
                    strongSelf.textView.text = response.description
                    if finished {
                        strongSelf.stopStreaming()
                    }
                }
            })
            self.audioData = NSMutableData()
        }
    }

Step4. REST の実装

RESTの場合、端末内部に一度音声ファイルを作成し送信します。

保存先ファイルパス。

    var soundFilePath: String {
        let paths = NSSearchPathForDirectoriesInDomains(.documentDirectory, .userDomainMask , true)
        guard let path = paths.first else { return "" }
        return path.appending("/sound.caf")
    }

AVAudioSession等を初期化。

    func prepare(delegate: AudioRESTControllerDelegate) {
        self.delegate = delegate
        AVAudioSession.sharedInstance().requestRecordPermission() { [unowned self] allowed in
            if allowed {
                print(self.soundFilePath)
            } else {
            }
        }

        let soundFileURL = URL(fileURLWithPath: soundFilePath)
        let session = AVAudioSession.sharedInstance()

        do {
            try session.setCategory(AVAudioSessionCategoryPlayAndRecord, with: .defaultToSpeaker)
            try session.setMode(AVAudioSessionModeMeasurement)
            try session.setActive(true)
            let settings = [
                AVSampleRateKey: SAMPLE_RATE,
                AVNumberOfChannelsKey: 1,
                AVEncoderBitRateKey: 16,
                AVEncoderAudioQualityKey: AVAudioQuality.max.rawValue
            ]
            audioRecorder = try AVAudioRecorder(url: soundFileURL, settings: settings)
            audioRecorder?.delegate = self
            audioRecorder?.isMeteringEnabled = true
        }
        catch let error {
            print(error)
        }
    }

音声ファイルを送信し、解析内容候補テキストの入ったJSONを受信。
取り敢えず先頭の候補を返却しています。

    func soundFileToText() {
        let service = "https://speech.googleapis.com/v1/speech:recognize?key=\(API_KEY)"
        let data = try! Data(contentsOf: URL(fileURLWithPath: soundFilePath))
        let config: [String: Any] = [
            "encoding": "LINEAR16",
            "sampleRateHertz": "\(SAMPLE_RATE)",
            "languageCode": "ja-JP",
            "maxAlternatives": 1]

        let audioRequest = ["content": data.base64EncodedString()]
        let requestDictionary = ["config": config, "audio": audioRequest]
        let requestData = try! JSONSerialization.data(withJSONObject: requestDictionary, options: [])
        let request = NSMutableURLRequest(url: URL(string: service)!)
        request.addValue("application/json", forHTTPHeaderField: "Content-Type")
        request.httpBody = requestData
        request.httpMethod = "POST"
        let task = URLSession.shared.dataTask(with: request as URLRequest, completionHandler: { (data, resp, err) in
            if let data = data, let json = try! JSONSerialization.jsonObject(with: data, options: []) as? [String: Any] {
                print(json)
                let results = (json["results"] as? [[String: Any]])
                if let first = results?.first, let alternatives = first["alternatives"] as? [[String: Any]] {
                    if let alternativesFirst = alternatives.first, let str = alternativesFirst["transcript"] as? String {
                        DispatchQueue.main.async {
                            AudioRESTController.sharedInstance.delegate.doneAnalyze([str])
                        }
                    }
                }
            }
        })
        task.resume()
    }

ViewControllerで表示。

ViewController.swift
    func doneAnalyze(_ items: [String]) -> Void {
        if items.isEmpty { return }
        textView.text = items.joined(separator: ", ")
    }

雑感

APIキー部分を入れ替えれば動作するサンプルをgithubに上げておきました。
GoogleSpeechRecognizerSample

次は連続待ち受け等も試してみたいところ。