Amazon Echoで「バルス」を実現する

Last updated at 2016-06-04Posted at 2015-12-19

この投稿は今年もやるよ！AWS Lambda縛り Advent Calendar 2015 - Qiitaの18日目の記事です。
Lambda縛りだけど、この記事はAmaozon Echo、Alexa Skills Kitの成分多めとなっています…

バルスとは？

言わずと知れた、世にも恐ろしい滅びの言葉。
先人が既に破壊の限りをつくすコマンドを作り世に出しています
https://github.com/qphoney/balus
今回紹介するのは、そこまで無慈悲なものではありません。

Amazon Echoとは？

音声認識によって質問に答えてくれたり、サービスと連携したり出来るものです。
Amazon Skills Kitを利用することで、機能を追加することができます。
サードパーティー製のSkillは審査を経て、専用のサイトからインストールして使うことができます。
https://youtu.be/7Jc82wIL7m4

Amazon Skills Kitとは？

AlexaはAmazon Echoに採用されているクラウドベースの音声認識サービスです。
Alexa Skill KitはAlexaが利用できる機能（Skill）を簡単に構築するために必要な環境を提供してくれます。
Amazon Alexa Skills Kitを調べてみる

Alexa Skills Kit(ASK)のLambdaファンクションを作成

先日のre:InventでLambdaがPythonに対応したと発表がありましたが、ASKのファンクションでもLambdaが使えるようになりました。
ただし、ASKはus-east-1リージョンにしか対応していないため、他のリージョンにはLambdaのテンプレート一覧にASKのテンプレは出てきません。

今回はこのテンプレートを少し改造してバルスを実装してみました。

# -*- coding: utf-8 -*-
from __future__ import print_function
import boto3
from time import gmtime, strftime

client = boto3.client('ec2')

# エントリーポイント
def lambda_handler(event, context):
    print(strftime('%a, %d %b %Y %H:%M:%S +0000', gmtime()))
    print(event)

    if event['request']['type'] == "LaunchRequest":
        # skill開始のリクエスト
        return on_launch(event['request'], event['session'])
    elif event['request']['type'] == "IntentRequest":
        # インテントの呼び出し
        return on_intent(event['request'], event['session'])

    print("nothing and finish")
    return get_finish_response()

def on_launch(launch_request, session):
    # まじないの言葉を取得する
    return get_charm_response()

def on_intent(intent_request, session):

    intent = intent_request['intent']
    intent_name = intent_request['intent']['name']
    # 滅びの言葉を含まなければ終わり
    if 'Barusu' not in intent['slots'] or \
       'value' not in intent['slots']['Barusu'] or \
       not intent_name == "RunHorobi":
        return get_finish_response()
    
    print(intent['slots']['Barusu']['value'])

    stop_instance()
    return get_horobi_response()

# おまじないの言葉のレスポンスを生成
def get_charm_response():

    session_attributes = {}
    card_title = "Charm"
    audio_url = "https://url/to/your/audio.mp3"
    should_end_session = False
    return build_response(session_attributes, build_audio_response(
        card_title, speech_output, audio_url, should_end_session))

# 滅びの言葉を言われたら返すレスポンスを生成
def get_horobi_response():

    session_attributes = {}
    card_title = "Horobi"
    speech_output = "Megaaaaa!"
    reprompt_text = speech_output
    should_end_session = True
    return build_response(session_attributes, build_speechlet_response(
        card_title, speech_output, reprompt_text, should_end_session))

# なにもしない時のレスポンスを生成
def get_finish_response():

    session_attributes = {}
    card_title = "Words that should not be used"
    speech_output = "Don't say the word of horobee"
    reprompt_text = speech_output
    should_end_session = True
    return build_response(session_attributes, build_speechlet_response(
        card_title, speech_output, reprompt_text, should_end_session))

# バルスで停止するインスタンスを取得
def get_instances():
    response = client.describe_instances(
        Filters=[
            {
                'Name': 'tag-value','Values': [
                    'laputa',
                ]
            }
        ]
    )
    instance_ids = []
    for res in response['Reservations']:
        for item in res['Instances']:
            instance_ids.append(item['InstanceId'])
    return instance_ids

# インスタンスの開始
def start_instance():
    print('start_instance')
    response = client.start_instances(
        InstanceIds=get_instances()
    )

# インスタンスを終了
def stop_instance():
    print('stop_instance')
    response = client.stop_instances(
        InstanceIds=get_instances()
    )

# 戻り値のJSONを生成
def build_speechlet_response(title, output, reprompt_text, should_end_session):
    return {
        'outputSpeech': {
            'type': 'PlainText',
            'text': output
        },
        'card': {
            'type': 'Simple',
            'title': 'SessionSpeechlet - ' + title,
            'content': 'SessionSpeechlet - ' + output
        },
        'reprompt': {
            'outputSpeech': {
                'type': 'PlainText',
                'text': reprompt_text
            }
        },
        'shouldEndSession': should_end_session
    }

# SSML形式の戻り値JSONを生成
def build_audio_response(title, output, audio_url, should_end_session):
    return {
        'outputSpeech': {
            'type': 'SSML',
            'ssml': '<speak><audio src="{0}" /></speak>'.format(audio_url)
        },
        'card': {
            'type': 'Simple',
            'title': 'SessionSpeechlet - ' + title,
            'content': 'SessionSpeechlet - ' + output
        },
        'reprompt': {
            'outputSpeech': {
                'type': 'SSML',
                'ssml': '<speak><audio src="{0}" /></speak>'.format(audio_url)
            }
        },
        'shouldEndSession': should_end_session
    }

# 戻り値の全体
def build_response(session_attributes, speechlet_response):
    return {
        'version': '1.0',
        'sessionAttributes': session_attributes,
        'response': speechlet_response
    }

このスクリプトでは、Alexa, run laputaとAmazon Echoに話しかけると、困ったときのおまじないリテ・ラトバリタ・ウルス・アリアロス・バル・ネトリールを答えてくれます。

Alexa Skillsはテキストを返却してAmazon Echoで喋らすのと、SSML(Speech Synthesis Markup Language)を使って喋らす方法がありますが、先日の発表で任意のオーディオデータを扱うことも可能になりました。
Alexaの発音だと、英語ベースとなってしまうので、音声合成したmp3ファイルをSSMLで指定してます。（build_audio_responseのところで）

オーディオファイルの制限

Amazon Echoが再生できるオーディオのファイルは細かい制限があり、以下の様なフォーマットである必要があります。

有効なMP3ファイル(MPEG version 2)
90秒以内
ビットレートは、48kbps
サンプルレートは、16000 Hz
httpsでアクセスできて、信頼できるSSL証明書であること（オレオレはダメ）
個人情報や機密情報など含んではならない

macでffmpegを使うと以下の様なコマンド

ffmpeg -i input.mp3 -ac 2 -codec:a libmp3lame -b:a 48k -ar 16000 output.mp3

Alexa Skillsを追加する

Skill information

https://developer.amazon.com/edw/home.html#/
ダッシュボードからAlexa Skills Kitを選択して、Add a New Skillから新規で登録します。

EndpointでLambdaを選択し、先ほど作成したLambdaファンクションのARNを指定します。

Intent Schema

{
  "intents": [
    {
      "intent": "RunHorobi",
      "slots": [
        {
          "name": "Barusu",
          "type": "LIST_OF_BARUSU"
        }
      ]
    }
  ]
}

インテント（Lambdaファンクション側の機能）に付随するキーワードの定義を設定。

Custom Slot Types

barusu
bars
barus
barsu

対応するキーワードを登録する。（ただし、ここに登録されていないキーワードでも反応してLambdaファンクションを読みだしてしまうのは、ワークショップで聞いた限り現状しょうがないらしい。）

Sample Utterances

RunHorobi {Barusu}

発話されたワードに合わせてどのインテントを呼び出すかを指定している。
実際にユーザーがどのような発話をするかパターンがいくつかあるので、それを想定して複数の文章をインテントに紐付けておくと、それだけASK側で識別しやすくなる。

テスト

ASKにはAmazon Echoを利用しなくてもSkillのテストが出来る仕組みがあります。

これを使ってLambdaファンクションのテストを行えば、デバッグでひたすらEchoに対して話しかけなくて済みます。
右下にある再生ボタンをクリックすると、戻り値のテキストを喋ってくれます。
（ただし、現在はSSMLには対応出来ていない）

実機確認

自分のデバイスとして登録されているEchoで、実際に発話した言葉とかその結果がダッシュボードで確認することができます。

https://youtu.be/7Jc82wIL7m4

まとめ

Amazon Echoを使うことで、よりリアルな「バルス」が出来るようになりました。
（音声認識の飛行石があると、より気分が盛り上がります）
しかし、Amazon Echoはまだ日本語に対応していないので、「目がーーー！」の表現が残念すぎますね。早く対応してくれることを期待します。

真面目な話

LambdaがPythonで書けるようになったので、個人的にはすごくコーディングしやすくなりました。
Skillのテストとシミュレータが追加されたので、Amazon Echoが無くても結構いいところまで作り込めるようになりました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up