Azure AI Search VoiceRAG をカスタマイズする

Last updated at 2024-11-15Posted at 2024-10-04

はじめに

Realtime APIを用いて裏側でAI Searchで検索をかけつつデータと会話ができる素晴らしいサンプルが公開されています。

今回はこちらのサンプルで気になるカスタマイズを入れる際のあれこれをメモで残します。適宜追加していく予定です。現状以下のカスタム対応についてメモを残しています。

出力音声のトランスクリプトを取得する(10/4)
入力音声のトランスクリプトを取得する(10/4)
Bing 検索を取り込む(10/11)
注文を受けつける(10/16)
最初に声かけしてもらう(10/17)
インタビューしてもらう(10/23)
同時通訳してもらう(11/7)
ボールを動かす(11/15)

本家のコードもどんどん変わっているようなので、本家とズレが発生している場合がありますことご理解ください。

まだプレビュー中のモデルなので、常時精度がぶれます。また会話が続くと壊れることもままあります。いずれ解消されるとは思いますが、現時点では動作の確認やイメージ検討などで利用いただくと良いかなと思います。

★出力音声のトランスクリプトを取得する

app/frontend/src/App.tsxでuseRealTimeの宣言部分に以下を追加します。
message.deltaとして出力音声の差分文字が入ってきますので、これをApp.tsxで扱えばよいです。
onReceivedResponseDoneが呼ばれるまでが１回の回答出力になります。
ただ音声データと文字データが同期とって降ってくるわけではない（文字が先に来る）ので、中断させた場合などはonReceivedInputAudioBufferSpeechStartedで何か処理を入れる必要もありそうです。

App.tsx

    const { startSession, addUserAudio, inputAudioBufferClear } = useRealTime({
        //省略
        onReceivedResponseAudioTranscriptDelta: message => {
            //とりあえずコンソールに出力
            console.log("response_transcript_delta", message.delta);
        },

★入力音声のトランスクリプトを取得する

入力音声のトランスクリプトをONにすることとコールバックを追加します。これも上記と同じくuseRealTimeの宣言部分の変更です。
出力よりも若干遅れて入力音声のトランスクリプトが取得できます。

App.tsx

    const { startSession, addUserAudio, inputAudioBufferClear } = useRealTime({
        enableInputAudioTranscription: true,
        //省略
        onReceivedInputAudioTranscriptionCompleted: message => {
            //とりあえずコンソールに出力
            console.log("request_transcription", message.transcript);
        },

★Bing 検索を取り込む

本家サンプルは AI Search の結果を会話に取り込みますが、ここをWeb検索(Bing検索) に切り替えます。結構手順が長いのでいくつかのステップに分けてご紹介します。

Bing検索＆検索結果表示用の関数準備

まず既存のapp/backend/ragtools.pyの確認です。ここでは以下の2つのツールを定義しており、このツールをFunction Callで呼び出すようにしています。

_search_tool_schema : 検索クエリを抜き出してくれる
_grounding_tool_schema : 検索結果のリンク情報を抜き出してくれる

これと同じように app/backend/webtoolsでWeb検索と検索結果の表示用の関数を定義していきます。定義するのは以下の2つです。

_web_search_tool_schema : Web検索クエリを抜き出してくれる
_grounding_webtool_schema : Web検索結果のタイトルとURLをリストで抜き出してくれる

それぞれの詳細は以下です。（とりあえず動作は確認しましたがもっとブラッシュはできそうです。）

webtools.py

_web_search_tool_schema = {  
    "type": "function",  
    "name": "websearch",  
    "description": "Performs a web search using the specified query." + \
                   "The results are returned as a list of highly relevant web pages.",  
    "parameters": {  
        "type": "object",  
        "properties": {  
            "query": {  
                "type": "string",  
                "description": "Search query for web search"  
            }  
        },  
        "required": ["query"],  
        "additionalProperties": False  
    }  
}

_grounding_webtool_schema = {
    "type": "function",
    "name": "webreport_grounding",
    "description": "Report the use of web sources from the knowledge base as part of the response (essentially citing the web sources)." + \
                   "Web sources are delineated within each knowledge base by <Reference> tags, and " + \
                   "each <Reference> consists of <ReferenceURL>, <ReferenceName>, and <ReferenceContent>." + \
                   "When using information from the knowledge base to respond, always use this tool to cite the web sources.",
    "parameters": {
    "type": "object",
    "properties": {
        "references": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "reference_url": {
                        "type": "string",
                        "description": "URL of the web source"
                    },
                    "reference_name": {
                        "type": "string",
                        "description": "Title name of the web source"
                    },
                },
                "required": ["referenceurl", "referencename"],
                "additionalProperties": False
            },
            "description": "List of references used in the response, each containing a URL and a title(name)"
        }
    },
    "required": ["references"],
    "additionalProperties": False
    }
}

そのうえで、websearchの方で取得したクエリをもとに検索を実行する実装を用意します。
Bing検索はsearch_bing_and_read_contents関数で実行します。ここはAPIキーと検索クエリを渡して結果のリストを取得する関数ですがbing検索関数自体の詳細は長くなるので割愛します。
Bing検索結果をタグで区切って集約したテキストを作成して関数の結果としてgpt側に渡すようになっています。

webtools.py

async def _web_search_tool(bing_api_key, args: Any) -> ToolResult:
    print(f"Searching for '{args['query']}' using Bing API v7")
    search_results = search_bing_and_read_contents(bing_api_key, args['query'])
    result = ''
    for r in search_results:
        result += "<Reference>\n"
        result += f"  <ReferenceURL>{r['url']}</ReferenceURL>\n"
        result += f"  <ReferenceName>{r['name']}</ReferenceName>\n"
        result += f"  <ReferenceContent>{r['content']}</ReferenceContent>\n"
        result += "</Reference>\n\n"
    return ToolResult(result, ToolResultDirection.TO_SERVER)

また検索した結果をもとに回答を作った際に画面側への表示を制御する実装を用意します。HTMLの本文自体をGPT処理で出力させるとtoken数が大きくなることもあり、URLをもとにHTML本文を取得する関数read_content関数を使っていますが、Bing検索同様にここでは詳細は長くなるので割愛します。私はlxmlをつかってHTMLの本文文章を抜き出してやりました。

webtools.py

async def _report_grounding_webtool(args: Any) -> None:
    list = args["references"]
    print(f"Grounding source: {list}")

    docs = []
    for reference in list:
        content = read_content(reference["reference_url"])
        docs.append({"url": reference["reference_url"], "title": reference["reference_name"], "content": content})
    return ToolResult({"sources": docs}, ToolResultDirection.TO_CLIENT)

Web検索と結果表示の実装ができたので、それぞれを登録する関数も定義します。

webtools.py

def attach_rag_webtools(rtmt: RTMiddleTier, bing_api_key: str) -> None:

    rtmt.tools["websearch"] = Tool(schema=_web_search_tool_schema, target=lambda args: _web_search_tool(bing_api_key, args))
    rtmt.tools["webreport_grounding"] = Tool(schema=_grounding_webtool_schema, target=lambda args: _report_grounding_webtool(args))

ここまででWeb検索と検索結果表示のための実装準備は終わりです。

サーバ側の実装

app/backend/app.pyをいじっていきます。
まずはシステムプロンプトを上記で定義したツールを呼び出すように、またWeb検索を使うような形で変更します。

app.py

    rtmt.system_message = "You are a helpful assistant. Only answer questions based on information you searched in the knowledge base, accessible with the 'websearch' tool. " + \
                          "The user is listening to answers with audio, so it's *super* important that answers are as short as possible, a single sentence if at all possible. " + \
                          "Never read url names or source names or keys out loud. " + \
                          "Please conduct searches and provide responses in the language input by the user." + \
                          "Always use the following step-by-step instructions to respond: \n" + \
                          "1. Always use the 'websearch' tool to check the knowledge base before answering a question. \n" + \
                          "2. Always use the 'webreport_grounding' tool to report the source of information from the knowledge base. \n" + \
                          "3. Produce an answer that's as short as possible. If the answer isn't in the knowledge base, say you don't know."

その上で、既存ではattach_rag_toolsになっているところを以下のように上記で実装したアタッチ関数に変えます。併せて環境変数でBING_API_KEYを設定します。
Bing検索自体はv7を使っていますが、こちらもここでは詳細は割愛します。

app.py

    bing_api_key = os.environ.get("BING_API_KEY")
    //略
    attach_rag_webtools(rtmt, bing_api_key)

フロント側の修正

ここまででサーバサイドは完成しました。音声をいれるとBing検索をして、その結果をもとに回答してくれ、かつBing検索結果の参照元をフロントに送信している形です。
なので、フロント側でBing検索結果の参照元情報を受け取るように変えていきます。

frontend/src/types.tsにBing検索結果を受け取る定義を追加します。

types.ts

export type WebToolResult = {
    sources: { url: string; title : string, content: string }[];
}

最後にapp/frontend/src/App.tsxを上記のWebToolResultを受け取るように変えます。

App.tsx

//インポートのところを変更
import { GroundingFile, WebToolResult } from "./types";

        //Bing検索結果の参照元情報を受け取る関数を変更
        onReceivedExtensionMiddleTierToolResponse: message => {
            const result: WebToolResult = JSON.parse(message.tool_result);

            //既存の表示コンポーネントをそのまま利用
            const files: GroundingFile[] = result.sources.map(x => {
                return { id: x.title, name: x.url, content: x.content };
            });

            setGroundingFiles(prev => [...prev, ...files]);
        }

ここまでの変更で、会話をするとBing検索をして回答をつくってくれるようになります。

★注文を受け付ける

会話ができるのであれば、会話内容をサマリしたりそこから情報を抜き出すことができますね。というわけで、来々軒AIを作ってみたいと思います。
今回は出前の注文を受け付ける形ではありますが、ここをチケット登録にするとか色々発展は見込めそうです。

注文情報表示用の関数準備

まずはBing検索と同様、注文を表示するための関数を定義します。ragtools.pyを参考にordertools.pyを作成します。注文にあたっては以下の情報を持ちたいため、そのようなスキーマをまずは定義します。

注文リスト
- 品名
- オプション
- 値段
注文者の氏名
配達先の住所
電話番号

ordertools.py

_order_display_tool_schema = {
    "type": "function",
    "name": "order_display",
    "description": "Displays the order information in a restaurant",
    "parameters": {
        "type": "object",
        "properties": {
            "order": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "dish_name": {
                            "type": "string",
                            "description": "Name of the dish"
                        },
                        "quantity": {
                            "type": "string",
                            "description": "Quantity of the dish ordered"
                        },
                        "option" : {
                            "type": "string",
                            "description": "Option for the dish, such as large size or extra toppings"
                        },
                        "price" : {
                            "type": "string",
                            "description": "Price of the dish"
                        },

                    },
                    "required": ["dish_name", "quantity", "price"],
                    "additionalProperties": False
                },
                "description": "List of ordered items"
            },
            "customer_name": {
                "type": "string",
                "description": "Name of the customer"
            },
            "delivery_address": {
                "type": "string",
                "description": "Delivery address for the order"
            },
            "phone_number": {
                "type": "string",
                "description": "Phone number of the customer"
            },
        },
        "required": ["order"],
        "additionalProperties": False
    }
}

次にordertools.pyにクライアント側に返す関数を作成し、合わせてアタッチするための関数も実装します。今回は注文情報をそのままJSONで返すだけなので、取れた引数をそのままクライアント側に返却するようにしています。

ordertools.py

async def _display_order_tool(args: Any) -> None:
    return ToolResult({"sources": args}, ToolResultDirection.TO_CLIENT)

def attach_rag_ordertools(rtmt: RTMiddleTier) -> None:
    rtmt.tools["order_display"] = Tool(schema=_order_display_tool_schema, target=lambda args: _display_order_tool(args))

サーバ側の実装

app.pyを変更していきます。変更内容はシステムプロンプトと、アタッチ関数の変更です。
まず来々軒の注文票を定義します。

app.py

    menu = """\
# 来々軒メニュー

## ラーメン

### 醤油ラーメン
- **価格**: 800円
- **説明**: 醤油ベースのスープに、チャーシュー、メンマ、ネギがトッピングされた定番のラーメン。

### 味噌ラーメン
- **価格**: 850円
- **説明**: 濃厚な味噌スープに、野菜たっぷりの具材が特徴のラーメン。

### 塩ラーメン
- **価格**: 800円
- **説明**: さっぱりとした塩味のスープに、シンプルな具材が合わさったラーメン。

※各大盛 +100円

## サイドメニュー

### 餃子(5個)
- **価格**: 300円
- **説明**: 外はカリッと、中はジューシーな餃子。

### チャーハン
- **価格**: 500円
- **説明**: パラパラのご飯に具材がたっぷりのチャーハン。

### 白ご飯
- **価格**: 150円
- **説明**: ラーメンと一緒にどうぞ。

## ドリンク

### 緑茶
- **価格**: 200円
- **説明**: さっぱりとした味わいの緑茶。

### ウーロン茶
- **価格**: 200円
- **説明**: 口の中をさっぱりさせるウーロン茶。

### ビール
- **価格**: 500円
- **説明**: ラーメンと相性抜群の冷たいビール。
"""

つぎにこの注文票を用いて、また先に定義した注文表示関数をツール呼び出しするシステムプロンプトを定義します。（精度の兼ね合いで英語にしています）
内容的には注文を受け付けつつ、注文表示関数ツールを呼びユーザに注文状況を表示させるような指示をいれています。

app.py

rtmt.system_message = "You are the AI reception assistant for Rairaiken. Start with a greeting and take the user's order. \n" + \
                      "You need to converse with the user in the same language.\n\n" + \
                      "Since the user is listening to the responses via audio, it is important to keep the responses as short as possible and in one sentence. " + \
                      "Proceed with the conversation in the following order: \n" + \
                      "1. Cheerful greeting \n" + \
                      "2. Ask for the user's order (single item or multiple items) \n" + \
                      "3. Finally, Confirm customer name, delivery address, and phone number \n" + \
                      "If there is any order or delivery information in the conversation, always use the 'order_display' tool to show the order and delivery information to the user as text. \n" + \
                      "Keep the conversation as short as possible. If an order is not on the menu, say 'We cannot provide that dish.'\n\n" + \
                      "Here is the menu for Rairaiken: \n" + menu

最後に注文表示関数ツールをアタッチさせます。

app.py

from ordertools import attach_rag_ordertools
...
    attach_rag_ordertools(rtmt)

フロント側の修正

ここまででサーバサイドは完成しました。注文を会話でとって、取れた注文や注文者の情報をフロントに送信している形です。
なので、フロント側で注文情報を受け取るように変えていきます。

Bing検索のときと同様に、まずは型を定義していきます。今回はそのまま返却するようにしているため、関数のスキーマと同じ形にしています。

types.ts

export type OrderToolResult = {
    sources: {    
    order: {
        dish_name: string;
        quantity: string;
        option?: string;
        price: string;
    }[];
    customer_name?: string;
    delivery_address?: string;
    phone_number?: string;
    }
};

次にApp.tsxで受け取った情報を画面に表示するようにします。とりあえず受け取ったjsonをそのまま表示です。

App.tsx

import { OrderToolResult } from "./types";
...
function App_Web() {
    ...
    
    // 注文情報の箱を定義
    const [orderInfo, setOrderInfo] = useState<OrderToolResult | null>(null);

    ...

        // 受け取った注文情報をセットする
        onReceivedExtensionMiddleTierToolResponse: message => {
            const result: OrderToolResult = JSON.parse(message.tool_result);
            setOrderInfo(result);
        }

...
            <main className="flex flex-grow flex-col items-center justify-center">

            ...

                // 画面にJSONを表示する
                <div>
                    {orderInfo && (
                        <div>
                            <h2>Order Tool Result</h2>
                            <pre>{JSON.stringify(orderInfo, null, 2)}</pre>
                        </div>
                    )}
                </div>
            </main>

味気のないフロント画面ですし、気の利いたテストユーザが思いつかず潰していますが以下のような感じの画面が出来上がります。無事に注文がとれました！

★最初に声かけしてもらう

通常Realtime APIをつかった会話ではユーザからの会話をきっかけに会話が進みます。が、注文受付シナリオなどのように最初に声かけをAI側からしてほしいこともあるかと思いますので、その対応について記載します。

やることは、1.声かけようのイベントを、2.会話開始のタイミングで送る。になります。

1.声かけようのイベントを定義します。`useRealtime.tsx'に関数を用意します。あわせてexportに追加しておきます。（とりあえず「こんにちわ」を送ってみます）

useRealtime.tsx

//作成
    const startConversation = () => {
        const command = {
            type: "conversation.item.create",
            item: {
                type: "message",
                role: "user",
                content: [
                    {
                        type: "input_text",
                        text: "こんにちわ!!"
                    }
                ]
            }
        };

        sendJsonMessage(command);
        sendJsonMessage({ type: "response.create" });
    };

...
    
//追加
    return { startSession, addUserAudio, inputAudioBufferClear, startConversation };

2.会話開始のタイミングで送るように画面の実装を変えます。マイクボタンが押された際に呼ばれるonToggleListeningの中でstartConversationをコールするようにします。

App.tsx


//startConversationを追加
    const { startSession, addUserAudio, inputAudioBufferClear, startConversation } = useRealTime({

...

    const onToggleListening = async () => {
        if (!isRecording) {
            startSession();

            await startAudioRecording();
            resetAudioPlayer();

            setIsRecording(true);
            //とりあえずここに追加
            startConversation();
        } else {
            await stopAudioRecording();
            stopAudioPlayer();
            inputAudioBufferClear();

            setIsRecording(false);
        }
    };

これでマイクボタンを押した際に「こんにちわ」がテキストで送信され、それに対応した音声回答が得られるようになります。

★インタビューしてもらう

前項の声かけ実現に伴い、能動的に情報を取得するようなインタビューができるかなということでその実現メモです。

インタビュー情報表示用の関数準備

ただ聞かれるだけだとその後の展開もありませんので、集めた情報を最後に表示する関数を定義します。今回は以下のようなスキーマを用意してみました。

利用している携帯キャリア
料金プラン（覚えていれば）
月額利用料金
要望（金額を下げたい、通信料を増やしたい、通信を安定させたいなど）

interviewtools.py

_interview_display_tool_schema = {
    "type": "function",
    "name": "interview_display",
    "description": "Displays the interview information for a mobile carrier service",
    "parameters": {
        "type": "object",
        "properties": {
            "carrier": {
                "type": "string",
                "description": "The mobile carrier being used"
            },
            "plan": {
                "type": "string",
                "description": "The current mobile plan (If the user does not clearly remember, this is an information such as the approximate data usage.)"
            },
            "monthly_fee": {
                "type": "string",
                "description": "The monthly fee for the mobile plan"
            },
            "requirements": {
                "type": "string",
                "description": "Customer's requirements such as lowering the cost, increasing data, or improving stability"
            }
        },
        "required": ["carrier", "monthly_fee", "requirements"],
        "additionalProperties": False
    }
}

次にinterviewtools.pyにクライアント側に返す関数を作成し、合わせてアタッチするための関数も実装します。今回はインタビュー情報をそのままJSONで返すだけなので、取れた引数をそのままクライアント側に返却するようにしています。
また携帯キャリアの各プランを取得するための関数get_mobile_menuも用意します。この関数をリコメンド生成の際に利用するイメージです。

interviewtools.py

async def _display_interview_tool(args: Any) -> None:
    return ToolResult({"sources": args}, ToolResultDirection.TO_CLIENT)

async def _get_mobile_menu_tool(args: Any) -> ToolResult:
    menu = f"""\
#インタビュー結果
{args}

# 携帯キャリアプラン情報
省略。ここにキャリアとプランと値段などの情報を羅列するとリコメンドが作れます。
"""
    return ToolResult(menu, ToolResultDirection.TO_SERVER)

def attach_rag_interviewtools(rtmt: RTMiddleTier) -> None:
    rtmt.tools["interview_display"] = Tool(schema=_interview_display_tool_schema, target=lambda args: _display_interview_tool(args))
    rtmt.tools["get_mobile_menu"] = Tool(schema=_interview_display_tool_schema, target=lambda args: _get_mobile_menu_tool(args))

サーバ側の実装

app.pyを変更していきます。変更内容はシステムプロンプトと、アタッチ関数の変更です。今回は日本語でシステムプロンプトを定義してみます。

app.py

rtmt.system_message = "あなたはインタビューAIです。以下の順序でユーザの携帯キャリアに関する情報をヒアリングしていきます。\n" + \
                    "\n" +\
                    "1. 挨拶とともに、インタビュー参加へのお礼を伝えます。\n" + \
                    "2. 携帯電話のキャリア" + \
                    "3. 利用プラン（ユーザが覚えていない可能性もあります）\n" + \
                    "4. 月額の利用料金 (大体の金額でOKです) \n" + \
                    "5. ご要望（利用料金を安くしたい、通信料を増やしたい、早くしたい、など） \n" + \
                    "6. 全ての内容がそろったらお礼とともに'interview_display'ツールを実行し、インタビュー結果をユーザに表示する \n" + \
                    "\n" +\
                    "※各項目について明確にユーザから回答がない場合は再度聞き直してください。\n" +\
                    "また最後にもしユーザからアドバイスを求められたら、'get_mobile_menu'ツールを利用して最新のプラン情報を取得し、インタビュー情報をもとに、推奨のプランについておススメしてください。"

最後に各関数ツールをアタッチさせます。

app.py

from interviewtools import attach_rag_interviewtools
...
    attach_rag_interviewtools(rtmt)

フロント側の修正

ここまででサーバサイドは完成しました。会話でインタビューをして、取れた情報をフロントに送信している形です。
なので、フロント側でインタビュー情報を受け取るように変えていきます。

まずは型を定義していきます。今回はそのまま返却するようにしているため、関数のスキーマと同じ形にしています。

types.ts

export type InterviewToolResult = {
    sources: {
        carrier: string;
        plan?: string;
        monthly_fee: string;
        requirements: string;
    }
};

次にApp.tsxで受け取った情報を画面に表示するようにします。とりあえず受け取ったjsonをそのまま表示です。

App.tsx

import { InterviewToolResult } from "./types";
...
function App() {
    ...
    
    // インタビュー情報の箱を定義
    const [interviewInfo, setInterviewInfo] = useState<InterviewToolResult | null>(null);

    ...

        // 受け取った注文情報をセットする
        onReceivedExtensionMiddleTierToolResponse: message => {
            const result: InterviewToolResult = JSON.parse(message.tool_result);
            setInterviewInfo(result);
        }

...
            <main className="flex flex-grow flex-col items-center justify-center">

            ...

                // 画面にJSONを表示する
                <div>
                    {interviewInfo && (
                        <div>
                            <h2>Interview Result</h2>
                            <pre>{JSON.stringify(interviewInfo, null, 2)}</pre>
                        </div>
                    )}
                </div>
            </main>

注文受付と同様に味気のないフロント画面ですが、インタビューが無事にとれ、またリコメンドを回答してくれました！（精度の問題でたまに回答してくれなかったりしますが...;<

★同時通訳してもらう

技術的にはここまでの内容で一番簡単ではあるのですが、アイディアとして非常に面白いので作ってみます。
まずfunction callの定義やアタッチは不要です。その上で以下のようなシステムプロンプトにします。

app.py

    rtmt.system_message = """\
あなたは同時通訳AIのニックです。以下のルールに従いユーザの音声を翻訳・回答します。

#ルール
1. ユーザが日本語を話した場合は、話した内容をそのままの形で、英語に翻訳し回答します。
2. ユーザが英語を話した場合は、話した内容をそのままの形で、日本語に翻訳し回答します。
3. ユーザの話した内容を翻訳する以外の情報は付け加えないでください。翻訳のみ行います。
4. "ニック"として話しかけられた場合は、ユーザの話した言語で、翻訳せずに通常の会話として回答してください。
"""

#関数をアタッチしている箇所はコメントアウトしちゃいます。
#attach_rag_tools(rtmt)

フロント部分も関数呼び出しの結果を取得もしないようにもろもろコメントアウトします。

App.tsx

// import { GroundingFile, ToolResult } from "./types";

function App() {
    const [isRecording, setIsRecording] = useState(false);
    //const [groundingFiles, setGroundingFiles] = useState<GroundingFile[]>([]);
    //const [selectedFile, setSelectedFile] = useState<GroundingFile | null>(null);
...

        onReceivedExtensionMiddleTierToolResponse: message => {
            //const result: ToolResult = JSON.parse(message.tool_result);

            //const files: GroundingFile[] = result.sources.map(x => {
            //   return { id: x.chunk_id, name: x.title, content: x.chunk };
            //});

            //setGroundingFiles(prev => [...prev, ...files]);
        }

これで日本語で話すと英語にしてくれて、英語で話すと日本語にしてくれる同時通訳AIが完成です。
ニックさんと呼びかけるとこれまでの会話を要約してくれたりもします。

※ただし、現状のプレビューAPIでは精度がいまいちです。途中で翻訳をやめてしまったり、数字の翻訳が中途半端になったりしてしまうことがあります。早く安定するといいですね。

★ボールを動かす

技術的には特にこれまでと比べて新しいことはしていませんが、シナリオとしてボールを動かすことを実装します。

ボール移動の関数準備

ボールの色、移動方向、移動量を関数の引数として取得し、クライアントに返すようなツールを用意します。

balltools.py

import re
from typing import Any
from rtmt import RTMiddleTier, Tool, ToolResult, ToolResultDirection
from bing import search_bing_and_read_contents, read_content

# ボール移動を行うスキーマ  
_ball_tool_schema = {  
    "type": "function",  
    "name": "balltool",  
    "description": "指定した色のボールを移動します。",
    "parameters": {  
        "type": "object",  
        "properties": {  
            "color": {  
                "type": "string",  
                "description": "移動するボールの色: red, blue"  
            },
            "direction": {  
                "type": "string",  
                "description": "ボールの移動先の向き: up, down, left, right"
            },
            "amount": {  
                "type": "number",  
                "description": "ボールの移動量(1以上の整数)"
            }  
        },  
        "required": ["color", "direction", "amount"],  
        "additionalProperties": False  
    }  
}

async def _ball_tool(args: Any) -> None:
    print("**************************************")
    print(args)

    return ToolResult({"sources": args}, ToolResultDirection.TO_CLIENT)

def attach_ball_tools(rtmt: RTMiddleTier) -> None:
    rtmt.tools["balltool"] = Tool(schema=_ball_tool_schema, target=lambda args: _ball_tool(args))

後はこれまでと同様です。app.pyでこのツールを呼ぶようにシステムプロンプトを変更し、ツールをアタッチします。

app.py

    rtmt.system_message = """\
あなたはボール制御AIです。ユーザからの指示に従い、'balltool"ツールを利用してボールを移動させます。
ユーザはボールの色、移動先の向き、移動量を指定します。足りない要素がある場合や、ツールを実行後はその旨報告してください。
なお、ユーザは音声で指示を出しているため、回答はできるだけ簡潔に、一文で答えることが重要です。
"""

    attach_ball_tools(rtmt)

フロント側はo1-previewでReactコンポーネントを作成し、引数を反映するグリッドを作成します。（ここは割愛）
以下のような画面ができました。

音声の指示に従って画面上のボールを移動してくれます。
まぁこれはそれだけなんですが、これを何かのスイッチにするとか色々想像しやすいかなということで実装でした。

おわりに

素晴らしいサンプルなので色々カスタマイズしたくなりますね。今後も何かいじったら備忘としてメモを追加していきたいなと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up