More than 1 year has passed since last update.

組み込みAIのWatson君を触ってみた（The Watson Speech to Text Library for Embed）-2/2

Last updated at 2022-12-27Posted at 2022-12-27

記事の概要

単語誤り率4.3%（人間の認識エラー率は5.1%）を達成したIBMのThe Watson Speech to Text(以下STT)が、インターネットに接続せずに使用できる組み込みAIとして提供された。
そこで、カスタマイズされた言語モデルの認識信頼度の高さや、話者ラベルや周波数解析による背景音除去がもたらす認識信頼度の変化について、Webアプリを作成して試してみた。
本記事では、フロントエンドで稼働するWebアプリを作成する。

背景

2022年10月25日 IBMは、組み込み可能なAIソフトウェア・ポートフォリオの拡張を発表した。
IBM Watson Natural Language Processingライブラリー：意図（インテント）や感情（センチメント）から意味や文脈を解釈し、言語を処理する機能のライブラリー
IBM Watson Speech to Textライブラリー：迅速かつ正確な音声書き起こしを可能にするライブラリー
IBM Watson Text to Speechライブラリー：テキストをさまざまな言語や音声で正確かつ自然な音声に変換する機能のライブラリー

本製品は、組み込みAIであるため、インターネットに繋ぐことなく、より秘匿性の高いデータに対しても分析可能である。
例えば企業の企画に対して、SNSから取得したデータを用いて世論調査を実施することや、クレーム内容やレビュー内容と購入情報等のデータから商品開発も可能だろう。

また、2022年12月時点では、180日有効な試用ライセンスが提供されていることから、どのようなことが実施できるのか、試してみる。
（個人的には、スポーツ心理や人間工学研究において、音声のテキスト化及び自然言語分析は重要であるため、所属した板研究室の後輩に勧めている最中である。）

環境

自分のパソコンには、諸事情により導入することが出来なかったため、Linuxの仮想サーバーを立てた。仮想サーバーのスペックは以下の通りである。
cpu cores: 4
MemTotal: 8129000 kB
podman version 4.2.0
openjdk version "17.0.5" 2022-10-18 LTS
OpenJDK Runtime Environment (Red_Hat-17.0.5.0.8-2.el8_6) (build 17.0.5+8-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-17.0.5.0.8-2.el8_6) (build 17.0.5+8-LTS, mixed mode, sharing)
Red Hat Enterprise Linux release 8.7 (Ootpa)

実施方法

こちらのGithubの手順に沿って実施する。
STT Client ApplicationとあるフロントエンドのWebアプリのセットアップと、STTが動作するバックエンドのDockerコンテナーのセットアップである。

Githubでは英語とフランス語の言語モデルを用いているが、本Qiitaでは、領域一般な日本語モデルと英語モデル、そして医療用にカスタマイズされた英語モデルを用いる。
フロントアプリでは、

領域特有の言語モデルと領域一般な英語モデルの比較
ノイズキャンセリング
話者ラベルの付与
録音されたデータをフロントエンドのWebアプリからアップロードし、テキスト化

を実施する。
完成系は以下の通りである。

モデル別音声認識

パラメータべ別機能

フロントアプリからデータのアップロード

本編

STTが動作するバックエンドのDockerコンテナーのセットアップ

今回dockerコマンドの代わりにpodmanコマンドを使用するため、

alias docker=podman

サンプルコードを含むディレクトリに移動する。

cd Watson-Speech/STTApplication
ls
# Dockerfile  architecture.png  images  mvnw.cmd  readme.md  target STTArchitectureLocal.png  deployment    mvnw    pom.xml   src

HTMLの変更

既存コードに対して変更した点を記載する。
vi src/main/resources/templates/index.htmlを変更する。
医療系の言葉を含んだ音声demo1.wavを作成し、src/main/resources/static/audio/に置く。
HTMLでは、表を作成し、左列に言語モデル未カスタマイズの結果を、右列に医療用にカスタマイズした結果を表示する。

index.html

# 修正前のコード
# 90 <tr>
# 91	<td width="30%">
# 92		Sample1
# 93	</td>
# 94	<td width=50%>

# 96		<audio controls th:src="@{/audio/CallCenterSample1.wav}" Your browser does not
# 97			support the <code>audio</code>
# 98			element.
# 99		</audio>
# 100	</td>
# 101	<td width="20%">
# 102		<bx-link href="/transcript/CallCenterSample1.wav"> Convert </bx-link> |
# 103		<bx-link href="/transcript/CallCenterSample1.wav/download"> Download </bx-link>
# 104	</td>
# 105</tr>

# 修正後コード
<tr>
	<td width="30%">
		Sample1
	</td>
	<td width=50%>

		<audio controls th:src="@{/audio/demo1.wav}" Your browser does not
			support the <code>audio</code>
			element.
		</audio>
	</td>
	<td width="20%">
		<bx-link href="/transcript/ww/demo1.wav"> テキスト化 </bx-link> |
		<bx-link href="/transcript/ww/demo1.wav/download"> ダウンロード </bx-link>
	</td>
</tr>

# 修正前のコード
# 143 <bx-table th:if="${resultNP ne null}" class="table table-striped">
# 144	<bx-table-body>
# 145		<th:block th:each="rs : ${resultNP}">
# 146			<bx-table-row>
# 147				<bx-table-cell th:text="'Confidence: ' + ${rs.confidence}"></bx-table-cell>
# 148			</bx-table-row>
# 149			<bx-table-row>
# 150				<bx-table-cell th:text="'Transcript: ' + ${rs.transcript}"></bx-table-cell>
# 151			</bx-table-row>
# 152		</th:block>
# 153	</bx-table-body>
# 154 </bx-table>

# 修正後のコード
<div class="row" th:if="${resultUS ne null}">
        <div class="col">
               <bx-table class="table table-striped">
                       <caption style="text-align: center;">一般言語学習モデル</caption>
                       <bx-table-body>
                               <th:block th:each="rs : ${resultUS}">
                                       <bx-table-row>
                                               <bx-table-cell th:text="'認識信頼度: ' + ${rs.confidence}">
                                               </bx-table-cell>
                                       </bx-table-row>
                                       <bx-table-row>
                                               <bx-table-cell th:text="'テキスト化: ' + ${rs.transcript}">
                                               </bx-table-cell>
                                       </bx-table-row>
                               </th:block>
                        </bx-table-body>
               </bx-table>
        </div>
        <div class="col">
               <bx-table th:if="${resultWW ne null}" class="table table-striped">
                       <caption style="text-align: center;">医療言語学習済みモデル</caption>
                       <bx-table-body>
                               <th:block th:each="rs : ${resultWW}">
                                       <bx-table-row>
                                               <bx-table-cell th:text="'認識信頼度: ' + ${rs.confidence}">
                                               </bx-table-cell>
                                       </bx-table-row>
                                       <bx-table-row>
                                               <bx-table-cell th:text="'テキスト化: ' + ${rs.transcript}">
                                               </bx-table-cell>
                                       </bx-table-row>
                               </th:block>
                       </bx-table-body>
               </bx-table>
        </div>
</div>

Controllerの変更

既存コードに対して変更した点を記載する。
vi src/main/java/com/build/labs/controller/STTController.java を変更する。
GetMappingを新しく定義する。

STTController.java

# 修正前のコード
# 57 -
# 修正後のコード
        @GetMapping("/transcript/ww/{filename}")
        public String transcriptAudioWW(@PathVariable("filename") String filename, Model model)
                        throws IOException, URISyntaxException {

                InputStream input = new ClassPathResource("static/audio/" + filename).getInputStream();
                InputStream input1 = new ClassPathResource("static/audio/" + filename).getInputStream();
        		String transcript = sttService.transcriptAudioUS(input);
                String transcript1 = sttService.transcriptAudioWW(input1);

                List<Output> outputList1 = formatOutput(transcript1);
                outputList1.forEach(o -> {
                        System.out.println("confidence: " + o.getConfidence());
                        System.out.println("transcript: " + o.getTranscript());
                });
		model.addAttribute("resultWW", outputList1); 

                List<Output> outputList = formatOutput(transcript);
                outputList.forEach(o -> {
                        System.out.println("confidence: " + o.getConfidence());
                        System.out.println("transcript: " + o.getTranscript());
                });

                model.addAttribute("resultUS", outputList);
                return "index";

        }

Serviceの変更

既存コードに対して変更した点を記載する。
vi src/main/java/com/build/labs/feignclient/SSTServingClient.javaを変更する。
使用するモデルの変更とPostMapingを新しく定義する。

SSTServingClient.java

# 変更前
# 13    public final String STT_REST_MAPPING = "/speech-to-text/api/v1/recognize?model=en-US_Multimedia";
# 14    public final String STT_REST_MAPPING1 = "/speech-to-text/api/v1/recognize?model=en-US_Telephony";

# 16    @PostMapping(STT_REST_MAPPING)
# 17    String transcript(@RequestBody byte[] body);

# 変更後
    public final String STT_REST_MAPPING = "/speech-to-text/api/v1/recognize?model=ja-JP_Multimedia";
    public final String STT_REST_MAPPING1 = "/speech-to-text/api/v1/recognize?model=en-WW_Medical_Telephony";
    public final String STT_REST_MAPPING2 = "/speech-to-text/api/v1/recognize?model=en-US_Multimedia";

    @PostMapping(STT_REST_MAPPING)
    String transcript(@RequestBody byte[] body);

    @PostMapping(STT_REST_MAPPING1)
    String transcriptWW(@RequestBody byte[] body);

    @PostMapping(STT_REST_MAPPING2)
    String transcriptUS(@RequestBody byte[] body);

vi src/main/java/com/build/labs/services/STTService.javaを変更する。

STTService.java

# 変更前
# 38 -
# 変更後
public String transcriptAudioWW(InputStream inputStream) throws URISyntaxException, IOException {
        String result = postFeignClient.transcriptWW(inputStream.readAllBytes());
        return result;
}

public String transcriptAudioUS(InputStream inputStream) throws URISyntaxException, IOException {
        String result = postFeignClient.transcriptUS(inputStream.readAllBytes());
        return result;
}

また、音声として使用するwavファイルをsrc/main/resources/static/audio/に追加

ビルドコマンドを実行

./mvnw clean package
# /usr/bin/which: no javac in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
# Warning: JAVA_HOME environment variable is not set.
# [INFO] Scanning for projects...
# [INFO] 
# [INFO] -------------------< com.build.labs:STTApplication >--------------------
# [INFO] Building STTApplication 0.0.1-SNAPSHOT
# [INFO] --------------------------------[ jar ]---------------------------------
# [INFO] 
# [INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ STTApplication ---
# [INFO] Deleting /root/Watson-Speech/STTApplication/target
# [INFO] 
# [INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @ STTApplication ---
# [INFO] Using 'UTF-8' encoding to copy filtered resources.
# [INFO] Using 'UTF-8' encoding to copy filtered properties files.
# [INFO] Copying 1 resource
# [INFO] Copying 8 resources
# [INFO] 
# [INFO] --- maven-compiler-plugin:3.10.1:compile (default-compile) @ STTApplication ---
# [INFO] Changes detected - recompiling the module!
# [INFO] Compiling 14 source files to /root/Watson-Speech/STTApplication/target/classes
# [INFO] 
# [INFO] --- maven-resources-plugin:3.2.0:testResources (default-testResources) @ STTApplication ---
# [INFO] Using 'UTF-8' encoding to copy filtered resources.
# [INFO] Using 'UTF-8' encoding to copy filtered properties files.
# [INFO] skip non existing resourceDirectory /root/Watson-Speech/STTApplication/src/test/resources
# [INFO] 
# [INFO] --- maven-compiler-plugin:3.10.1:testCompile (default-testCompile) @ STTApplication ---
# [INFO] Changes detected - recompiling the module!
# [INFO] Compiling 1 source file to /root/Watson-Speech/STTApplication/target/test-classes
# [INFO] 
# [INFO] --- maven-surefire-plugin:2.22.2:test (default-test) @ STTApplication ---
# [INFO] 
# [INFO] -------------------------------------------------------
# [INFO]  T E S T S
# [INFO] -------------------------------------------------------
# [INFO] 
# [INFO] Results:
# [INFO] 
# [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
# [INFO] 
# [INFO] 
# [INFO] --- maven-jar-plugin:3.2.2:jar (default-jar) @ STTApplication ---
# [INFO] Building jar: /root/Watson-Speech/STTApplication/target/STTApplication-0.0.1-SNAPSHOT.jar
# [INFO] 
# [INFO] --- spring-boot-maven-plugin:2.7.3:repackage (repackage) @ STTApplication ---
# [INFO] Replacing main artifact with repackaged archive
# [INFO] ------------------------------------------------------------------------
# [INFO] BUILD SUCCESS
# [INFO] ------------------------------------------------------------------------
# [INFO] Total time:  6.220 s
# [INFO] Finished at: 2022-12-27T02:33:15-08:00
# [INFO] ------------------------------------------------------------------------

target/STTApplication-0.0.1-SNAPSHOT.jarとして、パッケージ化される。

実行

JavaアプリケーションからSTTサービスやWebsocket ストリーミングサービスにアクセスできるように環境変数を設定します。STTサービスがポート 1080で実行されているとします。

export STT_WSS_SERVICE_ENDPOINT=ws://localhost:1080
export STT_WSS_SERVICE_ENDPOINT=ws://localhost:1080

アプリケーションの実行

java -jar target/STTApplication-0.0.1-SNAPSHOT.jar

#  .   ____          _            __ _ _
# /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
#( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
# \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
#  '  |____| .__|_| |_|_| |_\__, | / / / /
# =========|_|==============|___/=/_/_/_/
# :: Spring Boot ::                (v2.7.3)

#2022-12-27 06:42:11.315  INFO 1028712 --- [           main] com.build.labs.SttApplication            : Starting SttApplication v0.0.1-SNAPSHOT using Java 17.0.5 on itzvsi-mhdzrtur.dte.demo.ibmcloud.com with PID 1028712 (/root/Watson-Speech/STTApplication/target/STTApplication-0.0.1-SNAPSHOT.jar started by root in /root/Watson-Speech/STTApplication)
#2022-12-27 06:42:11.319  INFO 1028712 --- [           main] com.build.labs.SttApplication            : No active profile set, falling back to 1 default profile: "default"
#2022-12-27 06:42:12.362  INFO 1028712 --- [           main] o.s.cloud.context.scope.GenericScope     : BeanFactory id=2315b251-04c6-38b0-a08f-d8d0c9e8d972
#2022-12-27 06:42:12.780  INFO 1028712 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat initialized with port(s): 8080 (http)
#2022-12-27 06:42:12.791  INFO 1028712 --- [           main] o.apache.catalina.core.StandardService   : Starting service [Tomcat]
#2022-12-27 06:42:12.792  INFO 1028712 --- [           main] org.apache.catalina.core.StandardEngine  : Starting Servlet engine: [Apache Tomcat/9.0.65]
#2022-12-27 06:42:12.899  INFO 1028712 --- [           main] o.a.c.c.C.[Tomcat].[localhost].[/]       : Initializing Spring embedded WebApplicationContext
#2022-12-27 06:42:12.899  INFO 1028712 --- [           main] w.s.c.ServletWebServerApplicationContext : Root WebApplicationContext: initialization completed in 1485 ms
#2022-12-27 06:42:13.534  INFO 1028712 --- [           main] o.s.b.a.w.s.WelcomePageHandlerMapping    : Adding welcome page template: index
#2022-12-27 06:42:13.770  INFO 1028712 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
#2022-12-27 06:42:13.783  INFO 1028712 --- [           main] com.build.labs.SttApplication            : Started SttApplication in 3.141 seconds (JVM running for 3.747)

http://localhost:8080を用いてフロントアプリのアプリケーションにアクセスすることができたら成功です。
是非色々と試してみてください。

注意点

httpの様に暗号化されていない通信のWebサイトでは、ブラウザがマイクを有効できません。最も適した方法は暗号化することですが、気軽に試したい場合には、
chrome://flags/#unsafely-treat-insecure-origin-as-secure
のInsecure origins treated as secureにアクセスしようとしているURLを指定してEnabledにしてください

参考記事

・ IBM、AIポートフォリオの拡張により、エコシステム・パートナーのAIの導入を加速
・IBM Helps Ecosystem Partners Accelerate AI Adoption by Making it Easier to Embed and Scale AI Across Their Business
・Announcement Letters

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up