More than 1 year has passed since last update.

【Flutter/Dart】AmiVoice APIで音声ファイルの書き起こしアプリを作る

Last updated at 2024-05-14Posted at 2024-05-13

はじめに

こんにちは。最近は、SwiftからFlutterにチェンジしようと勉強中なのですが、Widgetの扱い方などが難しく苦労しています。
そんな中で、株式会社アドバンスト・メディアの「音声認識APIを使ってみよう！」のキャンペーンで、音声認識APIの無料クーポンが発行されていたので、Flutterの練習を兼ねて、音声認識アプリを作ってみました。
Flutterやコードの書き方などおかしいところや改善点などありましたら、ご指摘お願いします。

作ったもの

今回作ってみた音声認識アプリは、

音声ファイルを送信し、音声認識が完了するまで待つ
音声認識結果の確認

といった流れで音声ファイルの書き起こしを行います。

1. 音声ファイルを送信し、音声認識が完了するまで待つ

注意
見やすさのために音声認識が完了するまで早送りしています。実際には、送信した音声の0.5〜1.5倍ほどの時間が必要なのでご注意ください。
※ 詳しくは"非同期 HTTP インタフェース"の"注記"を参照してください

2. 音声認識結果の確認

GitHub

作った音声認識アプリはGitHubにあげています。ぜひ使って見てください。

AmiVoice API

AmiVoice API とは、株式会社アドバンスト・メディアが提供している音声認識サービスの１つです。
主な特徴としては、

音声データを受け取り、発話内容をテキストに変換して返します。

ファイルと逐次結果を得られるストリーミングに対応しています。インタフェースの使い分けを参照してください。

HTTP や WebSocket によるテキストベースのプロトコルを利用しているので、クライアント動作環境でTCP/IPが利用できるだけでよく、特殊なライブラリを組み込む必要がありません。

HTTPS および WSS によって暗号化されているため、通信経路は安全です。

認識結果は、JSON 形式で返します。

送信された音声データから、人が発話している部分を推定して音声認識し、その音声認識の対象となった発話時間のみに費用がかかります。AmiVoice APIの価格を参照してください

音声認識の結果として、発話内容を推定したテキストだけではなく、発話の開始時間、終了時間、トークンごとの時間情報、信頼度などが得られます。

音声認識エンジン(言語モデル、音響モデルの組み合わせ)を複数提供しており、様々な言語、ドメインや利用シーンに最適なエンジンを選択できます。

単語登録することで認識しない単語を利用者が追加できます。

「えーっと」や「あのー」などの不要語を自動的に削除します。コールセンターで従業員の話し方の分析をするためなどの目的で敢えて不要語を残すこともできます。不要語の削除を参照してください。

句読点を自動的に挿入します。

話者ダイアライゼーション機能を有効にすると、複数の話者が話をしている音声に対して、どこからどこまでを誰が話しているのか推定した結果が得られます。

感情分析機能を有効にすると、感情分析も同時に行うことができます。

などがあり、音声認識だけでなく話者分離や感情分析といった機能も利用できます（公式ドキュメント参照）。

またAPIを利用するにあたっては、

同期 HTTP インターフェース
非同期 HTTP インターフェース
WebSocket インターフェース

の３つのインターフェースが用意されており、用途によっての使い分けが可能です。

この記事で作ってみた音声認識アプリは、複数の音声ファイルに対して音声認識をしてみたかったので、非同期 HTTP インターフェースを使っています。

開発環境

macOS 12.6
Flutter 3.19.6
Xcode 14.2
テスト環境: macos

pubspec.yaml

name: speech_recognition_flutter_app
description: "A new Flutter project."
publish_to: 'none'
version: 1.0.0+1

environment:
  sdk: '>=3.3.4 <4.0.0'

dependencies:
  flutter:
    sdk: flutter

  cupertino_icons: ^1.0.6
  file_picker: ^8.0.3
  http: ^1.2.1
  go_router: ^14.0.2
  hooks_riverpod: ^2.5.1
  intl: ^0.19.0
  just_audio: ^0.9.37
  audio_video_progress_bar: ^2.0.2

dev_dependencies:
  flutter_test:
    sdk: flutter

  flutter_lints: ^3.0.0

flutter:
  uses-material-design: true

非同期 HTTP インターフェースの実装

AmiVoice APIの非同期 HTTP インターフェースは、以下を参考にして実装しました。
https://docs.amivoice.com/amivoice-api/manual/user-guide/request/async-http-interface
https://acp.amivoice.com/blog/2022-09-30-134154/

非同期 HTTP インターフェースの実装は、

AmiVoice APIへの登録
音声ファイルの選択
音声ファイルの送信
音声認識ジョブのポーリング
認識結果の取得

という流れで行います。

1. AmiVoice APIへの登録

前準備として、AmiVoice APIの登録を以下の記事を参考にして行います。
https://acp.amivoice.com/blog/qiita_api240424/

2. 音声ファイルの選択

まずアプリ側でテキスト化したい音声ファイルを選択します。
音声ファイルの選択には、file_pickerを使います。

Future<void> pick() async {
  state = const AsyncLoading();
  try {
    final paths = (await FilePicker.platform.pickFiles(
      compressionQuality: 30,
      type: FileType.audio,
      allowMultiple: true,
    ))
        ?.files;
    // pathsの取得に成功した場合は、音声データを送信する
    
  } on PlatformException catch (e) {
    state = AsyncValue.error(e, StackTrace.current);
  } catch (e) {
    state = AsyncValue.error(e, StackTrace.current);
  }
}

3. 音声ファイルの送信

音声を選択後は、マルチパート POST で音声データをサーバに送信します。
公式ドキュメントより下記のようなリクエストを作成します。

POST https://acp-api-async.amivoice.com/v1/recognitions
Content-Type: multipart/form-data;boundary=some-boundary-string

--some-boundary-string
Content-Disposition: form-data; name="u"

{APPKEY}
--some-boundary-string
Content-Disposition: form-data; name="d"

-a-general
--some-boundary-string
Content-Disposition: form-data; name="a"
Content-Type: application/octet-stream
{音声データバイナリ}
--some-boundary-string--

上記リクエストにあるパラメータは、

u: 認証情報 (AmiVoice APIのマイページで取得できるAPPKEY)
d: 接続エンジン名
a: 音声データバイナリ

を設定します。

以上を参考にして、POST リクエストを実装します。

Future<String> upload(Uint8List data) async {
  final audioData = data;
  final url = Uri.parse("https://acp-api-async.amivoice.com/v1/recognitions");
  final request = http.MultipartRequest("POST", url);
  request.headers["Content-type"] = "multipart/form-data; charset=UTF-8";
  request.fields["u"] = apiKey;
  request.fields["d"] = "grammarFileNames=-a-general";
  request.files.add(http.MultipartFile.fromBytes(
    "a",
    audioData.toList(),
    contentType: MediaType.parse("application/octet-stream"),
  ));

  try {
    final stream = await request.send();
    final response = await http.Response.fromStream(stream);
    final body = utf8.decode(response.bodyBytes);
    final jsonData = json.decode(body);
    final errorCode = jsonData["code"];
    final errorMessage = jsonData["message"];
    if (errorCode != null && errorMessage != null) {
      throw "$errorMessage[$errorCode]";
    }
    return jsonData["sessionid"] ?? "";
  } catch (e) {
    rethrow;
  }
}

音声ファイルをマルチパート POSTで送信した後のレスポンスでは、
成功した場合は以下の json が返ってきます。json内のsessionidを使って、送信した音声データの音声認識ジョブの状態を追跡します。

{ "sessionid": "017ac8786c5b0a0504399999", "text": "..." }

また失敗した場合は、以下 json にレスポンスコードとエラーのメッセージが返ってきます。
レスポンスコードとエラーメッセージはこちらに詳細があります。

{
  "results": [{ "tokens": [], "tags": [], "rulename": "", "text": "" }],
  "text": "",
  "code": "-",
  "message": "received illegal service authorization"
}

4. 音声認識ジョブのポーリング

音声データの送信に成功した後は、sessionidを使って音声認識ジョブの状態をGETで取得する必要があります。

GET https://acp-api-async.amivoice.com/v1/recognitions/{sessionid}
Authorization: Bearer {APPKEY}

音声認識ジョブには、queued、started、processing、completed、errorの５つの状態が存在します。そのため、ジョブの状態がcompletedもしくはerrorになるまで、ポーリングで監視し続ける必要があります。

// 10秒間隔でポーリング
Future<void> _polling(String sessionId) async {
  if (_timer != null && _timer!.isActive) return;
  void periodicFetch(Timer timer) {
    // 音声認識ジョブがcompletedもしくはerrorの状態に移行した時にポーリングを停止する
    if (jobState == "completed" || jobState == "error") {
      timer.cancel();
      return;
    }
    fetch(sessionId);
  }

  _timer = Timer.periodic(const Duration(seconds: 10), periodicFetch);
}

// GETで音声認識ジョブの状態または認識結果を取得
Future<AsyncRecognition> fetch(String sessionId) async {
  final url = Uri.parse("https://acp-api-async.amivoice.com/v1/recognitions/$sessionId");
  final request = http.Request("GET", url);
  request.headers["Authorization"] = "Bearer $apiKey";

  try {
    final stream = await request.send();
    final response = await http.Response.fromStream(stream);
    final body = utf8.decode(response.bodyBytes);
    final jsonData = json.decode(body);
    return AsyncRecognition.fromJson(jsonData);
  } catch (e) {
    rethrow;
  }
}

音声認識のプロセスは、ジョブの状態がqueued -> started -> processingのように進み、errorもしくはcompletedで終了となります。

5. 認識結果の取得

音声認識ジョブの状態がcompletedに遷移したとき、音声認識結果が返ってきます。completed 後のレスポンスでは、以下のようなJSONを取得できます。

{
  "status": "completed",
  "session_id": "018f7027a3270a305aca9ccc",
  "service_id": "serviceId",
  "audio_size": 306980,
  "audio_md5": "40f59fe5fc7745c33b33af44be43f6ad",
  "segments": [
    {
      "results": [
        {
          "tokens": [
            {"written": "アドバンスト・メディア", "confidence": 1, "starttime": 458, "endtime": 1578, "spoken": "あどばんすとめでぃあ"},
            {"written": "は", "confidence": 1, "starttime": 1578, "endtime": 1946, "spoken": "は"}, 
            {"written": "、", "confidence": 0.44, "starttime": 1946, "endtime": 1978, "spoken": "_"}, 
            {"written": "人", "confidence": 1, "starttime": 1978, "endtime": 2314, "spoken": "ひと"}, 
            {"written": "と", "confidence": 1, "starttime": 2314, "endtime": 2426, "spoken": "と"}, 
            {"written": "機械", "confidence": 1, "starttime": 2426, "endtime": 2826, "spoken": "きかい"}, 
            {"written": "と", "confidence": 1, "starttime": 2826, "endtime": 2938, "spoken": "と"}, 
            {"written": "の", "confidence": 0.96, "starttime": 2938, "endtime": 3082, "spoken": "の"}, 
            {"written": "自然", "confidence": 1, "starttime": 3082, "endtime": 3434, "spoken": "しぜん"}, 
            {"written": "な", "confidence": 1, "starttime": 3434, "endtime": 3530, "spoken": "な"}, 
            {"written": "コミュニケーション", "confidence": 1, "starttime": 3530, "endtime": 4362, "spoken": "こみゅにけーしょん"}, 
            {"written": "を", "confidence": 1, "starttime": 4362, "endtime": 4442, "spoken": "を"}, 
            {"written": "実現", "confidence": 1, "starttime": 4442, "endtime": 4922, "spoken": "じつげん"}, 
            {"written": "し", "confidence": 1, "starttime": 4922, "endtime": 5402, "spoken": "し"}, 
            {"written": "、", "confidence": 0.41, "starttime": 5402, "endtime": 5434, "spoken": "_"}, 
            {"written": "豊か", "confidence": 1, "starttime": 5562, "endtime": 5994, "spoken": "ゆたか"}, 
            {"written": "な", "confidence": 1, "starttime": 5994, "endtime": 6090, "spoken": "な"}, 
            {"written": "未来", "confidence": 1, "starttime": 6090, "endtime": 6490, "spoken": "みらい"}, 
            {"written": "を", "confidence": 1, "starttime": 6490, "endtime": 6570, "spoken": "を"}, 
            {"written": "創造", "confidence": 0.95, "starttime": 6570, "endtime": 7034, "spoken": "そうぞう"}, 
            {"written": "して", "confidence": 1, "starttime": 7034, "endtime": 7226, "spoken": "して"}, 
            {"written": "いく", "confidence": 1, "starttime": 7226, "endtime": 7418, "spoken": "いく"}, 
            {"written": "こと", "confidence": 0.96, "starttime": 7418, "endtime": 7674, "spoken": "こと"}, 
            {"written": "を", "confidence": 1, "starttime": 7674, "endtime": 7722, "spoken": "を"}, 
            {"written": "目指し", "confidence": 0.79, "starttime": 7722, "endtime": 8090, "spoken": "めざし"}, 
            {"written": "ます", "confidence": 0.79, "starttime": 8090, "endtime": 8538, "spoken": "ます"}, 
            {"written": "。", "confidence": 0.94, "starttime": 8538, "endtime": 8794, "spoken": "_"}
          ], 
          "confidence": 0.997, 
          "starttime": 250, 
          "endtime": 8794, 
          "tags": [], 
          "rulename": "", 
          "text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。"
        }
      ], 
      "text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。"
    }
  ], 
  "utteranceid": "20240513/13/018f70289db00a30619a39d0_20240513_131421", 
  "text": "アドバンスト・メディアは、人と機械との自然なコミュニケーションを実現し、豊かな未来を創造していくことを目指します。", 
  "code": "", 
  "message": ""
}

アプリ側で認識結果を使いやすくするために、上記のJSONからAsyncRecognition データクラスを生成します。

class AsyncRecognition {
  final String status; // State of job.
  final String audioMd5; // 受信した音声ファイルの MD5 チェックサムの値
  final int? audioSize;
  final String contentId; // ユーザがリクエスト時に設定した contentId の値
  final String serviceId; // ユーザー名
  final List<Segment> segments; // 音声認識プロセスの結果
  final String utteranceId;
  final String text;
  final String code;
  final String message;
  final String errorMessage;

  const AsyncRecognition({
    required this.status,
    required this.audioMd5,
    this.audioSize,
    required this.contentId,
    required this.serviceId,
    required this.segments,
    required this.utteranceId,
    required this.text,
    required this.code,
    required this.message,
    required this.errorMessage,
  });

  factory AsyncRecognition.fromJson(dynamic json) {
    List<Segment> toSegments(dynamic segments) {
      if (segments == null) return [];
      return (segments as List<dynamic>)
          .map((e) => Segment.fromJson(e))
          .toList();
    }

    return AsyncRecognition(
      status: json["status"] ?? "error",
      audioMd5: json["audio_md5"] ?? "",
      audioSize: json["audio_size"],
      contentId: json["content_id"] ?? "",
      serviceId: json["service_id"] ?? "",
      segments: toSegments(json["segments"]),
      utteranceId: json["utteranceid"] ?? "",
      text: json["text"] ?? "",
      code: json["code"] ?? "",
      message: json["message"] ?? "",
      errorMessage: json["error_message"] ?? "",
    );
  }

  factory AsyncRecognition.error(String errorMessage) {
    return AsyncRecognition(
      status: "",
      audioMd5: "",
      contentId: "",
      serviceId: "",
      segments: [],
      utteranceId: "",
      text: "",
      code: "",
      message: "",
      errorMessage: errorMessage,
    );
  }
}

class Segment {
  final List<SRResult> results; // 音声認識プロセスの結果
  final String text;
  const Segment({
    required this.results,
    required this.text,
  });

  factory Segment.fromJson(dynamic json) {
    List<SRResult> toSRResults(dynamic results) {
      if (results == null) return [];
      return (results as List<dynamic>)
          .map((e) => SRResult.fromJson(e))
          .toList();
    }

    return Segment(
      results: toSRResults(json["results"]),
      text: json["text"] ?? "",
    );
  }
}

class SRResult {
  // Speech Recognition Result
  final List<Token> tokens;
  final double? confidence;
  final int? startTime;
  final int? endTime;
  final List<String> tags;
  final String ruleName;
  final String text;

  const SRResult({
    required this.tokens,
    this.confidence,
    this.startTime,
    this.endTime,
    required this.tags,
    required this.ruleName,
    required this.text,
  });

  factory SRResult.fromJson(dynamic json) {
    List<Token> toTokens(dynamic tokens) {
      if (tokens == null) return [];
      return (tokens as List<dynamic>).map((e) => Token.fromJson(e)).toList();
    }

    List<String> toTags(dynamic tags) {
      if (tags == null) return [];
      return (tags as List<dynamic>).map((e) => e as String).toList();
    }

    double toConfidence(dynamic data) {
      if (data is int) {
        return data.toDouble();
      } else if (data is double) {
        return data;
      }
      return 0.0;
    }

    return SRResult(
      tokens: toTokens(json["tokens"]),
      confidence: toConfidence(json["confidence"]),
      startTime: json["starttime"],
      endTime: json["endtime"],
      tags: toTags(json["tags"]),
      ruleName: json["rulename"] ?? "",
      text: json["text"] ?? "",
    );
  }
}

class Token {
  final String written;
  final double? confidence;
  final int? startTime;
  final int? endTime;
  final String spoken;
  final String label;

  const Token({
    required this.written,
    this.confidence,
    this.startTime,
    this.endTime,
    required this.spoken,
    required this.label,
  });

  factory Token.fromJson(dynamic json) {
    double toConfidence(dynamic data) {
      if (data is int) {
        return data.toDouble();
      } else if (data is double) {
        return data;
      }
      return 0.0;
    }

    return Token(
      written: json["written"] ?? "",
      confidence: toConfidence(json["confidence"]),
      startTime: json["starttime"],
      endTime: json["endtime"],
      spoken: json["spoken"] ?? "",
      label: json["label"] ?? "",
    );
  }
}

実際に認識結果を利用する場合には、以下のように発話単位の認識結果などを使用することができます。

AsyncRecognition recognition = await fetch("sessionid");

Column(
  mainAxisSize: MainAxisSize.min,
  children: [
    ...recognition.segments.map((Segment segment) {
      final result = segment.results.first;
      return Padding(
        padding: const EdgeInsets.symmetric(horizontal: 10, vertical: 0),
        child: Column(
          children: [
            Text(result.text),
            if (_segments.last != segment)
              const Padding(
                padding: EdgeInsets.only(top: 5, left: 25, bottom: 5, right: 15),
                child: Divider(color: Colors.black),
              ),
          ],
        ),
      );
    }),
  ],
);

最後に

今回、FlutterとAmiVoice APIで音声ファイルの書き起こしアプリを作ってみました。APIを利用する側は、音声ファイルを送信して、認識結果を待つという処理だったので比較的簡単に音声認識を実装できました。また音声認識以外にも、話者ダイアライゼーションや感情分析といった機能も利用できるみたいなので、いずれ使って見たいです。
ここまで読んでいただきありがとうございました。Flutterやコードの書き方などおかしいところや改善点などありましたら、ご指摘お願いします。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up