Azureの仮想マシン＋StreamlitでAzure AI Speech to Textを試す

Last updated at 2024-11-04Posted at 2024-11-03

はじめに

仕事でオフラインの文字起こしのニーズが多い。既存の文字起こしサービスはたくさんありますが、①情報の取り扱いを主体的にコントロールするニーズがあること、②精度の高いといわれるAzure AI Speechの性能を評価したいということで、自作しました。

↓のように書き起こした文書を生成AIで整形等することも見据えています。

Azure AI SpeechかAzure OpenAI Whisperか

価格比較、性能比較等、以下のページで解説いただいているので詳細は書きませんが、Whisperのファイル長の制限（25MB）が後々ネックになりそうだったので、AI Speechを選択

環境構築手順

1. Azureの仮想マシンを作成

Azure Portalから仮想マシンを作成。スペックは最低限にするため、月６千円程度の最低限のものを選択。OSはWindows2022Serverを選択。

2.1 Pythonのinstall

以下のページから3.13をインストール。

メモ：Pythonのルートフォルダと\ScriptsにPathを通す。

2.2 パッケージのインストール

pipを使って以下のパッケージをインストール

pip install azure-cognitiveservices-speech
pip install streamlit
pip install pydub
pip install audioop-lts

2.3 Visual Studio 2015、2017、2019、2022 の Microsoft Visual C++ 再頒布可能パッケージのインストール

これをしないと、azure-cognitiveservices-speech関連のDLL読み込み時にエラーが出てしまいます。以下からインストールする。

2.4 FFMPEGのインストール

こちらはmp3を入力源にした処理する際に必要な手順。やっておかないと以下のエラーが出た。

FileNotFoundError: [WinError 2] The system cannot find the file specified

2.5 Gstreamerのインストール

こちらもMP3を処理する際に必要な手順。
やらないと以下のエラーが出た。

SPXERR_GSTREAMER_NOT_FOUND_ERROR

以下のページからインストールを行うこと。

2.6 Webアプリへのアクセスを許可するための手順

Azureのセキュリティグループで80番ポート受信ルールの追加
仮想マシン側のWindows Defenderにおける80番ポート受信ルールの作成

メモ：Streamlitの80番ポートの起動は"streamlit run app.py --server.port 80"

Webアプリの構築

以下のStreamlit + Azure AIのサンプルをベースにコーディングを行いました。

以下も大変参考になりました。

話者分離については以下を参考にしています。

StreamlitのUIからファイルをアップロード、言語、話者分離の有無を選択したうえで、Startを押下することで文字起こしをしてくれます。

非常に汚いコードでとりあえず動くもの、です。諸々ご容赦ください。

なお、裏側でWAVに変換してAI Speechにいれるか、MP3に変換していれるかを選択できるようにしています。が、処理結果は同じです。行きがかり上作っただけであまり意味はありません。また、中間ファイルの削除処理も完全ではありません。

app.py

import time
import os
import datetime
import azure.cognitiveservices.speech as speechsdk
import streamlit as st
from pydub import AudioSegment

# Set up Azure Speech Service credentials
speech_key = "REPLACETOYOURAZUREAPIKEY"
service_region = "REPLACETOYOURAZUREREGION"
done = False
overWrite = ""

#####################################################################################
########################### Class 定義 for MP3(話者分離) ############################
#####################################################################################

class BinaryFileReaderCallback(speechsdk.audio.PullAudioInputStreamCallback):
    def __init__(self, filename: str):
        super().__init__()
        self._file_h = open(filename, "rb")

    def read(self, buffer: memoryview) -> int:
        print('trying to read {} frames'.format(buffer.nbytes))
        try:
            size = buffer.nbytes
            frames = self._file_h.read(size)

            buffer[:len(frames)] = frames
            print('read {} frames'.format(len(frames)))

            return len(frames)
        except Exception as ex:
            print('Exception in `read`: {}'.format(ex))
            raise

    def close(self) -> None:
        print('closing file')
        try:
            self._file_h.close()
        except Exception as ex:
            print('Exception in `close`: {}'.format(ex))
            raise

#####################################################################################
############################# MP3文字起こし(話者分離) ###############################
#####################################################################################
def compressed_stream_helper_transcribe(output, speech_key, service_region, compressed_format,
        mp3_file_path
        ):
    callback = BinaryFileReaderCallback(mp3_file_path)
    stream = speechsdk.audio.PullAudioInputStream(stream_format=compressed_format, pull_stream_callback=callback)

    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    if lng == "Japanese":
        speech_config.speech_recognition_language = "ja-JP"
    else:
        speech_config.speech_recognition_language="en-US"
 
    speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_DiarizeIntermediateResults, value='true')

    audio_config = speechsdk.audio.AudioConfig(stream = stream)
    conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)

    transcribing_stop = False
    Ofst = 0.0
    block=""
    def conversation_transcriber_recognition_canceled_cb(evt: speechsdk.SessionEventArgs):
        print('Canceled event')

    def conversation_transcriber_session_stopped_cb(evt: speechsdk.SessionEventArgs):
        print('SessionStopped event')

    def conversation_transcriber_transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs):
        print('\nTRANSCRIBED:')
        if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
            print('\tText={}'.format(evt.result.text))
            print('\tSpeaker ID={}\n'.format(evt.result.speaker_id))
        elif evt.result.reason == speechsdk.ResultReason.NoMatch:
            print('\tNOMATCH: Speech could not be TRANSCRIBED: {}'.format(evt.result.no_match_details))
        nonlocal block
        block += '\r\nSpeaker ID={}'.format(evt.result.speaker_id)+ ' [Offset={}sec]'.format(evt.result.offset/10000000)+'\r\n'
        block += evt.result.text+'\r\n'
        nonlocal output
        output+='\r\nSpeaker ID={}'.format(evt.result.speaker_id)+ ' [Offset={}sec]'.format(evt.result.offset/10000000)+'\r\n'
        output+= evt.result.text+'\r\n'

    def conversation_transcriber_transcribing_cb(evt: speechsdk.SpeechRecognitionEventArgs):
        print('TRANSCRIBING:')
        print('\tText={}'.format(evt.result.text))
        print('\tSpeaker ID={}'.format(evt.result.speaker_id))
        nonlocal Ofst
        Ofst = evt.result.offset/10000000

    def conversation_transcriber_session_started_cb(evt: speechsdk.SessionEventArgs):
        print('SessionStarted event')
    def stop_cb(evt: speechsdk.SessionEventArgs):
        #"""callback that signals to stop continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        nonlocal transcribing_stop
        transcribing_stop = True

    # Connect callbacks to the events fired by the conversation transcriber
    conversation_transcriber.transcribed.connect(conversation_transcriber_transcribed_cb)
    conversation_transcriber.transcribing.connect(conversation_transcriber_transcribing_cb)
    conversation_transcriber.session_started.connect(conversation_transcriber_session_started_cb)
    conversation_transcriber.session_stopped.connect(conversation_transcriber_session_stopped_cb)
    conversation_transcriber.canceled.connect(conversation_transcriber_recognition_canceled_cb)
    # stop transcribing on either session stopped or canceled events
    conversation_transcriber.session_stopped.connect(stop_cb)
    conversation_transcriber.canceled.connect(stop_cb)

    conversation_transcriber.start_transcribing_async()

    # Waits for completion.
    overWrite = st.empty()
    while not transcribing_stop:
        time.sleep(.5)
        with overWrite.container():
            st.write(Ofst, "sec処理完了")
        if block != "":
            st.write(block)
            block=""
                
    conversation_transcriber.stop_transcribing_async()
    audio_input = None
    speech_recognizer = None
    return output


#####################################################################################
###############################    MP3文字起こし   ##################################
#####################################################################################
def compressed_stream_helper(output, speech_key, service_region, compressed_format,
        mp3_file_path
        ):
    callback = BinaryFileReaderCallback(mp3_file_path)
    stream = speechsdk.audio.PullAudioInputStream(stream_format=compressed_format, pull_stream_callback=callback)

    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    if lng == "Japanese":
        speech_config.speech_recognition_language = "ja-JP"
    else:
        speech_config.speech_recognition_language="en-US"
    audio_config = speechsdk.audio.AudioConfig(stream=stream)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    done = False
    Ofst = 0.0
    block=""
    def recognized(evt):
        nonlocal block
        block += '\r\n[Offset={}sec]'.format(evt.result.offset/10000000)+'\r\n'
        block += evt.result.text+'\r\n'
        nonlocal output
        output+= '\r\n[Offset={}sec]'.format(evt.result.offset/10000000)+'\r\n'
        output+= evt.result.text+'\r\n'
        
    def recognizing(evt):
        #print('RECOGNIZING on {}'.format(evt))
        nonlocal Ofst
        Ofst = evt.result.offset/10000000
    def stop_cb(evt):
        """callback that signals to stop continuous recognition upon receiving an event `evt`"""
        print('STOPPED on {}'.format(evt))
        nonlocal done
        done = True

        

    # Connect callbacks to the events fired by the speech recognizer
    speech_recognizer.recognizing.connect(recognizing)
    speech_recognizer.recognized.connect(recognized)
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    #speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    #speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

    # Start continuous speech recognition
    speech_recognizer.start_continuous_recognition()
    # stop continuous recognition on either session stopped or canceled events
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)
    overWrite = st.empty()
    while not done:
        time.sleep(.5)
        
        with overWrite.container():
            st.write(Ofst, "sec処理完了")
        if block != "":
            st.write(block)
            block=""
    speech_recognizer.stop_continuous_recognition()
    return output
    
#####################################################################################
##############    マイクからの文字起こし（つかっていない）   ########################
#####################################################################################
def speech_recognize_once_from_mic():
    # Set up the speech config and audio config
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    audio_config = speechsdk.AudioConfig(use_default_microphone=True)

    # Create a speech recognizer with the given settings
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    st.write("Speak into your microphone.")
    result = speech_recognizer.recognize_once_async().get()

    # Check the result
    if result.reason == speechsdk.ResultReason.RecognizedSpeech:
        return f"Recognized: {result.text}"
    elif result.reason == speechsdk.ResultReason.NoMatch:
        return "No speech could be recognized"
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        return f"Speech Recognition canceled: {cancellation_details.reason}"
    else:
        return "Unknown error"

#####################################################################################
################################## WAV文字起こし ####################################
#####################################################################################
def recognize_audio(output, speech_key, service_region, filename, lng, recognize_time=100):
    # Speech to Text 設定
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    if lng == "Japanese":
        speech_config.speech_recognition_language = "ja-JP"
    # 入力設定
    audio_input = speechsdk.AudioConfig(filename=filename)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

    done = False
    Ofst = 0.0
    block = ""
    def recognizing(evt):
        #print('RECOGNIZING on {}'.format(evt))
        nonlocal Ofst
        Ofst = evt.result.offset/10000000
    def recognized(evt):
        nonlocal block
        block += '\r\n[Offset={}sec]'.format(evt.result.offset/10000000)+'\r\n'
        block += evt.result.text+'\r\n'
        nonlocal output
        output+= '\r\n[Offset={}sec]'.format(evt.result.offset/10000000)+'\r\n'
        output+= evt.result.text+'\r\n'
        
    def start(evt):
        st.write('SESSION STARTED: {}'.format(evt))
    def stop_cb(evt):
        """callback that stops continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done = True
        
    # 音声認識の実行
    speech_recognizer.recognizing.connect(recognizing)
    speech_recognizer.recognized.connect(recognized)
    speech_recognizer.session_started.connect(start)
    speech_recognizer.start_continuous_recognition()
     # stop continuous recognition on either session stopped or canceled events
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    overWrite = st.empty()
    while not done:
        time.sleep(.5)
        
        with overWrite.container():
            st.write(Ofst, "sec処理完了")
        if block != "":
            st.write(block)
            block=""
        
    audio_input = None
    speech_recognizer = None
    return output

#####################################################################################
############################ WAV文字起こし（話者分離） ##############################
#####################################################################################
def recognize_from_file(output, speech_key, service_region, filename, lng, recognize_time=100):
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    if lng == "Japanese":
        speech_config.speech_recognition_language = "ja-JP"
    else:
        speech_config.speech_recognition_language="en-US"
    speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_DiarizeIntermediateResults, value='true')

    audio_config = speechsdk.audio.AudioConfig(filename=filename)
    conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)

    transcribing_stop = False
    Ofst=0.0
    block =""
    def conversation_transcriber_recognition_canceled_cb(evt: speechsdk.SessionEventArgs):
        print('Canceled event')

    def conversation_transcriber_session_stopped_cb(evt: speechsdk.SessionEventArgs):
        print('SessionStopped event')

    def conversation_transcriber_transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs):
        print('\nTRANSCRIBED:')
        if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
            print('\tText={}'.format(evt.result.text))
            print('\tSpeaker ID={}\n'.format(evt.result.speaker_id))
        elif evt.result.reason == speechsdk.ResultReason.NoMatch:
            print('\tNOMATCH: Speech could not be TRANSCRIBED: {}'.format(evt.result.no_match_details))
        nonlocal block
        block += '\r\nSpeaker ID={}'.format(evt.result.speaker_id)+ ' [Offset={}sec]'.format(evt.result.offset/10000000)+'\r\n'
        block += evt.result.text+'\r\n'
        nonlocal output
        output+= '\r\nSpeaker ID={}'.format(evt.result.speaker_id)+ ' [Offset={}sec]'.format(evt.result.offset/10000000)+'\r\n'
        output+= evt.result.text+'\r\n'
        
    def conversation_transcriber_transcribing_cb(evt: speechsdk.SpeechRecognitionEventArgs):
        print('TRANSCRIBING:')
        print('\tText={}'.format(evt.result.text))
        print('\tSpeaker ID={}'.format(evt.result.speaker_id))
        nonlocal Ofst
        Ofst = evt.result.offset/10000000

    def conversation_transcriber_session_started_cb(evt: speechsdk.SessionEventArgs):
        print('SessionStarted event')
    def stop_cb(evt: speechsdk.SessionEventArgs):
        #"""callback that signals to stop continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        nonlocal transcribing_stop
        transcribing_stop = True

    # Connect callbacks to the events fired by the conversation transcriber
    conversation_transcriber.transcribed.connect(conversation_transcriber_transcribed_cb)
    conversation_transcriber.transcribing.connect(conversation_transcriber_transcribing_cb)
    conversation_transcriber.session_started.connect(conversation_transcriber_session_started_cb)
    conversation_transcriber.session_stopped.connect(conversation_transcriber_session_stopped_cb)
    conversation_transcriber.canceled.connect(conversation_transcriber_recognition_canceled_cb)
    # stop transcribing on either session stopped or canceled events
    conversation_transcriber.session_stopped.connect(stop_cb)
    conversation_transcriber.canceled.connect(stop_cb)

    conversation_transcriber.start_transcribing_async()

    # Waits for completion.
    overWrite = st.empty()
    while not transcribing_stop:
        time.sleep(.5)
        
        with overWrite.container():
            st.write(Ofst, "sec処理完了")
        if block != "":
          st.write(block)
          block=""
    conversation_transcriber.stop_transcribing_async()
    audio_input = None
    speech_recognizer = None
    return output
#####################################################################################
##################################### UI画面 ########################################
#####################################################################################

st.title("Azure Speech + Streamlit")

byte_file = st.file_uploader('こちらからファイルを読み込み')

lng = st.selectbox(
    '言語',
    list(["Japanese","English"])
)
wb = st.selectbox(
    '話者分離',
    list(["True","False"])
)
ft = st.selectbox(
    '処理ファイルタイプ(開発者オプション。Waveは高速だがファイル容量大)',
    list(["Wave","Mp3"])
)

st.divider()
if st.button('Start speech recognition'):
    if byte_file is None:
        st.error("Please upload a file")
    else:
        ## マイク入力する場合(コメントアウト)
        # recognition_result = speech_recognize_once_from_mic()
        # st.write(recognition_result)
        start_time = time.time()
        output_path=""
        if ft == "Wave":
            output_path = "./"+str(datetime.datetime.now().strftime('%Y%m%d%H%M%S%f'))+".wav"
        else:
            output_path = "./"+str(datetime.datetime.now().strftime('%Y%m%d%H%M%S%f'))+".mp3"
        audio = AudioSegment.from_file(byte_file)

        # ファイル情報の表示
        st.subheader("ファイルの内容")
        file_content = {"ファイル名": byte_file.name, "ファイルタイプ": byte_file.type, "長さ(sec)": audio.duration_seconds, "サイズ(Byte)": byte_file.size}
        st.write(file_content)

        st.write("ファイル形式変換開始"+"["+'{:.3f}'.format(time.time()-start_time)+"]")
        convertedaudio=""
        if ft == "Wave":
            convertedaudio = audio.export(output_path, format='wav')
        else:
            convertedaudio = audio.export(output_path, format='mp3')

        st.write("ファイル形式変換完了➡文字起こし開始"+"["+'{:.3f}'.format(time.time()-start_time)+"]")
        st.write("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
        filename = output_path
        output = ""
        if ft == "Wave":
            if wb == "True":
                output = recognize_from_file(output, speech_key, service_region, filename, lng, recognize_time=0)
            else:
                output = recognize_audio(output, speech_key, service_region, filename, lng, recognize_time=0)
        else:
            if wb == "True":
                # Create a compressed format
                compressed_format = speechsdk.audio.AudioStreamFormat(compressed_stream_format=speechsdk.AudioStreamContainerFormat.MP3)
                output = compressed_stream_helper_transcribe(output, speech_key, service_region, compressed_format, output_path)   
            else:
                # Create a compressed format
                compressed_format = speechsdk.audio.AudioStreamFormat(compressed_stream_format=speechsdk.AudioStreamContainerFormat.MP3)
                output = compressed_stream_helper(output, speech_key, service_region, compressed_format, output_path)    
        

        st.write("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

        st.write("文字起こし完了"+"["+'{:.3f}'.format(time.time()-start_time)+"]")
        st.success("Completed")
        ##st.write(output)
                
        convertedaudio.close()
        os.remove(output_path)

アプリの画面

こんな感じのアプリです。

後日談

会社の環境からWebアプリにアクセスしてみると、画面がロード中のような表示で止まって動かない。調べてみるとStreamlitが使用するWebSocketの通信が失敗していた。会社のプロキシを経由する環境ではWebSocketが動かないようだ。
HTTPS通信とした場合はプロキシが中身が見れないのでプロキシに邪魔されないという情報があったので、急遽、Azure Application Gatewayを作成し、HTTPS通信に対応。無事接続できるようになりました。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up