More than 3 years have passed since last update.

GoogleのSpeech to textをDjangoで実装してみた

Last updated at 2021-07-24Posted at 2021-01-12

AWSでは、WEB操作で文字起こしをすることが可能ですが、GCPではAPIからしか操作することができません。なので、djangoを勉強するとともに簡易的に実施してみました。理由は、Googleの音声認識の精度はかなり高いから。流れとしては、GoogleStorageにアップロードして文字起こしをします。なぜ、GoogleStorageなのかは、ローカルだとファイルサイズが10MB未満など条件がつくからです。

完成図

開発環境

MacBook
Python(3.7.7)
Django(3.1.5)
google-cloud-storage(1.35.0)
google-cloud-speech(2.0.1)
pydub(0.24.1)

Googleの認証用jsonを取得

サービスアカウント作成時に、"Google Storage"の管理者権限を付与する.

Speech to TextのAPI有効化

GCPのライブラリからAPIを有効化する

環境設定

# Django
pip3 install django==3.1.5

# google-cloud-storage
pip3 install google-cloud-storage==1.35.0

# google-cloud-storage
pip3 install google-cloud-speech==2.0.1

# pydub
pip3 install pydub==0.24.1

Django設定

プロジェクト作成

projectフォルダが作成されます.

# プロジェクト名(project)
django-admin startproject project

アプリケーションの作成

projectフォルダに移動しアプリケーションを作成します.
※今回は"mozi"というアプリケーションを作成

# アプリケーション作成
python3 manage.py startapp moji

Django(WEBサーバ)の基本設定

projectフォルダにあるprojectフォルダのファイルを設定.

settings.py

# 誰からでもアクセスできるように
ALLOWED_HOSTS = ['*']

# htmlファイルを使用するために
INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'mozi',  #アプリケーションの追加(mozi内のtemplatesの検索するようになる)
]

urls.py

from django.contrib import admin
from django.urls import path,include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('mozi/', include('mozi.urls')), #moziアプリ内でurls.pyを設定できるように
]

アプリケーション(mozi)の基本設定

projectフォルダ内のmoziフォルダ内のファイルを設定.
アプリケーション側で画面遷移を設定できるようにurls.pyを新規作成

urls.py

from django.urls import path
from . import views

urlpatterns = [
    path('', views.index, name='index'),
]

メイン機能を作成

projectフォルダ内のsettings.pyにアップロード先を追記.
BASE_DIRは、manage.pyがあるところなので、そこにuploadフォルダを作成する.

settings.py

# FILE_UPLOAD
import os
MEDIA_ROOT = os.path.join(BASE_DIR, 'upload')
MEDIA_URL = '/upload/'

projectフォルダ内のurls.pyにおまじないを追記.

urls.py

from django.conf.urls.static import static
from django.conf import settings

if settings.DEBUG:
    urlpatterns += static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)

フォームに必要な情報をmoziフォルダ配下のmodels.pyに追記します.
ここの"media"は、uploadフォルダの配下に作成されその中にファイルが保存されます.

models.py

from django.db import models

class Upload(models.Model):
    document = models.FileField(upload_to='media')
    uploaded_at = models.DateTimeField(auto_now_add=True)

フォームからファイルをアップロードするためmoziフォルダ配下にforms.pyを新規作成.

forms.py

from django import forms
from .models import Upload
 
class UploadForm(forms.ModelForm):
    class Meta:
        model = Upload
        fields = ('document',)

WEB画面作成のためにhtmlを作成する.
※mozi/templates/mozi/inde.html --> template以下を新規作成
※{{}}でくくると変数扱いになる

inde.html

<!DOCTYPE html>
<html lang="ja-JP">
<head>
    <meta charset="UTF-8">
    <title>文字起こし君</title>
</head>
<body>
 
    <h1>Google Speech To Text</h1>
 
    <form method="post" enctype="multipart/form-data">
        {% csrf_token %}
        {{ form.as_p }}
        <button type="submit">開始</button>
    </form>

   <h2>文字起こしの結果</h2>
   <p>{{ transcribe_result }}</p>
 
</body>
</html>

画面表示の核となるviews.pyを設定.

views.py

from django.http import HttpResponse
from django.shortcuts import render,redirect
from .forms import UploadForm
from .models import Upload

def index(request):
    import os
    import subprocess

    #保存PATH
    source = "ファイルがアップロードされるpath" 
  
    #GCS_URL
    GCS_BASE = "gs://バケット名/"    

    #結果保存
    speech_result = ""

    if request.method == 'POST':
        #GoogleStorageの環境準備
        from google.cloud import storage
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='jsonのPATH'
        client = storage.Client()
        bucket = client.get_bucket('GoogleStorageのバケット名')
       
        #アップロードファイルの保存
        form = UploadForm(request.POST,request.FILES)
        form.save()

        #アップロードしたファイル名を取得
        #ファイル名と拡張子を分割(ext->拡張子(.py))
        transcribe_file = request.FILES['document'].name
        name, ext = os.path.splitext(transcribe_file)

        if ext==".wav": 
            #GoogleStorageへアップロード
            blob = bucket.blob( transcribe_file )
            blob.upload_from_filename(filename= source + transcribe_file )

            #再生時間を取得
            from pydub import AudioSegment
            sound = AudioSegment.from_file( source + transcribe_file )
            length = sound.duration_seconds
            length += 1


            #作業用ファイルの削除
            cmd = 'rm -f ' + source + transcribe_file     
            subprocess.call(cmd, shell=True)

            #文字起こし
            from google.cloud import speech

            client = speech.SpeechClient()

            gcs_uri = GCS_BASE + transcribe_file

            audio = speech.RecognitionAudio(uri=gcs_uri)
            config = speech.RecognitionConfig(
                encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
                #sample_rate_hertz=16000,
                language_code="ja_JP",
                enable_automatic_punctuation=True,
            )

            operation = client.long_running_recognize(config=config, audio=audio)

            response = operation.result(timeout=round(length))

            for result in response.results:
                speech_result += result.alternatives[0].transcript

            #GoogleStorageのファイル削除
            blob.delete()

        else:
            #ファイルの変換処理
            f_input = source + transcribe_file
            f_output = source + name + ".wav"
            upload_file_name = name + ".wav"
            cmd = 'ffmpeg -i ' + f_input + ' -ar 16000 -ac 1 ' + f_output
            subprocess.call(cmd, shell=True)

            #GoogleStorageへアップロード
            blob = bucket.blob( upload_file_name )
            blob.upload_from_filename(filename= f_output )

            #再生時間を取得
            from pydub import AudioSegment
            sound = AudioSegment.from_file( source + transcribe_file )
            length = sound.duration_seconds
            length += 1


            #作業用ファイルの削除
            cmd = 'rm -f ' + f_input + ' ' + f_output     
            subprocess.call(cmd, shell=True)
            
            #文字起こし
            from google.cloud import speech

            client = speech.SpeechClient()

            gcs_uri = GCS_BASE + upload_file_name

            audio = speech.RecognitionAudio(uri=gcs_uri)
            config = speech.RecognitionConfig(
                encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
                #sample_rate_hertz=16000,
                language_code="ja_JP",
            )

            operation = client.long_running_recognize(config=config, audio=audio)

            response = operation.result(timeout=round(length))

            for result in response.results:
                speech_result += result.alternatives[0].transcript

            #GoogleStorageのファイル削除
            blob.delete()
    else:
        form = UploadForm()
    return render(request, 'mozi/index.html', {
        'form': form,
        'transcribe_result':speech_result
    })

最後にアプリケーションを同期します.

python3 manage.py makemigrations mozi
python3 manage.py migrate

これで準備が整ったので、WEBサーバを起動します.

python3 manage.py runserver

PythonでWEBサーバ構築から内部処理を記述できたので構築しやすかったです。
触れてみた程度かつメモ程度の記録となります。

追記
以下のコマンドで管理者ユーザを作成することができ、/adminにてログインできます。

python3 manage.py createsuperuser

参考サイト

https://noumenon-th.net/programming/2019/10/28/django-forms/
https://qiita.com/peijipe/items/009fc487505dfdb03a8d
https://cloud.google.com/speech-to-text/docs/async-recognize?hl=ja

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up