More than 1 year has passed since last update.

faster-whisperを使った音声データのテキスト化

Posted at 2024-04-29

1. はじめに

はじめに、結論を書いておきます。

①faster-whisperは音声データのテキスト化には、非常に優れている
②GPUが必要になるので、手当てが必要
③翻訳や要約にはChat-GPTを使うのが良さそう

以下、自分のメモ用にポイントだけ記載します。

2. 準備

①ハード面

GPUが必要です。個人で、お手軽に試すのであれば、Google Colaboratoryを使うのが良いでしょう。
ランタイムのタイプを必ず、GPUにしておくのを忘れないようにしましょう。

②ソフト面

以下のインストールが必要です。Google Colaboratoryを使うのであれば、以下のように、! が必要になるので、注意してください。
また、%%captureを付けておくと、インストールについてのゴチャゴチャした記載がなくなるので、個人的には好きです(笑)

%%capture
!pip install faster-whisper yt-dlp

3. 音声データのテキスト化（日本語）

さて、いよいよ、日本語の音声データをテキスト化してみます。
題材としては、YouTubeの動画を使いました。

まず、コードは以下の通りです。

from faster_whisper import WhisperModel
import subprocess
YOUTUBE_ID = "mv_zMjjfX5k" # Youtube ID
AUDIO_FILE_NAME = f"{YOUTUBE_ID}.mp3"
# Download audio from Youtube
def dl_yt(yt_url):
    subprocess.run(f"yt-dlp -x --audio-format mp3 -o {AUDIO_FILE_NAME} {yt_url}", shell=True)

dl_yt(f"https://youtu.be/{YOUTUBE_ID}")

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe(
	AUDIO_FILE_NAME,
	beam_size=5,
	vad_filter=True,
	without_timestamps=True,)
	
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
    
import os
os.remove(AUDIO_FILE_NAME)

簡単に解説しておきます。

このコードは、YouTubeからオーディオをダウンロードし、そのオーディオをテキストに変換する作業を行っています。

まず、指定されたYouTubeの動画IDから、YouTubeからオーディオをダウンロードする関数 dl_yt が定義されます。この関数では、yt-dlp というツールを使用して、指定されたYouTubeのURLからオーディオファイル（MP3形式）をダウンロードします。

次に、WhisperModelと呼ばれるテキスト変換のためのモデルが用意されます。このモデルは、大規模なニューラルネットワークを使用して音声をテキストに変換します。

その後、ダウンロードしたオーディオファイルを使用して、WhisperModelを使用して音声をテキストに変換します。この際、ビームサーチ法を使用して最適なテキストを探索し、音声アクティビティ検出（VAD）フィルターを適用し、タイムスタンプなしでテキストを出力します。

最後に、変換が完了したら、オーディオファイルを削除します。

出力結果は以下の通りでした。

[0.50s -> 32.00s] 都心から車で2時間半ほどの場所にある群馬県南木村。2014年に消滅可能性自治体の中でも全国ワースト1位。30年間で若年女性が89.9%減少すると推計された自治体です。
[32.00s -> 66.67s] その後、村を消滅の危機から救うため、役場が取り組み始めたのが、南木村にある道の駅。運営は移住者らがメンバーになっているNPO法人に委託され、雇用の場となっています。NPOスタッフの給料は村役場の職員と同じ水準だそうです。
[66.67s -> 96.67s] 2人が所属するNPOは、他にも老人ホームやケーブルテレビ、
[96.67s -> 126.67s] ケアハウスの業務も担当しています。また、南木村では保育料や学校給食費が無料に。さらに、今年度から高校までの定期券代が全額補助され、
[126.67s -> 156.67s] ガソリン代が一部補助されるようになりました。先月、埼玉県から引っ越してきた一家は、
[156.67s -> 188.11s] 村から転出した人は減りつつあり、おととしは41人。直近10年間で最も少ない数になりました。そして迎えたきょう、人口戦略会議の発表では、消滅可能性自治体に変わりはないものの、若年女性人口減少率が改善した。
[188.11s -> 218.11s] ただ、若年女性の減少率はわずかな改善にとどまっていて、1556人の村の人口は、2050年に4分の1、406人に減ると推計されています。村は住民や移住者に住み続けてもらうため、引き続き対策を進めるとしています。
[218.11s -> 248.11s] 2050年までの30年間で、20代から30代の女性の人口が50%以上、半分以上減少する自治体ということで、一気に消滅するということではないんですね。
[248.11s -> 278.11s] 九州、沖縄、少ないということになっていますね。
[278.11s -> 308.11s] もう少し異様はあるし、
[308.11s -> 338.11s] あるところまでそこを打ってから、そこから楽しいことが起きるということを、なぜポジティブにそういうことも提案しないのかというのが、10年前の時も相当噛みついていました。そうですか。でも、こうした結果を受けて、改めて少子化対策が十分なのかですとか、若者にアピールするような対策がちゃんと打てているのかということを考え出すきっかけになるのかなと。
[338.11s -> 368.11s] 整った時に、やっぱりこんな都会で暮らすの嫌だという人が出てきた時に、今までだと、どこに行くか分からないというのがあって、これだけ人が減っているところがあるけど、これだけ色んなものがあるということが、より自分たちの自発的な、誰かに頼る問題ではなくて、例えば、自治体によっては、うちは減っていくのでいいんだという選択肢なんですよね。地元の人がそれでいいという。けど、やっぱり子どもの声を聞きたいよねと言ったら、じゃあどうすればいいのかと考えなきゃいけないのに、
[368.11s -> 398.11s] なぜそれをプレッシャーをかけて何とかしろというのかというのは、それは民主主義が大好きな日本の人たちからすると、それは大きなお世話で、私は地元の人が反旗を翻してほしいですよね。全国一律じゃなくて、それぞれの自治体で考えて話し合ってですよね。そういった中で課題が見つかったら、効果の高い事業をやっていけばいいという点だと思いますが、例えば、こちらをご覧ください。千葉県の流山市、北西部にあります埼玉と隣接している市がありまして、
[398.11s -> 428.11s] そのうちの一つ、例えばこの送迎保育ステーションサービスがありまして、朝駅前の集合場所に子どもを預けます。親御さんはそのまま出勤をする。
[428.11s -> 458.11s] 子育て世帯がどんどん流入することで、駅前には大型商業施設が次々とオープンして、買い物も便利になって好循環が生まれる。これ、少子化対策で数を増やそうじゃなくて、子育てしやすい環境を作るために、徹底的に自治体がリーダーシップを取って聞き込みを行ったんですよね。それで、こうやってやろうじゃないかと決めて貫いている結果だそうです。質を高めようという意思も感じますし、非常に努力されたんだと思いますけれども、ただ、筑波エクスプレス開通ということで、
[458.11s -> 488.11s] 東京に近いところでこういう例で出されても、問題は東北なわけですよね、現実はね。この例を出してくること自体が、いや、それはここの例出されたら、申し訳ないですけど、誰でもちょっと違う。もちろんです。同じ通りにやってくれとは言いませんよ。言いませんけど、一つのきっかけをそれぞれの自治体で見出すのはいいんじゃないかということです。自分の地域性が、どこが売りで、どういうところから変えていくかというのが大事だ。この例を出されちゃうと、日本って真似しちゃうんですよ。
[488.11s -> 518.11s] だから、しかもですよ、これ東北で例を挙げてくれないと、こんな東京近郊のところで出されたら、もう見たら、そりゃここはいいよねって思っちゃうと私は思いますけどね。
[518.11s -> 548.11s] そうしたら、自分のお子さんたちが手を離れた方々の地方のお家に高校生とかが留学して、高校とかも自分で選べるようにしたらいいんじゃないかとか、
[548.11s -> 578.11s] できない経験をそこですればいいんじゃないかということを提案されていて、それはすごく素敵なアイデアだなと思って、高校生の方がそれを思い浮かべるっていうのも本当に素敵なことだと思いますけど、やっぱりどうしても人口って動いていくと思いますし、動物たちの分布が変わるように人間の分布も変わるだろうっていうのは思うので、
[578.11s -> 608.11s] 資源がそこにあるのかっていうことを考えればいいって思うので、そこまで悲観する必要もないんじゃないかな。ただその移動の過程で不便をこむる方がいらっしゃるなら、そこはケアしないといけないところですけど。それぞれの自治体をどういう自治体にしていくのか、そしてもっと引いたみたいに言うと、国をどういう国にしていくのか、そういう国に考えていかなければならない。留学は所詮仮なんで、やっぱりこれね、もうちょっと縁があった方がいいんですよ。
[608.11s -> 638.11s] 例えば、親戚が住んでるとか、実家があるとか、やっぱりね、両親が東京生まれの人が地方に暮らすっていうのは、そういう取材も昔したことがあるんですけど、最初のうちにすごい温かく迎えてもらってるっていうふうに感じるんですけど、地方には地方のコミュニティの中にどう入っていくかっていうと、都会みたいにお隣は何してるか知りませんっていう言い方をしてる人からすると、結構大変なんですよね。
[638.11s -> 644.88s] あくまで一例。またこのお話、長い問題なんで。

まあ、大体、良い感じですかね（笑）
ちなみに、10分強の動画が約1分でテキスト化できました。faster-whisperは、やはり早いですね。

4. 音声データのテキスト化（英語）

次に、英語の音声データをテキスト化してみます。

コードは上記と同様です。使用するYouTubeのみ、変更しています。

[11.63s -> 41.63s]  What's up, friends? I'm Coy here in Arizona where UConn was crowned NCAA champions in men's basketball. So congratulations to the Huskies. And we have to give a huge shout out to South Carolina who rocked it out in the women's tournament. Rise up. Today is hashtag Your Word Wednesday. So listen to see if your vocab word made it into this show. First up, we start in our nation's capital, Washington, D.C., where President Joe Biden is prepping for an important meeting with the leader of Japan, Prime Minister Fumio Kishida.
[41.63s -> 71.63s]  The main issue on the agenda, strengthening the military alliance between the U.S. and Japan. Japan is widely considered the most important ally for the U.S. in Asia. The U.S. has military bases all over Japan and coming out of this summit, the U.S. and Japan will change some rules that Biden and Kishida say will help the two countries work together better. Japan and the U.S. are also discussing how to better design and possibly produce military and defense hardware together.
[71.63s -> 101.63s]  Why do the countries want to strengthen their bond? Well, they face many of the same threats. Chief amongst them, China. The U.S. and Japan say they're concerned about China and Russia recently strengthening their relationship and that China has expanded its military presence in the Asia-Pacific region, which could potentially endanger Taiwan. You may remember last week we talked about how China says it owns Taiwan, but Taiwan sees itself as independent.
[101.63s -> 131.63s]  Taiwan is only 62 miles from Japan's southern coast. North Korea's nuclear arsenal, the war in Ukraine and the war in Gaza are also major concerns for both countries. The world at a historic turning point, Japanese Prime Minister Fumio Kishida tells me, ahead of a summit with U.S. President Joe Biden this week. The longtime partners will upgrade their defense relationship to the next level.
[131.63s -> 161.63s]  In the backdrop of mounting international security challenges.
[161.63s -> 191.63s]  During geopolitical tensions forced Japan, long a pacifist country, to change its defense posture, moves not seen since World War II. Under his leadership, Japan plans to boost its defense spending to 2% of its GDP by 2027 and purchase weapons including U.S.-made Tomahawk cruise missiles, acquiring counter-strike capabilities for the first time in decades.
[191.63s -> 221.63s]  Japan has a security pact with the United States. Why does it need counter-strike capabilities? Missile-related technology is evolving year by year. As missiles become more sophisticated, Japan must constantly consider what kind of technology is needed to protect the lives and livelihood of its citizens. Following the U.S.-Japan summit, the two nations will convene with the Philippines, the first trilateral meeting of its kind.
[221.63s -> 251.63s]  They will address rising security threats from North Korea weapons testing and aggression in the South China Sea. The U.S. and Japan, an enduring bond, confronts its pivotal moment as a volatile world seeks to unravel their global sway. Hanako Montgomery, CNN, Tokyo. Ten-second trivia. Which one of these categories is considered a worst-case scenario when it comes to hurricanes?
[251.63s -> 285.12s]  Five or category four? Correct answer is category five. Meteorologists use a scale called the Saffir-Simpson Hurricane Wind Scale to classify hurricanes into categories one through five. Categories three to five are considered major hurricanes, but a category five is catastrophic. Next up, this coming hurricane season may be less than ideal. That's according to Colorado State University.
[285.12s -> 315.12s]  This year, they're predicting more hurricanes and major storms than ever coming out of the Atlantic Ocean. The university says that from June through November, there could be 23 big storms, including 11 hurricanes and five category three or higher major hurricanes. These storms are likely to hit the eastern part of the U.S. and the Caribbean. In the typical season, there are 14 big storms, including seven hurricanes.
[315.12s -> 345.12s]  Basically means that the temperature of the surface water is cooler than normal in a particular area, and certain wind patterns are stronger in that same area. Unfortunately, that combination creates the ideal conditions for some major storms. Well, indeed, our water temperatures here in the Atlantic are three to five degrees warmer than they should be at this time of year, indicative of really late April, not early April. And so our hurricane season will likely start earlier than usually.
[345.12s -> 375.12s]  I'll tell you what that means in a second, and obviously the warm ocean temperatures from the record-breaking year of 2023 globally. So Colorado State, 23, 11, and five. Those are the numbers for storms with names, hurricanes, and majors. Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestimating in April just in case.
[375.12s -> 405.12s]  Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestimating in April just in case. Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestimating in April just in case.
[405.12s -> 435.12s]  Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestimating in April just in case. Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestimating in April just in case. Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestimating in April just in case. Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestimating in April just in case. Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestimating in April just in case. Now, last year, they had 13, 6, and 2, and it turned out to be 27 and 3. So they're not overestim
[435.12s -> 465.12s]  For our next story, we're talking cameras. They're everywhere. They're on our phones, on our tablets, in our cars, on street corners. It's like there's some Machiavellian plan out there to capture our every single move. And now we have the camera the size of a car. That's right, the biggest digital camera ever. It's called the LSST camera. It has 3 billion pixels, a 5-foot wide optical lens, and it took scientists at the SLAC National Accelerator Lab nearly two decades to build it.
[465.12s -> 495.12s]  It will be installed at an observatory in Chile where it will take a series of photos of the southern skies. Each photo will take 15 seconds to snap. That may sound like a long time to take one photo, but it's the best way to make sure the camera is capturing as much detailed information in the deep, dark solar system as possible.
[495.12s -> 529.70s]  1,000 of these large, detailed photos of the sky. Scientists will then analyze and compare these images to understand how our galaxy was formed, how the universe is expanding, and what exactly is dark matter, which is a substance that scientists believe exists in outer space but don't know for sure. Today's story, getting a 10 out of 10, we go from photos of the sky to the animals that occupy it. Birds, well, in this case, parrots. Researchers have been studying how parrots interact with touchscreens.
[529.70s -> 559.70s]  The reason? They may want to develop apps to entertain them. You see, parrots are very smart creatures, and the concern is that if you keep them cooped up as pets, they could get bored. So what do people like to do if they get bored? Games, apps. This is all coming from a recent paper written by researchers at Northeastern University's Interact Animal Lab and the University of Glasgow's Animal Computer Interaction Lab. So if you get a Snapchat from Polly, who wants a cracker? You'll know why.
[559.70s -> 589.70s]  Congrats to Mr. Ross Valley's class at Weston Middle in Weston, Connecticut, for submitting our vocab word of the day, Machiavellian, an adjective meaning to be cunning, scheming, and unscrupulous. Great job, explorers, and thanks for watching our show every day. For today's shout-out, we are showing some love to Forest Park Middle School in Franklin, Wisconsin. Wildcats, rise up! And how about a shout-out to Ida B. Wells Middle School in Washington, D.C.? We see you, wolves.
[589.70s -> 593.62s]  Right back here tomorrow on The Ten.

当然ですが、英語でもテキスト化に問題はないですね。
10分弱の動画ですが、1分程度でテキスト化できてます。

5. ダメな例

上記は、上手くいった例ですが、ココでは失敗例も紹介します。
英語の音声データを、日本語に翻訳してテキスト化する事例です。
失敗例ですので、サラッと説明します。

まず、コード。

from faster_whisper import WhisperModel
import subprocess
from transformers import pipeline

YOUTUBE_ID = "ZWYzDndeTjQ"  # YouTubeのID
AUDIO_FILE_NAME = f"{YOUTUBE_ID}.mp3"

# YouTubeから音声をダウンロード
def dl_yt(yt_url):
    subprocess.run(f"yt-dlp -x --audio-format mp3 -o {AUDIO_FILE_NAME} {yt_url}", shell=True)

dl_yt(f"https://youtu.be/{YOUTUBE_ID}")

# Whisperモデルの設定
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

# 音声の文字起こしを行う
segments, info = model.transcribe(
    AUDIO_FILE_NAME,
    beam_size=5,
    vad_filter=True,
    without_timestamps=True,
)

text = ''
# 各セグメントからテキストを抽出
for segment in segments:
    text += segment.text + ('\n')

# 文末までの文章を結合する関数
def join_sentences_until_period(text):
    sentences = text.strip().split('\n')
    result = []
    i = 0
    while i < len(sentences):
        current_result = []
        while i < len(sentences) and not sentences[i].strip().endswith('.'):
            current_result.append(sentences[i].strip())
            i += 1
        if i < len(sentences):
            current_result.append(sentences[i].strip())
            i += 1
        result.append(' '.join(current_result))
    return result

# 文末までの文章を結合
output = join_sentences_until_period(text)

# 翻訳モデルのインスタンス化
fugu_translator = pipeline('translation', model='staka/fugumt-en-ja')

# 各行について翻訳を実行し、結果を出力
for line in output:
    translation = fugu_translator(line)[0]['translation_text']
    print(translation)

import os
os.remove(AUDIO_FILE_NAME)

どんな風にダメだったか分かるように、結果も載せておきます。

何か起きて いらしきが お目付かれる ウィグスワッティンゲリブの アメリカ合衆国大統領になるアメドニのアロバの話を 聞いたこと ワケデビデの言葉を 聞いたことも 初めて聞くことにした この国の 大統領のジャスリバの 重要な会に 立ち向かうことになる ジョハゲの 橋頭前 ジュピの ジュピの家 ヒメの家の人格の ヒレの家の主 ジュニの家の ジュプチの家の主 ジュピのところに 行く先を、その核となるアグスチュ-アの議事事、米軍の日本における軍備の強固な結社の力の増し方に日本の主たる論点でご理解いただける方も大変申し上げるように,アメリカの軍備を日本中にももつように,今後は,日米は,日米が,より一層,より良く軍需のハードを一緒に作り,より良くする上で,より一層,（以下、省略）

ちょっと、使い物にならないですね

6. 代替策

ということで、やはり、翻訳や要約は、Chat-GPTを使うのが良さそうですね。

Chat-GPTを使って、最初の日本語のテキストを要約すると以下の通りでした。
至極、まともな回答だと思います。

群馬県南木村は都心から車で2時間半ほどの場所にあり、2014年に消滅可能性自治体の中でも全国ワースト1位に挙げられました。若年女性が89.9%減少すると推計されたことから、村は消滅の危機に瀕していました。しかし、役場は道の駅を運営するNPO法人に委託し、雇用の場を提供しました。村では保育料や学校給食費が無料であり、高校までの定期券代やガソリン代が補助されています。これらの取り組みにより、村からの転出者が減少し、若年女性人口減少率が改善されました。しかし、2050年までに人口は4分の1に減少すると推計され、対策が引き続き必要です。このような自治体の課題は、全国の自治体がそれぞれの特性に合わせて取り組むべきものであり、地域の魅力を活かす取り組みが必要です。

そして、英語のテキストデータを入力して「翻訳した上で要約して」とChat-GPTにお願いした結果が以下の通りです。
英語ニュース独特の回りくどさはありますが、まあ、こんなモノでしょうか？

アリゾナのコイです。UConnが男子バスケットボールのNCAAチャンピオンに輝き、南カロライナが女子トーナメントで活躍しました。今日は“Your Word Wednesday”です。米国の首都ワシントンDCでは、バイデン大統領が日本の岸田文雄首相との重要な会議に臨んでいます。主要な議題は、米日間の軍事同盟の強化です。両国は中国などの脅威に対処するために結束し、台湾の安全にも懸念を抱いています。日本は防衛力を向上させ、米国製のトマホーク巡航ミサイルを購入し、数十年ぶりに反撃能力を獲得します。

7. おわりに

faster-whisper、良いと思います。
例えば、良い講義がYouTubeの動画として上がっていても２時間も見ているのは、正直、シンドイですよね。サクッとテキスト化して、翻訳・要約して、本当に必要なモノだけを、予備知識を入れて聴くということが出来れば、時間を上手く使えると思います。

なお、今回、以下のサイトを参考にさせて頂きました。ありがとうございます。
いずれも、とても良い記事です。ぜひ、ご覧ください。
https://zenn.dev/tsuzukia/articles/1381e6c9a88577
https://tt-tsukumochi.com/archives/7753

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up