FFmpegとmermaid-cliとgTTSで、LLMに動画を作成してもらう

Last updated at 2024-08-13Posted at 2024-08-13

記事の概要

FFmpegで簡単な動画編集ができるというのはわかっていたので、大部分をLLMの力を使って動画を作成する手法を検証する。

フローは以下のようなイメージ。

環境構築

FFmpeg

以下の記事を参考に、FFmpegをインストール・システム環境変数を設定する。

mermaid-cli

node.jsをインストールする。

コマンドプロンプトで以下のコマンドを実行する。

npm install -g @mermaid-js/mermaid-cli

gTTS

以下のコマンドでgTTSをインストールする。
（pythonがインストールされていなければ、インストールしてください。）

pip install gtts

LLMにコマンド・プログラムを作成してもらう。

プロンプト

ChatGPTなどで、LLMにマーメイド記法の図を出力してもらう。

LLＭのベースとなっているトランスフォーマーのアーキテクチャーについて、マーメイド記法の図を使って解説してください。

マーメイドが想定通りかを確認する。

※以下の手順で、マーメイド記法の図は参照可能。

修正が必要な場合は、実際にVSCodeなどで修正したり、LLMに修正を依頼する。（パワハラプロンプトを使うと効果的）

```mermaid
graph TD
    subgraph Encoder
        A1[Input Embedding]
        A2[Positional Encoding]
        A3[Multi-Head Attention]
        A4[Add & Norm]
        A5[Feed Forward]
        A6[Add & Norm]
        A7[Output to Decoder]
        A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7
    end

    subgraph Decoder
        B1[Target Embedding]
        B2[Positional Encoding]
        B3[Masked Multi-Head Attention]
        B4[Add & Norm]
        B5[Multi-Head Attention w/ Encoder Output]
        B6[Add & Norm]
        B7[Feed Forward]
        B8[Add & Norm]
        B9[Output]
        B1 --> B2 --> B3 --> B4 --> B5 --> B6 --> B7 --> B8 --> B9
    end

    subgraph Output
        C1[Linear]
        C2[Softmax]
        C3[Predicted Output]
        B9 --> C1 --> C2 --> C3
    end

    subgraph Attention
        A3 -.->|Keys, Values, Queries| B5
    end
    ```

以下のプロンプトを実行する。

mermaid-cliで図を、gTTSで音声ファイルを作成し、FFmpegで動画を作成してください。

マーメイドの図を作成

完成したマーメイド記法のテキストを「.md」拡張子のファイルで保存する。

以下のコマンドを実行し、マーメイド記法で作成された図をpngファイルとして出力する。

mmdc -i output.md -o output.png

画像ファイルが出力される。

ナレーションの作成

ナレーションの作成は、LLMの出力を利用する。

以下のようなプログラムを出力される想定のため、それを実行する。

generate_narration.py

from gtts import gTTS

# ナレーション用のテキスト
text = "This diagram represents the architecture of the Transformer model, which is fundamental in modern NLP. The model consists of an encoder and a decoder, each made up of multiple layers of self-attention and feed-forward networks."

# 出力ファイル名
filename = "narration.mp3"

# MP3ファイルを生成
tts = gTTS(text)
tts.save(filename)

python generate_narration.py

ナレーションファイルが出力される。

動画の作成

おそらく、LLMがFFmpegのコマンドを出力していると思うが、以下のようなコマンドを実行する。
（「scale=1080:1920」については、PNGに合わせて変更する必要あり）

ffmpeg -loop 1 -i output-1.png -i narration.mp3 -c:v libx264 -tune stillimage -c:a aac -b:a 192k -pix_fmt yuv420p -t 30 -vf "scale=1080:1920" output.mp4

動画ファイルが作成される。

結果

アウトプットはこんなイメージ。
※内容の正しさは確認していません。

補足

字幕も追加してみる。

字幕追加手順

以下の内容のファイルを「subtitles.srt」という名前で作成する。

subtitles.srt

1
00:00:00,000 --> 00:00:10,000
This diagram represents the architecture of the Transformer model, which is fundamental in modern NLP.

2
00:00:10,000 --> 00:00:20,000
The model consists of an encoder and a decoder, each made up of multiple layers of self-attention and feed-forward networks.

以下のコマンドを実行する。

ffmpeg -i output.mp4 -vf "subtitles=subtitles.srt:force_style='FontSize=24,PrimaryColour=&HFFFFFF&'" -c:a copy output_with_subtitles.mp4

雑な見た目だが、字幕が追加されている。
※内容の正しさは確認していない。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up