手探りしてみる CV/ ML/ NN Advent Calendar 2025

手探りしてみる CV/ ML/ NN: 22日目動画生成モデルに手を出してみる6

Posted at 2025-12-23

VACEで現実的なワークフロー・自動処理を考えてみる

今回は番外編です。VACEには直接的には関わらないですが、VACEでのある程度現実的に実現可能なワークフローとその自動化を考えてみました。

概要

VACEでinpaintingによる対象物のRemovalを行うためには、少なくとも以下の2つが必要です：

・ マスク: 除去する対象物の領域を指定
・ プロンプト: 生成する映像の内容を記述

これらを手動で作成するのは時間がかかるため、可能な限り自動化したいと考えました。

課題

マスク作成: 動画内の特定オブジェクトを正確にトラッキングしてマスクを生成する必要がある
プロンプト作成:
- シーンの内容を記述する必要がある
- カメラの動きを記述する必要がある
- 既存のVLMはフレーム単位の記述しかできず、動きの表現が苦手

提案する解決策

各課題に対して、以下のツールを組み合わせて自動化パイプラインを構築します：

課題	ツール	役割
マスク生成	SAM3	自然言語プロンプトから動画オブジェクトをトラッキング
シーン記述	FastVLM	マスク済み動画のフレーム単位記述
カメラ動き抽出	Depth Anything v3	カメラextrinsicパラメータを抽出
プロンプト統合	Qwen 2.5 LLM	VLM記述+カメラデータから連続ナラティブを生成
Inpainting	VACE	マスク+プロンプトから対象物を除去

ワークフロー全体像

各コンポーネントの詳細

Step 1: SAM3によるbbox生成

目的: 自然言語プロンプト（例: "person"）から動画内のオブジェクトをトラッキングし、bboxマスク動画を生成

使用ツール: SAM3 (Segment Anything with Concepts)

入力:

入力動画 (input.mp4)
自然言語プロンプト ("person")

出力:

オブジェクトIDごとのbbox動画 (bbox_person_id1.mp4, bbox_person_id2.mp4, ...)
全IDを統合したbbox動画 (bbox_person_all.mp4)

実装例: bbox_video.py

# エントリーポイント例
python bbox_video.py \
    --input input.mp4 \
    --prompt "person" \
    --score-thresh 0.3 \
    --checkpoint sam3.pt

# 出力例:
# - bbox_person_id1.mp4  (人物1のbbox)
# - bbox_person_id2.mp4  (人物2のbbox)
# - bbox_person_all.mp4  (全人物のbbox統合)

特徴:

自然言語指定: 座標を指定する必要なく、"person", "car", "dog"などのテキストで指定可能
複数オブジェクト対応: 同じクラスの複数オブジェクトを個別にトラッキング（ID付与）
bboxベースのマスク: 矩形領域としてマスクを生成

bbox動画の形式:

黒背景（0,0,0）+ 白矩形（255,255,255）のMP4動画
VACEのinpaintingマスクとしてそのまま使用可能

Step 2: FastVLMによるシーン記述

目的: bbox適用済み動画の各フレームを記述し、可視領域の情報を抽出

使用ツール: FastVLM (Vision-Language Model)

入力:

入力動画 (input.mp4)
bbox動画 (bbox_person_all.mp4)

出力:

フレーム記述JSON (frame_descriptions.json)

実装: FastVLMのinference APIを使用

# エントリーポイント例（疑似コード）
from fastvlm import FastVLM

vlm = FastVLM(model=MODEL_NAME)  # 使用するVLMモデルを指定

# bboxを適用した動画からフレームを抽出
masked_frames = extract_masked_frames(
    video_path="input.mp4",
    bbox_path="bbox_person_all.mp4",
    frame_indices=[0, 40, 80]  # 3フレームをサンプリング
)

# 各フレームを記述
descriptions = {}
for idx, frame in zip([0, 40, 80], masked_frames):
    desc = vlm.describe(
        image=frame,
        prompt="Describe this scene in detail, including visible objects, environment, and actions."
    )
    descriptions[f"frame_{idx}"] = {
        "timestamp": f"{idx / fps:.2f}s",
        "description": desc
    }

# 保存
with open("frame_descriptions.json", "w") as f:
    json.dump(descriptions, f, indent=2)

出力例 (frame_descriptions.json):

{
  "frame_0": {
    "timestamp": "0.00s",
    "description": "A man in black jacket and blue jeans in front of a brick building with ivy. **GRAY RECTANGULAR BOX** blocking part of the view. A window with white frame is visible."
  },
  "frame_40": {
    "timestamp": "1.67s",
    "description": "A man running in dark jacket and jeans. Behind him is a large old brick building with white and red facade, chimney, and ivy. **GRAY BOX** partially obscuring the view."
  },
  "frame_80": {
    "timestamp": "3.33s",
    "description": "A man in dark jacket and jeans holding a gun, standing in front of a dilapidated brick building with peeling paint and overgrown ivy."
  }
}

注意点:

VLMはマスク領域を「グレーボックス」として認識する
この記述は後でLLMがフィルタリングし、グレーボックス部分を無視する

Step 3: Depth Anything v3によるカメラ動き抽出

目的: 動画からカメラのextrinsicパラメータ（位置・姿勢）を抽出

使用ツール: Depth Anything v3

入力:

入力動画 (input.mp4)
フレームインデックス（オプション）

出力:

カメラデータJSON (cameras.json)
Depth可視化動画 (depth.mp4, オプション)

実装例: extract_camera_for_inpainting.py

# エントリーポイント例
python extract_camera_for_inpainting.py \
    --video input.mp4 \
    --frame-indices "0,40,80" \
    --output output/ \
    --ensure-yup

# 出力:
# - output/cameras.json
# - output/depth.mp4 (オプション)

出力例 (cameras.json):

{
  "fps": 24.0,
  "total_frames": 81,
  "processed_frames": 3,
  "model": "da3-large",
  "resolution": 504,
  "frames": [
    {
      "index": 0,
      "timestamp": 0.0,
      "camera_pose": {
        "position": {"x": -5.3e-06, "y": -8.8e-05, "z": -0.00012},
        "rotation_degrees": {"yaw": 0.006, "pitch": -0.009, "roll": 0.005}
      },
      "intrinsics": {
        "fx": 518.8579,
        "fy": 518.8579,
        "cx": 320.0,
        "cy": 240.0
      }
    },
    {
      "index": 40,
      "timestamp": 1.67,
      "camera_pose": {
        "position": {"x": 0.031, "y": -0.145, "z": -0.503},
        "rotation_degrees": {"yaw": 0.061, "pitch": 0.176, "roll": 0.474}
      },
      "intrinsics": {...}
    },
    {
      "index": 80,
      "timestamp": 3.33,
      "camera_pose": {
        "position": {"x": 0.055, "y": -0.359, "z": -1.276},
        "rotation_degrees": {"yaw": 0.518, "pitch": 0.199, "roll": 0.734}
      },
      "intrinsics": {...}
    }
  ],
  "camera_motion": {
    "total_distance": 1.321,
    "start_position": {"x": -5.3e-06, "y": -8.8e-05, "z": -0.00012},
    "end_position": {"x": 0.055, "y": -0.359, "z": -1.276},
    "height_range": {"min": -0.359, "max": -8.8e-05, "delta": 0.359}
  }
}

カメラパラメータの意味:

position: カメラのワールド座標 (x, y, z)
- x: 左右の動き（正が右）
- y: 上下の動き（正が上、Y-up座標系）
- z: 前後の動き（正が前、負が後ろ）
rotation_degrees: カメラの回転（Euler角）
- yaw: 水平回転（パン）
- pitch: 垂直回転（チルト）
- roll: 軸回転（ロール）

例の解釈:

カメラは後方に1.276m移動 (z: 0 → -1.276)
カメラは下方に0.359m移動 (y: 0 → -0.359)
右方向にわずかに移動 (x: 0 → 0.055)
→ ドリーバック + ペデスタルダウンの動き

Step 4: Qwen 2.5 LLMによるプロンプト生成

目的: VLMのフレーム記述 + カメラデータから、VACEのための連続的なinpaintingプロンプトを生成

使用ツール: Qwen 2.5 (Large Language Model)

入力:

フレーム記述JSON (frame_descriptions.json)
カメラデータJSON (cameras.json)

出力:

Inpaintingプロンプト (inpainting_prompt.json)

実装例: generate_inpainting_prompt.py

# エントリーポイント例
python generate_inpainting_prompt.py \
    --masked-descriptions frame_descriptions.json \
    --camera-data cameras.json \
    --output inpainting_prompt.json

# 出力:
# - inpainting_prompt.json

LLMプロンプト設計:

システムプロンプト:

You are an expert at creating scene descriptions for video inpainting tasks.

INSTRUCTIONS:
1. You will receive frame-by-frame descriptions of a video with some regions masked (marked as gray boxes).
2. Create a single continuous narrative describing the visible scene throughout the video.
3. IGNORE and DO NOT MENTION any gray boxes or masked regions.
4. Include camera movement information in your description.
5. Focus on: people, objects, environment, actions that are VISIBLE.

Output ONLY valid JSON:
{
  "camera_motion": "How the camera moves (dolly, pan, tilt, etc.)",
  "scene_context": "Brief description of the environment",
  "inpainting_prompt": "A single continuous narrative of the visible scene throughout the video, including temporal flow and camera movement."
}

ユーザープロンプト:

I have a video with some regions masked (indicated by gray boxes).

VIDEO FRAME DESCRIPTIONS:
{frame_descriptions.json の内容}

CAMERA MOVEMENT DATA:
{cameras.json の内容}

YOUR TASK:
Create a single continuous narrative describing what is VISIBLE in the video.
- Ignore all mentions of gray boxes/gray rectangles
- Describe the temporal flow: how the scene progresses from start to end
- Include camera movement
- Focus only on visible elements: people, objects, environment, actions

出力例 (inpainting_prompt.json):

{
  "camera_motion": "The camera performs a dolly back movement combined with a pedestal down, moving backward 1.3 meters while descending 0.36 meters, with minimal horizontal shift.",
  "scene_context": "An old brick building with ivy growth, featuring a white-framed window and weathered facade.",
  "inpainting_prompt": "A scene in front of a dilapidated brick building with ivy overgrowth and peeling paint. The building has a white-framed window and a red and white facade with a chimney. The environment appears overcast. The camera slowly moves backward and descends, revealing more of the building's weathered structure and the metal framework above. The scene maintains a consistent atmosphere of urban decay throughout the shot."
}

LLMの役割:

フィルタリング: VLMが記述した「グレーボックス」への言及を除去
時間的統合: フレーム単位の記述を連続的なナラティブに変換
カメラ動き統合: 数値データ（position, rotation）を自然言語の動き記述に変換
プロンプト最適化: VACEが理解しやすい形式に整形

Step 5: VACEによるInpainting

目的: bboxマスクとプロンプトを使って、動画から対象物を除去

使用ツール: VACE (Video Inpainting model)

入力:

入力動画 (input.mp4)
bboxマスク動画 (bbox_person_all.mp4)
Inpaintingプロンプト (inpainting_prompt.json)

出力:

Inpainting済み動画 (output_inpainted.mp4)

実装: vace_wan_inference.py

# エントリーポイント例
python vace/vace_wan_inference.py \
    --src_video input.mp4 \
    --src_mask bbox_person_all.mp4 \
    --prompt "A scene in front of a dilapidated brick building with ivy overgrowth and peeling paint. The building has a white-framed window and a red and white facade with a chimney. The environment appears overcast. The camera slowly moves backward and descends, revealing more of the building's weathered structure and the metal framework above. The scene maintains a consistent atmosphere of urban decay throughout the shot." \
    --n_prompt "person, human, man, woman" \
    --output_dir output/ \
    --model_path "$VACE_MODEL" \
    --num_inference_steps 8 \
    --guidance_scale 3.5

# 出力:
# - output/output_inpainted.mp4

パラメータ説明:

--src_video: 入力動画
--src_mask: bboxマスク動画（白=除去領域、黒=保持領域）
--prompt: Inpaintingプロンプト（LLMが生成）
--n_prompt: ネガティブプロンプト（除去対象のクラス名）
--num_inference_steps: ステップ数（8=高速、40=高品質）
--guidance_scale: CFG scale（3.5推奨）

VACE設定のポイント:

bboxマスクは拡張（dilate）推奨: VAEの空間圧縮（1/8）を考慮し、必要に応じてマスクを拡張
ネガティブプロンプト活用: 除去対象のクラス（"person"など）を明示的に抑制
ミニマルプロンプト: プロンプトはシンプルに保ち、inpainting能力に依存
14B FP8モデル推奨: 高品質 + メモリ効率のバランス

実装例

完全パイプラインスクリプト

以下は全ステップを統合したシェルスクリプト例です：

#!/bin/bash
# complete_inpainting_pipeline.sh
# VACEの自動化inpaintingパイプライン

set -e  # エラーで停止

# === 設定 ===
INPUT_VIDEO="input.mp4"
TARGET_OBJECT="person"
OUTPUT_DIR="output"
TEMP_DIR="temp"

# スクリプトパス（環境に応じて調整）
# 例: SAM3_SCRIPT="/path/to/sam3/bbox_video.py"
SAM3_SCRIPT="bbox_video.py"
DEPTH_SCRIPT="extract_camera_for_inpainting.py"
PROMPT_SCRIPT="generate_inpainting_prompt.py"
VACE_SCRIPT="vace/vace_wan_inference.py"

# モデル設定
DEPTH_MODEL="da3-large"  # da3-small, da3-base, da3-large, da3-giant
VACE_MODEL="VACE-14B-FP8"

# フレームサンプリング（FML風: 最初、中間、最後）
TOTAL_FRAMES=$(ffprobe -v error -select_streams v:0 -count_packets \
    -show_entries stream=nb_read_packets -of csv=p=0 "$INPUT_VIDEO")
FRAME_INDICES="0,$((TOTAL_FRAMES / 2)),$((TOTAL_FRAMES - 1))"

mkdir -p "$OUTPUT_DIR" "$TEMP_DIR"

echo "=========================================="
echo "VACE Automated Inpainting Pipeline"
echo "=========================================="
echo "Input: $INPUT_VIDEO"
echo "Target: $TARGET_OBJECT"
echo "Frames: $FRAME_INDICES"
echo ""

# === Step 1: SAM3でbbox生成 ===
echo "[Step 1/5] Generating bbox mask with SAM3..."
python "$SAM3_SCRIPT" \
    --input "$INPUT_VIDEO" \
    --prompt "$TARGET_OBJECT" \
    --score-thresh 0.3

BBOX_VIDEO="bbox_${TARGET_OBJECT}_all.mp4"
mv "$BBOX_VIDEO" "$TEMP_DIR/"
echo "  Bbox mask saved: $TEMP_DIR/$BBOX_VIDEO"
echo ""

# === Step 2: FastVLMでシーン記述 ===
echo "[Step 2/5] Describing scene with FastVLM..."
# (疑似コード - 実際のFastVLM APIに置き換え)
python scripts/describe_masked_video.py \
    --video "$INPUT_VIDEO" \
    --bbox "$TEMP_DIR/$BBOX_VIDEO" \
    --frame-indices "$FRAME_INDICES" \
    --output "$TEMP_DIR/frame_descriptions.json"
echo "  Descriptions saved: $TEMP_DIR/frame_descriptions.json"
echo ""

# === Step 3: Depth Anything v3でカメラ抽出 ===
echo "[Step 3/5] Extracting camera motion with Depth Anything v3..."
python "$DEPTH_SCRIPT" \
    --video "$INPUT_VIDEO" \
    --model "$DEPTH_MODEL" \
    --frame-indices "$FRAME_INDICES" \
    --output "$TEMP_DIR" \
    --ensure-yup \
    --no-depth-video
echo "  Camera data saved: $TEMP_DIR/cameras.json"
echo ""

# === Step 4: Qwen LLMでプロンプト生成 ===
echo "[Step 4/5] Generating inpainting prompt with Qwen 2.5..."
python "$PROMPT_SCRIPT" \
    --masked-descriptions "$TEMP_DIR/frame_descriptions.json" \
    --camera-data "$TEMP_DIR/cameras.json" \
    --output "$TEMP_DIR/inpainting_prompt.json" \
    
echo "  Prompt saved: $TEMP_DIR/inpainting_prompt.json"
echo ""

# プロンプトを抽出
PROMPT=$(python -c "import json; print(json.load(open('$TEMP_DIR/inpainting_prompt.json'))['inpainting_prompt'])")
echo "  Generated prompt: $PROMPT"
echo ""

# === Step 5: VACEでInpainting ===
echo "[Step 5/5] Running VACE inpainting..."
python vace/vace_wan_inference.py \
    --src_video "$INPUT_VIDEO" \
    --src_mask "$TEMP_DIR/$BBOX_VIDEO" \
    --prompt "$PROMPT" \
    --n_prompt "$TARGET_OBJECT, human, man, woman" \
    --output_dir "$OUTPUT_DIR" \
    --model_path "$VACE_MODEL" \
    --num_inference_steps 8 \
    --guidance_scale 3.5
echo "  Output saved: $OUTPUT_DIR/output.mp4"
echo ""

echo "=========================================="
echo "Pipeline completed successfully!"
echo "Output: $OUTPUT_DIR/output.mp4"
echo "=========================================="

使用例

# パイプライン実行
./complete_inpainting_pipeline.sh

# または個別実行
# Step 1
python "$SAM3_SCRIPT" --input video.mp4 --prompt "person"

# Step 2
python describe_video.py --video video.mp4 --bbox bbox_person_all.mp4

# Step 3
python "$DEPTH_SCRIPT" \
    --video video.mp4 --frame-indices "0,40,80" --output cameras/

# Step 4
python "$PROMPT_SCRIPT" \
    --masked-descriptions descriptions.json \
    --camera-data cameras/cameras.json \
    --output prompt.json

# Step 5
python vace/vace_wan_inference.py \
    --src_video video.mp4 \
    --src_mask bbox_person_all.mp4 \
    --prompt "$(jq -r .inpainting_prompt prompt.json)" \
    --output_dir output/

まとめ

あくまで提案ではあるが、

半自動化パイプライン:

入力: 動画 + 自然言語での対象物指定（例: "person"）
出力: 対象物が除去された動画

各ステップの独立性:

各コンポーネントは独立して動作
JSONでデータ受け渡し
モジュール交換が容易

LLMによる知的処理:

カメラextrinsic（数値）→ 自然言語記述
フレーム記述 → 連続ナラティブ
マスク領域の自動フィルタリング

ここからさらにできそう・期待できること

VLMの改善:

現在: フレーム単位の記述
理想: ビデオネイティブなVLM（動きも記述可能）

プロンプト最適化:

現在: LLMが生成したプロンプトをそのまま使用
理想: VACEに最適化されたプロンプトをファインチューニング

高速処理:

現在: 各ステップが逐次実行
理想: パイプライン並列化 + モデル軽量化

使用ツール

実装と検証は以下のツールで行いました：

SAM3: bbox_video.py - 動画からbboxマスクを生成
FastVLM: VLMモデルによるフレーム記述生成
Depth Anything v3: extract_camera_for_inpainting.py - カメラextrinsic抽出
Qwen 2.5: generate_inpainting_prompt.py - プロンプト生成
VACE: vace_wan_inference.py - 動画inpainting

結論

VACEを使った動画inpaintingの自動化パイプラインを構築をやってみました。自然言語での指示（"person"など）だけで、複雑なマスク生成からプロンプト作成、inpainting実行までを自動化できます。

各コンポーネントは独立しているため、より良いモデルが登場すれば簡単に置き換え可能です。特にVLMやLLMの進化により、将来的にはさらに高品質な結果が期待できます。

参考リンク:

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

手探りしてみる CV/ ML/ NN: 22日目 動画生成モデルに手を出してみる6