Cosmos-Reason1をMacBook Pro M4で使ってみた：さらに進化した動画解析

Last updated at 2025-09-15Posted at 2025-09-15

Cosmos-Reason1をMacBook Pro M4で使ってみた：さらに進化した動画解析

はじめに

こんにちは、しゅんです。

前回の記事

MacBook Pro M4 (32GB) 上で cosmos-reason1 を動かしてみた結果について紹介しました。今回はその続編として、さらに進化した動画解析を行い、実際に車の速度や進行方向を予測することに成功したので、その詳細をお伝えします。

今回は、MPS（Metal Performance Shaders） を活用した解析結果に加えて、発生したエラーの対処方法や、それに伴う改善点についても触れます。

Code

main.py


# 使用例:
# python main.py --prompt ../prompts/question.yaml --question "How fast is the car going?" --videos "/path/to/video with spaces.mp4" -v
# 
# 注意: 動画パスにスペースが含まれる場合は必ずクォートで囲んでください

import sys
import pathlib

# cosmos_reason1_utilsのディレクトリを追加
sys.path.append(str(pathlib.Path(__file__).parents[1] / "cosmos_reason1_utils" / "src"))

import argparse
import collections
import textwrap
import yaml
import torch
import os
import qwen_vl_utils
import transformers

# Tritonカーネルを無効化（macOS/MPS環境でのtriton不整合を回避）
# vLLMのimportより前に設定する必要があります
os.environ["VLLM_USE_TRITON"] = "0"
os.environ["VLLM_USE_MROPE_TRITON"] = "0"
os.environ["VLLM_ATTENTION_BACKEND"] = "XFORMERS"
# vLLMはMPSをサポートしていないためCPUバックエンドを強制
os.environ["VLLM_TARGET_DEVICE"] = "cpu"
# 追加のTriton無効化設定
os.environ["TRITON_DISABLE"] = "1"

from rich import print
from rich.pretty import pprint

from cosmos_reason1_utils.script import init_script
from cosmos_reason1_utils.text import (
    PromptConfig,
    create_conversation,
    extract_tagged_text,
)
from cosmos_reason1_utils.vision import (
    VisionConfig,
    overlay_text_on_tensor,
    save_tensor,
)

# 初期化
init_script()

ROOT = pathlib.Path(__file__).parents[1].resolve()
SEPARATOR = "-" * 20

def pprint_dict(d: dict, name: str):
    """辞書を整形して表示するヘルパー関数"""
    pprint(collections.namedtuple(name, d.keys())(**d), expand_all=True)

def main():
    # 引数の処理
    parser = argparse.ArgumentParser()
    parser.add_argument("--images", type=str, nargs="*", help="Image paths")
    parser.add_argument("--videos", type=str, nargs="*", help="Video paths")
    parser.add_argument(
        "--timestamp", action="store_true", help="Overlay timestamp on video frames"
    )
    parser.add_argument(
        "--prompt", type=str, required=True, help="Path to prompt yaml file"
    )
    parser.add_argument(
        "--question", type=str, help="Question to ask the model (user prompt)"
    )
    parser.add_argument(
        "--reasoning", action="store_true", help="Enable reasoning trace"
    )
    parser.add_argument(
        "--vision-config",
        type=str,
        default=f"{ROOT}/configs/vision_config.yaml",
        help="Path to vision config json file",
    )
    parser.add_argument(
        "--sampling-params",
        type=str,
        default=f"{ROOT}/configs/sampling_params.yaml",
        help="Path to sampling parameters yaml file",
    )
    parser.add_argument(
        "--model",
        type=str,
        default="nvidia/Cosmos-Reason1-7B",
        help="Model name or path",
    )
    parser.add_argument(
        "--revision", type=str, help="Model revision (branch name, tag name, or commit id)"
    )
    parser.add_argument("-v", "--verbose", action="store_true", help="Verbose output")
    parser.add_argument(
        "-o", "--output", type=str, help="Output directory for debugging"
    )
    args = parser.parse_args()

    images: list[str] = args.images or []
    videos: list[str] = args.videos or []
    
    if args.verbose:
        print(f"Images: {images}")
        print(f"Videos: {videos}")

    # 設定ファイルを読み込む
    prompt_kwargs = yaml.safe_load(open(args.prompt, "rb"))
    prompt_config = PromptConfig.model_validate(prompt_kwargs)
    vision_kwargs = yaml.safe_load(open(args.vision_config, "rb"))
    _vision_config = VisionConfig.model_validate(vision_kwargs)
    sampling_kwargs = yaml.safe_load(open(args.sampling_params, "rb"))
    # sampling_params = vllm.SamplingParams(**sampling_kwargs)

    if args.verbose:
        pprint_dict(vision_kwargs, "VisionConfig")
        pprint_dict(sampling_kwargs, "SamplingParams")

    # 会話を作成する
    system_prompts = [open(f"{ROOT}/prompts/addons/english.txt").read()]
    if prompt_config.system_prompt:
        system_prompts.append(prompt_config.system_prompt)
    if args.reasoning and "<think>" not in prompt_config.system_prompt:
        if extract_tagged_text(prompt_config.system_prompt)[0]:
            raise ValueError(
                "Prompt already contains output format. Cannot add reasoning."
            )
        system_prompts.append(open(f"{ROOT}/prompts/addons/reasoning.txt").read())
    system_prompt = "\n\n".join(map(str.rstrip, system_prompts))

    if args.question:
        user_prompt = args.question
    else:
        user_prompt = prompt_config.user_prompt
    if not user_prompt:
        raise ValueError("No user prompt provided.")
    user_prompt = user_prompt.rstrip()
    conversation = create_conversation(
        system_prompt=system_prompt,
        user_prompt=user_prompt,
        images=images,
        videos=videos,
        vision_kwargs=vision_kwargs,
    )

    if args.verbose:
        pprint(conversation, expand_all=True)

    print(SEPARATOR)
    print("System:")
    print(textwrap.indent(system_prompt.rstrip(), "  "))
    print("User:")
    print(textwrap.indent(user_prompt.rstrip(), "  "))
    print(SEPARATOR)

    # 画像/動画処理用のtorchデバイス（vLLMはCPUを使用）
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    print(f"Using device: {device}")

    # モデルの作成（Transformers使用）
    model = transformers.AutoModelForVision2Seq.from_pretrained(
        args.model,
        revision=args.revision,
        torch_dtype=torch.float16 if device.type == "mps" else torch.float32,
        device_map="auto" if device.type == "mps" else None,
        trust_remote_code=True,
    )
    if device.type != "mps":
        model = model.to(device)
    
    # 入力データの処理
    try:
        processor: transformers.Qwen2_5_VLProcessor = (
            transformers.AutoProcessor.from_pretrained(args.model)
        )
    except Exception as e:
        print(f"Error loading processor: {e}")
        return
    prompt = processor.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs, video_kwargs = qwen_vl_utils.process_vision_info(
        conversation, return_video_kwargs=True
    )

    if args.timestamp:
        for i, video in enumerate(video_inputs):
            video_inputs[i] = overlay_text_on_tensor(video, fps=video_kwargs["fps"][i])

    if args.output:
        if image_inputs is not None:
            for i, image in enumerate(image_inputs):
                save_tensor(image, f"{args.output}/image_{i}")
        if video_inputs is not None:
            for i, video in enumerate(video_inputs):
                save_tensor(video, f"{args.output}/video_{i}")

    # 推論の実行
    try:
        # Transformersを使用した推論
        inputs = processor(
            text=[prompt],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt"
        )
        
        # デバイスに移動
        inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
        
        # 生成パラメータ
        generation_config = {
            "max_new_tokens": sampling_kwargs.get("max_tokens", 512),
            "temperature": sampling_kwargs.get("temperature", 0.7),
            "do_sample": sampling_kwargs.get("temperature", 0.7) > 0,
            "pad_token_id": processor.tokenizer.eos_token_id,
        }
        
        with torch.no_grad():
            outputs = model.generate(**inputs, **generation_config)
        
        # デコード
        generated_ids = outputs[0][inputs["input_ids"].shape[1]:]
        output_text = processor.tokenizer.decode(generated_ids, skip_special_tokens=True)
        
    except Exception as e:
        print(f"Error during inference: {e}")
        import traceback
        traceback.print_exc()
        return

    print(SEPARATOR)
    print("Assistant:")
    print(textwrap.indent(output_text.rstrip(), "  "))
    print(SEPARATOR)

    result, _ = extract_tagged_text(output_text)
    if args.verbose and result:
        pprint_dict(result, "Result")


if __name__ == "__main__":
    main()

動画解析結果

実行コマンド 1: 車の速度予測

まず、最初に行った解析は、レースシミュレーションゲームの動画を使用して、車の速度を予測することでした。以下のコマンドで実行しました。

python main.py --prompt ../prompts/question.yaml --question "How fast is the car going in the race?" --videos /Users/syun/Downloads/gtr_test.mp4 -v

結果

予測結果は以下の通りです。

The car is going 202 km/h, then 210 km/h, and later 228 km/h.

この予測結果は非常に高い精度であり、ゲーム内での車の速度を正確に把握できました。MPS環境でも十分な性能を発揮して、スムーズに動作した点が大きな収穫です。

これだけだとなんか物足りないなぁと思って進行予測を聞いてみた

実行コマンド 2: 車の進行方向予測

次に、車が次に進むべき方向を予測する実験を行いました。ユーザーからの質問は、「次に車はどの方向に進むべきか？」という内容です。

python main.py --prompt ../prompts/question.yaml --question "What direction should the car go next in the race?" --videos /Users/syun/Downloads/gtr_test.mp4 -v

結果

予測結果は以下の通りです。

The car should turn left, as indicated by the green arrows on the track.

この結果は、ゲーム内で示された緑色の矢印に基づいており、車が次に進むべき方向（左に曲がる）を正確に予測しました。この予測も、ゲームの指示と一致しており、非常に正確でした。

これは最後の一フレームですかなりすごいです

エラーの対処方法

最初、vLLM（vLLMライブラリ）が原因で、macOS環境においてTritonエラーが発生しました。このエラーを解決するために行った手順は以下の通りです。
(つまり単純にpip install vllmだとダメらしい　（kiro IDEはCursorよりすごいことがわかりました))

エラー原因

元のエラー：

AttributeError: module 'triton' has no attribute 'next_power_of_2'

vLLMがTritonライブラリに依存しており、macOS環境ではTritonが正しく動作しないことが原因でした。

解決手順

vLLMの削除
pip uninstall vllm ray -y を実行し、vLLMとその依存関係を完全に削除しました。
```
pip uninstall vllm ray -y
```

コードの修正
main.pyの中で、vLLMを使用する部分をTransformersライブラリに置き換えました。

# 変更前（vLLM）
llm = vllm.LLM(**llm_kwargs)

# 変更後（Transformers）
model = transformers.AutoModelForVision2Seq.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

推論処理の変更
vLLMの代わりに、Transformersの標準的な生成処理を使用しました。

inputs = processor(text=[prompt], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
outputs = model.generate(**inputs, **generation_config)

最初に試したもの（失敗）

pip install "vllm==0.6.3.post1" --no-deps
pip install ray transformers accelerate

古いバージョンのvLLMを試しましたが、依存関係の競合が発生したため最終的に削除しました。

最終的に使用したもの

pip install qwen-vl-utils

qwen-vl-utilsの確認インストール（既にインストール済みでした）

結果

Tritonエラーが解消され、MPS（Metal Performance Shaders）を使用して推論が成功しました。
動画解析で車の速度や進む方向を予測でき、ゲーム内のアクションをモデルが正しく認識できました。

The car is going 202 km/h, then 210 km/h, and later 228 km/h.

The car should turn left, as indicated by the green arrows on the track.

これにより、macOS環境でも問題なくCosmos-Reason1を使用して動画解析が可能になり、またTransformersライブラリへの移行により、macOSとMPSでの互換性問題を回避できました。

８倍速 https://t.co/nirrBUoUH0 pic.twitter.com/YbYztTVmTn
— SYUN@Decoder (@syun88AI) September 15, 2025

まとめ

今回、MacBook Pro M4 (32GB) で Cosmos-Reason1 を使用した動画解析の結果、車の速度や進行方向を高精度で予測することができました。また、vLLMの問題を解決し、Transformersライブラリを活用することで、macOS環境でも安定して動作することが確認できました。

これからのcosmos、非常に楽しみです。

最後に：

解像度やフレームレートを適切に設定することで、MPS環境でもスムーズに動作しました。
MPSはCUDAに比べてメモリ効率が異なるものの、十分なパフォーマンスを発揮しました。

今回も最後まで読んでくれて、ありがとうございました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up