1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

cosmos-reason1 をMacBookPro M4で使ってみた

Last updated at Posted at 2025-09-15

cosmos-reason1 をMacBookPro M4で使ってみた

はじめに

こんにちは、しゅんです。

今回の記事は、cosmos-reason1MacBookPro M4 (32GB) にインストールし、実際に動作させた結果をシェアしたいと思います。

最初は、自分が持っている NVIDIA RTX 3080 Laptop GPU(8GB) を使って CUDA で動作させたいと考えていました。しかし、8GBのGPUでは 忌々しいOOM(Out of Memory)になります。そこで macOS環境 で動かせるかどうかを試してみたところ、意外にも MPS (Metal Performance Shaders) を使用することでスムーズに動作しました。

環境説明

  • MacBook Pro M4(32GB RAM、Apple Silicon)
  • Python 3.12.2

手順 1: Clone & 環境構築

mkdir cosmos_pj
cd cosmos_pj
git clone https://github.com/nvidia-cosmos/cosmos-reason1.git  
python3 -m venv .cosmos_reason1
source .cosmos_reason1/bin/activate
pip install --upgrade pip wheel
pip install torch torchvision torchaudio
pip install "transformers>=4.56.1" qwen-vl-utils pillow opencv-python av imageio imageio-ffmpeg
cd cosmos-reason1
mkdir mps_pj
cd mps_pj
touch run_reason_mps.py

手順 2: MPS確認

次に、MPSが正しく使えるかどうかを確認します。

python - << 'PY'                                                                               
import torch
print("torch:", torch.__version__)
print("MPS available:", torch.backends.mps.is_available())
PY

MPS available: True が表示されれば、MPSが正常に有効になっていることが確認できます。

実行するコード

次に、MPS対応の設定を行い、公式のコードを基に torch_dtypefloat16 に設定し、デバイスを MPS に自動設定するようにしました。

run_reason_mps.py
# run_reason_mps.py  

from pathlib import Path
import qwen_vl_utils
import transformers
import torch

# ROOT = Path(__file__).parents[1]
SEPARATOR = "-" * 20
# video_path = "/Users/syun/python_project/cosmos_pj/cosmos-reason1/assets/sample.mp4"
video_path = '/Users/syun/Downloads/SSYouTube.online_2022初めてのAC2 GTR【車内視線】【しゅん】【レーシング練習】_1080p.mp4'


def main():
    # モデルのロード
    model_name = "nvidia/Cosmos-Reason1-7B"
    
    # MPS(Metal)対応のため、`torch_dtype`と`device_map`を修正
    model = transformers.Qwen2_5_VLForConditionalGeneration.from_pretrained(
        model_name, 
        torch_dtype=torch.float16,  # MPSはfp16が効率的
        device_map="auto"           # MPS対応のため自動設定
    )
    
    # モデルの前処理
    processor: transformers.Qwen2_5_VLProcessor = transformers.AutoProcessor.from_pretrained(model_name)

    # 入力データの作成(動画とテキスト)
    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": video_path,
                    "fps": 4,  # FPSを低めに設定
                    "total_pixels": 6422528,  # 解像度に基づくトークン数設定 1080pは2073600
                },
                {"type": "text", "text": "Describe this video."},
            ],
        }
    ]

    # テキストと画像、動画を前処理
    text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = qwen_vl_utils.process_vision_info(conversation)

    # テキスト、画像、動画をモデルに入力できる形に変換
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    
    # 入力データをMPS対応デバイスに移動
    inputs = inputs.to(model.device)

    # 推論の実行
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=4096)

    # 生成されたトークンを入力トークン数分切り取ってデコード
    generated_ids_trimmed = [
        out_ids[len(in_ids):]  # 入力トークン数分を削除
        for in_ids, out_ids in zip(inputs.input_ids, generated_ids, strict=False)
    ]
    
    # 出力結果をデコードして表示
    output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    
    # 結果表示
    print(SEPARATOR)
    print(output_text[0])
    print(SEPARATOR)


if __name__ == "__main__":
    main()

結果

実行結果

(.cosmos_reason1) syun@syunnoMacBook-Pro mps_pj % python run_reason_mps.py

出力例:

--------------------
The video captures a suburban street scene viewed from inside a vehicle, likely equipped with a dashcam. The perspective is from the driver's seat, looking out through the windshield onto a quiet residential area under clear blue skies. 

In the foreground, part of the hood and windshield wipers of the recording vehicle are visible. To the left side of the frame, another car is parked along the curb, partially obstructing the view. This car appears to be a silver sedan.

The middle ground features a row of houses with well-maintained lawns and driveways. Each house has a driveway where cars are parked. On the far left, there is a house with a white fence and some greenery around it. Next to it, another house with a similar design but with more visible landscaping. Further down the street, a house with a beige exterior and a small porch is prominent. In front of this house, two vehicles are parked—one black SUV and one dark-colored sedan—while another dark-colored SUV is parked further down the driveway on the right side.

On the right side of the street, several other cars are parked along the curb. A gray SUV is closest to the camera, followed by a silver sedan and then another dark-colored SUV. The street itself is paved and marked with pedestrian crosswalk lines near the center of the frame. Overhead, power lines run parallel to the street, supported by utility poles.

The environment suggests a peaceful neighborhood with no visible pedestrians or moving traffic. The sunlight casts shadows from the trees onto the road, indicating that it is daytime, possibly late morning or early afternoon. The overall atmosphere is calm and serene, typical of a residential area.
--------------------

自分の動画で適当な部分の4秒を試した結果

動画はこれを使ってます

--------------------
The video depicts a first-person view from inside a high-performance race car, likely part of a racing simulation game. 
The driver is navigating a racetrack under clear daylight conditions. The dashboard displays various gauges and indicators, including speed, engine temperature, and lap times, which suggest the car's performance metrics.

The track is surrounded by grandstands filled with spectators, indicating an organized racing event. 
The road surface appears smooth and well-maintained, with visible lane markings guiding the path. 
As the car progresses, it maintains a steady speed, indicated by the increasing distance traveled on the speedometer.

In the background, there are green barriers and fencing along the track edges, ensuring safety for both drivers and spectators. 
The environment includes trees and open skies, contributing to a bright and vibrant setting.
The car maneuvers through the track, passing other vehicles and approaching a turn marked by a red flag, signaling caution or a potential hazard ahead. 
The overall scene captures the intensity and precision required in motorsport racing.
--------------------

すごいなぁ。。。

まとめ

MacBook Pro M4 (32GB)MPS 環境で cosmos-reason1 を動かしてみた結果、無事に動作することが確認できました。CUDA を使わずに、MPS を活用して動画やテキストを処理できた点が大きな収穫です。

  • 動画や画像を処理する際、解像度やフレームレートを適切に設定することで、MPS上でもスムーズに動作することが確認できました。
  • MPS 環境は CUDA に比べるとメモリ効率に違いがありますが、十分なパフォーマンスを発揮してくれました。

まとめ

  • Checkpointの読み込み速度が遅いことが確認されました。原因は不明ですが、外付けHDDの影響があるかもしれません

最後まで読んでくれてありがとうございます。

1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?