Appleがだしたml-fastvlmを試しました

Posted at 2025-05-15

はじめに

しゅんです。今回はApple が公開したマルチモーダルモデル「ml-fastvlm」を、Python の venv 環境下で動かしてみました。
画像と言語を組み合わせたタスクを手軽に試せるので、ロボットやデモ用途にぴったりです。
APPもあるそうです

以下のように、導入から手順、実行結果、考察までを整理してみました。読みやすさと流れを重視しつつ、必要なポイントはしっかり盛り込んでいます。ご参考ください。

環境構築

公式ドキュメントでは conda を使っていますが、私は軽量な venv を好むため、以下の手順で準備しました。

# 作業ディレクトリの作成とクローン
mkdir apple && cd apple
git clone https://github.com/apple/ml-fastvlm.git

# venv の作成・有効化
python -m venv .venv
source .venv/bin/activate

# パッケージ更新とインストール
pip install -U pip
cd ml-fastvlm
pip install -e .

モデルチェックポイントの取得

事前学習済みのすべてのチェックポイントを一括ダウンロードします。通信環境によっては時間がかかるので、☕️片手に待ちましょう。

bash get_models.sh

サンプル実行

以下のコマンドで、任意の画像に対してキャプション生成が可能です。

python predict.py \
  --model-path ./checkpoints/llava-fastvithd_0.5b_stage2 \
  --image-file /Users/syun/Downloads/PXL_20250311_052458229.jpg \
  --prompt "Describe the image."

結果イメージ

実行結果比較

● 0.5B Stage2 モデル

The image depicts a white shelving unit with three shelves. The shelves are empty except for a few items placed on the top and bottom shelves.

Top Shelf:

There is a red can of what appears to be a canned food product...
A small red ball is placed on the top shelf...

Middle Shelf:

On the middle shelf, there are a banana, a bell pepper, a lemon, and a small orange fruit...

Bottom Shelf:

On the bottom shelf, there is a single yellow tennis ball...

Background:

The background consists of a light-colored floor and a grayish wall...

● 7B Stage3 モデル

The image depicts a white, four-shelf storage unit placed against a gray wall. The shelves are evenly spaced...

Top Shelf:
- A can labeled “Puppy Chow” with a white lid and red label...
- A small red apple.
Second Shelf:
- A yellow banana.
- A green apple.
- An orange.
Third Shelf:
- A can of Campbell’s Tomato soup with the classic red-and-white design.
Bottom Shelf:
- A single yellow tennis ball.

0.5B は大まかな物体・色の説明でも十分素晴らしいと思います

今後の展望

日本語キャプション対応：LoRA を使った日本語データでの微調整（Fine-Tuneやってみたい）
リアルタイム入力：Web カメラキャプチャを predict.py に組み込み（すでに実装してApple github にIssues投げました）--対応してくれるかどうかはわからないが、一応コードを修正して使えるようにしてたので、気になってる人はコメントかXでDMで連絡してください

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up