ロボット基盤モデルの主要アプローチ5種の比較

Last updated at 2025-03-01Posted at 2025-03-01

近年、ロボット基盤モデル（Robotics Foundation Model）の技術が急速に進歩し、ロボットの認識能力や複雑なタスクへの適応力が劇的に向上しつつあります。従来のロボット制御では非常にたくさんのプログラミングを要したり、そもそも技術的に実現困難であったようなタスクも、生成AI技術の進歩に基づくVision-Language-Actionモデル（VLA）、拡散（Diffusion）モデルでのAction生成といったアプローチによって、驚くようなデモが公開されてきています。

この記事では、そうした技術を多様な角度から取り入れている5つのフレームワークを比較します。

π0 (PI0)
OpenVLA
Octo
RT-2
Diffusion Policy (DP)

これらは全て、機械学習ベースのロボット制御アプローチとして重要な示唆を与えた技術ですが、それぞれで使用しているベースモデルやデータ推論の出力方法などに違いがあります。この記事は、それぞれの特徴をざっくり掴み、違いを捉えることでロボット基盤モデルの今後を考えるための早見表として使われることを目指して作成をしています。

それぞれのアプローチの特徴

π0 (PI0)
- 概要: 事前学習済みのVision-Languageモデル（VLM）「PaliGemma」をベースに、フローマッチング（flow-matching、拡散モデルの一種）を使い、データ分布の流れを学習して連続アクションを生成
- 強み: ゼロショット性能の高さが特筆されており、洗濯物たたみ・コーヒー作りといった長時間（Long-Horizon）タスクでも注目を集めている
- 出典: π0 Blog
OpenVLA
- 概要: LLaMA 2を使用し、言語＋画像を処理して離散トークンとしてアクションを出力
- 強み: LoRAなどを使って少ないリソースで素早くファインチューニング可能
- 出典: OpenVLA
Octo
- 概要: Transformer＋拡散（Diffusion）で連続的なアクションを生成。約80万件のマルチロボットデモで学習
- 強み: 10〜15 Hzのリアルタイム動作に対応。マルチタスクシナリオや、テキスト・ゴール画像による指示が可能
- 出典: Octo
RT-2
- 概要: Googleが開発する大規模VLM（PaLM-EやPaLI-X）を応用し、テキスト＋画像から離散的なアクションを生成
- 強み: 「絶滅した動物を取って（“pick up the extinct animal”）」＝恐竜のおもちゃを取る、などセマンティック推論に強い
- 出典: RT-2
Diffusion Policy (DP)
- 概要: 拡散（Diffusion）プロセスを使い、連続的なアクション軌道を生成。基本的にタスクごとに学習
- 強み: 正確な動作に強い。データ量が限られた設定（例：カップの上下反転、ソースを注ぐ）で高いパフォーマンス
- 出典: Diffusion Policy

比較表

下記の表は、それぞれのモデルのアクション出力形式、コマンド入力、アーキテクチャなどの違いをまとめたものです。

Model	π0	OpenVLA	RT-2	Octo	DP
Action Output Format	Continuous trajectory (flow-matching outputs smooth joint commands)	Discrete tokens (transformer-decoded tokens, converted to actions)	Discrete tokens (token-based actions, then converted)	Continuous trajectory (diffusion-based generation)	Continuous trajectory (diffusion-based generation)
Generation Method	Flow matching (diffusion variant, trained via velocity/flow matching)	Token-based (autoregressive LLM-style decoding)	Token-based (co-fine-tuned VLM outputting action tokens)	Diffusion Model (Transformer backbone + diffusion head)	Diffusion Model (conditional denoising diffusion)
Multimodal LLM Integration	Yes (built on pre-trained VLM “PaliGemma”)	Yes (LLaMA 2-based LLM with vision encoders)	Yes (PaLM-E / PaLI-X large VLM)	Partial (accepts language + vision, but no large pre-trained LLM)	No (no language/LLM; purely vision/action-based)
Command / Instruction Input	Text instructions + vision (can parse high-level text to plan tasks)	Text instructions + vision (natural language prompts, images)	Text instructions + vision (web-scale semantic understanding)	Text or goal images (can specify tasks via short text or desired final state)	No explicit text commands (trained per-task; no user “intent” input)
Action Chunking	Yes (multi-step “chunks” in single inference; up to 50 Hz)	No (step-by-step token decoding)	Yes (limited) (outputs short token sequences for multi-step actions)	Yes (diffusion predicts multiple consecutive actions per inference)	Yes (short-horizon sequences with receding horizon)
Real-time Execution Frequency	~50 Hz (fast inference via flow-matching + chunking)	~5–15 Hz (can be slow; token-based LLM requires autoregressive decoding)	~1 Hz (massive model size limits speed)	~10–15 Hz (relatively small model; single pass + diffusion steps)	~5–10 Hz (iterative diffusion per action sequence)
Model Architecture	Transformer + MoE (VLM backbone + separate action “expert”)	Transformer (LLM) + CNN/ViT (LLaMA 2 core + visual encoders)	Transformer (PaLM-E / PaLI-X backbone)	Transformer + Diffusion (own mid-sized model, CNN for vision)	CNN + Transformer (Diffusion) (time-series diffusion)
Pre-trained Foundation Model	Yes (“PaliGemma” VLM)	Yes (LLaMA 2 & pre-trained image encoders)	Yes (PaLM-E, proprietary Google VLM)	No (trained from scratch on multi-task data)	No (no large pre-trained model)
General Task Performance	High / SOTA (excels at dexterous, long-horizon tasks)	High (leads many multi-task benchmarks)	Medium-High (strong semantic grasp, but slower control)	Medium-High (robust multi-task performance at good speed)	High (per task) (strong on single tasks, not generalist)
Long-duration Task Capability	Yes (laundry folding, coffee-making; multi-phase demos)	Limited (tested mostly on short single tasks)	Limited (one-step instructions, no extended chaining)	Yes (demonstrated multi-step tasks like coffee prep)	Moderate (multi-step control within single domain)
Practicality of Tasks Achieved	Very high (household chores, real-world complex tasks)	High (broad real manipulations: stacking, wiping, placing objects)	High (pick-place with semantic queries on real robots)	High (peg insertion, coffee-making, bimanual tasks, etc.)	Moderate (sauce pouring, flipping tasks, often lab-focused)
Fine-tuning Efficiency	High (data-efficient “post-training” approach)	High (LoRA or low-rank adaptation on consumer GPUs)	Low (55B+ model, not released for easy adaptation)	High (fast adaptation with small data)	Moderate (usually trained anew per task; data-efficient)
Date Published	Oct 2024	Jun 2024	Jul 2023	May 2024	Mar 2023

ポイント

出力が連続的か離散的か
- π0、Octo、DP：連続的なアクション（スムーズな制御が可能）
- OpenVLA、RT-2：トークン（離散）形式でアクションを生成。各トークンに対応する動きを定義するため、連続的なアクションに比べるとトークン化による動作表現の制限があり得る
コマンド入力と言語対応
- π0、OpenVLA、RT-2：VLMを活用してテキスト（プロンプト）＋画像の指示に対応
- Octo：テキストやゴール画像で指示できるが、LLMは使わない
- DP：特定のタスクで学習を行うため、テキスト指示は想定外
時間軸の長いタスク（Long-Horizon Task）
- π0、Octo：複数のサブステップを含んだタスク（例：洗濯物、コーヒー作り）をデモ
- OpenVLA、RT-2：主にシングルステップ指示（ブロック積み、オブジェクト把持など）
- DP：一つのタスク内で複数アクション（ソースを掬ってピザに塗る）をデモ

まとめ

機械学習ベースのロボット制御技術は急激に進化しています。今回で紹介した5つのモデルは「Vision + Language + Action」を様々な方法で融合していますが、今後も新たな手法が続々と提案されていくことが予想されます。また、単なるロボット領域での研究に留まらず、テキスト生成・画像生成といったLLM・VLM分野の先進技術とのクロスオーバーが活発になっていくと考えられます。

とはいえ、最終的には「実世界で、実際のハードウェアの上で、安定した再現性をもって複雑なタスクを実現する」ということが求められるため、機械学習のみならず、ハードウェアからソフトウェアまで一気通貫した理解と実装が重要になっていくと考えています。

私がCTOとして所属するTelexistence株式会社では、遠隔操作と自動制御を組み合わせたロボットシステムをコンビニや物流倉庫など、実環境に事業として導入しています。現在はロボット基盤モデルの領域にも注力していますので、もし興味をお持ちの方、あるいは同分野で研究開発されている方は、ぜひLinkedInなどでご連絡ください！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up