はじめに
生成AIを用いて、自動運転の論文「EMMA: End-to-End Multimodal Model for Autonomous Driving (2024)」の内容を(なるべく)把握してみました。生成AIに論文の記載内容を広く読んでもらって、実際にどのような記載があるのかを把握してみます。
車両の過去の位置情報は、「(xt, yt), ...」のようなテキスト形式で、大規模言語モデル(Gemini 1.0 Nano-1)に入力し、出力も自然言語で生成され、予測軌道「(xt, yt), ...」のようなテキスト形式で出力して、これを進行方向に利用するものである事が分かりました。
(その他、末尾の「分かったこと」章を参照)
以降で、ChatGPTに聞いてみた例を記載します。
対象の論文
[2410.23262] EMMA: End-to-End Multimodal Model for Autonomous Driving
https://arxiv.org/abs/2410.23262
(PDF: https://arxiv.org/pdf/2410.23262)
生成AIへの質問方法
生成AIを活用して、知りたい記事・論文の長い範囲のテキストを一気にコピー&ペーストして、その下に質問内容を「①~ ②~ …」と番号付きで書いて、生成AIに質問します。
長い範囲を一気に貼り付けて質問することで、質問の回数を減らし、生成AIに広い範囲の内容を上手にまとめてもらうことで、もっと楽して論文を把握したいと思っています。
下記のように分けて質問します。
Introductionを把握する
始めに「Abstract」「Introduction」に当たる章の部分を全て貼り付けて、「何についての記載か? + 要旨は何?」と質問し、まずおおよその内容を把握します。
(論文・記事を貼り付け)
上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)
※各質問の前には、元文の記載内容に基づいて忠実に回答してもらうために、毎回「上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)」と付けています。(以降では記載を省略)
これで生成された回答を読んでみて、分からなかった部分・もっと知りたい部分があれば、追加で質問します。
複数の質問がある場合は、「①~ ②~ …」と番号付きで一度に複数質問します。
①上記の「●」について詳細を教えて下さい。
②上記の「●」部分の元文を抜き出して下さい。続けて、その日本語訳を記載下さい。
学習に関する部分を把握する
学習に関わる章のテキストを広く選択してコピーし、長い範囲を貼り付けて、下記の質問をします。
①ここで使用された機械学習の、モデルへの入力形式、モデルからの出力形式、データセット、モデルの構造、計算量、パラメータ数、についての記載があれば教えて下さい。
①ここで使用された機械学習では、どんな学習が行われたのか、記載があれば教えて下さい。
結果に関する部分を把握する
実験結果に関わる章のテキストを広く選択してコピーし、長い範囲を貼り付けて、下記の質問をします。
①ここで行われた複数の実験の条件と、実験から分かったそれぞれの結論を説明下さい。
図・表を把握する
知りたい図・表の前後にあるテキストを広く選択してコピーし、長い範囲を貼り付けて、下記の質問をします。
①図1,2、表3,4は、何を示し、どのように解釈すればよいのか、教えて下さい。
キーワードから探す
上記の方法で、長い範囲について一度質問するだけでは、内容の取りこぼしがあるかもしれないので、生成AIに各章節でポイントとなるキーワード・フレーズを列挙してもらい、キーワードの羅列をざっと見て、他に取りこぼした内容がないかを確認することにします。
(生成AIが受付可能な限りの)長い範囲を貼り付けて、下記の質問をします。
①各章節毎に、ポイントとなるキーワード・フレーズを列挙下さい。(長めのフレーズとする事、元文+かっこで日本語書き)
より詳しく質問する時は、
①各章節毎に、ポイントとなるキーワード・フレーズを列挙下さい。(元文+かっこで日本語書き、他とは違う特徴的な内容・他と異なる特別な所・他と比べて優位な点・ここでの主張ポイント等、長めのフレーズとする事)
これで生成された回答を読んでみて、もっと知りたい部分があれば、追加で質問します。
複数の質問がある場合は、「①~ ②~ …」と番号付きで一度に複数質問します。
①上記の「●」について詳細を教えて下さい。
②上記の「●」部分の元文を抜き出して下さい。続けて、その日本語訳を記載下さい。
その他: 関連研究に関する部分を把握する
関連研究に関わる章のテキストを広く選択してコピーし、長い範囲を貼り付けて、下記の質問をします。
①関連研究をうまくまとめて下さい。(流れが分かる事、種類でうまくまとめる事、重要なものを選別、説明を追加)
①関連研究を再度うまくまとめて下さい。(多角的な視点で種類毎にまとめる事、重要なものを選別)
事前知識
その分野についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないことが多くあります。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。
対象論文について(または関連する)分かりやすい解説記事等を事前にチェックして、中心的な部分のイメージがあらかじめ分かっていると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。
注意点
ここでは、論文のテキスト内容だけを貼り付けて、生成AIに質問しています。論文の中の図・表の部分は貼り付けていません。図・表の内容は生成AIの回答には含まれず、別途論文を見る必要があります。
生成AIの回答内容が読みにくい・分かりづらい場合は、論文の本文でも同じように書かれてあり、論文の本文を直接読んでも同じように分かりづらいことが多くあります。論文では、既存研究等があるため、多くの説明を省略・一言だけサラッと書かれてある等、同種の研究に取り組む人でなければ、なかなか全容の理解に辿りつくのは難しい側面があります。この点は、生成AIの回答性能が悪いのではなく、論文という性質上、生じるものと考えています。
生成AIに質問
以降で、ChatGPTに実際に聞いてみた例を記載します。
Introductionを把握する
Abstract
1 Introduction
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
EMMA: End-to-End Multimodal Model for Autonomous Driving Jyh-Jing Hwang∗†, Runsheng Xu∗ , Hubert Lin‡ , Wei-Chih Hung‡ , Jingwei Ji, Kristy Choi Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo Dragomir Anguelov, Mingxing Tan† Waymo LLC Abstract We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA’s effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA’s potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures. 1 Introduction Autonomous driving technology has made significant progress in recent years. To make autonomous vehicles a ubiquitous form of transportation, they must navigate increasingly complex real-world scenarios that require understanding rich scene context as well as sophisticated reasoning and decision-making. Historically, autonomous driving systems employed a modular approach, consisting of specialized components for perception [Yurtsever et al., 2020, Li et al., 2022b, Lang et al., 2019, Sun et al., 2022, Hwang et al., 2022], mapping [Li et al., 2022a, Tancik et al., 2022], prediction [Nayakanti et al., 2023, Shi et al., 2024], and planning [Teng et al., 2023]. While this design lends itself to easier debugging and optimization of individual modules, it poses scalability challenges due to the accumulated errors among modules and limited inter-module communication. In particular, the expert-designed ∗Equal contributions; ‡ Equal contributions. †Contact emails: Mingxing Tan <tanmingxing@waymo.com>, Jyh-Jing Hwang <jyhh@waymo.com>. Technical report. November 5, 2024. arXiv:2410.23262v2 [cs.CV] 4 Nov 2024 interfaces between modules, such as the perception and behavior modules, may struggle to adapt to novel environments because they are often pre-defined based on targeted scenarios [Bansal et al., 2019, Jiang et al., 2023, Nayakanti et al., 2023, Seff et al., 2023]. End-to-end autonomous driving systems [Hu et al., 2023, Zhai et al., 2023, Li et al., 2024] have recently emerged as a potential solution, directly learning to generate driving actions from sensor data. This approach eliminates the need for symbolic interfaces between modules and allows for joint optimization of driving objectives from raw sensor inputs. However, these systems are often specialized for specific driving tasks and trained on limited datasets, hindering their ability to generalize to rare or novel scenarios. Multimodal Large Language Models (MLLMs) [Gemini Team Google, 2023, Achiam et al., 2023] offer a promising new paradigm for AI in autonomous driving that may help to address such challenges. This is because MLLMs, as generalist foundation models, excel in two key areas: (1) they are trained on vast, internet-scale datasets that provide rich "world knowledge" beyond what is contained in common driving logs, and (2) they demonstrate superior reasoning capabilities through techniques such as chain-of-thought reasoning [Wei et al., 2022] that are not available in specialized driving systems. While recent efforts [Chen et al., 2024b, Tian et al., 2024] have explored integrating and augmenting the capabilities of existing driving systems with MLLMs, we propose to develop an autonomous driving system in which the MLLM is a first class citizen. We introduce the End-to-End Multimodal Model for Autonomous Driving (EMMA), built on top of Gemini [Gemini Team Google, 2023] without additional specialized components. Figure 1 shows the overview of the EMMA framework. EMMA accepts camera images and plain text for other nonvision inputs such as high-level driving commands and historical context. By recasting driving tasks as visual question answering (VQA) problems, EMMA leverages Gemini’s pre-trained capabilities while preserving its extensive world knowledge. After EMMA is fine-tuned with driving logs from all tasks using task-specific prompts (see Figure 2 for more examples), its shared Gemini model generates various driving outputs such as future trajectories for motion planning, perception objects, road graph elements, and scene semantics. Our experiments showcase EMMA’s strong performance on several planning and perception benchmarks despite this simple design. Additionally, we find that EMMA can produce interpretable, human-readable outputs for many perception tasks such as road graph estimation, and is able to function as a generalist model that is both scalable and robust for autonomous driving. Notably, as used here and throughout the paper, the EMMA generalist model refers to a machine learning model that has been trained and fine-tuned on a large volume of driving data to perform a wide range of specific driving tasks in the autonomous driving domain. We summarize our key findings below: 1. EMMA exhibits strong performance in end-to-end motion planning, achieving state-ofthe-art performance on public benchmarks nuScenes [Caesar et al., 2020] and competitive results on the Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a]. We also show that we can further improve motion planning quality with more internal training data and chain-of-thought reasoning. 2. EMMA demonstrates competitive results for various perception tasks including 3D object detection, road graph estimation, and scene understanding. On the camera-primary Waymo Open Dataset (WOD) [Hung et al., 2024], EMMA achieves better precision and recall for 3D object detection than state-of-the-art methods. 3. We demonstrate that EMMA can function as a generalist model in the autonomous driving domain, which jointly generates the outputs for multiple driving related tasks. In particular, EMMA matches or even surpasses the performance of individually trained models when it is co-trained with motion planning, object detection, and road graph tasks. 4. Finally, we show EMMA’s capacity to reason and make decisions in complex, long-tail driving scenarios. Despite these promising results, EMMA is not without its limitations. In particular, it faces challenges for real-world deployment due to: (1) limitations in 3D spatial reasoning due to its inability to fuse camera inputs with LiDAR or radar, (2) the need for realistic and computationally expensive sensor simulation to power its closed-loop evaluation, and (3) the increased computational requirements relative to conventional models. We plan to better understand and address such challenges in future work. 2 Figure 1: EMMA overview diagram. It takes 3 inputs (left): 1) a high-level command from the router, 2) historical status of the ego vehicle, and 3) surround-view camera videos. The model then predicts ego future trajectories (right) for motion planning that will be transformed into vehicle driving control signals. Further, we can ask the model to explain its rationale (top right) before predicting trajectories, which enhances both the performance and explainability of the model through chain-of-thought reasoning. Notably, we incorporate visual grounding into the rationale so that the model also predicts the accurate 3D/BEV location for critical objects. For completeness, we highlight three additional capabilities of our model (bottom). In the remainder of this paper, Section 2 describes the detailed method of EMMA for end-to-end motion planning and generalist tasks in autonomous driving. In Section 3, we present experimental results of EMMA on public and internal datasets. Finally, we discuss related works and limitations of our work in Sections 4 and 5, respectively.▼生成AIの回答(ChatGPT 4o mini)
上記の内容は、EMMA(End-to-End Multimodal Model for Autonomous Driving)という、自動運転技術に関する新しいモデルの提案について記載されています。EMMAは、カメラの生データを入力として、運転に必要なさまざまな出力を直接生成するエンドツーエンドのマルチモーダルモデルです。特に、運転に関連するタスク(例えば、経路計画、物体認識、道路グラフ推定など)を自然言語形式で表現し、モデルにそのタスクを一貫して処理させるアプローチが特徴的です。
要旨:
-
EMMAモデルの紹介:
- EMMAは、Geminiを基盤にしたマルチモーダル大規模言語モデルを活用した自動運転向けのエンドツーエンドシステム。
- 入力はカメラ画像や、運転指令、車両の状態などのテキストデータ。
- 出力としては、運転経路(モーションプランニング)、物体認識、道路グラフ推定、シーン理解などを生成。
- 自然言語テキストとして運転タスクを表現し、タスク固有のプロンプトを用いて学習。
-
パフォーマンスの実証:
- EMMAは、nuScenesやWaymo Open Motion Dataset(WOMD)で最先端のモーションプランニングパフォーマンスを達成。
- Waymo Open Dataset(WOD)でのカメラを使った3D物体検出でも競争力のある結果を示す。
-
ジェネラリストモデルの特徴:
- モーションプランニング、物体検出、道路グラフ推定など複数のタスクを共に処理可能なモデル。
- これらのタスクを共同訓練することで、個別のモデルを超えるパフォーマンスを達成。
-
チェーン・オブ・ソート推論:
- EMMAは、タスクに対する推論を視覚的に補強することによって、より解釈可能で性能の高い予測を実現。
-
制限と課題:
- 3Dセンサー(LiDARやレーダー)との融合ができず、3D空間推論の能力に限界。
- 実世界での適用には、センサーシミュレーションが計算リソース的に高価。
- 計算資源が通常のモデルよりも多く必要となるため、実運用に向けた改善が必要。
-
将来の展望:
- EMMAの研究成果が自動運転技術のさらなる進展を促進することを期待。
既存手法の引用:
- 伝統的な自動運転システムでは、モジュール式アプローチが採用されており、専門化されたコンポーネント(認識、マッピング、予測、計画)を使用していた(Yurtsever et al., 2020, Li et al., 2022b, Lang et al., 2019, Sun et al., 2022, Hwang et al., 2022)。
- 最近では、エンドツーエンドの自動運転システム(Hu et al., 2023, Zhai et al., 2023, Li et al., 2024)が登場しており、これにより生データから直接運転行動を学習できるようになった。
主要な部分は書かれており良好。
ただ、「チェーン・オブ・ソート推論」の項目に書かれた内容「EMMAは、タスクに対する推論を視覚的に補強することによって、より解釈可能で性能の高い予測を実現。」は無関係で、不適切。
▼質問(人間)
続けて下記の質問に追加で回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①「ジェネラリストモデル」とは何?
②「共同訓練」とは何?
③「チェーン・オブ・ソート推論」とは何?
▼生成AIの回答(ChatGPT 4o mini)
①「ジェネラリストモデル」とは何?
「ジェネラリストモデル」とは、特定のタスクに特化したモデルではなく、複数の異なるタスクを同時に処理できる汎用的なモデルを指す。この場合、EMMAは自動運転に関する多様なタスク(モーションプランニング、物体検出、道路グラフ推定など)を単一のモデルで処理することができる。ジェネラリストモデルは、タスク間での情報の共有と一貫した処理を可能にするため、モデルの効率性とスケーラビリティに優れ、タスク間の相互作用を最適化することができる。
具体的には、EMMAは運転指令や車両の状態、周囲の視覚データを入力として、各タスクに適切な出力を同時に生成することができ、これによりタスク固有のモデルを個別に訓練するよりも、パフォーマンスの向上が期待される。これが「ジェネラリストモデル」の特徴であり、複数のタスクを一つの大規模言語モデルで処理できる点が革新的である。
②「共同訓練」とは何?
「共同訓練」とは、複数のタスクを同時に学習させる訓練方法を指す。EMMAにおける共同訓練は、モーションプランニング、物体検出、道路グラフ推定など、複数の運転関連タスクを一つのモデルに対して同時に訓練することを意味する。このアプローチは、異なるタスク間で学習した知識や特徴を共有し合うことができ、各タスクのパフォーマンスが相乗的に向上する可能性がある。
具体的には、EMMAはそれぞれのタスクに対応するデータを用いて学習を行い、その学習結果を相互に影響させることで、単独のタスクに特化したモデルに比べて全体的なパフォーマンスが向上する。共同訓練により、各タスクがモデル内で共通の基盤を持つため、システム全体の効果的な最適化が可能となる。
③「チェーン・オブ・ソート推論」とは何?
「チェーン・オブ・ソート推論」とは、モデルが推論を行う際に、逐次的に論理的なステップを踏んで考えるプロセスを指す。EMMAでは、運転タスクに対する推論の過程を、段階的に人間が理解できる形で説明しながら進めることができる。このような推論の過程は「チェーン・オブ・ソート(chain-of-thought)」と呼ばれ、モデルが最終的な予測をする前に、その過程を段階的に示し、思考の流れを明確にする。
EMMAはこの推論手法を採用することで、モデルがどのようにして出力に至ったのかを解釈可能にし、予測がなぜそのような結果になったのかを説明することができる。これにより、結果の透明性や信頼性が向上し、モデルの動作が理解しやすくなるとともに、実運用における安全性や信頼性の向上にも寄与する。
学習に関する部分を把握する
1 Introduction
2 Method
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
We introduce the End-to-End Multimodal Model for Autonomous Driving (EMMA), built on top of Gemini [Gemini Team Google, 2023] without additional specialized components. Figure 1 shows the overview of the EMMA framework. EMMA accepts camera images and plain text for other nonvision inputs such as high-level driving commands and historical context. By recasting driving tasks as visual question answering (VQA) problems, EMMA leverages Gemini’s pre-trained capabilities while preserving its extensive world knowledge. After EMMA is fine-tuned with driving logs from all tasks using task-specific prompts (see Figure 2 for more examples), its shared Gemini model generates various driving outputs such as future trajectories for motion planning, perception objects, road graph elements, and scene semantics. Our experiments showcase EMMA’s strong performance on several planning and perception benchmarks despite this simple design. Additionally, we find that EMMA can produce interpretable, human-readable outputs for many perception tasks such as road graph estimation, and is able to function as a generalist model that is both scalable and robust for autonomous driving. Notably, as used here and throughout the paper, the EMMA generalist model refers to a machine learning model that has been trained and fine-tuned on a large volume of driving data to perform a wide range of specific driving tasks in the autonomous driving domain. We summarize our key findings below: 1. EMMA exhibits strong performance in end-to-end motion planning, achieving state-ofthe-art performance on public benchmarks nuScenes [Caesar et al., 2020] and competitive results on the Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a]. We also show that we can further improve motion planning quality with more internal training data and chain-of-thought reasoning. 2. EMMA demonstrates competitive results for various perception tasks including 3D object detection, road graph estimation, and scene understanding. On the camera-primary Waymo Open Dataset (WOD) [Hung et al., 2024], EMMA achieves better precision and recall for 3D object detection than state-of-the-art methods. 3. We demonstrate that EMMA can function as a generalist model in the autonomous driving domain, which jointly generates the outputs for multiple driving related tasks. In particular, EMMA matches or even surpasses the performance of individually trained models when it is co-trained with motion planning, object detection, and road graph tasks. 4. Finally, we show EMMA’s capacity to reason and make decisions in complex, long-tail driving scenarios. Despite these promising results, EMMA is not without its limitations. In particular, it faces challenges for real-world deployment due to: (1) limitations in 3D spatial reasoning due to its inability to fuse camera inputs with LiDAR or radar, (2) the need for realistic and computationally expensive sensor simulation to power its closed-loop evaluation, and (3) the increased computational requirements relative to conventional models. We plan to better understand and address such challenges in future work. 2 Figure 1: EMMA overview diagram. It takes 3 inputs (left): 1) a high-level command from the router, 2) historical status of the ego vehicle, and 3) surround-view camera videos. The model then predicts ego future trajectories (right) for motion planning that will be transformed into vehicle driving control signals. Further, we can ask the model to explain its rationale (top right) before predicting trajectories, which enhances both the performance and explainability of the model through chain-of-thought reasoning. Notably, we incorporate visual grounding into the rationale so that the model also predicts the accurate 3D/BEV location for critical objects. For completeness, we highlight three additional capabilities of our model (bottom). In the remainder of this paper, Section 2 describes the detailed method of EMMA for end-to-end motion planning and generalist tasks in autonomous driving. In Section 3, we present experimental results of EMMA on public and internal datasets. Finally, we discuss related works and limitations of our work in Sections 4 and 5, respectively. 2 Method EMMA is built on top of Gemini [Gemini Team Google, 2023], a family of MLLMs developed at Google. We leverage the auto-regressive Gemini models that are trained to process interleaved textual and visual inputs to produce text outputs: O = G(T, V) (1) where G is the Gemini model, O represents natural language outputs, T represents natural language prompts, and V denotes images or videos. The language output O = (o1, o2, ..., on) is generated via next-token prediction, i.e., the output probability can be represented as Q P(O|T, V) = n i=1 P(oi |o<i, T, V) for n output tokens. Our goal is to adapt G for autonomous driving applications, thereby harnessing the world knowledge obtained during its pre-training phase. As shown in Figure 1, we map autonomous driving tasks into Gemini-based EMMA formulation. All sensor data are represented as stitched images or videos as V; all router commands, driving context, and task-specific prompts are represented as T; all output tasks are presented as language outputs O. A challenge is that many of the inputs and outputs need to capture 3D world coordinates, such as waypoint BEV (Bird’s Eye View) locations (x, y) for motion planning and the location and size of 3D boxes. We consider two representations: The first is direct text conversion to floating-point numbers, expressed as Tcoordinates = {(xi , yi)} ≈ text({(xi , yi)}), where the specified decimal precision depends on the distance unit and decimal points. RT-2 [Brohan et al., 2023] exemplifies this approach in robotic control. The second approach uses special tokens to represent each location or action, formulated as Tcoordinates = {(xi , yi)} ≈ tokenize({(xi , yi)}), with resolution determined by the learned or manually defined discretization scheme. MotionLM [Seff et al., 2023] leverages this method for motion forecasting. We note that the two approaches have their respective strengths and weaknesses. We opt for the text representation such that all tasks can share the same unified language representation space and they can maximally reuse the knowledge from the pre-trained weights, even though the text presentation may produce more tokens than specialized tokenization. 3 2.1 End-to-End Motion Planning EMMA employs a unified, end-to-end trained model to generate future trajectories for autonomous vehicles directly from sensor data. These generated trajectories are then transformed into vehiclespecific control actions such as acceleration and turning for autonomous vehicles. EMMA’s end-to-end approach aims to emulate human driving behavior, focusing on two critical aspects: (1) first, the use of navigation systems (e.g. Google Maps) for route planning and intent determination, and (2) second, the utilization of past actions to ensure smooth, consistent driving over time. Our model incorporates three key inputs to align with these human driving behaviors: 1. Surround-view camera videos (V): Provides comprehensive environment information. 2. High-level intent command (Tintent): Derived from the router, includes directives such as “go straight”, “turn left”, “turn right”, etc. 3. Set of historical ego status (Tego): Represented as a set of waypoint coordinates in Bird’s Eye View (BEV) space, Tego = {(xt, yt)} −Th t=−1 for Th timestamps. All waypoint coordinates are represented as plain text without specialized tokens. This can also be extended to include higher-order ego status such as velocity and acceleration. The model generates future trajectories for motion planning, represented as a set of future trajectory waypoints for the ego vehicle in the same BEV space: Otrajectory = {(xt, yt)} Tf t=1 for future Tf timestamps, where all output waypoints are also represnted as plain text. Putting everything together, the complete formulation is expressed as: Otrajectory = G(Tintent, Tego, V). (2) We then fine-tune Gemini with this formulation for end-to-end planner trajectory generation, as illustrated in Figure 1. We highlight 3 characteristics of this formulation: 1. Self-supervised: the only required supervision is the future locations of the ego vehicle. No dedicated human labels are needed. 2. Camera-only: the only sensor input required is surround-view cameras. 3. HD map free: no HD map is needed beyond the high-level routing information from a navigation system such as Google Maps. While we are not the first to adopt this general formulation—[Li et al., 2024] conducted a thorough investigation, particularly examining the impact of including the historical ego status—our contribution lies in adapting this formulation specifically for MLLMs for autonomous driving. 2.2 Planning with Chain-of-Thought Reasoning Chain-of-thought Prompting [Wei et al., 2022] is a powerful tool in MLLMs that enhances reasoning capabilities and improves explainability. In EMMA, we incorporate chain-of-thought reasoning into end-to-end planner trajectory generation by asking the model to articulate its decision rationale Orationale while predicting the final future trajectory waypoints Otrajectory. We structure the driving rationale hierarchically, progressing from 4 types of coarse-to-fine-grained information: R1 - Scene description broadly describes the driving scenarios, including weather, day of time, traffic situations, and road conditions. For example: The weather is clear and sunny, and it is daytime. The road is four-lane undivided street with a crosswalk in the middle. There are cars parked on both sides of the street. R2 - Critical objects are the on-road agents that can potentially influence the driving behavior of the ego vehicle, and we require the model to identify their precise 3D/BEV coordinates. For instance: pedestrian at [9.01, 3.22], vehicle at [11.58, 0.35]. R3 - Behavior description of critical objects describes the current status and intent for the identified critical objects. A concrete example is as follows: The pedestrian is currently standing on the sidewalk, looking toward the road, and maybe preparing to cross the street. The vehicle is currently ahead of me, moving in the same direction, and its future trajectory suggests it will continue straight. R4 - Meta driving decision includes 12 categories of high-level driving decisions, summarizing the driving plan given the previous observations. An example would be I should keep my current low speed. 4 We highlight that the driving rationale captions are generated using an automated tool without any additional human labels, ensuring scalability of the data generation pipeline. Specifically, we leverage off-the-shelf perception and prediction expert models to identify critical agents, and then use Gemini models with carefully designed visual and text prompts to generate comprehensive scene and agent behavior descriptions. Meta driving decisions are computed using a heuristic algorithm that analyzes the ego vehicle’s ground-truth trajectory. During both training and inference, the model predicts all four components of the driving rationale before predicting the future waypoints, i.e., (Orationale, Otrajectory) = G(Tintent, Tego, V). (3) Where Orationale denotes an ordered text output of (R1, R2, R3, R4) for driving rationale. Empirically, we observe that the prediction order of Orationale and Otrajectory does not result in a significant difference in quality after model convergence. This suggests that we can predict Otrajectory first and apply early stopping during inference for time-critical applications. 2.3 EMMA Generalist While end-to-end motion planning is the ultimate core task, a comprehensive autonomous driving system requires additional capabilities. Specifically, it must perceive the 3D world and recognize surrounding objects, the road graph and the traffic conditions. To achieve this goal, we formulate EMMA as a generalist model capable of handling multiple driving tasks through training mixtures. Our vision-language framework represents all non-sensor inputs and outputs as plain text, providing the flexibility necessary to incorporate many other driving tasks. We employ instruction-tuning, a well-established approach in LLMs, to jointly train all tasks together with task-specific prompts included in the inputs T of Eq. 1. We organize these tasks into three primary categories: spatial reasoning, road graph estimation, and scene understanding. Fig. 2 illustrates the overall EMMA Generalist diagram. Spatial reasoning is the ability to understand, reason, and draw conclusions about objects and their relationships in space. This enables an autonomous driving system to interpret and interact with its surrounding environment for safe navigation. Our primary focus for spatial reasoning is 3D object detection. We follow Pix2Seq [Chen et al., 2022a] and formulate the output 3D bounding boxes as Oboxes = set{text(x, y, z, l, w, h, θ, cls)} where (x, y, z) are the center location in the vehicle frame, l, w, h are the length, width, and height of the box, θ is the heading angle, and cls is the class label in text. We convert a 7D box to text by writing floating-point numbers with two decimal places, separated by spaces between each dimension. We then represent the detection tasks using a fixed prompt Tdetect_3D, such as “detect every object in 3D”, as in the following: Oboxes = G(Tdetect_3D, V). (4) While Oboxes is an unordered set of boxes, the predictions from an auto-regressive language model are always ordered. We find that sorting the 3D bounding boxes by depth improves detection quality, in contrast with the findings in Pix2Seq [Chen et al., 2022a]. Road graph estimation focuses on identifying critical road elements for safe driving, including semantic elements (e.g., lane markings, signs) and physical properties (e.g., lane curvature). The collection of these road elements forms a road graph. For example, lane segments are represented by (a) nodes, where the lanes encounter an intersection, merge, or split and (b) edges between these nodes following the direction of traffic. The full road-graph is composed of many such polyline segments. While edges within each polyline are directional, each polyline does not necessarily have a unique order relative to the other elements. This bears similarity to object detection (e.g., [Carion et al., 2020, Chen et al., 2022a]), where each box is defined by ordered attributes (top-left corner, bottom-right corner), but a relative ordering between boxes does not necessarily exist. There are several existing works that model polyline graphs with Transformers [Yuan et al., 2024, Liao et al., 2024a,b, 2023, Ding et al., 2023, Qiao et al., 2023, Liu et al., 2023, Li et al., 2022a], sharing similarities with language models. Our general modeling formulation in EMMA is as follows: Oroadgraph = G(Testimate_roadgraph, V). (5) 5 Figure 2: Illustration of EMMA Generalist. Starting with a task prompt (left), EMMA generates a corresponding textual prediction (middle right), which can then be decoded into a target output format, visualized and overlaid with the input image (right). EMMA Generalist is highly versatile, capable of performing a wide range of driving-related tasks, such as end-to-end motion planning, object detection, road graph estimation, and scene understanding Q&A. In the answer text, italicized words represent placeholders that will be dynamically substituted with actual values during task execution. where Oroadgraph is a text-encoded road graph represented as waypoints, Testimate_roadgraph is a prompt asking the model to predict the roadgrah, and V denotes the surrounding images. We focus specifically on predicting drivable lanes, i.e., the lanes that the ego vehicle can drive towards in the scene. These are neighboring lanes in the same traffic direction and lanes branching out from the current ego lane. To construct Oroadgraph, we (a) convert lanes into sets of ordered waypoints and (b) transform these sets of waypoints into text. It is beneficial to use sample-ordered waypoints to represent both traffic direction and curvature. Just like detection, we also find that ordering lanes by approximate distance improves the prediction quality. An example of our polyline text encoding is: "(x1,y1 and... and xn,yn);..." where "x,y" are floating point waypoints with precision to 2 decimal places, ";" separates polyline instances. Scene understanding tasks test the model’s understanding of the whole scene context, which can be relevant for driving. For example, roads can be temporarily obstructed due to construction, emergency situations, or other events. Detecting these blockages in a timely manner and safely navigating around them is essential for ensuring the smooth and safe operation of autonomous vehicles; however, multiple cues in the scene are required to determine if there is a blockage or not. We focus on how our model performs on this temporary blockage detection task, using the following formulation: Otemporary_blockage = G(Ttemporary_blockage, Troad_user, V), (6) where Otemporary_blockage is the model output signaling potential obstruction, V denotes the surrounding images, Troad_users denotes the all the objects on the road ahead, Ttemporary_blockage is the text prompt "is the road ahead temporarily blocked?". 2.4 Generalist Training Our unified vision-language formulation enables the simultaneous training of multiple tasks with a single model, allowing for task-specific predictions at inference time through simple variations of the task prompt Ttask. The training procedure is both straightforward and flexible. For each task, we construct a dataset Dtask containing |Dtask| training examples. During each training iteration, we randomly sample a batch from the available datasets, with the probability of selecting an example from a specific dataset proportional to the dataset size: i.e., |Dtask|/ P t |Dt |. To train for e epochs, we set the total number of training iterations to e × P t |Dt |, ensuring that the training 6 Method ADE 1s ADE 3s ADE 5s ADE 8s MotionLM [Seff et al., 2023] 0.045 0.251 0.694 1.766 Wayformer* [Nayakanti et al., 2023] 0.044 0.250 0.640 1.517 EMMA 0.032 0.248 0.681 1.718 EMMA+ 0.030 0.225 0.610 1.553 Table 1: End-to-end motion planning experiments on WOMD planning benchmark [Chen et al., 2024a]. EMMA+ denotes the models are pre-trained on our internal data. EMMA achieves promising performance on ADE 1s, 3s, and 5s. EMMA is slightly worse than Wayformer at ADE 8s, largely because the camera-only EMMA has higher depth error for longer distance than Wayformer, which uses all sensors including LiDAR, Radar, and Cameras. ∗Note that the Wayformer baseline is based on an enhanced internal version after the publication of [Nayakanti et al., 2023]. Figure 3: Ablation study on the number of sampled trajectories. As more trajectories are sampled, the quality measured by ADE@5s also improves, but the benefits diminish after 12+ samples. ratio among tasks is governed by the relative dataset sizes. The optimal training ratio is influenced by several factors, including task complexity, inter-task correlations, and the degree of transferability across tasks. Our experimental results demonstrate that the generalist model, trained across multiple tasks, consistently outperforms each specialist model that is trained on a single task. This highlights the advantage of the generalist approach: enhanced knowledge transfer, improved generalization, and increased efficiency.▼生成AIの回答(ChatGPT 4o mini)
EMMA(End-to-End Multimodal Model for Autonomous Driving)は、GoogleのGeminiモデルをベースにして構築された自律運転のための多目的モデルです。以下、質問に対する詳細な回答を元の文に基づいて提供します。
- モデルへの入力形式
EMMAの入力は、主に3つの要素から構成されています: - カメラ画像(V):環境の包括的な情報を提供する周囲のカメラビデオ。これが主要なセンサー入力であり、周囲の状況を視覚的に認識します。
- 高レベルの指令(Tintent):ルーターから送られる指令で、例えば「まっすぐ進む」「左折」「右折」などの運転指示が含まれます。
- 車両の履歴状態(Tego):車両の過去の位置情報を示すもので、例えば過去のウェイポイント(BEV空間での座標)です。また、必要に応じて速度や加速度など、車両の動的状態を含めることができます。
これらの入力がテキスト形式で表現され、モデルはこれらを基に動作します。
- モデルからの出力形式
EMMAの出力は、特定のタスクに応じて異なりますが、基本的には自然言語で生成されるテキスト形式です。具体的には: - 未来の軌道(Otrajectory):自律車両の未来の進行方向を示すウェイポイントのセットです。これらのウェイポイントもテキスト形式で表現されます。例えば、「(xt, yt)」のような形式で、時間軸に沿った位置情報が示されます。
- 運転の合理的な根拠(Orationale):モデルが決定した運転に対する理由付け。これは、シーンの説明、重要な物体の識別、物体の挙動の理解、最終的な運転の判断に分かれています。これもテキスト形式で出力され、例えば「歩行者が[9.01, 3.22]に立っている」などの情報を含みます。
また、出力は特定のタスクに応じて異なる形式になります(例:3D物体検出の場合、3Dバウンディングボックスの位置やサイズが出力される)。
- データセット
EMMAは複数のデータセットを使用して訓練されています。主に以下のデータセットが言及されています:
- nuScenes:自動運転車のための公開ベンチマークデータセット(特にモーションプランニングに使用)。
- Waymo Open Motion Dataset (WOMD):自律運転における運転軌道予測のベンチマーク(EMMAのモーションプランニング実験で使用)。
- Waymo Open Dataset (WOD):カメラデータが主体の自律運転タスクに用いられるデータセット。特に3D物体検出タスクで使用され、EMMAはここで最先端の性能を示しています。
- 内部データ:EMMA+というバージョンでは、内部データに基づく事前訓練が行われています。
- モデルの構造
EMMAは、GoogleのGeminiを基盤としており、Geminiは視覚(画像)とテキスト(自然言語)の入力を処理できる自己回帰型の多言語モデル(MLLM)です。Geminiは、複数のドメインにわたる知識を事前に学習しており、この知識を自動運転のタスクに転用します。
モデルの入力形式はテキストと視覚データを組み合わせたものであり、出力も自然言語で生成されます。具体的には、画像やビデオデータ(V)と、運転指令や車両の履歴情報(T)を入力として、次のトークン予測に基づいてテキスト形式で結果(O)を生成します。
- 計算量
計算量に関する直接的な記載はありませんが、EMMAは「計算要求が高い」と述べられています。具体的には、以下の点が計算において挑戦となっています:
- 計算リソースの増加:従来のモデルに比べて計算要求が高い。
- リアルタイム評価のための高精度なシミュレーション:EMMAは、カメラ画像のみを使用した運転タスクを対象としており、実世界での実行にはシミュレーションと現実のギャップを埋めるための計算リソースが必要です。
また、チェイン・オブ・ソート推論(Chain-of-thought reasoning)を用いることで、計算量が増加することが考えられます。
- パラメータ数
パラメータ数に関して具体的な記載はありませんが、EMMAはGeminiモデルを基にしているため、Geminiのパラメータ数を推測することはできます。Geminiは大規模なモデルであり、Googleの他の大規模言語モデルと同様に数十億から数百億のパラメータを持つ可能性があります。ただし、EMMA独自のパラメータ数に関する詳細は提供されていません。
結論
EMMAは、視覚とテキストの入力を処理し、自然言語での出力を生成する多目的な自律運転モデルです。入力形式はカメラ画像、運転指令、車両の履歴状態を含み、出力は運転の未来軌道や合理的な理由付けを含むテキストで提供されます。使用されるデータセットには公開データセット(nuScenes、WOMD、WOD)と内部データがあり、モデルの基盤はGoogleのGeminiモデルです。計算要求が高く、パラメータ数についての具体的な記載はありませんが、大規模なモデルであると推測されます。
▼質問(人間)
続けて下記の質問に追加で回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①ここで使用された機械学習では、どんな学習が行われたのか、記載があれば教えて下さい。
▼生成AIの回答(ChatGPT 4o mini)
上記の文において、使用された機械学習の学習プロセスについては、以下のように記載されている。
-
ファインチューニング (Fine-tuning):
- EMMAは、GoogleのGeminiモデルを基盤としており、事前学習済みのGeminiモデルを自動運転のタスクに適用するためにファインチューニングされている。具体的には、自動運転に関連する運転データを用いてタスク固有のプロンプトを使って調整される。これにより、EMMAは走行ログに基づいて、未来の軌跡予測や物体認識、道路グラフ要素、シーン理解などの複数の出力を生成できる。
-
エンドツーエンド学習 (End-to-end learning):
- EMMAはエンドツーエンドで学習されるモデルであり、センサー入力(カメラ映像など)から直接未来の軌跡を予測する。モデルは、高レベルな運転指令と過去のエゴ車両の状態を入力として、未来の軌跡を生成する。これには専用の教師なし学習(Self-supervised)アプローチが使用され、教師データとしてはエゴ車両の未来の位置情報が利用される。加えて、特別なラベルは必要なく、学習データのみでトレーニングが可能である。
-
チェーン・オブ・ソート学習 (Chain-of-thought reasoning):
- EMMAは、チェーン・オブ・ソート(Chain-of-thought)推論を統合しており、これにより推論の過程や決定の理由を人間が理解しやすい形で出力することができる。具体的には、モデルが最終的な未来の軌跡を予測する前に、シーンの説明、重要な物体の位置や状態、物体の動作に関する説明、そして最終的な運転方針など、段階的に理由を述べる。このアプローチにより、モデルはより直感的な理解を提供し、運転計画をより高精度で生成することができる。
-
複数タスク同時学習 (Multi-task learning):
- EMMAは複数の自動運転タスクを同時に学習できる汎用モデルとして設計されている。これには、運転経路計画、物体検出、道路グラフ推定、シーン理解といったタスクが含まれており、これらのタスクは単一のモデルで学習される。具体的には、各タスクごとに異なるプロンプトが入力され、モデルはそのタスクに対応した出力を生成する。この汎用的な学習方法により、EMMAは複数の異なるタスク間で知識を再利用できるため、各タスク単独で学習されたモデルに比べて効率的で強力なパフォーマンスを発揮する。
-
指示調整 (Instruction-tuning):
- EMMAは指示調整(Instruction-tuning)を使用しており、これにより異なるタスクに対応するための特定のプロンプトを学習データに組み込む。タスク固有のプロンプトが入力され、モデルはそのプロンプトに従って予測を行う。これにより、汎用的な知識がタスクごとに特化されたものに調整され、特定のドメインやタスクにおいて優れたパフォーマンスを発揮する。
-
データセットとタスクの組み合わせ:
- 学習プロセスは、異なるタスクのデータセットを組み合わせて行われ、各タスクの学習データがサンプリングされる。タスクごとのデータセットサイズに応じてサンプリングされる確率が調整され、複数のタスクが同時に学習される。これにより、モデルはタスク間の相関を学び、より汎用的で適応力のある学習が進む。
これらの学習方法により、EMMAは複数の自動運転タスクを同時にこなす汎用的なモデルとして設計されており、エンドツーエンドで動作するため、シンプルでありながら高いパフォーマンスを実現している。
3 Experiments
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
3 Experiments We highlight the experiments used to validate the efficacy of EMMA model. We leverage the smallest size of Gemini, i.e., Gemini 1.0 Nano-1 [Gemini Team Google, 2023] for all the experiments. In the following section, we first present the results of end-to-end planner trajectory generation on two public datasets in Section 3.1. Next, we conduct additional experiments on our internal datasets, studying the impact of chain-of-thought and data scaling in Section 3.2. Section 3.3 focuses on experiments involving three core perception tasks: 3D object detection, road graph estimation, and scene understanding. Our co-training results for the generalist model are summarized in Section 3.4. Finally, we showcase visual results that highlight EMMA capabilities in challenging, long-tail scenarios in Section 3.5. 3.1 End-to-End Motion Planning on Public Datasets We conduct the end-to-end planner trajectory generation experiments on two public datasets, the Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a] and the nuScenes dataset [Caesar et al., 2020]. EMMA is trained with the simplest end-to-end planner trajectory generation formulation as in Equation 2. That is, given camera images, ego vehicle history, and driving intent, the model is asked to predict the future ego waypoints for a certain time horizon. 7 3.1.1 Driving on the Waymo Open Motion Dataset (WOMD) WOMD comprises 103k real-world urban and suburban driving scenarios, each lasting 20 seconds. These scenarios are further segmented into 1.1M examples, each representing a 9-second window: 1 second is used as input context, and the remaining 8 seconds serve as the prediction target. The dataset includes detailed map features such as traffic signal states and lane characteristics, along with agent states such as position, velocity, acceleration, and bounding boxes. WOMD is typically used for the agent motion prediction task, i.e., given the agent tracks for the past 1 second on a corresponding map, predict the positions of up to 8 agents for 8 seconds into the future. We adopt the exact same setting but only train and evaluate the future motion prediction of the ego vehicle, i.e. the autonomous driving vehicle, as the planning benchmark. In addition, we leverage the ‘high-level intent command’ field, used for evaluating motion prediction, as the planning router instruction. We adapt some of the best motion prediction models, MotionLM [Seff et al., 2023] and enhanced Wayformer [Nayakanti et al., 2023], as the baselines of this planning benchmark. During inference, sampling a final trajectory from multiple candidate trajectories plays a critical role in the final performance. Both MotionLM and Wayformer generate 192 candidate trajectories, which are subsequently aggregated into 6 clusters using k-means clustering, resulting in 6 representative trajectories. The final trajectory is selected from these 6 representative trajectories based on their associated probabilities. For fairness, we also sample multiple trajectories using a Top-K decoding strategy, up to K = 24. We then compute the pairwise L2 distance between all trajectories and select the one with the lowest average L2 distance as the final predicted trajectory, which can be viewed as the “median” trajectory among all the predictions. In the ADE calculation, we treat this median trajectory with the highest probability. All WOMD eval results are based on the validation set. As shown in Table 1, our model achieved similar performance as the MotionLM baseline when we train only on the WOMD dataset, with Gemini pre-trained weights. When pre-trained with our internal dataset (denoted as EMMA+), our model surpasses both MotionLM and Wayformer up to ADE 5s in the future. EMMA is only slightly worse than Wayformer at ADE 8s, where camerainduced depth error becomes a prominent error resource. We note the differences in inputs between MotionLM and EMMA: MotionLM takes inputs of agent location history, agent interactions, the road graph, and traffic light states. These agent boxes are produced by specialized off-board perception models that look at both past and future observations and are trained with a large amount of carefully curated human labels, the road graph is manually generated using full run segments, and all inputs heavily use LiDAR data with superior depth estimation. In stark contrast, EMMA only takes camera images and ego vehicle history as input, without the need of any labels or additional models (besides leveraging the Gemini pre-trained weights). Notably, for this work, we added internally available image inputs, which are not currently in the released WOMD. We further investigate the impact of the number of sampled trajectories on ADE, as illustrated in Figure 3. The results highlight that sampling from multiple trajectories, rather than a single one, leads to a notable improvement in ADE. However, the advantage of increasing the number of candidate trajectories diminishes beyond a certain point. 3.1.2 Driving on the nuScenes Dataset The nuScenes dataset [Caesar et al., 2020] offers a comprehensive autonomous vehicle sensor suite for evaluation. It consists of 1,000 scenes, each spanning 20 seconds, and includes information from 6 cameras, 5 radars, and 1 LiDAR that collectively provide 360-degree coverage in the field of view. The dataset is fully annotated with 3D bounding boxes across 23 classes and 8 attributes. In our experiments, we follow the standard protocol for planning evaluation: predict the next 3 seconds of future driving actions based on 2 seconds of historical data. We measure the planning quality with L2 errors at 1-, 2- and 3-second time horizons, aligning with established baseline methods. As shown in Table 2, our self-supervised EMMA achieves state-of-the-art results in planning on nuScenes, outperforming all previous supervised (with intermediate perception labels and/or human labels) and self-supervised (no extra labels) methods. Under the same self-supervised setup, EMMA outperforms previous BEV-Planner [Li et al., 2024] by 17.1% in average L2 metric; even compared to DriveVLM-Dual [Tian et al., 2024] that heavily uses intermediate perception and human labels, our self-supervised EMMA still improves the average L2 metric by 6.4%. 8 Method self-supervised? L2 (m) 1s L2 (m) 2s L2 (m) 3s Avg L2 (m) UniAD [Hu et al., 2023] ✗ 0.42 0.64 0.91 0.66 DriveVLM [Tian et al., 2024] ✗ 0.18 0.34 0.68 0.40 VAD [Jiang et al., 2023] ✗ 0.17 0.34 0.60 0.37 OmniDrive [Wang et al., 2024a] ✗ 0.14 0.29 0.55 0.33 DriveVLM-Dual [Tian et al., 2024] ✗ 0.15 0.29 0.48 0.31 Ego-MLP∗ [Zhai et al., 2023] ✓ 0.15 0.32 0.59 0.35 BEV-Planner [Li et al., 2024] ✓ 0.16 0.32 0.57 0.35 EMMA (random init) ✓ 0.15 0.33 0.63 0.37 EMMA ✓ 0.14 0.29 0.54 0.32 EMMA+ ✓ 0.13 0.27 0.48 0.29 Table 2: End-to-end motion planning experiments on nuScenes [Caesar et al., 2020]. EMMA (random init) denotes models are randomly initialized; EMMA denotes models are initialized from Gemini; EMMA+ denotes models are pre-trained on our internal data. EMMA achieves state-of-the-art performance on the nuScenes planning benchmark, outperforming the supervised (with perception and/or human labels) prior art by 6.4% and self-supervised (no extra labels) prior art by 17.1%. ∗Ego-MLP results are taken from a reproduced version in BEV-Planner. Scene description Critical object Meta decision Behavior description Relative improvements over baseline e2e planning ✓ ✗ ✗ ✗ + 0.0% ✗ ✓ ✗ ✗ + 1.5% ✗ ✗ ✓ ✗ + 3.0% ✗ ✓ ✓ ✗ + 5.7% ✗ ✓ ✓ ✓ + 6.7% Table 3: Ablation study on chain-of-thought reasoning components. It improves end-to-end planning quality by up to 6.7% by combining all elements. In particular, driving meta-decision and critical objects contribute the improvements of 3.0% and 1.5%, respectively. The details of each component is described in Section 2.2. Unlike in WOMD, we note that sampling multiple trajectories did not yield significant improvements. We hypothesize that this is due to nuScenes’ shorter prediction time horizon (3s) in simpler driving scenarios. Thus, we report only top-1 predictions for our results. 3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset In this section, we present our studies of end-to-end planning with chain-of-thought on our internal dataset. This dataset contains millions of scenarios, orders of magnitude larger than any publicly available autonomous driving dataset. The model takes in 2 seconds of history to predict the driving actions for 5 seconds into the future. Table 3 presents the results of our experiments on chain-of-thought reasoning applied to end-to-end planning. By adopting the chain-of-thought formulation introduced in Equation 3, we achieve a notable 6.7% improvement over the standard end-to-end planning approach detailed in Equation 2. We also conduct an ablation study to analyze the contributions of different rationale components. Our findings reveal that both driving meta-decision and critical object identification significantly enhance performance, contributing improvements of 3.0% and 1.5%, respectively. When these components are combined, the gains are even more substantial. Conversely, while scene description has a neutral impact on driving performance, it enhances the model’s explainability. These results demonstrate that chain-of-thought reasoning can meaningfully improve driving performance, particularly when its components are carefully selected and integrated. We also perform a series of data scaling experiments for end-to-end planning, the results of which are illustrated in Figure 4. As we train the model on a larger training set, we observe lower eval perplexities before overfitting. Our findings indicate that the driving quality of EMMA has not yet plateaued, even with the current large-scale dataset. 9 Figure 4: EMMA data scaling experiments on our internal dataset. Each curve represents the eval perplexity for end-to-end motion planning as training proceeds with more steps. The x-axis is training compute, measured by floating-point operations (FLOPs) in log scale. The same EMMA model is trained on four sizes of datasets that are sampled with different percentages from 3% to 100% (denoted by different colors). In general, EMMA tends to achieve better quality until overfitting when given more training compute, but it also overfits quickly on smaller datasets. We observe the driving quality has not saturated when using the full large-scale dataset. 3.3 Perception Tasks We summarize our studies on three main perception tasks: 3D object detection, road graph estimation, and scene understanding. 3.3.1 3D Object Detection We validate our 3D object detection performance on the 3D camera-primary detection benchmark from the Waymo Open Dataset [Sun et al., 2020] using the d Longitudinal Error Tolerant (LET) matching [Hung et al., 2024]. We evaluate two versions: EMMA and EMMA+, similar to earlier sections, where EMMA+ is pre-trained on the 3D detection task using our internal dataset. The quantitative results are reported on the official test set and summarized in Figure 5. Our findings show that after pre-training, EMMA+ achieves competitive performance on the benchmark. Since our model output set of detected boxes without individual confidence scores, we directly compare the precision/recall instead of LET-3D-AP, which is calculated based on the precision/recall curve. We also compare the commonly used F1-score, where EMMA’s F1-score is computed using the single precision/recall and other models’ F1-scores are calculated by picking the maximal F1-score on the curve (often called F1-max). Figure 5 shows the performance comparison. In generally, EMMA+ demonstrates substantial improvements over state-of-the-art methods such as BEVFormer [Li et al., 2022b], achieving a 16.3% relative increase in vehicle precision at the same recall, and a 5.5% recall improvement at the same precision. EMMA+ also achieve better F1-score than prior arts. Performance on the pedestrian class is also comparable to that of MV-FCOS3D++ [Wang et al., 2021]. Additionally, we provide a performance breakdown across different ranges, highlighting that our model performs especially well in the near range. Our results underscore that with sufficient data and a large enough model, a multimodal approach can surpass specialized expert models in 3D object detection quality. 3.3.2 Road Graph Estimation Road graph estimation is a complex task that predicts a group of unordered polylines, each of which is represented as a sequence of waypoints. We measure the quality of road graph prediction with two metrics: (1) lane-level precision and recall, where we define a true positive match between a predicted lane polyline and a groundtruth lane polyline if and only if their Chamfer distance is within 1 meter; and (2) pixel-level precision and recall, where polylines are rasterized into a BEV grid with 1 meter resolution – we then treat the BEV grid as a image and compute precision and recall based on per-pixel matching. As discussed in Section 2.3, this task involves several design choices. One is about the representation of road graph polylines, where our choice is to define the start and end points of each lane, with 10 Figure 5: Camera-primary 3D object detection experiments on WOD [Sun et al., 2020] using the standard LET matching [Hung et al., 2024]. EMMA+ denotes model is pre-trained on our internal data. EMMA+ achieves competitive performance on the detection benchmark in both precision/recall and F1-score metrics. Compared to state-of-the-art methods, it achieves 16.3% relative improvements in vehicle precision at the same recall or 5.5% recalls improvement at the same precision. intermediate points added as needed to accurately capture the road’s curvature. Another critical design choice is the construction of target label sequences used for model training. Drawing inspiration from Pix2Seq [Chen et al., 2022a] in the context of object detection, one effective design choice is to pad the targets and apply random shuffling. This technique helps the model handle unordered outputs and prevents premature termination during training. Figure 6 presents our ablation studies on various design choices. Starting from our best designs, we systematically ablate each of the following configurations and assess the resulting quality degradation. We then summarize the key insights from our analysis. Polyline representation: dynamic sampling is better than fixed sampling. A simple polyline representation is to sample a fixed number of sparse control points per lane, e.g. two end points plus a fixed number of intermediate points to capture curvature. However, we find a better approach is to dynamically adjust the number of points per polyline according to the curvature and length of the lane. By keeping a consistent waypoint density rather than a consistent number of waypoints, we achieve a representation that more accurately captures the lane structure intricacies, yielding around a 40% to 90% difference in the metrics as shown in Figure 6. 11 Figure 6: Ablation study on road graph estimation. To evaluate the influence of different components in our road graph estimation model, we ablate each configuration and measure the corresponding impact on quality. Dynamic sampling (leftmost) of road graph polylines based on lane curvature and length proves to be the most significant factor, leading to a substantial 70% to 90% change in lane-level precision and recall. In contrast, aligning the model with a language-like representation, i.e., semantic punctuation (rightmost), has a more modest effect, contributing to only <10% change in precision and recall of any metric. Polyline representation: ego-origin aligned sample intervals are better than naively aligned sample intervals. The road graph is typically stored and accessed in global coordinate frame, meaning lane origins and extensions are independent of the ego vehicle position. To improve accuracy, it is essential to adjust lane point samples to start from the ego vehicle coordinate frame origin. Specifically, sampling polyline points relative to the AV position (ego-origin) avoids arbitrary offsets that can arise from directly transforming points sampled in the global coordinate frame into the ego coordinate frame. This prevents a 25% to 60% drop in prediction quality. Target sequence construction: shuffled ordering is better than arbitrary ordering. We organize polyline targets into bins based on their endpoint distance from the ego vehicle, providing a rough global ordering. For instance, we categorize lanes into nearby lanes and those further away that serve as connecting lanes. During training, we dynamically shuffle the polylines within each distance bin to enhance the model robustness and coverage. This dynamic shuffling within each bin improves the model’s ability to generalize across different lane configurations, leading to more accurate predictions. Target sequence construction: padding is better than non-padding. Similar to Pix2Seq [Chen et al., 2022a], we find that padding targets to prevent early termination is highly effective. In addition to padding the total number of polyline targets, we also pad the number of points within each polyline. We use “invalid” tokens to represent padded points within polylines. Each polyline is also explicitly tagged with a final “valid” or “invalid” token to denote whether it contains any nonpadded points. This approach ensures consistent input sizes, which helps maintain the integrity of the model during training and reduces the risk of premature truncation, leading to more reliable and accurate predictions. Target sequence construction: adding punctuation and other semantically redundant token improves quality. In the target sequence construction, we notice that it is beneficial to use language-like structures and punctuation to group targets (e.g., "(x,y and x,y);..." instead of "xy xy;..."). Additionally, explicitly including semantically redundant tokens – such as marking padded targets as “invalid” instead of relying on implicit omissions of “valid” markers – improves performance. This approach, incorporating punctuation and redundancy, results in a boost of up to 10% in lane-level metrics. We attribute this improvement to the language-related pre-training of Gemini. By leveraging similar structured expressions, Gemini can be more easily adapted to other tasks. 3.3.3 Scene Understanding Figure 7 summarizes our studies on the scene understanding task for temporary blockage detection. Our study is based on our internal datasets specifically curated for these scenarios. For this study, we establish our baselines by showing a picture to human and asking them to judge whether a lane is temporarily blocked. They can answer ‘yes’, ‘no’, or ‘unsure’. Our baseline will treat 12 Figure 7: Scene understanding experiments. direct fine-tuning denotes solely using the temporal blockage data during fine-tuning; naive mixture denotes co-training this scene task with road graph estimation; mix + short pretraining denotes pre-training on road graph esitmation first, and then fine-tune on the mixture of both tasks; mix + long pretraining denotes a longer pre-training before fine-tuning. The naive fine-tuning is already close to strong human baseline, but long-pretraining with training mixture can further boost the quality. Relative improvement over single task e2e planning 3D detection road graph estimation e2e planning detection road graph ✓ ✓ - +1.6% (±1.0%) +2.4% ( ±0.8%) ✓ ✓ +1.4% (±2.8%) +5.6% (±1.1%) - ✓ ✓ −1.4% (±2.9%) - +3.5% (±0.9%) ✓ ✓ ✓ +1.4% (±2.8%) +5.5% (±1.1%) +2.4% (±0.8%) Table 4: Generalist co-training experiments. (±∗) indicates standard deviation. By co-training on multiple tasks, EMMA gains a broader understanding of driving scenes, enabling it to handle various tasks at inference time, while enhancing individual task performance. Notably, certain task pairings yield greater benefits than others, suggesting these tasks are complementary. Co-training all three tasks together yields the best quality. all ‘unsure’ answers as incorrect, baseline+filtering will filter out all examples with ‘unsure’ answers. In contrast, our model is fine-tuned to predict ‘yes’ or ‘no’ for all examples. As shown in the figure, our naive model that is directly fine-tuned for only this task achieves better performance than the baseline comparison, but underperforms on the baseline+filtering comparison. To boost the model performance, our first attempt is to co-train this task with road graph estimation, but the naive mixture doesn’t improve performance. Our second attempt is to first pre-train the model on road graph estimation, and then fine-tune on these two tasks. Results show when the pre-training is long enough, the quality is boosted, showcasing the model’s ability to integrate multiple tasks for enhanced performance.▼生成AIの回答(ChatGPT 4o mini)
① ここで使用された機械学習のモデルについて、次の内容が記載されています。
モデルへの入力形式:
- Waymo Open Motion Dataset (WOMD) の実験において、EMMAモデルはカメラ画像、自己車両の履歴、運転意図(ドライビングインテント)を入力として使用します。これに基づいて、未来の自己車両の位置(ウェイポイント)を予測します。
- nuScenes Dataset の実験においても、EMMAはカメラ画像と自己車両の履歴を入力として使用します。これに基づいて、次の3秒間の運転アクション(位置予測)を行います。
モデルからの出力形式:
- Waymo Open Motion Dataset (WOMD) と nuScenes Dataset の実験において、モデルの出力は将来の自己車両の運転経路(ウェイポイント)です。特にWOMDの実験では、モデルは未来の8秒間の自己車両の位置を予測します。nuScenesの実験では、1秒、2秒、3秒先の位置予測が求められます。
- 出力は、複数の候補軌道(トラジェクトリ)を生成し、その中から最適なものを選ぶ形式となっています。
データセット:
- Waymo Open Motion Dataset (WOMD): 103,000の都市部および郊外の実世界での運転シナリオを含むデータセットで、1秒間の履歴情報と8秒間の予測情報が与えられます。このデータセットは、高精度な地図情報、交通信号状態、車両状態(位置、速度、加速度)などの詳細情報を含んでいます。
- nuScenes Dataset: 1,000のシーンから構成され、各シーンは20秒間のデータを提供します。6台のカメラ、5台のレーダー、1台のLiDARを使ったセンサーデータが含まれ、すべてのシーンには23のクラスと8つの属性を持つ3Dバウンディングボックスがアノテーションされています。
モデルの構造:
- EMMAは、カメラ画像と自己車両の履歴を用いて未来の運転アクション(位置予測)を行う自己教師ありのモデルです。特に、Gemini 1.0 Nano-1という事前学習された重みを持つモデルが使用されており、この重みを使ってEMMAがトレーニングされます。
- モデルは、カメラ画像を入力として受け取り、その画像から車両の運転経路を予測する能力を持っています。モデルはエンドツーエンドで動作し、追加のラベルや外部モデルなしで動作することが強調されています。
計算量およびパラメータ数:
- 計算量: 記事内には直接的な計算量(FLOPs)に関する具体的な数値は記載されていませんが、データスケーリング実験において、モデルは訓練データ量に比例して計算リソース(FLOPs)が増加することが言及されています。訓練中の計算量が増加すると、モデルは最適化され、過剰適合する前に評価のパープレキシティが低下します。これは、トレーニングステップが増えることによって計算負荷が増加することを意味します。
- パラメータ数: EMMAのパラメータ数に関する直接的な記載はありませんが、Geminiの重みを使って事前学習されたモデルとして、モデルはかなり大規模であると推測されます。具体的なパラメータ数については記載がなく、推測には限界がありますが、Gemini 1.0 Nano-1がベースになっているため、このモデルは数億から数十億のパラメータを有する可能性があります。
▼質問(人間)
続けて下記の質問に追加で回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①ここで使用された機械学習では、どんな学習が行われたのか、記載があれば教えて下さい。
▼生成AIの回答(ChatGPT 4o mini)
本研究における機械学習では、主にエンドツーエンドのモーションプランニングと自己教師あり学習が行われている。具体的には、以下の学習方法が記載されている。
-
エンドツーエンドのモーションプランニング:
- モデルは、車両のカメラ画像、エゴ車両の履歴、および運転意図を入力として、未来のエゴ車両の位置(ウェイポイント)を予測するタスクに対して学習されている。
- 予測は、未来の8秒間の動きを予測することが求められ、実際には9秒間のウィンドウを入力として1秒をコンテキストとして使用し、残りの8秒を予測ターゲットとして用いる。
-
自己教師あり学習:
- EMMAモデルは、教師なし(自己教師あり)で学習を行い、ラベル付きのデータや外部の人間によるラベルを使用せずに学習することが特徴である。
- モデルは、自己教師あり学習の一環として、例えばnuScenesデータセットにおいて、過去の2秒間のデータを入力として、次の3秒間の車両の動き(計画行動)を予測するタスクを学習している。
- 他の従来の手法(例えば、BEV-PlannerやDriveVLM-Dualなど)が使用する人間によるラベルや中間的な感覚情報を用いず、EMMAは自己教師あり学習を通じて優れた性能を達成している。
-
チェイン・オブ・ソート(Chain-of-Thought)推論:
- 本研究では、チェイン・オブ・ソート推論を用いた学習が行われ、複数の推論要素を組み合わせることによって、エンドツーエンドのモーションプランニングのパフォーマンスを向上させている。具体的には、運転におけるメタ判断や重要な物体の認識、シーンの説明といった要素が、モデルの予測精度を向上させる役割を果たしている。
- これらの要素は、モデルの解釈可能性を高めるだけでなく、運転性能をも向上させることが示されている。特に、メタ判断や重要物体の識別は、パフォーマンス向上において大きな貢献をしている。
-
データスケーリング:
- より大規模なデータセットを使用した学習も行われており、EMMAは、内部の大規模データセットで学習を行うことで、計算量を増やすことなく、モデルのパフォーマンスを向上させることができることが確認されている。
- トレーニングデータのスケールが大きくなることで、モデルの予測精度(特に評価困難度が低くなる)は改善され、過学習の兆候が見られる前に安定したパフォーマンスを発揮する。
-
共学習(Co-training):
- EMMAは、異なるタスクを共学習することで、異なるドライビングシーンを理解し、複数のタスク(例:モーションプランニング、3D物体検出、道路グラフ推定など)を同時に処理できるようになる。これにより、個々のタスクにおけるパフォーマンスが向上し、さらに一般的な理解力を高めることができる。
- 特に、全てのタスクを共学習させることで最も大きな改善が見られるが、タスクの組み合わせによって相乗効果が得られることが示唆されている。
これらの学習方法を通じて、EMMAは、他の先行研究に比べて高い性能を発揮しており、特に自己教師あり学習によって、ラベルなしデータでの強力な学習能力を示している。
結果に関する部分を把握する
3 Experiments
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
3 Experiments We highlight the experiments used to validate the efficacy of EMMA model. We leverage the smallest size of Gemini, i.e., Gemini 1.0 Nano-1 [Gemini Team Google, 2023] for all the experiments. In the following section, we first present the results of end-to-end planner trajectory generation on two public datasets in Section 3.1. Next, we conduct additional experiments on our internal datasets, studying the impact of chain-of-thought and data scaling in Section 3.2. Section 3.3 focuses on experiments involving three core perception tasks: 3D object detection, road graph estimation, and scene understanding. Our co-training results for the generalist model are summarized in Section 3.4. Finally, we showcase visual results that highlight EMMA capabilities in challenging, long-tail scenarios in Section 3.5. 3.1 End-to-End Motion Planning on Public Datasets We conduct the end-to-end planner trajectory generation experiments on two public datasets, the Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a] and the nuScenes dataset [Caesar et al., 2020]. EMMA is trained with the simplest end-to-end planner trajectory generation formulation as in Equation 2. That is, given camera images, ego vehicle history, and driving intent, the model is asked to predict the future ego waypoints for a certain time horizon. 7 3.1.1 Driving on the Waymo Open Motion Dataset (WOMD) WOMD comprises 103k real-world urban and suburban driving scenarios, each lasting 20 seconds. These scenarios are further segmented into 1.1M examples, each representing a 9-second window: 1 second is used as input context, and the remaining 8 seconds serve as the prediction target. The dataset includes detailed map features such as traffic signal states and lane characteristics, along with agent states such as position, velocity, acceleration, and bounding boxes. WOMD is typically used for the agent motion prediction task, i.e., given the agent tracks for the past 1 second on a corresponding map, predict the positions of up to 8 agents for 8 seconds into the future. We adopt the exact same setting but only train and evaluate the future motion prediction of the ego vehicle, i.e. the autonomous driving vehicle, as the planning benchmark. In addition, we leverage the ‘high-level intent command’ field, used for evaluating motion prediction, as the planning router instruction. We adapt some of the best motion prediction models, MotionLM [Seff et al., 2023] and enhanced Wayformer [Nayakanti et al., 2023], as the baselines of this planning benchmark. During inference, sampling a final trajectory from multiple candidate trajectories plays a critical role in the final performance. Both MotionLM and Wayformer generate 192 candidate trajectories, which are subsequently aggregated into 6 clusters using k-means clustering, resulting in 6 representative trajectories. The final trajectory is selected from these 6 representative trajectories based on their associated probabilities. For fairness, we also sample multiple trajectories using a Top-K decoding strategy, up to K = 24. We then compute the pairwise L2 distance between all trajectories and select the one with the lowest average L2 distance as the final predicted trajectory, which can be viewed as the “median” trajectory among all the predictions. In the ADE calculation, we treat this median trajectory with the highest probability. All WOMD eval results are based on the validation set. As shown in Table 1, our model achieved similar performance as the MotionLM baseline when we train only on the WOMD dataset, with Gemini pre-trained weights. When pre-trained with our internal dataset (denoted as EMMA+), our model surpasses both MotionLM and Wayformer up to ADE 5s in the future. EMMA is only slightly worse than Wayformer at ADE 8s, where camerainduced depth error becomes a prominent error resource. We note the differences in inputs between MotionLM and EMMA: MotionLM takes inputs of agent location history, agent interactions, the road graph, and traffic light states. These agent boxes are produced by specialized off-board perception models that look at both past and future observations and are trained with a large amount of carefully curated human labels, the road graph is manually generated using full run segments, and all inputs heavily use LiDAR data with superior depth estimation. In stark contrast, EMMA only takes camera images and ego vehicle history as input, without the need of any labels or additional models (besides leveraging the Gemini pre-trained weights). Notably, for this work, we added internally available image inputs, which are not currently in the released WOMD. We further investigate the impact of the number of sampled trajectories on ADE, as illustrated in Figure 3. The results highlight that sampling from multiple trajectories, rather than a single one, leads to a notable improvement in ADE. However, the advantage of increasing the number of candidate trajectories diminishes beyond a certain point. 3.1.2 Driving on the nuScenes Dataset The nuScenes dataset [Caesar et al., 2020] offers a comprehensive autonomous vehicle sensor suite for evaluation. It consists of 1,000 scenes, each spanning 20 seconds, and includes information from 6 cameras, 5 radars, and 1 LiDAR that collectively provide 360-degree coverage in the field of view. The dataset is fully annotated with 3D bounding boxes across 23 classes and 8 attributes. In our experiments, we follow the standard protocol for planning evaluation: predict the next 3 seconds of future driving actions based on 2 seconds of historical data. We measure the planning quality with L2 errors at 1-, 2- and 3-second time horizons, aligning with established baseline methods. As shown in Table 2, our self-supervised EMMA achieves state-of-the-art results in planning on nuScenes, outperforming all previous supervised (with intermediate perception labels and/or human labels) and self-supervised (no extra labels) methods. Under the same self-supervised setup, EMMA outperforms previous BEV-Planner [Li et al., 2024] by 17.1% in average L2 metric; even compared to DriveVLM-Dual [Tian et al., 2024] that heavily uses intermediate perception and human labels, our self-supervised EMMA still improves the average L2 metric by 6.4%. 8 Method self-supervised? L2 (m) 1s L2 (m) 2s L2 (m) 3s Avg L2 (m) UniAD [Hu et al., 2023] ✗ 0.42 0.64 0.91 0.66 DriveVLM [Tian et al., 2024] ✗ 0.18 0.34 0.68 0.40 VAD [Jiang et al., 2023] ✗ 0.17 0.34 0.60 0.37 OmniDrive [Wang et al., 2024a] ✗ 0.14 0.29 0.55 0.33 DriveVLM-Dual [Tian et al., 2024] ✗ 0.15 0.29 0.48 0.31 Ego-MLP∗ [Zhai et al., 2023] ✓ 0.15 0.32 0.59 0.35 BEV-Planner [Li et al., 2024] ✓ 0.16 0.32 0.57 0.35 EMMA (random init) ✓ 0.15 0.33 0.63 0.37 EMMA ✓ 0.14 0.29 0.54 0.32 EMMA+ ✓ 0.13 0.27 0.48 0.29 Table 2: End-to-end motion planning experiments on nuScenes [Caesar et al., 2020]. EMMA (random init) denotes models are randomly initialized; EMMA denotes models are initialized from Gemini; EMMA+ denotes models are pre-trained on our internal data. EMMA achieves state-of-the-art performance on the nuScenes planning benchmark, outperforming the supervised (with perception and/or human labels) prior art by 6.4% and self-supervised (no extra labels) prior art by 17.1%. ∗Ego-MLP results are taken from a reproduced version in BEV-Planner. Scene description Critical object Meta decision Behavior description Relative improvements over baseline e2e planning ✓ ✗ ✗ ✗ + 0.0% ✗ ✓ ✗ ✗ + 1.5% ✗ ✗ ✓ ✗ + 3.0% ✗ ✓ ✓ ✗ + 5.7% ✗ ✓ ✓ ✓ + 6.7% Table 3: Ablation study on chain-of-thought reasoning components. It improves end-to-end planning quality by up to 6.7% by combining all elements. In particular, driving meta-decision and critical objects contribute the improvements of 3.0% and 1.5%, respectively. The details of each component is described in Section 2.2. Unlike in WOMD, we note that sampling multiple trajectories did not yield significant improvements. We hypothesize that this is due to nuScenes’ shorter prediction time horizon (3s) in simpler driving scenarios. Thus, we report only top-1 predictions for our results. 3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset In this section, we present our studies of end-to-end planning with chain-of-thought on our internal dataset. This dataset contains millions of scenarios, orders of magnitude larger than any publicly available autonomous driving dataset. The model takes in 2 seconds of history to predict the driving actions for 5 seconds into the future. Table 3 presents the results of our experiments on chain-of-thought reasoning applied to end-to-end planning. By adopting the chain-of-thought formulation introduced in Equation 3, we achieve a notable 6.7% improvement over the standard end-to-end planning approach detailed in Equation 2. We also conduct an ablation study to analyze the contributions of different rationale components. Our findings reveal that both driving meta-decision and critical object identification significantly enhance performance, contributing improvements of 3.0% and 1.5%, respectively. When these components are combined, the gains are even more substantial. Conversely, while scene description has a neutral impact on driving performance, it enhances the model’s explainability. These results demonstrate that chain-of-thought reasoning can meaningfully improve driving performance, particularly when its components are carefully selected and integrated. We also perform a series of data scaling experiments for end-to-end planning, the results of which are illustrated in Figure 4. As we train the model on a larger training set, we observe lower eval perplexities before overfitting. Our findings indicate that the driving quality of EMMA has not yet plateaued, even with the current large-scale dataset. 9 Figure 4: EMMA data scaling experiments on our internal dataset. Each curve represents the eval perplexity for end-to-end motion planning as training proceeds with more steps. The x-axis is training compute, measured by floating-point operations (FLOPs) in log scale. The same EMMA model is trained on four sizes of datasets that are sampled with different percentages from 3% to 100% (denoted by different colors). In general, EMMA tends to achieve better quality until overfitting when given more training compute, but it also overfits quickly on smaller datasets. We observe the driving quality has not saturated when using the full large-scale dataset. 3.3 Perception Tasks We summarize our studies on three main perception tasks: 3D object detection, road graph estimation, and scene understanding. 3.3.1 3D Object Detection We validate our 3D object detection performance on the 3D camera-primary detection benchmark from the Waymo Open Dataset [Sun et al., 2020] using the d Longitudinal Error Tolerant (LET) matching [Hung et al., 2024]. We evaluate two versions: EMMA and EMMA+, similar to earlier sections, where EMMA+ is pre-trained on the 3D detection task using our internal dataset. The quantitative results are reported on the official test set and summarized in Figure 5. Our findings show that after pre-training, EMMA+ achieves competitive performance on the benchmark. Since our model output set of detected boxes without individual confidence scores, we directly compare the precision/recall instead of LET-3D-AP, which is calculated based on the precision/recall curve. We also compare the commonly used F1-score, where EMMA’s F1-score is computed using the single precision/recall and other models’ F1-scores are calculated by picking the maximal F1-score on the curve (often called F1-max). Figure 5 shows the performance comparison. In generally, EMMA+ demonstrates substantial improvements over state-of-the-art methods such as BEVFormer [Li et al., 2022b], achieving a 16.3% relative increase in vehicle precision at the same recall, and a 5.5% recall improvement at the same precision. EMMA+ also achieve better F1-score than prior arts. Performance on the pedestrian class is also comparable to that of MV-FCOS3D++ [Wang et al., 2021]. Additionally, we provide a performance breakdown across different ranges, highlighting that our model performs especially well in the near range. Our results underscore that with sufficient data and a large enough model, a multimodal approach can surpass specialized expert models in 3D object detection quality. 3.3.2 Road Graph Estimation Road graph estimation is a complex task that predicts a group of unordered polylines, each of which is represented as a sequence of waypoints. We measure the quality of road graph prediction with two metrics: (1) lane-level precision and recall, where we define a true positive match between a predicted lane polyline and a groundtruth lane polyline if and only if their Chamfer distance is within 1 meter; and (2) pixel-level precision and recall, where polylines are rasterized into a BEV grid with 1 meter resolution – we then treat the BEV grid as a image and compute precision and recall based on per-pixel matching. As discussed in Section 2.3, this task involves several design choices. One is about the representation of road graph polylines, where our choice is to define the start and end points of each lane, with 10 Figure 5: Camera-primary 3D object detection experiments on WOD [Sun et al., 2020] using the standard LET matching [Hung et al., 2024]. EMMA+ denotes model is pre-trained on our internal data. EMMA+ achieves competitive performance on the detection benchmark in both precision/recall and F1-score metrics. Compared to state-of-the-art methods, it achieves 16.3% relative improvements in vehicle precision at the same recall or 5.5% recalls improvement at the same precision. intermediate points added as needed to accurately capture the road’s curvature. Another critical design choice is the construction of target label sequences used for model training. Drawing inspiration from Pix2Seq [Chen et al., 2022a] in the context of object detection, one effective design choice is to pad the targets and apply random shuffling. This technique helps the model handle unordered outputs and prevents premature termination during training. Figure 6 presents our ablation studies on various design choices. Starting from our best designs, we systematically ablate each of the following configurations and assess the resulting quality degradation. We then summarize the key insights from our analysis. Polyline representation: dynamic sampling is better than fixed sampling. A simple polyline representation is to sample a fixed number of sparse control points per lane, e.g. two end points plus a fixed number of intermediate points to capture curvature. However, we find a better approach is to dynamically adjust the number of points per polyline according to the curvature and length of the lane. By keeping a consistent waypoint density rather than a consistent number of waypoints, we achieve a representation that more accurately captures the lane structure intricacies, yielding around a 40% to 90% difference in the metrics as shown in Figure 6. 11 Figure 6: Ablation study on road graph estimation. To evaluate the influence of different components in our road graph estimation model, we ablate each configuration and measure the corresponding impact on quality. Dynamic sampling (leftmost) of road graph polylines based on lane curvature and length proves to be the most significant factor, leading to a substantial 70% to 90% change in lane-level precision and recall. In contrast, aligning the model with a language-like representation, i.e., semantic punctuation (rightmost), has a more modest effect, contributing to only <10% change in precision and recall of any metric. Polyline representation: ego-origin aligned sample intervals are better than naively aligned sample intervals. The road graph is typically stored and accessed in global coordinate frame, meaning lane origins and extensions are independent of the ego vehicle position. To improve accuracy, it is essential to adjust lane point samples to start from the ego vehicle coordinate frame origin. Specifically, sampling polyline points relative to the AV position (ego-origin) avoids arbitrary offsets that can arise from directly transforming points sampled in the global coordinate frame into the ego coordinate frame. This prevents a 25% to 60% drop in prediction quality. Target sequence construction: shuffled ordering is better than arbitrary ordering. We organize polyline targets into bins based on their endpoint distance from the ego vehicle, providing a rough global ordering. For instance, we categorize lanes into nearby lanes and those further away that serve as connecting lanes. During training, we dynamically shuffle the polylines within each distance bin to enhance the model robustness and coverage. This dynamic shuffling within each bin improves the model’s ability to generalize across different lane configurations, leading to more accurate predictions. Target sequence construction: padding is better than non-padding. Similar to Pix2Seq [Chen et al., 2022a], we find that padding targets to prevent early termination is highly effective. In addition to padding the total number of polyline targets, we also pad the number of points within each polyline. We use “invalid” tokens to represent padded points within polylines. Each polyline is also explicitly tagged with a final “valid” or “invalid” token to denote whether it contains any nonpadded points. This approach ensures consistent input sizes, which helps maintain the integrity of the model during training and reduces the risk of premature truncation, leading to more reliable and accurate predictions. Target sequence construction: adding punctuation and other semantically redundant token improves quality. In the target sequence construction, we notice that it is beneficial to use language-like structures and punctuation to group targets (e.g., "(x,y and x,y);..." instead of "xy xy;..."). Additionally, explicitly including semantically redundant tokens – such as marking padded targets as “invalid” instead of relying on implicit omissions of “valid” markers – improves performance. This approach, incorporating punctuation and redundancy, results in a boost of up to 10% in lane-level metrics. We attribute this improvement to the language-related pre-training of Gemini. By leveraging similar structured expressions, Gemini can be more easily adapted to other tasks. 3.3.3 Scene Understanding Figure 7 summarizes our studies on the scene understanding task for temporary blockage detection. Our study is based on our internal datasets specifically curated for these scenarios. For this study, we establish our baselines by showing a picture to human and asking them to judge whether a lane is temporarily blocked. They can answer ‘yes’, ‘no’, or ‘unsure’. Our baseline will treat 12 Figure 7: Scene understanding experiments. direct fine-tuning denotes solely using the temporal blockage data during fine-tuning; naive mixture denotes co-training this scene task with road graph estimation; mix + short pretraining denotes pre-training on road graph esitmation first, and then fine-tune on the mixture of both tasks; mix + long pretraining denotes a longer pre-training before fine-tuning. The naive fine-tuning is already close to strong human baseline, but long-pretraining with training mixture can further boost the quality. Relative improvement over single task e2e planning 3D detection road graph estimation e2e planning detection road graph ✓ ✓ - +1.6% (±1.0%) +2.4% ( ±0.8%) ✓ ✓ +1.4% (±2.8%) +5.6% (±1.1%) - ✓ ✓ −1.4% (±2.9%) - +3.5% (±0.9%) ✓ ✓ ✓ +1.4% (±2.8%) +5.5% (±1.1%) +2.4% (±0.8%) Table 4: Generalist co-training experiments. (±∗) indicates standard deviation. By co-training on multiple tasks, EMMA gains a broader understanding of driving scenes, enabling it to handle various tasks at inference time, while enhancing individual task performance. Notably, certain task pairings yield greater benefits than others, suggesting these tasks are complementary. Co-training all three tasks together yields the best quality. all ‘unsure’ answers as incorrect, baseline+filtering will filter out all examples with ‘unsure’ answers. In contrast, our model is fine-tuned to predict ‘yes’ or ‘no’ for all examples. As shown in the figure, our naive model that is directly fine-tuned for only this task achieves better performance than the baseline comparison, but underperforms on the baseline+filtering comparison. To boost the model performance, our first attempt is to co-train this task with road graph estimation, but the naive mixture doesn’t improve performance. Our second attempt is to first pre-train the model on road graph estimation, and then fine-tune on these two tasks. Results show when the pre-training is long enough, the quality is boosted, showcasing the model’s ability to integrate multiple tasks for enhanced performance.▼生成AIの回答(ChatGPT 4o mini)
この論文で行われた複数の実験の条件と、それぞれの実験から得られた結論を以下に整理します。
3.1 エンドツーエンドのモーションプランニング実験(公開データセット使用)
-
Waymo Open Motion Dataset (WOMD) での実験
- 条件: 自車(エゴ車)の未来の位置を予測するタスクを行い、MotionLMとWayformerをベースラインとして比較。入力はカメラ画像、エゴ車の履歴、ドライビングインテント。
- 結果: EMMAはMotionLMと同等のパフォーマンスを達成し、EMMA+(内部データで事前学習されたモデル)はMotionLMやWayformerを上回った。特に、EMMAは深度誤差の影響を受けやすいADE 8秒でわずかに劣る。
- 結論: EMMAはカメラ画像とエゴ車の履歴のみを入力として、高精度な予測を行えることを示しており、事前学習を行うことでさらに性能が向上する。
-
nuScenes での実験
- 条件: 2秒間の履歴データを基に3秒後の運転行動を予測。L2誤差を1、2、3秒の予測時間で評価。
- 結果: EMMAは従来の自己教師あり法や監視型法を超えるパフォーマンスを達成。EMMA+(内部データで事前学習されたモデル)は、BEV-Plannerと比較して17.1%のL2誤差改善を示し、DriveVLM-Dualをも上回る。
- 結論: EMMAは自己教師あり学習の設定でも最先端のパフォーマンスを発揮し、他の方法を大きく上回る結果を得た。
3.2 チェイン・オブ・ソート(思考の連鎖)によるモーションプランニング(内部データ使用)
-
条件: 大規模な内部データセットを使用し、2秒間の履歴を基に5秒後の運転行動を予測。チェイン・オブ・ソートによる推論を取り入れた実験。
- 結果: チェイン・オブ・ソートによる推論を使用することで、標準的なエンドツーエンドの計画手法に対して6.7%の改善を達成。特に「運転のメタ意思決定」と「重要なオブジェクトの識別」が重要な改善要因であり、それぞれ3.0%と1.5%の向上を示した。
- 結論: 思考の連鎖(チェイン・オブ・ソート)は、運転行動予測の精度を向上させる有力な手法である。
-
データスケーリング実験
- 条件: トレーニングデータの規模を増やすことでモデルのパフォーマンスがどのように変化するかを評価。
- 結果: より多くのデータを使ってトレーニングすると、過学習する前に評価困難度(perplexity)が低下し、モデルの運転性能が向上することが確認された。
- 結論: EMMAは、大規模なデータセットを用いることで、性能向上が持続的に得られ、まだ性能が飽和していないことが分かった。
3.3 知覚タスク(3D物体検出、道路グラフ推定、シーン理解)
-
3D物体検出
- 条件: Waymo Open Datasetを使用し、カメラを基にした3D物体検出の精度を評価。
- 結果: EMMA+はBEVFormerなどの最先端手法を上回る精度を示し、車両クラスでは16.3%の精度向上、歩行者クラスでも良好なパフォーマンスを達成した。
- 結論: 十分なデータと大規模なモデルを使用することで、専門的な3D物体検出モデルを超えるパフォーマンスを達成できる。
-
道路グラフ推定
- 条件: 道路のポリライン(道の形状)を予測するタスクで、精度はレーンごとの適合度(Chamfer距離)を基に評価。
- 結果: 動的サンプリングによるポリライン表現が、固定サンプリングよりもはるかに優れていることが確認された。また、目標シーケンスの構築においてパディングとシャッフルが有効であることが示された。
- 結論: 道路グラフ推定には、カーブに応じた動的サンプリングや、目標シーケンスの構築における戦略が性能向上に寄与する。
-
シーン理解
- 条件: 一時的な障害物検出を目的としたシーン理解タスクを、道路グラフ推定と共同で学習。
- 結果: 道路グラフ推定と共同で学習した場合、モデルのパフォーマンスが向上することが示された。長期的な事前学習がモデルの性能向上に貢献した。
- 結論: 複数のタスクを共同で学習することで、モデルはシーンの理解を深め、各タスクの性能が向上する。
まとめ
- EMMAモデルは、公開データセット(WOMD、nuScenes)で優れたエンドツーエンドのモーションプランニングを実現し、特に事前学習によって精度が向上する。
- チェイン・オブ・ソート(思考の連鎖)を取り入れることで、運転行動予測の精度が向上することが示された。
- 3D物体検出、道路グラフ推定、シーン理解のタスクにおいて、EMMAは他の最先端手法と比較して優れたパフォーマンスを示し、特に複数のタスクを共同で学習することが有益であることがわかった。
▼質問(人間)
①上記の「EMMA+(内部データで事前学習~」について詳細を教えて下さい。
②上記の「特に「運転のメタ意思決定」と「重要なオブジェクトの識別」が重要な改善要因であり」について詳細を教えて下さい。
③上記の「レーンごとの適合度(Chamfer距離)」について詳細を教えて下さい。
▼生成AIの回答(ChatGPT 4o mini)
① EMMA+(内部データで事前学習)の詳細
EMMA+は、EMMAモデルの強化版で、内部データセットで事前学習(プリトレーニング)を行ったバージョンです。この事前学習の主な利点は、EMMAがより広範な運転シナリオに適応できるようになることです。具体的には、内部データセットには、公開されているデータセットに比べて数桁以上多くのシナリオが含まれており、EMMAはそれに基づいて事前にトレーニングされることで、より強力な予測能力を持つことになります。その結果、公開データセットでの評価(例えば、nuScenesデータセットにおける性能)でEMMA+は従来の手法を上回り、特に予測の精度が向上しています。
EMMA+がどのように性能を向上させるかというと、内部データでの事前学習によって、EMMAモデルは運転中の状況をより深く理解できるようになり、複雑な運転シナリオにも対応できるようになります。これにより、より精度の高い軌道予測が可能となり、物体検出や道路ネットワーク推定など他の知覚タスクでも優れたパフォーマンスを発揮しています。
② 「運転のメタ意思決定」と「重要なオブジェクトの識別」が重要な改善要因であること
「運転のメタ意思決定」とは、運転の過程での大きな判断や意思決定、つまり例えば「次に何をすべきか」といった戦略的な判断を指します。このメタ意思決定がモデルに組み込まれることで、運転中の複雑なシナリオにおいて適切な選択を行えるようになり、全体の運転パフォーマンスが向上します。例えば、前方の車両の動きに対して、どのタイミングで車線変更を行うか、または速度を調整するかを決めることです。このような大きな判断が精度よく行えることが重要です。
また、「重要なオブジェクトの識別」とは、周囲の環境において、特に注意すべき物体(例えば障害物、他の車両、歩行者、信号など)を正確に特定する能力です。この識別能力が強化されることで、運転シナリオにおいて重要な要素を見逃すことなく、それに基づいて適切に行動できるようになります。
これらの要素(メタ意思決定と重要オブジェクトの識別)が組み合わさることで、運転の精度と安全性が向上し、実際の道路状況でも非常に有用なモデルとなります。特に、これらの要素は「連想的思考(chain-of-thought)」的なアプローチを通じて、シナリオに基づいて自ら推論し、適切な行動を導くために重要です。
③ レーンごとの適合度(Chamfer距離)についての詳細
Chamfer距離は、2つの点集合(ここでは道路のレーンポリラインと予測されたレーンポリライン)の間の類似度を測定するために使用される距離の一つです。レーン予測においては、レーンのポリライン(道路の各車線を表す線の集合)の予測が、実際のレーンポリラインとどれだけ一致しているかを評価するためにChamfer距離が使われます。
具体的には、予測されたレーンの点(ウェイポイント)と実際のレーンの点との間で、どれだけ近いかを測定します。予測と実際のレーンポリラインの間にずれがある場合、そのずれがChamfer距離として計算されます。Chamfer距離が小さいほど、予測が実際のレーンに適合しているといえます。
また、レーンポリラインのマッチングが1メートル以内の距離で「真の陽性」とみなされるという基準もあります。これは、予測されたレーンポリラインと実際のレーンポリラインが1メートル以内の距離であれば、正確に一致していると判断されるという意味です。このように、Chamfer距離はレーンごとの精度を定量的に示すための指標として非常に重要であり、道路ネットワーク推定タスクでのパフォーマンス向上に寄与しています。
この方法を使用することで、運転モデルが予測する道路の構造(例えば車線や交差点)を、実際の道路とどれだけ正確に一致させることができるかを測定することができ、予測の精度を向上させることが可能になります。
▼質問(人間)
①上記の「EMMA+(内部データで事前学習)」部分の元文を抜き出して下さい。続けて、その日本語訳を記載下さい。
②上記の「内部データセットには、公開されているデータセットに比べて数桁以上多くのシナリオが含まれており」部分の元文を抜き出して下さい。続けて、その日本語訳を記載下さい。
③上記の「特に「運転のメタ意思決定」と「重要なオブジェクトの識別」が重要な改善要因であり」部分の元文を抜き出して下さい。続けて、その日本語訳を記載下さい。
▼生成AIの回答(ChatGPT 4o mini)
こちらのリクエストに基づいて、該当部分を抜き出し、その翻訳を行います。
① EMMA+(内部データで事前学習)部分の元文:
"When pre-trained with our internal dataset (denoted as EMMA+), our model surpasses both MotionLM and Wayformer up to ADE 5s in the future. EMMA is only slightly worse than Wayformer at ADE 8s, where camera-induced depth error becomes a prominent error resource."
日本語訳:
「私たちの内部データセット(EMMA+と呼ばれる)で事前学習を行った場合、モデルはMotionLMおよびWayformerを超えて、将来のADE 5秒までの性能を達成します。EMMAは、カメラによる深度誤差が顕著な誤差源となるADE 8秒ではWayformerより若干劣ります。」
② 内部データセットには、公開されているデータセットに比べて数桁以上多くのシナリオが含まれており部分の元文:
"This dataset contains millions of scenarios, orders of magnitude larger than any publicly available autonomous driving dataset."
日本語訳:
「この内部データセットには数百万のシナリオが含まれており、公開されているどの自動運転データセットよりも数桁以上大きい規模です。」
③ 特に「運転のメタ意思決定」と「重要なオブジェクトの識別」が重要な改善要因であり部分の元文:
"Our findings reveal that both driving meta-decision and critical object identification significantly enhance performance, contributing improvements of 3.0% and 1.5%, respectively."
日本語訳:
「私たちの発見によれば、運転のメタ意思決定と重要なオブジェクトの識別は、性能を大きく向上させ、各々3.0%および1.5%の改善に寄与しています。」
元文を抜き出した結果は、全て元文に含まれており(検索して確認)、正しく抜き出しOK。
図・表を把握する
3 Experiments
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
2.4 Generalist Training Our unified vision-language formulation enables the simultaneous training of multiple tasks with a single model, allowing for task-specific predictions at inference time through simple variations of the task prompt Ttask. The training procedure is both straightforward and flexible. For each task, we construct a dataset Dtask containing |Dtask| training examples. During each training iteration, we randomly sample a batch from the available datasets, with the probability of selecting an example from a specific dataset proportional to the dataset size: i.e., |Dtask|/ P t |Dt |. To train for e epochs, we set the total number of training iterations to e × P t |Dt |, ensuring that the training 6 Method ADE 1s ADE 3s ADE 5s ADE 8s MotionLM [Seff et al., 2023] 0.045 0.251 0.694 1.766 Wayformer* [Nayakanti et al., 2023] 0.044 0.250 0.640 1.517 EMMA 0.032 0.248 0.681 1.718 EMMA+ 0.030 0.225 0.610 1.553 Table 1: End-to-end motion planning experiments on WOMD planning benchmark [Chen et al., 2024a]. EMMA+ denotes the models are pre-trained on our internal data. EMMA achieves promising performance on ADE 1s, 3s, and 5s. EMMA is slightly worse than Wayformer at ADE 8s, largely because the camera-only EMMA has higher depth error for longer distance than Wayformer, which uses all sensors including LiDAR, Radar, and Cameras. ∗Note that the Wayformer baseline is based on an enhanced internal version after the publication of [Nayakanti et al., 2023]. Figure 3: Ablation study on the number of sampled trajectories. As more trajectories are sampled, the quality measured by ADE@5s also improves, but the benefits diminish after 12+ samples. ratio among tasks is governed by the relative dataset sizes. The optimal training ratio is influenced by several factors, including task complexity, inter-task correlations, and the degree of transferability across tasks. Our experimental results demonstrate that the generalist model, trained across multiple tasks, consistently outperforms each specialist model that is trained on a single task. This highlights the advantage of the generalist approach: enhanced knowledge transfer, improved generalization, and increased efficiency. 3.1.1 Driving on the Waymo Open Motion Dataset (WOMD) WOMD comprises 103k real-world urban and suburban driving scenarios, each lasting 20 seconds. These scenarios are further segmented into 1.1M examples, each representing a 9-second window: 1 second is used as input context, and the remaining 8 seconds serve as the prediction target. The dataset includes detailed map features such as traffic signal states and lane characteristics, along with agent states such as position, velocity, acceleration, and bounding boxes. WOMD is typically used for the agent motion prediction task, i.e., given the agent tracks for the past 1 second on a corresponding map, predict the positions of up to 8 agents for 8 seconds into the future. We adopt the exact same setting but only train and evaluate the future motion prediction of the ego vehicle, i.e. the autonomous driving vehicle, as the planning benchmark. In addition, we leverage the ‘high-level intent command’ field, used for evaluating motion prediction, as the planning router instruction. We adapt some of the best motion prediction models, MotionLM [Seff et al., 2023] and enhanced Wayformer [Nayakanti et al., 2023], as the baselines of this planning benchmark. During inference, sampling a final trajectory from multiple candidate trajectories plays a critical role in the final performance. Both MotionLM and Wayformer generate 192 candidate trajectories, which are subsequently aggregated into 6 clusters using k-means clustering, resulting in 6 representative trajectories. The final trajectory is selected from these 6 representative trajectories based on their associated probabilities. For fairness, we also sample multiple trajectories using a Top-K decoding strategy, up to K = 24. We then compute the pairwise L2 distance between all trajectories and select the one with the lowest average L2 distance as the final predicted trajectory, which can be viewed as the “median” trajectory among all the predictions. In the ADE calculation, we treat this median trajectory with the highest probability. All WOMD eval results are based on the validation set. As shown in Table 1, our model achieved similar performance as the MotionLM baseline when we train only on the WOMD dataset, with Gemini pre-trained weights. When pre-trained with our internal dataset (denoted as EMMA+), our model surpasses both MotionLM and Wayformer up to ADE 5s in the future. EMMA is only slightly worse than Wayformer at ADE 8s, where camerainduced depth error becomes a prominent error resource. We note the differences in inputs between MotionLM and EMMA: MotionLM takes inputs of agent location history, agent interactions, the road graph, and traffic light states. These agent boxes are produced by specialized off-board perception models that look at both past and future observations and are trained with a large amount of carefully curated human labels, the road graph is manually generated using full run segments, and all inputs heavily use LiDAR data with superior depth estimation. In stark contrast, EMMA only takes camera images and ego vehicle history as input, without the need of any labels or additional models (besides leveraging the Gemini pre-trained weights). Notably, for this work, we added internally available image inputs, which are not currently in the released WOMD. We further investigate the impact of the number of sampled trajectories on ADE, as illustrated in Figure 3. The results highlight that sampling from multiple trajectories, rather than a single one, leads to a notable improvement in ADE. However, the advantage of increasing the number of candidate trajectories diminishes beyond a certain point. 3.1.2 Driving on the nuScenes Dataset The nuScenes dataset [Caesar et al., 2020] offers a comprehensive autonomous vehicle sensor suite for evaluation. It consists of 1,000 scenes, each spanning 20 seconds, and includes information from 6 cameras, 5 radars, and 1 LiDAR that collectively provide 360-degree coverage in the field of view. The dataset is fully annotated with 3D bounding boxes across 23 classes and 8 attributes. In our experiments, we follow the standard protocol for planning evaluation: predict the next 3 seconds of future driving actions based on 2 seconds of historical data. We measure the planning quality with L2 errors at 1-, 2- and 3-second time horizons, aligning with established baseline methods. As shown in Table 2, our self-supervised EMMA achieves state-of-the-art results in planning on nuScenes, outperforming all previous supervised (with intermediate perception labels and/or human labels) and self-supervised (no extra labels) methods. Under the same self-supervised setup, EMMA outperforms previous BEV-Planner [Li et al., 2024] by 17.1% in average L2 metric; even compared to DriveVLM-Dual [Tian et al., 2024] that heavily uses intermediate perception and human labels, our self-supervised EMMA still improves the average L2 metric by 6.4%. 8 Method self-supervised? L2 (m) 1s L2 (m) 2s L2 (m) 3s Avg L2 (m) UniAD [Hu et al., 2023] ✗ 0.42 0.64 0.91 0.66 DriveVLM [Tian et al., 2024] ✗ 0.18 0.34 0.68 0.40 VAD [Jiang et al., 2023] ✗ 0.17 0.34 0.60 0.37 OmniDrive [Wang et al., 2024a] ✗ 0.14 0.29 0.55 0.33 DriveVLM-Dual [Tian et al., 2024] ✗ 0.15 0.29 0.48 0.31 Ego-MLP∗ [Zhai et al., 2023] ✓ 0.15 0.32 0.59 0.35 BEV-Planner [Li et al., 2024] ✓ 0.16 0.32 0.57 0.35 EMMA (random init) ✓ 0.15 0.33 0.63 0.37 EMMA ✓ 0.14 0.29 0.54 0.32 EMMA+ ✓ 0.13 0.27 0.48 0.29 Table 2: End-to-end motion planning experiments on nuScenes [Caesar et al., 2020]. EMMA (random init) denotes models are randomly initialized; EMMA denotes models are initialized from Gemini; EMMA+ denotes models are pre-trained on our internal data. EMMA achieves state-of-the-art performance on the nuScenes planning benchmark, outperforming the supervised (with perception and/or human labels) prior art by 6.4% and self-supervised (no extra labels) prior art by 17.1%. ∗Ego-MLP results are taken from a reproduced version in BEV-Planner. Scene description Critical object Meta decision Behavior description Relative improvements over baseline e2e planning ✓ ✗ ✗ ✗ + 0.0% ✗ ✓ ✗ ✗ + 1.5% ✗ ✗ ✓ ✗ + 3.0% ✗ ✓ ✓ ✗ + 5.7% ✗ ✓ ✓ ✓ + 6.7% Table 3: Ablation study on chain-of-thought reasoning components. It improves end-to-end planning quality by up to 6.7% by combining all elements. In particular, driving meta-decision and critical objects contribute the improvements of 3.0% and 1.5%, respectively. The details of each component is described in Section 2.2. Unlike in WOMD, we note that sampling multiple trajectories did not yield significant improvements. We hypothesize that this is due to nuScenes’ shorter prediction time horizon (3s) in simpler driving scenarios. Thus, we report only top-1 predictions for our results. 3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset In this section, we present our studies of end-to-end planning with chain-of-thought on our internal dataset. This dataset contains millions of scenarios, orders of magnitude larger than any publicly available autonomous driving dataset. The model takes in 2 seconds of history to predict the driving actions for 5 seconds into the future. Table 3 presents the results of our experiments on chain-of-thought reasoning applied to end-to-end planning. By adopting the chain-of-thought formulation introduced in Equation 3, we achieve a notable 6.7% improvement over the standard end-to-end planning approach detailed in Equation 2. We also conduct an ablation study to analyze the contributions of different rationale components. Our findings reveal that both driving meta-decision and critical object identification significantly enhance performance, contributing improvements of 3.0% and 1.5%, respectively. When these components are combined, the gains are even more substantial. Conversely, while scene description has a neutral impact on driving performance, it enhances the model’s explainability. These results demonstrate that chain-of-thought reasoning can meaningfully improve driving performance, particularly when its components are carefully selected and integrated. We also perform a series of data scaling experiments for end-to-end planning, the results of which are illustrated in Figure 4. As we train the model on a larger training set, we observe lower eval perplexities before overfitting. Our findings indicate that the driving quality of EMMA has not yet plateaued, even with the current large-scale dataset. 9 Figure 4: EMMA data scaling experiments on our internal dataset. Each curve represents the eval perplexity for end-to-end motion planning as training proceeds with more steps. The x-axis is training compute, measured by floating-point operations (FLOPs) in log scale. The same EMMA model is trained on four sizes of datasets that are sampled with different percentages from 3% to 100% (denoted by different colors). In general, EMMA tends to achieve better quality until overfitting when given more training compute, but it also overfits quickly on smaller datasets. We observe the driving quality has not saturated when using the full large-scale dataset.▼生成AIの回答(ChatGPT 4o mini)
こちらの図や表は、モデルの性能やアプローチに関する実験結果を示しています。それぞれの内容について簡単に説明し、どのように解釈すればよいかをお伝えします。
図3: サンプリングされた軌跡の数とADE@5sの関係
- 説明: この図は、異なる数の軌跡をサンプリングした際に、モデルのパフォーマンス(ADE@5s)がどのように変化するかを示しています。軌跡の数が増えると、予測の精度(ADE)が向上しますが、その改善は12軌跡以上では限界に達することがわかります。
- 解釈: ここでは、軌跡を複数サンプリングすることで、最適な予測を選ぶことが重要であることが示されていますが、過度に多くの軌跡をサンプリングすることには効率的な利益がないことも分かります。軌跡数を12で固定することが最も効率的であると言えます。
図4: 内部データセットでのデータスケーリング実験
- 説明: 図4では、EMMAモデルを異なるサイズのデータセットでトレーニングした結果を示しています。x軸はトレーニングにかかった計算量(FLOPs)をログスケールで示し、y軸は評価時のパープレキシティ(予測の難易度)を示しています。トレーニングデータが増えると、パフォーマンスが向上しますが、小規模なデータセットでは早期にオーバーフィットしてしまう傾向があることがわかります。
- 解釈: データセットが大きいほど、モデルはより良い性能を発揮する傾向がありますが、データ量が少ないとオーバーフィットが早く進行するため、データ量の調整が重要であることを示唆しています。現在の大規模なデータセットでトレーニングしたモデルは、まだ性能が向上する余地があるということです。
表1: WOMDデータセットにおける計画結果
- 説明: 表1は、WOMDデータセットでの異なるモデル(MotionLM、Wayformer、EMMA、EMMA+)による予測結果を示しています。これらのモデルは、未来の動作予測精度(ADE@1s, 3s, 5s, 8s)に基づいて比較されています。
- 解釈: EMMAはMotionLMとWayformerに匹敵する性能を発揮し、特にEMMA+(内部データで事前学習済み)は、最初の5秒間の予測で優れたパフォーマンスを示しています。特に、EMMAはカメラを用いた予測を行い、他のモデルと比較してセンサーが異なることに注意が必要です。ADE@8sでは、Wayformerが優れた結果を示しており、深度推定の精度が影響を与えていることが示唆されています。
表2: nuScenesデータセットにおける計画結果
- 説明: 表2は、nuScenesデータセットでの自己教師あり学習を利用したEMMAの性能を、他のモデルと比較したものです。特に、EMMAは従来の教師あり学習と自己教師あり学習を用いたモデルを超えて、L2誤差が小さい結果を示しています。
- 解釈: EMMAは、他の最先端モデル(DriveVLMやOmniDriveなど)と比較して、自己教師あり学習で優れた結果を出しており、特にL2誤差が低いことが示されています。EMMA+(内部データで事前学習済み)はさらにパフォーマンスを向上させており、自己教師あり学習における新たな最良の手法として位置づけられます。
表3: チェーン・オブ・ソート推論の効果
- 説明: 表3では、チェーン・オブ・ソート推論を用いた場合の、シーンに関する異なる要素(例:メタ決定、重要オブジェクトなど)の影響を評価しています。これにより、予測精度が最大6.7%改善されることが示されています。
- 解釈: チェーン・オブ・ソート推論の各要素が予測の質を向上させることがわかります。特に、メタ決定と重要オブジェクトの識別が予測精度の向上に寄与していますが、シーン説明は予測精度には影響しないものの、モデルの説明性を高めることに貢献しています。
これらの結果から、EMMAは複数のシナリオで他のモデルを上回るパフォーマンスを発揮しており、特にデータ量が多い場合やチェーン・オブ・ソート推論を活用することで、さらなる性能向上が期待できることが示されています。
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
3.3.1 3D Object Detection We validate our 3D object detection performance on the 3D camera-primary detection benchmark from the Waymo Open Dataset [Sun et al., 2020] using the d Longitudinal Error Tolerant (LET) matching [Hung et al., 2024]. We evaluate two versions: EMMA and EMMA+, similar to earlier sections, where EMMA+ is pre-trained on the 3D detection task using our internal dataset. The quantitative results are reported on the official test set and summarized in Figure 5. Our findings show that after pre-training, EMMA+ achieves competitive performance on the benchmark. Since our model output set of detected boxes without individual confidence scores, we directly compare the precision/recall instead of LET-3D-AP, which is calculated based on the precision/recall curve. We also compare the commonly used F1-score, where EMMA’s F1-score is computed using the single precision/recall and other models’ F1-scores are calculated by picking the maximal F1-score on the curve (often called F1-max). Figure 5 shows the performance comparison. In generally, EMMA+ demonstrates substantial improvements over state-of-the-art methods such as BEVFormer [Li et al., 2022b], achieving a 16.3% relative increase in vehicle precision at the same recall, and a 5.5% recall improvement at the same precision. EMMA+ also achieve better F1-score than prior arts. Performance on the pedestrian class is also comparable to that of MV-FCOS3D++ [Wang et al., 2021]. Additionally, we provide a performance breakdown across different ranges, highlighting that our model performs especially well in the near range. Our results underscore that with sufficient data and a large enough model, a multimodal approach can surpass specialized expert models in 3D object detection quality. 3.3.2 Road Graph Estimation Road graph estimation is a complex task that predicts a group of unordered polylines, each of which is represented as a sequence of waypoints. We measure the quality of road graph prediction with two metrics: (1) lane-level precision and recall, where we define a true positive match between a predicted lane polyline and a groundtruth lane polyline if and only if their Chamfer distance is within 1 meter; and (2) pixel-level precision and recall, where polylines are rasterized into a BEV grid with 1 meter resolution – we then treat the BEV grid as a image and compute precision and recall based on per-pixel matching. As discussed in Section 2.3, this task involves several design choices. One is about the representation of road graph polylines, where our choice is to define the start and end points of each lane, with 10 Figure 5: Camera-primary 3D object detection experiments on WOD [Sun et al., 2020] using the standard LET matching [Hung et al., 2024]. EMMA+ denotes model is pre-trained on our internal data. EMMA+ achieves competitive performance on the detection benchmark in both precision/recall and F1-score metrics. Compared to state-of-the-art methods, it achieves 16.3% relative improvements in vehicle precision at the same recall or 5.5% recalls improvement at the same precision. intermediate points added as needed to accurately capture the road’s curvature. Another critical design choice is the construction of target label sequences used for model training. Drawing inspiration from Pix2Seq [Chen et al., 2022a] in the context of object detection, one effective design choice is to pad the targets and apply random shuffling. This technique helps the model handle unordered outputs and prevents premature termination during training. Figure 6 presents our ablation studies on various design choices. Starting from our best designs, we systematically ablate each of the following configurations and assess the resulting quality degradation. We then summarize the key insights from our analysis. Polyline representation: dynamic sampling is better than fixed sampling. A simple polyline representation is to sample a fixed number of sparse control points per lane, e.g. two end points plus a fixed number of intermediate points to capture curvature. However, we find a better approach is to dynamically adjust the number of points per polyline according to the curvature and length of the lane. By keeping a consistent waypoint density rather than a consistent number of waypoints, we achieve a representation that more accurately captures the lane structure intricacies, yielding around a 40% to 90% difference in the metrics as shown in Figure 6. 11 Figure 6: Ablation study on road graph estimation. To evaluate the influence of different components in our road graph estimation model, we ablate each configuration and measure the corresponding impact on quality. Dynamic sampling (leftmost) of road graph polylines based on lane curvature and length proves to be the most significant factor, leading to a substantial 70% to 90% change in lane-level precision and recall. In contrast, aligning the model with a language-like representation, i.e., semantic punctuation (rightmost), has a more modest effect, contributing to only <10% change in precision and recall of any metric. Polyline representation: ego-origin aligned sample intervals are better than naively aligned sample intervals. The road graph is typically stored and accessed in global coordinate frame, meaning lane origins and extensions are independent of the ego vehicle position. To improve accuracy, it is essential to adjust lane point samples to start from the ego vehicle coordinate frame origin. Specifically, sampling polyline points relative to the AV position (ego-origin) avoids arbitrary offsets that can arise from directly transforming points sampled in the global coordinate frame into the ego coordinate frame. This prevents a 25% to 60% drop in prediction quality. Target sequence construction: shuffled ordering is better than arbitrary ordering. We organize polyline targets into bins based on their endpoint distance from the ego vehicle, providing a rough global ordering. For instance, we categorize lanes into nearby lanes and those further away that serve as connecting lanes. During training, we dynamically shuffle the polylines within each distance bin to enhance the model robustness and coverage. This dynamic shuffling within each bin improves the model’s ability to generalize across different lane configurations, leading to more accurate predictions. Target sequence construction: padding is better than non-padding. Similar to Pix2Seq [Chen et al., 2022a], we find that padding targets to prevent early termination is highly effective. In addition to padding the total number of polyline targets, we also pad the number of points within each polyline. We use “invalid” tokens to represent padded points within polylines. Each polyline is also explicitly tagged with a final “valid” or “invalid” token to denote whether it contains any nonpadded points. This approach ensures consistent input sizes, which helps maintain the integrity of the model during training and reduces the risk of premature truncation, leading to more reliable and accurate predictions. Target sequence construction: adding punctuation and other semantically redundant token improves quality. In the target sequence construction, we notice that it is beneficial to use language-like structures and punctuation to group targets (e.g., "(x,y and x,y);..." instead of "xy xy;..."). Additionally, explicitly including semantically redundant tokens – such as marking padded targets as “invalid” instead of relying on implicit omissions of “valid” markers – improves performance. This approach, incorporating punctuation and redundancy, results in a boost of up to 10% in lane-level metrics. We attribute this improvement to the language-related pre-training of Gemini. By leveraging similar structured expressions, Gemini can be more easily adapted to other tasks. 3.3.3 Scene Understanding Figure 7 summarizes our studies on the scene understanding task for temporary blockage detection. Our study is based on our internal datasets specifically curated for these scenarios. For this study, we establish our baselines by showing a picture to human and asking them to judge whether a lane is temporarily blocked. They can answer ‘yes’, ‘no’, or ‘unsure’. Our baseline will treat 12 Figure 7: Scene understanding experiments. direct fine-tuning denotes solely using the temporal blockage data during fine-tuning; naive mixture denotes co-training this scene task with road graph estimation; mix + short pretraining denotes pre-training on road graph esitmation first, and then fine-tune on the mixture of both tasks; mix + long pretraining denotes a longer pre-training before fine-tuning. The naive fine-tuning is already close to strong human baseline, but long-pretraining with training mixture can further boost the quality. Relative improvement over single task e2e planning 3D detection road graph estimation e2e planning detection road graph ✓ ✓ - +1.6% (±1.0%) +2.4% ( ±0.8%) ✓ ✓ +1.4% (±2.8%) +5.6% (±1.1%) - ✓ ✓ −1.4% (±2.9%) - +3.5% (±0.9%) ✓ ✓ ✓ +1.4% (±2.8%) +5.5% (±1.1%) +2.4% (±0.8%) Table 4: Generalist co-training experiments. (±∗) indicates standard deviation. By co-training on multiple tasks, EMMA gains a broader understanding of driving scenes, enabling it to handle various tasks at inference time, while enhancing individual task performance. Notably, certain task pairings yield greater benefits than others, suggesting these tasks are complementary. Co-training all three tasks together yields the best quality. all ‘unsure’ answers as incorrect, baseline+filtering will filter out all examples with ‘unsure’ answers. In contrast, our model is fine-tuned to predict ‘yes’ or ‘no’ for all examples. As shown in the figure, our naive model that is directly fine-tuned for only this task achieves better performance than the baseline comparison, but underperforms on the baseline+filtering comparison. To boost the model performance, our first attempt is to co-train this task with road graph estimation, but the naive mixture doesn’t improve performance. Our second attempt is to first pre-train the model on road graph estimation, and then fine-tune on these two tasks. Results show when the pre-training is long enough, the quality is boosted, showcasing the model’s ability to integrate multiple tasks for enhanced performance. 3.4 Generalist We explore the development of a generalist model by co-training on multiple tasks and analyzing their synergies, as summarized in Table 4. For this study, we focus on three core tasks: end-to-end planning, 3D object detection, and road graph estimation. Co-training on all three tasks yields significant improvements, with the generalist model outperforming the single-task models by up to 5.5%. Notably, when co-training two tasks, certain combinations lead to greater gains than others. For instance, detection performance improves most when co-trained with driving, and road graph estimation similarly benefits most when paired with driving. This suggests that the driving task plays a particularly prominent and influential role, serving as a key contributor to overall performance improvements. We attribute these results to the complementary nature of the tasks. For example, road graph estimation becomes easier when the model can accurately identify the locations of vehicles. Similarly, driving quality is closely tied to understanding agent interactions, a skill enhanced by 3D object detection. These findings suggest that pursuing a generalist model is a promising direction for future research, with the potential for deeper insights into task synergies and performance optimization.▼生成AIの回答(ChatGPT 4o mini)
図5: 3D物体検出の実験結果(Waymo Open Dataset)
- 内容: 図5では、EMMAとEMMA+の3D物体検出パフォーマンスを比較しています。EMMA+は、内部データセットで事前学習されており、通常のEMMAと比較して大きな改善が見られます。
-
解釈:
- F1スコア: EMMA+は、BEVFormerなどの最先端の手法と比較して、車両の精度が16.3%向上し、リコールも5.5%向上しています。
- 性能の優位性: これらの結果は、事前学習により、特に車両検出において、EMMA+が非常に優れたパフォーマンスを示すことを強調しています。特に、近距離における精度が高いことがわかります。
図6: 道路グラフ推定のアブレーション研究
- 内容: 図6では、道路グラフ推定タスクにおける異なる設計選択肢の影響を示しています。特に、動的サンプリング、エゴ車両に合わせたサンプル間隔、ターゲットシーケンスのシャッフル処理などの要素が評価されています。
-
解釈:
- 動的サンプリングの効果: 車線のカーブや長さに応じた動的なサンプリングが、固定されたサンプル数でサンプリングするよりも精度が大幅に向上し、40%〜90%の精度向上を達成しています。
- ターゲットシーケンスのシャッフル: 動的なシャッフルがモデルの汎用性を高め、より正確な推定結果を得るために重要であることがわかります。
図7: シーン理解(臨時閉塞検出)の実験結果
- 内容: 図7では、臨時閉塞検出タスクのためのシーン理解に関する実験結果が示されています。特に、人間の基準と比較して、EMMAモデルがどれだけ良いかを評価しています。
-
解釈:
- プレトレーニングとファインチューニング: 直接的なファインチューニング(特定のタスクに対する学習)での性能は、人間の基準に非常に近いが、長期間のプレトレーニングを経た場合に性能がさらに向上することが示されています。
- 混合学習: 複数のタスクを同時に学習することで、モデルが異なるシーンに対する理解を高め、より良い予測を行うことができます。
表4: ジェネラリストモデルの共同学習実験
- 内容: 表4では、EMMAモデルが複数のタスク(エンドツーエンドプランニング、3D物体検出、道路グラフ推定)を共同学習した結果が示されています。
-
解釈:
- タスク間のシナジー: 複数のタスクを共同学習することで、モデルは異なるタスクに対する理解を深め、個々のタスクのパフォーマンスも向上します。特に、運転タスク(エンドツーエンドプランニング)が他のタスクに対して大きな影響を与え、最も重要な役割を果たしていることがわかります。
- タスクの組み合わせ: 共同学習を通じて、タスク間の相互作用によってパフォーマンスが最適化され、3つのタスクを同時に学習することで最大の品質向上が達成されることが示されています。
結論
- 図5では、EMMA+が最先端のモデルと比較して優れた3D物体検出能力を示しており、図6では、道路グラフ推定における設計選択が精度に与える影響を強調しています。
- 図7では、臨時閉塞検出タスクにおいて、長期的なプレトレーニングと共同学習の効果が確認され、表4では、複数タスクの共同学習が全体的なパフォーマンスを向上させることが示されています。
キーワードから探す
Abstract
1 Introduction
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
EMMA: End-to-End Multimodal Model for Autonomous Driving Jyh-Jing Hwang∗†, Runsheng Xu∗ , Hubert Lin‡ , Wei-Chih Hung‡ , Jingwei Ji, Kristy Choi Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo Dragomir Anguelov, Mingxing Tan† Waymo LLC Abstract We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA’s effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA’s potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures. 1 Introduction Autonomous driving technology has made significant progress in recent years. To make autonomous vehicles a ubiquitous form of transportation, they must navigate increasingly complex real-world scenarios that require understanding rich scene context as well as sophisticated reasoning and decision-making. Historically, autonomous driving systems employed a modular approach, consisting of specialized components for perception [Yurtsever et al., 2020, Li et al., 2022b, Lang et al., 2019, Sun et al., 2022, Hwang et al., 2022], mapping [Li et al., 2022a, Tancik et al., 2022], prediction [Nayakanti et al., 2023, Shi et al., 2024], and planning [Teng et al., 2023]. While this design lends itself to easier debugging and optimization of individual modules, it poses scalability challenges due to the accumulated errors among modules and limited inter-module communication. In particular, the expert-designed ∗Equal contributions; ‡ Equal contributions. †Contact emails: Mingxing Tan <tanmingxing@waymo.com>, Jyh-Jing Hwang <jyhh@waymo.com>. Technical report. November 5, 2024. arXiv:2410.23262v2 [cs.CV] 4 Nov 2024 interfaces between modules, such as the perception and behavior modules, may struggle to adapt to novel environments because they are often pre-defined based on targeted scenarios [Bansal et al., 2019, Jiang et al., 2023, Nayakanti et al., 2023, Seff et al., 2023]. End-to-end autonomous driving systems [Hu et al., 2023, Zhai et al., 2023, Li et al., 2024] have recently emerged as a potential solution, directly learning to generate driving actions from sensor data. This approach eliminates the need for symbolic interfaces between modules and allows for joint optimization of driving objectives from raw sensor inputs. However, these systems are often specialized for specific driving tasks and trained on limited datasets, hindering their ability to generalize to rare or novel scenarios. Multimodal Large Language Models (MLLMs) [Gemini Team Google, 2023, Achiam et al., 2023] offer a promising new paradigm for AI in autonomous driving that may help to address such challenges. This is because MLLMs, as generalist foundation models, excel in two key areas: (1) they are trained on vast, internet-scale datasets that provide rich "world knowledge" beyond what is contained in common driving logs, and (2) they demonstrate superior reasoning capabilities through techniques such as chain-of-thought reasoning [Wei et al., 2022] that are not available in specialized driving systems. While recent efforts [Chen et al., 2024b, Tian et al., 2024] have explored integrating and augmenting the capabilities of existing driving systems with MLLMs, we propose to develop an autonomous driving system in which the MLLM is a first class citizen. We introduce the End-to-End Multimodal Model for Autonomous Driving (EMMA), built on top of Gemini [Gemini Team Google, 2023] without additional specialized components. Figure 1 shows the overview of the EMMA framework. EMMA accepts camera images and plain text for other nonvision inputs such as high-level driving commands and historical context. By recasting driving tasks as visual question answering (VQA) problems, EMMA leverages Gemini’s pre-trained capabilities while preserving its extensive world knowledge. After EMMA is fine-tuned with driving logs from all tasks using task-specific prompts (see Figure 2 for more examples), its shared Gemini model generates various driving outputs such as future trajectories for motion planning, perception objects, road graph elements, and scene semantics. Our experiments showcase EMMA’s strong performance on several planning and perception benchmarks despite this simple design. Additionally, we find that EMMA can produce interpretable, human-readable outputs for many perception tasks such as road graph estimation, and is able to function as a generalist model that is both scalable and robust for autonomous driving. Notably, as used here and throughout the paper, the EMMA generalist model refers to a machine learning model that has been trained and fine-tuned on a large volume of driving data to perform a wide range of specific driving tasks in the autonomous driving domain. We summarize our key findings below: 1. EMMA exhibits strong performance in end-to-end motion planning, achieving state-ofthe-art performance on public benchmarks nuScenes [Caesar et al., 2020] and competitive results on the Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a]. We also show that we can further improve motion planning quality with more internal training data and chain-of-thought reasoning. 2. EMMA demonstrates competitive results for various perception tasks including 3D object detection, road graph estimation, and scene understanding. On the camera-primary Waymo Open Dataset (WOD) [Hung et al., 2024], EMMA achieves better precision and recall for 3D object detection than state-of-the-art methods. 3. We demonstrate that EMMA can function as a generalist model in the autonomous driving domain, which jointly generates the outputs for multiple driving related tasks. In particular, EMMA matches or even surpasses the performance of individually trained models when it is co-trained with motion planning, object detection, and road graph tasks. 4. Finally, we show EMMA’s capacity to reason and make decisions in complex, long-tail driving scenarios. Despite these promising results, EMMA is not without its limitations. In particular, it faces challenges for real-world deployment due to: (1) limitations in 3D spatial reasoning due to its inability to fuse camera inputs with LiDAR or radar, (2) the need for realistic and computationally expensive sensor simulation to power its closed-loop evaluation, and (3) the increased computational requirements relative to conventional models. We plan to better understand and address such challenges in future work. 2 Figure 1: EMMA overview diagram. It takes 3 inputs (left): 1) a high-level command from the router, 2) historical status of the ego vehicle, and 3) surround-view camera videos. The model then predicts ego future trajectories (right) for motion planning that will be transformed into vehicle driving control signals. Further, we can ask the model to explain its rationale (top right) before predicting trajectories, which enhances both the performance and explainability of the model through chain-of-thought reasoning. Notably, we incorporate visual grounding into the rationale so that the model also predicts the accurate 3D/BEV location for critical objects. For completeness, we highlight three additional capabilities of our model (bottom). In the remainder of this paper, Section 2 describes the detailed method of EMMA for end-to-end motion planning and generalist tasks in autonomous driving. In Section 3, we present experimental results of EMMA on public and internal datasets. Finally, we discuss related works and limitations of our work in Sections 4 and 5, respectively.▼生成AIの回答(ChatGPT 4o mini)
以下に、各章節でポイントとなるキーワードやフレーズを列挙します。日本語訳も追加しました。
Abstract
- End-to-end Multimodal Model for Autonomous driving (自動運転のためのエンドツーエンド多モーダルモデル)
- Multi-modal large language model foundation (多モーダル大規模言語モデル基盤)
- Task-specific prompts (タスク固有のプロンプト)
- State-of-the-art performance in motion planning (運転計画の最先端の性能)
- Competitive results for camera-primary 3D object detection (カメラ主体の3D物体検出における競争力のある結果)
- Co-training with planner trajectories, object detection, and road graph tasks (プランナー軌道、物体検出、道路グラフタスクとの共同トレーニング)
1. Introduction
- Autonomous driving technology (自動運転技術)
- Modular approach (モジュールアプローチ)
- Scalability challenges (スケーラビリティの課題)
- Joint optimization of driving objectives (運転目的の共同最適化)
- Multimodal Large Language Models (MLLMs) (多モーダル大規模言語モデル)
- Superior reasoning capabilities through chain-of-thought reasoning (思考連鎖による優れた推論能力)
- Generalist foundation models for autonomous driving (自動運転のための汎用基盤モデル)
- Gemini’s pre-trained capabilities (Geminiの事前学習された能力)
2. EMMA Overview
- EMMA framework (EMMAフレームワーク)
- Recasting driving tasks as visual question answering (運転タスクの視覚的質問応答への再構成)
- Shared Gemini model (共有されたGeminiモデル)
- Various driving outputs (様々な運転出力)
- Human-readable outputs for perception tasks (知覚タスクのための人間可読な出力)
- Generalist model (汎用モデル)
3. Key Findings
- Strong performance in end-to-end motion planning (エンドツーエンド運転計画での強力な性能)
- Competitive results on the Waymo Open Motion Dataset (WOMD) (Waymo Open Motion Dataset (WOMD)での競争力のある結果)
- Better precision and recall for 3D object detection (3D物体検出での優れた精度と再現率)
- Reasoning and decision-making in complex scenarios (複雑なシナリオでの推論と意思決定)
4. Limitations and Future Work
- Limitations in 3D spatial reasoning (3D空間推論における限界)
- Inability to fuse camera inputs with LiDAR or radar (カメラ入力とLiDARやレーダーの統合の不可能)
- Computationally expensive sensor simulation (計算コストの高いセンサーシミュレーション)
- Increased computational requirements (増加した計算要件)
これで論文の主要なポイントを押さえることができました。
特に、追加で質問したいキーワードは見当たらない。
2 Method
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
2 Method EMMA is built on top of Gemini [Gemini Team Google, 2023], a family of MLLMs developed at Google. We leverage the auto-regressive Gemini models that are trained to process interleaved textual and visual inputs to produce text outputs: O = G(T, V) (1) where G is the Gemini model, O represents natural language outputs, T represents natural language prompts, and V denotes images or videos. The language output O = (o1, o2, ..., on) is generated via next-token prediction, i.e., the output probability can be represented as Q P(O|T, V) = n i=1 P(oi |o<i, T, V) for n output tokens. Our goal is to adapt G for autonomous driving applications, thereby harnessing the world knowledge obtained during its pre-training phase. As shown in Figure 1, we map autonomous driving tasks into Gemini-based EMMA formulation. All sensor data are represented as stitched images or videos as V; all router commands, driving context, and task-specific prompts are represented as T; all output tasks are presented as language outputs O. A challenge is that many of the inputs and outputs need to capture 3D world coordinates, such as waypoint BEV (Bird’s Eye View) locations (x, y) for motion planning and the location and size of 3D boxes. We consider two representations: The first is direct text conversion to floating-point numbers, expressed as Tcoordinates = {(xi , yi)} ≈ text({(xi , yi)}), where the specified decimal precision depends on the distance unit and decimal points. RT-2 [Brohan et al., 2023] exemplifies this approach in robotic control. The second approach uses special tokens to represent each location or action, formulated as Tcoordinates = {(xi , yi)} ≈ tokenize({(xi , yi)}), with resolution determined by the learned or manually defined discretization scheme. MotionLM [Seff et al., 2023] leverages this method for motion forecasting. We note that the two approaches have their respective strengths and weaknesses. We opt for the text representation such that all tasks can share the same unified language representation space and they can maximally reuse the knowledge from the pre-trained weights, even though the text presentation may produce more tokens than specialized tokenization. 3 2.1 End-to-End Motion Planning EMMA employs a unified, end-to-end trained model to generate future trajectories for autonomous vehicles directly from sensor data. These generated trajectories are then transformed into vehiclespecific control actions such as acceleration and turning for autonomous vehicles. EMMA’s end-to-end approach aims to emulate human driving behavior, focusing on two critical aspects: (1) first, the use of navigation systems (e.g. Google Maps) for route planning and intent determination, and (2) second, the utilization of past actions to ensure smooth, consistent driving over time. Our model incorporates three key inputs to align with these human driving behaviors: 1. Surround-view camera videos (V): Provides comprehensive environment information. 2. High-level intent command (Tintent): Derived from the router, includes directives such as “go straight”, “turn left”, “turn right”, etc. 3. Set of historical ego status (Tego): Represented as a set of waypoint coordinates in Bird’s Eye View (BEV) space, Tego = {(xt, yt)} −Th t=−1 for Th timestamps. All waypoint coordinates are represented as plain text without specialized tokens. This can also be extended to include higher-order ego status such as velocity and acceleration. The model generates future trajectories for motion planning, represented as a set of future trajectory waypoints for the ego vehicle in the same BEV space: Otrajectory = {(xt, yt)} Tf t=1 for future Tf timestamps, where all output waypoints are also represnted as plain text. Putting everything together, the complete formulation is expressed as: Otrajectory = G(Tintent, Tego, V). (2) We then fine-tune Gemini with this formulation for end-to-end planner trajectory generation, as illustrated in Figure 1. We highlight 3 characteristics of this formulation: 1. Self-supervised: the only required supervision is the future locations of the ego vehicle. No dedicated human labels are needed. 2. Camera-only: the only sensor input required is surround-view cameras. 3. HD map free: no HD map is needed beyond the high-level routing information from a navigation system such as Google Maps. While we are not the first to adopt this general formulation—[Li et al., 2024] conducted a thorough investigation, particularly examining the impact of including the historical ego status—our contribution lies in adapting this formulation specifically for MLLMs for autonomous driving. 2.2 Planning with Chain-of-Thought Reasoning Chain-of-thought Prompting [Wei et al., 2022] is a powerful tool in MLLMs that enhances reasoning capabilities and improves explainability. In EMMA, we incorporate chain-of-thought reasoning into end-to-end planner trajectory generation by asking the model to articulate its decision rationale Orationale while predicting the final future trajectory waypoints Otrajectory. We structure the driving rationale hierarchically, progressing from 4 types of coarse-to-fine-grained information: R1 - Scene description broadly describes the driving scenarios, including weather, day of time, traffic situations, and road conditions. For example: The weather is clear and sunny, and it is daytime. The road is four-lane undivided street with a crosswalk in the middle. There are cars parked on both sides of the street. R2 - Critical objects are the on-road agents that can potentially influence the driving behavior of the ego vehicle, and we require the model to identify their precise 3D/BEV coordinates. For instance: pedestrian at [9.01, 3.22], vehicle at [11.58, 0.35]. R3 - Behavior description of critical objects describes the current status and intent for the identified critical objects. A concrete example is as follows: The pedestrian is currently standing on the sidewalk, looking toward the road, and maybe preparing to cross the street. The vehicle is currently ahead of me, moving in the same direction, and its future trajectory suggests it will continue straight. R4 - Meta driving decision includes 12 categories of high-level driving decisions, summarizing the driving plan given the previous observations. An example would be I should keep my current low speed. 4 We highlight that the driving rationale captions are generated using an automated tool without any additional human labels, ensuring scalability of the data generation pipeline. Specifically, we leverage off-the-shelf perception and prediction expert models to identify critical agents, and then use Gemini models with carefully designed visual and text prompts to generate comprehensive scene and agent behavior descriptions. Meta driving decisions are computed using a heuristic algorithm that analyzes the ego vehicle’s ground-truth trajectory. During both training and inference, the model predicts all four components of the driving rationale before predicting the future waypoints, i.e., (Orationale, Otrajectory) = G(Tintent, Tego, V). (3) Where Orationale denotes an ordered text output of (R1, R2, R3, R4) for driving rationale. Empirically, we observe that the prediction order of Orationale and Otrajectory does not result in a significant difference in quality after model convergence. This suggests that we can predict Otrajectory first and apply early stopping during inference for time-critical applications. 2.3 EMMA Generalist While end-to-end motion planning is the ultimate core task, a comprehensive autonomous driving system requires additional capabilities. Specifically, it must perceive the 3D world and recognize surrounding objects, the road graph and the traffic conditions. To achieve this goal, we formulate EMMA as a generalist model capable of handling multiple driving tasks through training mixtures. Our vision-language framework represents all non-sensor inputs and outputs as plain text, providing the flexibility necessary to incorporate many other driving tasks. We employ instruction-tuning, a well-established approach in LLMs, to jointly train all tasks together with task-specific prompts included in the inputs T of Eq. 1. We organize these tasks into three primary categories: spatial reasoning, road graph estimation, and scene understanding. Fig. 2 illustrates the overall EMMA Generalist diagram. Spatial reasoning is the ability to understand, reason, and draw conclusions about objects and their relationships in space. This enables an autonomous driving system to interpret and interact with its surrounding environment for safe navigation. Our primary focus for spatial reasoning is 3D object detection. We follow Pix2Seq [Chen et al., 2022a] and formulate the output 3D bounding boxes as Oboxes = set{text(x, y, z, l, w, h, θ, cls)} where (x, y, z) are the center location in the vehicle frame, l, w, h are the length, width, and height of the box, θ is the heading angle, and cls is the class label in text. We convert a 7D box to text by writing floating-point numbers with two decimal places, separated by spaces between each dimension. We then represent the detection tasks using a fixed prompt Tdetect_3D, such as “detect every object in 3D”, as in the following: Oboxes = G(Tdetect_3D, V). (4) While Oboxes is an unordered set of boxes, the predictions from an auto-regressive language model are always ordered. We find that sorting the 3D bounding boxes by depth improves detection quality, in contrast with the findings in Pix2Seq [Chen et al., 2022a]. Road graph estimation focuses on identifying critical road elements for safe driving, including semantic elements (e.g., lane markings, signs) and physical properties (e.g., lane curvature). The collection of these road elements forms a road graph. For example, lane segments are represented by (a) nodes, where the lanes encounter an intersection, merge, or split and (b) edges between these nodes following the direction of traffic. The full road-graph is composed of many such polyline segments. While edges within each polyline are directional, each polyline does not necessarily have a unique order relative to the other elements. This bears similarity to object detection (e.g., [Carion et al., 2020, Chen et al., 2022a]), where each box is defined by ordered attributes (top-left corner, bottom-right corner), but a relative ordering between boxes does not necessarily exist. There are several existing works that model polyline graphs with Transformers [Yuan et al., 2024, Liao et al., 2024a,b, 2023, Ding et al., 2023, Qiao et al., 2023, Liu et al., 2023, Li et al., 2022a], sharing similarities with language models. Our general modeling formulation in EMMA is as follows: Oroadgraph = G(Testimate_roadgraph, V). (5) 5 Figure 2: Illustration of EMMA Generalist. Starting with a task prompt (left), EMMA generates a corresponding textual prediction (middle right), which can then be decoded into a target output format, visualized and overlaid with the input image (right). EMMA Generalist is highly versatile, capable of performing a wide range of driving-related tasks, such as end-to-end motion planning, object detection, road graph estimation, and scene understanding Q&A. In the answer text, italicized words represent placeholders that will be dynamically substituted with actual values during task execution. where Oroadgraph is a text-encoded road graph represented as waypoints, Testimate_roadgraph is a prompt asking the model to predict the roadgrah, and V denotes the surrounding images. We focus specifically on predicting drivable lanes, i.e., the lanes that the ego vehicle can drive towards in the scene. These are neighboring lanes in the same traffic direction and lanes branching out from the current ego lane. To construct Oroadgraph, we (a) convert lanes into sets of ordered waypoints and (b) transform these sets of waypoints into text. It is beneficial to use sample-ordered waypoints to represent both traffic direction and curvature. Just like detection, we also find that ordering lanes by approximate distance improves the prediction quality. An example of our polyline text encoding is: "(x1,y1 and... and xn,yn);..." where "x,y" are floating point waypoints with precision to 2 decimal places, ";" separates polyline instances. Scene understanding tasks test the model’s understanding of the whole scene context, which can be relevant for driving. For example, roads can be temporarily obstructed due to construction, emergency situations, or other events. Detecting these blockages in a timely manner and safely navigating around them is essential for ensuring the smooth and safe operation of autonomous vehicles; however, multiple cues in the scene are required to determine if there is a blockage or not. We focus on how our model performs on this temporary blockage detection task, using the following formulation: Otemporary_blockage = G(Ttemporary_blockage, Troad_user, V), (6) where Otemporary_blockage is the model output signaling potential obstruction, V denotes the surrounding images, Troad_users denotes the all the objects on the road ahead, Ttemporary_blockage is the text prompt "is the road ahead temporarily blocked?". 2.4 Generalist Training Our unified vision-language formulation enables the simultaneous training of multiple tasks with a single model, allowing for task-specific predictions at inference time through simple variations of the task prompt Ttask. The training procedure is both straightforward and flexible. For each task, we construct a dataset Dtask containing |Dtask| training examples. During each training iteration, we randomly sample a batch from the available datasets, with the probability of selecting an example from a specific dataset proportional to the dataset size: i.e., |Dtask|/ P t |Dt |. To train for e epochs, we set the total number of training iterations to e × P t |Dt |, ensuring that the training 6 Method ADE 1s ADE 3s ADE 5s ADE 8s MotionLM [Seff et al., 2023] 0.045 0.251 0.694 1.766 Wayformer* [Nayakanti et al., 2023] 0.044 0.250 0.640 1.517 EMMA 0.032 0.248 0.681 1.718 EMMA+ 0.030 0.225 0.610 1.553 Table 1: End-to-end motion planning experiments on WOMD planning benchmark [Chen et al., 2024a]. EMMA+ denotes the models are pre-trained on our internal data. EMMA achieves promising performance on ADE 1s, 3s, and 5s. EMMA is slightly worse than Wayformer at ADE 8s, largely because the camera-only EMMA has higher depth error for longer distance than Wayformer, which uses all sensors including LiDAR, Radar, and Cameras. ∗Note that the Wayformer baseline is based on an enhanced internal version after the publication of [Nayakanti et al., 2023]. Figure 3: Ablation study on the number of sampled trajectories. As more trajectories are sampled, the quality measured by ADE@5s also improves, but the benefits diminish after 12+ samples. ratio among tasks is governed by the relative dataset sizes. The optimal training ratio is influenced by several factors, including task complexity, inter-task correlations, and the degree of transferability across tasks. Our experimental results demonstrate that the generalist model, trained across multiple tasks, consistently outperforms each specialist model that is trained on a single task. This highlights the advantage of the generalist approach: enhanced knowledge transfer, improved generalization, and increased efficiency.▼生成AIの回答(ChatGPT 4o mini)
以下は、論文の各章節から抽出したポイントとなるキーワード・フレーズです。元文とともに日本語の説明も付け加えました。
- Method
- EMMA (EMMA): GoogleのGeminiモデルを基にした、画像とテキストを処理する自律運転向けのモデル
- Gemini (Gemini): Googleが開発したマルチモーダル言語モデル(MLLM)のファミリー
- Auto-regressive model (自己回帰モデル): 次のトークンを予測して出力を生成するモデル
- Next-token prediction (次トークン予測): 出力トークンを予測する方法
- Textual and visual inputs (テキストと視覚的入力): 自律運転に必要なデータの表現方法
- 3D world coordinates (3D座標): モーションプランニングや3Dボックスの位置とサイズに必要
- Waypoint BEV (ウェイポイントBEV): モーションプランニングで使用される位置情報
- Text conversion to floating-point numbers (浮動小数点数へのテキスト変換): 座標データの表現方法
2.1 End-to-End Motion Planning
- Unified, end-to-end model (統一されたエンドツーエンドモデル): センサーデータから未来の軌道を生成するモデル
- Vehicle control actions (車両制御アクション): 生成された軌道をもとにした加速や方向転換
- Navigation systems (ナビゲーションシステム): ルート計画や意図決定に使用
- Surround-view camera videos (周囲のカメラ映像): 環境情報を提供するセンサー
- Historical ego status (履歴的エゴ状態): 過去の状態(位置情報など)を使用
- Self-supervised (自己教師あり): 未来の位置だけを監視データとして使用
- Camera-only (カメラのみ): 周囲視覚情報のみで運転を計画
- HD map free (HDマップ不要): 高精度な地図情報は使用しない
- Gemini fine-tuning (Gemini微調整): 自律運転のためにモデルを調整
2.2 Planning with Chain-of-Thought Reasoning
- Chain-of-thought prompting (チェーン・オブ・ソート・プロンプティング): 思考の過程をモデルに示す手法
- Reasoning and explainability (推論と説明能力): モデルの推論過程の明示化
- Coarse-to-fine-grained information (粗から細かい情報): 思考過程を段階的に進める
- Scene description (シーンの説明): 天候や交通状況などの運転シナリオ
- Critical objects (重要な物体): 運転行動に影響を与える可能性のある物体
- Behavior description (物体の行動説明): 物体の現在の状態と意図
- Meta driving decision (メタ運転判断): 運転の最終的な決定
- Rationale and trajectory (推論と軌道): 推論の説明と未来の軌道予測
2.3 EMMA Generalist
- Generalist model (汎用モデル): 複数の運転タスクを扱う能力
- Task-specific prompts (タスク固有のプロンプト): 各タスクに特化した入力プロンプト
- Spatial reasoning (空間的推論): 環境内の物体とその関係性を理解
- 3D object detection (3D物体検出): 物体の位置やサイズを検出
- Road graph estimation (道路グラフ推定): 道路上の要素を識別し、道路のグラフを構築
- Scene understanding (シーン理解): 一時的な障害物の検出や運転に必要な理解
2.4 Generalist Training
- Training mixture (トレーニングミクスチャー): 複数のタスクを同時に訓練する方法
- Task-specific datasets (タスク固有のデータセット): 各タスクに応じたデータセットを使用
- Knowledge transfer (知識転送): 異なるタスク間での知識の再利用
- Generalization (一般化): モデルが他の状況でも良い結果を出す能力
- Efficiency (効率性): トレーニングの効率と時間節約
- Outperforming specialist models (専門モデルを上回る): 複数のタスクを同時に学習することで専門モデルより優れた結果を出す
各章節は自律運転に関連する技術やアプローチを示しており、特にエンドツーエンドのモーションプランニングやタスク特化型のアプローチに焦点を当てています。
▼質問(人間)
①上記の「Outperforming specialist models (専門モデルを上回る)」について詳細を教えて下さい。
▼生成AIの回答(ChatGPT 4o mini)
「Outperforming specialist models (専門モデルを上回る)」について、論文で言及されている内容を詳しく説明します。
このフレーズは、EMMAのような汎用的なモデルが、専門のタスクに特化したモデルよりも優れたパフォーマンスを発揮することを示しています。具体的な理由や背景については以下のように説明されています。
-
知識転送と一般化
汎用モデルは、複数のタスクを同時に学習することができ、各タスク間での知識転送(knowledge transfer)が可能です。つまり、一度学習したタスクで得た知識を、他のタスクに活用できるため、異なるタスクを効果的に処理できます。この知識転送が、専門モデルがそれぞれのタスクに特化して学習する場合に比べて、汎用モデルに優位性を与えます。
例えば、自律運転において、汎用モデルはモーションプランニング、物体検出、道路グラフ推定など複数のタスクを同時に学習でき、これらのタスク間で得た知識を相互に活用することができます。そのため、特定のタスクに対して個別に訓練された専門モデルと比べて、汎用モデルはタスク間での相互作用や複雑なシナリオにも柔軟に対応できるのです。
-
汎用性と柔軟性
汎用モデルはタスクの違いを反映させたプロンプトや入力形式を調整するだけで、異なるタスクをこなせるという特長があります。この柔軟性により、専門モデルでは対応しきれないような新たなタスクにも対応できるため、多様な状況に適応する力が高まります。たとえば、同じモデルを用いて、物体検出や道路グラフ推定、さらには一時的な障害物の検出など、さまざまなタスクに対応できる点が汎用モデルの強みです。 -
効率的なトレーニング
汎用モデルでは、複数のタスクを同時に訓練するため、タスクごとに個別にモデルを訓練する必要がありません。この訓練方式は、トレーニングデータの多様性を活かしつつ、効率的な学習を可能にします。さらに、訓練中に得た知識を全てのタスクに活かせるため、一般化性能が向上し、未知のシナリオにも強くなります。 -
性能の向上
実際の実験結果では、汎用モデルは専門モデルよりも安定した高い性能を発揮しています。例えば、EMMAは、専門的な自律運転タスクを処理するモデルと比較しても、いくつかの指標(例えば、ADE(Average Displacement Error))で優れた結果を出していることが確認されています。汎用的に学習されているため、さまざまなタスクで高い汎用性と精度を示すことができます。 -
学習プロセスの簡素化
複数の専門モデルを使い分ける場合、各モデルに対して個別にデータを収集し、トレーニングを行う必要がありますが、汎用モデルではタスクごとに異なるデータセットを学習させつつ、単一のモデルで全てのタスクをカバーできます。このアプローチは、トレーニングの管理を簡素化し、さらにスケーラブルな運用を可能にします。
結論
汎用モデルは、タスク間の知識転送、柔軟性、効率的な学習プロセスなどの利点を活かし、専門的なモデルを上回る性能を発揮することができます。特に、異なるタスクをこなすために専門モデルを個別に作成する必要がなく、汎用モデルは多様な運転タスクを単一のフレームワーク内で処理できるため、より優れたパフォーマンスを実現します。
質問の内容をより詳しく書いて再度質問し直す↓
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
2 Method EMMA is built on top of Gemini [Gemini Team Google, 2023], a family of MLLMs developed at Google. We leverage the auto-regressive Gemini models that are trained to process interleaved textual and visual inputs to produce text outputs: O = G(T, V) (1) where G is the Gemini model, O represents natural language outputs, T represents natural language prompts, and V denotes images or videos. The language output O = (o1, o2, ..., on) is generated via next-token prediction, i.e., the output probability can be represented as Q P(O|T, V) = n i=1 P(oi |o<i, T, V) for n output tokens. Our goal is to adapt G for autonomous driving applications, thereby harnessing the world knowledge obtained during its pre-training phase. As shown in Figure 1, we map autonomous driving tasks into Gemini-based EMMA formulation. All sensor data are represented as stitched images or videos as V; all router commands, driving context, and task-specific prompts are represented as T; all output tasks are presented as language outputs O. A challenge is that many of the inputs and outputs need to capture 3D world coordinates, such as waypoint BEV (Bird’s Eye View) locations (x, y) for motion planning and the location and size of 3D boxes. We consider two representations: The first is direct text conversion to floating-point numbers, expressed as Tcoordinates = {(xi , yi)} ≈ text({(xi , yi)}), where the specified decimal precision depends on the distance unit and decimal points. RT-2 [Brohan et al., 2023] exemplifies this approach in robotic control. The second approach uses special tokens to represent each location or action, formulated as Tcoordinates = {(xi , yi)} ≈ tokenize({(xi , yi)}), with resolution determined by the learned or manually defined discretization scheme. MotionLM [Seff et al., 2023] leverages this method for motion forecasting. We note that the two approaches have their respective strengths and weaknesses. We opt for the text representation such that all tasks can share the same unified language representation space and they can maximally reuse the knowledge from the pre-trained weights, even though the text presentation may produce more tokens than specialized tokenization. 3 2.1 End-to-End Motion Planning EMMA employs a unified, end-to-end trained model to generate future trajectories for autonomous vehicles directly from sensor data. These generated trajectories are then transformed into vehiclespecific control actions such as acceleration and turning for autonomous vehicles. EMMA’s end-to-end approach aims to emulate human driving behavior, focusing on two critical aspects: (1) first, the use of navigation systems (e.g. Google Maps) for route planning and intent determination, and (2) second, the utilization of past actions to ensure smooth, consistent driving over time. Our model incorporates three key inputs to align with these human driving behaviors: 1. Surround-view camera videos (V): Provides comprehensive environment information. 2. High-level intent command (Tintent): Derived from the router, includes directives such as “go straight”, “turn left”, “turn right”, etc. 3. Set of historical ego status (Tego): Represented as a set of waypoint coordinates in Bird’s Eye View (BEV) space, Tego = {(xt, yt)} −Th t=−1 for Th timestamps. All waypoint coordinates are represented as plain text without specialized tokens. This can also be extended to include higher-order ego status such as velocity and acceleration. The model generates future trajectories for motion planning, represented as a set of future trajectory waypoints for the ego vehicle in the same BEV space: Otrajectory = {(xt, yt)} Tf t=1 for future Tf timestamps, where all output waypoints are also represnted as plain text. Putting everything together, the complete formulation is expressed as: Otrajectory = G(Tintent, Tego, V). (2) We then fine-tune Gemini with this formulation for end-to-end planner trajectory generation, as illustrated in Figure 1. We highlight 3 characteristics of this formulation: 1. Self-supervised: the only required supervision is the future locations of the ego vehicle. No dedicated human labels are needed. 2. Camera-only: the only sensor input required is surround-view cameras. 3. HD map free: no HD map is needed beyond the high-level routing information from a navigation system such as Google Maps. While we are not the first to adopt this general formulation—[Li et al., 2024] conducted a thorough investigation, particularly examining the impact of including the historical ego status—our contribution lies in adapting this formulation specifically for MLLMs for autonomous driving. 2.2 Planning with Chain-of-Thought Reasoning Chain-of-thought Prompting [Wei et al., 2022] is a powerful tool in MLLMs that enhances reasoning capabilities and improves explainability. In EMMA, we incorporate chain-of-thought reasoning into end-to-end planner trajectory generation by asking the model to articulate its decision rationale Orationale while predicting the final future trajectory waypoints Otrajectory. We structure the driving rationale hierarchically, progressing from 4 types of coarse-to-fine-grained information: R1 - Scene description broadly describes the driving scenarios, including weather, day of time, traffic situations, and road conditions. For example: The weather is clear and sunny, and it is daytime. The road is four-lane undivided street with a crosswalk in the middle. There are cars parked on both sides of the street. R2 - Critical objects are the on-road agents that can potentially influence the driving behavior of the ego vehicle, and we require the model to identify their precise 3D/BEV coordinates. For instance: pedestrian at [9.01, 3.22], vehicle at [11.58, 0.35]. R3 - Behavior description of critical objects describes the current status and intent for the identified critical objects. A concrete example is as follows: The pedestrian is currently standing on the sidewalk, looking toward the road, and maybe preparing to cross the street. The vehicle is currently ahead of me, moving in the same direction, and its future trajectory suggests it will continue straight. R4 - Meta driving decision includes 12 categories of high-level driving decisions, summarizing the driving plan given the previous observations. An example would be I should keep my current low speed. 4 We highlight that the driving rationale captions are generated using an automated tool without any additional human labels, ensuring scalability of the data generation pipeline. Specifically, we leverage off-the-shelf perception and prediction expert models to identify critical agents, and then use Gemini models with carefully designed visual and text prompts to generate comprehensive scene and agent behavior descriptions. Meta driving decisions are computed using a heuristic algorithm that analyzes the ego vehicle’s ground-truth trajectory. During both training and inference, the model predicts all four components of the driving rationale before predicting the future waypoints, i.e., (Orationale, Otrajectory) = G(Tintent, Tego, V). (3) Where Orationale denotes an ordered text output of (R1, R2, R3, R4) for driving rationale. Empirically, we observe that the prediction order of Orationale and Otrajectory does not result in a significant difference in quality after model convergence. This suggests that we can predict Otrajectory first and apply early stopping during inference for time-critical applications. 2.3 EMMA Generalist While end-to-end motion planning is the ultimate core task, a comprehensive autonomous driving system requires additional capabilities. Specifically, it must perceive the 3D world and recognize surrounding objects, the road graph and the traffic conditions. To achieve this goal, we formulate EMMA as a generalist model capable of handling multiple driving tasks through training mixtures. Our vision-language framework represents all non-sensor inputs and outputs as plain text, providing the flexibility necessary to incorporate many other driving tasks. We employ instruction-tuning, a well-established approach in LLMs, to jointly train all tasks together with task-specific prompts included in the inputs T of Eq. 1. We organize these tasks into three primary categories: spatial reasoning, road graph estimation, and scene understanding. Fig. 2 illustrates the overall EMMA Generalist diagram. Spatial reasoning is the ability to understand, reason, and draw conclusions about objects and their relationships in space. This enables an autonomous driving system to interpret and interact with its surrounding environment for safe navigation. Our primary focus for spatial reasoning is 3D object detection. We follow Pix2Seq [Chen et al., 2022a] and formulate the output 3D bounding boxes as Oboxes = set{text(x, y, z, l, w, h, θ, cls)} where (x, y, z) are the center location in the vehicle frame, l, w, h are the length, width, and height of the box, θ is the heading angle, and cls is the class label in text. We convert a 7D box to text by writing floating-point numbers with two decimal places, separated by spaces between each dimension. We then represent the detection tasks using a fixed prompt Tdetect_3D, such as “detect every object in 3D”, as in the following: Oboxes = G(Tdetect_3D, V). (4) While Oboxes is an unordered set of boxes, the predictions from an auto-regressive language model are always ordered. We find that sorting the 3D bounding boxes by depth improves detection quality, in contrast with the findings in Pix2Seq [Chen et al., 2022a]. Road graph estimation focuses on identifying critical road elements for safe driving, including semantic elements (e.g., lane markings, signs) and physical properties (e.g., lane curvature). The collection of these road elements forms a road graph. For example, lane segments are represented by (a) nodes, where the lanes encounter an intersection, merge, or split and (b) edges between these nodes following the direction of traffic. The full road-graph is composed of many such polyline segments. While edges within each polyline are directional, each polyline does not necessarily have a unique order relative to the other elements. This bears similarity to object detection (e.g., [Carion et al., 2020, Chen et al., 2022a]), where each box is defined by ordered attributes (top-left corner, bottom-right corner), but a relative ordering between boxes does not necessarily exist. There are several existing works that model polyline graphs with Transformers [Yuan et al., 2024, Liao et al., 2024a,b, 2023, Ding et al., 2023, Qiao et al., 2023, Liu et al., 2023, Li et al., 2022a], sharing similarities with language models. Our general modeling formulation in EMMA is as follows: Oroadgraph = G(Testimate_roadgraph, V). (5) 5 Figure 2: Illustration of EMMA Generalist. Starting with a task prompt (left), EMMA generates a corresponding textual prediction (middle right), which can then be decoded into a target output format, visualized and overlaid with the input image (right). EMMA Generalist is highly versatile, capable of performing a wide range of driving-related tasks, such as end-to-end motion planning, object detection, road graph estimation, and scene understanding Q&A. In the answer text, italicized words represent placeholders that will be dynamically substituted with actual values during task execution. where Oroadgraph is a text-encoded road graph represented as waypoints, Testimate_roadgraph is a prompt asking the model to predict the roadgrah, and V denotes the surrounding images. We focus specifically on predicting drivable lanes, i.e., the lanes that the ego vehicle can drive towards in the scene. These are neighboring lanes in the same traffic direction and lanes branching out from the current ego lane. To construct Oroadgraph, we (a) convert lanes into sets of ordered waypoints and (b) transform these sets of waypoints into text. It is beneficial to use sample-ordered waypoints to represent both traffic direction and curvature. Just like detection, we also find that ordering lanes by approximate distance improves the prediction quality. An example of our polyline text encoding is: "(x1,y1 and... and xn,yn);..." where "x,y" are floating point waypoints with precision to 2 decimal places, ";" separates polyline instances. Scene understanding tasks test the model’s understanding of the whole scene context, which can be relevant for driving. For example, roads can be temporarily obstructed due to construction, emergency situations, or other events. Detecting these blockages in a timely manner and safely navigating around them is essential for ensuring the smooth and safe operation of autonomous vehicles; however, multiple cues in the scene are required to determine if there is a blockage or not. We focus on how our model performs on this temporary blockage detection task, using the following formulation: Otemporary_blockage = G(Ttemporary_blockage, Troad_user, V), (6) where Otemporary_blockage is the model output signaling potential obstruction, V denotes the surrounding images, Troad_users denotes the all the objects on the road ahead, Ttemporary_blockage is the text prompt "is the road ahead temporarily blocked?". 2.4 Generalist Training Our unified vision-language formulation enables the simultaneous training of multiple tasks with a single model, allowing for task-specific predictions at inference time through simple variations of the task prompt Ttask. The training procedure is both straightforward and flexible. For each task, we construct a dataset Dtask containing |Dtask| training examples. During each training iteration, we randomly sample a batch from the available datasets, with the probability of selecting an example from a specific dataset proportional to the dataset size: i.e., |Dtask|/ P t |Dt |. To train for e epochs, we set the total number of training iterations to e × P t |Dt |, ensuring that the training 6 Method ADE 1s ADE 3s ADE 5s ADE 8s MotionLM [Seff et al., 2023] 0.045 0.251 0.694 1.766 Wayformer* [Nayakanti et al., 2023] 0.044 0.250 0.640 1.517 EMMA 0.032 0.248 0.681 1.718 EMMA+ 0.030 0.225 0.610 1.553 Table 1: End-to-end motion planning experiments on WOMD planning benchmark [Chen et al., 2024a]. EMMA+ denotes the models are pre-trained on our internal data. EMMA achieves promising performance on ADE 1s, 3s, and 5s. EMMA is slightly worse than Wayformer at ADE 8s, largely because the camera-only EMMA has higher depth error for longer distance than Wayformer, which uses all sensors including LiDAR, Radar, and Cameras. ∗Note that the Wayformer baseline is based on an enhanced internal version after the publication of [Nayakanti et al., 2023]. Figure 3: Ablation study on the number of sampled trajectories. As more trajectories are sampled, the quality measured by ADE@5s also improves, but the benefits diminish after 12+ samples. ratio among tasks is governed by the relative dataset sizes. The optimal training ratio is influenced by several factors, including task complexity, inter-task correlations, and the degree of transferability across tasks. Our experimental results demonstrate that the generalist model, trained across multiple tasks, consistently outperforms each specialist model that is trained on a single task. This highlights the advantage of the generalist approach: enhanced knowledge transfer, improved generalization, and increased efficiency.▼生成AIの回答(ChatGPT 4o mini)
2 Method
- EMMA (Gemini): "EMMA is built on top of Gemini, a family of MLLMs developed at Google." (Geminiを基盤としたEMMAの開発)
- Auto-regressive Gemini models: "We leverage the auto-regressive Gemini models that are trained to process interleaved textual and visual inputs to produce text outputs." (テキストとビジュアル入力を処理する自己回帰型Geminiモデル)
- 3D world coordinates: "Many of the inputs and outputs need to capture 3D world coordinates, such as waypoint BEV (Bird’s Eye View) locations (x, y) for motion planning." (運転計画に必要な3D座標のキャプチャ)
- Direct text conversion vs. Tokenization for coordinates: "We opt for the text representation such that all tasks can share the same unified language representation space." (座標のテキスト表現 vs 特殊トークンによる表現)
2.1 End-to-End Motion Planning
- Unified end-to-end model for trajectory generation: "EMMA employs a unified, end-to-end trained model to generate future trajectories for autonomous vehicles directly from sensor data." (センサーデータから直接未来の軌道を生成する統一型エンドツーエンドモデル)
- Key inputs for human-like driving behavior: "Our model incorporates three key inputs to align with these human driving behaviors." (人間の運転行動に基づく主要な入力)
- No need for HD maps: "No HD map is needed beyond the high-level routing information from a navigation system such as Google Maps." (Google Mapsによる高レベルなルーティング情報のみでHDマップ不要)
- Self-supervised and camera-only: "The only required supervision is the future locations of the ego vehicle." (自己監視型、カメラ入力のみ)
2.2 Planning with Chain-of-Thought Reasoning
- Chain-of-thought reasoning: "Chain-of-thought Prompting is a powerful tool in MLLMs that enhances reasoning capabilities and improves explainability." (推論能力を強化し、説明可能性を向上させるChain-of-thought)
- Hierarchical decision structure: "We structure the driving rationale hierarchically, progressing from 4 types of coarse-to-fine-grained information." (粗→細粒度の4段階で運転意思決定を構造化)
- Automated rationale generation: "The driving rationale captions are generated using an automated tool without any additional human labels." (自動生成された運転意思決定キャプション、追加の人手不要)
2.3 EMMA Generalist
- Versatility in handling multiple driving tasks: "EMMA is a generalist model capable of handling multiple driving tasks through training mixtures." (複数の運転タスクを同時に処理可能な汎用モデル)
- Instruction-tuning for task-specific prompts: "We employ instruction-tuning to jointly train all tasks together with task-specific prompts." (タスク特化型プロンプトによるインストラクションチューニング)
- Spatial reasoning and 3D object detection: "Spatial reasoning is the ability to understand, reason, and draw conclusions about objects and their relationships in space." (空間推論能力、3D物体検出)
- Road graph estimation: "Road graph estimation focuses on identifying critical road elements for safe driving." (安全運転のための道路要素推定)
2.4 Generalist Training
- Unified vision-language framework: "Our unified vision-language formulation enables the simultaneous training of multiple tasks with a single model." (統一されたビジョン・ランゲージフレームワークによる複数タスク同時トレーニング)
- Training procedure and task-specific dataset sampling: "For each task, we construct a dataset Dtask containing |Dtask| training examples." (タスクごとのデータセットDtask作成)
- Generalist model performance: "The generalist model, trained across multiple tasks, consistently outperforms each specialist model." (複数タスクで訓練された汎用モデルの優れた性能)
▼質問(人間)
①上記の「複数タスクで訓練された汎用モデルの優れた性能」について詳細を教えて下さい。
▼生成AIの回答(ChatGPT 4o mini)
「複数タスクで訓練された汎用モデルの優れた性能」について、以下の点が挙げられます:
-
知識の転送 (Knowledge Transfer)
EMMAの汎用モデルは、複数の異なるタスクを同時に訓練することによって、タスク間で得られた知識を相互に転送することができます。このアプローチにより、モデルは一つのタスクで学んだ情報を他のタスクに応用でき、個別に専門化したモデルと比べて全体的に強力な推論能力を持つことが可能になります。例えば、運転計画や物体検出、道路グラフ推定など、複数のタスクで共通する要素や関係性を理解することができるため、汎用性が高まります。 -
一般化性能の向上 (Improved Generalization)
汎用モデルは、個別のタスクに特化したモデルに比べてより広範囲なシナリオに対応できる能力を持っています。単一タスクのモデルが特定の状況やデータに過剰に適応してしまうことに対して、汎用モデルは異なるデータセットや新たな環境でも効果的に動作し、一般化の能力が向上します。これにより、未知の状況や予期しないシナリオにも適応する能力が高まります。 -
効率性の向上 (Increased Efficiency)
複数のタスクを1つの汎用モデルで訓練することで、タスクごとに異なる専門モデルを個別に開発する必要がなくなります。これにより、計算リソースの節約や開発の効率化が進みます。モデルが1つのフレームワークで多くのタスクを学習するため、全体的なコストが削減され、訓練・推論の効率性が向上します。 -
タスク間の相互関係 (Inter-task Relationships)
複数のタスクを同時に訓練することで、タスク間の関係性を理解することができます。例えば、物体検出や道路グラフ推定など、異なるタスクでの情報が相互に補完し合い、より精度の高い予測を可能にします。例えば、道路グラフ推定で得られた道路の情報は、物体検出や運転計画における意思決定に役立つことがあります。このように、複数のタスク間で得られた情報を相互に活用できる点が汎用モデルの強みです。 -
タスク固有のプロンプトとインストラクションチューニング (Task-specific Prompts and Instruction-tuning)
EMMAは、各タスクの特定の要求に応じたプロンプトを使って訓練されています。このプロンプト設計により、各タスクを効率的に学習でき、同時に他のタスクも学習することで相乗効果が生まれます。例えば、運転計画のタスクと物体検出タスクを同時に学ぶことで、物体を避ける動きと進行方向を予測する能力が高まるなどの効果が期待できます。 -
専門モデルとの比較 (Comparison with Specialist Models)
実験結果では、汎用モデル(EMMA)は各専門モデル(単一タスク専用のモデル)と比較して、タスクごとに均等に良い性能を発揮しており、総合的に優れた性能を示しています。例えば、運転計画、物体検出、道路グラフ推定など、個別の専門モデルに劣らず、高い精度でタスクを処理できることが示されています。これにより、汎用モデルの方がタスク間での知識の相互作用を最大限に活かし、専門化したモデルに比べて強力なパフォーマンスを発揮できることが確認されています。
総じて、EMMAの汎用モデルは複数タスクを訓練することで、知識の共有や相互補完が可能となり、特に一般化能力と効率性、タスク間の相互関係において専門モデルを上回るパフォーマンスを実現しています。
3 Experiments
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
3 Experiments We highlight the experiments used to validate the efficacy of EMMA model. We leverage the smallest size of Gemini, i.e., Gemini 1.0 Nano-1 [Gemini Team Google, 2023] for all the experiments. In the following section, we first present the results of end-to-end planner trajectory generation on two public datasets in Section 3.1. Next, we conduct additional experiments on our internal datasets, studying the impact of chain-of-thought and data scaling in Section 3.2. Section 3.3 focuses on experiments involving three core perception tasks: 3D object detection, road graph estimation, and scene understanding. Our co-training results for the generalist model are summarized in Section 3.4. Finally, we showcase visual results that highlight EMMA capabilities in challenging, long-tail scenarios in Section 3.5. 3.1 End-to-End Motion Planning on Public Datasets We conduct the end-to-end planner trajectory generation experiments on two public datasets, the Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a] and the nuScenes dataset [Caesar et al., 2020]. EMMA is trained with the simplest end-to-end planner trajectory generation formulation as in Equation 2. That is, given camera images, ego vehicle history, and driving intent, the model is asked to predict the future ego waypoints for a certain time horizon. 7 3.1.1 Driving on the Waymo Open Motion Dataset (WOMD) WOMD comprises 103k real-world urban and suburban driving scenarios, each lasting 20 seconds. These scenarios are further segmented into 1.1M examples, each representing a 9-second window: 1 second is used as input context, and the remaining 8 seconds serve as the prediction target. The dataset includes detailed map features such as traffic signal states and lane characteristics, along with agent states such as position, velocity, acceleration, and bounding boxes. WOMD is typically used for the agent motion prediction task, i.e., given the agent tracks for the past 1 second on a corresponding map, predict the positions of up to 8 agents for 8 seconds into the future. We adopt the exact same setting but only train and evaluate the future motion prediction of the ego vehicle, i.e. the autonomous driving vehicle, as the planning benchmark. In addition, we leverage the ‘high-level intent command’ field, used for evaluating motion prediction, as the planning router instruction. We adapt some of the best motion prediction models, MotionLM [Seff et al., 2023] and enhanced Wayformer [Nayakanti et al., 2023], as the baselines of this planning benchmark. During inference, sampling a final trajectory from multiple candidate trajectories plays a critical role in the final performance. Both MotionLM and Wayformer generate 192 candidate trajectories, which are subsequently aggregated into 6 clusters using k-means clustering, resulting in 6 representative trajectories. The final trajectory is selected from these 6 representative trajectories based on their associated probabilities. For fairness, we also sample multiple trajectories using a Top-K decoding strategy, up to K = 24. We then compute the pairwise L2 distance between all trajectories and select the one with the lowest average L2 distance as the final predicted trajectory, which can be viewed as the “median” trajectory among all the predictions. In the ADE calculation, we treat this median trajectory with the highest probability. All WOMD eval results are based on the validation set. As shown in Table 1, our model achieved similar performance as the MotionLM baseline when we train only on the WOMD dataset, with Gemini pre-trained weights. When pre-trained with our internal dataset (denoted as EMMA+), our model surpasses both MotionLM and Wayformer up to ADE 5s in the future. EMMA is only slightly worse than Wayformer at ADE 8s, where camerainduced depth error becomes a prominent error resource. We note the differences in inputs between MotionLM and EMMA: MotionLM takes inputs of agent location history, agent interactions, the road graph, and traffic light states. These agent boxes are produced by specialized off-board perception models that look at both past and future observations and are trained with a large amount of carefully curated human labels, the road graph is manually generated using full run segments, and all inputs heavily use LiDAR data with superior depth estimation. In stark contrast, EMMA only takes camera images and ego vehicle history as input, without the need of any labels or additional models (besides leveraging the Gemini pre-trained weights). Notably, for this work, we added internally available image inputs, which are not currently in the released WOMD. We further investigate the impact of the number of sampled trajectories on ADE, as illustrated in Figure 3. The results highlight that sampling from multiple trajectories, rather than a single one, leads to a notable improvement in ADE. However, the advantage of increasing the number of candidate trajectories diminishes beyond a certain point. 3.1.2 Driving on the nuScenes Dataset The nuScenes dataset [Caesar et al., 2020] offers a comprehensive autonomous vehicle sensor suite for evaluation. It consists of 1,000 scenes, each spanning 20 seconds, and includes information from 6 cameras, 5 radars, and 1 LiDAR that collectively provide 360-degree coverage in the field of view. The dataset is fully annotated with 3D bounding boxes across 23 classes and 8 attributes. In our experiments, we follow the standard protocol for planning evaluation: predict the next 3 seconds of future driving actions based on 2 seconds of historical data. We measure the planning quality with L2 errors at 1-, 2- and 3-second time horizons, aligning with established baseline methods. As shown in Table 2, our self-supervised EMMA achieves state-of-the-art results in planning on nuScenes, outperforming all previous supervised (with intermediate perception labels and/or human labels) and self-supervised (no extra labels) methods. Under the same self-supervised setup, EMMA outperforms previous BEV-Planner [Li et al., 2024] by 17.1% in average L2 metric; even compared to DriveVLM-Dual [Tian et al., 2024] that heavily uses intermediate perception and human labels, our self-supervised EMMA still improves the average L2 metric by 6.4%. 8 Method self-supervised? L2 (m) 1s L2 (m) 2s L2 (m) 3s Avg L2 (m) UniAD [Hu et al., 2023] ✗ 0.42 0.64 0.91 0.66 DriveVLM [Tian et al., 2024] ✗ 0.18 0.34 0.68 0.40 VAD [Jiang et al., 2023] ✗ 0.17 0.34 0.60 0.37 OmniDrive [Wang et al., 2024a] ✗ 0.14 0.29 0.55 0.33 DriveVLM-Dual [Tian et al., 2024] ✗ 0.15 0.29 0.48 0.31 Ego-MLP∗ [Zhai et al., 2023] ✓ 0.15 0.32 0.59 0.35 BEV-Planner [Li et al., 2024] ✓ 0.16 0.32 0.57 0.35 EMMA (random init) ✓ 0.15 0.33 0.63 0.37 EMMA ✓ 0.14 0.29 0.54 0.32 EMMA+ ✓ 0.13 0.27 0.48 0.29 Table 2: End-to-end motion planning experiments on nuScenes [Caesar et al., 2020]. EMMA (random init) denotes models are randomly initialized; EMMA denotes models are initialized from Gemini; EMMA+ denotes models are pre-trained on our internal data. EMMA achieves state-of-the-art performance on the nuScenes planning benchmark, outperforming the supervised (with perception and/or human labels) prior art by 6.4% and self-supervised (no extra labels) prior art by 17.1%. ∗Ego-MLP results are taken from a reproduced version in BEV-Planner. Scene description Critical object Meta decision Behavior description Relative improvements over baseline e2e planning ✓ ✗ ✗ ✗ + 0.0% ✗ ✓ ✗ ✗ + 1.5% ✗ ✗ ✓ ✗ + 3.0% ✗ ✓ ✓ ✗ + 5.7% ✗ ✓ ✓ ✓ + 6.7% Table 3: Ablation study on chain-of-thought reasoning components. It improves end-to-end planning quality by up to 6.7% by combining all elements. In particular, driving meta-decision and critical objects contribute the improvements of 3.0% and 1.5%, respectively. The details of each component is described in Section 2.2. Unlike in WOMD, we note that sampling multiple trajectories did not yield significant improvements. We hypothesize that this is due to nuScenes’ shorter prediction time horizon (3s) in simpler driving scenarios. Thus, we report only top-1 predictions for our results. 3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset In this section, we present our studies of end-to-end planning with chain-of-thought on our internal dataset. This dataset contains millions of scenarios, orders of magnitude larger than any publicly available autonomous driving dataset. The model takes in 2 seconds of history to predict the driving actions for 5 seconds into the future. Table 3 presents the results of our experiments on chain-of-thought reasoning applied to end-to-end planning. By adopting the chain-of-thought formulation introduced in Equation 3, we achieve a notable 6.7% improvement over the standard end-to-end planning approach detailed in Equation 2. We also conduct an ablation study to analyze the contributions of different rationale components. Our findings reveal that both driving meta-decision and critical object identification significantly enhance performance, contributing improvements of 3.0% and 1.5%, respectively. When these components are combined, the gains are even more substantial. Conversely, while scene description has a neutral impact on driving performance, it enhances the model’s explainability. These results demonstrate that chain-of-thought reasoning can meaningfully improve driving performance, particularly when its components are carefully selected and integrated. We also perform a series of data scaling experiments for end-to-end planning, the results of which are illustrated in Figure 4. As we train the model on a larger training set, we observe lower eval perplexities before overfitting. Our findings indicate that the driving quality of EMMA has not yet plateaued, even with the current large-scale dataset. 9 Figure 4: EMMA data scaling experiments on our internal dataset. Each curve represents the eval perplexity for end-to-end motion planning as training proceeds with more steps. The x-axis is training compute, measured by floating-point operations (FLOPs) in log scale. The same EMMA model is trained on four sizes of datasets that are sampled with different percentages from 3% to 100% (denoted by different colors). In general, EMMA tends to achieve better quality until overfitting when given more training compute, but it also overfits quickly on smaller datasets. We observe the driving quality has not saturated when using the full large-scale dataset. 3.3 Perception Tasks We summarize our studies on three main perception tasks: 3D object detection, road graph estimation, and scene understanding. 3.3.1 3D Object Detection We validate our 3D object detection performance on the 3D camera-primary detection benchmark from the Waymo Open Dataset [Sun et al., 2020] using the d Longitudinal Error Tolerant (LET) matching [Hung et al., 2024]. We evaluate two versions: EMMA and EMMA+, similar to earlier sections, where EMMA+ is pre-trained on the 3D detection task using our internal dataset. The quantitative results are reported on the official test set and summarized in Figure 5. Our findings show that after pre-training, EMMA+ achieves competitive performance on the benchmark. Since our model output set of detected boxes without individual confidence scores, we directly compare the precision/recall instead of LET-3D-AP, which is calculated based on the precision/recall curve. We also compare the commonly used F1-score, where EMMA’s F1-score is computed using the single precision/recall and other models’ F1-scores are calculated by picking the maximal F1-score on the curve (often called F1-max). Figure 5 shows the performance comparison. In generally, EMMA+ demonstrates substantial improvements over state-of-the-art methods such as BEVFormer [Li et al., 2022b], achieving a 16.3% relative increase in vehicle precision at the same recall, and a 5.5% recall improvement at the same precision. EMMA+ also achieve better F1-score than prior arts. Performance on the pedestrian class is also comparable to that of MV-FCOS3D++ [Wang et al., 2021]. Additionally, we provide a performance breakdown across different ranges, highlighting that our model performs especially well in the near range. Our results underscore that with sufficient data and a large enough model, a multimodal approach can surpass specialized expert models in 3D object detection quality. 3.3.2 Road Graph Estimation Road graph estimation is a complex task that predicts a group of unordered polylines, each of which is represented as a sequence of waypoints. We measure the quality of road graph prediction with two metrics: (1) lane-level precision and recall, where we define a true positive match between a predicted lane polyline and a groundtruth lane polyline if and only if their Chamfer distance is within 1 meter; and (2) pixel-level precision and recall, where polylines are rasterized into a BEV grid with 1 meter resolution – we then treat the BEV grid as a image and compute precision and recall based on per-pixel matching. As discussed in Section 2.3, this task involves several design choices. One is about the representation of road graph polylines, where our choice is to define the start and end points of each lane, with 10 Figure 5: Camera-primary 3D object detection experiments on WOD [Sun et al., 2020] using the standard LET matching [Hung et al., 2024]. EMMA+ denotes model is pre-trained on our internal data. EMMA+ achieves competitive performance on the detection benchmark in both precision/recall and F1-score metrics. Compared to state-of-the-art methods, it achieves 16.3% relative improvements in vehicle precision at the same recall or 5.5% recalls improvement at the same precision. intermediate points added as needed to accurately capture the road’s curvature. Another critical design choice is the construction of target label sequences used for model training. Drawing inspiration from Pix2Seq [Chen et al., 2022a] in the context of object detection, one effective design choice is to pad the targets and apply random shuffling. This technique helps the model handle unordered outputs and prevents premature termination during training. Figure 6 presents our ablation studies on various design choices. Starting from our best designs, we systematically ablate each of the following configurations and assess the resulting quality degradation. We then summarize the key insights from our analysis. Polyline representation: dynamic sampling is better than fixed sampling. A simple polyline representation is to sample a fixed number of sparse control points per lane, e.g. two end points plus a fixed number of intermediate points to capture curvature. However, we find a better approach is to dynamically adjust the number of points per polyline according to the curvature and length of the lane. By keeping a consistent waypoint density rather than a consistent number of waypoints, we achieve a representation that more accurately captures the lane structure intricacies, yielding around a 40% to 90% difference in the metrics as shown in Figure 6. 11 Figure 6: Ablation study on road graph estimation. To evaluate the influence of different components in our road graph estimation model, we ablate each configuration and measure the corresponding impact on quality. Dynamic sampling (leftmost) of road graph polylines based on lane curvature and length proves to be the most significant factor, leading to a substantial 70% to 90% change in lane-level precision and recall. In contrast, aligning the model with a language-like representation, i.e., semantic punctuation (rightmost), has a more modest effect, contributing to only <10% change in precision and recall of any metric. Polyline representation: ego-origin aligned sample intervals are better than naively aligned sample intervals. The road graph is typically stored and accessed in global coordinate frame, meaning lane origins and extensions are independent of the ego vehicle position. To improve accuracy, it is essential to adjust lane point samples to start from the ego vehicle coordinate frame origin. Specifically, sampling polyline points relative to the AV position (ego-origin) avoids arbitrary offsets that can arise from directly transforming points sampled in the global coordinate frame into the ego coordinate frame. This prevents a 25% to 60% drop in prediction quality. Target sequence construction: shuffled ordering is better than arbitrary ordering. We organize polyline targets into bins based on their endpoint distance from the ego vehicle, providing a rough global ordering. For instance, we categorize lanes into nearby lanes and those further away that serve as connecting lanes. During training, we dynamically shuffle the polylines within each distance bin to enhance the model robustness and coverage. This dynamic shuffling within each bin improves the model’s ability to generalize across different lane configurations, leading to more accurate predictions. Target sequence construction: padding is better than non-padding. Similar to Pix2Seq [Chen et al., 2022a], we find that padding targets to prevent early termination is highly effective. In addition to padding the total number of polyline targets, we also pad the number of points within each polyline. We use “invalid” tokens to represent padded points within polylines. Each polyline is also explicitly tagged with a final “valid” or “invalid” token to denote whether it contains any nonpadded points. This approach ensures consistent input sizes, which helps maintain the integrity of the model during training and reduces the risk of premature truncation, leading to more reliable and accurate predictions. Target sequence construction: adding punctuation and other semantically redundant token improves quality. In the target sequence construction, we notice that it is beneficial to use language-like structures and punctuation to group targets (e.g., "(x,y and x,y);..." instead of "xy xy;..."). Additionally, explicitly including semantically redundant tokens – such as marking padded targets as “invalid” instead of relying on implicit omissions of “valid” markers – improves performance. This approach, incorporating punctuation and redundancy, results in a boost of up to 10% in lane-level metrics. We attribute this improvement to the language-related pre-training of Gemini. By leveraging similar structured expressions, Gemini can be more easily adapted to other tasks. 3.3.3 Scene Understanding Figure 7 summarizes our studies on the scene understanding task for temporary blockage detection. Our study is based on our internal datasets specifically curated for these scenarios. For this study, we establish our baselines by showing a picture to human and asking them to judge whether a lane is temporarily blocked. They can answer ‘yes’, ‘no’, or ‘unsure’. Our baseline will treat 12 Figure 7: Scene understanding experiments. direct fine-tuning denotes solely using the temporal blockage data during fine-tuning; naive mixture denotes co-training this scene task with road graph estimation; mix + short pretraining denotes pre-training on road graph esitmation first, and then fine-tune on the mixture of both tasks; mix + long pretraining denotes a longer pre-training before fine-tuning. The naive fine-tuning is already close to strong human baseline, but long-pretraining with training mixture can further boost the quality. Relative improvement over single task e2e planning 3D detection road graph estimation e2e planning detection road graph ✓ ✓ - +1.6% (±1.0%) +2.4% ( ±0.8%) ✓ ✓ +1.4% (±2.8%) +5.6% (±1.1%) - ✓ ✓ −1.4% (±2.9%) - +3.5% (±0.9%) ✓ ✓ ✓ +1.4% (±2.8%) +5.5% (±1.1%) +2.4% (±0.8%) Table 4: Generalist co-training experiments. (±∗) indicates standard deviation. By co-training on multiple tasks, EMMA gains a broader understanding of driving scenes, enabling it to handle various tasks at inference time, while enhancing individual task performance. Notably, certain task pairings yield greater benefits than others, suggesting these tasks are complementary. Co-training all three tasks together yields the best quality. all ‘unsure’ answers as incorrect, baseline+filtering will filter out all examples with ‘unsure’ answers. In contrast, our model is fine-tuned to predict ‘yes’ or ‘no’ for all examples. As shown in the figure, our naive model that is directly fine-tuned for only this task achieves better performance than the baseline comparison, but underperforms on the baseline+filtering comparison. To boost the model performance, our first attempt is to co-train this task with road graph estimation, but the naive mixture doesn’t improve performance. Our second attempt is to first pre-train the model on road graph estimation, and then fine-tune on these two tasks. Results show when the pre-training is long enough, the quality is boosted, showcasing the model’s ability to integrate multiple tasks for enhanced performance.▼生成AIの回答(ChatGPT 4o mini)
以下に各章節のポイントとなるキーワード・フレーズを列挙しました(日本語訳も記載しています):
3.1 End-to-End Motion Planning on Public Datasets
- End-to-end planner trajectory generation experiments (エンドツーエンドのプランナー軌道生成実験)
- Waymo Open Motion Dataset (WOMD) (Waymoオープンモーションデータセット)
- nuScenes dataset (nuScenesデータセット)
- Future ego waypoints prediction (未来の自車位置予測)
- MotionLM and Wayformer baselines (MotionLMおよびWayformerのベースライン)
- Sampling multiple trajectories for ADE improvement (ADEの改善のための複数軌道のサンプリング)
- Gemini pre-trained weights (Gemini事前学習済み重み)
3.1.1 Driving on the Waymo Open Motion Dataset (WOMD)
- 103k real-world urban and suburban driving scenarios (103kの実世界の都市および郊外の運転シナリオ)
- High-level intent command field (高レベルの意図コマンドフィールド)
- L2 distance between trajectories (軌道間のL2距離)
- Median trajectory selection (中央値軌道の選択)
- EMMA surpasses MotionLM and Wayformer (EMMAがMotionLMおよびWayformerを上回る)
3.1.2 Driving on the nuScenes Dataset
- Self-supervised EMMA outperforms previous methods (自己教師ありEMMAが既存の手法を上回る)
- L2 errors at 1-, 2-, and 3-second time horizons (1秒、2秒、3秒の時間範囲におけるL2誤差)
- Performance comparison with DriveVLM-Dual and BEV-Planner (DriveVLM-DualおよびBEV-Plannerとの性能比較)
- State-of-the-art performance on nuScenes planning benchmark (nuScenes計画ベンチマークにおける最先端の性能)
3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset
- Chain-of-thought reasoning in planning (計画におけるチェーン・オブ・ソート推論)
- Improvement by up to 6.7% with reasoning components (推論コンポーネントによる最大6.7%の改善)
- Driving meta-decision and critical object identification (運転のメタ意思決定と重要な物体の識別)
- Data scaling experiments (データスケーリング実験)
3.3 Perception Tasks
- 3D object detection, road graph estimation, and scene understanding (3D物体検出、道路グラフ推定、シーン理解)
- 3D object detection performance on Waymo Open Dataset (Waymoオープンデータセットにおける3D物体検出の性能)
- Dynamic polyline sampling for road graph estimation (道路グラフ推定のための動的なポリラインサンプリング)
- Target sequence construction for road graph prediction (道路グラフ予測のためのターゲットシーケンス構築)
3.3.1 3D Object Detection
- Precision/recall improvement on vehicle class (車両クラスの精度/再現率向上)
- F1-score comparison with prior methods (従来手法とのF1スコア比較)
- Vehicle precision increase by 16.3% (車両の精度が16.3%向上)
3.3.2 Road Graph Estimation
- Lane-level and pixel-level precision and recall (車線レベルおよびピクセルレベルの精度と再現率)
- Dynamic sampling vs. fixed sampling for lane representation (車線表現のための動的サンプリングと固定サンプリング)
- Padding and shuffling for training stability (トレーニングの安定性のためのパディングとシャッフル)
3.3.3 Scene Understanding
- Temporary blockage detection for scene understanding (シーン理解のための一時的障害検出)
- Co-training for multiple tasks (road graph estimation and blockage detection) (複数のタスク(道路グラフ推定と障害検出)の共同訓練)
- Long pre-training to boost scene understanding performance (シーン理解性能向上のための長期的な事前学習)
3.4 Co-training Results for Generalist Model
- Co-training of multiple tasks (複数タスクの共同訓練)
- Improvement in performance by co-training tasks (タスクの共同訓練による性能改善)
▼質問(人間)
①上記の「Dynamic polyline sampling for road graph estimation (道路グラフ推定のための動的なポリラインサンプリング)」について詳細を教えて下さい。
②上記の「Padding and shuffling for training stability (トレーニングの安定性のためのパディングとシャッフル)」について詳細を教えて下さい。
③上記の「Temporary blockage detection for scene understanding (シーン理解のための一時的障害検出)」について詳細を教えて下さい。
④上記の「Long pre-training to boost scene understanding performance (シーン理解性能向上のための長期的な事前学習)」について詳細を教えて下さい。
▼生成AIの回答(ChatGPT 4o mini)
① Dynamic polyline sampling for road graph estimation (道路グラフ推定のための動的なポリラインサンプリング)
この技術は、道路グラフを構成する際に、道路のカーブや長さに応じてサンプリングする点の数を動的に調整する方法です。具体的には、固定された数のポイント(例えば、道路の両端点や数個の中間点)を用いる代わりに、道路の形状に合わせてサンプルを動的に調整することで、より精度の高い道路構造を表現できます。これにより、車線の曲線や長さに対する表現力が向上し、精度が40%~90%向上することが示されています。この方法は、固定数のポイントサンプリングよりも車線の複雑なカーブや長さをより適切に捉えることができるため、道路グラフ推定の精度を大きく改善します。
② Padding and shuffling for training stability (トレーニングの安定性のためのパディングとシャッフル)
このアプローチは、ネットワークのトレーニング過程で安定性を確保し、早期終了(Early Stopping)を防ぐために用いられます。具体的には、ポリライン(道路の線)ターゲットを固定長にパディングして、モデルが一貫したサイズの入力を受け取るようにします。この方法により、モデルは途中で予測を終えず、安定したトレーニングが可能になります。また、ターゲットポリラインの順番をランダムにシャッフルすることで、訓練中に早期に適切なターゲットを学ぶのを防ぎ、異なる車線構成に対する一般化能力を向上させます。これらの手法は、モデルがより多様なシナリオに適応できるようにし、過剰適合(オーバーフィッティング)を防ぐために重要です。
③ Temporary blockage detection for scene understanding (シーン理解のための一時的障害検出)
このタスクは、運転中に道路が一時的にブロックされているかどうかを識別する問題です。障害物や交通の妨げが一時的なものであるか、長期的なものであるかを特定することは、安全で効率的な運転計画において重要です。この研究では、モデルに対して「はい」「いいえ」「わからない」の3つの選択肢で回答を求め、人間の基準と比較します。最初に単独で一時的な障害検出タスクをトレーニングし、次に他のタスク(例えば道路グラフ推定)と一緒にトレーニングを行いました。結果として、シンプルなファインチューニングでは効果が限られ、他のタスクと共同でトレーニングすることで、性能が向上することが示されています。
④ Long pre-training to boost scene understanding performance (シーン理解性能向上のための長期的な事前学習)
シーン理解のパフォーマンスを向上させるために、長期間にわたる事前学習が重要であるという結果が示されています。特に、一時的な障害検出のようなタスクを改善するために、初めに長期間にわたる事前学習を行い、その後、他のタスク(例えば、道路グラフ推定)と組み合わせたファインチューニングを実施することで、最終的により高いパフォーマンスを発揮します。この手法により、複数のタスクを同時に学習することができ、個別のタスクの性能も向上することが確認されました。
質問の内容をより詳しく書いて再度質問し直す↓
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
3 Experiments We highlight the experiments used to validate the efficacy of EMMA model. We leverage the smallest size of Gemini, i.e., Gemini 1.0 Nano-1 [Gemini Team Google, 2023] for all the experiments. In the following section, we first present the results of end-to-end planner trajectory generation on two public datasets in Section 3.1. Next, we conduct additional experiments on our internal datasets, studying the impact of chain-of-thought and data scaling in Section 3.2. Section 3.3 focuses on experiments involving three core perception tasks: 3D object detection, road graph estimation, and scene understanding. Our co-training results for the generalist model are summarized in Section 3.4. Finally, we showcase visual results that highlight EMMA capabilities in challenging, long-tail scenarios in Section 3.5. 3.1 End-to-End Motion Planning on Public Datasets We conduct the end-to-end planner trajectory generation experiments on two public datasets, the Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a] and the nuScenes dataset [Caesar et al., 2020]. EMMA is trained with the simplest end-to-end planner trajectory generation formulation as in Equation 2. That is, given camera images, ego vehicle history, and driving intent, the model is asked to predict the future ego waypoints for a certain time horizon. 7 3.1.1 Driving on the Waymo Open Motion Dataset (WOMD) WOMD comprises 103k real-world urban and suburban driving scenarios, each lasting 20 seconds. These scenarios are further segmented into 1.1M examples, each representing a 9-second window: 1 second is used as input context, and the remaining 8 seconds serve as the prediction target. The dataset includes detailed map features such as traffic signal states and lane characteristics, along with agent states such as position, velocity, acceleration, and bounding boxes. WOMD is typically used for the agent motion prediction task, i.e., given the agent tracks for the past 1 second on a corresponding map, predict the positions of up to 8 agents for 8 seconds into the future. We adopt the exact same setting but only train and evaluate the future motion prediction of the ego vehicle, i.e. the autonomous driving vehicle, as the planning benchmark. In addition, we leverage the ‘high-level intent command’ field, used for evaluating motion prediction, as the planning router instruction. We adapt some of the best motion prediction models, MotionLM [Seff et al., 2023] and enhanced Wayformer [Nayakanti et al., 2023], as the baselines of this planning benchmark. During inference, sampling a final trajectory from multiple candidate trajectories plays a critical role in the final performance. Both MotionLM and Wayformer generate 192 candidate trajectories, which are subsequently aggregated into 6 clusters using k-means clustering, resulting in 6 representative trajectories. The final trajectory is selected from these 6 representative trajectories based on their associated probabilities. For fairness, we also sample multiple trajectories using a Top-K decoding strategy, up to K = 24. We then compute the pairwise L2 distance between all trajectories and select the one with the lowest average L2 distance as the final predicted trajectory, which can be viewed as the “median” trajectory among all the predictions. In the ADE calculation, we treat this median trajectory with the highest probability. All WOMD eval results are based on the validation set. As shown in Table 1, our model achieved similar performance as the MotionLM baseline when we train only on the WOMD dataset, with Gemini pre-trained weights. When pre-trained with our internal dataset (denoted as EMMA+), our model surpasses both MotionLM and Wayformer up to ADE 5s in the future. EMMA is only slightly worse than Wayformer at ADE 8s, where camerainduced depth error becomes a prominent error resource. We note the differences in inputs between MotionLM and EMMA: MotionLM takes inputs of agent location history, agent interactions, the road graph, and traffic light states. These agent boxes are produced by specialized off-board perception models that look at both past and future observations and are trained with a large amount of carefully curated human labels, the road graph is manually generated using full run segments, and all inputs heavily use LiDAR data with superior depth estimation. In stark contrast, EMMA only takes camera images and ego vehicle history as input, without the need of any labels or additional models (besides leveraging the Gemini pre-trained weights). Notably, for this work, we added internally available image inputs, which are not currently in the released WOMD. We further investigate the impact of the number of sampled trajectories on ADE, as illustrated in Figure 3. The results highlight that sampling from multiple trajectories, rather than a single one, leads to a notable improvement in ADE. However, the advantage of increasing the number of candidate trajectories diminishes beyond a certain point. 3.1.2 Driving on the nuScenes Dataset The nuScenes dataset [Caesar et al., 2020] offers a comprehensive autonomous vehicle sensor suite for evaluation. It consists of 1,000 scenes, each spanning 20 seconds, and includes information from 6 cameras, 5 radars, and 1 LiDAR that collectively provide 360-degree coverage in the field of view. The dataset is fully annotated with 3D bounding boxes across 23 classes and 8 attributes. In our experiments, we follow the standard protocol for planning evaluation: predict the next 3 seconds of future driving actions based on 2 seconds of historical data. We measure the planning quality with L2 errors at 1-, 2- and 3-second time horizons, aligning with established baseline methods. As shown in Table 2, our self-supervised EMMA achieves state-of-the-art results in planning on nuScenes, outperforming all previous supervised (with intermediate perception labels and/or human labels) and self-supervised (no extra labels) methods. Under the same self-supervised setup, EMMA outperforms previous BEV-Planner [Li et al., 2024] by 17.1% in average L2 metric; even compared to DriveVLM-Dual [Tian et al., 2024] that heavily uses intermediate perception and human labels, our self-supervised EMMA still improves the average L2 metric by 6.4%. 8 Method self-supervised? L2 (m) 1s L2 (m) 2s L2 (m) 3s Avg L2 (m) UniAD [Hu et al., 2023] ✗ 0.42 0.64 0.91 0.66 DriveVLM [Tian et al., 2024] ✗ 0.18 0.34 0.68 0.40 VAD [Jiang et al., 2023] ✗ 0.17 0.34 0.60 0.37 OmniDrive [Wang et al., 2024a] ✗ 0.14 0.29 0.55 0.33 DriveVLM-Dual [Tian et al., 2024] ✗ 0.15 0.29 0.48 0.31 Ego-MLP∗ [Zhai et al., 2023] ✓ 0.15 0.32 0.59 0.35 BEV-Planner [Li et al., 2024] ✓ 0.16 0.32 0.57 0.35 EMMA (random init) ✓ 0.15 0.33 0.63 0.37 EMMA ✓ 0.14 0.29 0.54 0.32 EMMA+ ✓ 0.13 0.27 0.48 0.29 Table 2: End-to-end motion planning experiments on nuScenes [Caesar et al., 2020]. EMMA (random init) denotes models are randomly initialized; EMMA denotes models are initialized from Gemini; EMMA+ denotes models are pre-trained on our internal data. EMMA achieves state-of-the-art performance on the nuScenes planning benchmark, outperforming the supervised (with perception and/or human labels) prior art by 6.4% and self-supervised (no extra labels) prior art by 17.1%. ∗Ego-MLP results are taken from a reproduced version in BEV-Planner. Scene description Critical object Meta decision Behavior description Relative improvements over baseline e2e planning ✓ ✗ ✗ ✗ + 0.0% ✗ ✓ ✗ ✗ + 1.5% ✗ ✗ ✓ ✗ + 3.0% ✗ ✓ ✓ ✗ + 5.7% ✗ ✓ ✓ ✓ + 6.7% Table 3: Ablation study on chain-of-thought reasoning components. It improves end-to-end planning quality by up to 6.7% by combining all elements. In particular, driving meta-decision and critical objects contribute the improvements of 3.0% and 1.5%, respectively. The details of each component is described in Section 2.2. Unlike in WOMD, we note that sampling multiple trajectories did not yield significant improvements. We hypothesize that this is due to nuScenes’ shorter prediction time horizon (3s) in simpler driving scenarios. Thus, we report only top-1 predictions for our results. 3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset In this section, we present our studies of end-to-end planning with chain-of-thought on our internal dataset. This dataset contains millions of scenarios, orders of magnitude larger than any publicly available autonomous driving dataset. The model takes in 2 seconds of history to predict the driving actions for 5 seconds into the future. Table 3 presents the results of our experiments on chain-of-thought reasoning applied to end-to-end planning. By adopting the chain-of-thought formulation introduced in Equation 3, we achieve a notable 6.7% improvement over the standard end-to-end planning approach detailed in Equation 2. We also conduct an ablation study to analyze the contributions of different rationale components. Our findings reveal that both driving meta-decision and critical object identification significantly enhance performance, contributing improvements of 3.0% and 1.5%, respectively. When these components are combined, the gains are even more substantial. Conversely, while scene description has a neutral impact on driving performance, it enhances the model’s explainability. These results demonstrate that chain-of-thought reasoning can meaningfully improve driving performance, particularly when its components are carefully selected and integrated. We also perform a series of data scaling experiments for end-to-end planning, the results of which are illustrated in Figure 4. As we train the model on a larger training set, we observe lower eval perplexities before overfitting. Our findings indicate that the driving quality of EMMA has not yet plateaued, even with the current large-scale dataset. 9 Figure 4: EMMA data scaling experiments on our internal dataset. Each curve represents the eval perplexity for end-to-end motion planning as training proceeds with more steps. The x-axis is training compute, measured by floating-point operations (FLOPs) in log scale. The same EMMA model is trained on four sizes of datasets that are sampled with different percentages from 3% to 100% (denoted by different colors). In general, EMMA tends to achieve better quality until overfitting when given more training compute, but it also overfits quickly on smaller datasets. We observe the driving quality has not saturated when using the full large-scale dataset. 3.3 Perception Tasks We summarize our studies on three main perception tasks: 3D object detection, road graph estimation, and scene understanding. 3.3.1 3D Object Detection We validate our 3D object detection performance on the 3D camera-primary detection benchmark from the Waymo Open Dataset [Sun et al., 2020] using the d Longitudinal Error Tolerant (LET) matching [Hung et al., 2024]. We evaluate two versions: EMMA and EMMA+, similar to earlier sections, where EMMA+ is pre-trained on the 3D detection task using our internal dataset. The quantitative results are reported on the official test set and summarized in Figure 5. Our findings show that after pre-training, EMMA+ achieves competitive performance on the benchmark. Since our model output set of detected boxes without individual confidence scores, we directly compare the precision/recall instead of LET-3D-AP, which is calculated based on the precision/recall curve. We also compare the commonly used F1-score, where EMMA’s F1-score is computed using the single precision/recall and other models’ F1-scores are calculated by picking the maximal F1-score on the curve (often called F1-max). Figure 5 shows the performance comparison. In generally, EMMA+ demonstrates substantial improvements over state-of-the-art methods such as BEVFormer [Li et al., 2022b], achieving a 16.3% relative increase in vehicle precision at the same recall, and a 5.5% recall improvement at the same precision. EMMA+ also achieve better F1-score than prior arts. Performance on the pedestrian class is also comparable to that of MV-FCOS3D++ [Wang et al., 2021]. Additionally, we provide a performance breakdown across different ranges, highlighting that our model performs especially well in the near range. Our results underscore that with sufficient data and a large enough model, a multimodal approach can surpass specialized expert models in 3D object detection quality. 3.3.2 Road Graph Estimation Road graph estimation is a complex task that predicts a group of unordered polylines, each of which is represented as a sequence of waypoints. We measure the quality of road graph prediction with two metrics: (1) lane-level precision and recall, where we define a true positive match between a predicted lane polyline and a groundtruth lane polyline if and only if their Chamfer distance is within 1 meter; and (2) pixel-level precision and recall, where polylines are rasterized into a BEV grid with 1 meter resolution – we then treat the BEV grid as a image and compute precision and recall based on per-pixel matching. As discussed in Section 2.3, this task involves several design choices. One is about the representation of road graph polylines, where our choice is to define the start and end points of each lane, with 10 Figure 5: Camera-primary 3D object detection experiments on WOD [Sun et al., 2020] using the standard LET matching [Hung et al., 2024]. EMMA+ denotes model is pre-trained on our internal data. EMMA+ achieves competitive performance on the detection benchmark in both precision/recall and F1-score metrics. Compared to state-of-the-art methods, it achieves 16.3% relative improvements in vehicle precision at the same recall or 5.5% recalls improvement at the same precision. intermediate points added as needed to accurately capture the road’s curvature. Another critical design choice is the construction of target label sequences used for model training. Drawing inspiration from Pix2Seq [Chen et al., 2022a] in the context of object detection, one effective design choice is to pad the targets and apply random shuffling. This technique helps the model handle unordered outputs and prevents premature termination during training. Figure 6 presents our ablation studies on various design choices. Starting from our best designs, we systematically ablate each of the following configurations and assess the resulting quality degradation. We then summarize the key insights from our analysis. Polyline representation: dynamic sampling is better than fixed sampling. A simple polyline representation is to sample a fixed number of sparse control points per lane, e.g. two end points plus a fixed number of intermediate points to capture curvature. However, we find a better approach is to dynamically adjust the number of points per polyline according to the curvature and length of the lane. By keeping a consistent waypoint density rather than a consistent number of waypoints, we achieve a representation that more accurately captures the lane structure intricacies, yielding around a 40% to 90% difference in the metrics as shown in Figure 6. 11 Figure 6: Ablation study on road graph estimation. To evaluate the influence of different components in our road graph estimation model, we ablate each configuration and measure the corresponding impact on quality. Dynamic sampling (leftmost) of road graph polylines based on lane curvature and length proves to be the most significant factor, leading to a substantial 70% to 90% change in lane-level precision and recall. In contrast, aligning the model with a language-like representation, i.e., semantic punctuation (rightmost), has a more modest effect, contributing to only <10% change in precision and recall of any metric. Polyline representation: ego-origin aligned sample intervals are better than naively aligned sample intervals. The road graph is typically stored and accessed in global coordinate frame, meaning lane origins and extensions are independent of the ego vehicle position. To improve accuracy, it is essential to adjust lane point samples to start from the ego vehicle coordinate frame origin. Specifically, sampling polyline points relative to the AV position (ego-origin) avoids arbitrary offsets that can arise from directly transforming points sampled in the global coordinate frame into the ego coordinate frame. This prevents a 25% to 60% drop in prediction quality. Target sequence construction: shuffled ordering is better than arbitrary ordering. We organize polyline targets into bins based on their endpoint distance from the ego vehicle, providing a rough global ordering. For instance, we categorize lanes into nearby lanes and those further away that serve as connecting lanes. During training, we dynamically shuffle the polylines within each distance bin to enhance the model robustness and coverage. This dynamic shuffling within each bin improves the model’s ability to generalize across different lane configurations, leading to more accurate predictions. Target sequence construction: padding is better than non-padding. Similar to Pix2Seq [Chen et al., 2022a], we find that padding targets to prevent early termination is highly effective. In addition to padding the total number of polyline targets, we also pad the number of points within each polyline. We use “invalid” tokens to represent padded points within polylines. Each polyline is also explicitly tagged with a final “valid” or “invalid” token to denote whether it contains any nonpadded points. This approach ensures consistent input sizes, which helps maintain the integrity of the model during training and reduces the risk of premature truncation, leading to more reliable and accurate predictions. Target sequence construction: adding punctuation and other semantically redundant token improves quality. In the target sequence construction, we notice that it is beneficial to use language-like structures and punctuation to group targets (e.g., "(x,y and x,y);..." instead of "xy xy;..."). Additionally, explicitly including semantically redundant tokens – such as marking padded targets as “invalid” instead of relying on implicit omissions of “valid” markers – improves performance. This approach, incorporating punctuation and redundancy, results in a boost of up to 10% in lane-level metrics. We attribute this improvement to the language-related pre-training of Gemini. By leveraging similar structured expressions, Gemini can be more easily adapted to other tasks. 3.3.3 Scene Understanding Figure 7 summarizes our studies on the scene understanding task for temporary blockage detection. Our study is based on our internal datasets specifically curated for these scenarios. For this study, we establish our baselines by showing a picture to human and asking them to judge whether a lane is temporarily blocked. They can answer ‘yes’, ‘no’, or ‘unsure’. Our baseline will treat 12 Figure 7: Scene understanding experiments. direct fine-tuning denotes solely using the temporal blockage data during fine-tuning; naive mixture denotes co-training this scene task with road graph estimation; mix + short pretraining denotes pre-training on road graph esitmation first, and then fine-tune on the mixture of both tasks; mix + long pretraining denotes a longer pre-training before fine-tuning. The naive fine-tuning is already close to strong human baseline, but long-pretraining with training mixture can further boost the quality. Relative improvement over single task e2e planning 3D detection road graph estimation e2e planning detection road graph ✓ ✓ - +1.6% (±1.0%) +2.4% ( ±0.8%) ✓ ✓ +1.4% (±2.8%) +5.6% (±1.1%) - ✓ ✓ −1.4% (±2.9%) - +3.5% (±0.9%) ✓ ✓ ✓ +1.4% (±2.8%) +5.5% (±1.1%) +2.4% (±0.8%) Table 4: Generalist co-training experiments. (±∗) indicates standard deviation. By co-training on multiple tasks, EMMA gains a broader understanding of driving scenes, enabling it to handle various tasks at inference time, while enhancing individual task performance. Notably, certain task pairings yield greater benefits than others, suggesting these tasks are complementary. Co-training all three tasks together yields the best quality. all ‘unsure’ answers as incorrect, baseline+filtering will filter out all examples with ‘unsure’ answers. In contrast, our model is fine-tuned to predict ‘yes’ or ‘no’ for all examples. As shown in the figure, our naive model that is directly fine-tuned for only this task achieves better performance than the baseline comparison, but underperforms on the baseline+filtering comparison. To boost the model performance, our first attempt is to co-train this task with road graph estimation, but the naive mixture doesn’t improve performance. Our second attempt is to first pre-train the model on road graph estimation, and then fine-tune on these two tasks. Results show when the pre-training is long enough, the quality is boosted, showcasing the model’s ability to integrate multiple tasks for enhanced performance.▼生成AIの回答(ChatGPT 4o mini)
以下は各章節のポイントとなるキーワード・フレーズのリストです。元文とその日本語訳も記載しています。
3.1 End-to-End Motion Planning on Public Datasets (公開データセットでのエンドツーエンドモーションプランニング)
-
End-to-end planner trajectory generation
(エンドツーエンドプランナー軌道生成)
モデルがカメラ画像、自己車両の履歴、および運転意図を入力として、未来の自車の位置を予測するというシンプルな設定。 -
Waymo Open Motion Dataset (WOMD) and nuScenes dataset
(Waymoオープンモーションデータセット(WOMD) とnuScenesデータセット)
WOMDは103k件の都市および郊外の運転シナリオを含み、nuScenesは車両センサーを活用したデータセットで、両者ともに自動運転システムの評価に利用される。 -
Performance comparison with MotionLM and Wayformer
(MotionLMおよびWayformerとの性能比較)
EMMAはMotionLMと同等の結果を得るが、内部データで学習したEMMA+はMotionLMおよびWayformerをADE 5秒先で上回る性能を発揮。 -
Impact of number of sampled trajectories on ADE
(サンプルされた軌道数がADEに与える影響)
複数の軌道をサンプリングすることでADEが向上し、候補軌道数が増えるとその利点が薄れるが、一定の効果は続く。
3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset (内部データセットにおける連鎖的思考推論を用いたエンドツーエンドプランニング)
-
Chain-of-thought reasoning in end-to-end planning
(エンドツーエンドプランニングにおける連鎖的思考推論)
連鎖的思考推論を採用することで、標準的なプランニング手法に比べて6.7%の性能向上を達成。 -
Driving meta-decision and critical object identification
(運転メタ決定と重要な物体の識別)
運転のメタ決定や重要物体の識別がそれぞれ3.0%、1.5%の改善に貢献。 -
Ablation study on rationale components
(推論コンポーネントに対するアブレーションスタディ)
連鎖的思考のコンポーネントがプランニングの質を大きく改善し、シーンの説明はモデルの説明能力を向上させる。
3.3 Perception Tasks (知覚タスク)
3.3.1 3D Object Detection (3D物体検出)
-
3D camera-primary detection benchmark
(3Dカメラ主導の検出ベンチマーク)
EMMA+はBEVFormerを大きく上回る精度を達成し、特に車両クラスで優れたパフォーマンスを示す。 -
F1-score comparison with prior methods
(以前の手法とのF1スコア比較)
EMMA+はF1スコアで先行研究を超えており、特に車両クラスで16.3%の精度向上を示す。
3.3.2 Road Graph Estimation (道路グラフ推定)
-
Polyline representation based on lane curvature and length
(車線の曲率および長さに基づくポリライン表現)
車線の曲がり具合と長さに基づいて動的にサンプリングを行い、モデルの精度を大幅に改善(40%~90%の改善)。 -
Target sequence construction with padding and shuffling
(パディングおよびシャッフリングを用いたターゲットシーケンス構築)
ポリラインターゲットをシャッフルして動的に構成し、予測精度を向上させる。
3.3.3 Scene Understanding (シーン理解)
-
Temporary blockage detection using internal datasets
(内部データセットを用いた一時的な障害物検出)
モデルを道路グラフ推定と共同訓練することで、シーン理解の精度が向上。 -
Co-training for multiple tasks improves performance
(複数タスクの共同訓練によるパフォーマンス向上)
道路グラフ推定や3D物体検出といった複数タスクを共同訓練することで、個別タスクのパフォーマンスも向上。
特徴的な内容
-
Self-supervised learning approach (自己教師あり学習アプローチ)
EMMAは自己教師あり学習であり、追加のラベルや手動生成された地図情報を必要とせず、カメラ画像と自己車両の履歴を入力として運転計画を行う点が特徴的。 -
Pre-training with internal datasets (内部データセットによる事前学習)
EMMAはGeminiの事前学習された重みを使用し、内部データセットで学習することにより、他のベースラインモデルを超えるパフォーマンスを発揮。 -
Ablation studies for performance improvement (パフォーマンス向上のためのアブレーション研究)
さまざまな設計選択(例えば、ポリライン表現やターゲットシーケンス構築方法)についてのアブレーションスタディを行い、最適な設定を導き出している点が特に注目されます。
「ただし追加説明は不要」を追加して、再度質問し直す↓
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
3 Experiments We highlight the experiments used to validate the efficacy of EMMA model. We leverage the smallest size of Gemini, i.e., Gemini 1.0 Nano-1 [Gemini Team Google, 2023] for all the experiments. In the following section, we first present the results of end-to-end planner trajectory generation on two public datasets in Section 3.1. Next, we conduct additional experiments on our internal datasets, studying the impact of chain-of-thought and data scaling in Section 3.2. Section 3.3 focuses on experiments involving three core perception tasks: 3D object detection, road graph estimation, and scene understanding. Our co-training results for the generalist model are summarized in Section 3.4. Finally, we showcase visual results that highlight EMMA capabilities in challenging, long-tail scenarios in Section 3.5. 3.1 End-to-End Motion Planning on Public Datasets We conduct the end-to-end planner trajectory generation experiments on two public datasets, the Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a] and the nuScenes dataset [Caesar et al., 2020]. EMMA is trained with the simplest end-to-end planner trajectory generation formulation as in Equation 2. That is, given camera images, ego vehicle history, and driving intent, the model is asked to predict the future ego waypoints for a certain time horizon. 7 3.1.1 Driving on the Waymo Open Motion Dataset (WOMD) WOMD comprises 103k real-world urban and suburban driving scenarios, each lasting 20 seconds. These scenarios are further segmented into 1.1M examples, each representing a 9-second window: 1 second is used as input context, and the remaining 8 seconds serve as the prediction target. The dataset includes detailed map features such as traffic signal states and lane characteristics, along with agent states such as position, velocity, acceleration, and bounding boxes. WOMD is typically used for the agent motion prediction task, i.e., given the agent tracks for the past 1 second on a corresponding map, predict the positions of up to 8 agents for 8 seconds into the future. We adopt the exact same setting but only train and evaluate the future motion prediction of the ego vehicle, i.e. the autonomous driving vehicle, as the planning benchmark. In addition, we leverage the ‘high-level intent command’ field, used for evaluating motion prediction, as the planning router instruction. We adapt some of the best motion prediction models, MotionLM [Seff et al., 2023] and enhanced Wayformer [Nayakanti et al., 2023], as the baselines of this planning benchmark. During inference, sampling a final trajectory from multiple candidate trajectories plays a critical role in the final performance. Both MotionLM and Wayformer generate 192 candidate trajectories, which are subsequently aggregated into 6 clusters using k-means clustering, resulting in 6 representative trajectories. The final trajectory is selected from these 6 representative trajectories based on their associated probabilities. For fairness, we also sample multiple trajectories using a Top-K decoding strategy, up to K = 24. We then compute the pairwise L2 distance between all trajectories and select the one with the lowest average L2 distance as the final predicted trajectory, which can be viewed as the “median” trajectory among all the predictions. In the ADE calculation, we treat this median trajectory with the highest probability. All WOMD eval results are based on the validation set. As shown in Table 1, our model achieved similar performance as the MotionLM baseline when we train only on the WOMD dataset, with Gemini pre-trained weights. When pre-trained with our internal dataset (denoted as EMMA+), our model surpasses both MotionLM and Wayformer up to ADE 5s in the future. EMMA is only slightly worse than Wayformer at ADE 8s, where camerainduced depth error becomes a prominent error resource. We note the differences in inputs between MotionLM and EMMA: MotionLM takes inputs of agent location history, agent interactions, the road graph, and traffic light states. These agent boxes are produced by specialized off-board perception models that look at both past and future observations and are trained with a large amount of carefully curated human labels, the road graph is manually generated using full run segments, and all inputs heavily use LiDAR data with superior depth estimation. In stark contrast, EMMA only takes camera images and ego vehicle history as input, without the need of any labels or additional models (besides leveraging the Gemini pre-trained weights). Notably, for this work, we added internally available image inputs, which are not currently in the released WOMD. We further investigate the impact of the number of sampled trajectories on ADE, as illustrated in Figure 3. The results highlight that sampling from multiple trajectories, rather than a single one, leads to a notable improvement in ADE. However, the advantage of increasing the number of candidate trajectories diminishes beyond a certain point. 3.1.2 Driving on the nuScenes Dataset The nuScenes dataset [Caesar et al., 2020] offers a comprehensive autonomous vehicle sensor suite for evaluation. It consists of 1,000 scenes, each spanning 20 seconds, and includes information from 6 cameras, 5 radars, and 1 LiDAR that collectively provide 360-degree coverage in the field of view. The dataset is fully annotated with 3D bounding boxes across 23 classes and 8 attributes. In our experiments, we follow the standard protocol for planning evaluation: predict the next 3 seconds of future driving actions based on 2 seconds of historical data. We measure the planning quality with L2 errors at 1-, 2- and 3-second time horizons, aligning with established baseline methods. As shown in Table 2, our self-supervised EMMA achieves state-of-the-art results in planning on nuScenes, outperforming all previous supervised (with intermediate perception labels and/or human labels) and self-supervised (no extra labels) methods. Under the same self-supervised setup, EMMA outperforms previous BEV-Planner [Li et al., 2024] by 17.1% in average L2 metric; even compared to DriveVLM-Dual [Tian et al., 2024] that heavily uses intermediate perception and human labels, our self-supervised EMMA still improves the average L2 metric by 6.4%. 8 Method self-supervised? L2 (m) 1s L2 (m) 2s L2 (m) 3s Avg L2 (m) UniAD [Hu et al., 2023] ✗ 0.42 0.64 0.91 0.66 DriveVLM [Tian et al., 2024] ✗ 0.18 0.34 0.68 0.40 VAD [Jiang et al., 2023] ✗ 0.17 0.34 0.60 0.37 OmniDrive [Wang et al., 2024a] ✗ 0.14 0.29 0.55 0.33 DriveVLM-Dual [Tian et al., 2024] ✗ 0.15 0.29 0.48 0.31 Ego-MLP∗ [Zhai et al., 2023] ✓ 0.15 0.32 0.59 0.35 BEV-Planner [Li et al., 2024] ✓ 0.16 0.32 0.57 0.35 EMMA (random init) ✓ 0.15 0.33 0.63 0.37 EMMA ✓ 0.14 0.29 0.54 0.32 EMMA+ ✓ 0.13 0.27 0.48 0.29 Table 2: End-to-end motion planning experiments on nuScenes [Caesar et al., 2020]. EMMA (random init) denotes models are randomly initialized; EMMA denotes models are initialized from Gemini; EMMA+ denotes models are pre-trained on our internal data. EMMA achieves state-of-the-art performance on the nuScenes planning benchmark, outperforming the supervised (with perception and/or human labels) prior art by 6.4% and self-supervised (no extra labels) prior art by 17.1%. ∗Ego-MLP results are taken from a reproduced version in BEV-Planner. Scene description Critical object Meta decision Behavior description Relative improvements over baseline e2e planning ✓ ✗ ✗ ✗ + 0.0% ✗ ✓ ✗ ✗ + 1.5% ✗ ✗ ✓ ✗ + 3.0% ✗ ✓ ✓ ✗ + 5.7% ✗ ✓ ✓ ✓ + 6.7% Table 3: Ablation study on chain-of-thought reasoning components. It improves end-to-end planning quality by up to 6.7% by combining all elements. In particular, driving meta-decision and critical objects contribute the improvements of 3.0% and 1.5%, respectively. The details of each component is described in Section 2.2. Unlike in WOMD, we note that sampling multiple trajectories did not yield significant improvements. We hypothesize that this is due to nuScenes’ shorter prediction time horizon (3s) in simpler driving scenarios. Thus, we report only top-1 predictions for our results. 3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset In this section, we present our studies of end-to-end planning with chain-of-thought on our internal dataset. This dataset contains millions of scenarios, orders of magnitude larger than any publicly available autonomous driving dataset. The model takes in 2 seconds of history to predict the driving actions for 5 seconds into the future. Table 3 presents the results of our experiments on chain-of-thought reasoning applied to end-to-end planning. By adopting the chain-of-thought formulation introduced in Equation 3, we achieve a notable 6.7% improvement over the standard end-to-end planning approach detailed in Equation 2. We also conduct an ablation study to analyze the contributions of different rationale components. Our findings reveal that both driving meta-decision and critical object identification significantly enhance performance, contributing improvements of 3.0% and 1.5%, respectively. When these components are combined, the gains are even more substantial. Conversely, while scene description has a neutral impact on driving performance, it enhances the model’s explainability. These results demonstrate that chain-of-thought reasoning can meaningfully improve driving performance, particularly when its components are carefully selected and integrated. We also perform a series of data scaling experiments for end-to-end planning, the results of which are illustrated in Figure 4. As we train the model on a larger training set, we observe lower eval perplexities before overfitting. Our findings indicate that the driving quality of EMMA has not yet plateaued, even with the current large-scale dataset. 9 Figure 4: EMMA data scaling experiments on our internal dataset. Each curve represents the eval perplexity for end-to-end motion planning as training proceeds with more steps. The x-axis is training compute, measured by floating-point operations (FLOPs) in log scale. The same EMMA model is trained on four sizes of datasets that are sampled with different percentages from 3% to 100% (denoted by different colors). In general, EMMA tends to achieve better quality until overfitting when given more training compute, but it also overfits quickly on smaller datasets. We observe the driving quality has not saturated when using the full large-scale dataset. 3.3 Perception Tasks We summarize our studies on three main perception tasks: 3D object detection, road graph estimation, and scene understanding. 3.3.1 3D Object Detection We validate our 3D object detection performance on the 3D camera-primary detection benchmark from the Waymo Open Dataset [Sun et al., 2020] using the d Longitudinal Error Tolerant (LET) matching [Hung et al., 2024]. We evaluate two versions: EMMA and EMMA+, similar to earlier sections, where EMMA+ is pre-trained on the 3D detection task using our internal dataset. The quantitative results are reported on the official test set and summarized in Figure 5. Our findings show that after pre-training, EMMA+ achieves competitive performance on the benchmark. Since our model output set of detected boxes without individual confidence scores, we directly compare the precision/recall instead of LET-3D-AP, which is calculated based on the precision/recall curve. We also compare the commonly used F1-score, where EMMA’s F1-score is computed using the single precision/recall and other models’ F1-scores are calculated by picking the maximal F1-score on the curve (often called F1-max). Figure 5 shows the performance comparison. In generally, EMMA+ demonstrates substantial improvements over state-of-the-art methods such as BEVFormer [Li et al., 2022b], achieving a 16.3% relative increase in vehicle precision at the same recall, and a 5.5% recall improvement at the same precision. EMMA+ also achieve better F1-score than prior arts. Performance on the pedestrian class is also comparable to that of MV-FCOS3D++ [Wang et al., 2021]. Additionally, we provide a performance breakdown across different ranges, highlighting that our model performs especially well in the near range. Our results underscore that with sufficient data and a large enough model, a multimodal approach can surpass specialized expert models in 3D object detection quality. 3.3.2 Road Graph Estimation Road graph estimation is a complex task that predicts a group of unordered polylines, each of which is represented as a sequence of waypoints. We measure the quality of road graph prediction with two metrics: (1) lane-level precision and recall, where we define a true positive match between a predicted lane polyline and a groundtruth lane polyline if and only if their Chamfer distance is within 1 meter; and (2) pixel-level precision and recall, where polylines are rasterized into a BEV grid with 1 meter resolution – we then treat the BEV grid as a image and compute precision and recall based on per-pixel matching. As discussed in Section 2.3, this task involves several design choices. One is about the representation of road graph polylines, where our choice is to define the start and end points of each lane, with 10 Figure 5: Camera-primary 3D object detection experiments on WOD [Sun et al., 2020] using the standard LET matching [Hung et al., 2024]. EMMA+ denotes model is pre-trained on our internal data. EMMA+ achieves competitive performance on the detection benchmark in both precision/recall and F1-score metrics. Compared to state-of-the-art methods, it achieves 16.3% relative improvements in vehicle precision at the same recall or 5.5% recalls improvement at the same precision. intermediate points added as needed to accurately capture the road’s curvature. Another critical design choice is the construction of target label sequences used for model training. Drawing inspiration from Pix2Seq [Chen et al., 2022a] in the context of object detection, one effective design choice is to pad the targets and apply random shuffling. This technique helps the model handle unordered outputs and prevents premature termination during training. Figure 6 presents our ablation studies on various design choices. Starting from our best designs, we systematically ablate each of the following configurations and assess the resulting quality degradation. We then summarize the key insights from our analysis. Polyline representation: dynamic sampling is better than fixed sampling. A simple polyline representation is to sample a fixed number of sparse control points per lane, e.g. two end points plus a fixed number of intermediate points to capture curvature. However, we find a better approach is to dynamically adjust the number of points per polyline according to the curvature and length of the lane. By keeping a consistent waypoint density rather than a consistent number of waypoints, we achieve a representation that more accurately captures the lane structure intricacies, yielding around a 40% to 90% difference in the metrics as shown in Figure 6. 11 Figure 6: Ablation study on road graph estimation. To evaluate the influence of different components in our road graph estimation model, we ablate each configuration and measure the corresponding impact on quality. Dynamic sampling (leftmost) of road graph polylines based on lane curvature and length proves to be the most significant factor, leading to a substantial 70% to 90% change in lane-level precision and recall. In contrast, aligning the model with a language-like representation, i.e., semantic punctuation (rightmost), has a more modest effect, contributing to only <10% change in precision and recall of any metric. Polyline representation: ego-origin aligned sample intervals are better than naively aligned sample intervals. The road graph is typically stored and accessed in global coordinate frame, meaning lane origins and extensions are independent of the ego vehicle position. To improve accuracy, it is essential to adjust lane point samples to start from the ego vehicle coordinate frame origin. Specifically, sampling polyline points relative to the AV position (ego-origin) avoids arbitrary offsets that can arise from directly transforming points sampled in the global coordinate frame into the ego coordinate frame. This prevents a 25% to 60% drop in prediction quality. Target sequence construction: shuffled ordering is better than arbitrary ordering. We organize polyline targets into bins based on their endpoint distance from the ego vehicle, providing a rough global ordering. For instance, we categorize lanes into nearby lanes and those further away that serve as connecting lanes. During training, we dynamically shuffle the polylines within each distance bin to enhance the model robustness and coverage. This dynamic shuffling within each bin improves the model’s ability to generalize across different lane configurations, leading to more accurate predictions. Target sequence construction: padding is better than non-padding. Similar to Pix2Seq [Chen et al., 2022a], we find that padding targets to prevent early termination is highly effective. In addition to padding the total number of polyline targets, we also pad the number of points within each polyline. We use “invalid” tokens to represent padded points within polylines. Each polyline is also explicitly tagged with a final “valid” or “invalid” token to denote whether it contains any nonpadded points. This approach ensures consistent input sizes, which helps maintain the integrity of the model during training and reduces the risk of premature truncation, leading to more reliable and accurate predictions. Target sequence construction: adding punctuation and other semantically redundant token improves quality. In the target sequence construction, we notice that it is beneficial to use language-like structures and punctuation to group targets (e.g., "(x,y and x,y);..." instead of "xy xy;..."). Additionally, explicitly including semantically redundant tokens – such as marking padded targets as “invalid” instead of relying on implicit omissions of “valid” markers – improves performance. This approach, incorporating punctuation and redundancy, results in a boost of up to 10% in lane-level metrics. We attribute this improvement to the language-related pre-training of Gemini. By leveraging similar structured expressions, Gemini can be more easily adapted to other tasks. 3.3.3 Scene Understanding Figure 7 summarizes our studies on the scene understanding task for temporary blockage detection. Our study is based on our internal datasets specifically curated for these scenarios. For this study, we establish our baselines by showing a picture to human and asking them to judge whether a lane is temporarily blocked. They can answer ‘yes’, ‘no’, or ‘unsure’. Our baseline will treat 12 Figure 7: Scene understanding experiments. direct fine-tuning denotes solely using the temporal blockage data during fine-tuning; naive mixture denotes co-training this scene task with road graph estimation; mix + short pretraining denotes pre-training on road graph esitmation first, and then fine-tune on the mixture of both tasks; mix + long pretraining denotes a longer pre-training before fine-tuning. The naive fine-tuning is already close to strong human baseline, but long-pretraining with training mixture can further boost the quality. Relative improvement over single task e2e planning 3D detection road graph estimation e2e planning detection road graph ✓ ✓ - +1.6% (±1.0%) +2.4% ( ±0.8%) ✓ ✓ +1.4% (±2.8%) +5.6% (±1.1%) - ✓ ✓ −1.4% (±2.9%) - +3.5% (±0.9%) ✓ ✓ ✓ +1.4% (±2.8%) +5.5% (±1.1%) +2.4% (±0.8%) Table 4: Generalist co-training experiments. (±∗) indicates standard deviation. By co-training on multiple tasks, EMMA gains a broader understanding of driving scenes, enabling it to handle various tasks at inference time, while enhancing individual task performance. Notably, certain task pairings yield greater benefits than others, suggesting these tasks are complementary. Co-training all three tasks together yields the best quality. all ‘unsure’ answers as incorrect, baseline+filtering will filter out all examples with ‘unsure’ answers. In contrast, our model is fine-tuned to predict ‘yes’ or ‘no’ for all examples. As shown in the figure, our naive model that is directly fine-tuned for only this task achieves better performance than the baseline comparison, but underperforms on the baseline+filtering comparison. To boost the model performance, our first attempt is to co-train this task with road graph estimation, but the naive mixture doesn’t improve performance. Our second attempt is to first pre-train the model on road graph estimation, and then fine-tune on these two tasks. Results show when the pre-training is long enough, the quality is boosted, showcasing the model’s ability to integrate multiple tasks for enhanced performance.▼生成AIの回答(ChatGPT 4o mini)
3.1 End-to-End Motion Planning on Public Datasets
- Waymo Open Motion Dataset (WOMD) [Chen et al., 2024a] (ウェイモオープンモーションデータセット): 103kの都市および郊外の運転シナリオ
- NuScenes dataset [Caesar et al., 2020] (ヌーシーンズデータセット): 1,000のシーン、20秒間の運転データ
- Motion prediction benchmark (モーション予測ベンチマーク): 自律運転車の未来位置予測
- Sampling multiple candidate trajectories (複数候補軌道のサンプリング): 最終性能に重要な役割を果たす
- Performance comparison with MotionLM and Wayformer (MotionLMおよびWayformerとの性能比較): EMMAがWOMDで類似のパフォーマンスを実現
3.1.1 Driving on the Waymo Open Motion Dataset (WOMD)
- Camera images and ego vehicle history as input (カメラ画像とエゴ車両履歴を入力): ラベルなしで運転計画を行う
- Difference in inputs between MotionLM and EMMA (MotionLMとEMMAの入力の違い): EMMAはカメラ画像とエゴ車両履歴のみを使用
- Effect of sampling multiple trajectories on ADE (複数軌道のサンプリングがADEに与える影響): 候補軌道数の増加がADEを改善
3.1.2 Driving on the nuScenes Dataset
- State-of-the-art results in planning (プランニングにおける最先端の結果): EMMAが他の自己教師ありおよび監視ありモデルを超越
- Self-supervised setup outperforming supervised methods (自己教師あり設定が監視あり手法を上回る): 6.4%の改善を達成
- L2 errors at different time horizons (異なる時間のL2誤差): 1秒、2秒、3秒の計画品質を評価
3.2 End-to-End Motion Planning with Chain-of-Thought Reasoning on Internal Dataset
- Chain-of-thought reasoning improves performance (チェーン・オブ・ソート推論が性能を改善): 6.7%の改善
- Ablation study on rationale components (推論コンポーネントのアブレーション研究): メタ意思決定と重要オブジェクトが性能向上に寄与
- Impact of data scaling (データスケーリングの影響): トレーニングセットのサイズ拡大でモデルの性能向上
3.3 Perception Tasks
- 3D Object Detection (3Dオブジェクト検出): Waymo Open Datasetを用いた検証、EMMA+はBEVFormerより優れた結果
- Road Graph Estimation (道路グラフ推定): ポリライン表現とサンプリング方法が精度に与える影響
- Scene Understanding (シーン理解): 一時的な遮断の検出タスクでのモデル改善
3.3.1 3D Object Detection
- Performance on camera-primary detection benchmark (カメラ優先検出ベンチマークでのパフォーマンス): EMMA+はBEVFormerより16.3%の精度向上
- Precision/Recall comparison (精度/再現率の比較): EMMAのF1スコアが他の先端手法を上回る
3.3.2 Road Graph Estimation
- Dynamic sampling of polyline points (ポリライン点の動的サンプリング): 曲線と長さに基づいてポリラインの精度が向上
- Ego-origin aligned sample intervals (エゴ原点に合わせたサンプリング): 精度向上に重要な要素
- Shuffling and padding target sequences (ターゲットシーケンスのシャッフルとパディング): モデルのロバスト性と精度の向上
3.3.3 Scene Understanding
- Temporary blockage detection task (一時的な遮断の検出タスク): ロードグラフ推定との共訓練による精度向上
- Co-training on multiple tasks (複数タスクの共訓練): 複数タスクを同時に訓練することで全体のパフォーマンス向上
回答の内容・形が結構変化する。
▼質問(人間)
①上記の「Shuffling and padding target sequences (ターゲットシーケンスのシャッフルとパディング): モデルのロバスト性と精度の向上」について詳細を教えて下さい。
②上記の「Co-training on multiple tasks (複数タスクの共訓練): 複数タスクを同時に訓練することで全体のパフォーマンス向上」について詳細を教えて下さい。
▼生成AIの回答(ChatGPT 4o mini)
① Shuffling and padding target sequences (ターゲットシーケンスのシャッフルとパディング: モデルのロバスト性と精度の向上)
ターゲットシーケンスのシャッフルとパディングは、モデルの学習過程においていくつかの重要な改善をもたらします。具体的には、次の2つの方法が強調されています:
-
ターゲットシーケンスのシャッフル
シャッフルは、ターゲットポリライン(道路グラフのポリライン)を距離や位置に関係なく無作為に並べ替えることです。この方法により、モデルはターゲットの出力順序に依存せず、異なる配置や構造に対しても柔軟に対応できるようになります。シャッフルを行うことで、異なる道路配置や道路の種類に対して汎用性が高まり、モデルのロバスト性が向上します。 -
ターゲットシーケンスのパディング
パディングは、ターゲットのシーケンスの長さを一定に保つために、指定された長さに合わせて追加の無効なトークン(無意味なデータ)を加えることです。この方法により、入力のサイズが常に一定となり、学習過程での早期終了を防ぐ効果があります。パディングを使用することで、ターゲットシーケンスの異なる長さを均等化し、モデルの学習における安定性が向上します。
この2つの方法(シャッフルとパディング)は、特に複雑な構造の道路グラフやターゲットシーケンスを扱う場合に重要であり、最終的には予測精度の向上に貢献します。
② Co-training on multiple tasks (複数タスクの共訓練: 複数タスクを同時に訓練することで全体のパフォーマンス向上)
複数タスクの共訓練は、異なるタスクを同時に訓練することで、各タスクにおける性能が向上する方法です。ここでは、以下の内容が強調されています:
-
共訓練の目的
EMMAモデルは、複数のタスク(例えば、道路グラフ推定、3D物体検出、シーン理解など)を同時に訓練することで、1つのタスクに対する深い理解を得るだけでなく、他のタスクにも役立つ知識を得ることができます。これにより、各タスクのパフォーマンスが向上し、最終的には全体的なパフォーマンスも向上します。 -
タスク間の補完性
複数タスクを共に訓練することで、タスク間で補完的な知識が得られます。例えば、道路グラフ推定とシーン理解を共訓練することで、シーンの理解がより豊かになり、道路グラフ推定の精度が向上することがあります。これは、あるタスクで得られた知識が他のタスクにも適用されるためです。 -
共訓練の効果
複数のタスクを共訓練することは、個々のタスクを単独で訓練するよりも大きな利点をもたらします。例えば、シーン理解のタスクを道路グラフ推定のタスクと一緒に訓練すると、シーン理解におけるパフォーマンスが向上します。また、タスク間で学習が重複し合うことで、モデルはより多角的にシーンを理解し、より高い精度で予測を行えるようになります。
このように、複数タスクを共訓練することによって、各タスクに特化したモデルよりも高い総合的なパフォーマンスが得られることが示されています。
その他: 関連研究に関する部分を把握する
4 Related Works
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
4 Related Works End-to-end autonomous driving research enjoys a rich history and has evolved significantly since ALVINN [Pomerleau, 1988] employed shallow neural networks to predict control signals. The field benefited from further deep learning advancements: e.g. DAVE-2 [Bojarski et al., 2016] and ChauffeurNet [Bansal et al., 2019], leveraged deeper neural architectures and incorporated sophisticated perception and motion planning modules respectively. Recent research has expanded to include multimodal inputs [Codevilla et al., 2018, Prakash et al., 2021], multi-task learning [Chitta et al., 2022, Wu et al., 2022], reinforcement learning [Chekroun et al., 2023, Chen et al., 2021, Kendall et al., 2019, Liang et al., 2018, Toromanoff et al., 2020], and distillation [Chen et al., 2020, Zhang and Ohn-Bar, 2021, Zhang et al., 2021]. Unified planning frameworks such as VAD [Jiang et al., 2023] and UniAD [Hu et al., 2023] integrated planning with conventional modules in open-loop environments. More studies have been proposed to examine the robustness, safety, and transferability from synthetic environments to real-world domains. However, recent findings from Ego-MLP [Zhai et al., 2023] and BEV-Planner [Li et al., 2024] revealed that these methods could potentially overfit to ego status despite their good performance on benchmark tasks. Our work revisits the simplicity of earlier end-to-end models such as ALVINN and DAVE-2, enhancing them with powerful MLLMs. Vision language models for autonomous driving have gained increasing interest, focusing on achieving explainable driving behavior and generalizability through end-to-end learning frameworks. DriveGPT4 [Xu et al., 2024] utilizes LLMs to both explain vehicle actions and predict control signals in an iterative Q&A format. Drive Anywhere [Wang et al., 2024b] introduces patch-aligned feature extraction from MLLMs for text-based driving decision queries, while OmniDrive [Wang et al., 2024a] features a 3D vision-language model design for reasoning and planning. Other approaches use MLLMs in graph-based VQA contexts (DriveLM [Sima et al., 2024]), or apply chain-of-thought reasoning (DriveVLM [Tian et al., 2024]) to tackle multiple driving-related tasks. For non-end-to-end models, LLM-Drive [Chen et al., 2024b] leverages LLMs with object-level vector inputs for control planning. In contrast, our work studies the end-to-end fine-tuning of a state-of-the art MLLM for driving tasks, employing a generalist approach that emphasizes open-world driving capabilities. Multimodal large language models (MLLM) extend LLMs [Vaswani et al., 2017, Devlin, 2019, Raffel et al., 2020, Gemini Team Google, 2023, Reid et al., 2024, Chowdhery et al., 2023, Anil et al., 14 (a) A garbage bag appears on the freeway, so our predicted trajectory suggests to nudge slightly to the right to avoid it. (b) A ladder appears on the freeway, and our predicted trajectory suggests to switch to the left lane to bypass it appropriately. (c) we encounter a small squirrel on the road and our predicted trajectory instinctively slows down to avoid the animal. Note EMMA wasn’t explicitly trained to detect squirrels. (d) A white dog appears in our lane, and our model predicts to slow down and yield. Our model also accurately detects surrounding vehicles, including those in adjacent lanes and the parking lot. Figure 8: EMMA prediction visualization. Each row contains a scenario with our model’s predictions: end-to-end planning (left), 3D object detection (middle), and road graph estimation (right). 15 (e) As a construction zone blocks the left lanes, our predicted trajectory suggests passing through on the right, while the road graph estimation correctly identifies the blocked area. (f) Our lane is blocked by construction cones, so our predicted trajectory suggests to move into the left lane, even though it’s in the opposite direction. EMMA captured the blockage and performed a safe detour. (g) A traffic controller signals to proceed through the intersection, and our predicted trajectory aligns with the instruction. (h) Our predicted trajectory suggests to stop as we approach an intersection with a yellow light, demonstrating cautious and safe behavior. Figure 9: EMMA prediction visualization. Each row contains a scenario with our model’s predictions: end-to-end planning trajectory (left), 3D object detection (middle), and road graph estimation (right). 16 (i) While crossing an intersection, our predicted trajectory nudges slightly to the left due to nearby cars and a bicyclist partially occupying our lane. (j) Our model predicts a driving trajectory to patiently wait at a red light (left). The model also accurately predicts surrounding 3D objects (middle) and road graph with lane centers (right). (k) A fleet of fast-moving motorcyclists passes by. The predicted trajectory suggests pausing to allow them to pass safely. Notably, motorcyclists are accurately identified by our model (middle). (l) A motorbike is moving on a narrow lane at night, and yields to the right. Our predicted trajectory adjusts, guiding us to pass safely by nudging slightly to the left. Figure 10: EMMA prediction visualization. Each row contains a scenario with our model’s predictions: end-to-end planning trajectory (left), 3D object detection (middle), and road graph estimation (right). 17 2023, Radford et al., 2018, 2019, Brown et al., 2020, Achiam et al., 2023, Touvron et al., 2023a,b, Dubey et al., 2024] to multiple modalities, leveraging their generalizability, reasoning capabilities, and contextual understanding. Early explorations [Donahue et al., 2015, Vinyals et al., 2015, Chen et al., 2022a] focused on specific vision-language problems or open-set object detection Liu et al. [2024b], Zareian et al. [2021], Gu et al. [2022], while recent research has scaled up both trask diversity and model sizes for improved generalizability and few-shot capabilities [Cho et al., 2021, Chen et al., 2022b, Wang et al., 2022, Lu et al., 2022, Alayrac et al., 2022, Yu et al., 2022, Chen et al., 2023, 2024c, Wang et al., 2024c, Peng et al., 2024, Huang et al., 2023, Lu et al., 2024]. Notable examples include Flamingo [Alayrac et al., 2022], a 70B model [Hoffmann et al., 2022] which achieved state-of-the-art quality for multiple few-shot vision benchmarks, and CoCa [Yu et al., 2022] a 2.1B parameter model which demonstrated state-of-the-art performance on zero-shot transfer and various downstream tasks including ImageNet classification. PaLI [Chen et al., 2023, 2024c], at 55B parameters, achieves better performance across multiple vision and language tasks by scaling both the vision and language model components jointly. These early works demonstrate the strong performance and generalizability of MLLMs. Recent trends have seen the integration of native multi-modal inputs in LLMs, such as Gemini [Gemini Team Google, 2023, Reid et al., 2024], GPT-4o, and Llama3-v [Dubey et al., 2024, Liu et al., 2024a]. Our work explores the application of these promising new models for generalist end-to-end autonomous driving.▼生成AIの回答(ChatGPT 4o mini)
関連研究に関しては、以下のように分類して整理することができます:
-
初期の自動運転システム
- ALVINN (Pomerleau, 1988):自動運転の初期の試みとして、浅いニューラルネットワークを用いて制御信号を予測したモデルであり、これが自動運転技術の礎となった。
- DAVE-2 (Bojarski et al., 2016):深層ニューラルネットワークを用いて運転行動を学習する方法で、ALVINNの限界を越えてより複雑な環境に対応可能なシステムとして発展した。
-
深層学習と進化する技術
- ChauffeurNet (Bansal et al., 2019):運転行動の予測により複雑なモジュールを取り入れたモデルで、特にモーションプランニングと知覚処理が統合されている。
- DAVE-2 や ChauffeurNet 以降の進展:深層学習の技術がさらに進化し、マルチモーダル入力やマルチタスク学習、強化学習、知識蒸留などが採用された。これらのアプローチは、より高度な状況に対応するための研究が進んだ。
-
マルチモーダル入力と強化学習の導入
- マルチモーダル入力:視覚、音声、テキストなど複数の異なる入力モダリティを同時に扱うアプローチが登場し、これにより自動運転の精度や汎用性が向上した(例:Codevilla et al., 2018, Prakash et al., 2021)。
- 強化学習:自動運転車両の制御に強化学習を適用する研究も増加した。Chekroun et al., 2023やLiang et al., 2018のような研究では、強化学習を活用し、シミュレーション環境での自律的学習を試みている。
- 知識蒸留:大規模な教師なし学習を行ったモデルから小型のモデルへ知識を移す手法が提案され、学習効率の向上が図られた。
-
統合的計画フレームワーク
- VAD (Jiang et al., 2023) と UniAD (Hu et al., 2023):これらは従来のプランニングと制御モジュールを統合し、開放環境における自律的運転を目指す統合的なアプローチを採用している。
-
実世界への適用とロバスト性の向上
- 実際の走行環境におけるロバスト性や安全性、シミュレーション環境から現実世界への移行に関する研究も進んでおり、Ego-MLP (Zhai et al., 2023) や BEV-Planner (Li et al., 2024) といった研究では、シミュレーションと実際の車両が異なる動作を示す可能性があることを指摘し、その解決策を模索している。
-
新たなアプローチとしての大規模言語モデル(LLM)の利用
- DriveGPT4 (Xu et al., 2024):大規模言語モデルを活用し、車両の行動を説明したり、運転制御信号を予測したりする手法を提案している。このアプローチは、運転行動の説明可能性を重視している。
- Drive Anywhere (Wang et al., 2024b) と OmniDrive (Wang et al., 2024a):これらの研究では、MLLM(マルチモーダル大規模言語モデル)を利用して、テキストベースでの運転意思決定や3Dビジョンと言語を融合した推論を行う新しいフレームワークを紹介している。
- DriveLM (Sima et al., 2024) や DriveVLM (Tian et al., 2024):これらは、グラフベースの視覚質問応答(VQA)やチェーン・オブ・ソート推論を活用し、複数の運転関連タスクを解決しようとするアプローチである。
-
最近の発展としてのEMMAの登場
- 本研究では、EMMAという新しいモデルを紹介しており、このモデルは従来の単純なエンドツーエンド方式に強力なMLLMを組み合わせて、運転行動を予測する方法を改良している。EMMAは、特定の物体検出や予測された走行軌道に基づいて、特に複雑な環境でも安全な運転が可能なシステムを目指している。例えば、障害物や動物、信号を認識し、それに応じた運転行動を決定する。
総括
自動運転技術は、初期の単純なモデルから始まり、深層学習の進化やマルチモーダル入力の導入、強化学習などによって大きく進展してきた。最近では、LLMを活用したアプローチが注目され、運転行動の説明や制御予測、さらに安全性の向上が期待されている。EMMAのような新しいモデルは、これらの技術を組み合わせることで、より柔軟で汎用性の高い自動運転を実現しようとする試みである。
▼質問(人間)
続けて下記の質問に追加で回答下さい:
①関連研究を再度うまくまとめて下さい。(多角的な視点で種類毎にまとめる事、重要なものを選別)
▼生成AIの回答(ChatGPT 4o mini)
関連研究を再度多角的に視点を分けて整理し、重要な研究を選別してまとめます。以下は主に技術的なアプローチと進展を中心に分類した内容です。
-
エンドツーエンドアプローチ
初期の自律運転におけるエンドツーエンドアプローチは、深層学習を活用して車両の制御信号を予測する方法です。
-
ALVINN(1988年): 自律運転分野のパイオニアとして、シンプルな浅層ニューラルネットワークを使い、車両制御信号を予測した。これは自律運転の初期段階であり、モデルのシンプルさと効果的な制御の可能性を示した。
-
DAVE-2(2016年): より深いニューラルネットワークアーキテクチャを使用し、カメラ映像から運転行動を学習するモデル。ディープラーニングの進展により、より高精度な制御が可能になった。
-
ChauffeurNet(2019年): より高度な運転行動予測を実現し、車両の動きとその周囲の状況を包括的に学習することで、実世界のシナリオに対応した自律運転技術を向上させた。
これらは、エンドツーエンドのアプローチにおける重要な基礎となり、その後の技術発展に大きな影響を与えています。
-
強化学習 (Reinforcement Learning)
強化学習を用いたアプローチは、環境との相互作用を通じて最適な運転行動を学習する方法です。
-
Chekroun et al.(2023年)やLiang et al.(2018年): 強化学習を自律運転に応用し、実世界の運転環境での最適な制御戦略を学習しました。強化学習は、試行錯誤を通じて最適な行動を導き出すため、動的な状況に対応する能力が向上します。
-
Kendall et al.(2019年): 深層強化学習を用いた運転行動の最適化を行い、特に長期的な運転戦略において有効性を発揮しました。
強化学習は、自己学習を通じて未知の運転環境でも適応できる強力な手法として広く注目されています。
-
マルチモーダル入力
マルチモーダル入力を活用することで、視覚情報だけでなく、異なるセンサーデータを統合して、より高度な運転行動の予測を行うアプローチです。
-
Codevilla et al.(2018年)やPrakash et al.(2021年): カメラやLiDARなど複数のセンサーからのデータを統合し、より精度の高い運転行動予測を行う手法を提案。これにより、車両の認識精度と行動の選択肢が広がり、より複雑なシナリオに対応できるようになった。
-
Jiang et al.(2023年)のVADやHu et al.(2023年)のUniAD: 複数のセンサー情報を活用した統合的な運転プランニングを行い、特にオープンループ環境での運転を有効にする技術を提供しました。これにより、複雑な環境における柔軟な運転が可能になります。
マルチモーダル入力の活用は、運転における安全性や信頼性を大きく向上させるため、今後の自律運転技術には欠かせない要素となっています。
-
視覚と言語の統合 (Vision-Language Models)
視覚と言語の統合により、運転における理解力を高め、より説明可能な運転行動の予測を実現する研究が進んでいます。
-
DriveGPT4(2024年): 大規模言語モデル(LLM)を活用して、車両の行動を説明し、制御信号を予測するという革新的なアプローチを取っています。これにより、運転行動の「説明可能性」を高め、運転の透明性が向上しました。
-
Drive Anywhere(2024年): 画像とテキストのマッチングを行い、テキストベースの運転決定に基づく意思決定を可能にしています。この手法は、運転中の意思決定過程を言語で説明できる点で特徴的です。
-
OmniDrive(2024年): 3D視覚と言語モデルを用いて、より高次元の推論と計画を行う方法を提案。視覚的な情報を言語モデルで処理することにより、より複雑な運転タスクに対応できるようになります。
これらの技術は、自律運転の説明可能性を提供し、人間と自律車両のインタラクションを改善する上で重要な進展を示しています。
-
モデル蒸留 (Knowledge Distillation)
モデル蒸留は、複雑なモデルから学習した知識をより簡潔なモデルに移行させ、効率的な推論を可能にする技術です。
- Chen et al.(2020年)やZhang et al.(2021年): 知識蒸留を用いて、大規模で計算資源を消費するモデルの性能を、より小型のモデルに移すことで、推論速度を向上させる方法を提案。これにより、リアルタイムで動作する自律運転システムにおいて、計算リソースを効率的に利用できるようになります。
-
一般化能力と少数ショット学習
大規模なモデルが、特に少数ショット学習や未知の環境においても良好なパフォーマンスを発揮する能力が高まっています。
-
Flamingo(2022年)やCoCa(2022年): 少数ショット学習やゼロショット転送を達成し、ビジョンと言語の統合を強化しました。これにより、特定の環境における学習が少ない状況でも、自律運転システムの柔軟性が高まりました。
-
PaLI(2023年): 視覚と言語の両方のコンポーネントを共同でスケーリングすることで、異なるビジョンタスクにおけるパフォーマンスを向上させています。このアプローチは、運転環境が変化する場合でも柔軟に対応できる強力なシステムを実現します。
-
現実世界への適用と安全性
多くの研究が、合成環境から現実世界への移行における問題に注目しており、安全性やロバストネスの向上が求められています。
- Ego-MLP(2023年)やBEV-Planner(2024年): これらの研究では、合成環境での良好な性能にもかかわらず、現実世界の複雑な状況において過剰適合が問題となる可能性が指摘されています。現実世界での適用に向けて、モデルのロバストネスと一般化能力を向上させる研究が進められています。
結論
自律運転技術の関連研究は、初期のシンプルなニューラルネットワークに基づくアプローチから、複雑なマルチモーダル入力や強化学習、さらに視覚と言語の統合を通じた高度な推論に至るまで、多岐にわたります。特に、エンドツーエンドの学習における進展や、説明可能性、強化学習、マルチモーダル技術などが今後の研究の重要な焦点となっており、安全性や現実世界での適用に向けた研究も活発に行われています。
その他: 制限・軽減策に関する部分を把握する
5 Limitations, Risks, and Mitigations
▼質問(人間)
(ここに論文・記事を貼り付け、内容全体はここをクリック)
5 Limitations, Risks, and Mitigations In the previous sections, we demonstrate state-of-the-art end-to-end motion planning on the nuScenes planning benchmark. We also achieve competitive performance for end-to-end motion planning on the WOMD planning benchmark and camera-primary 3D detection on WOD. Furthermore, our generalist setup improved the quality across multiple tasks through joint training. Despite these promising results, we acknowledge the limitations of our work and propose directions for building on this foundation and addressing such challenges in future research. Memory and video capability: Currently, our model processes only a limited number of frames (up to 4), restricting its ability to capture the long-term dependencies essential for driving tasks. Effective driving requires not just real-time decision-making but also reasoning over extended time horizons, relying on long-term memory to anticipate and respond to evolving scenarios. Enhancing the model’s ability to perform long-term reasoning is a promising area for future research. This could potentially be achieved by integrating memory modules or extending its capability to process longer video sequences efficiently, enabling more comprehensive temporal understanding. Extension to LiDAR and radar input: Our approach heavily relies on pre-trained MLLMs, which typically do not incorporate LiDAR or radar inputs. Expanding our model to integrate these 3D sensing modalities presents two key challenges: 1) There is a significant imbalance between the volume of available camera and 3D sensing data, resulting in less generalizable 3D sensing encoders as compared to their camera-based counterparts. 2) The development of 3D sensing encoders has not yet reached the scale and sophistication of camera-based encoders. A potential solution to address these challenges is to pre-train a large-scale 3D sensing encoder using data carefully aligned with camera inputs. This approach may foster better cross-modality synergy and substantially improve the generalization capabilities of the 3D sensing encoder. Verification of the predicted driving signals: Our model can directly predict driving signals without relying on intermediate outputs, such as object detection or road graph estimation. This approach introduces challenges for both real-time and post-hoc verification. We have demonstrated that our generalist model can jointly predict additional human readable outputs such as objects and road graph elements, and the driving decision can be further explained with chain-of-thought driving rationale. However, there is no guarantee that these outputs will be always consistent despite the empirical observations that they are often indeed consistent. Furthermore, additional outputs incur significant runtime latency overhead for deployment. Sensor simulation for closed-loop evaluation: It is widely recognized that open-loop evaluation may not be strongly correlated with closed-loop performance. To accurately assess an end-to-end autonomous driving system in a closed-loop environment, a comprehensive sensor simulation solution is necessary. However, the computational cost of sensor simulation is often several times higher than 18 that of behavior simulators. This significant cost burden can hinder thorough testing and verification of an end-to-end models unless substantial optimizations are made. Challenges of onboard deployment: Autonomous driving demands real-time decision-making, which poses a significant challenge when deploying large models due to their increased inference latency. This creates a need for optimizing the model or distilling it into a more compact form suitable for deployment, all while maintaining performance and safety standards. Achieving this delicate balance between model size, efficiency, and quality is crucial for the successful real-world deployment of autonomous driving systems, and represents a key area for future research.▼生成AIの回答(ChatGPT 4o mini)
①何についての記載か?
本記載は、自己運転技術に関する研究の限界、リスク、そしてそれに対する対策を述べたものです。特に、エンドツーエンドのモーションプランニングシステムの性能に関する評価と、その改善点や今後の研究に向けた課題について説明しています。この研究は、nuScenes、WOMD、およびWODの計画ベンチマークにおける競争力のある成果を報告し、その結果を基に限界点を指摘しています。
要旨
-
メモリと動画処理能力:
- 現在のモデルは最大4フレームまでしか処理できず、長期的な依存関係を捉える能力が制限されている。
- 自動運転にはリアルタイムの意思決定に加えて、長期間にわたるシナリオの予測と対応が求められるため、モデルが長期的な推論を行えるようにする必要がある。
- 長期的な推論能力を高めるために、メモリモジュールを統合したり、より長いビデオシーケンスを効率的に処理できるようにすることが今後の研究課題である。
-
LiDARおよびレーダー入力への拡張:
- 現在のアプローチは、カメラデータを基にした事前学習されたMLLM(大規模言語モデル)に依存しており、LiDARやレーダー入力を活用していない。
- LiDARやレーダーといった3Dセンサー入力の統合には、以下の2つの大きな課題がある:
- カメラと3Dセンサーのデータの量に偏りがあり、3Dセンサー用のエンコーダがカメラエンコーダほど一般化できていない。
- 3Dセンサーエンコーダの開発はカメラエンコーダほど進んでいない。
- 解決策として、カメラ入力と慎重に整合させた大規模な3Dセンサーエンコーダの事前学習が提案されており、これによりモダリティ間の相乗効果を高め、3Dセンサーエンコーダの汎化能力を向上させることができる。
-
予測された運転信号の検証:
- 本モデルは、オブジェクト検出や道路グラフ推定などの中間出力に依存せずに、運転信号を直接予測する。
- これにより、リアルタイムおよび事後の検証に課題が生じる。
- 一般的に予測された運転信号は一致していることが多いが、常に一貫性が保たれるわけではない。
- 追加の出力は実行時間の遅延を引き起こす可能性があり、展開時に重要な問題となる。
-
センサーシミュレーションによる閉ループ評価:
- オープンループ評価と閉ループ評価の相関は弱いと認識されており、実際の自動運転システムを正確に評価するには、包括的なセンサーシミュレーションソリューションが必要となる。
- しかし、センサーシミュレーションの計算コストは行動シミュレータの数倍高いため、エンドツーエンドのモデルの徹底的なテストと検証には多大なコスト負担がかかる。
-
車載システムへの展開の課題:
- 自動運転にはリアルタイムでの意思決定が必要であり、大規模なモデルを展開する際には推論の遅延が増大するという問題が発生する。
- そのため、モデルの最適化や、展開に適したコンパクトな形式への蒸留が求められ、性能と安全基準を維持しながら、このバランスを取ることが成功の鍵となる。
生成AIの回答は、よくまとまっており、良好。
あとがき
所感
生成AIの回答は、要旨をまとめる・元文の抜き出し等のタスクでは、かなり良好な場合が多いのですが、一方で、その分野についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないことが多くあります。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。(良い意味で論文の要点をそのまま抜き出した形)
分かったこと
分かったこと・注目した点を抜き出して列挙:
Introductionを把握する
運転に関連するタスク(例えば、経路計画、物体認識、道路グラフ推定など)を自然言語形式で表現し、モデルにそのタスクを一貫して処理させるアプローチが特徴的です。
「ジェネラリストモデル」とは、特定のタスクに特化したモデルではなく、複数の異なるタスクを同時に処理できる汎用的なモデルを指す。
EMMAは運転指令や車両の状態、周囲の視覚データを入力として、各タスクに適切な出力を同時に生成することができ、これによりタスク固有のモデルを個別に訓練するよりも、パフォーマンスの向上が期待される。これが「ジェネラリストモデル」の特徴であり、複数のタスクを一つの大規模言語モデルで処理できる点が革新的である。
「共同訓練」とは、複数のタスクを同時に学習させる訓練方法を指す。EMMAにおける共同訓練は、モーションプランニング、物体検出、道路グラフ推定など、複数の運転関連タスクを一つのモデルに対して同時に訓練することを意味する。
単独のタスクに特化したモデルに比べて全体的なパフォーマンスが向上する。共同訓練により、各タスクがモデル内で共通の基盤を持つため、システム全体の効果的な最適化が可能となる。
「チェーン・オブ・ソート推論」とは、モデルが推論を行う際に、逐次的に論理的なステップを踏んで考えるプロセスを指す。EMMAでは、運転タスクに対する推論の過程を、段階的に人間が理解できる形で説明しながら進める。
学習に関する部分を把握する
EMMAの入力は、主に3つの要素から構成されています:
カメラ画像(V):環境の包括的な情報を提供する周囲のカメラビデオ。これが主要なセンサー入力であり、周囲の状況を視覚的に認識します。
高レベルの指令(Tintent):ルーターから送られる指令で、例えば「まっすぐ進む」「左折」「右折」などの運転指示が含まれます。
車両の履歴状態(Tego):車両の過去の位置情報を示すもので、例えば過去のウェイポイント(BEV空間での座標)です。また、必要に応じて速度や加速度など、車両の動的状態を含めることができます。
EMMAの出力は、基本的には自然言語で生成されるテキスト形式です:
未来の軌道(Otrajectory):自律車両の未来の進行方向を示すウェイポイントのセットです。これらのウェイポイントもテキスト形式で表現されます。例えば、「(xt, yt)」のような形式で、時間軸に沿った位置情報が示されます。
運転の合理的な根拠(Orationale):モデルが決定した運転に対する理由付け。これは、シーンの説明、重要な物体の識別、物体の挙動の理解、最終的な運転の判断に分かれています。これもテキスト形式で出力され、例えば「歩行者が[9.01, 3.22]に立っている」などの情報を含みます。
計算量に関する直接的な記載はありませんが、EMMAは「計算要求が高い」と述べられています。
パラメータ数に関して具体的な記載はありませんが、EMMAはGeminiモデルを基にしているため、数十億から数百億のパラメータを持つ可能性があります。
モデルは、高レベルな運転指令と過去のエゴ車両の状態を入力として、未来の軌跡を生成する。これには専用の教師なし学習(Self-supervised)アプローチが使用され、教師データとしてはエゴ車両の未来の位置情報が利用される。加えて、特別なラベルは必要なく、学習データのみでトレーニングが可能である。
複数タスク同時学習 (Multi-task learning):
EMMAは複数の自動運転タスクを同時に学習できる汎用モデルとして設計されている。これには、運転経路計画、物体検出、道路グラフ推定、シーン理解といったタスクが含まれており、これらのタスクは単一のモデルで学習される。具体的には、各タスクごとに異なるプロンプトが入力され、モデルはそのタスクに対応した出力を生成する。この汎用的な学習方法により、EMMAは複数の異なるタスク間で知識を再利用できるため、各タスク単独で学習されたモデルに比べて効率的で強力なパフォーマンスを発揮する。
学習プロセスは、異なるタスクのデータセットを組み合わせて行われ、各タスクの学習データがサンプリングされる。タスクごとのデータセットサイズに応じてサンプリングされる確率が調整され、複数のタスクが同時に学習される。これにより、モデルはタスク間の相関を学び、より汎用的で適応力のある学習が進む。
出力は、複数の候補軌道(トラジェクトリ)を生成し、その中から最適なものを選ぶ形式となっています。
EMMAは、Gemini 1.0 Nano-1という事前学習された重みを持つモデルが使用されており、この重みを使ってEMMAがトレーニングされます。
EMMAモデルは、教師なし(自己教師あり)で学習を行い、ラベル付きのデータや外部の人間によるラベルを使用せずに学習することが特徴である。
メタ判断や重要物体の識別は、パフォーマンス向上において大きな貢献をしている。
トレーニングデータのスケールが大きくなることで、モデルの予測精度(特に評価困難度が低くなる)は改善される。
結果に関する部分を把握する
EMMA+(内部データで事前学習):
EMMA+は、EMMAモデルの強化版で、内部データセットで事前学習(プリトレーニング)を行ったバージョンです。この事前学習の主な利点は、EMMAがより広範な運転シナリオに適応できるようになることです。具体的には、内部データセットには、公開されているデータセットに比べて数桁以上多くのシナリオが含まれており、EMMAはそれに基づいて事前にトレーニングされることで、より強力な予測能力を持つことになります。その結果、公開データセットでの評価(例えば、nuScenesデータセットにおける性能)でEMMA+は従来の手法を上回り、特に予測の精度が向上しています。
私たちの発見によれば、運転のメタ意思決定と重要なオブジェクトの識別は、性能を大きく向上させ、各々3.0%および1.5%の改善に寄与しています。
図・表を把握する
図7、臨時閉塞検出タスクのためのシーン理解に関する実験結果
プレトレーニングとファインチューニング: 直接的なファインチューニング(特定のタスクに対する学習)での性能は、人間の基準に非常に近いが、長期間のプレトレーニングを経た場合に性能がさらに向上することが示されています。
混合学習: 複数のタスクを同時に学習することで、モデルが異なるシーンに対する理解を高め、より良い予測を行うことができます。
表4、EMMAモデルが複数のタスク(エンドツーエンドプランニング、3D物体検出、道路グラフ推定)を共同学習した結果
タスク間のシナジー: 複数のタスクを共同学習することで、モデルは異なるタスクに対する理解を深め、個々のタスクのパフォーマンスも向上します。
タスクの組み合わせ: 共同学習を通じて、タスク間の相互作用によってパフォーマンスが最適化され、3つのタスクを同時に学習することで最大の品質向上が達成されることが示されています。
キーワードから探す
No need for HD maps: "No HD map is needed beyond the high-level routing information from a navigation system such as Google Maps." (Google Mapsによる高レベルなルーティング情報のみでHDマップ不要)
Hierarchical decision structure: "We structure the driving rationale hierarchically, progressing from 4 types of coarse-to-fine-grained information." (粗→細粒度の4段階で運転意思決定を構造化)
Automated rationale generation: "The driving rationale captions are generated using an automated tool without any additional human labels." (自動生成された運転意思決定キャプション、追加の人手不要)
Generalist model performance: "The generalist model, trained across multiple tasks, consistently outperforms each specialist model." (複数タスクで訓練された汎用モデルの優れた性能)
一般化性能の向上 (Improved Generalization)
汎用モデルは、個別のタスクに特化したモデルに比べてより広範囲なシナリオに対応できる能力を持っています。単一タスクのモデルが特定の状況やデータに過剰に適応してしまうことに対して、汎用モデルは異なるデータセットや新たな環境でも効果的に動作し、一般化の能力が向上します。これにより、未知の状況や予期しないシナリオにも適応する能力が高まります。
専門モデルとの比較 (Comparison with Specialist Models)
実験結果では、汎用モデル(EMMA)は各専門モデル(単一タスク専用のモデル)と比較して、タスクごとに均等に良い性能を発揮しており、総合的に優れた性能を示しています。例えば、運転計画、物体検出、道路グラフ推定など、個別の専門モデルに劣らず、高い精度でタスクを処理できることが示されています。これにより、汎用モデルの方がタスク間での知識の相互作用を最大限に活かし、専門化したモデルに比べて強力なパフォーマンスを発揮できることが確認されています。
複数の軌道をサンプリングすることでADEが向上。
連鎖的思考推論を採用することで、標準的なプランニング手法に比べて6.7%の性能向上を達成。
運転のメタ決定や重要物体の識別がそれぞれ3.0%、1.5%の改善に貢献。
連鎖的思考のコンポーネントがプランニングの質を大きく改善し、シーンの説明はモデルの説明能力を向上させる。