生成AIを用いて自動運転の論文「Hidden Biases of End-to-End Driving Models (2023)」を読んでみた (続き)

Last updated at 2025-01-19Posted at 2025-01-19

はじめに

前回、生成AIを用いて自動運転の論文「Hidden Biases of End-to-End Driving Models (2023)」の内容(本文)を(なるべく)把握してみました。
同論文の末尾にあるAppendixについても、前回と同様の方法で、把握してみます。

以降で、ChatGPTに聞いてみた例を記載します。

前回の記事: (本文の内容)

対象の論文

論文: (自動運転に関する論文)

[2306.07957] Hidden Biases of End-to-End Driving Models
https://arxiv.org/abs/2306.07957
(PDF: https://arxiv.org/pdf/2306.07957)

質問時の各章節の区切り部分

論文の中にある各章節を、下記のように区切って、部分毎に生成AIに内容を質問していきます。

Appendix
A. Changes to TransFuser
A.1. Expert
---
A.2. Dataset
---
A.3. Training and Architecture
---
A.4. Localization
---
A.5. PID Controller
---
B. TransFuser++ Implementation Details
B.1. Attention Pooling Implementation
---
B.2. Data Augmentation
---
C. Additional Results
C.1. Longest6 Ablations
---
C.2. Brake Threshold
---
C.3. Additional Baselines
---
C.4. Additional Examples
---
C.5. Additional Experiments
---
C.6. CARLA Leaderboard

生成AIへの質問方法

生成AIを活用して、知りたい記事・論文の1節分(適度な長さ)のテキストをコピー＆ペーストして、その下に質問内容を「①～ ②～ …」と番号付きで書いて、生成AIに渡せば、全質問に一発で的確に回答してくれるので、非常に良好でした。記事全体を読む必要なく、知りたい点の情報だけを収集できます。

生成AIへの質問例:

(論文・記事の各章節を貼り付け)

上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、回答量は長くなってもOK)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載)

※各章節に応じて、適宜下記の質問を追加。

続けて下記の質問に追加で回答下さい:
②具体的な処理方法の記載があれば説明下さい。(簡略化せずに全て記載、既存手法があれば引用元を記載)
③改良点・工夫点・テクニック等の記載があれば説明下さい。
④メカニズムの解明・なぜそうなるのか等の記載があれば説明下さい。(記載がなければ回答不要)
⑤性能が向上した記載があれば説明下さい。(具体値があれば記載、対応する図/表番号があれば各文末に記載)
⑥比較の記載があれば違いを表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば明記)
⑦上記⑥以外で表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)
⑧具体的な数値の記載を全て列挙して、表にまとめて下さい。(必ず正しく数値を抜き取る事、|数値|説明|の表へ)
⑨具体的な変数名(数式用の記号)の記載を全て列挙して、表にまとめて下さい。(|変数名|説明|次元・型|の表へ)
⑩図/表があれば、各図/表は何を主張するためのものかを説明下さい。(掲載理由・注目ポイント等)
⑪関連研究の参照番号を全て列挙して、表にまとめて下さい。(元文にある内容のみ記載・類推して付け足しは不要、|参照番号|概要説明|の表へ、関連するもの同士でまとめて並べ替え)
⑫難解用語を全て列挙して、表にまとめて下さい。(必ず正しく抜き取る事、|用語|説明|の表へ)

※回答が長くなりそうな場合は、適宜、分けて質問: ②③④⑤、⑥⑦⑧⑨、⑩⑪⑫
※各章節に応じて、その章節内で明らかに不要な質問は、適宜除外。

※その他、不明点があれば、適宜、ピンポイントで質問。

質問内容は、始めに「①要旨は何か」に絞って1つだけを質問し、得られた回答内容に応じて、適宜質問を追加するようにしました。また、表で表した方が素早く把握できるので、必要であれば、記事を表に変換するような質問を追加しています。

論文・記事を貼り付けるテキストの長さは、1節分程度の量にとどめた方が、良い回答が得られました。生成AIの回答の文量が多くなってくると、回答が長くなり過ぎないように、生成AIが勝手に(適度に)端折り始めてしまい、重要な点が回答から抜けてしまう可能性が高くなります。

事前知識

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。

自動運転の論文についての分かりやすい解説記事(下記)等を事前にチェックして、中核部分の内容をあらかじめ分かっていると、理解しやすいと思います。生成AIは実際の細かい処理方法自体を分かりやすく説明してはくれない傾向があります。

注意点

論文のテキスト内容だけを貼り付けて、生成AIに質問しています。論文の中の図・表の部分は貼り付けていません。図・表の内容は生成AIの回答には含まれず、別途論文を見る必要があります。

以降で、生成AIの回答内容が読みにくい・分かりづらい場合は、論文の本文でも同じように書かれてあり、論文の本文を読んでも同じように分かりづらいことが多くあります。論文では、既存研究等があるため、多くの説明を省略・一言だけサラッと書かれてある等、同種の研究に取り組む人でなければ、なかなか全容の理解に辿りつくのは難しい側面があります。この点は、生成AIの回答性能が悪いのではなく、論文という性質上、生じるものと考えています。

生成AIに質問

以降で、ChatGPTに実際に聞いてみた例を記載します。

生成AIへの質問＆回答の全容

生成AIへの質問＆回答の全容・詳細:

Appendix

A. Changes to TransFuser

A.1. Expert

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) Appendix A. Changes to TransFuser Our reproduced TransFuser largely follows [12] but has some minor differences in implementation details described in this section. A.1. Expert For data collection, we follow the common practice of using an automatic labeling algorithm (expert driver) to generate the imitation labels. The auxiliary perception la- bels are provided directly by the CARLA simulator except for the bird’s eye view (BEV) segmentation, for which we follow [43] and render the relevant objects into an HD map. To generate the imitation labels, the expert driver needs to solve planning and control, but can bypass perception using privileged access to the simulator. We build our expert upon the model predictive control (MPC) approach of [12]. Lat- eral control is done by following the next point (at least 3.5 meters away) in a path, created by an A⋆ algorithm, with a PID controller. Longitudinal control is done via MPC that differentiates between 4 target speeds. For regular driving, we use 8 m/s (double the target speed compared to [12]). Inside intersections, we slow down to 5 m/s. Collision avoidance: The target speed is set to 0 m/s when the MPC algorithm predicts a collision. We predict colli- sions similarly to [12]. They approximate all agents’ fu- ture positions by iteratively unrolling a kinematic bicycle model [8]. Actions for other cars are set to be the same as the current time step while the ego agent’s own action is approximated by using a PID controller to follow the A⋆ path. In case the ego agent’s bounding box overlaps with another agent’s bounding box at future time step t, a col- lision is predicted. To keep a safety distance to the lead- ing vehicle, the MPC expert in [8] additionally has a static bounding box in front of the ego agent that predicts a colli- sion if it intersects with any other agent. Because our agent has double the maximum driving speed, we need to keep a larger safety distance. Using a static area would not suffice here, as the static bounding box will become so large that it sticks into the opposing lane during turns, causing unnec- essary braking. Instead, we approximate the area where the expert would end up if it performed a full brake (after 1 me- ter of driving) at his current speed s. The distance to stop d is approximated with: d = 0.5 ∗ ( s ∗ 3.6 10.0 )2 + 2.5 (1) We again use the kinematic bicycle model to compute where the expert will be after this distance. If the bounding box of the expert at that time step intersects with any bound- ing boxes for the current time step, we set the target speed to 0. This approach has the advantage that the safety area follows the road and increases with increasing agent speed. A second problem coming from the higher driving speed is scenario 3 where a pedestrian runs in front of the vehi- cle. The expert will stop as soon as the pedestrian attempts to move forward, but when driving at 8 m/s this is already too late to prevent the collision. To solve this problem, we preemptively slow down to 2 m/s whenever a pedestrian is within a 30-meter radius in front of the vehicle. Traffic rules: Stop signs infractions are addressed by slowing down to 2 m/s when the safety area intersects with the stop sign trigger area (provided by the simulator) and stopping once the car is on the stop sign trigger area. Red lights are resolved by setting the target speed to 0 when the safety area or vehicle intersects with the entrance of an in- tersection and the corresponding traffic light is active.

Method DS ↑ RC ↑ Veh ↓ Block ↓
MPC Expert [12] 77 ± 2 89 ± 1 0.28 0.13
MPC Expert (ours) 81 ± 3 90 ± 1 0.21 0.09
Table 9: Expert comparison on Longest6

Performance: We show in Table 9 that our expert outper-
forms the baseline by 4 DS. This represents a higher upper
bound for our imitation learning models. However, since
no method can currently reach that upper bound, we see no
direct improvement from the better expert (see Table 10).

上記の内容に関して下記の質問に回答下さい: (である調で記載、一般的な推測を回答に混入しない事、元文の記載内容に基づいて忠実に回答、元文量に合わせて回答量は十分長くする事)
①何についての記載か? + 要旨は何? (要旨は箇条書きで記載、既存手法があれば引用元を記載、図・表の参照があれば記載)

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

本節は、TransFuserにおけるエキスパートドライバー（模倣学習に必要な教師ラベルを生成するアルゴリズム）の変更点や性能評価について記載している。具体的には、データ収集方法、経路追従と制御手法の詳細、および衝突回避や交通ルール遵守に関する改良点が述べられている。さらに、表9で既存手法との性能比較が行われており、提案されたエキスパートが改良された点についても述べられている。

本節のエキスパートドライバーは、模倣ラベルの生成を支援する自動ラベリングアルゴリズムとしてCARLAシミュレーターを用い、BEVセグメンテーションラベル生成などで工夫がなされている。
経路追従にはA⋆アルゴリズムに基づく経路生成と、PIDコントローラーによる横制御、MPC（モデル予測制御）による縦制御を採用しており、最大速度を8 m/sに設定することで速度向上を実現している。
衝突回避のため、MPCアルゴリズムを用いて他車の位置予測を行い、エゴエージェントの行動を制御している。また、静的領域を利用せず、速度に応じて動的に変化する停止距離を設定して安全性を高めている。
歩行者や信号に対するルール遵守を強化し、例えば、歩行者が30メートル以内にいる際には速度を減速し、信号やストップサインにも対応する。
表9に示されるように、提案手法はMPCエキスパートの従来手法[12]に比べ、運転成功率（DS）が4ポイント向上し、車両侵害率や進路妨害率が低減している

▼質問(人間)
回答が短い。端折り過ぎで、文脈が分かりにくい。
本件を記憶下さい。前回の質問に再度回答下さい。

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

本節「Appendix A: Changes to TransFuser」では、模倣学習のためのラベル生成を行う自動ラベリングアルゴリズム（エキスパートドライバー）について、既存手法からの具体的な変更点とその実装に関する詳細が述べられている。特に、エキスパートドライバーの実装における制御手法の改良点、速度の向上に伴う衝突回避メカニズムの強化、および交通ルール遵守の改善について詳述されている。また、表9において提案手法のMPCエキスパートと従来手法の比較を行い、運転成功率（DS）や車両侵害率（Veh）、進路妨害率（Block）など、具体的な性能向上についても評価している。

要旨は以下の通りである。

データ収集方法：CARLAシミュレーターを用いて模倣ラベルを生成する際、BEV（鳥瞰視点）セグメンテーションラベルを得るため、[43]に基づき対象物をHDマップにレンダリングしている。
エキスパートドライバーの構築：経路追従と制御手法において、経路はA⋆アルゴリズムで生成し、横制御にはPIDコントローラーを採用。縦制御にはMPC（モデル予測制御）を利用しており、従来手法[12]の目標速度（4 m/s）を倍の8 m/sに設定し、速度向上を実現。
衝突回避の改良：他車両の未来位置予測には、運動モデルを利用しており、速度が上がったエージェントの安全距離確保のために、固定された静的領域ではなく、速度に応じた停止距離を動的に調整している。
歩行者対策：従来手法と比較し、歩行者が前方30メートル以内にいる場合は、事前に2 m/sまで減速することで衝突を回避。
交通ルールの強化：ストップサインや信号に対応し、進入時には適切に減速や停止を行う処理を追加。
性能評価：表9に示される通り、提案手法のMPCエキスパートは、従来手法のMPCエキスパートに対し、運転成功率（DS）が4ポイント向上し、車両侵害率と進路妨害率が低減している

多少は長めの回答になった。

A.2. Dataset

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) A.2. Dataset We generate our training dataset by executing the expert described in Section A.1 on the training routes from [12] and storing every frame (20 FPS). Weathers are randomized per route instead of per frame to avoid exposure problems. During training, we train on every fifth frame, leading to an effective FPS of 4. For the scaling experiment, we re- run data collection on each route 3 times. This randomizes weathers and traffic (the traffic manager in CARLA 0.9.10 is not deterministic) but the environment is the same. For the final model in the main paper we recollect the data at 4 FPS and train on every frame. The model sees equivalent data, but this makes the second dataset is 5x smaller and hence easier to release. To reduce storage re- quirements further, we compress the camera images with JPG (we add the compression during inference as well to avoid distribution shifts), perception labels with PNG (depth maps are stored at 8 bit resolution), text files with zip, and LiDAR point clouds with LASzip [20] (which com- presses point clouds much better than standard zip). We store full 360° LiDAR sweeps and realign all points into the coordinate system of the current frame. Since the expert is not a perfect driver, we log the driving score of each train- ing route and only train on routes with 100 DS. The labels for the disentangled path are generated by storing the next 10 2D points (spaced 1 meter apart) from the A⋆ path the expert is following.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

本節A.2「Dataset」では、前節A.1で説明されたエキスパートドライバーを用いたトレーニングデータセットの生成方法や、データ処理と圧縮の詳細について記載されている。データ収集、処理フレームレートの調整、気象や交通のランダム化、圧縮手法の工夫、データセットの公開を容易にするための最適化、そしてモデルのトレーニングにおけるルートスコアの活用についても触れている。

要旨は以下の通りである。

データ収集とフレームレート：トレーニングデータセットは、[12]で用いられたトレーニングルート上でエキスパートドライバーを実行することで生成される。データ収集時には毎秒20フレーム（20 FPS）で記録し、気象条件をルートごとにランダム化している。トレーニング時には5フレームごとに学習する（実質的に4 FPS）。
スケーリング実験：スケーリング実験用のデータ収集では、各ルートでデータ収集を3回行い、気象や交通状況をランダム化することで、環境は同じでも多様なデータを生成している。CARLAシミュレーターのバージョン0.9.10におけるトラフィックマネージャは非決定的であるため、ランダム性が増している。
最終モデル用データセットの効率化：論文本文の最終モデルでは、収集データを4 FPSに調整し、全てのフレームでトレーニングする。これにより、収集データは元の5分の1のサイズとなり、公開しやすくなっている。
データの圧縮方法：ストレージ容量の削減のため、カメラ画像はJPGで圧縮し、圧縮による分布シフトを避けるために推論時にも同様の圧縮を追加している。認識ラベルはPNG形式で、深度マップは8ビット解像度で保存している。テキストデータはzip形式で圧縮し、LiDARのポイントクラウドデータはLASzip [20]を使用して圧縮（標準のzipよりも効率的）している。これにより、360度のLiDARスイープデータ全体を保存し、全てのポイントを現在のフレームの座標系に再整列している。
エキスパートドライバーの運転スコアに基づくデータフィルタリング：エキスパートドライバーが完全な運転精度を持つわけではないため、各トレーニングルートの運転スコア（Driving Score, DS）を記録し、100点のルートのみをトレーニングに使用している。
分離された経路ラベルの生成：エキスパートが追従するA⋆経路から、次の10個の2Dポイント（1メートル間隔）を保存し、分離された経路ラベルとして使用している。

続けて、長めの回答が得られるようになった。良好。

A.3. Training and Architecture

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) A.3. Training and Architecture Training: We use the same loss functions as in [12] for all 5 outputs. For the new target speed classification task we use a class frequency weighted cross-entropy loss with label smoothing 0.1. As classes we use the 4 different tar- get speeds of the expert (see Section A.1). We observe some failure cases where the target speed predictions are overcon- fident, and the car starts braking a few frames too late when stopping behind another vehicle, leading to close rear col- lisions. Label smoothing addresses this problem because it makes the car drive a bit slower (due to the probability weighting). This gives the car a few more frames time to brake and avoids these collisions. To make this more ex- plicit, we do not use label smoothing in the final model in the main paper, and instead reduce the target speeds by 2 m/s during inference, achieving the same effect. During training, each individual loss is added together with the same weight. The loss weights are normalized to sum to 1. We use AdamW [23] with amsgrad [30] and a slightly higher learning rate of 0.0003 which is reduced by a factor of 10 after epoch 30. We train for 31 epochs and always use the last epoch as our model. All models in Sec- tion 3 of the main paper are trained on 8 2080ti GPUs with distributed data parallel and total batch size of 48. For the dense dataset, we subsample by a factor of 5 but shift the first frame by the GPU index, so that all the frames are seen during training (a subtle form of data augmentation, close by frames are mostly redundant). The final model is trained with 4 A100 GPUs and batch size 128 to speed up training. We also add standard color augmentations to the camera and a discrete conditioning concatenated with the velocity input, so that the target speed branch also has a conditioning signal. This is conceptually more sound, but we observed no significant impact on driv- ing performance. Our path labels have a subtle ambiguity due to conversion from a path in global coordinates to the local ego coordinate system (points can be shifted by 1 me- ter depending where the car is on the path). To resolve this, we linearly interpolate between the stored points at 1 me- ter distances in ego coordinates during training. This makes the predictions look qualitatively more consistent, but again has no significant impact on driving score. Architecture: For the BEV semantic segmentation we pre- dict 11 classes (unlabeled, road, sidewalk, non cross-able lane marker, cross-able lane marker, stop signs, traffic light stop line green / yellow / red, vehicle, pedestrians) [43]. We only predict pixels that are visible in the camera, since some classes need RGB features. For the perspective segmenta- tion we predict 7 classes (unlabeled, road, sidewalk, lane markings, traffic light, vehicle, pedestrian). For the bound- ing boxes we predict 4 classes (red and yellow traffic light stop line, stop sign, vehicles and pedestrians). Traffic lights and stop sign boxes are only predicted if they affect the ego vehicle. Vehicles and pedestrians only when they are in the LiDAR range. Architecture wise we add 1x1 convolution layers in the LiDAR branch before each transformer that match the channel dimension of the LiDAR to the one in the image branch (allows using different backbones for the two branches, though we keep using RegNetY-3.2GF [29] for both). Our reproduced TransFuser predicts 8 waypoints, each placed 250 ms apart (up to 2 seconds into the future). Sensors: We follow [41] and use a single high resolution camera. It has a 110° horizontal field of view (FOV), is mounted at (-1.5, 0.0, 2.0) and has a resolution of 256 x 1024. Our LiDAR is mounted at (0.0,0.0, 2.5) and has a full 360° FOV. LiDAR points are realigned into the current ve- hicle coordinate system by using the filter described in Sec- tion A.4. The LiDAR points are voxelized into a 256x256 grid representing a 64 x 64 meter area with the vehicle at its center. Each pixel covers a 0.25 x 0.25 meter area. We re- move the LiDAR ground plane by removing all points with a height of less than 0.2 meters. Performance: The reproduced TransFuser has a slightly lower DS to the original on the Longest6 benchmark as shown in Table 10. The difference in driving score likely is a result of the driving score weighting the dissimilar fail- ure modes differently. Note, that our reproduced TransFuser is a single model, whereas the original result was from an ensemble of 3 models.

Method DS ↑ RC ↑ Veh ↓ Stat ↓
TransFuser (ours) 40 82 1.17 0.57
TransFuser [12] 47 93 2.45 0.07
Table 10: Reproduced TF vs original on Longest6

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

本節A.3「Training and Architecture」では、訓練方法、損失関数、アーキテクチャ設計、使用センサー、及び性能比較について説明されている。特に、速度予測タスクに対するラベルスムージングの利用、トレーニングの設定とデータ並列化による効率化、そして最終モデルのトレーニング環境とパラメータについて記述されている。また、アーキテクチャとしてバックボーン構成、セマンティックセグメンテーションおよび3Dポイントクラウド処理を活用したアプローチを採用しており、再現実験の結果として運転スコアやエラーの測定を行っている。

要旨は以下の通りである：

損失関数とラベルスムージング：本研究では[12]に基づき、5つの出力に対して同様の損失関数を用いている。新しい速度分類タスクにはクラス頻度重み付きのクロスエントロピー損失を適用し、ラベルスムージング（平滑化係数0.1）を追加している。これは、車両の停止遅延を抑えるために行われ、ブレーキのタイミングが遅れる問題に対処する意図がある。また、最終モデルではラベルスムージングを使用せず、推論時に目標速度を2m/s減少させて同様の効果を達成している。
トレーニング設定：各損失は同一の重みで加算され、合計が1になるよう正規化される。最適化にはAdamW[23]（amsgrad[30]付き）を使用し、初期学習率0.0003からエポック30で1/10に減少させる。31エポックでトレーニングを行い、最終エポックのモデルを使用する。8台の2080ti GPUによる分散データ並列で訓練し、バッチサイズ48を採用している。密なデータセットでは、5フレームに1フレームを用いてトレーニングし、各GPUに異なるフレームシフトを適用してデータの冗長性を減少させている。
最終モデルのトレーニング環境：最終モデルのトレーニングでは、4台のA100 GPUとバッチサイズ128を使用して速度を向上させている。また、カメラ画像に標準的なカラー増強を追加し、速度分類支配において速度信号と結合した条件付けを行うが、運転性能には顕著な影響は見られなかった。
アーキテクチャの特徴：アーキテクチャには11クラスのBEVセマンティックセグメンテーションを用い、カメラで可視化できる範囲に限定して予測している。LiDARと画像の2つの分岐において、各トランスフォーマーの前に1x1畳み込み層を挿入し、チャネル次元を一致させ、異なるバックボーン（本研究では両分岐にRegNetY-3.2GF[29]を採用）を使用できるようにしている。また、再現実験ではTransFuserモデルが8つのウェイポイント（250ms間隔）を予測し、2秒先までの経路を推定している。
使用センサー：[41]に準拠し、高解像度カメラとLiDARを使用している。カメラは110°の水平視野（FOV）を持ち、解像度は256×1024である。LiDARは360°のFOVで、点群データを256×256グリッド（64m x 64m範囲）でボクセル化し、車両座標系に再整列している。LiDARの地面平面は0.2m未満の点を除去することで取り除かれている。
性能比較：再現実験として、TransFuserの運転スコア（Driving Score, DS）と各種エラーをLongest6ベンチマーク上で評価し、結果を表10に示している。再現されたTransFuserモデルは単一モデルであり、元の結果は3モデルのアンサンブルに基づくものであるため、運転スコアがやや低い。

▼質問(人間)
続けて下記の質問に追加で回答下さい:
⑦表に出来そうな部分があれば表でまとめて下さい。(下に解説を記載、対応する図/表番号があれば記載)

▼生成AIの回答(ChatGPT-4o)
⑦ 表にできそうな部分

以下に、表形式でまとめた項目とその解説を記載する。対応する図・表番号としては、性能比較については表10が該当する。

項目	内容	補足・解説
損失関数	5つの出力に[12]の損失関数を使用	速度分類タスクにはクロスエントロピー損失、ラベルスムージング0.1を追加している
ラベルスムージング	ブレーキの遅延回避を目的にラベルスムージング0.1を使用	最終モデルではラベルスムージングを使用せず、代わりに目標速度を2 m/s減少
トレーニングエポック数	31エポック	最終エポックのモデルを採用
最適化アルゴリズム	AdamW[23] (amsgrad[30]付き)	初期学習率0.0003で開始し、エポック30で1/10に減少
GPUおよびバッチサイズ	2080ti 8台で分散データ並列、バッチサイズ48	各GPUに異なるフレームシフトを適用することで密なデータセットの冗長性を抑制
最終モデル	A100 GPU 4台、バッチサイズ128	トレーニング速度を向上させるための設定
カメラの視野 (FOV)	110°水平方向	解像度256 x 1024
LiDARの視野 (FOV)	360°	車両座標系に再整列し、地面平面は0.2 m未満の点を除去
セグメンテーションクラス	BEVでは11クラス、パースペクティブでは7クラス、バウンディングボックスは4クラスを予測	クラスの詳細は上記要旨に記載
モデルの運転スコア (DS)	TransFuser（再現実験）: 40、元モデル: 47	元の結果はアンサンブル（3モデル）での結果であり、再現モデルは単一モデル
表10の指標	DS（運転スコア）、RC（正解率）、Veh（車両誤差）、Stat（静止物誤差）	TransFuser（再現モデル）と元の結果をLongest6ベンチマーク上で比較

解説
上記の表では、訓練方法、アーキテクチャ設定、損失関数、ラベルスムージングの調整とその効果、使用するGPU構成やバッチサイズといったトレーニング環境について整理した。また、表10に示される指標の比較も含め、性能評価についても表形式で要約している。特に、表10の性能指標は、運転スコア(DS)、車両誤差(Veh)、静止物誤差(Stat)など、各指標の違いを視覚的に捉えやすくしている。

A.4. Localization

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) A.4. Localization We localize our vehicle with a GNSS sensor. The GNSS signal has Gaussian noise applied to it, leading to average localization errors of ∼0.7 meters. Like prior work [9, 12] we use a filtering algorithm to reduce the noise. In particu- lar, we use an Unscented Kalman Filter (UKF) [32,39] with Van der Merwe’s scaled sigma point algorithm [24]. As its model of the car, our UKF uses the same kinematic bicycle model as the expert [8]. The filter tracks the position, orien- tation and speed of the ego vehicle. The parameters of the filter are tuned manually by reducing localization error on a small dataset consisting of ground truth localization paired with GNSS signals (similar to tuning hyperparameters on a validation set in supervised learning). The filter reduces the average localization error to below ∼0.1 meters.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

この節は、自動運転車両のローカリゼーション方法について述べている。GNSS（Global Navigation Satellite System）センサーを用いたローカリゼーションの精度向上のための手法とその設定が詳細に記載されている。また、GNSS信号のノイズ低減のためにUnscented Kalman Filter（UKF）を使用し、GNSSから得られる位置情報の誤差を低減している。主な内容は以下の通りである。

GNSSセンサーによるローカリゼーション
自動運転車の位置特定にGNSSセンサーを利用しており、このセンサーにはガウスノイズが加わっているため、平均して約0.7メートルのローカリゼーション誤差が生じる。
ノイズ低減のためのフィルタリングアルゴリズム
ローカリゼーション精度を向上させるため、従来の手法[9, 12]と同様にフィルタリングアルゴリズムを適用している。具体的には、Unscented Kalman Filter（UKF）[32,39]を採用し、Van der Merweのスケールドシグマポイントアルゴリズム[24]を用いる。
UKFの車両モデル
フィルタには、エキスパートドライバーモデルと同じ運動学的バイシクルモデル[8]を車両モデルとして使用しており、車両の位置、方向、速度を追跡している。
フィルタのパラメータチューニング
フィルタのパラメータは手動で調整されており、これにより小規模なデータセットにおいて、GNSS信号と基準位置情報（真値）のペアを使用してローカリゼーション誤差を低減している。この手動調整は、教師あり学習における検証セットでのハイパーパラメータ調整に類似した方法である。
ローカリゼーション精度の向上
調整されたUKFにより、GNSSのローカリゼーション誤差は平均約0.1メートル以下まで低減される。

A.5. PID Controller

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) A.5. PID Controller In [12] the PID controller converts the waypoints into steering by computing the car’s angle towards the average between the first two waypoints. The angle is then input to a PID controller to minimize. Computing the angle based on entangled waypoints makes the steering angle towards the path depend on the predicted speed of the vehicle. This works well at the low driving speeds of [12] but becomes hard to tune when driving at higher (and more diverse) speeds like our expert. Instead, we follow the first way- point that is at least a certain aim distance a away from the center of the car (or the last one). We keep a similar to [12] and use a = 2.25 when driving slower than 5.5 m/s (inside intersections) and a = 3.0 when driving faster. Longitudi- nal control is kept similar, we use the velocity between the waypoint 0.5 second into the future and 1 second into the future as target speed. The models using the disentangled waypoint representation do not require tuning an additional controller. Since we directly predict the input to the expert’s PID controller, we can reuse the same PID controllers from the expert (see Section A.1), which we know work well. The task of yielding to stop signs is defined in CARLA to stop the vehicle on a STOP paining printed on the road. This stop painting is largely occluded by the motor hut once the car is on the sign because our camera is mounted at the back roof of the vehicle. We observe many cases where TF++ de- tects the stop sign initially and slows down but starts driving again once the painting becomes occluded. We can address this problem in the controller because TF++ detects stop sign bounding boxes in BEV as auxiliary task. For our final model, we keep the last detected stop sign in a buffer and transform it to the current ego coordinate system using the UKF motion estimation. If the car is on the stop sign bound- ing box we set the action to brake in the controller until the stop sign is cleared. Table 11 shows the impact of the stop sign controller change on the validation routes. TF++ detects most stop signs and the controller is able to reduce Stop infractions by 5x, leading to a +10 DS improvement.

Stop Contr. DS ↑ RC ↑ Stop ↓

60 ± 5 99 ± 1 1.35
✓ 70 ± 6 99 ± 0 0.26
Table 11: Stop sign controller.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

この節は、自動運転車両の操縦に使用されるPID（比例・積分・微分）制御器の設定方法について述べている。特に、従来の手法[12]での低速運転での制御方法と異なり、より高速度域での操作性を確保するための調整方法が紹介されている。また、STOPサイン認識と停止制御の改善についても詳細に説明されている。主な内容は以下の通りである。

従来手法のPID制御器
従来手法[12]では、最初の2つのウェイポイントの平均を車両の角度計算に使用し、この角度を最小化するようにPID制御器を用いてステアリングを制御していた。この方法は低速運転には有効であるが、高速域ではチューニングが難しい問題が生じていた。
改善されたステアリング制御
高速運転時の安定性向上のため、本研究では特定の距離 $ a $ 離れた最初のウェイポイント（または最後のウェイポイント）を参照して進行方向を設定している。低速運転（5.5 m/s以下）では $ a = 2.25 $、高速運転時には $ a = 3.0 $ として設定している。この設定により、速度に応じた適切なウェイポイントが選択され、ステアリング制御の精度向上が期待される。
長軸制御
進行速度についても、0.5秒先と1秒先のウェイポイント間の速度を基準として使用し、速度制御を行っている。このため、予測されたウェイポイントを直接利用できるため追加のコントローラーチューニングは不要である。
STOPサインへの対応
CARLAシミュレーターにおいて、STOPサイン上での停止が要求されるが、カメラ位置の関係で車両がサインを通過した場合、サインが隠れてしまう問題が発生していた。TF++はBEVでのSTOPサインバウンディングボックスを検出できるため、最終モデルでは最後に検出したSTOPサインの位置情報をUKFによる推定移動情報を使って追跡し、サイン上にいる場合はブレーキを作動させる設定としている。
性能向上
STOPサインの変更により、STOP違反が5倍減少し、ドライビングスコア（DS）が+10向上している（表11参照）。

▼生成AIの回答(ChatGPT-4o)
⑦ 表に出来そうな部分

以下の要素を表にまとめる。

制御方法	説明	条件/設定
ステアリング制御	最初の2つのウェイポイント間の角度を最小化するようにステアリングを制御	従来手法[12]
改良ステアリング制御	最初のウェイポイントで一定距離 $ a $（または最後のウェイポイント）を参照して進行方向を設定	運転速度に応じて $ a = 2.25 $（低速）、$ a = 3.0 $（高速）
長軸制御	0.5秒先と1秒先のウェイポイント間の速度を基準に制御	速度制御に追加のチューニング不要
STOPサインへの対応	BEVで検出された最後のSTOPサインをバッファに保持し、現在の座標系に変換して追跡	サイン上ではブレーキを作動させる
性能向上	STOPサイン認識の変更で違反が減少し、ドライビングスコア（DS）が向上	STOP違反が5倍減少し、DSが+10向上

解説

この表は、ステアリング制御とSTOPサイン対応の手法について、その概要と設定条件、及び従来手法との違いを明確に比較するためにまとめたものである。表11にある通り、STOPサインの認識と追跡に関する改良がドライビングスコアに大きな改善をもたらしている。

B. TransFuser++ Implementation Details

B.1. Attention Pooling Implementation

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) B. TransFuser++ Implementation Details B.1. Attention Pooling Implementation The core implementation of our attention pooling with a transformer decoder is similar to [35]. We use the 8x8 features coming from the BEV branch and reduce the num- ber of channels (1512 with our RegNetY-3.2GF backbone) to 256 with a 1x1 convolution. We add a sinusoidal posi- tional embedding to the features and flatten them afterward. The velocity input, we normalize with a 1D Batchnorm and embed it with a 2 layer MLP to 256 features. Afterward, we add a learnable positional encoding. The velocity token then gets concatenated to the BEV tokens. These tokens are then processed by a standard Transformer decoder [38] that has 8 heads, gelu activation function [18], 6 layers and uses layernorm [3]. We use 1 learned queries for every predicted waypoint and 1 additional one if we predict target speeds. The tokens are then fed as inputs into a GRU, whose hidden state is initialized with an (MLP embedded) target point. A linear layer converts the output to 2 dimensions, and a cumulative sum is applied to the final vector (forcing the network to predict offsets from the first point).

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

記載内容
本節では、TransFuser++におけるAttention Poolingの実装手法について詳細に説明している。この手法は、BEV（鳥瞰図）特徴をTransformerデコーダーに入力する過程を構成し、ウェイポイント予測および目標速度の予測を行うためのAttention Poolingを実現している。

要旨

Attention Poolingの実装：[35]を参考に、8×8のBEV特徴を用いたAttention PoolingをTransformerデコーダーを通じて実装している。
特徴の圧縮と位置エンコーディングの追加：RegNetY-3.2GFバックボーンを用いて得られた1512チャンネルのBEV特徴を1×1の畳み込み層で256チャンネルに圧縮した後、サイン波状の位置埋め込みを追加し、フラット化する。
速度入力の処理：速度入力には1次元のバッチ正規化と2層のMLPで埋め込み処理を行い、256特徴次元とする。その後、学習可能な位置エンコーディングを付加し、BEVトークンと連結する。
Transformerデコーダー：8ヘッドの標準的なTransformerデコーダー[38]（gelu活性化関数[18]、6層構造、layernorm[3]）を使用し、BEVと速度トークンを処理する。
クエリとトークンの出力処理：各予測ウェイポイントに1つのクエリを、目標速度予測時には追加のクエリを設定する。トークンはGRUに入力され、その隠れ状態はMLPで埋め込まれたターゲットポイントで初期化される。
出力の次元変換：線形層で2次元に変換し、累積和を適用して最終ベクトルをオフセット予測として出力する。

これにより、ウェイポイントの精度と目標速度の予測が向上するためのAttention Poolingが実現される。

B.2. Data Augmentation

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) B.2. Data Augmentation When collecting data with the expert driver, we mount an additional camera on the vehicle that collects augmented frames and labels at every time step. This means for every frame in the dataset, we have an augmented counterpart. We shift the camera by ±1 meter to the left or right (of the vehi- cle) and rotate it by ±5◦ around the yaw axis. The particular values are drawn from a uniform distribution. During train- ing, we load the perturbed camera with 50% probability. We transform the 2D way point labels and other data such that they have the perturbed camera as center. The shifted and rotated waypoint labels do not describe an actual re- covery trajectory that an expert would take, and are instead the original waypoints in the center of the lane. However, the PID controller will steer the car back to the center of the lane, when the network predicts these augmented way- points, achieving a similar effect. The range of errors the network is trained to recover from equals the range of the perturbations. To generate the perturbed camera, we need access to a rendering engine (CARLA in our case). On real data, a similar effect can be achieved using novel view syn- thesis [2].

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

記載内容
本節では、データ拡張手法として用いたカメラシフトと回転による画像拡張の詳細について説明している。特に、車両に取り付けた追加カメラで、毎フレームに対応する拡張された画像とラベルを収集する方法に関する具体的な設定や、学習時のデータ処理方法について述べている。

要旨

追加カメラを用いた拡張データ収集：専門のドライバーデータ収集時に車両に追加カメラを取り付け、データセット内の各フレームに対し、拡張された対応フレームを収集。
カメラのシフトと回転：カメラを±1メートル左右にシフトさせ、ヨー軸周りに±5°回転。これらの値は一様分布に基づいてランダムに抽出。
訓練時の処理：学習時に50%の確率で補正されたカメラのフレームをロードする。補正されたカメラを中心とするように、2Dウェイポイントラベルと他のデータを変換。
拡張ウェイポイントの効果：シフトや回転により変換されたウェイポイントラベルは実際の回復経路ではなく、車線中央にある元のウェイポイントを記述。PIDコントローラが車を車線中央に戻すことで、ネットワークが拡張ウェイポイントを予測しても、似たような効果が達成される。
エラー範囲のトレーニング：ネットワークが回復するエラー範囲は、カメラシフトや回転の範囲に対応する。
シミュレーション環境の利用：拡張されたカメラを生成するためにレンダリングエンジン（CARLA）を使用。本データ上では、類似の効果を新しいビュー合成[2]を用いて達成可能。

このデータ拡張手法により、カメラ位置のシフトや回転に対するネットワークの耐性が向上し、よりロバストな予測が可能となる。

C. Additional Results

C.1. Longest6 Ablations

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C. Additional Results C.1. Longest6 Ablations As a sanity check, we test our additions to TransFuser by repeating the experiments on the training towns (Longest6) while training with all data. The results are presented in Ta- ble 13. We start with our reproduced version of TransFuser and iteratively add 1 change in every row. All results are the average of 3 training seeds, each evaluated 3 times. The reported standard deviation is across the training seeds. The main problems of the reproduced TransFuser are high colli- sions with other vehicles (Veh) and collisions with the static environments (Stat). Replacing the global average pooling and MLP with a Transformer decoder improves the driv- ing score by 17 points, improving both vehicle collisions and collisions with the environment (indicating improved steering). Adding shift and rotation augmentations has a similar effect, improving the driving score by +9 and reduc- ing collisions. The disentangled representation has a similar effect than on the validation routes, increasing vehicle col- lisions and reducing collisions with the environment. Since Longest6 has denser traffic, the increased vehicle collisions have a larger impact here, leading to an overall decrease in 2 DS. Two stage training improves DS by +3 and scal- ing the dataset by 3x leads to a result of 69 DS. Since the waypoint representation is slightly better on longest6, we additionally report the result of a variant called TF++ WP in Table 13. It was trained with the released dataset and uses waypoints as output representation, but is otherwise identi- cal to TF++. TF++ WP again has a slightly higher DS of +1. We also report the result of an ensemble of the 3 train- ing seeds following prior work [12]. The ensemble reduces vehicle collisions by 0.10 and leads to a SotA result of 73 DS on longest6.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

記載内容
本節では、TransFuserの強化による影響を評価するために、「Longest6」データセットで訓練時の追加実験を行い、その結果を分析している。各改良を逐次適用し、それぞれの効果を定量的に検証した結果が、表13に示されている。

要旨

実験の目的：訓練タウン（Longest6）での実験において、TransFuserに適用した各種改良の効果を確認し、性能向上の妥当性を検証するため。
手法と評価方法：実験は、再現したTransFuserバージョンに対し、改良を1つずつ適用して実施。各結果は3つの異なる訓練シードから平均を算出し、標準偏差も併記している。
主な問題：再現されたTransFuserでは、車両（Veh）および静的環境（Stat）との衝突が多発し、これが主な問題点とされた。
Transformerデコーダの導入効果：グローバル平均プーリングおよびMLPの代わりにTransformerデコーダを使用することで、ドライビングスコアが17ポイント向上し、車両および環境との衝突が改善されている（操舵性能の向上を示唆）。
シフトおよび回転拡張の導入効果：ドライビングスコアが+9ポイント改善され、衝突も減少。
Disentangled表現の効果：車両との衝突が増加し、環境との衝突が減少する傾向が見られた。Longest6データセットはトラフィックが密集しているため、この車両衝突の増加が大きく影響し、総合的なDSは2ポイント低下した。
二段階訓練法の効果：ドライビングスコアが+3ポイント改善。
データセットの拡張（3倍）：ドライビングスコアが69に達し、性能向上が見られた。
ウェイポイント表現（TF++ WP）の導入：ウェイポイント表現の使用により、ドライビングスコアがさらに+1ポイント向上（表13に示す）。
3シードのアンサンブル効果：過去の研究[12]に従い3つの訓練シードを組み合わせたアンサンブルにより、車両衝突が0.10減少し、Longest6での最高のドライビングスコア73を達成。

C.2. Brake Threshold

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C.2. Brake Threshold The confidence weighted PID controller from the mod- els using target speed prediction has a hyperparameter tha determines at which confidence threshold the action is set to full brake. In Table 14 we investigate the effects of different thresholds on validation towns. Lower thresholds reduce vehicle collisions at the cost of lower route completion due to false positive braking. Over- all, the choice of threshold is robust, only changing by 3 DS between the best and worst tested threshold. We use the de- fault threshold of 50% for all validation town experiments. On Longest6, the brake threshold has a slightly larger im- pact because the dense traffic puts a higher focus on col- lision avoidance. We therefore use the threshold 33% for Longest6 experiments.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

記載内容
本節では、ターゲット速度予測を行うモデルにおいて、信頼度重み付けPIDコントローラのブレーキの閾値設定がパフォーマンスに与える影響を検証している。特に、異なる閾値が検証用タウンとLongest6における車両衝突回避や走行完了率にどう影響を及ぼすかを分析し、実験結果を表14に示している。

要旨

ブレーキ閾値の影響：低い閾値を設定することで車両衝突が減少する一方で、誤検知によるブレーキングが増加するため走行完了率が低下する。
閾値の全体的な影響：最良値と最悪値の間でドライビングスコア（DS）が最大3ポイントの変動しか見られないため、閾値設定には一定の堅牢性があると評価されている。
デフォルト設定：検証タウンにおいてはデフォルトの50%を使用。
Longest6における影響：Longest6では交通量が密集しているため衝突回避の優先度が高く、33%の閾値がより有効であると判断された。

C.3. Additional Baselines

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C.3. Additional Baselines We report 3 additional results on the validation towns in Table 12. TF++ (all towns) is TF++, but its training includes the validation towns. This is not a fair comparison to other methods, and rather serves to illustrate what part of the re- maining problems are due to generalization issues. Includ- ing the validation towns during training increases the driv- ing score by 20 points, improving both vehicle collisions and stop sign infractions. TF++ is close to expert level per- formance in this setting, “underfitting” by 4 DS. The expert still has slightly lower vehicle collisions (Veh). We retrain Perception PlanT [31] on its released dataset. It achieves a DS of 37. Perception Plant uses a handcrafted intermediate representation (bounding boxes) as visual ab- straction. This makes the method more interpretable, but the downside of human designed representations are that they might miss important things. In this case, the bound- ing boxes do not include stop signs, leading to the method ignoring all stop signs (Stop) and incurring a large penalty for that infraction (1.87). Stop signs were not considered in the benchmark this method was developed on (Longest6). The reproduced LAV v2 [9] also achieves a surprisingly low driving score of 27. The reason for this are its high stop sign and red light infractions. The red light infractions oc- cur almost exclusively in Town 02 which has European style traffic lights. We evaluate on the LAV routes with additional scenarios (7,8,9,10) compared to the original benchmark. Due to the added scenarios, many traffic lights will turn yel- low (and then red) just as the agent approaches an intersec- tion. LAV v2 ignores those situations, incurring many red light infractions, bringing the overall score down. Besides the failure to adhere to traffic rules, LAV v2 achieves SotA results in route completion and vehicle collision avoidance.

▼生成AIの回答(ChatGPT-4o)
① 何についての記載か? + 要旨は何?

記載内容
本節では、検証タウンにおける追加のベースライン実験結果を示しており、表12に結果を記載している。主にTF++のバリエーションおよび他の手法であるPerception PlanTとLAV v2の評価を行い、それぞれの手法における課題と性能の特徴を分析している。

要旨

TF++ (all towns)：TF++のトレーニングに検証用タウンを含めた設定で実験し、一般化の課題を示している。他手法とフェアな比較はできないが、20ポイントのドライビングスコア（DS）向上を示し、車両衝突や一時停止標識違反が減少した。しかし、エキスパートレベルの性能には僅かに及ばず、特に車両衝突の面で4ポイントの差が見られる。
Perception PlanT：公開されているデータセットを用いて再学習を行った結果、DSは37を記録。視覚的抽象化に手作りの中間表現（境界ボックス）を使用しており、手法の解釈性は高いが、境界ボックスが一時停止標識を含まないため、すべての一時停止標識を無視し、大きなペナルティ（1.87）を受けている。この手法はLongest6でのベンチマークで開発されたため、一時停止標識が考慮されていない。
LAV v2：DSは27と低く、主に一時停止標識や赤信号違反が多いことが原因。特にタウン02ではヨーロッパ風の信号機が配置され、追加シナリオのルート（7,8,9,10）において交差点に近づくと黄色から赤に変わる信号を無視するため、赤信号違反が頻発した。ただし、LAV v2はルート完了率や車両衝突回避では最先端（SotA）性能を示している。

▼質問(人間)
Method DS ↑ RC ↑ IS ↑ Ped ↓ Veh ↓ Stat ↓ Red ↓ Stop ↓ Dev ↓ TO ↓ Block ↓
Table 12: Performance on validation towns (LAV). Reproduced models, std over 3 trainings and 3 evaluations.

上記に関して、各項目(DS等)を、表にまとめて下さい。(|用語|説明|の表へ)

▼生成AIの回答(ChatGPT-4o)
以下に、表12の各項目についての説明を**|用語|説明|**の形式でまとめる。

用語	説明
DS	Driving Scoreの略称であり、総合的な運転性能の指標を示す。スコアが高いほど運転性能が優れているとされる。
RC	Route Completionの略称で、走行ルートの完了率を表す。スコアが高いほどルートの達成率が高い。
IS	Infraction Scoreの略称で、交通違反に関連するペナルティを集約してスコア化した指標。値が高いほど違反が少ないことを示す。
Ped	Pedestrian Collisionの略称で、歩行者との衝突回数を示す。値が低いほど歩行者との安全が確保されている。
Veh	Vehicle Collisionの略称で、他の車両との衝突回数を示す。低い値ほど安全な運転が行われていることを示す。
Stat	Static Collisionの略称で、固定物との衝突回数を表す。値が低いほど周囲の物体との接触が少ない。
Red	Red Light Infractionの略称で、赤信号無視による違反の発生数を示す。低い値が望ましい。
Stop	Stop Sign Infractionの略称で、一時停止標識違反の発生数を示す。低い値が望ましい。
Dev	Deviationの略称で、走行ルートからの逸脱度合いを示す。低い値が望ましい。
TO	Traffic Obstructionの略称で、交通の妨害発生数を示す。低い値が好ましい。
Block	Blockageの略称で、走行ルートのブロックまたは塞がれた状況の回数を示す。低い値が好ましい。

これにより、表12の各指標が示す意味や評価基準について理解が容易となる。

C.4. Additional Examples

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C.4. Additional Examples Fig. 8 shows another example of the importance of the target point for recovery. This time we forcefully steer the ego car onto the sidewalk, which is also an out-of- distribution situation. The target point conditioned methods (here TransFuser and TCP) extrapolate their waypoints to- wards the nearby target point and drive back to the center of the lane. The discrete conditioned TransFuser that does not have access to the geometric information of the target point gets stuck on the sidewalk instead. Fig. 9 shows another example of harmful target point ex- trapolation. TCP and TransFuser both predict waypoints leaving the road in a right turn where the target point is far behind the turn. Changing the global average pooling + MLP approach in TransFuser to a transformer decoder mitigates the problem.

▼生成AIの回答(ChatGPT-4o mini)
①何についての記載か? + 要旨は何?

記載内容:

本節では、ターゲットポイントの重要性とその影響を示す実験例について記載されている。ターゲットポイントを用いた回復の重要性と、ターゲットポイントに基づいて予測されたウェイポイントの動作に関する事例が紹介されている。これにより、ターゲットポイントが運転に与える影響や、異なる手法がどのように挙動するかを検証している。

要旨:

ターゲットポイントによる回復の重要性:
- 図8では、エゴ車を意図的に歩道に進行させ、ターゲットポイントを条件とする手法（TransFuserおよびTCP）が近くのターゲットポイントに向かってウェイポイントを外挿し、車両が車線の中心に戻ることを示している。一方、ターゲットポイントの幾何学情報を利用しないディスクリート条件付きのTransFuserは、歩道に取り残される結果となる。
- ターゲットポイントの外挿の問題:
  - 図9では、ターゲットポイントがターンの手前にあり、TCPおよびTransFuserが右折時に道路を外れるウェイポイントを予測する様子が示されている。この問題に対処するため、TransFuserのグローバル平均プーリング+MLPアプローチをトランスフォーマーデコーダに変更することで、問題が軽減されることが確認された。

図・表参照:

図8: ターゲットポイントを条件にした場合の回復能力の比較。ターゲットポイントを条件にしたTransFuserとTCPが車両を車線の中心に戻す動作を示す。
図9: ターゲットポイントの外挿に関する問題の例。右折時にターゲットポイントがターンの手前にあるときに予測される不適切なウェイポイント。TransFuserの改善方法（トランスフォーマーデコーダの使用）を示す。

▼質問(人間)
上記の「エゴ車」とは何か?
この時の「エゴ」の意味は何か?

▼生成AIの回答(ChatGPT-4o mini)
「エゴ車」（英: ego vehicle）は、自律走行や運転支援システムにおいて、実験やシミュレーションの対象となる車両を指す用語です。これは、運転を行う「自分自身の車」または「自分の車両」といった意味合いを持ちます。

「エゴ」の意味:

エゴ（ego）は、ラテン語で「自分」「自己」を意味します。この文脈では、システムや実験において「主語となる車両」を指します。すなわち、運転者や自律システムが操作する車両であり、その車両を中心に周囲の状況が描かれます。
自律運転技術の研究やシミュレーションでは、エゴ車は、他の車両や障害物などと比較する際の「基準」となる存在です。この車両の挙動が、周囲の環境や他の車両の影響を受けながらどう変化するかを観察・評価します。

要するに、エゴ車とは実験対象となる車両で、その車両がどのように挙動するかを測定・評価するために使用される言葉です。

C.5. Additional Experiments

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C.5. Additional Experiments Multimodal waypoints: Instead of disentangling the ve- locity from the waypoints one could also allow the network to predict multiple sets of waypoints as is sometimes done in trajectory forecasting. To test this approach we train a model with two waypoint GRUs and a selection head to classify the better mode. During training we compute the loss as the minimum L1-loss from both predicted way- points. The classification head is supervised with a binary cross entropy loss to classify which of the two waypoint losses has the lower L1 loss. Table 15 shows that the

Output DS ↑ RC ↑ Veh ↓ Stat ↓
Multimodal WP 47 ± 8 96 ± 1 1.13 0.20
Unimodal WP 49 ± 8 90 ± 4 0.70 0.10
Table 15: Multimodal waypoint representation.

representation performed 2 DS worse than using standard
unimodal waypoints. This is not a big difference but the
multimodal waypoints are more complex so there is not re-
ally a reason to use them.
NC conditioned AIM: To test whether recovery from the
target point bias depends on the architecture, we reproduce
the AIM [28] approach. Compared to TransFuser it uses no
LiDAR, no auxiliary losses and no transformers, but has the
same target point conditioned waypoint GRU as decoder.
Like with TransFuser we run two variants, one with TP con-
ditioning and one with NC. The result presented in Table 16

Cond. DS ↑ RC ↑ Dev ↓
NC 27 ± 2 50 ± 3 0.96
TP 26 ± 2 86 ± 8 0.00
Table 16: Conditioning effect on AIM [28]

also show a strong impact of TP conditioning on RC and
route deviations (Dev) for the AIM architecture, suggesting
that the target point bias does not depend on the particular
architecture used. Consistent with [28] we also observe that
AIM performs worse than TransFuser overall (-13 DS).
Target Point statistics: To show the importance of the
TP for the network predictions we train a model where we
input the TP as a TF decoder token so that we can analyse
the attention weights. We average the attention across all
layers, heads and waypoint queries. We run the model on
the validation routes and observe an avg. of 25% attention
on the TP token which is 16.5x higher than uniform. We
also test how steering correlates with the target point. The
sign of the steering angle (when larger than 1°), of our final
model, is the same as that of the target point (TP) 92% of
the time (93% for the expert) on the validation routes.

▼生成AIの回答(ChatGPT-4o mini)
①何についての記載か? + 要旨は何?

記載内容:

本節では、いくつかの追加実験結果が報告されている。特に、マルチモーダルウェイポイント（Multimodal waypoints）、ターゲットポイントバイアスに関するAIM（AIM [28]）の再現実験、ターゲットポイント（TP）の重要性に関する統計的解析が紹介されている。実験は、モデルの性能を向上させるための異なるアプローチを比較している。

要旨:

マルチモーダルウェイポイント: 複数のウェイポイントセットを予測する手法をテストした結果、従来の単一のウェイポイント（Unimodal WP）に比べて2 DS（ドライビングスコア）が低かった。マルチモーダルウェイポイントはより複雑であり、その使用に特別な理由は見つからなかった。
- 結果: マルチモーダルWP（47 ± 8 DS, 96 ± 1 RC, 1.13 Veh, 0.20 Stat）と、従来型単一ウェイポイント（49 ± 8 DS, 90 ± 4 RC, 0.70 Veh, 0.10 Stat）を比較。
- 表参照: Table 15（Multimodal waypoint representation）
AIM（AIM [28]）によるターゲットポイント（TP）バイアスの検証: TransFuserと比較し、LiDARや補助的な損失、トランスフォーマーを使用しないAIMアーキテクチャにおけるターゲットポイント（TP）条件付きのウェイポイントGRU（ゲート付きリカレントユニット）の影響を調査した結果、TP条件付きの方がルート完成度（RC）やルート偏差（Dev）に強い影響を与えることが示された。AIMは全体的にTransFuserよりも13 DS低い結果を示した。
- 結果: NC条件（27 ± 2 DS, 50 ± 3 RC, 0.96 Dev）とTP条件（26 ± 2 DS, 86 ± 8 RC, 0.00 Dev）を比較。
- 表参照: Table 16（Conditioning effect on AIM [28]）
ターゲットポイント（TP）の重要性: TPがモデルの予測に与える影響を調査するため、TPをTFデコーダートークンとして入力し、アテンションウェイトを分析。モデルがTPに対して25%の平均アテンションを示し、この値は均等な分布に比べて16.5倍高いことが確認された。また、最終モデルのステアリング角度の符号は、92%の確率でTPの符号と一致していた。

▼生成AIの回答(ChatGPT-4o mini)
⑦表に出来そうな部分があれば表でまとめて下さい

以下の内容は、表にまとめることができます。元文に記載された実験結果に基づいて、特に「マルチモーダルウェイポイント」「AIMによるターゲットポイントバイアスの影響」「ターゲットポイントの重要性」に関する数値が、表形式で整理可能です。

1 マルチモーダルウェイポイントの比較結果 (Table 15)

Method	DS (Driving Score)	RC (Route Completion)	Veh (Vehicle Collisions)	Stat (Stop Sign Infractions)
Multimodal WP	47 ± 8	96 ± 1	1.13	0.20
Unimodal WP	49 ± 8	90 ± 4	0.70	0.10

解説:

マルチモーダルウェイポイントと単一ウェイポイントの性能を比較した結果、DS（ドライビングスコア）ではマルチモーダルが2ポイント低いことが確認されたが、他のメトリクス（RC, Veh, Stat）においては顕著な差は見られなかった。表に示された通り、マルチモーダルウェイポイントは計算量が増加するにもかかわらず、性能面で特に優位性はない。

2 AIMによるターゲットポイントバイアスの影響 (Table 16)

Conditioning	DS (Driving Score)	RC (Route Completion)	Dev (Route Deviations)
NC (No Conditioning)	27 ± 2	50 ± 3	0.96
TP (Target Point Conditioning)	26 ± 2	86 ± 8	0.00

解説:

AIMにおいて、ターゲットポイント（TP）を条件にすることで、**ルート完成度（RC）とルート偏差（Dev）**が顕著に改善されることが確認された。しかし、全体のドライビングスコア（DS）では、TP条件をつけてもわずかな差（1 DS）しかなく、NC条件とTP条件の間には大きな違いはなかった。これはTP条件がルート偏差に強く影響することを示唆している。

3 ターゲットポイント（TP）の重要性の統計

Metric	Value	Explanation
Average Attention on TP	25%	TPに対して平均25%のアテンションが集中しており、均等分布に比べて16.5倍高い。
Steering Angle Correlation with TP	92%	最終モデルのステアリング角度が、ターゲットポイントの符号と92%の一致率を示す。

解説:

ターゲットポイントの重要性に関して、TPがモデルの予測に与える影響を解析した結果、TPに対して25%のアテンションが集まり、これは均等分布と比較して大きな差となっていることが分かる。また、最終モデルのステアリング角度がターゲットポイントの符号と92%一致していることから、TPが運転の意思決定に重要な役割を果たしていることが示唆されている。

▼質問(人間)
続けて下記の質問に追加で回答下さい:
これらの実験から分かったそれぞれの結論を説明下さい。

▼生成AIの回答(ChatGPT-4o)
各実験から分かった結論

マルチモーダルウェイポイントの実験結果
この実験では、マルチモーダルウェイポイントの導入が単一のウェイポイントに対してどのような性能差をもたらすかを比較した。以下の結論が得られる。

複雑さの増加に対する性能の変化: マルチモーダルウェイポイントを使用しても、単一ウェイポイントと比較して大きな性能向上は見られず、DS（ドライビングスコア）において2ポイントの減少が観察された。RC（ルート完成度）には若干の向上があったものの、複雑さが増した割に全体的な利点が少ないことが確認された。
選択の指針: より複雑な設計にもかかわらず、単一のウェイポイントと大差ない性能であるため、効率と性能の観点から単一のウェイポイント方式を推奨する結論が導かれる。

AIMにおけるターゲットポイント条件の影響
AIMモデルにおいて、ターゲットポイント（TP）条件がある場合とない場合でのパフォーマンスを比較した結果、以下の結論が得られる。

TP条件のルート偏差への影響: TP条件を加えた場合、ルート偏差（Dev）がゼロとなり、RC（ルート完成度）が大幅に向上した。これにより、TP条件がAIMモデルのルート維持能力に強く寄与していることが示された。
ドライビングスコアへの影響の限界: DS（ドライビングスコア）ではNC条件とTP条件の差は1ポイントのみであり、TP条件がスコア全体に与える影響は限定的である。これにより、TP条件は主にルート偏差と完成度に対する影響が強いことが確認される。
アーキテクチャに依存しないバイアス: 本実験はAIMモデルにおけるTPバイアスの影響がTransFuserと同様に見られたことから、TPバイアスが特定のアーキテクチャには依存しない特性であると示唆される。

ターゲットポイント（TP）の重要性の統計結果
TPに対するモデルの依存度およびその重要性を定量的に評価した結果、以下の結論が得られる。

アテンション集中度: TPに対するアテンションが他の要素よりも顕著に高く、モデルがTP情報を特に重要視していることが確認された。このことはTPが運転方針の決定において中心的な役割を果たしていることを示している。
ステアリング角度との一致率: ステアリング角度の方向がTPに対して92%の一致率を示すため、TPが車両の方向決定に大きな影響を及ぼしていることが明確になった。この結果から、TPを取り入れることでモデルの方向制御精度が向上することが示唆される。

これらの結論により、ターゲットポイント（TP）の設定が、運転方針の決定において重要な指標であると理解される。

C.6. CARLA Leaderboard

▼質問(人間)

質問内容全体はここをクリック

(論文・記事の各章節を貼り付け) C.6. CARLA Leaderboard The CARLA leaderboard [1] (we are considering version 1 in this discussion) is a test server where groups can sub- mit agents enclosed within a docker container. These agents are then evaluated on a set of 10 routes across 2 secret towns. Each route is traversed 2× with different weather conditions. Additionally, these 20 routes are then repeated 5 times with different random seeds and average metrics across all routes are reported to the user. The CARLA leaderboard serves as a standardized evaluation platform, comparable to a test set, and some works solely rely on it to compare to other work [9, 35, 41]. Significant progress has been achieved on this benchmark. Driving scores have in- creased from approximately 10 DS to 70-80 DS over a span of three years. The top performing methods on the leader- board have released code and models alongside the release of their research papers. In this study, we try to reproduce the top 4 reported results by submitting the released code to the leaderboard, either with the released model (∗) file or a retrained one (†). Method Reproduced Reported

DS ↑ RC ↑ IS ↑ DS ↑ RC ↑ IS ↑
LAV∗ [9] 25 46 0.74 62 94 0.64
Interfuser∗ [35] 34 75 0.45 76 88 0.84
TCP † [41] 48 66 0.77 70 83 0.85
TF++ (ours) 53 71 0.76 - - -
TransFuser † [12] 55 90 0.63 61 87 0.71
TF++ WP (ours) 62 78 0.81 - - -
TF++ WP Ens. (ours) 66 79 0.84 - - -
Table 17: CARLA leaderboard reproduced results.

Table 17 shows significant disparities between reported
results and the outcomes obtained when utilizing the offi-
cial codebases. Notably, the released model of the SotA
method Interfuser achieves less than half of the reported
score. With the exception of TransFuser, these differences
are larger than what we would expect from training or eval-
uation variance. The CARLA leaderboard ensures repeata-
bility (within the bounds of evaluation variance) because
one can rerun the submitted docker container. It does how-
ever not guarantee reproducibility, since released code and
models can differ from what was used to achieve the re-
ported score. Currently, it is publicly not known how to
achieve scores above 60 DS, even though some groups have
achieved such scores in the past. Given that the official
codebases incorporate the main concepts discussed in the
respective papers, these results indicate that factors not ex-
plicitly emphasized in the papers significantly influence the
outcomes observed on the CARLA leaderboard.
Note, that it has been documented in [12] (Table 5) that
some changes can have a significant impact on performance
on the leaderboard (+19 DS), even though the same changes
do not generalize to the public towns (-2 DS on Longest6).
This discrepancy can be attributed, in part, to substantial
fluctuations in RC, which we don’t observe in the publicly
available towns. The evaluation is secret, so the exact rea-
sons for these fluctuations are unclear. The LAV results
are crossed out because the low score can be attributed to
CUDA out of memory errors stemming from the released
software, that appear after route 53. The CARLA leader-
board fails silently and gives 0 DS on routes where the soft-
ware crashes. We contacted the organizers in this case be-
cause some of the auxiliary metrics looked unusual.
Furthermore, the auxiliary metrics on the CARLA
leaderboard are typically not reported because they are
known to be incorrectly computed1. The results presented
in Table 17 were obtained with released code and models
around April 2023. It should be noted that these results are
subject to change if the authors decide to update their repos-
itories. We submit TF++ and TF++ WP (see Section C.1) to
the CARLA leaderboard. Surprisingly, we observe a large
difference of 9 DS between the representations. The reason
for that difference is unclear because the effect is not re-
producible on the publicly available towns. Similar to prior
work [12], we submit an ensemble with 3 training seeds of
our best model to the leaderboard. The ensemble increases
the driving score by 4 points and is the best publicly avail-
able model, at the time of writing.
Discussion: The CARLA leaderboard aimed at repro-
ducing success in past benchmarks like ImageNet for au-
tonomous driving. It has led to similar fast progress, al-
though the resulting methods are not always reproducible
and not very well understood. There are structural changes
the community could make that we think should be consid-
ered when designing future benchmarks.
The CARLA leaderboard does not have an official vali-
dation benchmark. Since validation is fundamental to ma-
chine learning development, authors have proposed vari-
ous validation benchmarks [9, 11, 12, 28, 43] using the pub-
licly available towns. They differ along various axes: the
towns, route length, scenarios and wheathers used for vali-
dation. Based on our experience, none of the available vali-
dation benchmarks can reliably predict performance on the
CARLA leaderboard. It is hard for authors to make valida-
tion routes that are well aligned because the test routes are
secret. As a consequence, results on the CARLA leader-
board are often unexpected, requiring authors to run ad-
ditional investigations [12]. Other authors choose not to
use the leaderboard in their publications [19, 22, 31, 42, 43].
Submissions on the CARLA leaderboard may encounter a
pending status if all servers are in use. During this project,
we have encountered pending times of up to 3 weeks. Sub-
missions themselves then take more than four days to eval-
uate, leading to total evaluation times of up to 4 weeks. For
comparison, we can run a similar amount of simulation on
our cluster in 3–6 hours, by evaluating all routes in paral-
lel. Achieving this efficiency requires a scalable infrastruc-
ture but uses the same amount of computational resources
as when evaluating routes sequentially. We encourage or-
ganizers of future testing benchmarks to release an aligned
public validation benchmark that authors can use to do rapid
experiments on.
The CARLA leaderboard is a benchmark that is not asso-
ciated with any dataset. Consequently, authors are required
to create their own datasets for training their methods. In
the early stages, the quality of these datasets, including sen-
sor configurations, label accuracy, and scale, was often sub-
par. Substantial progress has been made by enhancing the
quality of the data itself [21]. Most papers, including this
one, introduce a new dataset alongside their method. While
this practice allows for innovation in dataset curation tech-
niques, it also complicates fair comparisons between dif-
ferent methods. Additionally, it can obscure the source of
performance improvements if the dataset is not studied as
well. Although there have been some efforts in dataset cu-
ration [8,43], such efforts could be more fruitful if there was
a standardized dataset to compare to.

▼生成AIの回答(ChatGPT-4o mini)
①何についての記載か? + 要旨は何?

内容の概要
本節「C.6 CARLA Leaderboard」は、CARLA leaderboardバージョン1の評価基盤を対象とし、このリーダーボードが提供する評価環境や制約、また再現性と結果のばらつきに関する問題点を論じている。特に、リーダーボードに提出されたエージェントの再現性の確保やパフォーマンスの測定方法の不備が指摘されている。本研究では、CARLA leaderboard上でのトップ4手法の再現実験を通じて、実際の再現結果と報告された結果との間に大きな差異があることを示している（表17）。さらに、リーダーボードの検証方法の限界や評価効率の問題点、そして訓練データセットの質のばらつきが性能の向上や手法間の公平な比較に与える影響が強調されている。

要旨

CARLA Leaderboardの評価環境: CARLA leaderboardは、評価用サーバーであり、参加者はエージェントをdockerコンテナで提出し、2つのシークレットタウンにおける10ルートで評価される。このルートは異なる天候で2回ずつ、さらに異なるシードで5回再現され、すべてのルートでの平均が評価結果として報告される。
リーダーボード上のパフォーマンス進展: 過去3年間でドライビングスコア（DS）は約10から70-80まで向上し、トップの手法は論文と共にコードやモデルを公開している。
再現性の問題とスコアのばらつき: 表17に示されるように、公式コードを使用した再現実験と報告されたスコアの間には大きな差が見られる。特に、Interfuser手法の公開モデルは報告されたスコアの半分以下のDSを達成するにとどまり、再現性の課題が明らかである。このような差異は、トレーニングや評価のばらつき範囲を超えていることが示唆される。
CARLAリーダーボードにおける検証と実験の難しさ: リーダーボードには公式の検証ベンチマークが存在しないため、公開されているタウンを使って異なる検証ベンチマークが複数提案されているが、いずれもリーダーボードのテスト結果を予測する信頼性が不足している。リーダーボードでの予期しない結果や時間のかかる評価のため、他の著者の中にはリーダーボードの使用を避ける例も見られる。
評価インフラの効率問題: 提出が多くなると、最大3週間の保留時間や4日以上の評価時間が生じることがあり、リーダーボード評価の効率性が課題とされている。
データセットの標準化の必要性: CARLAリーダーボードは特定のデータセットに依存せず、各著者が自らデータセットを作成する必要があるため、データセットの品質（センサー設定、ラベル精度、データ規模）にバラつきが生じる。データの質の向上によって性能が向上することが確認されているものの、異なる手法間での公平な比較が困難であるため、標準化されたデータセットの必要性が指摘されている。

あとがき

所感

機械学習についてある程度知っていないと、生成AIに質問しても、その回答内容だけでは理解できないと思います。生成AIの回答は、論文の内容をあまり変えずに、要点をそのまま掻い摘んで回答するような形になっています。(良い意味で論文の要点をそのまま抜き出した形)

分かったこと

分かったこと・注目した点を列挙すると:

Appendix
A. Changes to TransFuser
A.1. Expert

本節は、TransFuserにおけるエキスパートドライバー（模倣学習に必要な教師ラベルを生成するアルゴリズム）の変更点や性能評価について記載している。具体的には、データ収集方法、経路追従と制御手法の詳細、および衝突回避や交通ルール遵守に関する改良点が述べられている。

データ収集方法：CARLAシミュレーターを用いて模倣ラベルを生成する際、BEV（鳥瞰視点）セグメンテーションラベルを得るため、[43]に基づき対象物をHDマップにレンダリングしている。
エキスパートドライバーの構築：経路追従と制御手法において、経路はA⋆アルゴリズムで生成し、横制御にはPIDコントローラーを採用。縦制御にはMPC（モデル予測制御）を利用しており、従来手法[12]の目標速度（4 m/s）を倍の8 m/sに設定し、速度向上を実現。
衝突回避の改良：他車両の未来位置予測には、運動モデルを利用しており、速度が上がったエージェントの安全距離確保のために、固定された静的領域ではなく、速度に応じた停止距離を動的に調整している。
歩行者対策：従来手法と比較し、歩行者が前方30メートル以内にいる場合は、事前に2 m/sまで減速することで衝突を回避。
交通ルールの強化：ストップサインや信号に対応し、進入時には適切に減速や停止を行う処理を追加。

A.2. Dataset

前節A.1で説明されたエキスパートドライバーを用いたトレーニングデータセットの生成方法や、データ処理と圧縮の詳細について記載されている。

データ収集とフレームレート：トレーニングデータセットは、[12]で用いられたトレーニングルート上でエキスパートドライバーを実行することで生成される。データ収集時には毎秒20フレーム（20 FPS）で記録し、気象条件をルートごとにランダム化している。トレーニング時には5フレームごとに学習する（実質的に4 FPS）。
スケーリング実験：スケーリング実験用のデータ収集では、各ルートでデータ収集を3回行い、気象や交通状況をランダム化することで、環境は同じでも多様なデータを生成している。CARLAシミュレーターのバージョン0.9.10におけるトラフィックマネージャは非決定的であるため、ランダム性が増している。
最終モデル用データセットの効率化：論文本文の最終モデルでは、収集データを4 FPSに調整し、全てのフレームでトレーニングする。これにより、収集データは元の5分の1のサイズとなり、公開しやすくなっている。
データの圧縮方法：ストレージ容量の削減のため、カメラ画像はJPGで圧縮し、圧縮による分布シフトを避けるために推論時にも同様の圧縮を追加している。認識ラベルはPNG形式で、深度マップは8ビット解像度で保存している。テキストデータはzip形式で圧縮し、LiDARのポイントクラウドデータはLASzip [20]を使用して圧縮（標準のzipよりも効率的）している。これにより、360度のLiDARスイープデータ全体を保存し、全てのポイントを現在のフレームの座標系に再整列している。
エキスパートドライバーの運転スコアに基づくデータフィルタリング：エキスパートドライバーが完全な運転精度を持つわけではないため、各トレーニングルートの運転スコア（Driving Score, DS）を記録し、100点のルートのみをトレーニングに使用している。
分離された経路ラベルの生成：エキスパートが追従するA⋆経路から、次の10個の2Dポイント（1メートル間隔）を保存し、分離された経路ラベルとして使用している。

A.3. Training and Architecture

新しい速度分類タスクにはクラス頻度重み付きのクロスエントロピー損失を適用し、ラベルスムージング（平滑化係数0.1）を追加している。これは、車両の停止遅延を抑えるために行われ、ブレーキのタイミングが遅れる問題に対処する意図がある。また、最終モデルではラベルスムージングを使用せず、推論時に目標速度を2m/s減少させて同様の効果を達成している。

トレーニング設定：各損失は同一の重みで加算され、合計が1になるよう正規化される。最適化にはAdamW[23]（amsgrad[30]付き）を使用し、初期学習率0.0003からエポック30で1/10に減少させる。31エポックでトレーニングを行い、最終エポックのモデルを使用する。8台の2080ti GPUによる分散データ並列で訓練し、バッチサイズ48を採用している。

アーキテクチャの特徴：アーキテクチャには11クラスのBEVセマンティックセグメンテーションを用い、カメラで可視化できる範囲に限定して予測している。LiDARと画像の2つの分岐において、各トランスフォーマーの前に1x1畳み込み層を挿入し、チャネル次元を一致させ、異なるバックボーン（本研究では両分岐にRegNetY-3.2GF[29]を採用）を使用できるようにしている。また、再現実験ではTransFuserモデルが8つのウェイポイント（250ms間隔）を予測し、2秒先までの経路を推定している。

使用センサー：[41]に準拠し、高解像度カメラとLiDARを使用している。カメラは110°の水平視野（FOV）を持ち、解像度は256×1024である。

A.4. Localization

GNSSセンサーによるローカリゼーション
自動運転車の位置特定にGNSSセンサーを利用しており、このセンサーにはガウスノイズが加わっているため、平均して約0.7メートルのローカリゼーション誤差が生じる。

従来の手法[9, 12]と同様にフィルタリングアルゴリズムを適用している。具体的には、Unscented Kalman Filter（UKF）[32,39]を採用し、Van der Merweのスケールドシグマポイントアルゴリズム[24]を用いる。
フィルタには、エキスパートドライバーモデルと同じ運動学的バイシクルモデル[8]を車両モデルとして使用しており、車両の位置、方向、速度を追跡している。

ローカリゼーション精度の向上
調整されたUKFにより、GNSSのローカリゼーション誤差は平均約0.1メートル以下まで低減される。

A.5. PID Controller

改善されたステアリング制御
高速運転時の安定性向上のため、本研究では特定の距離 aa 離れた最初のウェイポイント（または最後のウェイポイント）を参照して進行方向を設定している。低速運転（5.5 m/s以下）では a=2.25a=2.25、高速運転時には a=3.0a=3.0 として設定している。この設定により、速度に応じた適切なウェイポイントが選択され、ステアリング制御の精度向上が期待される。

進行速度についても、0.5秒先と1秒先のウェイポイント間の速度を基準として使用し、速度制御を行っている。このため、予測されたウェイポイントを直接利用する。

CARLAシミュレーターにおいて、STOPサイン上での停止が要求されるが、カメラ位置の関係で車両がサインを通過した場合、サインが隠れてしまう問題が発生していた。TF++はBEVでのSTOPサインバウンディングボックスを検出できるため、最終モデルでは最後に検出したSTOPサインの位置情報をUKFによる推定移動情報を使って追跡し、サイン上にいる場合はブレーキを作動させる設定としている。

B. TransFuser++ Implementation Details
B.1. Attention Pooling Implementation

本節では、TransFuser++におけるAttention Poolingの実装手法について詳細に説明している。

Attention Poolingの実装：[35]を参考に、8×8のBEV特徴を用いたAttention PoolingをTransformerデコーダーを通じて実装している。
特徴の圧縮と位置エンコーディングの追加：RegNetY-3.2GFバックボーンを用いて得られた1512チャンネルのBEV特徴を1×1の畳み込み層で256チャンネルに圧縮した後、サイン波状の位置埋め込みを追加し、フラット化する。
速度入力の処理：速度入力には1次元のバッチ正規化と2層のMLPで埋め込み処理を行い、256特徴次元とする。その後、学習可能な位置エンコーディングを付加し、BEVトークンと連結する。
Transformerデコーダー：8ヘッドの標準的なTransformerデコーダー[38]（gelu活性化関数[18]、6層構造、layernorm[3]）を使用し、BEVと速度トークンを処理する。

クエリとトークンの出力処理：各予測ウェイポイントに1つのクエリを、目標速度予測時には追加のクエリを設定する。トークンはGRUに入力され、その隠れ状態はMLPで埋め込まれたターゲットポイントで初期化される。

B.2. Data Augmentation

カメラのシフトと回転：カメラを±1メートル左右にシフトさせ、ヨー軸周りに±5°回転。これらの値は一様分布に基づいてランダムに抽出。
訓練時の処理：学習時に50%の確率で補正されたカメラのフレームをロードする。補正されたカメラを中心とするように、2Dウェイポイントラベルと他のデータを変換。

エラー範囲のトレーニング：ネットワークが回復するエラー範囲は、カメラシフトや回転の範囲に対応する。

C. Additional Results
C.1. Longest6 Ablations

本節では、TransFuserの強化による影響を評価するために、「Longest6」データセットで訓練時の追加実験を行い、その結果を分析している。

Transformerデコーダの導入効果：グローバル平均プーリングおよびMLPの代わりにTransformerデコーダを使用することで、ドライビングスコアが17ポイント向上し、車両および環境との衝突が改善されている（操舵性能の向上を示唆）。
シフトおよび回転拡張の導入効果：ドライビングスコアが+9ポイント改善され、衝突も減少。
二段階訓練法の効果：ドライビングスコアが+3ポイント改善。
データセットの拡張（3倍）：ドライビングスコアが69に達し、性能向上が見られた。
ウェイポイント表現（TF++ WP）の導入：ウェイポイント表現の使用により、ドライビングスコアがさらに+1ポイント向上（表13に示す）。
3シードのアンサンブル効果：過去の研究[12]に従い3つの訓練シードを組み合わせたアンサンブルにより、車両衝突が0.10減少し、Longest6での最高のドライビングスコア73を達成。

C.2. Brake Threshold

本節では、ターゲット速度予測を行うモデルにおいて、信頼度重み付けPIDコントローラのブレーキの閾値設定がパフォーマンスに与える影響を検証している。

ブレーキ閾値の影響：低い閾値を設定することで車両衝突が減少する一方で、誤検知によるブレーキングが増加するため走行完了率が低下する。

C.3. Additional Baselines

本節では、検証タウンにおける追加のベースライン実験結果を示している。

TF++ (all towns)：TF++のトレーニングに検証用タウンを含めた設定で実験。他手法とフェアな比較はできないが、20ポイントのドライビングスコア（DS）向上を示し、車両衝突や一時停止標識違反が減少した。しかし、エキスパートレベルの性能には僅かに及ばず、特に車両衝突の面で4ポイントの差が見られる。

Perception PlanT：DSは37を記録。境界ボックスが一時停止標識を含まないため、すべての一時停止標識を無視し、大きなペナルティ（1.87）を受けている。この手法はLongest6でのベンチマークで開発されたため、一時停止標識が考慮されていない。

LAV v2：DSは27と低く、主に一時停止標識や赤信号違反が多いことが原因。特にタウン02ではヨーロッパ風の信号機が配置され、追加シナリオのルート（7,8,9,10）において交差点に近づくと黄色から赤に変わる信号を無視するため、赤信号違反が頻発した。ただし、LAV v2はルート完了率や車両衝突回避では最先端（SotA）性能を示している。

C.4. Additional Examples

本節では、ターゲットポイントの重要性とその影響を示す実験例について記載されている。

図8では、意図的に歩道に進行させ、ターゲットポイントを条件とする手法（TransFuserおよびTCP）が近くのターゲットポイントに向かってウェイポイントを外挿し、車両が車線の中心に戻ることを示している。

図9では、ターゲットポイントがターンの手前にあり、TCPおよびTransFuserが右折時に道路を外れるウェイポイントを予測する様子が示されている。この問題に対処するため、TransFuserのグローバル平均プーリング+MLPアプローチをトランスフォーマーデコーダに変更することで、問題が軽減されることが確認された。

C.5. Additional Experiments

本節では、いくつかの追加実験結果が報告されている。

マルチモーダルウェイポイントは計算量が増加するにもかかわらず、性能面で特に優位性はない。

TP条件を加えた場合、ルート偏差（Dev）がゼロとなり、RC（ルート完成度）が大幅に向上した。これにより、TP条件がAIMモデルのルート維持能力に強く寄与していることが示された。

TPに対するアテンションが他の要素よりも顕著に高く、モデルがTP情報を特に重要視していることが確認された。このことはTPが運転方針の決定において中心的な役割を果たしていることを示している。

C.6. CARLA Leaderboard

本節は、CARLA leaderboardバージョン1の評価基盤を対象とし、このリーダーボードが提供する評価環境や制約、また再現性と結果のばらつきに関する問題点を論じている。

訓練データセットの質のばらつきが性能の向上や手法間の公平な比較に与える影響が強調されている。

CARLA Leaderboardの評価環境: CARLA leaderboardは、評価用サーバーであり、参加者はエージェントをdockerコンテナで提出し、2つのシークレットタウンにおける10ルートで評価される。このルートは異なる天候で2回ずつ、さらに異なるシードで5回再現され、すべてのルートでの平均が評価結果として報告される。

過去3年間でドライビングスコア（DS）は約10から70-80まで向上し、トップの手法は論文と共にコードやモデルを公開している。

CARLAリーダーボードは特定のデータセットに依存せず、各著者が自らデータセットを作成する必要があるため、データセットの品質（センサー設定、ラベル精度、データ規模）にバラつきが生じる。データの質の向上によって性能が向上することが確認されているものの、異なる手法間での公平な比較が困難であるため、標準化されたデータセットの必要性が指摘されている。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up