論文まとめ：VGGT: Visual Geometry Grounded Transformer

Last updated at 2025-07-14Posted at 2025-07-14

はじめに

先月開催された CVPR2025 でBest Paper Awards となった論文のざっくりまとめ。

もろもろ書きかけ

paper

project page

github

概要

従来のBAなどの最適化を不要とし、画像から直接3D情報（カメラパラメータ、深度マップ、点群、トラッキング）を推定する
これにより高精度ながら推論速度が非常に速い
ニューラル・ネットワークのアーキテクチャとしてはGlobal AttentionとFrame Attentionを交互に実施する

背景

これまでの SfM (Structure from Motion)

従来のSfM(Structure from Motion)はbundle adjustmentなど最適化のフェーズが必要だった。

\begin{equation}
\min_{\{X_j, R_i, t_i, K_i\}} \sum_{i=1}^{N} \sum_{j=1}^{M} v_{ij} \cdot \left\| \mathbf{x}_{ij} - \pi\left(K_i, R_i, t_i, X_j \right) \right\|^2
\end{equation}

各記号は
$X_j \in \mathbb{R}^3$: 3D空間上の点（$j$番目）
$R_i \in SO(3)$, $t_i \in \mathbb{R}^3$: カメラ$i$の回転行列と並進ベクトル（外部パラメータ）
$K_i$: カメラ$i$の内部パラメータ（焦点距離、主点などを含む行列）
$\mathbf{x}_{ij} \in \mathbb{R}^2$: カメラ$i$の画像上で観測された点$X_j$の位置
$\pi(\cdot)$: カメラモデルによる透視投影関数（3D点を2D画像平面に投影）
$v_{ij} \in {0,1}$: 点$X_j$がカメラ$i$に見えているかどうかの可視性マスク

VGGSfMの概要

Fast3Rの概要

こちらもCVPR2025の論文。
https://arxiv.org/abs/2501.13928

主な共通点

項目	Fast3R	VGGT
🎯 目的	多視点からの3D再構成を1回のforwardで行う	同上
🔧 モデル構成	トランスフォーマベース	トランスフォーマベース
📥 入力	数百～千枚のRGB画像（最大2,000）	最大200枚のRGB画像
🧠 推論時最適化	なし（feed-forward）	同上
⛓ カメラパラメータ	自己推定（Fast3Rはrelative poses）	自己推定（rotation, translation, FOV）

主な違い

項目	Fast3R	VGGT
3D表現	トライアングルメッシュ（via TSDF）	点群 or depth map（カメラごと）
出力形式	TSDF表現を推定 → Marching Cubesでメッシュ化	カメラごとのdepth / 3D点 / trackなどを出力
解像度	高品質なメッシュ生成が可能	点群・depthベースの簡潔な幾何再構成
カメラトークン	なし（relative poseをattentionで学習）	あり（camera tokenから姿勢・FOVを直接予測）
カメラ出力	相対回転/平行移動行列	回転（クォータニオン）、位置、視野角
入力トークン構造	Patch Embedding + Position Embedding + Pose Attention	DINOトークン + Camera Token + Register Token
Transformer方式	Pose-aware Spatio-Temporal Attention	Alternating Attention（画像軸とカメラ軸の交互処理）
学習データ	BlendedMVS, ScanNetなど	ScanNet, MegaDepth, BlendedMVS など
モデルサイズ	2.1B（Base）または smaller variants	約1.2B
学習時間	1000時間以上（最大モデル）	64GPU × 9日程度

これまでの MVS (Multi-View Stereo)

これまでのMVSも最適化のフェーズがある。

Tracking-Any-Point

複数frameある中で、あるframeにおけるある点に対し、他のframeでどこにあるかを当てるtask
いずれも Tracking-Any-Point に限ったモデル

ニューラルネットワークのアーキテクチャ

backbone

AA

head

学習

loss 全体

\begin{equation}
\mathcal{L} = \lambda_{\text{depth}} \cdot \mathcal{L}_{\text{depth}} + 
              \lambda_{\text{point}} \cdot \mathcal{L}_{\text{point}} +
              \lambda_{\text{camera}} \cdot \mathcal{L}_{\text{camera}} +
              \lambda_{\text{track}} \cdot \mathcal{L}_{\text{track}}
\end{equation}

$\mathcal{L}_{\text{depth}}$：深度マップに対するL1損失（または相対誤差）
$\mathcal{L}_{\text{point}}$：点群（3D座標）に対する Chamfer Distance
$\mathcal{L}_{\text{camera}}$：カメラ姿勢・視野角に対する損失（回転はクォータニオン距離、位置はL2誤差）
$\mathcal{L}_{\text{track}}$：3D点トラッキングに対する損失（特徴間距離など）

chamfer距離

\begin{equation}
\mathcal{L}_{\text{point}} = \frac{1}{|\hat{P}|} \sum_{x \in \hat{P}} \min_{y \in P} \| x - y \|_2^2 +
                             \frac{1}{|P|} \sum_{y \in P} \min_{x \in \hat{P}} \| y - x \|_2^2
\end{equation}

回転誤差

\begin{equation}
\mathcal{L}_{\text{rot}} = \| \hat{q} - q \|_2
\end{equation}

深度マップの誤差

\begin{equation}
\mathcal{L}_{\text{depth}} = \frac{1}{N} \sum_{i=1}^{N} \left| \hat{d}_i - d_i \right|
\end{equation}

実験と結果

カメラパラメータ

推定した遷移行列の各値、回転行列の各角度と正解値とのAUC@30を算出し、既存モデルと比較。

multi-view depth estimation

DTU datasetを用いて depthの精度を他モデルと比較した
予測したdepthと正解値との smallest euclidean distance（以下）と chamfer distance（上述）を用いた

\mathcal{L}_{\text{one-sided}} = \frac{1}{|\hat{P}|} \sum_{x \in \hat{P}} \min_{y \in P} \| x - y \|_2

\text{Accuracy} = \frac{1}{|\hat{P}|} \sum_{x \in \hat{P}} \mathbb{I} \left[ \min_{y \in P} \|x - y\|_2 < \tau \right]

\text{Completeness} = \frac{1}{|P|} \sum_{y \in P} \mathbb{I} \left[ \min_{x \in \hat{P}} \|y - x\|_2 < \tau \right]

$ \hat{P} $：予測された点群（Predicted point cloud）
$ P $：GT（Ground Truth）点群
$ x \in \hat{P} $：予測点群の中の任意の点
$ y \in P $：GT点群の中の任意の点
$ | x - y |_2 $：予測点 ( x ) とGT点 ( y ) のユークリッド距離
$ \min_{y \in P} | x - y |_2 $：点 ( x ) に最も近いGT点との距離
$ \frac{1}{|\hat{P}|} \sum)$：予測点群全体にわたる平均

Point Map Estimation

略

Image Matching

略

ablation study

感想

Fast3Rのように高速でSfM等を行う似た仕組みがあるが、それと比べても精度がかなりよい。そうすると、AAによって計算資源あたりの高い表現力が得られた点が大きいか
WORLD MODELになりえるか、という観点でいうと予測がない、3D構築が速いといえどもH100使ってるなど課題は残る

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up