【論文まとめ】STN(Spatial Transformer Network)

Last updated at 2025-02-24Posted at 2025-02-24

前提の確認

STNの概要

STN(Spatial Transformer Network)論文 Figure 1

STN(Spatial Transformer Network)は上図の(a)のようにランダムな変形・サイズ変更・回転などによって歪められた入力の取り扱いができるようにAffine変換のような変換の学習をCNNに導入することを試みた研究です。STNは主に下記の3つのステップから構成されます。

STN(Spatial Transformer Network)論文 Figure 2

1. Localisation Network
2. Parameterised Sampling Grid
3. Differentiable Image Sampling

1〜3の詳細については次節で詳しく確認します。

Affine変換とパラメータ行列

2DのAffine変換は下記のような式で表されます。

\begin{align}
\left(\begin{array}{c} x' \\ y' \end{array} \right) &= A_{\theta} \left(\begin{array}{c} x \\ y \\ 1 \end{array} \right) \\
  &= \left(\begin{array}{ccc} \theta_{11} & \theta_{12} & \theta_{13} \\ \theta_{21} & \theta_{22} & \theta_{23} \end{array} \right) \left(\begin{array}{c} x \\ y \\ 1 \end{array} \right)
\end{align}

STNの詳細

Localisation Network

Localisation Networkは特徴マップ(Feature map)からAffine変換に用いるパラメータの$\theta$の計算を行うニューラルネットワークです。入力となる特徴マップ$U \in \mathbb{R}^{H \times W \times C}$に対し、下記のように定義されます。

\begin{align}
\theta = f_{\mathrm{loc}}(U) \in \mathbb{R}^{d}
\end{align}

たとえば2Dのアフィン変換は下記のように6個のパラメータで表されるので基本的には$d=6$です。

\begin{align}
\left(\begin{array}{c} x' \\ y' \end{array} \right) &= A_{\theta} \left(\begin{array}{c} x \\ y \\ 1 \end{array} \right) \\
  &= \left(\begin{array}{ccc} \theta_{11} & \theta_{12} & \theta_{13} \\ \theta_{21} & \theta_{22} & \theta_{23} \end{array} \right) \left(\begin{array}{c} x \\ y \\ 1 \end{array} \right)
\end{align}

一方で変換によってはパラメータに制約がある(たとえば拡大の場合は$\theta_{11}=\theta_{22}=s, \theta_{12}=\theta_{13}=\theta_{21}=\theta_{23}=0$)ので、$\theta$の次元は問題によって調整が必要です。

Parameterised Sampling Grid

STN(Spatial Transformer Network)論文 Figure 3

Parameterised Sampling Gridの概要は上図を元に掴むと良いです。output(target)の$V$の網目状のグリッド$G={ G_{i} }$に対し、$\mathcal{T}_{\theta}$に基づく下記のようなアフィン変換を行って「output(target)のグリッドに対応するinput(source)上の位置を得る」と理解すれば良いです。

\begin{align}
\left(\begin{array}{c} x_{i}^{s} \\ y_{i}^{s} \end{array} \right) &= \mathcal{T}_{\theta}(G_{i}) \\
  &= A_{\theta} \left(\begin{array}{c} x_{i}^{t} \\ y_{i}^{t} \\ 1 \end{array} \right) \\
  &= \left(\begin{array}{ccc} \theta_{11} & \theta_{12} & \theta_{13} \\ \theta_{21} & \theta_{22} & \theta_{23} \end{array} \right) \left(\begin{array}{c} x_{i}^{t} \\ y_{i}^{t} \\ 1 \end{array} \right)
\end{align}

Differentiable Image Sampling

前項で得たoutput(target)のグリッドに対応するinput(source)の位置からサンプリングを行います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up