ベイズで画像を読む：顔画像の次元数を推定する

Last updated at 2025-06-18Posted at 2025-06-18

はじめに

こんにちは、事業会社で働いているデータサイエンティストです。

「データの次元数」と聞いて、あなたは何を思い浮かべますか？

たとえば、word2vecの次元数を36や72に設定するのはよくある話ですが、これらは多くの場合、データに基づく判断ではなく、かなり恣意的な選択です。極端に言えば、「6月18日が誕生日だから次元数を618にした」といった行為と本質的には変わりません。

本記事では、こうした恣意性から脱却するために、ベイズ機械学習の棒折り過程を活用し、画像データの復元に必要な潜在次元数を自動的に推定する方法をご紹介します。

「R言語やベイズ統計では画像データを扱えない」といった誤解を耳にすることがありますが、実際には十分に可能です。この記事を通して、その実例をぜひ体感してみてください！

モデル定式化

次元数の自動推定

本モデルでは、画像データの構造を効率よく表現するために、必要な次元数をデータから自動で推定します。その鍵となるのが、**棒折り過程（Stick-Breaking Process）**です。

まず、次元の重要度を決定するためのハイパーパラメータ $\gamma$ を、以下のようにガンマ分布からサンプリングします：

$$
\gamma \sim Gamma(1, 1)
$$

続いて、棒折り過程に基づき、各次元 $d$ に対応する重み $\omega_d$ を構築します：

$$
\delta_d \sim Beta(1, \gamma)
$$

$$
\omega_d = \delta_d \prod\limits_{l=1}^{d - 1} (1 - \delta_l)
$$

この処理は $d = 1$ から $d = \infty$ まで理論的に繰り返されます。

詳細は後述しますが、本記事で紹介するモデルでは、ピクセルの行ベクトルおよび列ベクトルが無限次元の潜在空間を持ちます。しかし、実際には多くの $\omega_d$ が0に非常に近いため、モデルが実際に活用する次元数は有限であり、これにより必要な次元数を自動的に推定することが可能になります。

ピクセル行ベクトルと列ベクトル

画像データは、本質的にはピクセルの行列です。したがって、行IDおよび列IDごとに、それぞれ無限次元の潜在ベクトルを以下のように定義します。

すべての行$i$について：

$$
\alpha_{i} \sim Normal_{\infty}(0, 1)
$$

すべての列$j$について：

$$
\beta_{j} \sim Normal_{\infty}(0, 1)
$$

画像生成モデル

最後に、画像$f$の$i$行目、$j$列目のピクセル強度は離散値（整数）なので、ポワソン分布に従ってサンプリングします：

$$
Pixel_{f, i, j} \sim Poisson\left( \exp\left( \gamma + \sum_{d=1}^{\infty} \omega_d \cdot \alpha_{i,d} \cdot \beta_{j,d} \right) \right)
$$

このようにして、画像の各ピクセルを、次元ごとの寄与の重み付き積として再構成することが可能になります。

Stanコード

Stanでの実装コードはこちらです。

vision_embedding.stan

functions {
  vector stick_breaking(vector breaks){
    int length = size(breaks) + 1;
    vector[length] result;
    
    result[1] = breaks[1];
    real summed = result[1];
    for (d in 2:(length - 1)) {
      result[d] = (1 - summed) * breaks[d];
      summed += result[d];
    }
    result[length] = 1 - summed;
    
    return result;
  }
  
  real partial_sum_lpmf(
    array[] int pixel_id,
    int start, int end,
    
    array[] int y,
    array[] int pixel_row, array[] int pixel_col,
    array[] int pixel_begin, array[] int pixel_end,
    
    vector dimension, real intercept, 
    array[] vector row_embedding,
    array[] vector col_embedding
  ){
    vector[end - start + 1] log_likelihood;
    int count = 1;
    for (i in start:end){
      log_likelihood[count] = poisson_log_lpmf(y[pixel_begin[i]:pixel_end[i]] | intercept + (dimension .* row_embedding[pixel_row[i]]) '* col_embedding[pixel_col[i]]);
      count += 1;
    }
    return sum(log_likelihood);
  }
}
data {
  int row_type;
  int col_type;
  
  int dimension_type;
  
  int pixel_type;
  array[pixel_type] int pixel_id;
  array[pixel_type] int pixel_row;
  array[pixel_type] int pixel_col;
  array[pixel_type] int pixel_begin;
  array[pixel_type] int pixel_end;
  
  int N;
  array[N] int y;
}
parameters {
  real<lower=0> dimension_alpha;
  vector<lower=0, upper=1>[dimension_type - 1] dimension_breaks;
  
  array[row_type] vector[dimension_type] row_embedding;
  array[col_type] vector[dimension_type] col_embedding;
  
  real intercept;
}
transformed parameters {
  simplex[dimension_type] dimension;
  
  dimension = stick_breaking(dimension_breaks);
}
model {
  dimension_alpha ~ gamma(1, 1);
  dimension_breaks ~ beta(1, dimension_alpha);
  
  for (i in 1:row_type){
    row_embedding[i] ~ normal(0, 1);
  }
  
  for (i in 1:col_type){
    col_embedding[i] ~ normal(0, 1);
  }
  
  intercept ~ normal(0, 10);
  
  target += reduce_sum(
    partial_sum_lpmf, pixel_id, 1,
    
    y,
    pixel_row, pixel_col,
    pixel_begin, pixel_end,
    
    dimension, intercept, 
    row_embedding,
    col_embedding
  );
}
generated quantities {
  array[dimension_type] vector[pixel_type] extracted_patterns;
  array[dimension_type] vector[pixel_type] convolution;
  vector[pixel_type] avg_image;
  for (i in 1:dimension_type){
    for (j in 1:pixel_type){
      extracted_patterns[i, j] = dimension[i] * row_embedding[pixel_row[j], i] * col_embedding[pixel_col[j], i];
    }
  }
  
  for (j in 1:pixel_type){
    convolution[1, j] = dimension[1] * row_embedding[pixel_row[j], 1] * col_embedding[pixel_col[j], 1];
  }
  
  for (i in 2:dimension_type){
    for (j in 1:pixel_type){
      convolution[i, j] = convolution[i - 1, j] + dimension[i] * row_embedding[pixel_row[j], i] * col_embedding[pixel_col[j], i];
    }
  }
  
  for (i in 1:pixel_type){
    avg_image[i] = poisson_log_rng(intercept + (dimension .* row_embedding[pixel_row[i]]) '* col_embedding[pixel_col[i]]);
  }
}

実装のポイントは、どの画像でも、$i$行目、$j$列目のピクセルの値が$Poisson\left( \exp\left( \gamma + \sum_{d=1}^{\infty} \omega_d \cdot \alpha_{i,d} \cdot \beta_{j,d} \right) \right)$に従うので、Stan言語のベクトル表記で一気にサンプリングすることで、（一般化された）内積の回数を減らすことができて、高速な計算を実現することが可能になります。

最後のgenerated quantitiesブロックの中のconvolutionでは、各ピクセルに対して、次元ごとの寄与を累積的に加算していく処理を行っています。具体的には、1次元目から順に、各次元が画像再構成にどれだけ貢献しているかを加算しながら記録します。

まず、1次元目（$d=1$）の寄与を以下のように計算します：

for (j in 1:pixel_type){
  convolution[1, j] = dimension[1] * row_embedding[pixel_row[j], 1] * col_embedding[pixel_col[j], 1];
}

これは、各ピクセル$j$に対して、1次元目の重み$\omega_{1}$と、そのピクセルの行・列のベクトル$\alpha_{i,1}$および$\beta_{j,1}$の積を記録しています。

続いて、2次元目以降では、前の次元までの累積に新たな寄与を加えていきます：

for (i in 2:dimension_type){
  for (j in 1:pixel_type){
    convolution[i, j] = convolution[i - 1, j] + dimension[i] * row_embedding[pixel_row[j], i] * col_embedding[pixel_col[j], i];
  }
}

この処理により、convolution[i, j]には「1次元目から$i$次元目までの寄与の合計」が格納されます。これを可視化することで、「どの次元まで加えれば十分に画像が再構成できるか」「各次元がどれだけ画像の再構成に貢献しているか」といった情報を分析することが可能になります。

要するに、これは次元ごとの「画像復元の寄与」を生成するための準備処理です。

前処理とモデル推定

まず、今回の分析で利用するFrey Faceデータの中身を見てみましょう。まずデータを読み込みます：

library(RnavGraphImageData)
face_df <- snedata::frey_faces()

データ量を調べてみましょう：

> nrow(face_df)
[1] 1965

1965枚です！

次に、9件ピックアップして可視化しましょう；

par(mfrow = c(3, 3))

set.seed(1)
(1:nrow(face_df)) |>
  sample(9) |>
  purrr::walk(
    \(i){
      snedata::show_frey_face(face_df, i)
    }
  )

こんな感じの顔画像データです。

次に、行idと列idを付与する形で縦持ちにします：

face_long_df <- face_df |>
  nrow() |>
  seq_len() |>
  purrr::map(
    \(i){
      matrix(t(face_df[i, 560:1]), ncol = 28, nrow = 20) |>
        as.data.frame() |>
        tibble::rowid_to_column(var = "row_id") |>
        tibble::tibble() |>
        tidyr::pivot_longer(!row_id, names_to = "col_id", values_to = "value") |>
        dplyr::mutate(
          col_id = col_id |>
            stringr::str_remove_all("V") |>
            as.integer(),
          pixel_id = dplyr::row_number()
        )
    },
    .progress = TRUE
  ) |>
  dplyr::bind_rows() |>
  dplyr::mutate(
    value_std = (value - mean(value))/sd(value)
  ) |>
  dplyr::arrange(pixel_id)

そして、一気にサンプリングするため、i行j列のピクセルの位置をまとめるマスター表も用意します：

begin_end_df <- face_long_df |>
  dplyr::mutate(
    rid = dplyr::row_number()
  ) |>
  dplyr::summarise(
    begin = min(rid),
    end = max(rid),
    .by = c(pixel_id, row_id, col_id)
  )

次に、モデルのコンパイル

m_emb_init <- cmdstanr::cmdstan_model("vision_embedding.stan",
                                      cpp_options = list(
                                        stan_threads = TRUE
                                      )
                                      )

と推定を実施します：

> m_emb_estimate <- m_emb_init$variational(
     threads = 10,
     seed = 1,
     data = list(
         row_type = max(face_long_df$row_id),
         col_type = max(face_long_df$col_id),
         
         dimension_type = 30,
         
         pixel_type = nrow(begin_end_df),
         pixel_id = 1:nrow(begin_end_df),
         pixel_row = begin_end_df$row_id,
         pixel_col = begin_end_df$col_id,
         pixel_begin = begin_end_df$begin,
         pixel_end = begin_end_df$end,
         
         N = nrow(face_long_df),
         y = face_long_df$value
     )
 )
------------------------------------------------------------ 
EXPERIMENTAL ALGORITHM: 
  This procedure has not been thoroughly tested and may be unstable 
  or buggy. The interface is subject to change. 
------------------------------------------------------------ 
Gradient evaluation took 0.003795 seconds 
1000 transitions using 10 leapfrog steps per transition would take 37.95 seconds. 
Adjust your expectations accordingly! 
Begin eta adaptation. 
Iteration:   1 / 250 [  0%]  (Adaptation) 
Iteration:  50 / 250 [ 20%]  (Adaptation) 
Iteration: 100 / 250 [ 40%]  (Adaptation) 
Iteration: 150 / 250 [ 60%]  (Adaptation) 
Iteration: 200 / 250 [ 80%]  (Adaptation) 
Iteration: 250 / 250 [100%]  (Adaptation) 
Success! Found best value [eta = 0.1]. 
Begin stochastic gradient ascent. 
  iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes  
   100   -211705702.793             1.000            1.000 
   200   -114203944.728             0.927            1.000 
   300    -63545538.491             0.884            0.854 
   400    -44157769.089             0.773            0.854 
   500    -33853907.591             0.679            0.797 
   600    -26385780.428             0.613            0.797 
   700    -21533707.808             0.558            0.439 
   800    -18139343.896             0.511            0.439 
   900    -16066299.787             0.469            0.304 
  1000    -13922824.108             0.437            0.304 
  1100    -13082594.581             0.344            0.283 
  1200    -12022377.453             0.267            0.225 
  1300    -11479373.868             0.192            0.187 
  1400    -11022133.570             0.152            0.154 
  1500    -10360369.929             0.128            0.129 
  1600     -9926446.930             0.104            0.088 
  1700     -9441953.477             0.087            0.064 
  1800     -9017412.185             0.073            0.064 
  1900     -8610567.808             0.065            0.051 
  2000     -8415797.302             0.052            0.047 
  2100     -8042753.274             0.050            0.047 
  2200     -7858972.290             0.043            0.047 
  2300     -7714675.407             0.041            0.046 
  2400     -7583842.746             0.038            0.046 
  2500     -7498711.300             0.033            0.044 
  2600     -7417643.180             0.030            0.023 
  2700     -7346284.244             0.026            0.023 
  2800     -7311877.405             0.021            0.019 
  2900     -7266400.320             0.017            0.017 
  3000     -7203395.453             0.016            0.011 
  3100     -7174978.468             0.012            0.011 
  3200     -7134567.740             0.010            0.010   MEAN ELBO CONVERGED   MEDIAN ELBO CONVERGED 
Drawing a sample of size 1000 from the approximate posterior...  
COMPLETED. 
Finished in  38.4 seconds.

推定結果をデータフレイムに保存します：

m_emb_summary <- m_emb_estimate$summary()

推定結果可視化

まず、次元のウエイト$\omega$を可視化します：

m_emb_summary |>
  dplyr::filter(stringr::str_detect(variable, "^dimension\\[")) |>
  ggplot2::ggplot() + 
  ggplot2::geom_bar(ggplot2::aes(x = as.factor(1:30), y = mean), 
                    stat = "identity", fill = ggplot2::alpha("blue", 0.3)) + 
  ggplot2::geom_errorbar(
    ggplot2::aes(x = as.factor(1:30), ymin = q5, ymax = q95),
    width = 0.2
  ) + 
  ggplot2::labs(
    x = "次元",
    y = "次元重要度"
  ) +
  ggplot2::theme_gray(base_family = "HiraKakuPro-W3")

画像データなのに、そんなに大量の次元がなくてもいけそうですね！

では次に、それぞれの次元が顔画像のどんな特性を学習しているかを見てみましょう：

par(mfrow = c(3, 3))

(1:9) |>
  purrr::walk(
    \(i){
      im <- begin_end_df |>
        dplyr::bind_cols(
          e = m_emb_summary |>
            dplyr::filter(stringr::str_detect(variable, "^extracted_patterns\\[")) |>
            dplyr::mutate(
              id = variable |>
                purrr::map(\(x) {
                  as.integer(stringr::str_split(x, "\\[|\\]|,")[[1]][2:3])
                })
            ) |>
            tidyr::unnest_wider(id, names_sep = "_") |>
            dplyr::filter(id_1 == i) |>  # Change here: iterate over id_1 = 1 to 9
            dplyr::mutate(
              mean = exp(mean)
            ) |>
            dplyr::pull(mean)
        ) |>
        dplyr::select(row_id, col_id, e) |>
        tidyr::pivot_wider(names_from = col_id, values_from = e) |>
        dplyr::select(!row_id) |>
        as.matrix()
      
      image(1:nrow(im), 1:ncol(im), im,
            col = grDevices::gray(1/12:1), xlab = "", ylab = "",
            main = paste("dimension ", i))  # optional title
    }
  )

一次元目は目と鼻の輪郭、二次元目がおでこの輪郭？のように見えます。

最後に、その一つ一つの次元を畳み込んで、推定された「平均顔画像」が復元される様子を可視化しましょう：

par(mfrow = c(3, 3))

(1:9) |>
  purrr::walk(
    \(i){
      im <- begin_end_df |>
        dplyr::bind_cols(
          e = m_emb_summary |>
            dplyr::filter(stringr::str_detect(variable, "^convolution\\[")) |>
            dplyr::mutate(
              id = variable |>
                purrr::map(\(x) {
                  as.integer(stringr::str_split(x, "\\[|\\]|,")[[1]][2:3])
                })
            ) |>
            tidyr::unnest_wider(id, names_sep = "_") |>
            dplyr::filter(id_1 == i) |>  # Change here: iterate over id_1 = 1 to 9
            dplyr::mutate(
              mean = exp(mean)
            ) |>
            dplyr::pull(mean)
        ) |>
        dplyr::select(row_id, col_id, e) |>
        tidyr::pivot_wider(names_from = col_id, values_from = e) |>
        dplyr::select(!row_id) |>
        as.matrix()
      
      image(1:nrow(im), 1:ncol(im), im,
            col = grDevices::gray(1/12:1), xlab = "", ylab = "",
            main = paste("convolution to dimension ", i))  # optional title
    }
  )

6次元目でかなり顔の雰囲気になりましたね！

結論

いかがでしたか？

このように、ベイズ機械学習の手法を使えば、普段の機械学習ではブラックボックス化されがちな次元数をモデルの一部として推定することが可能になります。

最後に、私たちと一緒に働きたい方はぜひ下記のリンクもご確認ください：

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up