半分以上消えたデータ？ベイズ時系列モデルで復元せよ

Posted at 2025-02-25

はじめに

こんにちは、事業会社で働いているデータサイエンティストです。

ビジネスでもアカデミアでも、皆さんは大量の欠損を含む時系列データを扱ったことがありますか？比較的わかりやすい例として、集計が一時的に停止した重要な指標が挙げられます。実は、因果推論の観点では、処置を受けた時系列において「処置を受けなかった場合の時系列」も欠損データの一種とみなされます。

詳しくはこちらの記事をぜひご確認ください：

今回の記事では、このような欠損データに対して、ベイズ時系列モデルを用いた補完手法を紹介します。もちろん、データサイエンスは魔法ではありません。十分な情報がなければ補完は困難ですが、相関のありそうな他の時系列データを活用することで、より精度の高い補完が可能になります。本記事で紹介する手法は、因果推論におけるDID・Causal Impact・合成統制法といった推定法の発想にも通じるものです。

モデル説明

本記事で説明するモデルは、決して複雑なものではありません。シンプルに説明すると、欠損の有無に関わらず、手元にあるデータから複数の時系列の背後に隠されたトレンドが存在し、そのトレンドによってデータが生成されたと考えるのが本モデルの根幹です。

まず、トレンドの生成から説明しましょう。

T期のトレンドは、T-1期のトレンドを平均とする正規分布から生成されるとします：

$$
trend_{t} \sim Normal(trend_{t - 1}, \sigma_{trend})
$$

分散などの事前分布は省略します。実装のコードはご自身で確認してください。

次に、観測されたデータ $series$ は次のように生成されるとします：

$$
series_{i,t} \sim Normal(\sum_{l = 1}^{ラグ}trend_{t - l} * \beta_{i, l}, \sigma_{i})
$$

ここで、どのデータがどのように欠損しているかを特別に考慮する必要はありません。

モデルの推定が終わったら、欠損している$series$を上の$series_{i,t} \sim Normal(\sum_{l = 1}^{ラグ}trend_{t - l} * \beta_{i, l}, \sigma_{i})$にしたがってサンプリングすれば復元できます。

では、早速実際のデータ分析に入りましょう！

データ

今回利用するデータは、Google トレンドにおける「アルバイト」と「求人」の検索ボリュームです。2020年2月から2025年2月までのデータを対象としますが、2023年1月1日以前の「求人」のデータを意図的に欠損させます。これにより、モデルが「アルバイト」の全データと一部しか残存していない「求人」のデータから、どの程度正確に「求人」データを補完できるかを検証できます。

では、まずダウンロードしたデータを整理して、マスターテーブルも作りましょう：

# GoogleトレンドのデフォルトだとCSVに入らないデータが入るのでskipに2を指定する
df <- readr::read_csv("アルバイト.csv", skip = 2) |>
  `colnames<-`(c("date", "volumn")) |>
  dplyr::mutate(
    search_word = "アルバイト"
  ) |>
  dplyr::bind_rows(
    readr::read_csv("求人.csv", skip = 2) |>
      `colnames<-`(c("date", "volumn")) |>
      dplyr::mutate(
        search_word = "求人"
      )
  ) |>
  dplyr::mutate(
    volumn = volumn/100,
    train_status = dplyr::case_when(
      search_word == "アルバイト" ~ 1,
      search_word == "求人" & date >= as.Date("2023-01-01") ~ 1,
      # 2023-01-01より前の「求人」のデータを学習データから外す
      TRUE ~ 0
    )
  )


search_word_master <- df |>
  dplyr::select(search_word) |>
  dplyr::distinct() |>
  dplyr::arrange(search_word) |>
  dplyr::mutate(
    search_word_id = dplyr::row_number()
  )


date_master <- df |>
  dplyr::select(date) |>
  dplyr::distinct() |>
  dplyr::arrange(date) |>
  dplyr::mutate(
    date_id = dplyr::row_number()
  )

df_with_id <- df |>
  dplyr::left_join(
    date_master, by = "date"
  ) |>
  dplyr::left_join(
    search_word_master, by = "search_word"
  )

モデル実装コード

モデルのコードはこちらになります。

Googleトレンドのデータに合わせた変数名になっていますが、適宜変えていただければと思います！

impute_stan

data {
  int time_type;
  int search_word_type;
  int lag_type;
  
  int N;
  array[N] int time;
  array[N] int search_word;
  array[N] real volumn;
}
parameters {
  real<lower=0> sigma_begin_trend;
  real<lower=0> sigma_trend;
  vector[time_type] trend;
  
  matrix<lower=0,upper=1>[search_word_type, lag_type] sigma_beta;
  matrix[search_word_type, lag_type] beta;
  vector<lower=0>[search_word_type] sigma_volumn;
}
model {
  sigma_begin_trend ~ gamma(0.01, 0.01);
  sigma_trend ~ gamma(0.01, 0.01);
  
  trend[1] ~ normal(0, sigma_begin_trend);
  for (i in 2:time_type){
    trend[i] ~ normal(trend[i - 1], sigma_trend);
  }
  
  to_vector(sigma_beta) ~ gamma(0.01, 0.01);
  to_vector(beta) ~ normal(0, to_vector(sigma_beta));
  
  for (i in 1:N){
    real m = 0.0;
    for (l in 1:lag_type){
      if (time[i] - l < 1) {
        m = m;
      } else {
        m += beta[search_word[i], l] * trend[time[i] - l];
      }
    }
    volumn[i] ~ normal(m, sigma_volumn[search_word]);
  }
}
generated quantities {
  array[search_word_type] vector[time_type] predict;
  for (i in 1:search_word_type){
    for (j in 1:time_type){
      real m = 0.0;
      for (l in 1:lag_type){
        if (j - l < 1){
          m = m;
        } else {
          m += beta[i, l] * trend[j - l];
        }
      }
      predict[i, j] = normal_rng(m, sigma_volumn[i]);
    }
  }
}

モデル推定

ここで、データをlistに整形して、Stanのモデルをコンパイルします：

data_list <- list(
  time_type = nrow(date_master),
  search_word_type = nrow(search_word_master),
  lag_type = 3,
  
  N = nrow(df_with_id[which(df_with_id$train_status == 1),]),
  time = df_with_id$date_id[which(df_with_id$train_status == 1)],
  search_word = df_with_id$search_word_id[which(df_with_id$train_status == 1)],
  volumn = df_with_id$volumn[which(df_with_id$train_status == 1)]
)

m_impute_init <- cmdstanr::cmdstan_model("impute.stan")

では、ここで実際に推定します：

> m_impute_estimate <- m_impute_init$variational(
     data = data_list
 )
------------------------------------------------------------ 
EXPERIMENTAL ALGORITHM: 
  This procedure has not been thoroughly tested and may be unstable 
  or buggy. The interface is subject to change. 
------------------------------------------------------------ 
Gradient evaluation took 0.001993 seconds 
1000 transitions using 10 leapfrog steps per transition would take 19.93 seconds. 
Adjust your expectations accordingly! 
Begin eta adaptation. 
Iteration:   1 / 250 [  0%]  (Adaptation) 
Iteration:  50 / 250 [ 20%]  (Adaptation) 
Iteration: 100 / 250 [ 40%]  (Adaptation) 
Iteration: 150 / 250 [ 60%]  (Adaptation) 
Iteration: 200 / 250 [ 80%]  (Adaptation) 
Success! Found best value [eta = 1] earlier than expected. 
Begin stochastic gradient ascent. 
  iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes  
   100      -362161.995             1.000            1.000 
   200        67430.270             3.685            6.371 
   300       -19030.075             3.971            4.543 
   400        66801.591             3.300            4.543 
   500       168158.952             2.760            1.285 
   600       149391.016             2.321            1.285 
   700       -24000.042             3.022            1.285 
   800       186125.578             2.785            1.285 
   900       204797.144             2.486            1.129 
  1000       172978.329             2.256            1.129 
  1100       165409.917             2.160            1.129   MAY BE DIVERGING... INSPECT ELBO 
  1200       219351.507             1.548            0.603   MAY BE DIVERGING... INSPECT ELBO 
  1300       219398.493             1.093            0.246   MAY BE DIVERGING... INSPECT ELBO 
  1400       223578.956             0.967            0.184   MAY BE DIVERGING... INSPECT ELBO 
  1500       222163.667             0.907            0.126   MAY BE DIVERGING... INSPECT ELBO 
  1600       153057.020             0.940            0.184   MAY BE DIVERGING... INSPECT ELBO 
  1700       217591.157             0.247            0.184 
  1800       208635.495             0.138            0.091 
  1900       186339.196             0.141            0.120 
  2000       222801.604             0.139            0.120 
  2100       220690.562             0.136            0.120 
  2200       214941.011             0.114            0.043 
  2300       222558.094             0.117            0.043 
  2400       205890.330             0.123            0.081 
  2500       226311.134             0.132            0.090 
  2600       211470.682             0.093            0.081 
  2700       224441.388             0.070            0.070 
  2800       223101.494             0.066            0.070 
  2900       178675.866             0.079            0.070 
  3000       184041.057             0.065            0.058 
  3100       216623.372             0.079            0.070 
  3200       200619.321             0.085            0.080 
  3300       219890.168             0.090            0.081 
  3400       223310.615             0.084            0.080 
  3500       213703.819             0.079            0.070 
  3600       218541.446             0.074            0.058 
  3700       210543.925             0.072            0.045 
  3800       214769.663             0.074            0.045 
  3900       213825.695             0.049            0.038 
  4000       222794.546             0.050            0.040 
  4100       214793.941             0.039            0.038 
  4200       221871.108             0.034            0.037 
  4300       215968.264             0.028            0.032 
  4400       216378.540             0.027            0.032 
  4500       220822.748             0.024            0.027 
  4600       219218.552             0.023            0.027 
  4700       219732.754             0.019            0.020 
  4800       218793.745             0.018            0.020 
  4900       217565.969             0.018            0.020 
  5000       218753.417             0.014            0.007   MEDIAN ELBO CONVERGED 
Drawing a sample of size 1000 from the approximate posterior...  
COMPLETED. 
Finished in  12.0 seconds.

12秒で終わりました！

推定結果を保存します：

m_impute_summary <- m_impute_estimate$summary()

モデル推定結果

ここでは、まず可視化用のデータフレイムを作成します：

predict_result_df <- m_impute_summary |> 
  dplyr::filter(stringr::str_detect(variable, "^predict\\[")) |>
  dplyr::mutate(
    id = variable |>
      purrr::map(
        \(x){
          as.integer(stringr::str_split(x, "\\[|\\]|,")[[1]][2:3])
        }
      )
  ) |>
  tidyr::unnest_wider(id, names_sep = "_") |>
  dplyr::filter(id_1 == 2) |>
  dplyr::bind_cols(
    date = date_master$date,
    volumn = df_with_id |>
      dplyr::filter(search_word == "求人") |>
      dplyr::pull(volumn)
    ) |>
  dplyr::mutate(
    train_status = dplyr::case_when(
      date >= as.Date("2023-01-01") ~ 1,
      TRUE ~ 0
    ) |>
      as.factor()
  )

では、実際に可視化しましょう！

predict_result_df |>
  ggplot2::ggplot() +
  ggplot2::geom_point(ggplot2::aes(x = date, y = volumn, color = train_status)) + 
  ggplot2::geom_line(ggplot2::aes(x = date, y = mean)) + 
  ggplot2::geom_ribbon(ggplot2::aes(x = date, ymin = q5, ymax = q95), fill = "blue", alpha = 0.3)

train_statusが1のデータは学習に使用した実データ、train_statusが0のデータはモデルに与えていない欠損データを示します。線は予測の平均値を表し、青い区間は予測の信用区間を示しています。

完璧ではないものの、「アルバイト」のデータと一部の「求人」データを活用することで、「求人」の欠損データをかなり高い精度で復元できたのではないかと思います。

次に、X軸に実データ、Y軸にモデルの予測値をプロットします：

predict_result_df |>
  ggplot2::ggplot() + 
  ggplot2::geom_point(ggplot2::aes(x = volumn, y = mean, color = train_status)) + 
  ggplot2::geom_abline(intercept = 0, slope = 1, color = "red") +
  ggplot2::xlim(c(0, 1)) + 
  ggplot2::ylim(c(0, 1))

この図を見るとさらに分かりやすいですが、実データと予測値はほぼ 45 度線付近に分布しており、数値データの予測モデルとして理想的な結果になっています。

結論

いかがでしたか？

本記事で紹介したモデルを活用すれば、集計組織の都合で生じた大量の欠損データはもちろん、因果推論において必然的に欠損する「処置群に処置がなかった場合」のデータも復元可能です。これにより、高い精度と透明性を持った因果推論が実現でき、意思決定の強力な武器となります。

最後に、私たちと一緒に、データサイエンスの力で社会を改善したい方はこちらをご確認ください：

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up