自己回帰モデルのラグ選択を自動選択する夢のモデルを提案してみた

Last updated at 2024-12-09Posted at 2024-12-09

はじめに

こんにちは、事業会社で働いているデータサイエンティストです。

今回の記事では、まだ未完成状態ですが、私が考えたラグ数自動選定モデルを紹介し、シミュレーションで性能を実際にお見せします。

なぜ重要なのか

詳細はお伝えできませんが、現在、私たちのチームではさまざまな予測問題に取り組んでおります。その中でも、自己回帰モデルにおけるラグの選択は、ビジネス上の重要な意味を持つことが多く、非常に重要な要素となっています。

現在の独立変数の値がどのように未来の従属変数の値に影響するかはよくビジネスの世界で議論されます。例えば、

期末まであと3ヶ月。売上が足りない。今から営業をかけても間に合うか？
KPIになっている月間ユーザー数が足りない。今から広告をかけたら半年後に理想の数字を達成できるか？

のような会話は業種問わず、よく会社の経営会議で聞きます。

ここでまさにデータサイエンティストの出番ですが、ラグ数をモデルのパラメータとして推定する手法はあまりなく、結局モデルをたくさん推定して、AICなどの情報基準で正しいラグ数を判定しないといけません。

モデルをたくさん作るやり方は、ベイズ統計的に美しくない（？）だけでなく、実務上の問題にもなります。

例えば、TVCMとOOH広告で紅茶の売上を予測するモデルを考えてください。私はマーケティングの専門ではないですが、おそらくTVCMのラグ数とOOH広告のラグ数は違うというのはなんとなく感じます。そこで、情報基準で計算すると、

TVCMのラグ数 = 1, OOH広告のラグ数 = 1
TVCMのラグ数 = 1, OOH広告のラグ数 = 2
TVCMのラグ数 = 2, OOH広告のラグ数 = 1
TVCMのラグ数 = 2, OOH広告のラグ数 = 2

のように、変数の数 x 最大ラグ数乗回のモデル推定が必要です。ちなみに、5つの変数で、最大ラグ数が10だけでも、9,765,625回のモデル推定が必要です。これは残業確定です。自動化できても多分マシーンがとこかでメモリ周りの問題で壊れるので退勤できないですよ。

この指数関数的に増えるモデル推定回数の問題を解決するために、ベイズノンパラメトリックの棒折り過程の構造を元としたモデルを考えてみました。

棒折り過程については下記の記事を参照してください。

モデル定式化

モデルの構想としては簡単です。要するにまずはラグ数が無限大だと過程した上で、余分なラグを消していく考えです。

会社で検討しているものはもっと複雑ですが、ここではまず最も単純な自己回帰モデルのラグで性能をお見せします。

具体的にいうと、棒折り過程でラグの重みをサンプリングします：

$$
\alpha \sim Gamma(1, 1)
$$

lを1から無限大までループ：

$$
\pi_{l}\sim Beta(1, \alpha)
$$

$$
w_{l} = \pi_{l} \prod\limits_{l=1}^{l - 1} (1 - \pi_{l})
$$

次に、T期のデータはこのようにサンプリングされると仮定します。

$$
Y_{T} \sim Normal(切片 + \sum\limits_{l=1}^{\infty}Y_{T-l}\beta_{l}w_{l}, \sigma)
$$

残りのパラメータの事前分布の説明は省略します。詳しくは下記のStanのコードからご確認ください：

stick_breaking_lag.stan

functions {
  vector stick_breaking(vector breaks){
    int length = size(breaks) + 1;
    vector[length] result;
    
    result[1] = breaks[1];
    real summed = result[1];
    for (d in 2:(length - 1)) {
      result[d] = (1 - summed) * breaks[d];
      summed += result[d];
    }
    result[length] = 1 - summed;
    
    return result;
  }
}
data {
  int time_type;
  int max_lag;
  
  array[time_type] real y;
}
parameters {
  real<lower=0> lag_alpha;
  vector<lower=0, upper=1>[max_lag - 1] lag_breaks;
  
  real intercept;
  real<lower=0> rho_global;
  vector<lower=0>[max_lag] rho_beta;
  vector[max_lag] beta;
  real<lower=0> sigma;
}
transformed parameters {
  simplex[max_lag] lag;
  
  lag = stick_breaking(lag_breaks);
}
model{
  lag_alpha ~ gamma(0.001, 0.001);
  
  lag_breaks ~ beta(1, lag_alpha);
  
  intercept ~ normal(0, 10);
  rho_global ~ gamma(0.01, 0.01);
  rho_beta ~ exponential(0.5 * (rho_global)^2);
  beta ~ normal(0, rho_beta^2);
  sigma ~ inv_gamma(0.01, 0.01);
  
  for (i in (max_lag + 1):time_type){
    vector[max_lag] weigted_lags;
    
    for (j in 1:max_lag){
      weigted_lags[j] = y[i - j] * beta[j] * lag[j];
    }
    
    y[i] ~ normal(intercept + sum(weigted_lags), sigma);
  }
}
generated quantities {
  vector[max_lag] beta_times_lag;
  for (i in 1:max_lag){
      beta_times_lag[i] = beta[i] * lag[i];
  }
}

モデル推定

まずはR言語の関数で自己回帰過程に従うデータを生成します：

set.seed(123)
n <- 500

ar1_coef <- 0.8 
ar1_sim <- arima.sim(model = list(ar = ar1_coef), n = n) |>
  as.numeric()

ar2_coef <- c(0.8, -0.3)
ar2_sim <- arima.sim(model = list(ar = ar2_coef), n = n) |>
  as.numeric()

ar3_coef <- c(0.8, -0.3, 0.2)
ar3_sim <- arima.sim(model = list(ar = ar3_coef), n = n) |>
  as.numeric()

ar4_coef <- c(0.8, -0.3, 0.2, -0.1)
ar4_sim <- arima.sim(model = list(ar = ar4_coef), n = n)

次は、モデルをコンパイルします：

m_sbl_init <- cmdstanr::cmdstan_model("stick_breaking_lag.stan")

では、モデルたちを実際に推定して、ベイズモデルが推定した係数の値と係数の正解を比較しましょう：

set.seed(12345)
g_1 <- m_sbl_init$variational(
  data = list(
    time_type = n,
    max_lag = 20,
    y = ar1_sim
  )
)$
  summary() |>
  dplyr::filter(stringr::str_detect(variable, "beta_times_lag")) |>
  dplyr::bind_cols(beta = c(ar1_coef, rep(0, 19))) |>
  ggplot2::ggplot() +
  ggplot2::geom_point(
    ggplot2::aes(x = 1:20, y = mean, color = "係数推定値")
  ) +
  ggplot2::geom_point(
    ggplot2::aes(x = 1:20, y = beta, color = "係数正解")
  ) +
  ggplot2::geom_errorbar(
    ggplot2::aes(x = 1:20, ymin = q5, ymax = q95, color = "係数推定値")
  ) +
  ggplot2::scale_color_manual(
    values = c("係数推定値" = "blue", "係数正解" = "black")
  ) +
  ggplot2::scale_x_continuous(breaks = 1:20) +
  ggplot2::labs(
    title = "棒折り過程によるラグ選択\n正解はAR(1)",
    color = "",
    x = "ラグ",
    y = "値"
  ) +
  ggplot2::theme_gray(base_family = "HiraKakuPro-W3")

g_2 <- m_sbl_init$variational(
  data = list(
    time_type = n,
    max_lag = 20,
    y = ar2_sim
  )
)$
  summary() |>
  dplyr::filter(stringr::str_detect(variable, "beta_times_lag")) |>
  dplyr::bind_cols(beta = c(ar2_coef, rep(0, 18))) |>
  ggplot2::ggplot() +
  ggplot2::geom_point(
    ggplot2::aes(x = 1:20, y = mean, color = "係数推定値")
  ) +
  ggplot2::geom_point(
    ggplot2::aes(x = 1:20, y = beta, color = "係数正解")
  ) +
  ggplot2::geom_errorbar(
    ggplot2::aes(x = 1:20, ymin = q5, ymax = q95, color = "係数推定値")
  ) +
  ggplot2::scale_color_manual(
    values = c("係数推定値" = "blue", "係数正解" = "black")
  ) +
  ggplot2::scale_x_continuous(breaks = 1:20) +
  ggplot2::labs(
    title = "棒折り過程によるラグ選択\n正解はAR(2)",
    color = "",
    x = "ラグ",
    y = "値"
  ) +
  ggplot2::theme_gray(base_family = "HiraKakuPro-W3")

g_3 <- m_sbl_init$variational(
  data = list(
    time_type = n,
    max_lag = 20,
    y = ar3_sim
  )
)$
  summary() |>
  dplyr::filter(stringr::str_detect(variable, "beta_times_lag")) |>
  dplyr::bind_cols(beta = c(ar3_coef, rep(0, 17))) |>
  ggplot2::ggplot() +
  ggplot2::geom_point(
    ggplot2::aes(x = 1:20, y = mean, color = "係数推定値")
  ) +
  ggplot2::geom_point(
    ggplot2::aes(x = 1:20, y = beta, color = "係数正解")
  ) +
  ggplot2::geom_errorbar(
    ggplot2::aes(x = 1:20, ymin = q5, ymax = q95, color = "係数推定値")
  ) +
  ggplot2::scale_color_manual(
    values = c("係数推定値" = "blue", "係数正解" = "black")
  ) +
  ggplot2::scale_x_continuous(breaks = 1:20) +
  ggplot2::labs(
    title = "棒折り過程によるラグ選択\n正解はAR(3)",
    color = "",
    x = "ラグ",
    y = "値"
  ) +
  ggplot2::theme_gray(base_family = "HiraKakuPro-W3")

g_4 <- m_sbl_init$variational(
  data = list(
    time_type = n,
    max_lag = 20,
    y = ar4_sim
  )
)$
  summary() |>
  dplyr::filter(stringr::str_detect(variable, "beta_times_lag")) |>
  dplyr::bind_cols(beta = c(ar4_coef, rep(0, 16))) |>
  ggplot2::ggplot() +
  ggplot2::geom_point(
    ggplot2::aes(x = 1:20, y = mean, color = "係数推定値")
  ) +
  ggplot2::geom_point(
    ggplot2::aes(x = 1:20, y = beta, color = "係数正解")
  ) +
  ggplot2::geom_errorbar(
    ggplot2::aes(x = 1:20, ymin = q5, ymax = q95, color = "係数推定値")
  ) +
  ggplot2::scale_color_manual(
    values = c("係数推定値" = "blue", "係数正解" = "black")
  ) +
  ggplot2::scale_x_continuous(breaks = 1:20) +
  ggplot2::labs(
    title = "棒折り過程によるラグ選択\n正解はAR(4)",
    color = "",
    x = "ラグ",
    y = "値"
  ) +
  ggplot2::theme_gray(base_family = "HiraKakuPro-W3")

結果の可視化

早速可視化しましょう：

> gridExtra::grid.arrange(g_1, g_2, g_3, g_4)

この図からわかることをまとめてみますと

ラグが過剰に推定される傾向は見られる
係数が過小に推定されている可能性がある
推定された係数の符号は概ね正しい
ベイズ信用区間が真の値をカバーしていないケースが確認される

になるのではないかと思います。

このモデルには改善の余地があるものの、決して全く信頼できない結果を示しているわけではありません。そのため、状況によってはビジネスやアカデミアの応用研究において活用可能であると考えられます。

結論

いかがでしょうか？

今回紹介した棒折り過程を利用した自己回帰モデルのラグ数自動選定手法は、指数関数的に増加するモデル推定の計算負荷を大幅に削減しつつ、実際のラグ構造を精度良く復元することができることを示しました。特に、ラグ数が高次元でも余分なラグを自動で無視する性質があり、複数の変数が絡む複雑なビジネス課題にも適用可能なポテンシャルを持っています。

また、今回のシミュレーションでは単純な自己回帰モデルを例に取りましたが、このアプローチは変数間のラグが異なる場合にも拡張が可能です。

これにより、ビジネス上の意思決定において、より効率的かつ信頼性の高いデータ駆動型の支援が実現できると考えています。今後の記事では、さらに複雑な応用例についてもご紹介する予定です。引き続きご期待ください。

最後に、私たちと一緒に、データサイエンスの力で社会を改善したい方はこちらをご確認ください：

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up