More than 3 years have passed since last update.

累積和を使った周辺化消去を用いたベイズ推定による変化点検出の試み（その３）

Last updated at 2021-10-31Posted at 2020-11-23

1.はじめに

KFASパッケージを使ったガウス状態空間モデルによる変化点検出の試み
 状態空間モデルを使った変化点検出の試み（その２）
今回は状態空間モデルではなく、変化点=cp(単位:day)を用いてRstanによる直接的に変化点検出を試みます。
変化点cpは離散値のため、パラメータの推定に離散値が直接扱える、例えばPyMCだと比較的簡単ですが、stanは連続値しか扱えないので工夫が必要になります。

2.データの準備

ナイル川の流量のデータを準備します。
データは最大値で割ることで正規化しておきます。

library(rstan)
library(tidyverse)
library(bayesplot)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

df = data.frame(Time=1:length(Nile), Y=as.vector(Nile)/max(Nile))
data = list(T=length(Nile), Y=as.vector(Nile)/max(Nile))
ggplot(df,aes(x=Time,y=Y)) + geom_line()

3.stan codeの準備

以下、stan_codeです。
ファイル名は、model.stanで保存します。

data {
  int T;
  vector[T] Y;
}

parameters {
  real<lower=0,upper=1> mu_l;
  real<lower=0,upper=1> mu_r;
  real<lower=0> sigma;
}

transformed parameters {
  vector[T] lp;
  vector[T] lp_l;
  vector[T] lp_r;
    for (cp in 1:T) {　　　　
      lp_l[cp] = normal_lpdf(Y[cp] | mu_l, sigma);
      lp_r[cp] = normal_lpdf(Y[cp] | mu_r, sigma);
    }
    lp_l = cumulative_sum(lp_l);
    lp_r = cumulative_sum(lp_r);
    lp = lp_l + (lp_r[T] - lp_r);
  
}

model {
  target += log_sum_exp(lp);
}

3.1対数尤度の計算

平均パラメータは左側と右側の２つ（mu_l,mu_r）を設定しています。
正規化して0から1までしか範囲をとらないため、パラメータも、0〜1に制限しています。

lpの計算は、変化点はcpでmu_lからmu_rに切り替わるため、直接的には次のコード１のとおりです。

コード１
 vector[T] lp;
  lp = rep_vector(0, T);
  for (cp in 1:T)
    for (t in 1:T)
      lp[cp] = lp[cp] + normal_lpdf(Y[t] | if_else(t <= cp, mu_l, mu_r), sigma);

変化点cpはlpの中に周辺化消去されてなくなります。
しかし、Tについて２重ループするため、計算速度が遅くなります。

コード２は、cumulative_sum（累積和）をとることで、２重ループを解消しています。こうすることで、計算スピードが早くなります。

コード２
 for (cp in 1:T) {　　　　
      lp_l[cp] = normal_lpdf(Y[cp] | mu_l, sigma);
      lp_r[cp] = normal_lpdf(Y[cp] | mu_r, sigma);
    }
lp_l = cumulative_sum(lp_l);
lp_r = cumulative_sum(lp_r);
lp = lp_l + (lp_r[T] - lp_r);

lp_lとlp_rの計算方法を図で示しています。

3.2stanの実行

fit = stan(file="model.stan",data=data)

$> 省略
$>SAMPLING FOR MODEL 'model' NOW (CHAIN 4).
$>Chain 4: 
$>Chain 4: Gradient evaluation took 4.3e-05 seconds
$>Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.43 seconds.
$>Chain 4: Adjust your expectations accordingly!
$>Chain 4: 
$>Chain 4: 
$>Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
$>Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
$>Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
$>Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
$>Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
$>Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
$>Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
$>Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
$>Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
$>Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
$>Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
$>Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
$>Chain 4: 
$>Chain 4:  Elapsed Time: 0.125205 seconds (Warm-up)
$>Chain 4:                0.131506 seconds (Sampling)
$>Chain 4:                0.256711 seconds (Total)
$>Chain 4:

収束を見てみます。

mcmc_rhat(rhat(fit))
mcmc_combo(fit, pars = c("mu_l", "mu_r"))

Rhatは1と綺麗に収束しています。
mu_lとmu_rの事後分布のトレースも４chainがきれいに混ざっています。

4.結果の解釈

fit

$>Inference for Stan model: model.
$>4 chains, each with iter=2000; warmup=1000; thin=1; 
$>post-warmup draws per chain=1000, total post-warmup draws=4000.
$>
$>            mean se_mean    sd    2.5%    25%    50%    75% 97.5%
$>mu_l        0.80    0.00  0.02    0.77   0.79   0.80   0.81  0.84
$>mu_r        0.62    0.00  0.01    0.60   0.61   0.62   0.63  0.64
$>sigma       0.10    0.00  0.01    0.08   0.09   0.09   0.10  0.11
$>lp[1]      45.96    0.15  9.49   23.91  40.41  46.99  52.97 61.04
$>lp[2]      48.70    0.14  8.99   27.74  43.47  49.70  55.38 62.90
$>  省略

lp[i]はi時点における２つの平均（"mu_l", "mu_r"）の混合対数尤度の事後分布を示します。lpが最も高い時点(i)が変化点としての確率が最も高いところになります。

lpの中央値を推定値として、確率を計算します。

fit2 = rstan::extract(fit)
d2 = exp(apply(fit2,2,median))
d3 = data.frame(prob = d2/sum(d2))
d3$time = 1:nrow(d3)
which.max(d3$prob)
ggplot(d3,aes(x = time,y = prob)) + geom_bar(stat="identity")

$>[1] 28

28時点で確率が最も高くなりました。
変化する前の最大時点を推定しているため、次の時点（+1）が変化点になります。
前回の解析の29時点（1899年）と同じになりました。

流量の変化の50%中央値と95%信用区間を計算します。

quantile(ms$mu_l,prob = c(0.025,0.5,0.975))*max(Nile)
quantile(ms$mu_r,prob = c(0.025,0.5,0.975))*max(Nile)

$>    2.5%      50%    97.5% 
$>1049.729 1097.622 1146.958 
$>    2.5%      50%    97.5% 
$>820.3091 850.3095 881.5139

流量は、変化点において1097.6から850.3に変化しました。

5.参考

全面的に参考にしました。
StatModeling Memorandum
累積和を使って計算の無駄を省く（変化点検出の例）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up