【R入門】ベイズ統計を描画してみる♪

やったこと

・Rのインストール

・ggplot2のインストール

・χ二乗検定をやってみる

・台風の上陸回数の事後分布

事前分布が指数分布とした場合の事後密度関数はガンマ分布の確率密度関数になる

まとめ

More than 3 years have passed since last update.

Posted at 2020-10-27

比較的読みやすいというので、以下の書籍を読み始めたが、R言語ベースということで、そこからのお話です。
「モンテカルロ統計計算」

・Rのインストール
・ggplot2のインストール
・χ二乗検定をやってみる
・台風の上陸回数の事後分布

Raspi４に以前やった手順でRをインストールする。
今回は、途中でインターネット回線障害で止まってしまったが、以下のインストールは実施した。

$ gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
$ gpg -a --export E084DAB9 | sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install r-base

R のパッケージの更新
# !/bin/bash
echo 'options(repos="http://cran.rstudio.com/"); update.packages(checkBuilt=TRUE, ask=FALSE)' | sudo R --vanilla

インストールは以下を参考にしました。
【参考】
・ggplotの使い方

install.packages("ggplot2", dependencies=TRUE)

これでめでたく以下のdiamonds datasetsが出力できました。
本書には、基本データセットとあるが、ggplot2でインストールされるデータセットのようでした。

以下はRで直接実行した結果です.

> data(diamonds)
> diamonds[1:4]
# A tibble: 53,940 x 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1 0.23  Ideal     E     SI2    
 2 0.21  Premium   E     SI1    
 3 0.23  Good      E     VS1    
 4 0.290 Premium   I     VS2    
 5 0.31  Good      J     SI2    
 6 0.24  Very Good J     VVS2   
 7 0.24  Very Good I     VVS1   
 8 0.26  Very Good H     SI1    
 9 0.22  Fair      E     VS2    
10 0.23  Very Good H     VS1    
# … with 53,930 more rows
> table(diamonds[,c(2,3)])
           color
cut            D    E    F    G    H    I    J
  Fair       163  224  312  314  303  175  119
  Good       662  933  909  871  702  522  307
  Very Good 1513 2400 2164 2299 1824 1204  678
  Premium   1603 2337 2331 2924 2360 1428  808
  Ideal     2834 3903 3826 4884 3115 2093  896

注意）今日の状態はdiamonds[1:4]以下は実行できるが、以下のエラー

data(diamonds)
警告メッセージ:
data(diamonds) で: データセット ‘diamonds’ がありません

上記のエラーは以下で解消しました。

giamonds.R

library(ggplot2)
data(diamonds)
diamonds[0:10]
table(diamonds[,c(2,3)])

上記のプログラムをgiamonds.Rで保存して、以下のコマンドで実行できます。

$ Rscript --vanilla --slave giamonds.R

diamonds[0:10]、table(diamonds[,c(2,3)])の実行結果

# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

           color
cut            D    E    F    G    H    I    J
  Fair       163  224  312  314  303  175  119
  Good       662  933  909  871  702  522  307
  Very Good 1513 2400 2164 2299 1824 1204  678
  Premium   1603 2337 2331 2924 2360 1428  808
  Ideal     2834 3903 3826 4884 3115 2093  896
# A tibble: 53,940 x 10

まず、以下の解説の最後のχ二乗検定をやってみました.
・「R」--- Rの基礎

検定は成績が以下のとおりの場合（以下の表をクロス表と呼ぶ）、
ーー引用
クロス表から、その「行」と「列」の間の独立性（相関性）をχ二乗統計量から推定することができます。独立性の検定は以下のようです。
・帰無仮説H0: 行と列は独立である
・対立仮説H1: 行と列には相関がある
ーー

		優	良	可	不可	合計
A	男	O_{1,1}	O_{1,2}	O_{1,3}	O_{1,4}	O_{1,*}
B	女	O_{2,1}	O_{2,2}	O_{2,3}	2,4	O_{2,*}
合計		O_{*,1}	O_{*,2}	O_{*,2}	O_{*,3}	O
このようなクロス表が与えられたとき、χ二乗統計量は以下の式で定義されます。
R=行数、C＝列数

E_{i,j}=\frac{O_{i,*}O_{*,j}}{O}\\
\chi ^2=\Sigma _{i=1}^{R}\Sigma_{j=1}^{C}\frac{O_{i,j}-E_{i,j}}{O_{i,j}}

このとき、以下で推定できる。

\chi^2（クロス表） \geqq \chi^2 (0.05, df)\\
df = (R-1)(C-1) ;自由度

を満たすとき、有意水準5%で有意であると判定し、H0を棄却し、「行と列には関連がある」と結論します。

# Rプログラム χ二乗統計量を求める関数
myChi2 <- function(table){ # Pearson's chi square
	row <- nrow(table) # 行
	col <- ncol(table) # 列
	rowmat <- matrix(0, col, 1)
	colmat <- matrix(0, row, 1)
	for (i in 1:row) for (j in 1:col) colmat[i] <- colmat[i]+table[i,j] # n*,j
	for (j in 1:col) for (i in 1:row) rowmat[j] <- rowmat[j]+table[i,j] # ni,*
	total <- 0
	for (i in 1:row) for (j in 1:col) total <- total +table[i,j] 
	Expect <- matrix(0, row, col) # 期待度数
	for (i in 1:row) for (j in 1:col) 
		Expect[i,j] <- (colmat[i]*rowmat[j])/total
	chi2 <- 0
	for (i in 1:row) for (j in 1:col) 
		chi2 <- chi2 + ((table[i,j]-Expect[i,j])^2)/Expect[i,j]
	return(chi2)
}

# Rプログラム χ検定
prefix <- "" #"http://www.tutarc.org/Seminar/R/data/"
url <- paste(prefix,"mf-seiseki.txt",sep="")
# url2 <- paste(prefix,"myChi2.r",sep="") # 前に定義されている関数が見えていれば不要
table1 <- read.csv(url) #, header=TRUE, row.names=1)
table1
# source(url) # 前に定義した関数が見えていれば不要
p0 = myChi2(table1)
p1 = qchisq(0.05, 3, lower.tail=FALSE)
chisq.test(table1, correct=FALSE)

ここで、読み込むデータは以下をテキストで保存

mf-seiseki.txt

y,r,k,h
27,34,18,3
48,15,8,1

実行結果は以下のとおり

$ Rscript --vanilla --slave chi2_test.R
   y  r  k h
1 27 34 18 3
2 48 15  8 1

	Pearson's Chi-squared test

data:  table1
X-squared = 17.518, df = 3, p-value = 0.0005529

グラフを画く方法は以下を参考にしました.
【参考】
・ggplot2 — きれいなグラフを簡単に合理的に

theta <- seq(0, 20, 0.01)
post = dchisq(theta, 3)
# plot(theta, dchisq(theta, 3), type="n")

# curve( dchisq( x, df = 3 ), xlim = c( 0, 20 ), ylim = c( 0, 0.3 ), lwd = 3, col = "red", add = TRUE ) #自由度3

library(ggplot2)
df=data.frame(post,theta)

g = ggplot(data=df,aes(x=theta))+geom_vline(xintercept=p0, size=1) +
  geom_vline(xintercept=p1, linetype="dashed", size=1) +
  stat_function(fun=dchisq, args=list(df=3, ncp=0)) 
print(g) 
ggsave("g_chi.png", g, width = 6, height = 4, dpi = 300)

【参考】
・Chisquare
・Top 50 ggplot2 Visualizations - The Master List (With Full R Code)

# install.packages("ggplot2")
library(ggplot2)
data <- c(3,2,2,4,4,6,4,5) #,3,7,0,2,1,8)
p1 = ggplot(data.frame(x=c(2,5)), aes(x))
p2 = p1 + stat_function(fun = dgamma, args = list(shape = 1+ sum(data), rate = 1 + length(data)))
p3 = p2 + theme_bw()
# print(p2)
# ggsave("thaifoon0.png", p2, width = 6, height = 4, dpi = 300)
print(p3)
ggsave("thaifoon1.png", p3, width = 6, height = 4, dpi = 300)

上記の結果つまりstat_function(fun = dgamma, args = list(shape = 1+ sum(data), rate = 1 + length(data)))を理解するために、以下を実行しました。
【参考】
・Gamma
・How to plot a Gamma distribution in ggplot2

theta = seq(0,1,length=500)
post <- dgamma(theta,0.5, 1)
g0 = plot(theta, post, type = "l", xlab = expression(theta), ylab="density", lty=1, lwd=3)

df=data.frame(post,theta)
g = ggplot(data=df,aes(x=theta))+ scale_y_continuous(limits=c(0,12)) +
  stat_function(fun=dgamma, args=list(shape=0.5, rate=1))
print(g) 
ggsave("g.png", g, width = 6, height = 4, dpi = 300)

つまり、上記の台風上陸の事後分布の図は、単にガンマ関数を計算して描画していることが分かります。
そして、ちょっと理論的なことを以下引用（本書とは順番変更しています）します。

まず、パラメーターθと観測xの確率分布が定まれば、ベイズの公式にしたがって、xの条件のもとでのθの確率分布は以下のように計算できる。

p(\theta|x)=\frac{p(x|\theta)p(\theta)}{\int _{\theta}p(x|\theta)p(\theta)d\theta}

これを事後密度関数とよぶ。なお、この$p(x|\theta)$は尤度関数（Likelihood function）とよぶ。
ここで、分母は変数xの周辺密度関数とよばれ、単に正規化定数とよばれ、$p(x)$と書く。
また、パラメータθの推定値（代表値）として、事後平均は上記の事後密度関数を用いて、以下で計算できる。

\int _{\theta}\theta p(\theta|x)d\theta

ーーここまでがベイズ統計の基本らしい

そして、事前分布$p(\theta)$が以下の指数分布$\epsilon (\lambda)$

p(\theta;\lambda)=\lambda e^{-\lambda \theta}

$\epsilon (1)=e^{-\theta}$で与えられると仮定すると、事後密度関数は、観測値がN個ある場合$x^N=(x_n)_{n=1...N}$、以下のように求められる.

\begin{align}
p(\theta|x^N)&\propto p(x^N|\theta)p(\theta)\
\end{align}

なお、この$p(x^N|\theta)$は尤度関数（Likelihood function）であり、個別の観測がポアソン分布に従う場合、尤度はそれらの積となり、

p(x^N|\theta)=\frac{\theta ^{x_1}}{x_1!}\ldots\frac{\theta ^{x_N}}{x_N!}
\propto \theta^{\Sigma_{n=1}^{N}x_n}e^{-N\theta}

となる。二つの関数を代入すると、以下が得られる。

\begin{align}
p(\theta|x^N)&\propto p(x^N|\theta)p(\theta)\\
             &\propto \theta^{\Sigma_{n=1}^{N}x_n}e^{-(N+1)\theta}
\end{align}

そして、この関数は、ガンマ分布$G(\nu, \alpha)$の確率密度関数

p(x;\nu \alpha)=\frac{\alpha ^{\nu}x^{\nu -1}exp(-\alpha x)}{\Gamma (\nu)}

ここで、Γ（ν）は

\begin{align}
\Gamma (\nu)&=\int _0^{\infty}x^{\nu -1}exp(-x)dx\\
\Gamma (\nu +1)&=\nu\Gamma(\nu)
\end{align}

を利用すると、

\begin{align}
p(\theta|x^N)&\propto \theta^{\Sigma_{n=1}^{N}x_n}e^{-(N+1)\theta}\\
&\propto G(\Sigma_{n=1}^{N}x_n +1,N+1)
\end{align}

とガンマ分布で表せることが分かる.
そして、ガンマ分布の事後平均が$\frac{\nu}{\alpha}$であることから、事後平均は以下のとおりとなる。

\frac{\Sigma_{n=1}^{N}x_n +1}{N+1}=\frac{30}{8+1}=3.33...

これで、stat_function(fun = dgamma, args = list(shape = 1+ sum(data), rate = 1 + length(data)))の意味がはっきりしました。

テキストデータは以下としました。

・ベイズ統計のイロハである事後密度関数を計算してみた
・R言語及びggplot2をインストールして台風上陸の事後分布を計算・描画してみた
・現象がポアソン分布に従う現象は、同様に事後分布を計算できることが分かる
・行列が示された場合の行と列のχ二乗検定をやってみた

・本書は、ここからさらにおもしろくなりそう

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up