More than 1 year has passed since last update.

glmにおける多重比較（カウントデータ編）

Last updated at 2023-12-17Posted at 2023-12-16

今回はglmでの多重比較を実行する方法をメモしておきます。
すでにいろいろ記事はありますが、自分の学習の記録として残しておきます。

今回はR組み込みデータの中からまず殺虫スプレーの効果のデータを使って試します。

data.InsectSprays

IS<-InsectSprays
head(IS,10)
summary(IS)

> head(IS,10)
   count spray
1     10     A
2      7     A
3     20     A
4     14     A
5     14     A
6     12     A
7     10     A
8     23     A
9     17     A
10    20     A

> summary(IS)
     count       spray 
 Min.   : 0.00   A:12  
 1st Qu.: 3.00   B:12  
 Median : 7.00   C:12  
 Mean   : 9.50   D:12  
 3rd Qu.:14.25   E:12  
 Max.   :26.00   F:12

虫のカウントデータでスプレーは6種類の効果を見ているようです。
早速多重比較をやっていきます。
multcompパッケージをダウンロード。

multcompパッケージ

install.packages("multcomp")
library(multcomp)

カウントデータということで、まずは誤差構造にポアソン分布を仮定してモデル作成し多重比較を実行します。

多重比較

fit<-glm(count~spray,family=poisson,IS)
multicomp<-glht(fit,mcp(spray="Tukey"))
summary(multicomp)

> summary(multicomp)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = count ~ spray, family = poisson, data = IS)

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)    
B - A == 0  0.05588    0.10574   0.528  0.99446    
C - A == 0 -1.94018    0.21389  -9.071  < 0.001 ***
D - A == 0 -1.08152    0.15065  -7.179  < 0.001 ***
E - A == 0 -1.42139    0.17192  -8.268  < 0.001 ***
F - A == 0  0.13926    0.10367   1.343  0.74319    
C - B == 0 -1.99606    0.21315  -9.364  < 0.001 ***
D - B == 0 -1.13740    0.14961  -7.602  < 0.001 ***
E - B == 0 -1.47727    0.17101  -8.638  < 0.001 ***
F - B == 0  0.08338    0.10215   0.816  0.96090    
D - C == 0  0.85866    0.23864   3.598  0.00376 ** 
E - C == 0  0.51879    0.25261   2.054  0.29105    
F - C == 0  2.07944    0.21213   9.803  < 0.001 ***
E - D == 0 -0.33987    0.20189  -1.683  0.51948    
F - D == 0  1.22078    0.14815   8.240  < 0.001 ***
F - E == 0  1.56065    0.16973   9.195  < 0.001 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

とりあえず結果を出すことはできました。
が、ここで注意点としては、ポアソン分布を仮定したものの、カウントデータのモデルでは過分散になっている可能性が高いことです。
過分散になっていると信頼区間が狭くなり、本当は有意でないパラメータが有意であると判断されてしまう（第1種の過誤）場合が多くなります。

モデルの中身をチェックします。

> summary(fit)

Call:
glm(formula = count ~ spray, family = poisson, data = IS)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.67415    0.07581  35.274  < 2e-16 ***
sprayB       0.05588    0.10574   0.528    0.597    
sprayC      -1.94018    0.21389  -9.071  < 2e-16 ***
sprayD      -1.08152    0.15065  -7.179 7.03e-13 ***
sprayE      -1.42139    0.17192  -8.268  < 2e-16 ***
sprayF       0.13926    0.10367   1.343    0.179    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 409.041  on 71  degrees of freedom
Residual deviance:  98.329  on 66  degrees of freedom
AIC: 376.59

Number of Fisher Scoring iterations: 5

仮定した誤差構造が、データの母集団が持つ誤差を正確に表現できているかを検定するためにはDispersion Parameterが規定の値と一致しているかを調べる必要があります。

データが十分に大きい場合は、残差逸脱度（Residual deviance）を残差自由度（degrees of freedom）で割った値でDispersion Parameterを推定できます。
早速計算してみます。

> 98.329/66
[1] 1.489833

大東（2010）によればDispersion Parameterが実際にいくつであれば過分散なのか明確な基準はないそうですが「1.5程度であれば過分散ではないと判断してもかまわないかもしれない」とのことです。微妙ですね。

Dispersion Parameterですが、Pearsonのカイ二乗を残差自由度で割った値の方が正確という説が多いそうです。Pearsonのカイ二乗とは、観測値と推定値の差の二乗を推定値で割ったものの総和です。
計算してみます。

> sum(residuals(fit,type ="pearson")^2)/fit$df.residual
[1] 1.507713

若干値が大きくなりました。

調べてみるとこのPearsonのカイ二乗を使って過分散をチェックできるperformanceパッケージというのがありました。

ダウンロードして実行します。

install.packages("performance")
library(performance)
check_overdispersion(fit)

> check_overdispersion(fit)
# Overdispersion test

       dispersion ratio =  1.508
  Pearson's Chi-Squared = 99.509
                p-value =  0.005

やはり過分散のようです。
負の二項分布でもう一度多重比較をやってみます。

library(MASS)
fit_nb<-glm.nb(count~spray,data=IS)
multicomp2<-glht(fit_nb,mcp(spray="Tukey"))
summary(multicomp2)

> summary(multicomp2)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm.nb(formula = count ~ spray, data = IS, init.theta = 28.09950714, 
    link = log)

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)    
B - A == 0  0.05588    0.13082   0.427  0.99805    
C - A == 0 -1.94018    0.22733  -8.535  < 0.001 ***
D - A == 0 -1.08152    0.16920  -6.392  < 0.001 ***
E - A == 0 -1.42139    0.18838  -7.545  < 0.001 ***
F - A == 0  0.13926    0.12914   1.078  0.88325    
C - B == 0 -1.99606    0.22664  -8.807  < 0.001 ***
D - B == 0 -1.13740    0.16827  -6.759  < 0.001 ***
E - B == 0 -1.47727    0.18755  -7.877  < 0.001 ***
F - B == 0  0.08338    0.12793   0.652  0.98592    
D - C == 0  0.85866    0.25076   3.424  0.00744 ** 
E - C == 0  0.51879    0.26408   1.964  0.34747    
F - C == 0  2.07944    0.22568   9.214  < 0.001 ***
E - D == 0 -0.33987    0.21608  -1.573  0.60098    
F - D == 0  1.22078    0.16697   7.311  < 0.001 ***
F - E == 0  1.56065    0.18639   8.373  < 0.001 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

ほとんど変わりませんが、Ｐ値は若干大きくなっており、より適切な検定になったのではないでしょうか。
念のためもう一度過分散をチェックしてみます。

> check_overdispersion(fit_nb)
# Overdispersion test

       dispersion ratio =  1.138
  Pearson's Chi-Squared = 75.098
                p-value =  0.207

No overdispersion detected.

まあでも元データをみれば最初からだいたいのことはわかりそうなんですよね。

ヒストグラムを描いてみます。

>hist(IS$count)

箱ひげ図も描いてみます。

> boxplot(count~spray,IS)

平均と分散をチェックします。

> mean(IS$count)
[1] 9.5
> var(IS$count)
[1] 51.88732

やっぱりだいぶ分散が大きいですね。ポアソン分布を選択する基準が平均＝分散なのでこれだけ分散が大きいと最初から負の二項分布を仮定してもよさそうです。

説明変数は1個なので一元配置の分散分析（one-way ANOVA）じゃだめなの？という意見もありそうなので、とりあえず誤差の正規性と等分散性もチェックしました。
performanceパッケージでこれらをチェックできます。

誤差の正規性と分散の均一性の検定

fit<-lm(count~spray,data=IS)
check_normality(fit)
check_homogeneity(fit)

> check_normality(fit)
Warning: Non-normality of residuals detected (p = 0.022).
> check_homogeneity(fit)
Warning: Variances differ between groups (Bartlett Test, p = 0.000).

はやりglmで正規分布以外の誤差構造を選択した方がよさそうです。

データが正規分布に従わない場合の、等分散性の検証法としてLevene testというのもあるようです。

> leveneTest(count ~ spray,data=IS)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value   Pr(>F)   
group  5  3.8214 0.004223 **
      66                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

なんとperformanceパッケージにどの確率分布にすればよいか、ランダムフォレストのアルゴリズムで計算してくれる関数がありました。
実行しようとするとrandomForestパッケージのダウンロードが求められるのでイエス（y）を入力します。

> check_distribution(fit)
# Distribution of Model Family

Predicted Distribution of Residuals

 Distribution Probability
       normal         44%
      tweedie         28%
       cauchy         25%

Predicted Distribution of Response

               Distribution Probability
 neg. binomial (zero-infl.)         69%
              beta-binomial         25%
       poisson (zero-infl.)          3%

やはり負の二項分布がよいだろうという結果になってますね。

せっかくなのでグラフに描画してみます。
以下の記事が参考になりました。

多重比較の結果をグラフに描画

#結果をデータフレームに格納
alphabet = cld(multicomp2,decreasing=F)

library(tidyverse)
abc = alphabet[["mcletters"]][["Letters"]]
plot = 
  ggplot(data = IS, aes(x = spray, y = count))+
  geom_boxplot()+
  stat_summary(geom = 'text', label =abc, fun = max, vjust = -1)+
  theme_classic()
print(plot)

～参考文献～
大東健太郎. (2010). 線形モデルから一般化線形モデル (GLM) へ. 雑草研究, 55(4), 268-274.

他にも以下の記事が参考になりました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up