More than 1 year has passed since last update.

glmの多重比較（ロジスティック回帰編）

Last updated at 2023-12-17Posted at 2023-12-17

今回はロジスティック回帰編です。
今回もRは組み込みのタイタニックのデータを使ってみました。

data.Titanic

titanic<- data.frame(Titanic)
head(titanic)

> head(titanic)
  Class    Sex   Age Survived Freq
1   1st   Male Child       No    0
2   2nd   Male Child       No    0
3   3rd   Male Child       No   35
4  Crew   Male Child       No    0
5   1st Female Child       No    0
6   2nd Female Child       No    0

生存したかどうかの2値データの解析なのでロジスティック回帰をやっていきます。
分析によさげなカテゴリカルデータがあるものの頻度データになっているのでどうしようか思っていたらいい記事がありました。

やっぱこういう記事あるんですね。
glmにweightsという引数があるのを初めて知りました。
集計データでは、weightsに頻度を入れればそのまま使えるんですね。

Classが4つのカテゴリに分けられた変数であるため、これについて多重比較します。
まずはモデルをチェック。

glm

fit<-glm(Survived~Class+Sex+Age,weights=Freq,family=binomial,data=titanic)
summary(fit)

> summary(fit)

Call:
glm(formula = Survived ~ Class + Sex + Age, family = binomial, 
    data = titanic, weights = Freq)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.6853     0.2730   2.510   0.0121 *  
Class2nd     -1.0181     0.1960  -5.194 2.05e-07 ***
Class3rd     -1.7778     0.1716 -10.362  < 2e-16 ***
ClassCrew    -0.8577     0.1573  -5.451 5.00e-08 ***
SexFemale     2.4201     0.1404  17.236  < 2e-16 ***
AgeAdult     -1.0615     0.2440  -4.350 1.36e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2769.5  on 23  degrees of freedom
Residual deviance: 2210.1  on 18  degrees of freedom
AIC: 2222.1

Number of Fisher Scoring iterations: 5

ただこれだとサンプル数が少ない分Dispersion parameterがちゃんと計算できないですね。
までも一応多重比較をやってみます。

多重比較

library(multcomp)
multicomp<-glht(fit,mcp(Class="Tukey"))
summary(multicomp)

> summary(multicomp)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = Survived ~ Class + Sex + Age, family = binomial, 
    data = titanic, weights = Freq)

Linear Hypotheses:
                Estimate Std. Error z value Pr(>|z|)    
2nd - 1st == 0   -1.0181     0.1960  -5.194  < 1e-04 ***
3rd - 1st == 0   -1.7778     0.1716 -10.362  < 1e-04 ***
Crew - 1st == 0  -0.8577     0.1573  -5.451  < 1e-04 ***
3rd - 2nd == 0   -0.7597     0.1764  -4.308 0.000115 ***
Crew - 2nd == 0   0.1604     0.1738   0.923 0.790536    
Crew - 3rd == 0   0.9201     0.1486   6.192  < 1e-04 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

とりあえず実行できました。
調べてみると頻度に応じてデータを増やすやり方もわかりました。

同じようにやってみます。

頻度に応じてデータを複製

titanic2 <- data.frame(lapply(titanic,function(i){rep(i,titanic[,5])}))
#Freqの列を削除
titanic2 <- titanic2[,-5]
head(titanic2)

> head(titanic2)
  Class  Sex   Age Survived
1   3rd Male Child       No
2   3rd Male Child       No
3   3rd Male Child       No
4   3rd Male Child       No
5   3rd Male Child       No
6   3rd Male Child       No

いい感じです。
もう一度glmをやってみます。

fit2<-glm(Survived~Class+Sex+Age,family=binomial,data=titanic2)
summary(fit2)

> summary(fit2)

Call:
glm(formula = Survived ~ Class + Sex + Age, family = binomial, 
    data = titanic2)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.6853     0.2730   2.510   0.0121 *  
Class2nd     -1.0181     0.1960  -5.194 2.05e-07 ***
Class3rd     -1.7778     0.1716 -10.362  < 2e-16 ***
ClassCrew    -0.8577     0.1573  -5.451 5.00e-08 ***
SexFemale     2.4201     0.1404  17.236  < 2e-16 ***
AgeAdult     -1.0615     0.2440  -4.350 1.36e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2769.5  on 2200  degrees of freedom
Residual deviance: 2210.1  on 2195  degrees of freedom
AIC: 2222.1

Number of Fisher Scoring iterations: 4

Dispersion parameterもほぼ1になりそうですね。
一応計算します。

> sum(residuals(fit2,type ="pearson")^2)/fit2$df.residual
[1] 1.023531

performanceパッケージで検定してみます。

library(performance)
check_overdispersion(fit2)

> check_overdispersion(fit2)
 エラー: Overdispersion checks cannot be used for Bernoulli models.

ゼロイチデータでは使えないっぽいです。
前回と同じくどの分布が適切なのかランダムフォレストのアルゴリズムで調べてくれる関数を利用してみます。

> check_distribution(fit2)
# Distribution of Model Family

Predicted Distribution of Residuals

         Distribution Probability
 poisson (zero-infl.)         22%
               cauchy         19%
               normal         19%

Predicted Distribution of Response

  Distribution Probability
     bernoulli         88%
 beta-binomial          9%
      binomial          3%

問題なさそうですね。
最初の重み付けしたglmと違いがないか、念のためこちらでも多重比較を実行します。

multicomp2<-glht(fit2,mcp(Class="Tukey"))
summary(multicomp2)

> summary(multicomp2)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = Survived ~ Class + Sex + Age, family = binomial, 
    data = titanic2)

Linear Hypotheses:
                Estimate Std. Error z value Pr(>|z|)    
2nd - 1st == 0   -1.0181     0.1960  -5.194   <1e-04 ***
3rd - 1st == 0   -1.7778     0.1716 -10.362   <1e-04 ***
Crew - 1st == 0  -0.8577     0.1573  -5.451   <1e-04 ***
3rd - 2nd == 0   -0.7597     0.1764  -4.308   <1e-04 ***
Crew - 2nd == 0   0.1604     0.1738   0.923    0.791    
Crew - 3rd == 0   0.9201     0.1486   6.192   <1e-04 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

特に変わらなくて安心しました。

前回と同じようにグラフに結果を描画してみます。
生存しているかどうかがYesとNoの文字列なので1と0に変換します。
ゼロイチデータなので今回はバブルチャートにしてみました。

グラフに描画

alphabet = cld(multicomp2,decreasing=F)
alphabet

library(tidyverse)
titanic2 <- dplyr::mutate(titanic2, Survived = str_replace_all(Survived, pattern = c("No" = "0", "Yes" = "1")))
titanic2$Survived<-as.numeric(titanic2$Survived)
abc = alphabet[["mcletters"]][["Letters"]]
plot = 
  ggplot(data = titanic2, aes(x = Class, y = Survived))+
  geom_count()+
  stat_summary(geom = 'text', label =abc, fun = max, vjust = -1)+
  theme_classic()
print(plot)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up