More than 5 years have passed since last update.

broom+ggplot2で回帰係数を可視化する

Last updated at 2018-12-23Posted at 2018-12-23

やったこと

回帰分析の結果を表ではなく図で表現してみました。

環境

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=ja_JP.UTF-8       LC_NUMERIC=C               LC_TIME=ja_JP.UTF-8        LC_COLLATE=ja_JP.UTF-8    
 [5] LC_MONETARY=ja_JP.UTF-8    LC_MESSAGES=ja_JP.UTF-8    LC_PAPER=ja_JP.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ja_JP.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2       broom_0.5.0          texreg_1.36.23       forcats_0.3.0        stringr_1.3.1       
 [6] dplyr_0.7.6          purrr_0.2.5          readr_1.1.1          tidyr_0.8.1          tibble_1.4.2        
[11] ggplot2_3.0.0        tidyverse_1.2.1      RevoUtils_11.0.1     RevoUtilsMath_11.0.0

だいぶ前に入れたWSL上でRStudioServerを動かしています。

データ

Rに付属しているAirqualityデータを使用します。
1973年の5月から9月までにニューヨークで観測されたものです。
- 詳細：https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html
変数は以下の通り

[,1]	Ozone	numeric	Ozone (ppb)
[,2]	Solar.R	numeric	Solar R (lang)
[,3]	Wind	numeric	Wind (mph)
[,4]	Temp	numeric	Temperature (degrees F)
[,5]	Month	numeric	Month (1--12)
[,6]	Day	numeric	Day of month (1--31)

今回は、Ozoneを他の変数で説明する簡単な回帰モデルを作ります。

回帰モデルを作成

lm() で階層的にOLS回帰モデルを作っていきます。
airqualityデータは欠損値が多いデータですが、今回はリストワイズで除去します。

library(tidyverse)

aq_data <- airquality %>%
  na.omit

model_list <- list()
model_list[["model_1"]] <- lm(data = aq_data,Ozone ~ Solar.R)
model_list[["model_2"]] <- lm(data = aq_data,Ozone ~ Solar.R + Wind)
model_list[["model_3"]] <- lm(data = aq_data,Ozone ~ Solar.R + Wind + Temp)
model_list[["model_4"]] <- lm(data = aq_data,Ozone ~ Solar.R + Wind + Temp + Month)

ふつうに表形式で出力

texreg::screenreg()を使うと、こんな感じの見慣れた表になります。

> texreg::screenreg(model_list)
===========================================================
             model_1     model_2     model_3     model_4   
-----------------------------------------------------------
(Intercept)   18.60 **    77.25 ***  -64.34 **   -58.05 *  
              (6.75)      (9.07)     (23.05)     (22.97)   
Solar.R        0.13 ***    0.10 ***    0.06 *      0.05 *  
              (0.03)      (0.03)      (0.02)      (0.02)   
Wind                      -5.40 ***   -3.33 ***   -3.32 ***
                          (0.67)      (0.65)      (0.65)   
Temp                                   1.65 ***    1.87 ***
                                      (0.25)      (0.27)   
Month                                             -2.99    
                                                  (1.52)   
-----------------------------------------------------------
R^2            0.12        0.45        0.61        0.62    
Adj. R^2       0.11        0.44        0.59        0.61    
Num. obs.    111         111         111         111       
RMSE          31.33       24.92       21.18       20.90    
===========================================================
*** p < 0.001, ** p < 0.01, * p < 0.05

broomで推定結果をtidy dataに変換

可視化する下準備として、broomパッケージでlmオブジェクトをtidy dataに変換します。
broomが対応している回帰モデルのオブジェクトをtidy()に与えると、以下のようなtibbleを得ることができます。

> reg <- lm(data = aq_data,Ozone ~ Solar.R + Wind + Temp + Month)
> tidy(reg)
# A tibble: 5 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) -58.1      23.0        -2.53 1.30e- 2
2 Solar.R       0.0496    0.0235      2.11 3.68e- 2
3 Wind         -3.32      0.646      -5.14 1.29e- 6
4 Temp          1.87      0.274       6.84 5.34e-10
5 Month        -2.99      1.52       -1.97 5.10e- 2

先程のモデル群をtidy dataに変換して、可視化用にデータを加工していきます。
- term_orderとmodel_orderはggplot2での凡例を整えるのに使います。

library(broom)

term_order <- rev(c("(Intercept)","Solar.R","Wind","Temp","Month"))
model_order <- rev(c("model_1","model_2","model_3","model_4"))

result_data <- model_list %>%
  map(tidy,conf.level = 0.95,conf.int=TRUE) %>%
  bind_rows(.id = "model") %>%
  mutate(term = factor(term,levels = term_order), #項とモデルの順序付け
         model = factor(model,levels = model_order)) %>%
  filter(term != "(Intercept)") %>% #今回は簡単のため切片を除外します
  mutate(stars = case_when( #有意水準を設定する
    p.value <= 0.001 ~ "***",
    (p.value > 0.001 & p.value <= 0.01) ~ "**",
    (p.value > 0.01 & p.value <= 0.05) ~ "*",
    (p.value > 0.05 & p.value <= 0.1) ~ ".",
    TRUE ~ "")) %>%
  mutate(coef_label = paste0(round(estimate,3)," (",round(std.error,3),")",stars))

加工後のtibbleは以下のようになります。

> result_data
# A tibble: 10 x 10
   model   term    estimate std.error statistic  p.value conf.low conf.high stars coef_label       
   <fct>   <fct>      <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr> <chr>            
 1 model_1 Solar.R   0.127     0.0328      3.88 1.79e- 4  0.0622     0.192  ***   0.127 (0.033)*** 
 2 model_2 Solar.R   0.100     0.0263      3.82 2.24e- 4  0.0483     0.152  ***   0.1 (0.026)***   
 3 model_2 Wind     -5.40      0.673      -8.02 1.34e-12 -6.74      -4.07   ***   -5.402 (0.673)***
 4 model_3 Solar.R   0.0598    0.0232      2.58 1.12e- 2  0.0139     0.106  *     0.06 (0.023)*    
 5 model_3 Wind     -3.33      0.654      -5.09 1.52e- 6 -4.63      -2.04   ***   -3.334 (0.654)***
 6 model_3 Temp      1.65      0.254       6.52 2.42e- 9  1.15       2.15   ***   1.652 (0.254)*** 
 7 model_4 Solar.R   0.0496    0.0235      2.11 3.68e- 2  0.00309    0.0961 *     0.05 (0.023)*    
 8 model_4 Wind     -3.32      0.646      -5.14 1.29e- 6 -4.60      -2.04   ***   -3.317 (0.646)***
 9 model_4 Temp      1.87      0.274       6.84 5.34e-10  1.33       2.41   ***   1.871 (0.274)*** 
10 model_4 Month    -2.99      1.52       -1.97 5.10e- 2 -6.00       0.0138 .     -2.992 (1.516).

ggplot2で可視化

先程作成したtibbleをもとに、ggplot2でレイヤリングしていきます。
- モデル間で点や線が重ならないようにposition_dodge()を逐次入れています。

g <- ggplot(result_data, aes(x = term,
                             y = estimate,
                             col = model))
g <- g + geom_hline(yintercept = 0) #係数=0の直線
g <- g + geom_point(position = position_dodge(1)) #係数の点プロット
g <- g + geom_text(position = position_dodge(1), #係数の数値を文字列でプロット
                   vjust = -1,　#点と重ならないようにちょっと上に出します
                   aes(label = coef_label),
                   show.legend = FALSE) #これを設定しないと凡例に文字列も反映されてしまいます
g <- g + geom_pointrange(position = position_dodge(1),
                        aes(ymin = conf.low, ymax = conf.high), #エラーバー用
                        show.legend = FALSE)
g <- g + guides(col = guide_legend(reverse=T))
g <- g + coord_flip()
g <- g + ggtitle("Estimated Coefficients of Regression Model (95%CI)")
g

実際にできたものがこちらです。

感想

それぞれの係数のスケールが違うので、Solar.Rなどは限りなく推定値が0に近いため逆に違いがよくわからなくなりますね。
標準化回帰係数ならこの方法でいいかもしれませんが、そもそも回帰分析の結果を可視化する旨味はそんなにないかもしれません。
細かいですが、項によって行幅が揃ってないのが気になりますね…。

ちなみに

上の方法では、エラーバーを表すためにgeom_pointrange()で水平方向に線を引いてから最後にcoord_flip()するという、やや回りくどい方法を使っていますが、そうなった理由は以下の2つです。
- geom_pointrange()が水平方向の描画に対応していなかった。
  - ggstance というggplot2のエクステンションを使うと、geom_pointrangeh()で水平方向にエラーバーを引けるようです。
- 当初は、geom_errorbarh()を使おうとしたのですが、position_dodge()がうまく効かなかった。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up