More than 5 years have passed since last update.

Rで動かす順序ロジットモデル

Last updated at 2018-05-01Posted at 2018-04-30

注意

初投稿です．この記事についての誤りや改善点やお褒めの言葉などがあればコメントいただけると嬉しいです．コード全文は一番下にあります．

この記事を読むと何がわかる？

・順序ロジットモデルとは何か
・順序ロジットモデルの使いどころ
・Rでの順序ロジットモデルの推定方法
・説明変数が被説明変数の所属確率へあたえる効果や，潜在変数へ与える効果の可視化のやり方

状況設定

　I氏は「不倫は文化だ」という信条を持っています．O氏はI氏のこの信条を面白く思っていません．O氏は「不倫は夫婦仲を悪化させるので，認められるべきではない」と考えています．しかし，I氏の考えはこうです．「夫婦仲は他の要因（結婚年数，子供の有無，学歴）によって悪化し，不倫自体によっては悪化しない」．
　以上の二人の意見はどちらが正しいのでしょうか．これを確かめるために，アンケートを実施しました. ¹結婚生活満足度を5段階で評価し（5=とても満足,1=まったく満足していない),不倫回数や結婚年数，子供の有無などの情報も記入してもらいます．詳細なアンケート内容は脚注のリンクを参照ください．

使用するデータ

　AERライブラリ内のAffairsデータセットを使用します．最初の五行を表示します。

	gender	age	yearsmarried	children	religiousness	education	occupation	rating
4	male	37	10.00	no	3	18	7	4
5	female	27	4.00	no	4	14	6	4
11	female	32	15.00	yes	1	12	1	4
16	male	57	15.00	yes	5	18	6	5
23	male	22	0.75	no	2	17	6	3

全部で9列あり，最後列のratingが結婚満足度です．他の8列がratingにどのような影響を与えるのかを調べます．各列の意味はこちらの4ページ目にあります．
　以下にsummarytoolsライブラリによる要約を載せます．各変数の分布がざっくりつかめて便利です．例えば，ratingは1が一番少なく，だんだん増加していくことがわかります．

Data Frame Summary   
Affairs     
N: 601   
---------------------------------------------------------------------------------------------------------------
No   Variable         Stats / Values              Freqs (% of Valid)     Text Graph          Valid    Missing  
---- ---------------- --------------------------- ---------------------- ------------------- -------- ---------
1    affairs          mean (sd) : 1.46 (3.3)      0.00 : 451 (75.0%)     IIIIIIIIIIIIIIII    601      0        
     [numeric]        min < med < max :           1.00 :  34 ( 5.7%)     I                   (100%)   (0%)     
                      0 < 0 < 12                  2.00 :  17 ( 2.8%)     I                                     
                      IQR (CV) : 0 (2.27)         3.00 :  19 ( 3.2%)     I                                     
                                                  7.00 :  42 ( 7.0%)                                           
                                                  12.00 :  38 ( 6.3%)                                          

2    gender           1. female                   315 (52.4%)            IIIIIIIIIIIIIIII    601      0        
     [factor]         2. male                     286 (47.6%)            IIIIIIIIIIIIII      (100%)   (0%)     

3    age              mean (sd) : 32.49 (9.29)    17.50 :   6 ( 1.0%)    IIIIIIIIIIII        601      0        
     [numeric]        min < med < max :           22.00 : 117 (19.5%)    IIIIIIIIIIIIIIII    (100%)   (0%)     
                      17.5 < 32 < 57              27.00 : 153 (25.5%)    IIIIIIIIIIII                          
                      IQR (CV) : 10 (0.29)        32.00 : 115 (19.1%)    IIIIIIIII                             
                                                  37.00 :  88 (14.6%)    IIIII                                 
                                                  42.00 :  56 ( 9.3%)    II                                    
                                                  47.00 :  23 ( 3.8%)    II                                    
                                                  52.00 :  21 ( 3.5%)    II                                    
                                                  57.00 :  22 ( 3.7%)                                          

4    yearsmarried     mean (sd) : 8.18 (5.57)     0.12!:  11 ( 1.8%)     II                  601      0        
     [numeric]        min < med < max :           0.42!:  10 ( 1.7%)     IIIIII              (100%)   (0%)     
                      0.12 < 7 < 15               0.75 :  31 ( 5.2%)     IIIIIIII                              
                      IQR (CV) : 11 (0.68)        1.50 :  88 (14.6%)     IIIIII                                
                                                  4.00 : 105 (17.5%)     IIIII                                 
                                                  7.00 :  82 (13.6%)     IIIIIIIIIIIIIIII                      
                                                  10.00 :  70 (11.7%)                                          
                                                  15.00 : 204 (33.9%)                                          
                                                  ! rounded                                                    

5    children         1. no                       171 (28.4%)            IIIIII              601      0        
     [factor]         2. yes                      430 (71.5%)            IIIIIIIIIIIIIIII    (100%)   (0%)     

6    religiousness    mean (sd) : 3.12 (1.17)     1.00 :  48 ( 8.0%)     IIII                601      0        
     [integer]        min < med < max :           2.00 : 164 (27.3%)     IIIIIIIIIIIII       (100%)   (0%)     
                      1 < 3 < 5                   3.00 : 129 (21.5%)     IIIIIIIIII                            
                      IQR (CV) : 2 (0.37)         4.00 : 190 (31.6%)     IIIIIIIIIIIIIIII                      
                                                  5.00 :  70 (11.7%)     IIIII                                 

7    education        mean (sd) : 16.17 (2.4)     9.00 :   7 ( 1.2%)     IIII                601      0        
     [numeric]        min < med < max :           12.00 :  44 ( 7.3%)    IIIIIIIIIIIIIIII    (100%)   (0%)     
                      9 < 16 < 20                 14.00 : 154 (25.6%)    IIIIIIIIIII                           
                      IQR (CV) : 4 (0.15)         16.00 : 115 (19.1%)    IIIIIIIII                             
                                                  17.00 :  89 (14.8%)    IIIIIIIIIII                           
                                                  18.00 : 112 (18.6%)    IIIIIIII                              
                                                  20.00 :  80 (13.3%)                                          

8    occupation       mean (sd) : 4.19 (1.82)     1.00 : 113 (18.8%)     IIIIIIII            601      0        
     [integer]        min < med < max :           2.00 :  13 ( 2.2%)     I                   (100%)   (0%)     
                      1 < 5 < 7                   3.00 :  47 ( 7.8%)     III                                   
                      IQR (CV) : 3 (0.43)         4.00 :  68 (11.3%)     IIIII                                 
                                                  5.00 : 204 (33.9%)     IIIIIIIIIIIIIIII                      
                                                  6.00 : 143 (23.8%)     IIIIIIIIIII                           
                                                  7.00 :  13 ( 2.2%)     I                                     

9    rating           1. 1                         16 ( 2.7%)            I                   601      0        
     [factor]         2. 2                         66 (11.0%)            IIII                (100%)   (0%)     
                      3. 3                         93 (15.5%)            IIIIII                                
                      4. 4                        194 (32.3%)            IIIIIIIIIIIII                         
                      5. 5                        232 (38.6%)            IIIIIIIIIIIIIIII                      
---------------------------------------------------------------------------------------------------------------

　不倫回数が結婚満足度に及ぼす影響に興味があるので，箱ひげ図も描いておきましょう．

結婚満足度が低い回答数が少ないので分散が大きいですが，なんとなく不倫回数が多くなると結婚満足度が低くなる傾向があるように見えます．

順序ロジットモデル²

　このように，順序がある離散値の被説明変数に対して有効なモデルが順序ロジットモデルです．上記の例では次のように表されます．
$$ logit(P(結婚生活満足度 ≦ j))= \alpha_j - \boldsymbol {\beta} \boldsymbol{x} , j = 1,2,3,4,5　\tag{1}$$
ここで,
$$ logit(z) = log(z/(1-z)) $$
です．
　$\alpha_j$が結婚満足度を区切る潜在変数の閾値，$\boldsymbol{\beta}$が説明変数の係数ベクトル,$\boldsymbol{x}$が説明変数ベクトルです．順序ロジットモデルを推定するとは，$\alpha$と$\boldsymbol{\beta}$を与えられたデータから決めることです．このモデルにより，ある説明変数の値が与えられた時，各結婚満足度への所属確率が得られます．例えば，年齢が30才，子供なし，不倫回数1回の男性の結婚満足度はどのくらいかが予測できます．

Rでの推定³

　MASSライブラリのpolr でモデルの推定をします．

m <- polr(rating ~ gender + affairs + age +yearsmarried + children + religiousness + education + occupation , data = Affairs,  method = c("logistic"))
summary(m)

出力は以下のようになります．

Call:
polr(formula = rating ~ gender + affairs + age + yearsmarried + 
    children + religiousness + education + occupation, data = Affairs, 
    method = c("logistic"))

Coefficients:
                 Value Std. Error t value
gendermale    -0.16670    0.18176 -0.9171
affairs       -0.13428    0.02454 -5.4729
age           -0.01436    0.01394 -1.0297
yearsmarried  -0.04134    0.02550 -1.6215
childrenyes   -0.44325    0.21737 -2.0391
religiousness  0.05717    0.06780  0.8432
education      0.12093    0.03852  3.1397
occupation    -0.03761    0.05335 -0.7050

Intercepts:
    Value   Std. Error t value
1|2 -3.3259  0.6956    -4.7817
2|3 -1.4418  0.6615    -2.1796
3|4 -0.3862  0.6593    -0.5857
4|5  1.1086  0.6605     1.6783

Residual Deviance: 1547.651 
AIC: 1571.651

t値しか得られてないので,p値も計算して出力します.出力結果を載せます．

                    Value Std. Error    t value      p value
gendermale    -0.16669715 0.18176478 -0.9171037 3.590883e-01
affairs       -0.13428331 0.02453597 -5.4729162 4.426895e-08
age           -0.01435728 0.01394269 -1.0297356 3.031341e-01
yearsmarried  -0.04133972 0.02549509 -1.6214775 1.049153e-01
childrenyes   -0.44324874 0.21737398 -2.0391067 4.143938e-02
religiousness  0.05717281 0.06780417  0.8432050 3.991138e-01
education      0.12092608 0.03851563  3.1396626 1.691425e-03
occupation    -0.03761273 0.05335027 -0.7050149 4.808010e-01
1|2           -3.32594358 0.69555412 -4.7817179 1.738035e-06
2|3           -1.44180982 0.66150104 -2.1796033 2.928688e-02
3|4           -0.38615865 0.65927100 -0.5857358 5.580530e-01
4|5            1.10858324 0.66054404  1.6782881 9.329086e-02

　Coefficientが各説明変数に対応する係数です．上記した$\boldsymbol{\beta}$に対応します．Interceptが$\alpha$に対応する閾値です．単純に5%有意水準でなら，affairs, childrenyes, educationの係数が有意という結果になりました．affairsの係数が負なので，不倫回数が結婚満足度に悪影響を与えることがわかりました．O氏の主張がもっともらしいと考えられます．

モデルの可視化³

　有意であった係数に関して，結婚満足度に与える影響を見ていきましょう．effectsライブラリを使うと簡単に計算してくれます．

g.eff <- Effect(focal.predictors = c("affairs"), m)
g.eff
plot(g.eff, rug = FALSE)

g.eff <- Effect(focal.predictors = c("cildren"), m)
g.eff
plot(g.eff, rug = FALSE)

g.eff <- Effect(focal.predictors = c("education"), m)
g.eff
plot(g.eff, rug = FALSE)

以上の図ではある要因が変化した場合の，ratingの確率に与える影響が確認できます．不倫回数が多くなるほど低い結婚満足度になる確率が大きくなることがわかります．ちなみに，変化する要因以外は平均値に固定されています．値は以下のコードで確認できます．

g.eff$model.matrix

gendermale affairs      age yearsmarried childrenyes religiousness education occupation
1  0.4758735       0 32.48752     8.177696   0.7154742      3.116473  16.16639   4.194676
2  0.4758735       3 32.48752     8.177696   0.7154742      3.116473  16.16639   4.194676
3  0.4758735       6 32.48752     8.177696   0.7154742      3.116473  16.16639   4.194676
4  0.4758735       9 32.48752     8.177696   0.7154742      3.116473  16.16639   4.194676
5  0.4758735      10 32.48752     8.177696   0.7154742      3.116473  16.16639   4.194676

　
　次に，ある要因が1増加した場合に結婚満足度の確率にどのような影響があるかを確認します．ererライブラリを使います．

mea <-  ocME(w = m)
mea$out


$ME.1
              effect error t.value p.value
gendermale     0.003 0.004   0.892   0.373
affairs        0.003 0.001   3.492   0.001
age            0.000 0.000   0.999   0.318
yearsmarried   0.001 0.001   1.520   0.129
childrenyes    0.008 0.004   1.946   0.052
religiousness -0.001 0.001  -0.824   0.410
education     -0.002 0.001  -2.562   0.011
occupation     0.001 0.001   0.696   0.487

$ME.2
              effect error t.value p.value
gendermale     0.014 0.015   0.912   0.362
affairs        0.011 0.002   4.851   0.000
age            0.001 0.001   1.023   0.307
yearsmarried   0.003 0.002   1.608   0.108
childrenyes    0.035 0.016   2.147   0.032
religiousness -0.005 0.006  -0.841   0.401
education     -0.010 0.003  -3.017   0.003
occupation     0.003 0.004   0.704   0.482

$ME.3
              effect error t.value p.value
gendermale     0.016 0.018   0.915   0.361
affairs        0.013 0.003   4.700   0.000
age            0.001 0.001   1.024   0.306
yearsmarried   0.004 0.002   1.603   0.109
childrenyes    0.042 0.020   2.067   0.039
religiousness -0.006 0.007  -0.841   0.401
education     -0.012 0.004  -2.983   0.003
occupation     0.004 0.005   0.703   0.482

$ME.4
              effect error t.value p.value
gendermale     0.005 0.006   0.882   0.378
affairs        0.004 0.002   2.521   0.012
age            0.000 0.000   0.975   0.330
yearsmarried   0.001 0.001   1.381   0.168
childrenyes    0.021 0.014   1.473   0.141
religiousness -0.002 0.002  -0.806   0.421
education     -0.004 0.002  -2.088   0.037
occupation     0.001 0.002   0.682   0.495

$ME.5
              effect error t.value p.value
gendermale    -0.039 0.042  -0.919   0.359
affairs       -0.031 0.006  -5.491   0.000
age           -0.003 0.003  -1.030   0.303
yearsmarried  -0.010 0.006  -1.620   0.106
childrenyes   -0.105 0.052  -2.010   0.045
religiousness  0.013 0.016   0.843   0.399
education      0.028 0.009   3.141   0.002
occupation    -0.009 0.012  -0.705   0.481

$ME.all
              effect.1 effect.2 effect.3 effect.4 effect.5
gendermale       0.003    0.014    0.016    0.005   -0.039
affairs          0.003    0.011    0.013    0.004   -0.031
age              0.000    0.001    0.001    0.000   -0.003
yearsmarried     0.001    0.003    0.004    0.001   -0.010
childrenyes      0.008    0.035    0.042    0.021   -0.105
religiousness   -0.001   -0.005   -0.006   -0.002    0.013
education       -0.002   -0.010   -0.012   -0.004    0.028
occupation       0.001    0.003    0.004    0.001   -0.009

affairsについて確認すると，affairsが1増加すると，結婚満足度か1である確率が0.003増加し，2である確率が0.011増加し，...，5である確率が0.031減少することがわかります．

まとめ

　アンケートのデータを分析することにより，不倫が結婚満足度に悪影響を与えることが，客観的に証明できました．さらに，その影響も定量化することができました．結婚満足度5に対する影響を見ると，不倫を1回するよりも，子供を持つことのほうが大きな悪影響を及ぼすようです．

感想

　この本を読んで計量経済学の手法が面白いと感じ，自分でやってみたくなったのが動機です．なんで不倫について調べたかと言うと，すぐに見つかった順序つきデータがこれしかなかったからです．なにか面白そうなデータが有れば教えてください．マーケティングに向けての顧客満足度のアンケートとか見てみたいです．ここまで読んでいただいてありがとうございました．

コード

library(AER)
library(summarytools)
library(ggplot2)
library(MASS)
library(reshape2)
library(erer)
library(effects)



data <- data("Affairs")

dfSummary(Affairs)
Affairs$rating <- as.factor(Affairs$rating)
kable(head(Affairs,5), format = "markdown")

ggplot(Affairs, aes(x = rating, y = affairs)) +
  geom_boxplot(size = .75) +
  geom_jitter(alpha = .5) 



# 順序付きロジットモデル推定
## https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/
m <- polr(rating ~ gender + affairs + age +yearsmarried + children + religiousness + education + occupation , data = Affairs,  method = c("logistic"))
m
summary(m)
(ctable <- coef(summary(m)))
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p))


### http://data.library.virginia.edu/visualizing-the-effects-of-proportional-odds-logistic-regression/

g.eff <- Effect(focal.predictors = c("affairs"), m)
g.eff
plot(g.eff, rug = FALSE)

g.eff$model.matrix


g.eff <- Effect(focal.predictors = c("children"), m)
g.eff
plot(g.eff, rug = FALSE)

g.eff <- Effect(focal.predictors = c("education"), m)
g.eff
plot(g.eff, rug = FALSE)



# ererで限界効果計算

mea <-  ocME(w = m)
mea$out

Fair, R.C. (1978). A Theory of Extramarital Affairs. Journal of Political Economy, 86, 45–61　https://pdfs.semanticscholar.org/a8b0/4f90d9964c563917a66c50be2e2974fc4d78.pdf ↩
参考　http://data.library.virginia.edu/fitting-and-interpreting-a-proportional-odds-model/ ↩
参考　http://data.library.virginia.edu/visualizing-the-effects-of-proportional-odds-logistic-regression/ ↩ ↩²

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up