More than 5 years have passed since last update.

『統計検定準1級対応統計学実践ワークブック』をR, Pythonで解く～第18章質的回帰～

Last updated at 2020-08-02Posted at 2020-08-02

日本統計学会公式認定統計検定準1級対応統計学実践ワークブック

が統計検定の準備だけでなく統計学の整理に役立つので、R, Pythonでの実装も試みた。

問18.1

データ

手打ちで転記した下記
- 18.1.csv

L1,患者,寛解
8,2,0
10,2,0
12,3,0
14,3,0
16,3,0
18,1,1
20,3,2
22,2,1
24,1,0
26,1,1
28,1,1
32,1,0
34,1,1
38,3,2

R

df<-read.csv('19.1.csv')
head(df)

A data.frame: 6 × 3
L1	患者	寛解
<int>	<int>	<int>
1	8	2	0
2	10	2	0
3	12	3	0
4	14	3	0
5	16	3	0
6	18	1	1

1行1患者のデータに展開

df2<-data.frame()
for (i in (1:nrow(df))) {
  yo<-df[i, 3]
  for (j in (1:df[i, 2])) {
    rem = rem - 1
    if (rem >= 0) {
      remission = 1
    } else {
      remission = 0
    }
    tmp.df<-data.frame(L1 = df[i, 1], REM = remission)
    df2<-rbind(df2, tmp.df)
  }
}
df2$REM = as.factor(df2$REM)
head(df2)

A data.frame: 6 × 2
L1	REM
<int>	<fct>
1	8	0
2	8	0
3	10	0
4	10	0
5	12	0
6	12	0

ロジスティック回帰

model<-glm(REM ~ L1, df2, family = binomial("logit"))
summary(model)

Call:
glm(formula = REM ~ L1, family = binomial("logit"), data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9448  -0.6465  -0.4947   0.6571   1.6971  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -3.77714    1.37862  -2.740  0.00615 **
L1           0.14486    0.05934   2.441  0.01464 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.372  on 26  degrees of freedom
Residual deviance: 26.073  on 25  degrees of freedom
AIC: 30.073

Number of Fisher Scoring iterations: 4

Python

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('18.1.csv')
df.head()

df2 = pd.DataFrame(columns = ['L1', 'REM'])
for i in range(len(df)):
  rem = df.iloc[i, 2]
  for j in range(df.iloc[i, 1]):
    rem -= 1
    if rem >= 0:
      remission = 1
    else:
      remission = 0
    df2 = df2.append({'L1': df.iloc[i, 0], 'REM': remission}, ignore_index=True)
df2.head()

df_x = pd.DataFrame(df2['L1'])
df_x = sm.add_constant(df_x)
logit_mod = sm.Logit(np.asarray(df2['REM'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   27
Model:                          Logit   Df Residuals:                       25
Method:                           MLE   Df Model:                            1
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                  0.2414
Time:                        06:58:18   Log-Likelihood:                -13.036
converged:                       True   LL-Null:                       -17.186
Covariance Type:            nonrobust   LLR p-value:                  0.003967
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.7771      1.379     -2.740      0.006      -6.479      -1.075
x1             0.1449      0.059      2.441      0.015       0.029       0.261
==============================================================================

問18.2, 18.3

データ

手打ちで転記
- 18.2.csv

smoking,obesity,snoring,cases,hypertension
0,0,0,60,5
1,0,0,17,2
0,1,0,8,1
1,1,0,2,0
0,0,1,187,35
1,0,1,85,13
0,1,1,51,15
1,1,1,23,8

R

df<-read.csv('18.2.csv')
head(df)

A data.frame: 6 × 5
smoking	obesity	snoring	cases	hypertension
<int>	<int>	<int>	<int>	<int>
1	0	0	0	60	5
2	1	0	0	17	2
3	0	1	0	8	1
4	1	1	0	2	0
5	0	0	1	187	35
6	1	0	1	85	13

1行1患者に展開

df2<-data.frame()
for (i in (1:nrow(df))) {
  ht<-df[i, 5]
  for (j in (1:df[i, 4])) {
    ht = ht - 1
    if (ht >= 0) {
      hypertension = 1
    } else {
      hypertension = 0
    }
    tmp.df<-data.frame(df[i, 1:3], HT = hypertension)
    df2<-rbind(df2, tmp.df)
  }
}
df2$HT = as.factor(df2$HT)
head(df2)

A data.frame: 6 × 4
smoking	obesity	snoring	HT
<int>	<int>	<int>	<fct>
1	0	0	0	1
2	0	0	0	1
3	0	0	0	1
4	0	0	0	1
5	0	0	0	1
6	0	0	0	0

18.2 ロジスティック回帰

model.logit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("logit"))
summary(model.logit)

Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("logit"), 
    data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8578  -0.6330  -0.6138  -0.4212   2.2488  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.37766    0.38010  -6.255 3.97e-10 ***
smoking     -0.06777    0.27812  -0.244   0.8075    
obesity      0.69531    0.28508   2.439   0.0147 *  
snoring      0.87194    0.39749   2.194   0.0283 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 411.42  on 432  degrees of freedom
Residual deviance: 398.92  on 429  degrees of freedom
AIC: 406.92

Number of Fisher Scoring iterations: 4

InterceptのPrが少し違うのが?だがほかはテキスト通り
18.3 プロビットモデル

model.probit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("probit"))
summary(model.probit)


Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("probit"), 
    data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8542  -0.6337  -0.6142  -0.4211   2.2531  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.37312    0.19308  -7.112 1.15e-12 ***
smoking     -0.03865    0.15682  -0.246   0.8053    
obesity      0.39996    0.16748   2.388   0.0169 *  
snoring      0.46507    0.20464   2.273   0.0230 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 411.42  on 432  degrees of freedom
Residual deviance: 399.00  on 429  degrees of freedom
AIC: 407

Number of Fisher Scoring iterations: 4

Python

18.2

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('18.2.csv')
df.head()

smoking	obesity	snoring	cases	hypertension
0	0	0	0	60	5
1	1	0	0	17	2
2	0	1	0	8	1
3	1	1	0	2	0
4	0	0	1	187	35

df2 = pd.DataFrame(columns = ['smoking', 'obesity', 'snoring', 'hypertension'])
for i in range(len(df)):
  ht = df.iloc[i, 4]
  for j in range(df.iloc[i, 3]):
    ht -= 1
    if ht >= 0:
      hypertension = 1
    else:
      hypertension = 0
    df2 = df2.append({'smoking': df.iloc[i, 0], 'obesity': df.iloc[i, 1], 'snoring': df.iloc[i, 2], 'hypertension': hypertension}, ignore_index=True)
df2.head()

smoking	obesity	snoring	hypertension
0	0	0	0	1
1	0	0	0	1
2	0	0	0	1
3	0	0	0	1
4	0	0	0	1

18.2 ロジスティック回帰

df_x = pd.DataFrame(df2.iloc[:, 0:3])
df_x = sm.add_constant(df_x)
logit_mod = sm.Logit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  433
Model:                          Logit   Df Residuals:                      429
Method:                           MLE   Df Model:                            3
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                 0.03040
Time:                        07:05:53   Log-Likelihood:                -199.46
converged:                       True   LL-Null:                       -205.71
Covariance Type:            nonrobust   LLR p-value:                  0.005832
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.3777      0.380     -6.254      0.000      -3.123      -1.633
x1            -0.0678      0.278     -0.244      0.807      -0.613       0.477
x2             0.6953      0.285      2.439      0.015       0.137       1.254
x3             0.8719      0.398      2.193      0.028       0.093       1.651
==============================================================================

18.3 プロビットモデル

df_x = pd.DataFrame(df2.iloc[:, 0:3])
df_x = sm.add_constant(df_x)
probit_mod = sm.Probit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
probit_res = probit_mod.fit(disp=0)
print(probit_res.summary())

                          Probit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  433
Model:                         Probit   Df Residuals:                      429
Method:                           MLE   Df Model:                            3
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                 0.03019
Time:                        07:07:14   Log-Likelihood:                -199.50
converged:                       True   LL-Null:                       -205.71
Covariance Type:            nonrobust   LLR p-value:                  0.006077
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.3731      0.193     -7.120      0.000      -1.751      -0.995
x1            -0.0387      0.157     -0.246      0.805      -0.346       0.269
x2             0.4000      0.168      2.384      0.017       0.071       0.729
x3             0.4651      0.204      2.274      0.023       0.064       0.866
==============================================================================

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up