LoginSignup
1
4

More than 3 years have passed since last update.

『統計検定準1級対応統計学実践ワークブック』をR, Pythonで解く~第18章質的回帰~

Last updated at Posted at 2020-08-02

日本統計学会公式認定 統計検定準1級対応 統計学実践ワークブック

が統計検定の準備だけでなく統計学の整理に役立つので、R, Pythonでの実装も試みた。

問18.1

データ

  • 手打ちで転記した下記
    • 18.1.csv
L1,患者,寛解
8,2,0
10,2,0
12,3,0
14,3,0
16,3,0
18,1,1
20,3,2
22,2,1
24,1,0
26,1,1
28,1,1
32,1,0
34,1,1
38,3,2

R

df<-read.csv('19.1.csv')
head(df)
A data.frame: 6 × 3
L1  患者  寛解
<int>   <int>   <int>
1   8   2   0
2   10  2   0
3   12  3   0
4   14  3   0
5   16  3   0
6   18  1   1
  • 1行1患者のデータに展開
df2<-data.frame()
for (i in (1:nrow(df))) {
  yo<-df[i, 3]
  for (j in (1:df[i, 2])) {
    rem = rem - 1
    if (rem >= 0) {
      remission = 1
    } else {
      remission = 0
    }
    tmp.df<-data.frame(L1 = df[i, 1], REM = remission)
    df2<-rbind(df2, tmp.df)
  }
}
df2$REM = as.factor(df2$REM)
head(df2)
A data.frame: 6 × 2
L1  REM
<int>   <fct>
1   8   0
2   8   0
3   10  0
4   10  0
5   12  0
6   12  0
  • ロジスティック回帰
model<-glm(REM ~ L1, df2, family = binomial("logit"))
summary(model)
Call:
glm(formula = REM ~ L1, family = binomial("logit"), data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9448  -0.6465  -0.4947   0.6571   1.6971  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -3.77714    1.37862  -2.740  0.00615 **
L1           0.14486    0.05934   2.441  0.01464 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.372  on 26  degrees of freedom
Residual deviance: 26.073  on 25  degrees of freedom
AIC: 30.073

Number of Fisher Scoring iterations: 4

Python

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('18.1.csv')
df.head()
    L1  患者  寛解
0   8   2   0
1   10  2   0
2   12  3   0
3   14  3   0
4   16  3   0
df2 = pd.DataFrame(columns = ['L1', 'REM'])
for i in range(len(df)):
  rem = df.iloc[i, 2]
  for j in range(df.iloc[i, 1]):
    rem -= 1
    if rem >= 0:
      remission = 1
    else:
      remission = 0
    df2 = df2.append({'L1': df.iloc[i, 0], 'REM': remission}, ignore_index=True)
df2.head()
L1  REM
0   8   0
1   8   0
2   10  0
3   10  0
4   12  0
df_x = pd.DataFrame(df2['L1'])
df_x = sm.add_constant(df_x)
logit_mod = sm.Logit(np.asarray(df2['REM'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   27
Model:                          Logit   Df Residuals:                       25
Method:                           MLE   Df Model:                            1
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                  0.2414
Time:                        06:58:18   Log-Likelihood:                -13.036
converged:                       True   LL-Null:                       -17.186
Covariance Type:            nonrobust   LLR p-value:                  0.003967
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.7771      1.379     -2.740      0.006      -6.479      -1.075
x1             0.1449      0.059      2.441      0.015       0.029       0.261
==============================================================================

問18.2, 18.3

データ

  • 手打ちで転記
    • 18.2.csv
smoking,obesity,snoring,cases,hypertension
0,0,0,60,5
1,0,0,17,2
0,1,0,8,1
1,1,0,2,0
0,0,1,187,35
1,0,1,85,13
0,1,1,51,15
1,1,1,23,8

R

df<-read.csv('18.2.csv')
head(df)
A data.frame: 6 × 5
smoking obesity snoring cases   hypertension
<int>   <int>   <int>   <int>   <int>
1   0   0   0   60  5
2   1   0   0   17  2
3   0   1   0   8   1
4   1   1   0   2   0
5   0   0   1   187 35
6   1   0   1   85  13
  • 1行1患者に展開
df2<-data.frame()
for (i in (1:nrow(df))) {
  ht<-df[i, 5]
  for (j in (1:df[i, 4])) {
    ht = ht - 1
    if (ht >= 0) {
      hypertension = 1
    } else {
      hypertension = 0
    }
    tmp.df<-data.frame(df[i, 1:3], HT = hypertension)
    df2<-rbind(df2, tmp.df)
  }
}
df2$HT = as.factor(df2$HT)
head(df2)
A data.frame: 6 × 4
smoking obesity snoring HT
<int>   <int>   <int>   <fct>
1   0   0   0   1
2   0   0   0   1
3   0   0   0   1
4   0   0   0   1
5   0   0   0   1
6   0   0   0   0
  • 18.2 ロジスティック回帰
model.logit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("logit"))
summary(model.logit)
Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("logit"), 
    data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8578  -0.6330  -0.6138  -0.4212   2.2488  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.37766    0.38010  -6.255 3.97e-10 ***
smoking     -0.06777    0.27812  -0.244   0.8075    
obesity      0.69531    0.28508   2.439   0.0147 *  
snoring      0.87194    0.39749   2.194   0.0283 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 411.42  on 432  degrees of freedom
Residual deviance: 398.92  on 429  degrees of freedom
AIC: 406.92

Number of Fisher Scoring iterations: 4
  • InterceptのPrが少し違うのが?だがほかはテキスト通り

  • 18.3 プロビットモデル

model.probit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("probit"))
summary(model.probit)

Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("probit"), 
    data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8542  -0.6337  -0.6142  -0.4211   2.2531  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.37312    0.19308  -7.112 1.15e-12 ***
smoking     -0.03865    0.15682  -0.246   0.8053    
obesity      0.39996    0.16748   2.388   0.0169 *  
snoring      0.46507    0.20464   2.273   0.0230 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 411.42  on 432  degrees of freedom
Residual deviance: 399.00  on 429  degrees of freedom
AIC: 407

Number of Fisher Scoring iterations: 4

Python

  • 18.2
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('18.2.csv')
df.head()
smoking obesity snoring cases   hypertension
0   0   0   0   60  5
1   1   0   0   17  2
2   0   1   0   8   1
3   1   1   0   2   0
4   0   0   1   187 35
df2 = pd.DataFrame(columns = ['smoking', 'obesity', 'snoring', 'hypertension'])
for i in range(len(df)):
  ht = df.iloc[i, 4]
  for j in range(df.iloc[i, 3]):
    ht -= 1
    if ht >= 0:
      hypertension = 1
    else:
      hypertension = 0
    df2 = df2.append({'smoking': df.iloc[i, 0], 'obesity': df.iloc[i, 1], 'snoring': df.iloc[i, 2], 'hypertension': hypertension}, ignore_index=True)
df2.head()
smoking obesity snoring hypertension
0   0   0   0   1
1   0   0   0   1
2   0   0   0   1
3   0   0   0   1
4   0   0   0   1
  • 18.2 ロジスティック回帰
df_x = pd.DataFrame(df2.iloc[:, 0:3])
df_x = sm.add_constant(df_x)
logit_mod = sm.Logit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  433
Model:                          Logit   Df Residuals:                      429
Method:                           MLE   Df Model:                            3
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                 0.03040
Time:                        07:05:53   Log-Likelihood:                -199.46
converged:                       True   LL-Null:                       -205.71
Covariance Type:            nonrobust   LLR p-value:                  0.005832
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.3777      0.380     -6.254      0.000      -3.123      -1.633
x1            -0.0678      0.278     -0.244      0.807      -0.613       0.477
x2             0.6953      0.285      2.439      0.015       0.137       1.254
x3             0.8719      0.398      2.193      0.028       0.093       1.651
==============================================================================
  • 18.3 プロビットモデル
df_x = pd.DataFrame(df2.iloc[:, 0:3])
df_x = sm.add_constant(df_x)
probit_mod = sm.Probit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
probit_res = probit_mod.fit(disp=0)
print(probit_res.summary())
                          Probit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  433
Model:                         Probit   Df Residuals:                      429
Method:                           MLE   Df Model:                            3
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                 0.03019
Time:                        07:07:14   Log-Likelihood:                -199.50
converged:                       True   LL-Null:                       -205.71
Covariance Type:            nonrobust   LLR p-value:                  0.006077
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.3731      0.193     -7.120      0.000      -1.751      -0.995
x1            -0.0387      0.157     -0.246      0.805      -0.346       0.269
x2             0.4000      0.168      2.384      0.017       0.071       0.729
x3             0.4651      0.204      2.274      0.023       0.064       0.866
==============================================================================
1
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
4