Help us understand the problem. What is going on with this article?

『統計検定準1級対応統計学実践ワークブック』をR, Pythonで解く~第18章質的回帰~

日本統計学会公式認定 統計検定準1級対応 統計学実践ワークブック

が統計検定の準備だけでなく統計学の整理に役立つので、R, Pythonでの実装も試みた。

問18.1

データ

  • 手打ちで転記した下記
    • 18.1.csv
L1,患者,寛解
8,2,0
10,2,0
12,3,0
14,3,0
16,3,0
18,1,1
20,3,2
22,2,1
24,1,0
26,1,1
28,1,1
32,1,0
34,1,1
38,3,2

R

df<-read.csv('19.1.csv')
head(df)
A data.frame: 6 × 3
L1  患者  寛解
<int>   <int>   <int>
1   8   2   0
2   10  2   0
3   12  3   0
4   14  3   0
5   16  3   0
6   18  1   1
  • 1行1患者のデータに展開
df2<-data.frame()
for (i in (1:nrow(df))) {
  yo<-df[i, 3]
  for (j in (1:df[i, 2])) {
    rem = rem - 1
    if (rem >= 0) {
      remission = 1
    } else {
      remission = 0
    }
    tmp.df<-data.frame(L1 = df[i, 1], REM = remission)
    df2<-rbind(df2, tmp.df)
  }
}
df2$REM = as.factor(df2$REM)
head(df2)
A data.frame: 6 × 2
L1  REM
<int>   <fct>
1   8   0
2   8   0
3   10  0
4   10  0
5   12  0
6   12  0
  • ロジスティック回帰
model<-glm(REM ~ L1, df2, family = binomial("logit"))
summary(model)
Call:
glm(formula = REM ~ L1, family = binomial("logit"), data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9448  -0.6465  -0.4947   0.6571   1.6971  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -3.77714    1.37862  -2.740  0.00615 **
L1           0.14486    0.05934   2.441  0.01464 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.372  on 26  degrees of freedom
Residual deviance: 26.073  on 25  degrees of freedom
AIC: 30.073

Number of Fisher Scoring iterations: 4

Python

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('18.1.csv')
df.head()
    L1  患者  寛解
0   8   2   0
1   10  2   0
2   12  3   0
3   14  3   0
4   16  3   0
df2 = pd.DataFrame(columns = ['L1', 'REM'])
for i in range(len(df)):
  rem = df.iloc[i, 2]
  for j in range(df.iloc[i, 1]):
    rem -= 1
    if rem >= 0:
      remission = 1
    else:
      remission = 0
    df2 = df2.append({'L1': df.iloc[i, 0], 'REM': remission}, ignore_index=True)
df2.head()
L1  REM
0   8   0
1   8   0
2   10  0
3   10  0
4   12  0
df_x = pd.DataFrame(df2['L1'])
df_x = sm.add_constant(df_x)
logit_mod = sm.Logit(np.asarray(df2['REM'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   27
Model:                          Logit   Df Residuals:                       25
Method:                           MLE   Df Model:                            1
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                  0.2414
Time:                        06:58:18   Log-Likelihood:                -13.036
converged:                       True   LL-Null:                       -17.186
Covariance Type:            nonrobust   LLR p-value:                  0.003967
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.7771      1.379     -2.740      0.006      -6.479      -1.075
x1             0.1449      0.059      2.441      0.015       0.029       0.261
==============================================================================

問18.2, 18.3

データ

  • 手打ちで転記
    • 18.2.csv
smoking,obesity,snoring,cases,hypertension
0,0,0,60,5
1,0,0,17,2
0,1,0,8,1
1,1,0,2,0
0,0,1,187,35
1,0,1,85,13
0,1,1,51,15
1,1,1,23,8

R

df<-read.csv('18.2.csv')
head(df)
A data.frame: 6 × 5
smoking obesity snoring cases   hypertension
<int>   <int>   <int>   <int>   <int>
1   0   0   0   60  5
2   1   0   0   17  2
3   0   1   0   8   1
4   1   1   0   2   0
5   0   0   1   187 35
6   1   0   1   85  13
  • 1行1患者に展開
df2<-data.frame()
for (i in (1:nrow(df))) {
  ht<-df[i, 5]
  for (j in (1:df[i, 4])) {
    ht = ht - 1
    if (ht >= 0) {
      hypertension = 1
    } else {
      hypertension = 0
    }
    tmp.df<-data.frame(df[i, 1:3], HT = hypertension)
    df2<-rbind(df2, tmp.df)
  }
}
df2$HT = as.factor(df2$HT)
head(df2)
A data.frame: 6 × 4
smoking obesity snoring HT
<int>   <int>   <int>   <fct>
1   0   0   0   1
2   0   0   0   1
3   0   0   0   1
4   0   0   0   1
5   0   0   0   1
6   0   0   0   0
  • 18.2 ロジスティック回帰
model.logit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("logit"))
summary(model.logit)
Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("logit"), 
    data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8578  -0.6330  -0.6138  -0.4212   2.2488  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.37766    0.38010  -6.255 3.97e-10 ***
smoking     -0.06777    0.27812  -0.244   0.8075    
obesity      0.69531    0.28508   2.439   0.0147 *  
snoring      0.87194    0.39749   2.194   0.0283 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 411.42  on 432  degrees of freedom
Residual deviance: 398.92  on 429  degrees of freedom
AIC: 406.92

Number of Fisher Scoring iterations: 4
  • InterceptのPrが少し違うのが?だがほかはテキスト通り

  • 18.3 プロビットモデル

model.probit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("probit"))
summary(model.probit)
Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("probit"), 
    data = df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8542  -0.6337  -0.6142  -0.4211   2.2531  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.37312    0.19308  -7.112 1.15e-12 ***
smoking     -0.03865    0.15682  -0.246   0.8053    
obesity      0.39996    0.16748   2.388   0.0169 *  
snoring      0.46507    0.20464   2.273   0.0230 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 411.42  on 432  degrees of freedom
Residual deviance: 399.00  on 429  degrees of freedom
AIC: 407

Number of Fisher Scoring iterations: 4

Python

  • 18.2
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('18.2.csv')
df.head()
smoking obesity snoring cases   hypertension
0   0   0   0   60  5
1   1   0   0   17  2
2   0   1   0   8   1
3   1   1   0   2   0
4   0   0   1   187 35
df2 = pd.DataFrame(columns = ['smoking', 'obesity', 'snoring', 'hypertension'])
for i in range(len(df)):
  ht = df.iloc[i, 4]
  for j in range(df.iloc[i, 3]):
    ht -= 1
    if ht >= 0:
      hypertension = 1
    else:
      hypertension = 0
    df2 = df2.append({'smoking': df.iloc[i, 0], 'obesity': df.iloc[i, 1], 'snoring': df.iloc[i, 2], 'hypertension': hypertension}, ignore_index=True)
df2.head()
smoking obesity snoring hypertension
0   0   0   0   1
1   0   0   0   1
2   0   0   0   1
3   0   0   0   1
4   0   0   0   1
  • 18.2 ロジスティック回帰
df_x = pd.DataFrame(df2.iloc[:, 0:3])
df_x = sm.add_constant(df_x)
logit_mod = sm.Logit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  433
Model:                          Logit   Df Residuals:                      429
Method:                           MLE   Df Model:                            3
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                 0.03040
Time:                        07:05:53   Log-Likelihood:                -199.46
converged:                       True   LL-Null:                       -205.71
Covariance Type:            nonrobust   LLR p-value:                  0.005832
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.3777      0.380     -6.254      0.000      -3.123      -1.633
x1            -0.0678      0.278     -0.244      0.807      -0.613       0.477
x2             0.6953      0.285      2.439      0.015       0.137       1.254
x3             0.8719      0.398      2.193      0.028       0.093       1.651
==============================================================================
  • 18.3 プロビットモデル
df_x = pd.DataFrame(df2.iloc[:, 0:3])
df_x = sm.add_constant(df_x)
probit_mod = sm.Probit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
probit_res = probit_mod.fit(disp=0)
print(probit_res.summary())
                          Probit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  433
Model:                         Probit   Df Residuals:                      429
Method:                           MLE   Df Model:                            3
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                 0.03019
Time:                        07:07:14   Log-Likelihood:                -199.50
converged:                       True   LL-Null:                       -205.71
Covariance Type:            nonrobust   LLR p-value:                  0.006077
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.3731      0.193     -7.120      0.000      -1.751      -0.995
x1            -0.0387      0.157     -0.246      0.805      -0.346       0.269
x2             0.4000      0.168      2.384      0.017       0.071       0.729
x3             0.4651      0.204      2.274      0.023       0.064       0.866
==============================================================================
aokikenichi
質問は気軽にコメント欄かTwitterへ Q&Aサイト https://teratail.com/users/aokikenichi こんな本を読んでいます https://booklog.jp/users/aokikenichi/ 技術系以外の記事はnoteへ https://note.com/aokikenichi
https://twitter.com/aokikenichi/
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away