Help us understand the problem. What is going on with this article?

# 『統計検定準1級対応統計学実践ワークブック』をR, Pythonで解く～第18章質的回帰～

が統計検定の準備だけでなく統計学の整理に役立つので、R, Pythonでの実装も試みた。

# 問18.1

## データ

• 手打ちで転記した下記
• 18.1.csv
```L1,患者,寛解
8,2,0
10,2,0
12,3,0
14,3,0
16,3,0
18,1,1
20,3,2
22,2,1
24,1,0
26,1,1
28,1,1
32,1,0
34,1,1
38,3,2
```

## R

```df<-read.csv('19.1.csv')
```
```A data.frame: 6 × 3
L1  患者  寛解
<int>   <int>   <int>
1   8   2   0
2   10  2   0
3   12  3   0
4   14  3   0
5   16  3   0
6   18  1   1
```
• 1行1患者のデータに展開
```df2<-data.frame()
for (i in (1:nrow(df))) {
yo<-df[i, 3]
for (j in (1:df[i, 2])) {
rem = rem - 1
if (rem >= 0) {
remission = 1
} else {
remission = 0
}
tmp.df<-data.frame(L1 = df[i, 1], REM = remission)
df2<-rbind(df2, tmp.df)
}
}
df2\$REM = as.factor(df2\$REM)
```
```A data.frame: 6 × 2
L1  REM
<int>   <fct>
1   8   0
2   8   0
3   10  0
4   10  0
5   12  0
6   12  0
```
• ロジスティック回帰
```model<-glm(REM ~ L1, df2, family = binomial("logit"))
summary(model)
```
```Call:
glm(formula = REM ~ L1, family = binomial("logit"), data = df2)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.9448  -0.6465  -0.4947   0.6571   1.6971

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.77714    1.37862  -2.740  0.00615 **
L1           0.14486    0.05934   2.441  0.01464 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.372  on 26  degrees of freedom
Residual deviance: 26.073  on 25  degrees of freedom
AIC: 30.073

Number of Fisher Scoring iterations: 4
```

## Python

```import pandas as pd
import statsmodels.api as sm

```
```    L1  患者  寛解
0   8   2   0
1   10  2   0
2   12  3   0
3   14  3   0
4   16  3   0
```
```df2 = pd.DataFrame(columns = ['L1', 'REM'])
for i in range(len(df)):
rem = df.iloc[i, 2]
for j in range(df.iloc[i, 1]):
rem -= 1
if rem >= 0:
remission = 1
else:
remission = 0
df2 = df2.append({'L1': df.iloc[i, 0], 'REM': remission}, ignore_index=True)
```
```L1  REM
0   8   0
1   8   0
2   10  0
3   10  0
4   12  0
```
```df_x = pd.DataFrame(df2['L1'])
logit_mod = sm.Logit(np.asarray(df2['REM'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())
```
```                           Logit Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   27
Model:                          Logit   Df Residuals:                       25
Method:                           MLE   Df Model:                            1
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                  0.2414
Time:                        06:58:18   Log-Likelihood:                -13.036
converged:                       True   LL-Null:                       -17.186
Covariance Type:            nonrobust   LLR p-value:                  0.003967
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.7771      1.379     -2.740      0.006      -6.479      -1.075
x1             0.1449      0.059      2.441      0.015       0.029       0.261
==============================================================================
```

# 問18.2, 18.3

## データ

• 手打ちで転記
• 18.2.csv
```smoking,obesity,snoring,cases,hypertension
0,0,0,60,5
1,0,0,17,2
0,1,0,8,1
1,1,0,2,0
0,0,1,187,35
1,0,1,85,13
0,1,1,51,15
1,1,1,23,8
```

## R

```df<-read.csv('18.2.csv')
```
```A data.frame: 6 × 5
smoking obesity snoring cases   hypertension
<int>   <int>   <int>   <int>   <int>
1   0   0   0   60  5
2   1   0   0   17  2
3   0   1   0   8   1
4   1   1   0   2   0
5   0   0   1   187 35
6   1   0   1   85  13
```
• 1行1患者に展開
```df2<-data.frame()
for (i in (1:nrow(df))) {
ht<-df[i, 5]
for (j in (1:df[i, 4])) {
ht = ht - 1
if (ht >= 0) {
hypertension = 1
} else {
hypertension = 0
}
tmp.df<-data.frame(df[i, 1:3], HT = hypertension)
df2<-rbind(df2, tmp.df)
}
}
df2\$HT = as.factor(df2\$HT)
```
```A data.frame: 6 × 4
smoking obesity snoring HT
<int>   <int>   <int>   <fct>
1   0   0   0   1
2   0   0   0   1
3   0   0   0   1
4   0   0   0   1
5   0   0   0   1
6   0   0   0   0
```
• 18.2 ロジスティック回帰
```model.logit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("logit"))
summary(model.logit)
```
```Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("logit"),
data = df2)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-0.8578  -0.6330  -0.6138  -0.4212   2.2488

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.37766    0.38010  -6.255 3.97e-10 ***
smoking     -0.06777    0.27812  -0.244   0.8075
obesity      0.69531    0.28508   2.439   0.0147 *
snoring      0.87194    0.39749   2.194   0.0283 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 411.42  on 432  degrees of freedom
Residual deviance: 398.92  on 429  degrees of freedom
AIC: 406.92

Number of Fisher Scoring iterations: 4
```
• InterceptのPrが少し違うのが?だがほかはテキスト通り

• 18.3 プロビットモデル

```model.probit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("probit"))
summary(model.probit)
```
```Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("probit"),
data = df2)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-0.8542  -0.6337  -0.6142  -0.4211   2.2531

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.37312    0.19308  -7.112 1.15e-12 ***
smoking     -0.03865    0.15682  -0.246   0.8053
obesity      0.39996    0.16748   2.388   0.0169 *
snoring      0.46507    0.20464   2.273   0.0230 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 411.42  on 432  degrees of freedom
Residual deviance: 399.00  on 429  degrees of freedom
AIC: 407

Number of Fisher Scoring iterations: 4
```

## Python

• 18.2
```import pandas as pd
import statsmodels.api as sm

```
```smoking obesity snoring cases   hypertension
0   0   0   0   60  5
1   1   0   0   17  2
2   0   1   0   8   1
3   1   1   0   2   0
4   0   0   1   187 35
```
```df2 = pd.DataFrame(columns = ['smoking', 'obesity', 'snoring', 'hypertension'])
for i in range(len(df)):
ht = df.iloc[i, 4]
for j in range(df.iloc[i, 3]):
ht -= 1
if ht >= 0:
hypertension = 1
else:
hypertension = 0
df2 = df2.append({'smoking': df.iloc[i, 0], 'obesity': df.iloc[i, 1], 'snoring': df.iloc[i, 2], 'hypertension': hypertension}, ignore_index=True)
```
```smoking obesity snoring hypertension
0   0   0   0   1
1   0   0   0   1
2   0   0   0   1
3   0   0   0   1
4   0   0   0   1
```
• 18.2 ロジスティック回帰
```df_x = pd.DataFrame(df2.iloc[:, 0:3])
logit_mod = sm.Logit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())
```
```                           Logit Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  433
Model:                          Logit   Df Residuals:                      429
Method:                           MLE   Df Model:                            3
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                 0.03040
Time:                        07:05:53   Log-Likelihood:                -199.46
converged:                       True   LL-Null:                       -205.71
Covariance Type:            nonrobust   LLR p-value:                  0.005832
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.3777      0.380     -6.254      0.000      -3.123      -1.633
x1            -0.0678      0.278     -0.244      0.807      -0.613       0.477
x2             0.6953      0.285      2.439      0.015       0.137       1.254
x3             0.8719      0.398      2.193      0.028       0.093       1.651
==============================================================================
```
• 18.3 プロビットモデル
```df_x = pd.DataFrame(df2.iloc[:, 0:3])
probit_mod = sm.Probit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
probit_res = probit_mod.fit(disp=0)
print(probit_res.summary())
```
```                          Probit Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  433
Model:                         Probit   Df Residuals:                      429
Method:                           MLE   Df Model:                            3
Date:                Sun, 02 Aug 2020   Pseudo R-squ.:                 0.03019
Time:                        07:07:14   Log-Likelihood:                -199.50
converged:                       True   LL-Null:                       -205.71
Covariance Type:            nonrobust   LLR p-value:                  0.006077
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.3731      0.193     -7.120      0.000      -1.751      -0.995
x1            -0.0387      0.157     -0.246      0.805      -0.346       0.269
x2             0.4000      0.168      2.384      0.017       0.071       0.729
x3             0.4651      0.204      2.274      0.023       0.064       0.866
==============================================================================
```