日本統計学会公式認定 統計検定準1級対応 統計学実践ワークブック
が統計検定の準備だけでなく統計学の整理に役立つので、R, Pythonでの実装も試みた。
問18.1
データ
- 手打ちで転記した下記
- 18.1.csv
L1,患者,寛解
8,2,0
10,2,0
12,3,0
14,3,0
16,3,0
18,1,1
20,3,2
22,2,1
24,1,0
26,1,1
28,1,1
32,1,0
34,1,1
38,3,2
R
df<-read.csv('19.1.csv')
head(df)
A data.frame: 6 × 3
L1 患者 寛解
<int> <int> <int>
1 8 2 0
2 10 2 0
3 12 3 0
4 14 3 0
5 16 3 0
6 18 1 1
- 1行1患者のデータに展開
df2<-data.frame()
for (i in (1:nrow(df))) {
yo<-df[i, 3]
for (j in (1:df[i, 2])) {
rem = rem - 1
if (rem >= 0) {
remission = 1
} else {
remission = 0
}
tmp.df<-data.frame(L1 = df[i, 1], REM = remission)
df2<-rbind(df2, tmp.df)
}
}
df2$REM = as.factor(df2$REM)
head(df2)
A data.frame: 6 × 2
L1 REM
<int> <fct>
1 8 0
2 8 0
3 10 0
4 10 0
5 12 0
6 12 0
- ロジスティック回帰
model<-glm(REM ~ L1, df2, family = binomial("logit"))
summary(model)
Call:
glm(formula = REM ~ L1, family = binomial("logit"), data = df2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9448 -0.6465 -0.4947 0.6571 1.6971
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.77714 1.37862 -2.740 0.00615 **
L1 0.14486 0.05934 2.441 0.01464 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.372 on 26 degrees of freedom
Residual deviance: 26.073 on 25 degrees of freedom
AIC: 30.073
Number of Fisher Scoring iterations: 4
Python
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv('18.1.csv')
df.head()
L1 患者 寛解
0 8 2 0
1 10 2 0
2 12 3 0
3 14 3 0
4 16 3 0
df2 = pd.DataFrame(columns = ['L1', 'REM'])
for i in range(len(df)):
rem = df.iloc[i, 2]
for j in range(df.iloc[i, 1]):
rem -= 1
if rem >= 0:
remission = 1
else:
remission = 0
df2 = df2.append({'L1': df.iloc[i, 0], 'REM': remission}, ignore_index=True)
df2.head()
L1 REM
0 8 0
1 8 0
2 10 0
3 10 0
4 12 0
df_x = pd.DataFrame(df2['L1'])
df_x = sm.add_constant(df_x)
logit_mod = sm.Logit(np.asarray(df2['REM'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 27
Model: Logit Df Residuals: 25
Method: MLE Df Model: 1
Date: Sun, 02 Aug 2020 Pseudo R-squ.: 0.2414
Time: 06:58:18 Log-Likelihood: -13.036
converged: True LL-Null: -17.186
Covariance Type: nonrobust LLR p-value: 0.003967
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -3.7771 1.379 -2.740 0.006 -6.479 -1.075
x1 0.1449 0.059 2.441 0.015 0.029 0.261
==============================================================================
問18.2, 18.3
データ
- 手打ちで転記
- 18.2.csv
smoking,obesity,snoring,cases,hypertension
0,0,0,60,5
1,0,0,17,2
0,1,0,8,1
1,1,0,2,0
0,0,1,187,35
1,0,1,85,13
0,1,1,51,15
1,1,1,23,8
R
df<-read.csv('18.2.csv')
head(df)
A data.frame: 6 × 5
smoking obesity snoring cases hypertension
<int> <int> <int> <int> <int>
1 0 0 0 60 5
2 1 0 0 17 2
3 0 1 0 8 1
4 1 1 0 2 0
5 0 0 1 187 35
6 1 0 1 85 13
- 1行1患者に展開
df2<-data.frame()
for (i in (1:nrow(df))) {
ht<-df[i, 5]
for (j in (1:df[i, 4])) {
ht = ht - 1
if (ht >= 0) {
hypertension = 1
} else {
hypertension = 0
}
tmp.df<-data.frame(df[i, 1:3], HT = hypertension)
df2<-rbind(df2, tmp.df)
}
}
df2$HT = as.factor(df2$HT)
head(df2)
A data.frame: 6 × 4
smoking obesity snoring HT
<int> <int> <int> <fct>
1 0 0 0 1
2 0 0 0 1
3 0 0 0 1
4 0 0 0 1
5 0 0 0 1
6 0 0 0 0
- 18.2 ロジスティック回帰
model.logit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("logit"))
summary(model.logit)
Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("logit"),
data = df2)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8578 -0.6330 -0.6138 -0.4212 2.2488
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.37766 0.38010 -6.255 3.97e-10 ***
smoking -0.06777 0.27812 -0.244 0.8075
obesity 0.69531 0.28508 2.439 0.0147 *
snoring 0.87194 0.39749 2.194 0.0283 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 411.42 on 432 degrees of freedom
Residual deviance: 398.92 on 429 degrees of freedom
AIC: 406.92
Number of Fisher Scoring iterations: 4
-
InterceptのPrが少し違うのが?だがほかはテキスト通り
-
18.3 プロビットモデル
model.probit<-glm(HT ~ smoking + obesity + snoring, df2, family = binomial("probit"))
summary(model.probit)
Call:
glm(formula = HT ~ smoking + obesity + snoring, family = binomial("probit"),
data = df2)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8542 -0.6337 -0.6142 -0.4211 2.2531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.37312 0.19308 -7.112 1.15e-12 ***
smoking -0.03865 0.15682 -0.246 0.8053
obesity 0.39996 0.16748 2.388 0.0169 *
snoring 0.46507 0.20464 2.273 0.0230 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 411.42 on 432 degrees of freedom
Residual deviance: 399.00 on 429 degrees of freedom
AIC: 407
Number of Fisher Scoring iterations: 4
Python
- 18.2
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv('18.2.csv')
df.head()
smoking obesity snoring cases hypertension
0 0 0 0 60 5
1 1 0 0 17 2
2 0 1 0 8 1
3 1 1 0 2 0
4 0 0 1 187 35
df2 = pd.DataFrame(columns = ['smoking', 'obesity', 'snoring', 'hypertension'])
for i in range(len(df)):
ht = df.iloc[i, 4]
for j in range(df.iloc[i, 3]):
ht -= 1
if ht >= 0:
hypertension = 1
else:
hypertension = 0
df2 = df2.append({'smoking': df.iloc[i, 0], 'obesity': df.iloc[i, 1], 'snoring': df.iloc[i, 2], 'hypertension': hypertension}, ignore_index=True)
df2.head()
smoking obesity snoring hypertension
0 0 0 0 1
1 0 0 0 1
2 0 0 0 1
3 0 0 0 1
4 0 0 0 1
- 18.2 ロジスティック回帰
df_x = pd.DataFrame(df2.iloc[:, 0:3])
df_x = sm.add_constant(df_x)
logit_mod = sm.Logit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
logit_res = logit_mod.fit(disp=0)
print(logit_res.summary())
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 433
Model: Logit Df Residuals: 429
Method: MLE Df Model: 3
Date: Sun, 02 Aug 2020 Pseudo R-squ.: 0.03040
Time: 07:05:53 Log-Likelihood: -199.46
converged: True LL-Null: -205.71
Covariance Type: nonrobust LLR p-value: 0.005832
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -2.3777 0.380 -6.254 0.000 -3.123 -1.633
x1 -0.0678 0.278 -0.244 0.807 -0.613 0.477
x2 0.6953 0.285 2.439 0.015 0.137 1.254
x3 0.8719 0.398 2.193 0.028 0.093 1.651
==============================================================================
- 18.3 プロビットモデル
df_x = pd.DataFrame(df2.iloc[:, 0:3])
df_x = sm.add_constant(df_x)
probit_mod = sm.Probit(np.asarray(df2['hypertension'].astype('float')), np.asarray(df_x.astype('float')))
probit_res = probit_mod.fit(disp=0)
print(probit_res.summary())
Probit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 433
Model: Probit Df Residuals: 429
Method: MLE Df Model: 3
Date: Sun, 02 Aug 2020 Pseudo R-squ.: 0.03019
Time: 07:07:14 Log-Likelihood: -199.50
converged: True LL-Null: -205.71
Covariance Type: nonrobust LLR p-value: 0.006077
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.3731 0.193 -7.120 0.000 -1.751 -0.995
x1 -0.0387 0.157 -0.246 0.805 -0.346 0.269
x2 0.4000 0.168 2.384 0.017 0.071 0.729
x3 0.4651 0.204 2.274 0.023 0.064 0.866
==============================================================================