日本統計学会公式認定 統計検定準1級対応 統計学実践ワークブック
が統計検定の準備だけでなく統計学の整理に役立つので、R, Pythonでの実装も試みた。
問16.1
表でデータが示されているが、ありものデータセットがなく、問16.2と内容が被るので略。
問16.2
R
- c.f. 71. 回帰分析と重回帰分析
data(airquality)
head(airquality)
A data.frame: 6 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
- モデル0
model0 = lm(Ozone ~ 1, data = airquality)
summary(model0)
sprintf("AIC=%.4f", AIC(model0))
Call:
lm(formula = Ozone ~ 1, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-41.13 -24.13 -10.63 21.12 125.87
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.129 3.063 13.76 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 32.99 on 115 degrees of freedom
(37 observations deleted due to missingness)
'AIC=1143.2940'
定数項のみのモデル。これでよいはずだがテキストと数字が違う。
- モデル1
model1 = lm(Ozone ~ Solar.R + Wind + Temp + Month, data = airquality)
summary(model1)
sprintf("AIC=%.4f", AIC(model1))
Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-35.870 -13.968 -2.671 9.553 97.918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.05384 22.97114 -2.527 0.0130 *
Solar.R 0.04960 0.02346 2.114 0.0368 *
Wind -3.31651 0.64579 -5.136 1.29e-06 ***
Temp 1.87087 0.27363 6.837 5.34e-10 ***
Month -2.99163 1.51592 -1.973 0.0510 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.9 on 106 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055
F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16
'AIC=996.7119'
テキストと結果が一致。
- モデル2
model2 = lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality)
summary(model2)
sprintf("AIC=%.4f", AIC(model2))
Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-37.014 -12.284 -3.302 8.454 95.348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -64.11632 23.48249 -2.730 0.00742 **
Solar.R 0.05027 0.02342 2.147 0.03411 *
Wind -3.31844 0.64451 -5.149 1.23e-06 ***
Temp 1.89579 0.27389 6.922 3.66e-10 ***
Month -3.03996 1.51346 -2.009 0.04714 *
Day 0.27388 0.22967 1.192 0.23576
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.86 on 105 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6249, Adjusted R-squared: 0.6071
F-statistic: 34.99 on 5 and 105 DF, p-value: < 2.2e-16
'AIC=997.2188'
テキストと結果が一致。
Python
- RのデータセットをCSV出力したファイルを用いる。
- sklearn.linear_model.LinearRegressionでも重回帰分析が出来るが、各種統計量を直接は出力しないので、各種統計量も出力する statmodels を用いる
- 欠損値を自動処理しないので、事前に取り除く
- c.f. StatsModelsによる重回帰解析
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("airquality.csv")
df.head()
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
- モデル0
- 定数項のみとするため説明変数にダミーでOzoneを入れて定数項を付与後Ozone列を削除している。
df0 = df.drop("Day", axis = 1)
df0 = df1[df1.isnull().any(axis = 1) == False]
x0 = df0.Ozone
x0 = sm.add_constant(x0)
x0 = x0.drop("Ozone", axis = 1)
y0 = df0.Ozone
model0 = sm.OLS(y0, x0)
result0 = model0.fit()
result0.summary()
OLS Regression Results
Dep. Variable: Ozone R-squared: -0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: -inf
Date: Thu, 23 Jul 2020 Prob (F-statistic): nan
Time: 05:00:56 Log-Likelihood: -546.04
No. Observations: 111 AIC: 1094.
Df Residuals: 110 BIC: 1097.
Df Model: 0
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 42.0991 3.158 13.329 0.000 35.840 48.358
Omnibus: 26.260 Durbin-Watson: 1.108
Prob(Omnibus): 0.000 Jarque-Bera (JB): 35.528
Skew: 1.248 Prob(JB): 1.93e-08
Kurtosis: 4.204 Cond. No. 1.00
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
テキストと結果が一致。AICが2ずれているがツール(ライブラリ)によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。
- モデル1
df1 = df.drop("Day", axis = 1)
df1 = df1[df1.isnull().any(axis = 1) == False]
x1 = df1.drop("Ozone", 1)
x1 = sm.add_constant(x1)
y1 = df1.Ozone
model1 = sm.OLS(y1, x1)
result1 = model1.fit()
result1.summary()
OLS Regression Results
Dep. Variable: Ozone R-squared: 0.620
Model: OLS Adj. R-squared: 0.606
Method: Least Squares F-statistic: 43.21
Date: Thu, 23 Jul 2020 Prob (F-statistic): 1.85e-21
Time: 05:00:29 Log-Likelihood: -492.36
No. Observations: 111 AIC: 994.7
Df Residuals: 106 BIC: 1008.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -58.0538 22.971 -2.527 0.013 -103.596 -12.511
Solar.R 0.0496 0.023 2.114 0.037 0.003 0.096
Wind -3.3165 0.646 -5.136 0.000 -4.597 -2.036
Temp 1.8709 0.274 6.837 0.000 1.328 2.413
Month -2.9916 1.516 -1.973 0.051 -5.997 0.014
Omnibus: 41.274 Durbin-Watson: 2.032
Prob(Omnibus): 0.000 Jarque-Bera (JB): 105.574
Skew: 1.394 Prob(JB): 1.19e-23
Kurtosis: 6.880 Cond. No. 2.53e+03
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.53e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
テキストと結果が一致。AICが2ずれているがツール(ライブラリ)によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。
- モデル2
df2 = df.copy()
df2 = df2[df2.isnull().any(axis = 1) == False]
x2 = df2.drop("Ozone", 1)
x2 = sm.add_constant(x2)
y2 = df2.Ozone
model2 = sm.OLS(y2, x2)
result2 = model2.fit()
result2.summary()
OLS Regression Results
Dep. Variable: Ozone R-squared: 0.625
Model: OLS Adj. R-squared: 0.607
Method: Least Squares F-statistic: 34.99
Date: Thu, 23 Jul 2020 Prob (F-statistic): 6.50e-21
Time: 05:00:24 Log-Likelihood: -491.61
No. Observations: 111 AIC: 995.2
Df Residuals: 105 BIC: 1011.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -64.1163 23.482 -2.730 0.007 -110.678 -17.555
Solar.R 0.0503 0.023 2.147 0.034 0.004 0.097
Wind -3.3184 0.645 -5.149 0.000 -4.596 -2.041
Temp 1.8958 0.274 6.922 0.000 1.353 2.439
Month -3.0400 1.513 -2.009 0.047 -6.041 -0.039
Day 0.2739 0.230 1.192 0.236 -0.182 0.729
Omnibus: 40.425 Durbin-Watson: 2.088
Prob(Omnibus): 0.000 Jarque-Bera (JB): 98.522
Skew: 1.388 Prob(JB): 4.04e-22
Kurtosis: 6.688 Cond. No. 2.60e+03
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.6e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
テキストと結果が一致。AICが2ずれているがツール(ライブラリ)によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。
問16.3
データはこれAquaticToxっすね。
R
スパース推定法による統計モデリング (統計学One Point)
も参考としました。
(1)
#install.packages("QSARdata")
library(QSARdata)
#install.packages("glmnet")
library(glmnet)
data(AquaticTox)
head(AquaticTox_Outcome)
A data.frame: 6 × 2
Molecule Activity
<fct> <dbl>
1 (d)-limonene 5.29
2 111-trichloro-2-methyl-2-propanolol(chlorobytanol) 3.12
3 111-trichloroethane 3.40
4 1122-tetrachloroethane 3.92
5 112-trichloroethane 3.21
6 11-dichloroethylene(vinylidene 2.84
- 列が長いので5列目までで略
head(AquaticTox_AtomPair[, 1:5])
A data.frame: 6 × 5
Molecule AP_FP_522 AP_FP_41 AP_FP_521 AP_FP_402
<fct> <int> <int> <int> <int>
1 (d)-limonene 0 1 1 0
2 111-trichloro-2-methyl-2-propanolol(chlorobytanol) 0 0 0 0
3 111-trichloroethane 0 0 0 0
4 1122-tetrachloroethane 0 0 0 0
5 112-trichloroethane 0 0 0 0
6 11-dichloroethylene(vinylidene 0 0 0 0
- 不要な化合物名列を削除し、値がIntegerなのでNumericにする。
df_x = AquaticTox_AtomPair[, -1]
df_x = apply(df_x, 2, as.numeric)
- テキストのResidualの図に該当するはず。
- defaultだとlog(λ)>0しか計算しないのでテキスト通りの範囲とした
model <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))
rmse = apply((predict(model, newx = df_x) - AquaticTox_Outcome$Activity)^2, 2, sum) / nrow(df_x)
plot(log(model$lambda), rmse, main = "Residual)
- テキストのcvの図に該当するはず。
- defaultだとlog(λ)>0しか計算しないのでテキスト通りの範囲とした
model.cv <- cv.glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))
plot(model.cv)
- λの最適値は
model.cv$lambda.min
1.19939610203539
(2)
- 下記の通り
- model1: λ=0
- model2: λ=e^-2, α=0
- model3: λ=e^-2, α=0.5
- model4: λ=e^-2, α=1
par(mfrow = c(2, 2))
plot(c(model.1$a0, model.1$beta[, 1]), main = "model1", xlab = "variabls", ylab = "coef.")
plot(c(model.2$a0, model.2$beta[, 1]), main = "model2", xlab = "variabls", ylab = "coef.")
plot(c(model.3$a0, model.3$beta[, 1]), main = "model3", xlab = "variabls", ylab = "coef.")
plot(c(model.4$a0, model.4$beta[, 1]), main = "model4", xlab = "variabls", ylab = "coef.")
par(mfrow = c(1, 1))
- よって
- (a): model2
- (b): model1
- (c): model4
- (d): model3
(3)
- 下記の通り
- model2.1: α=0
- model2.2: α=0.5
- model2.3: α=1
model2.1 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0)
model2.2 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0.5)
model2.3 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 1)
par(mfrow = c(2, 2))
plot(model2.1, xvar = "lambda", main = "model2.1")
plot(model2.2, xvar = "lambda", main = "model2.2")
plot(model2.3, xvar = "lambda", main = "model2.3")
par(mfrow = c(1, 1))
model2.1 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))
par(mfrow = c(2, 2))
plot(log(model2.1$lambda), apply(model2.1$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.1", xlab = "log(λ)", ylab = "# of nonzeros")
plot(log(model2.2$lambda), apply(model2.2$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.2", xlab = "log(λ)", ylab = "# of nonzeros")
plot(log(model2.2$lambda), apply(model2.3$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.3", xlab = "log(λ)", ylab = "# of nonzeros")
par(mfrow = c(1, 1))
- つまり
- (a): model2.1
- (b): model2.3
- (c): model2.2
Python
-
scikit-learnとstatsmodelsにElastic Netのライブラリがある
-
下記記事などを参照させていただき実行はできるのだがテキスト、Rの結果と全く合わない、、、すみません勉強します
-
RのデータセットをCSV出力したファイルを用いる。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet, enet_path
import statsmodels.api as sm
y = pd.read_csv("AquaticTox_Outcome.csv")
y = y.drop("Molecule", axis = 1)
x = pd.read_csv("AquaticTox_AtomPair.csv")
x = x.drop("Molecule", axis = 1)
(1)
# scikit-learn のみ
alpha_range = np.logspace(-6, 6, 30, base=np.e)
elout = np.empty((len(alpha_range), 2))
for i, a in enumerate(alpha_range):
#l1 = np.log(l1)
enet = ElasticNet(l1_ratio = 0, alpha = a)
result = enet.fit(x, y)
rmse = ((sum((result.predict(x) - y['Activity'])**2))/len(y))**0.5
elout[i] = [a, rmse]
plt.plot(np.log(elout[:, 0]), elout[:, 1], color='lightgreen', linewidth=2, label='Elastic net coefficients')
(2)
#scikit-learn
enet1 = ElasticNet(alpha = 0)
result1 = enet1.fit(x, y)
enet2 = ElasticNet(alpha = np.exp(-2), l1_ratio = 0)
result2 = enet2.fit(x, y)
enet3 = ElasticNet(alpha = np.exp(-2), l1_ratio = 0.5)
result3 = enet3.fit(x, y)
enet4 = ElasticNet(alpha = np.exp(-2), l1_ratio = 1)
result4 = enet4.fit(x, y)
fig = plt.figure()
fig, axes = plt.subplots(nrows=2, ncols=2, sharex=False)
axes[0, 0].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result1.intercept_), result1.coef_))
axes[0, 1].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result2.intercept_), result2.coef_))
axes[1, 0].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result3.intercept_), result3.coef_))
axes[1, 1].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result4.intercept_), result4.coef_))
fig.tight_layout() #レイアウトの設定
plt.show()
# statsmodels
model1 = sm.OLS(y, sm.add_constant(x))
result1 = model1.fit_regularized(method='elastic_net', alpha = 0) # lambda = 0result2 = model.fit_regularized(alpha = np.exp(-2), L1_wt=0) # lambda = exp(-2), alpha = 0
model2 = sm.OLS(y, sm.add_constant(x))
result2 = model2.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=0) # lambda = exp(-2), alpha = 0
model3 = sm.OLS(y, sm.add_constant(x))
result3 = model3.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=0.5) # lambda = exp(-2), alpha = 0.5
model4 = sm.OLS(y, sm.add_constant(x))
result4 = model4.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=1) # lambda = exp(-2), alpha = 1
- 図は略