More than 3 years have passed since last update.

『統計検定準1級対応統計学実践ワークブック』をR, Pythonで解く～第16章重回帰分析～

Posted at 2020-07-26

日本統計学会公式認定統計検定準1級対応統計学実践ワークブック

が統計検定の準備だけでなく統計学の整理に役立つので、R, Pythonでの実装も試みた。

問16.1

表でデータが示されているが、ありものデータセットがなく、問16.2と内容が被るので略。

問16.2

R

c.f. 71. 回帰分析と重回帰分析

data(airquality)
head(airquality)

A data.frame: 6 × 6
Ozone	Solar.R	Wind	Temp	Month	Day
<int>	<int>	<dbl>	<int>	<int>	<int>
1	41	190	7.4	67	5	1
2	36	118	8.0	72	5	2
3	12	149	12.6	74	5	3
4	18	313	11.5	62	5	4
5	NA	NA	14.3	56	5	5
6	28	NA	14.9	66	5	6

モデル0

model0 = lm(Ozone ~ 1, data = airquality)
summary(model0)

sprintf("AIC=%.4f", AIC(model0))

Call:
lm(formula = Ozone ~ 1, data = airquality)

Residuals:
   Min     1Q Median     3Q    Max 
-41.13 -24.13 -10.63  21.12 125.87 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.129      3.063   13.76   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.99 on 115 degrees of freedom
  (37 observations deleted due to missingness)
'AIC=1143.2940'

定数項のみのモデル。これでよいはずだがテキストと数字が違う。

モデル1

model1 = lm(Ozone ~ Solar.R + Wind + Temp + Month, data = airquality)
summary(model1)

sprintf("AIC=%.4f", AIC(model1))

Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.870 -13.968  -2.671   9.553  97.918 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -58.05384   22.97114  -2.527   0.0130 *  
Solar.R       0.04960    0.02346   2.114   0.0368 *  
Wind         -3.31651    0.64579  -5.136 1.29e-06 ***
Temp          1.87087    0.27363   6.837 5.34e-10 ***
Month        -2.99163    1.51592  -1.973   0.0510 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.9 on 106 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6199,	Adjusted R-squared:  0.6055 
F-statistic: 43.21 on 4 and 106 DF,  p-value: < 2.2e-16
'AIC=996.7119'

テキストと結果が一致。

モデル2

model2 = lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality)
summary(model2)

sprintf("AIC=%.4f", AIC(model2))

Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.014 -12.284  -3.302   8.454  95.348 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -64.11632   23.48249  -2.730  0.00742 ** 
Solar.R       0.05027    0.02342   2.147  0.03411 *  
Wind         -3.31844    0.64451  -5.149 1.23e-06 ***
Temp          1.89579    0.27389   6.922 3.66e-10 ***
Month        -3.03996    1.51346  -2.009  0.04714 *  
Day           0.27388    0.22967   1.192  0.23576    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.86 on 105 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6249,	Adjusted R-squared:  0.6071 
F-statistic: 34.99 on 5 and 105 DF,  p-value: < 2.2e-16
'AIC=997.2188'

テキストと結果が一致。

Python

RのデータセットをCSV出力したファイルを用いる。
sklearn.linear_model.LinearRegressionでも重回帰分析が出来るが、各種統計量を直接は出力しないので、各種統計量も出力する statmodels を用いる
- 欠損値を自動処理しないので、事前に取り除く
c.f. StatsModelsによる重回帰解析

import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("airquality.csv")
df.head()

	Ozone	Solar.R	Wind	Temp	Month	Day
0	41.0	190.0	7.4	67	5	1
1	36.0	118.0	8.0	72	5	2
2	12.0	149.0	12.6	74	5	3
3	18.0	313.0	11.5	62	5	4
4	NaN	NaN	14.3	56	5	5

モデル0
- 定数項のみとするため説明変数にダミーでOzoneを入れて定数項を付与後Ozone列を削除している。

df0 = df.drop("Day", axis = 1)
df0 = df1[df1.isnull().any(axis = 1) == False]
x0 = df0.Ozone
x0 = sm.add_constant(x0)
x0 = x0.drop("Ozone", axis = 1)
y0 = df0.Ozone

model0 = sm.OLS(y0, x0)
result0 = model0.fit()
result0.summary()

                  OLS Regression Results
Dep. Variable:    Ozone	           R-squared:          -0.000
Model:            OLS              Adj. R-squared:     -0.000
Method:	          Least Squares    F-statistic:        -inf
Date:	          Thu, 23 Jul 2020 Prob (F-statistic): nan
Time:             05:00:56         Log-Likelihood:     -546.04
No. Observations: 111	           AIC:     	       1094.
Df Residuals:	  110	           BIC:	               1097.
Df Model:         0		
Covariance Type:  nonrobust		
               coef     std err	  t       P>|t|	  [0.025 0.975]
const          42.0991  3.158     13.329  0.000	  35.840 48.358
Omnibus:       26.260	Durbin-Watson:	1.108
Prob(Omnibus): 0.000	Jarque-Bera (JB):	35.528
Skew:	1.248	Prob(JB):	1.93e-08
Kurtosis:	4.204	Cond. No.	1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

テキストと結果が一致。AICが2ずれているがツール（ライブラリ）によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。

モデル1

df1 = df.drop("Day", axis = 1)
df1 = df1[df1.isnull().any(axis = 1) == False]
x1 = df1.drop("Ozone", 1)
x1 = sm.add_constant(x1)
y1 = df1.Ozone

model1 = sm.OLS(y1, x1)
result1 = model1.fit()
result1.summary()

                OLS Regression Results
Dep. Variable:    Ozone	        R-squared:          0.620
Model:            OLS           Adj. R-squared:     0.606
Method:	          Least Squares F-statistic:        43.21
Date:	Thu, 23 Jul 2020        Prob (F-statistic): 1.85e-21
Time:             05:00:29      Log-Likelihood:     -492.36
No. Observations: 111           AIC:                994.7
Df Residuals:     106           BIC:                1008.
Df Model:         4		
Covariance Type:	nonrobust		
         coef    std err t	P>|t|	[0.025   0.975]
const	-58.0538 22.971  -2.527	0.013	-103.596 -12.511
Solar.R	0.0496   0.023	 2.114	0.037	0.003    0.096
Wind	-3.3165  0.646	 -5.136	0.000	-4.597   -2.036
Temp	1.8709   0.274	 6.837	0.000	1.328    2.413
Month	-2.9916  1.516	 -1.973	0.051	-5.997   0.014
Omnibus:	41.274	Durbin-Watson:	2.032
Prob(Omnibus):	0.000	Jarque-Bera (JB):	105.574
Skew:	1.394	Prob(JB):	1.19e-23
Kurtosis:	6.880	Cond. No.	2.53e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.53e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

テキストと結果が一致。AICが2ずれているがツール（ライブラリ）によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。

モデル2

df2 = df.copy()
df2 = df2[df2.isnull().any(axis = 1) == False]
x2 = df2.drop("Ozone", 1)
x2 = sm.add_constant(x2)
y2 = df2.Ozone

model2 = sm.OLS(y2, x2)
result2 = model2.fit()
result2.summary()

                  OLS Regression Results
Dep. Variable:    Ozone         R-squared:          0.625
Model:            OLS           Adj. R-squared:     0.607
Method:           Least Squares F-statistic:        34.99
Date:	Thu, 23 Jul 2020        Prob (F-statistic): 6.50e-21
Time:             05:00:24      Log-Likelihood:     -491.61
No. Observations: 111           AIC:                995.2
Df Residuals:     105           BIC:                1011.
Df Model:         5		
Covariance Type:  nonrobust		
         coef    std err t	P>|t|	[0.025   0.975]
const	-64.1163 23.482  -2.730	0.007	-110.678 -17.555
Solar.R	0.0503   0.023   2.147	0.034	0.004    0.097
Wind	-3.3184  0.645  -5.149	0.000	-4.596   -2.041
Temp	1.8958   0.274  6.922	0.000	1.353    2.439
Month	-3.0400  1.513  -2.009	0.047	-6.041   -0.039
Day     0.2739   0.230  1.192	0.236	-0.182   0.729
Omnibus:	40.425	Durbin-Watson:	2.088
Prob(Omnibus):	0.000	Jarque-Bera (JB):	98.522
Skew:	1.388	Prob(JB):	4.04e-22
Kurtosis:	6.688	Cond. No.	2.60e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.6e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

テキストと結果が一致。AICが2ずれているがツール（ライブラリ）によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。

問16.3

データはこれAquaticToxっすね。

R

スパース推定法による統計モデリング (統計学One Point)

も参考としました。

(1)

# install.packages("QSARdata")
library(QSARdata)
# install.packages("glmnet")
library(glmnet)

data(AquaticTox)

head(AquaticTox_Outcome)

A data.frame: 6 × 2
Molecule	Activity
<fct>	<dbl>
1	(d)-limonene	5.29
2	111-trichloro-2-methyl-2-propanolol(chlorobytanol)	3.12
3	111-trichloroethane	3.40
4	1122-tetrachloroethane	3.92
5	112-trichloroethane	3.21
6	11-dichloroethylene(vinylidene	2.84

列が長いので5列目までで略

head(AquaticTox_AtomPair[, 1:5])

A data.frame: 6 × 5
Molecule	AP_FP_522	AP_FP_41	AP_FP_521	AP_FP_402
<fct>	<int>	<int>	<int>	<int>
1	(d)-limonene	0	1	1	0
2	111-trichloro-2-methyl-2-propanolol(chlorobytanol)	0	0	0	0
3	111-trichloroethane	0	0	0	0
4	1122-tetrachloroethane	0	0	0	0
5	112-trichloroethane	0	0	0	0
6	11-dichloroethylene(vinylidene	0	0	0	0

不要な化合物名列を削除し、値がIntegerなのでNumericにする。

df_x = AquaticTox_AtomPair[, -1]
df_x = apply(df_x, 2, as.numeric)

テキストのResidualの図に該当するはず。
- defaultだとlog(λ)>0しか計算しないのでテキスト通りの範囲とした

model <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))
rmse = apply((predict(model, newx = df_x) -  AquaticTox_Outcome$Activity)^2, 2, sum) / nrow(df_x)
plot(log(model$lambda), rmse, main = "Residual)

テキストのcvの図に該当するはず。
- defaultだとlog(λ)>0しか計算しないのでテキスト通りの範囲とした

model.cv <- cv.glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))
plot(model.cv)

λの最適値は

model.cv$lambda.min

1.19939610203539

(2)

下記の通り
- model1: λ=0
- model2: λ=e^-2, α=0
- model3: λ=e^-2, α=0.5
- model4: λ=e^-2, α=1

par(mfrow = c(2, 2))
plot(c(model.1$a0, model.1$beta[, 1]), main = "model1", xlab = "variabls", ylab = "coef.")
plot(c(model.2$a0, model.2$beta[, 1]), main = "model2", xlab = "variabls", ylab = "coef.")
plot(c(model.3$a0, model.3$beta[, 1]), main = "model3", xlab = "variabls", ylab = "coef.")
plot(c(model.4$a0, model.4$beta[, 1]), main = "model4", xlab = "variabls", ylab = "coef.")
par(mfrow = c(1, 1))

よって
- (a): model2
- (b): model1
- (c): model4
- (d): model3

(3)

下記の通り
- model2.1: α=0
- model2.2: α=0.5
- model2.3: α=1

model2.1 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0)
model2.2 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0.5)
model2.3 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 1)

par(mfrow = c(2, 2))
plot(model2.1, xvar = "lambda", main = "model2.1")
plot(model2.2, xvar = "lambda", main = "model2.2")
plot(model2.3, xvar = "lambda", main = "model2.3")
par(mfrow = c(1, 1))

of nonzerosの図は(a)だけlog(λ)の範囲が異なっているので再計算して

model2.1 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))

par(mfrow = c(2, 2))
plot(log(model2.1$lambda), apply(model2.1$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.1", xlab = "log(λ)", ylab = "# of nonzeros")
plot(log(model2.2$lambda), apply(model2.2$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.2", xlab = "log(λ)", ylab = "# of nonzeros")
plot(log(model2.2$lambda), apply(model2.3$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.3", xlab = "log(λ)", ylab = "# of nonzeros")
par(mfrow = c(1, 1))

つまり
- (a): model2.1
- (b): model2.3
- (c): model2.2

Python

scikit-learnとstatsmodelsにElastic Netのライブラリがある
- sklearn.linear_model.ElasticNet
- statsmodels.regression.linear_model.OLS.fit_regularized
下記記事などを参照させていただき実行はできるのだがテキスト、Rの結果と全く合わない、、、すみません勉強します
- Python StatsModels メモ
- StatsModelsとscikit-learnのElastic Net(Lasso回帰)を比較してみた
RのデータセットをCSV出力したファイルを用いる。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet, enet_path
import statsmodels.api as sm

y = pd.read_csv("AquaticTox_Outcome.csv")
y = y.drop("Molecule", axis = 1)
x = pd.read_csv("AquaticTox_AtomPair.csv")
x = x.drop("Molecule", axis = 1)

(1)

# scikit-learn のみ

alpha_range = np.logspace(-6, 6, 30, base=np.e)
elout = np.empty((len(alpha_range), 2))
for i, a in enumerate(alpha_range):
  #l1 = np.log(l1)
  enet = ElasticNet(l1_ratio = 0, alpha = a)
  result = enet.fit(x, y)
  rmse = ((sum((result.predict(x) - y['Activity'])**2))/len(y))**0.5
  elout[i] = [a, rmse]

plt.plot(np.log(elout[:, 0]), elout[:, 1], color='lightgreen', linewidth=2, label='Elastic net coefficients')

(2)

# scikit-learn
enet1 = ElasticNet(alpha = 0)
result1 = enet1.fit(x, y)
enet2 = ElasticNet(alpha = np.exp(-2), l1_ratio = 0)
result2 = enet2.fit(x, y)
enet3 = ElasticNet(alpha = np.exp(-2), l1_ratio = 0.5)
result3 = enet3.fit(x, y)
enet4 = ElasticNet(alpha = np.exp(-2), l1_ratio = 1)
result4 = enet4.fit(x, y)

fig = plt.figure()
fig, axes = plt.subplots(nrows=2, ncols=2, sharex=False)
axes[0, 0].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result1.intercept_), result1.coef_))
axes[0, 1].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result2.intercept_), result2.coef_))
axes[1, 0].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result3.intercept_), result3.coef_))
axes[1, 1].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result4.intercept_), result4.coef_))
fig.tight_layout()              #レイアウトの設定
plt.show()

# statsmodels
model1 = sm.OLS(y, sm.add_constant(x))
result1 = model1.fit_regularized(method='elastic_net', alpha = 0) # lambda = 0result2 = model.fit_regularized(alpha = np.exp(-2), L1_wt=0) # lambda = exp(-2), alpha = 0
model2 = sm.OLS(y, sm.add_constant(x))
result2 = model2.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=0) # lambda = exp(-2), alpha = 0
model3 = sm.OLS(y, sm.add_constant(x))
result3 = model3.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=0.5) # lambda = exp(-2), alpha = 0.5
model4 = sm.OLS(y, sm.add_constant(x))
result4 = model4.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=1) # lambda = exp(-2), alpha = 1

図は略

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

『統計検定準1級対応統計学実践ワークブック』をR, Pythonで解く～第16章重回帰分析～

問16.1

問16.2

R

Python

問16.3

R

(1)

(2)

(3)

of nonzerosの図は(a)だけlog(λ)の範囲が異なっているので再計算して

Python

(1)

(2)