1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

『統計検定準1級対応統計学実践ワークブック』をR, Pythonで解く~第16章重回帰分析~

Posted at

日本統計学会公式認定 統計検定準1級対応 統計学実践ワークブック

が統計検定の準備だけでなく統計学の整理に役立つので、R, Pythonでの実装も試みた。

問16.1

表でデータが示されているが、ありものデータセットがなく、問16.2と内容が被るので略。

問16.2

R

data(airquality)
head(airquality)
A data.frame: 6 × 6
Ozone	Solar.R	Wind	Temp	Month	Day
<int>	<int>	<dbl>	<int>	<int>	<int>
1	41	190	7.4	67	5	1
2	36	118	8.0	72	5	2
3	12	149	12.6	74	5	3
4	18	313	11.5	62	5	4
5	NA	NA	14.3	56	5	5
6	28	NA	14.9	66	5	6
  • モデル0
model0 = lm(Ozone ~ 1, data = airquality)
summary(model0)

sprintf("AIC=%.4f", AIC(model0))
Call:
lm(formula = Ozone ~ 1, data = airquality)

Residuals:
   Min     1Q Median     3Q    Max 
-41.13 -24.13 -10.63  21.12 125.87 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.129      3.063   13.76   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.99 on 115 degrees of freedom
  (37 observations deleted due to missingness)
'AIC=1143.2940'

定数項のみのモデル。これでよいはずだがテキストと数字が違う。

  • モデル1
model1 = lm(Ozone ~ Solar.R + Wind + Temp + Month, data = airquality)
summary(model1)

sprintf("AIC=%.4f", AIC(model1))

Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.870 -13.968  -2.671   9.553  97.918 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -58.05384   22.97114  -2.527   0.0130 *  
Solar.R       0.04960    0.02346   2.114   0.0368 *  
Wind         -3.31651    0.64579  -5.136 1.29e-06 ***
Temp          1.87087    0.27363   6.837 5.34e-10 ***
Month        -2.99163    1.51592  -1.973   0.0510 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.9 on 106 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6199,	Adjusted R-squared:  0.6055 
F-statistic: 43.21 on 4 and 106 DF,  p-value: < 2.2e-16
'AIC=996.7119'

テキストと結果が一致。

  • モデル2
model2 = lm(Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality)
summary(model2)

sprintf("AIC=%.4f", AIC(model2))
Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day, data = airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.014 -12.284  -3.302   8.454  95.348 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -64.11632   23.48249  -2.730  0.00742 ** 
Solar.R       0.05027    0.02342   2.147  0.03411 *  
Wind         -3.31844    0.64451  -5.149 1.23e-06 ***
Temp          1.89579    0.27389   6.922 3.66e-10 ***
Month        -3.03996    1.51346  -2.009  0.04714 *  
Day           0.27388    0.22967   1.192  0.23576    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.86 on 105 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6249,	Adjusted R-squared:  0.6071 
F-statistic: 34.99 on 5 and 105 DF,  p-value: < 2.2e-16
'AIC=997.2188'

テキストと結果が一致。

Python

  • RのデータセットをCSV出力したファイルを用いる。
  • sklearn.linear_model.LinearRegressionでも重回帰分析が出来るが、各種統計量を直接は出力しないので、各種統計量も出力する statmodels を用いる
    • 欠損値を自動処理しないので、事前に取り除く
  • c.f. StatsModelsによる重回帰解析
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv("airquality.csv")
df.head()
	Ozone	Solar.R	Wind	Temp	Month	Day
0	41.0	190.0	7.4	67	5	1
1	36.0	118.0	8.0	72	5	2
2	12.0	149.0	12.6	74	5	3
3	18.0	313.0	11.5	62	5	4
4	NaN	NaN	14.3	56	5	5
  • モデル0
    • 定数項のみとするため説明変数にダミーでOzoneを入れて定数項を付与後Ozone列を削除している。
df0 = df.drop("Day", axis = 1)
df0 = df1[df1.isnull().any(axis = 1) == False]
x0 = df0.Ozone
x0 = sm.add_constant(x0)
x0 = x0.drop("Ozone", axis = 1)
y0 = df0.Ozone

model0 = sm.OLS(y0, x0)
result0 = model0.fit()
result0.summary()
                  OLS Regression Results
Dep. Variable:    Ozone	           R-squared:          -0.000
Model:            OLS              Adj. R-squared:     -0.000
Method:	          Least Squares    F-statistic:        -inf
Date:	          Thu, 23 Jul 2020 Prob (F-statistic): nan
Time:             05:00:56         Log-Likelihood:     -546.04
No. Observations: 111	           AIC:     	       1094.
Df Residuals:	  110	           BIC:	               1097.
Df Model:         0		
Covariance Type:  nonrobust		
               coef     std err	  t       P>|t|	  [0.025 0.975]
const          42.0991  3.158     13.329  0.000	  35.840 48.358
Omnibus:       26.260	Durbin-Watson:	1.108
Prob(Omnibus): 0.000	Jarque-Bera (JB):	35.528
Skew:	1.248	Prob(JB):	1.93e-08
Kurtosis:	4.204	Cond. No.	1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

テキストと結果が一致。AICが2ずれているがツール(ライブラリ)によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。

  • モデル1
df1 = df.drop("Day", axis = 1)
df1 = df1[df1.isnull().any(axis = 1) == False]
x1 = df1.drop("Ozone", 1)
x1 = sm.add_constant(x1)
y1 = df1.Ozone

model1 = sm.OLS(y1, x1)
result1 = model1.fit()
result1.summary()

                OLS Regression Results
Dep. Variable:    Ozone	        R-squared:          0.620
Model:            OLS           Adj. R-squared:     0.606
Method:	          Least Squares F-statistic:        43.21
Date:	Thu, 23 Jul 2020        Prob (F-statistic): 1.85e-21
Time:             05:00:29      Log-Likelihood:     -492.36
No. Observations: 111           AIC:                994.7
Df Residuals:     106           BIC:                1008.
Df Model:         4		
Covariance Type:	nonrobust		
         coef    std err t	P>|t|	[0.025   0.975]
const	-58.0538 22.971  -2.527	0.013	-103.596 -12.511
Solar.R	0.0496   0.023	 2.114	0.037	0.003    0.096
Wind	-3.3165  0.646	 -5.136	0.000	-4.597   -2.036
Temp	1.8709   0.274	 6.837	0.000	1.328    2.413
Month	-2.9916  1.516	 -1.973	0.051	-5.997   0.014
Omnibus:	41.274	Durbin-Watson:	2.032
Prob(Omnibus):	0.000	Jarque-Bera (JB):	105.574
Skew:	1.394	Prob(JB):	1.19e-23
Kurtosis:	6.880	Cond. No.	2.53e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.53e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

テキストと結果が一致。AICが2ずれているがツール(ライブラリ)によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。

  • モデル2
df2 = df.copy()
df2 = df2[df2.isnull().any(axis = 1) == False]
x2 = df2.drop("Ozone", 1)
x2 = sm.add_constant(x2)
y2 = df2.Ozone

model2 = sm.OLS(y2, x2)
result2 = model2.fit()
result2.summary()

                  OLS Regression Results
Dep. Variable:    Ozone         R-squared:          0.625
Model:            OLS           Adj. R-squared:     0.607
Method:           Least Squares F-statistic:        34.99
Date:	Thu, 23 Jul 2020        Prob (F-statistic): 6.50e-21
Time:             05:00:24      Log-Likelihood:     -491.61
No. Observations: 111           AIC:                995.2
Df Residuals:     105           BIC:                1011.
Df Model:         5		
Covariance Type:  nonrobust		
         coef    std err t	P>|t|	[0.025   0.975]
const	-64.1163 23.482  -2.730	0.007	-110.678 -17.555
Solar.R	0.0503   0.023   2.147	0.034	0.004    0.097
Wind	-3.3184  0.645  -5.149	0.000	-4.596   -2.041
Temp	1.8958   0.274  6.922	0.000	1.353    2.439
Month	-3.0400  1.513  -2.009	0.047	-6.041   -0.039
Day     0.2739   0.230  1.192	0.236	-0.182   0.729
Omnibus:	40.425	Durbin-Watson:	2.088
Prob(Omnibus):	0.000	Jarque-Bera (JB):	98.522
Skew:	1.388	Prob(JB):	4.04e-22
Kurtosis:	6.688	Cond. No.	2.60e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.6e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

テキストと結果が一致。AICが2ずれているがツール(ライブラリ)によりAICの定義が異なるため。ツール内で閉じて比較する分には問題なし。

問16.3

データはこれAquaticToxっすね。

R

スパース推定法による統計モデリング (統計学One Point)

も参考としました。

(1)

#install.packages("QSARdata")
library(QSARdata)
#install.packages("glmnet")
library(glmnet)

data(AquaticTox)
head(AquaticTox_Outcome)
A data.frame: 6 × 2
Molecule	Activity
<fct>	<dbl>
1	(d)-limonene	5.29
2	111-trichloro-2-methyl-2-propanolol(chlorobytanol)	3.12
3	111-trichloroethane	3.40
4	1122-tetrachloroethane	3.92
5	112-trichloroethane	3.21
6	11-dichloroethylene(vinylidene	2.84
  • 列が長いので5列目までで略
head(AquaticTox_AtomPair[, 1:5])
A data.frame: 6 × 5
Molecule	AP_FP_522	AP_FP_41	AP_FP_521	AP_FP_402
<fct>	<int>	<int>	<int>	<int>
1	(d)-limonene	0	1	1	0
2	111-trichloro-2-methyl-2-propanolol(chlorobytanol)	0	0	0	0
3	111-trichloroethane	0	0	0	0
4	1122-tetrachloroethane	0	0	0	0
5	112-trichloroethane	0	0	0	0
6	11-dichloroethylene(vinylidene	0	0	0	0
  • 不要な化合物名列を削除し、値がIntegerなのでNumericにする。
df_x = AquaticTox_AtomPair[, -1]
df_x = apply(df_x, 2, as.numeric)
  • テキストのResidualの図に該当するはず。
    • defaultだとlog(λ)>0しか計算しないのでテキスト通りの範囲とした
model <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))
rmse = apply((predict(model, newx = df_x) -  AquaticTox_Outcome$Activity)^2, 2, sum) / nrow(df_x)
plot(log(model$lambda), rmse, main = "Residual)

ダウンロード (1).png

  • テキストのcvの図に該当するはず。
    • defaultだとlog(λ)>0しか計算しないのでテキスト通りの範囲とした
model.cv <- cv.glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))
plot(model.cv)

ダウンロード.png

  • λの最適値は
model.cv$lambda.min
1.19939610203539

(2)

  • 下記の通り
    • model1: λ=0
    • model2: λ=e^-2, α=0
    • model3: λ=e^-2, α=0.5
    • model4: λ=e^-2, α=1
par(mfrow = c(2, 2))
plot(c(model.1$a0, model.1$beta[, 1]), main = "model1", xlab = "variabls", ylab = "coef.")
plot(c(model.2$a0, model.2$beta[, 1]), main = "model2", xlab = "variabls", ylab = "coef.")
plot(c(model.3$a0, model.3$beta[, 1]), main = "model3", xlab = "variabls", ylab = "coef.")
plot(c(model.4$a0, model.4$beta[, 1]), main = "model4", xlab = "variabls", ylab = "coef.")
par(mfrow = c(1, 1))

ダウンロード (2).png

  • よって
    • (a): model2
    • (b): model1
    • (c): model4
    • (d): model3

(3)

  • 下記の通り
    • model2.1: α=0
    • model2.2: α=0.5
    • model2.3: α=1
model2.1 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0)
model2.2 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0.5)
model2.3 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 1)

par(mfrow = c(2, 2))
plot(model2.1, xvar = "lambda", main = "model2.1")
plot(model2.2, xvar = "lambda", main = "model2.2")
plot(model2.3, xvar = "lambda", main = "model2.3")
par(mfrow = c(1, 1))

ダウンロード (3).png

  • of nonzerosの図は(a)だけlog(λ)の範囲が異なっているので再計算して

model2.1 <- glmnet(x = df_x, y = AquaticTox_Outcome$Activity, alpha = 0, lambda = exp(seq(log(exp(-6)), log(exp(6)), length.out = 100)))

par(mfrow = c(2, 2))
plot(log(model2.1$lambda), apply(model2.1$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.1", xlab = "log(λ)", ylab = "# of nonzeros")
plot(log(model2.2$lambda), apply(model2.2$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.2", xlab = "log(λ)", ylab = "# of nonzeros")
plot(log(model2.2$lambda), apply(model2.3$beta, 2, function(x) {return(sum(x != 0))}), main = "model2.3", xlab = "log(λ)", ylab = "# of nonzeros")
par(mfrow = c(1, 1))

ダウンロード (4).png

  • つまり
    • (a): model2.1
    • (b): model2.3
    • (c): model2.2

Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet, enet_path
import statsmodels.api as sm

y = pd.read_csv("AquaticTox_Outcome.csv")
y = y.drop("Molecule", axis = 1)
x = pd.read_csv("AquaticTox_AtomPair.csv")
x = x.drop("Molecule", axis = 1)

(1)

# scikit-learn のみ

alpha_range = np.logspace(-6, 6, 30, base=np.e)
elout = np.empty((len(alpha_range), 2))
for i, a in enumerate(alpha_range):
  #l1 = np.log(l1)
  enet = ElasticNet(l1_ratio = 0, alpha = a)
  result = enet.fit(x, y)
  rmse = ((sum((result.predict(x) - y['Activity'])**2))/len(y))**0.5
  elout[i] = [a, rmse]

plt.plot(np.log(elout[:, 0]), elout[:, 1], color='lightgreen', linewidth=2, label='Elastic net coefficients')

ダウンロード (6).png

(2)

#scikit-learn
enet1 = ElasticNet(alpha = 0)
result1 = enet1.fit(x, y)
enet2 = ElasticNet(alpha = np.exp(-2), l1_ratio = 0)
result2 = enet2.fit(x, y)
enet3 = ElasticNet(alpha = np.exp(-2), l1_ratio = 0.5)
result3 = enet3.fit(x, y)
enet4 = ElasticNet(alpha = np.exp(-2), l1_ratio = 1)
result4 = enet4.fit(x, y)

fig = plt.figure()
fig, axes = plt.subplots(nrows=2, ncols=2, sharex=False)
axes[0, 0].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result1.intercept_), result1.coef_))
axes[0, 1].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result2.intercept_), result2.coef_))
axes[1, 0].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result3.intercept_), result3.coef_))
axes[1, 1].scatter(np.arange(0, x.shape[1] + 1), np.append(np.array(result4.intercept_), result4.coef_))
fig.tight_layout()              #レイアウトの設定
plt.show()

ダウンロード (7).png

# statsmodels
model1 = sm.OLS(y, sm.add_constant(x))
result1 = model1.fit_regularized(method='elastic_net', alpha = 0) # lambda = 0result2 = model.fit_regularized(alpha = np.exp(-2), L1_wt=0) # lambda = exp(-2), alpha = 0
model2 = sm.OLS(y, sm.add_constant(x))
result2 = model2.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=0) # lambda = exp(-2), alpha = 0
model3 = sm.OLS(y, sm.add_constant(x))
result3 = model3.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=0.5) # lambda = exp(-2), alpha = 0.5
model4 = sm.OLS(y, sm.add_constant(x))
result4 = model4.fit_regularized(method='elastic_net', alpha = np.exp(-2), L1_wt=1) # lambda = exp(-2), alpha = 1
  • 図は略
1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?