0. はじめに
前回の投稿では、bostonデータセットで重回帰を試してみましたが、今回は更にいろいろなものを試してみました。ベースとしている教科書は、オライリー・ジャパン「Pythonではじめる機械学習 scikit-learnで学ぶ特徴量エンジニアリングと機械学習の基礎」です。教科書のママを載せても「あれ」なので、ライブラリの使い方、作図ルーチンなどは、自分なりにアレンジしています。
疑問があったら、以下のscilit-learnの公式ページで調べることをおすすめします。
ここでは、以下の内容を掲載しています。
-
- Python定番のライブラリをインポート
-
- データセットの中身を確認する
-
- ペアプロットと相関係数ヒートマップでざっくりデータを眺める
-
- ローソク足チャートで特徴量の分布範囲を眺める
-
- いろいろな方法で重回帰してみる
-
- 自動特徴量選択を試す
- 7. グリッドサーチでSVRの性能を測る
当方の環境は以下の通り。
- iMac (Retina 4K, 21.5-inch, 2017)
- Processor 3 GHz Intel Core i5
- Memory 8 GB 2400 MHz DDR4
- macOS Mojave
- Python 3.7.3
- scikit-learn version 0.20.3
- Jupyter lab version 0.35.4
1. Python定番のライブラリをインポート
Jupyter lab (Jupyter notebookでも大丈夫だと思う)を使うので、はじめに定番のライブラリをインポートしておきます。ヒートマップはseabornのものが使いやすいのでこれもインポート。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
警告を表示したくない場合は、以下を追加。
import warnings
warnings.filterwarnings('ignore')
なお、この記事で用いているscikit-learnからインポートしているものを一覧にすると以下の通り。プログラミングはJupyter上で行っているため、一度インポートしておけばカーネルを再スタートするまではインポートしたものが有効であるため、必要なもの全てを個々のプログラムでインポートしているとは限らないことに注意。
from sklearn.datasets import load_boston | bostonデータセット |
from sklearn.model_selection import train_test_split | 訓練データとテストデータの分割 |
from sklearn.linear_model import LinearRegression | 線形回帰(最小二乗法) |
from sklearn.linear_model import Ridge | リッジ回帰 |
from sklearn.linear_model import Lasso | ラッソ回帰 |
from sklearn.ensemble import RandomForestRegressor | ランダムフォレスト回帰 |
from sklearn.svm import SVR | サポートベクタマシン(回帰) |
from sklearn.neural_network import MLPRegressor | ニューラルネットワーク |
from sklearn.preprocessing import MinMaxScaler | データ変換 |
from sklearn.preprocessing import PolynomialFeatures | 多項式特徴量の追加 |
from sklearn.feature_selection import SelectPercentile | 単変量特徴量選択 |
from sklearn.feature_selection import SelectFromModel | モデルベース特徴量選択 |
from sklearn.feature_selection import RFE | 反復特徴量選択(再帰的特徴量削減) |
from sklearn.model_selection import GridSearchCV | 交差検証を用いたグリッドサーチ |
from sklearn.model_selection import cross_val_score | 交差検証 |
from sklearn.pipeline import Pipeline | 処理記載の簡略化 |
2. データセットの中身を確認する
データセットのキーを取得し、キーごとに内容を表示する。
from sklearn.datasets import load_boston
dsn=load_boston()
print(dsn.keys())
for ss in dsn.keys():
print()
print('* '+ ss)
print(dsn[ss])
print()
print('Data shape')
print('Data.shape: {}'.format(dsn.data.shape))
print('Target.shape: {}'.format(dsn.target.shape))
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
* data
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
[2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
[2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
...
[6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
[1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
[4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
* target
[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4
18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
...途中省略...
16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.
8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
22. 11.9]
* feature_names
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
* DESCR
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
...以下省略...
* filename
/usr/local/lib/python3.7/site-packages/sklearn/datasets/data/boston_house_prices.csv
Data shape
Data.shape: (506, 13)
Target.shape: (506,)
3. ペアプロットと相関係数ヒートマップでざっくりデータを眺める
13の特徴量とターゲットを加えた 14 x 14 のペアプロットを作成し眺める。また、各特徴量およびターゲットとの「相関行列」を計算し眺める。当初は相関行列を表示していたが、プラスとマイナスの表示が入り乱れて色とりどりになり見づらいので、ここでは、相関行列の各要素の2乗(単相関とした時の決定係数)を表示することにした。
df=pd.DataFrame(dsn.data, columns=dsn.feature_names)
df['target']=dsn.target
plt.figure(facecolor='w')
pd.plotting.scatter_matrix(df,figsize=(20,20))
fnameF='fig_boston_1.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
df_corr = df.corr()
print(df_corr**2)
plt.figure(figsize=(15, 7), facecolor='w')
sns.heatmap(df_corr**2, annot=True, fmt='.2f', vmax=1, vmin=0, center=0.5, cmap='Blues')
plt.yticks(rotation=0)
fnameF='fig_boston_1-1.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
4. ローソク足チャートで特徴量の分布範囲を眺める
特徴量の値の分布範囲を眺める。
plt.figure(figsize=(10,6),facecolor='w')
plt.xlabel('Feature')
plt.ylabel('Feature magnitude')
plt.yscale('log')
plt.boxplot(dsn.data)
plt.xticks(range(1,dsn.data.shape[1]+1), dsn.feature_names, rotation=90)
fnameF='fig_boston_2.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
5. いろいろな方法で重回帰してみる
いろいろな回帰を、特徴量の処理方法を変えて試す。
本来、決定係数と図上の相関係数の二乗とは異なるものであるため、図示する数値は reg.score(X_test, y_test)
で計算した決定係数にしました。
あとに出てくるように図示している相関係数の2乗(y_test
とy_pred=predict(X_test)
からnumpy.corrcoef
で計算)と、画面表示しているscore(X_test, y_test)
の値が違っている。図示のほうが大きい。これはscore
での決定係数は自由度調整済だから小さめなのかな?。。。そこらへんは不明です。
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neural_network import MLPRegressor
# global variables
fsz=12
xmin=0; ymin=0
xmax=60; ymax=60
def drawfig(y_test, y_pred, ss, nplt):
plt.subplot(3,7,nplt)
plt.gca().set_aspect('equal',adjustable='box')
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
if 15<=nplt:
plt.xlabel('y_test')
else:
plt.xticks([])
if nplt==1 or nplt==8 or nplt==15:
plt.ylabel('y_predicted')
else:
plt.yticks([])
plt.plot(y_test, y_pred, 'o')
plt.plot([xmin,xmax],[ymin,ymax],'-')
xs=xmin+0.05*(xmax-xmin)
ys=ymin+0.95*(ymax-ymin)
plt.text(xs, ys, ss, va='top', ha='left', fontsize=fsz)
def linreg(X_train, X_test, y_train, y_test):
reg=LinearRegression().fit(X_train, y_train)
sc_tra=reg.score(X_train, y_train)
sc_tes=reg.score(X_test, y_test)
y_pred=reg.predict(X_test)
ss='LinearReg\n'+'R$^2$={0:.3f}'.format(sc_tes)
nplt=1
drawfig(y_test, y_pred, ss, nplt)
print('* LinearRegression')
print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
def rfreg(X_train, X_test, y_train, y_test):
reg=RandomForestRegressor(n_estimators=100).fit(X_train, y_train)
sc_tra=reg.score(X_train, y_train)
sc_tes=reg.score(X_test, y_test)
y_pred=reg.predict(X_test)
ss='RandomForest\n'+'R$^2$={0:.3f}'.format(sc_tes)
nplt=2
drawfig(y_test, y_pred, ss, nplt)
print('* RandomForest')
print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
def svr1(X_train, X_test, y_train, y_test):
reg = SVR(kernel='linear').fit(X_train, y_train)
sc_tra=reg.score(X_train, y_train)
sc_tes=reg.score(X_test, y_test)
y_pred=reg.predict(X_test)
ss='SVR(linear)\n'+'R$^2$={0:.3f}'.format(sc_tes)
nplt=3
drawfig(y_test, y_pred, ss, nplt)
print('* SVR')
print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
print('{0:25s}: {1:.3f}'.format('Test set sore',sc_tes))
def svr2(X_train, X_test, y_train, y_test):
reg = SVR(kernel='rbf').fit(X_train, y_train)
sc_tra=reg.score(X_train, y_train)
sc_tes=reg.score(X_test, y_test)
y_pred=reg.predict(X_test)
ss='SVR(rbf)\n'+'R$^2$={0:.3f}'.format(sc_tes)
nplt=4
drawfig(y_test, y_pred, ss, nplt)
print('* SVR')
print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
print('{0:25s}: {1:.3f}'.format('Test set sore',sc_tes))
def mlp(X_train, X_test, y_train, y_test):
reg = MLPRegressor(solver='lbfgs', activation='relu', random_state=0, hidden_layer_sizes=[10, 10]).fit(X_train, y_train)
sc_tra=reg.score(X_train, y_train)
sc_tes=reg.score(X_test, y_test)
y_pred=reg.predict(X_test)
ss='MLP\n'+'R$^2$={0:.3f}'.format(sc_tes)
nplt=5
drawfig(y_test, y_pred, ss, nplt)
print('* MLP')
print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
print('{0:25s}: {1:.3f}'.format('Test set sore',sc_tes))
def rid(X_train, X_test, y_train, y_test):
alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
sc_tra = np.zeros(len(alpha), dtype=np.float64)
sc_tes = np.zeros(len(alpha), dtype=np.float64)
for i, alp in enumerate(alpha):
reg = Ridge(alpha=alp).fit(X_train, y_train)
sc_tra[i] = reg.score(X_train, y_train)
sc_tes[i] = reg.score(X_test, y_test)
y_pred = reg.predict(X_test)
ss = 'Ridge\n'.format(alp)
ss = ss + '(alpha={})\n'.format(alp)
ss = ss +'R$^2$={0:.3f}'.format(sc_tes[i])
nplt=8+i
drawfig(y_test, y_pred, ss, nplt)
print('* Ridge')
sr1='{:25s}'.format('alpha')
sr2='{:25s}'.format('Training set score')
sr3='{:25s}'.format('Test set score')
for i in range(len(alpha)):
sr1=sr1+' {0:>6s}'.format(str(alpha[i]))
sr2=sr2+' {0:6.3f}'.format(sc_tra[i])
sr3=sr3+' {0:6.3f}'.format(sc_tes[i])
print(sr1)
print(sr2)
print(sr3)
def las(X_train, X_test, y_train, y_test):
alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
sc_tra = np.zeros(len(alpha), dtype=np.float64)
sc_tes = np.zeros(len(alpha), dtype=np.float64)
num = np.zeros(len(alpha), dtype=np.int)
for i, alp in enumerate(alpha):
# default max_iter = 1000
reg=Lasso(alpha=alp, max_iter=100000).fit(X_train, y_train)
sc_tra[i] = reg.score(X_train, y_train)
sc_tes[i] = reg.score(X_test, y_test)
num[i] = np.sum(reg.coef_ !=0)
y_pred=reg.predict(X_test)
ss = 'Lasso\n'.format(alp)
ss = ss + '(alpha={})\n'.format(alp)
ss = ss +'R$^2$={0:.3f}'.format(sc_tes[i])
nplt=15+i
drawfig(y_test, y_pred, ss, nplt)
print('* Lasso')
sl1='{:25s}'.format('alpha')
sl2='{:25s}'.format('Training set score')
sl3='{:25s}'.format('Test set score')
sl4='{:25s}'.format('Number of features used')
for i in range(len(alpha)):
sl1=sl1+' {0:>6s}'.format(str(alpha[i]))
sl2=sl2+' {0:6.3f}'.format(sc_tra[i])
sl3=sl3+' {0:6.3f}'.format(sc_tes[i])
sl4=sl4+' {0:6d}'.format(num[i])
print(sl1)
print(sl2)
print(sl3)
print(sl4)
def ctl(X_train, X_test, y_train, y_test, iii):
bw=3; icol=7; irow=3
plt.figure(figsize=(bw*icol,bw*irow), facecolor='w')
plt.subplots_adjust(wspace=0.1, hspace=0.1)
linreg(X_train, X_test, y_train, y_test)
rfreg(X_train, X_test, y_train, y_test)
svr1(X_train, X_test, y_train, y_test)
svr2(X_train, X_test, y_train, y_test)
mlp(X_train, X_test, y_train, y_test)
rid(X_train, X_test, y_train, y_test)
las(X_train, X_test, y_train, y_test)
#plt.tight_layout()
fnameF='fig_boston_3-'+str(iii)+'.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
def main():
for iii in [1,2,3]:
X_train, X_test, y_train, y_test = train_test_split(
dsn['data'], dsn['target'], random_state=0)
if iii==1:
pass
if iii==2:
scaler=MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train=X_train_scaled
X_test=X_test_scaled
if iii==3:
scaler=MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly=poly.transform(X_train_scaled)
X_test_poly=poly.transform(X_test_scaled)
X_train=X_train_poly
X_test=X_test_poly
print('X_train.shape: {}'.format(X_train.shape))
print('y_train.shape: {}'.format(y_train.shape))
print('X_test.shape: {}'.format(X_test.shape))
print('y_test.shape: {}'.format(y_test.shape))
ctl(X_train, X_test, y_train, y_test, iii)
#==============
# Execution
#==============
if __name__ == '__main__': main()
(1) 生データのまま回帰
X_train.shape: (379, 13)
y_train.shape: (379,)
X_test.shape: (127, 13)
y_test.shape: (127,)
* LinearRegression
Training set score : 0.770
Test set score : 0.635
* RandomForest
Training set score : 0.983
Test set score : 0.799
* SVR
Training set score : 0.744
Test set sore : 0.564
* SVR
Training set score : 0.153
Test set sore : 0.005
* MLP
Training set score : 0.473
Test set sore : 0.146
* Ridge
alpha 0.0001 0.001 0.01 0.1 1 10 100
Training set score 0.770 0.770 0.770 0.770 0.768 0.762 0.747
Test set score 0.635 0.635 0.635 0.634 0.627 0.613 0.593
* Lasso
alpha 0.0001 0.001 0.01 0.1 1 10 100
Training set score 0.770 0.770 0.769 0.759 0.715 0.559 0.264
Test set score 0.635 0.635 0.632 0.607 0.552 0.399 0.119
Number of features used 13 13 13 12 11 4 2
(2) 特徴量のスケール変換実施
X_train.shape: (379, 13)
y_train.shape: (379,)
X_test.shape: (127, 13)
y_test.shape: (127,)
* LinearRegression
Training set score : 0.770
Test set score : 0.635
* RandomForest
Training set score : 0.982
Test set score : 0.798
* SVR
Training set score : 0.679
Test set sore : 0.497
* SVR
Training set score : 0.423
Test set sore : 0.274
* MLP
Training set score : 0.940
Test set sore : 0.730
* Ridge
alpha 0.0001 0.001 0.01 0.1 1 10 100
Training set score 0.770 0.770 0.770 0.770 0.766 0.688 0.405
Test set score 0.635 0.635 0.635 0.634 0.621 0.509 0.271
* Lasso
alpha 0.0001 0.001 0.01 0.1 1 10 100
Training set score 0.770 0.770 0.769 0.726 0.301 0.000 0.000
Test set score 0.635 0.635 0.631 0.562 0.243 -0.001 -0.001
Number of features used 13 13 13 7 2 0 0
(3) スケール変換実施後の特徴量に多項式特徴量を加えたもの
X_train.shape: (379, 105)
y_train.shape: (379,)
X_test.shape: (127, 105)
y_test.shape: (127,)
* LinearRegression
Training set score : 0.952
Test set score : 0.607
* RandomForest
Training set score : 0.986
Test set score : 0.768
* SVR
Training set score : 0.798
Test set sore : 0.624
* SVR
Training set score : 0.336
Test set sore : 0.214
* MLP
Training set score : 0.951
Test set sore : 0.790
* Ridge
alpha 0.0001 0.001 0.01 0.1 1 10 100
Training set score 0.951 0.951 0.944 0.928 0.887 0.789 0.598
Test set score 0.618 0.630 0.702 0.772 0.753 0.637 0.418
* Lasso
alpha 0.0001 0.001 0.01 0.1 1 10 100
Training set score 0.951 0.939 0.896 0.771 0.308 0.000 0.000
Test set score 0.644 0.738 0.766 0.630 0.229 -0.001 -0.001
Number of features used 96 68 33 8 4 0 0
6. 自動特徴量選択を試す
(1) 3種類の特徴量自動選択を試す
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
import time
def drawfig(select, iii):
if iii==1: ss='* SelectPercentile(percentile=50)'
if iii==2: ss='* SelectFromModel(RandomForestRegressor(),threshold=\'median\')'
if iii==3: ss='* RFE(RandomForestRegressor(),n_features_to_select=50)'
mask=select.get_support()
print(mask)
nplt=310+iii
plt.subplot(nplt)
plt.matshow(mask.reshape(1,-1),cmap='gray_r', fignum=False)
plt.xlabel('Sample index')
plt.yticks([])
plt.title(ss,loc='left', pad=30)
def main():
X_train, X_test, y_train, y_test = train_test_split(
dsn['data'], dsn['target'], random_state=0)
scaler=MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly=poly.transform(X_train_scaled)
X_test_poly=poly.transform(X_test_scaled)
X_train=X_train_poly
X_test=X_test_poly
print('Polynomial feature names:\n{}'.format(poly.get_feature_names()))
plt.figure(figsize=(10, 4),facecolor='w')
for iii in [1, 2, 3]:
start=time.time()
if iii==1:
select = SelectPercentile(percentile=50)
if iii==2:
select=SelectFromModel(
RandomForestRegressor(n_estimators=100,random_state=42),
threshold='median')
if iii==3:
select=RFE(
RandomForestRegressor(n_estimators=100,random_state=42),
n_features_to_select=50)
select.fit(X_train,y_train)
X_train_selected=select.transform(X_train)
X_test_selected=select.transform(X_test)
print('X_train.shape: {}'.format(X_train.shape))
print('X_train_selected.shape: {}'.format(X_train_selected.shape))
reg=LinearRegression().fit(X_train_selected, y_train)
sc_tra=reg.score(X_train_selected, y_train)
sc_tes=reg.score(X_test_selected, y_test)
dtime=time.time()-start
print('* LinearRegression')
print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
print('{0:25s}: {1:.3f}'.format('Calculation time',dtime))
drawfig(select, iii)
plt.tight_layout()
fnameF='fig_boston_4.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
#==============
# Execution
#==============
if __name__ == '__main__': main()
Polynomial feature names:
['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']
X_train.shape: (379, 105)
X_train_selected.shape: (379, 52)
* LinearRegression
Training set score : 0.899
Test set score : 0.747
Calculation time : 0.018
[False True False True False True True True False True True False
True True True False True False True True True True True True
True True True False False True False False False False False False
False False False False False True False True False True True True
False True False False False False True False False False False False
False True True False True True True False True True False False
False False False True True True False True True True False True
False False False False False False True True True False True True
True False True False False True False True True]
X_train.shape: (379, 105)
X_train_selected.shape: (379, 53)
* LinearRegression
Training set score : 0.924
Test set score : 0.770
Calculation time : 0.932
[False True False False False False True True True False False False
False True False False True False True True False True True True
True True True False False False False False False False False False
False False False False False True True False True True True True
True True False False False False False False False False False False
False False True True False True True False True True False True
True True True True False False True True True True False True
True True True True True True False False False True True False
True True True False True True False True True]
X_train.shape: (379, 105)
X_train_selected.shape: (379, 50)
* LinearRegression
Training set score : 0.923
Test set score : 0.772
Calculation time : 42.644
[False False False False False False True False False False False False
True True False True True False True True False True True True
True False True False False False False False False False False False
False False False False False True False False True False True True
True True False False False False False False False False False False
False False True True False True True False True True True False
True True True True True True True True True True False True
False False True True True True False False False True True False
True True True True True True False True True]
下の図では、選択された特徴量のインデックスが黒塗りで示されている。ここでmatshow
によるグラフでは、fignum=False
を入れないとsubplot
が効かないことに注意。
処理時間とLinearRegression
を用いたスコアは以下の通り。
特徴量選択手法 | 処理時間 | 訓練セット スコア |
テストセット スコア |
---|---|---|---|
SelectPercentile | 0.018 sec | 0.899 | 0.747 |
SelectFromModel | 0.932 sec | 0.924 | 0.770 |
RFE | 42.544 sec | 0.923 | 0.772 |
(2) RFEは処理に時間がかかるのでSelectFromModelを試す
RFEは処理に時間がかかるのでSelectFromModelを試すわけだが、thresholdに何を入れたらいいかよくわからない。そこで、thresholdの値をいろいろ試してみた。
モデルベースにはRandomForestRegressorを使い、特徴量の重要度を図化してみた。
threshold(しきい値)のmeanとは、重要度の平均値を指すようで、多くの係数を選択したければthresholdに指定する値を小さく(例えば’0.1*mean’)すればよい。
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
import time
def plot_bar(model, X_train, y_train):
reg=model.fit(X_train, y_train)
fsz=14
plt.figure(figsize=(10,6), facecolor='w')
plt.rcParams['font.size']=fsz
plt.rcParams['font.family']='sans-serif'
n_features = X_train.shape[1]
plt.barh(range(n_features), reg.feature_importances_, align='center')
ave=np.mean(reg.feature_importances_)
ymin=-0.5
ymax=n_features
plt.ylim(ymin,ymax)
ys=2
for r in [0.2, 0.5, 1.0]:
plt.plot([r*ave, r*ave], [ymin,ymax],'-',color='#ff0000')
ss='{}*mesn'.format(r)
plt.text(r*ave, ys, ss, va='center', ha='left', rotation=90, color='#ff0000')
plt.xlabel('Feature importance')
plt.ylabel('Feature index')
fnameF='fig_boston_5-1.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
def drawfig(select, ssth, iii):
ss='* threshold=\''+ssth+'\''
mask=select.get_support()
nplt=511+iii
plt.subplot(nplt)
plt.matshow(mask.reshape(1,-1),cmap='gray_r', fignum=False)
plt.xlabel('Sample index')
plt.yticks([])
plt.title(ss,loc='left', pad=30)
def main():
X_train, X_test, y_train, y_test = train_test_split(
dsn['data'], dsn['target'], random_state=0)
scaler=MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train=X_train_scaled
X_test=X_test_scaled
model=RandomForestRegressor(n_estimators=100,random_state=42)
plot_bar(model, X_train, y_train)
plt.figure(figsize=(10, 8),facecolor='w')
thlist=['median', '0.1*mean', '0.2*mean', '0.5*mean', 'mean']
for iii, ssth in enumerate(thlist):
start=time.time()
select=SelectFromModel(model, threshold=ssth)
select.fit(X_train,y_train)
X_train_selected=select.transform(X_train)
X_test_selected=select.transform(X_test)
print('X_train.shape: {}'.format(X_train.shape))
print('X_train_selected.shape: {}'.format(X_train_selected.shape))
reg=LinearRegression().fit(X_train_selected, y_train)
sc_tra=reg.score(X_train_selected, y_train)
sc_tes=reg.score(X_test_selected, y_test)
dtime=time.time()-start
print('* LinearRegression')
print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
print('{0:25s}: {1:.3f}'.format('Calculation time',dtime))
drawfig(select, ssth, iii)
plt.tight_layout()
fnameF='fig_boston_5-2.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
#==============
# Execution
#==============
if __name__ == '__main__': main()
上の図と下の図を比べてみよう。
上の図の'1.0mean'を超える特徴量は、インデックス5と12であり、下の図のthreshold='mean'でもインデックス5と12の特徴量が選択されている。
更に、上の図の'0.2mean'を超える特徴量は、インデックス0, 4, 5, 7, 9, 10, 12であり、下の図のthreshold='0.2*mean'でも同じインデックスの特徴量が選択されていることがわかる。
(3) 多項式特徴量を含む場合
def plot_bar(model, X_train, y_train):
reg=model.fit(X_train, y_train)
fsz=14
plt.figure(figsize=(10,12), facecolor='w')
plt.rcParams['font.size']=fsz
plt.rcParams['font.family']='sans-serif'
n_features = X_train.shape[1]
plt.barh(range(n_features), reg.feature_importances_, align='center')
ave=np.mean(reg.feature_importances_)
ymin=-0.5
ymax=n_features
plt.ylim(ymin,ymax)
ys=30
for r in [0.5, 1.0]:
plt.plot([r*ave, r*ave], [ymin,ymax],'-',color='#ff0000')
ss='{}*mesn'.format(r)
plt.text(r*ave, ys, ss, va='center', ha='left', rotation=90, color='#ff0000')
plt.xlabel('Feature importance')
plt.ylabel('Feature index')
fnameF='fig_boston_6-1.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
def drawfig(select, ssth, iii):
ss='* threshold=\''+ssth+'\''
mask=select.get_support()
nplt=511+iii
plt.subplot(nplt)
plt.matshow(mask.reshape(1,-1),cmap='gray_r', fignum=False)
plt.xlabel('Sample index')
plt.yticks([])
plt.title(ss,loc='left', pad=30)
def main():
X_train, X_test, y_train, y_test = train_test_split(
dsn['data'], dsn['target'], random_state=0)
scaler=MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly=poly.transform(X_train_scaled)
X_test_poly=poly.transform(X_test_scaled)
X_train=X_train_poly
X_test=X_test_poly
#print('Polynomial feature names:\n{}'.format(poly.get_feature_names()))
model=RandomForestRegressor(n_estimators=100,random_state=42)
plot_bar(model, X_train, y_train)
plt.figure(figsize=(10,6), facecolor='w')
thlist=['median', '0.1*mean', '0.2*mean', '0.5*mean', 'mean']
for iii, ssth in enumerate(thlist):
select=SelectFromModel(model, threshold=ssth)
select.fit(X_train,y_train)
X_train_selected=select.transform(X_train)
X_test_selected=select.transform(X_test)
print('X_train.shape: {}'.format(X_train.shape))
print('X_train_selected.shape: {}'.format(X_train_selected.shape))
reg=LinearRegression().fit(X_train_selected, y_train)
sc_tra=reg.score(X_train_selected, y_train)
sc_tes=reg.score(X_test_selected, y_test)
print('* LinearRegression')
print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
drawfig(select, ssth, iii)
plt.tight_layout()
fnameF='fig_boston_6-2.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
#==============
# Execution
#==============
if __name__ == '__main__': main()
前出 6. の打ち出しに示したように、多項式特徴量を含む特徴量は、以下のように並んでいる(再掲)。
['1',
'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12',
'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12',
'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12',
'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12',
'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12',
'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12',
'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12',
'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12',
'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12',
'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12',
'x9^2', 'x9 x10', 'x9 x11', 'x9 x12',
'x10^2', 'x10 x11', 'x10 x12',
'x11^2', 'x11 x12',
'x12^2']
すなわち、定数項と13個のオリジナル特徴量の後、インデックス番号で14以降は多項式部となる。下に示す特徴量の重要度を示す図から、meanを超えているオリジナル特徴量はインデックス6(x5)とインデックス13(x12)だけで、あとは加えた多項式特徴量の部分である。すなわち、このモデルでは、多項式特徴量の追加は予測精度を上げるために意味のあることになる。
下の表は、thresholdの値と選択された特徴量個数、訓練セットスコア、テストセットスコアをまとめたものである。スコア計算には、線形回帰(LInearRegression)を使っている。
threshold | もとの 特徴量個数 |
選択された 特徴量個数 |
訓練セット スコア |
テストセット スコア |
---|---|---|---|---|
median | 105 | 53 | 0.924 | 0.770 |
0.1*mean | 105 | 62 | 0.929 | 0.775 |
0.2*mean | 105 | 36 | 0.904 | 0.764 |
0.5*mean | 105 | 14 | 0.840 | 0.710 |
mean | 105 | 12 | 0.826 | 0.694 |
あまり特徴量を絞りすぎても回帰の精度が落ちていくので、約半分の特徴量を選択するthreshold='mediam'を使っておくのが無難な感じである。
7. グリッドサーチでSVRの性能を測る
5.で示した結果のうちカーネルrbf
を使ったSVR
の結果(パラメータはデフォルト)が可愛そうだったので、グリッドサーチでSVRの性能を測ってみた。
処理の流れは以下の通り。
a. 訓練データとテストデータを分離 (train_test_splitの利用)
b. データ変換 (MinMaxScalerの利用)
c. 多項式特徴量の追加 (PolynomialFeaturesの利用)
d. モデルベース特徴量選択 (SelectFromModelの利用)
e. 交差検証を用いたグリッドサーチで最適パラメータ検索 (GridSearchCVの利用)
f. Cとgammaを変数とした平均交差検証スコアヒートパップ作成 (関数:def hmap(df))
g. 最適パラメータを用いた予測値とテストデータの相関図作成 (関数:def drawfig(y_test, y_pred, ss))
上記 e. の処理は、GridSearchCV
の利用により、コードを簡略化できる。また、a.〜d.の前処理、あるいはa.〜f.の主要処理は、Pipeline
にいれることで更にコードを簡略化できる。ここで紹介するコードでは、使用するライブラリは全てインポートしている(気分的なものです)。
以下に(1)から(4)までのコードを試し、処理時間を計測してみた。結果は以下の通り。Case(1)では作図時間は含んでいない。
Case | 前処理 | グリッドサーチ | 処理時間 |
---|---|---|---|
(1) | 手書き | GridSearchCV | 6,455 sec |
(2) | 手書き | 手書きforループ | 5.610 sec |
(3) | Pipeline | 109.254 sec | |
(4) | Pipeline | GridSearchCV | 6.454 sec |
上表から言えることは以下の通り。
- Case(2) は、コードは長いが一番早い。
- Case(3) は、全てを
Pipeline
の中に詰め込んだもの。コードは簡略化できるが、とても遅い。 - 比較した中では、コードがゴタツク前処理部分を
Pipeline
に入れた Case(4) が、実務的かもしれない。
(1) GridSearchCV を使う (作図ルーチン付き)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
import time
def hmap(df):
sns.heatmap(df, annot=True, fmt='.3f', cmap='viridis')
plt.xlabel('gamma')
plt.ylabel('C')
plt.yticks(rotation=0)
fnameF='fig_boston_7-1.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
def drawfig(y_test, y_pred, ss):
fsz=12
xmin=0; ymin=0
xmax=60; ymax=60
plt.figure(figsize=(3,3), facecolor='w')
plt.gca().set_aspect('equal',adjustable='box')
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
plt.xlabel('y_test')
plt.ylabel('y_predicted')
plt.plot(y_test, y_pred, 'o')
plt.plot([xmin,xmax],[ymin,ymax],'-')
xs=xmin+0.05*(xmax-xmin)
ys=ymin+0.95*(ymax-ymin)
plt.text(xs, ys, ss, va='top', ha='left', fontsize=fsz)
fnameF='fig_boston_7-2.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()
def calc(X_train, y_train, X_test, y_test):
cc=[1e0, 1e1, 1e2, 1e3, 1e4]
gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
param_grid={'kernel': ['rbf'], 'C': cc, 'gamma': gg}
grid=GridSearchCV(SVR(), param_grid, cv=5)
grid.fit(X_train, y_train)
test_score=grid.score(X_test, y_test)
print('Size of training set: {}'.format(X_train.shape[0]))
print('Size of test set: {}'.format(X_test.shape[0]))
print('Best parameters: {}'.format(grid.best_params_))
print('Best cross-validation score: {0:.3f}'.format(grid.best_score_))
print('Test set score with best parameters: {0:.3f}'.format(test_score))
results=pd.DataFrame(grid.cv_results_)
#print(results)
rr=np.array(results.mean_test_score).reshape(len(cc), len(gg))
df = pd.DataFrame(data=rr, index=cc, columns=gg, dtype='float')
df=df.sort_index(ascending=False)
print('* DataFrame for score heatmap')
print(df)
y_pred=grid.predict(X_test)
return df, y_pred, test_score
def preproc(dsn):
X_train, X_test, y_train, y_test = train_test_split(dsn.data, dsn.target, random_state=0)
scaler=MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly=poly.transform(X_train_scaled)
X_test_poly=poly.transform(X_test_scaled)
X_train=X_train_poly
X_test=X_test_poly
#print('Polynomial feature names:\n{}'.format(poly.get_feature_names()))
model=RandomForestRegressor(n_estimators=100,random_state=42)
ssth='median'
select=SelectFromModel(model, threshold=ssth)
select.fit(X_train,y_train)
X_train_selected=select.transform(X_train)
X_test_selected=select.transform(X_test)
print('data set data.shape: {}'.format(dsn.data.shape))
print('X_train.shape: {}'.format(X_train.shape))
print('X_train_selected.shape: {}'.format(X_train_selected.shape))
X_train=X_train_selected
X_test=X_test_selected
return X_train, X_test, y_train, y_test
def main():
start=time.time()
dsn=load_boston()
X_train, X_test, y_train, y_test = preproc(dsn)
df, y_pred, test_score = calc(X_train, y_train, X_test, y_test)
dtime=time.time()-start
print('Calculation time={0:.3f}'.format(dtime))
hmap(df)
ss='SVR(rbf)\n'+'R$^2$={0:.3f}'.format(test_score)
drawfig(y_test, y_pred, ss)
#==============
# Execution
#==============
if __name__ == '__main__': main()
data set data.shape: (506, 13)
X_train.shape: (379, 105)
X_train_selected.shape: (379, 53)
Size of training set: 379
Size of test set: 127
Best parameters: {'C': 100.0, 'gamma': 1.0, 'kernel': 'rbf'}
Best cross-validation score: 0.885
Test set score with best parameters: 0.827
* DataFrame for score heatmap
0.0001 0.0010 0.0100 0.1000 1.0000 10.0000
10000.0 0.783758 0.843492 0.852278 0.818788 0.775954 0.659843
1000.0 0.582699 0.785104 0.848694 0.867736 0.832283 0.659843
100.0 0.269348 0.582686 0.795799 0.858090 0.885195 0.659843
10.0 0.022539 0.268821 0.581901 0.826387 0.863065 0.596508
1.0 -0.027188 0.022278 0.262279 0.561103 0.687804 0.209851
Calculation time=6.455
下のヒートマップは、ベスト・パラメータを求める際に実行した結果のスコアを示すもの。最大値は、0.885であり、上記打ち出しの、「Best cross-validation score: 0.885」と一致している。ベスト・パラメータによるテストデータによるスコアは「0.827」であり、相関図にはこれを示した。 下に示す相関図の相関係数の2乗値「0.832」と概ね一致する。この2値の差については、5. でも言及している通り、理由は不明。
(2) GridSearchCVを使わない
次に GridSearchCV
を使わずにforループでグリッドサーチを行うプログラムを書いてみた。交差検証を行うためにcross_val_score
をインポートしています。作図ルーチンは省略。結果は (1) と同一。
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
def calc(X_train, y_train, X_test, y_test):
cc=[1e0, 1e1, 1e2, 1e3, 1e4]
gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
rr=np.zeros((len(cc), len(gg)), dtype=np.float64)
best_score=0
for i, c in enumerate(cc):
for j, g in enumerate(gg):
reg=SVR(kernel='rbf', C=c, gamma=g)
scores=cross_val_score(reg, X_train, y_train, cv=5)
score=np.mean(scores)
rr[i, j]=score
if best_score < score:
best_score=score
best_parameters={'C': c, 'gamma': g}
reg=SVR(**best_parameters)
reg.fit(X_train, y_train)
test_score=reg.score(X_test, y_test)
print('Size of training set: {}'.format(X_train.shape[0]))
print('Size of test set: {}'.format(X_test.shape[0]))
print('Best parameters: {}'.format(best_parameters))
print('Best cross-validation score: {0:.3f}'.format(best_score))
print('Test set score with best parameters: {0:.3f}'.format(test_score))
df = pd.DataFrame(data=rr, index=cc, columns=gg, dtype='float')
df=df.sort_index(ascending=False)
print('* DataFrame for score heatmap')
print(df)
y_pred=reg.predict(X_test)
return df, y_pred
def preproc(dsn):
X_train, X_test, y_train, y_test = train_test_split(dsn.data, dsn.target, random_state=0)
scaler=MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly=poly.transform(X_train_scaled)
X_test_poly=poly.transform(X_test_scaled)
X_train=X_train_poly
X_test=X_test_poly
#print('Polynomial feature names:\n{}'.format(poly.get_feature_names()))
model=RandomForestRegressor(n_estimators=100,random_state=42)
ssth='median'
select=SelectFromModel(model, threshold=ssth)
select.fit(X_train,y_train)
X_train_selected=select.transform(X_train)
X_test_selected=select.transform(X_test)
print('data set data.shape: {}'.format(dsn.data.shape))
print('X_train.shape: {}'.format(X_train.shape))
print('X_train_selected.shape: {}'.format(X_train_selected.shape))
X_train=X_train_selected
X_test=X_test_selected
return X_train, X_test, y_train, y_test
def main():
start=time.time()
dsn=load_boston()
X_train, X_test, y_train, y_test = preproc(dsn)
df, y_pred = calc(X_train, y_train, X_test, y_test)
dtime=time.time()-start
print('Calculation time={0:.3f}'.format(dtime))
#hmap(df)
#ss='SVR(rbf)'
#drawfig(y_test, y_pred, ss)
#==============
# Execution
#==============
if __name__ == '__main__': main()
data set data.shape: (506, 13)
X_train.shape: (379, 105)
X_train_selected.shape: (379, 53)
Size of training set: 379
Size of test set: 127
Best parameters: {'C': 100.0, 'gamma': 1.0}
Best cross-validation score: 0.885
Test set score with best parameters: 0.827
* DataFrame for score heatmap
0.0001 0.0010 0.0100 0.1000 1.0000 10.0000
10000.0 0.783802 0.843498 0.852295 0.819009 0.776205 0.659981
1000.0 0.582766 0.785148 0.848699 0.867849 0.832527 0.659981
100.0 0.269420 0.582753 0.795827 0.858097 0.885315 0.659981
10.0 0.022615 0.268892 0.581960 0.826356 0.863014 0.596576
1.0 -0.027111 0.022354 0.262348 0.561128 0.687635 0.209916
Calculation time=5.610
(3) Pipelineを使ってみる
scikit-learnには Pipeline
という便利な機能があるので、前処理からグリッドサーチまでの全ての処理をその中に突っ込んでみた。コードは簡略化されるが、とにかく遅い!また、結果が他のケースと「微妙に」異なっている。
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import time
def main():
start=time.time()
dsn=load_boston()
X_train, X_test, y_train, y_test = train_test_split(dsn.data, dsn.target, random_state=0)
print('data set data.shape: {}'.format(dsn.data.shape))
print('X_train.shape: {}'.format(X_train.shape))
model=RandomForestRegressor(n_estimators=100,random_state=42)
pipe=Pipeline([
('scaler', MinMaxScaler()),
('poly', PolynomialFeatures(degree=2)),
('select', SelectFromModel(model, threshold='median')),
('reg', SVR())
])
cc=[1e0, 1e1, 1e2, 1e3, 1e4]
gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
param_grid={'reg__kernel': ['rbf'], 'reg__C': cc, 'reg__gamma': gg}
grid=GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
test_score=grid.score(X_test, y_test)
print('Best parameters: {}'.format(grid.best_params_))
print('Best cross-validation score: {0:.3f}'.format(grid.best_score_))
print('Test set score with best parameters: {0:.3f}'.format(test_score))
results=pd.DataFrame(grid.cv_results_)
#print(results)
rr=np.array(results.mean_test_score).reshape(len(cc), len(gg))
df = pd.DataFrame(data=rr, index=cc, columns=gg, dtype='float')
df=df.sort_index(ascending=False)
print('* DataFrame for score heatmap')
print(df)
y_pred=grid.predict(X_test)
dtime=time.time()-start
print('Calculation time={0:.3f}'.format(dtime))
#print(grid.best_estimator_.named_steps['scaler'])
#print(grid.best_estimator_.named_steps['poly'])
#print(grid.best_estimator_.named_steps['select'])
#print(grid.best_estimator_.named_steps['reg'])
#==============
# Execution
#==============
if __name__ == '__main__': main()
data set data.shape: (506, 13)
X_train.shape: (379, 13)
Best parameters: {'reg__C': 100.0, 'reg__gamma': 1.0, 'reg__kernel': 'rbf'}
Best cross-validation score: 0.869
Test set score with best parameters: 0.827
* DataFrame for score heatmap
0.0001 0.0010 0.0100 0.1000 1.0000 10.0000
10000.0 0.789522 0.836473 0.839120 0.643391 0.765627 0.648995
1000.0 0.598104 0.790297 0.845290 0.854108 0.818415 0.648995
100.0 0.284321 0.598198 0.801653 0.853009 0.868597 0.648995
10.0 0.025757 0.283716 0.597943 0.821156 0.851981 0.598107
1.0 -0.026828 0.025447 0.276398 0.573044 0.684257 0.210990
Calculation time=109.254
(4) 前処理Pipeline + GridSearceCV
全部の処理を Pipeline
に突っ込むと、とても遅いので、コードがごたついている前処理の部分のみを Pipeline
にいれ、グリッドサーチは GridSearchCV
で行うようにしてみた。速度は (1) と大差なしだし、結果も (1) と同一。
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import time
def main():
start=time.time()
dsn=load_boston()
X_train, X_test, y_train, y_test = train_test_split(dsn.data, dsn.target, random_state=0)
print('data set data.shape: {}'.format(dsn.data.shape))
print('X_train.shape: {}'.format(X_train.shape))
model=RandomForestRegressor(n_estimators=100,random_state=42)
pipe=Pipeline([
('scaler', MinMaxScaler()),
('poly', PolynomialFeatures(degree=2)),
('select', SelectFromModel(model, threshold='median'))
])
pipe.fit(X_train, y_train)
X_train=pipe.transform(X_train)
X_test=pipe.transform(X_test)
cc=[1e0, 1e1, 1e2, 1e3, 1e4]
gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
param_grid={'kernel': ['rbf'], 'C': cc, 'gamma': gg}
grid=GridSearchCV(SVR(), param_grid, cv=5)
grid.fit(X_train, y_train)
test_score=grid.score(X_test, y_test)
print('Size of training set: {}'.format(X_train.shape))
print('Size of test set: {}'.format(X_test.shape))
print('Best parameters: {}'.format(grid.best_params_))
print('Best cross-validation score: {0:.3f}'.format(grid.best_score_))
print('Test set score with best parameters: {0:.3f}'.format(test_score))
results=pd.DataFrame(grid.cv_results_)
#print(results)
rr=np.array(results.mean_test_score).reshape(len(cc), len(gg))
df = pd.DataFrame(data=rr, index=cc, columns=gg, dtype='float')
df=df.sort_index(ascending=False)
print('* DataFrame for score heatmap')
print(df)
y_pred=grid.predict(X_test)
dtime=time.time()-start
print('Calculation time={0:.3f}'.format(dtime))
#==============
# Execution
#==============
if __name__ == '__main__': main()
data set data.shape: (506, 13)
X_train.shape: (379, 13)
Size of training set: (379, 53)
Size of test set: (127, 53)
Best parameters: {'C': 100.0, 'gamma': 1.0, 'kernel': 'rbf'}
Best cross-validation score: 0.885
Test set score with best parameters: 0.827
* DataFrame for score heatmap
0.0001 0.0010 0.0100 0.1000 1.0000 10.0000
10000.0 0.783758 0.843492 0.852278 0.818788 0.775954 0.659843
1000.0 0.582699 0.785104 0.848694 0.867736 0.832283 0.659843
100.0 0.269348 0.582686 0.795799 0.858090 0.885195 0.659843
10.0 0.022539 0.268821 0.581901 0.826387 0.863065 0.596508
1.0 -0.027188 0.022278 0.262279 0.561103 0.687804 0.209851
Calculation time=6.454
以 上