More than 5 years have passed since last update.

Python: scikit-learnのbostonデータセットで遊ぶ（重回帰）

Last updated at 2019-08-22Posted at 2019-07-22

0. はじめに

前回の投稿では、bostonデータセットで重回帰を試してみましたが、今回は更にいろいろなものを試してみました。ベースとしている教科書は、オライリー・ジャパン「Pythonではじめる機械学習　scikit-learnで学ぶ特徴量エンジニアリングと機械学習の基礎」です。教科書のママを載せても「あれ」なので、ライブラリの使い方、作図ルーチンなどは、自分なりにアレンジしています。
疑問があったら、以下のscilit-learnの公式ページで調べることをおすすめします。

https://scikit-learn.org/stable/#

ここでは、以下の内容を掲載しています。

1. Python定番のライブラリをインポート
1. データセットの中身を確認する
1. ペアプロットと相関係数ヒートマップでざっくりデータを眺める
1. ローソク足チャートで特徴量の分布範囲を眺める
1. いろいろな方法で重回帰してみる
1. 自動特徴量選択を試す
7.　グリッドサーチでSVRの性能を測る

当方の環境は以下の通り。

iMac (Retina 4K, 21.5-inch, 2017)
Processor 3 GHz Intel Core i5
Memory 8 GB 2400 MHz DDR4
macOS Mojave
Python 3.7.3
scikit-learn version 0.20.3
Jupyter lab version 0.35.4

1. Python定番のライブラリをインポート

Jupyter lab （Jupyter notebookでも大丈夫だと思う）を使うので、はじめに定番のライブラリをインポートしておきます。ヒートマップはseabornのものが使いやすいのでこれもインポート。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

警告を表示したくない場合は、以下を追加。

import warnings
warnings.filterwarnings('ignore')

なお、この記事で用いているscikit-learnからインポートしているものを一覧にすると以下の通り。プログラミングはJupyter上で行っているため、一度インポートしておけばカーネルを再スタートするまではインポートしたものが有効であるため、必要なもの全てを個々のプログラムでインポートしているとは限らないことに注意。

from sklearn.datasets import load_boston	bostonデータセット
from sklearn.model_selection import train_test_split	訓練データとテストデータの分割
from sklearn.linear_model import LinearRegression	線形回帰（最小二乗法）
from sklearn.linear_model import Ridge	リッジ回帰
from sklearn.linear_model import Lasso	ラッソ回帰
from sklearn.ensemble import RandomForestRegressor	ランダムフォレスト回帰
from sklearn.svm import SVR	サポートベクタマシン（回帰）
from sklearn.neural_network import MLPRegressor	ニューラルネットワーク
from sklearn.preprocessing import MinMaxScaler	データ変換
from sklearn.preprocessing import PolynomialFeatures	多項式特徴量の追加
from sklearn.feature_selection import SelectPercentile	単変量特徴量選択
from sklearn.feature_selection import SelectFromModel	モデルベース特徴量選択
from sklearn.feature_selection import RFE	反復特徴量選択（再帰的特徴量削減）
from sklearn.model_selection import GridSearchCV	交差検証を用いたグリッドサーチ
from sklearn.model_selection import cross_val_score	交差検証
from sklearn.pipeline import Pipeline	処理記載の簡略化

2. データセットの中身を確認する

データセットのキーを取得し、キーごとに内容を表示する。

from sklearn.datasets import load_boston

dsn=load_boston()

print(dsn.keys())
for ss in dsn.keys():
    print()
    print('* '+ ss)
    print(dsn[ss])

print()
print('Data shape')
print('Data.shape: {}'.format(dsn.data.shape))
print('Target.shape: {}'.format(dsn.target.shape))

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

* data
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

* target
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
...途中省略...
 16.7 12.  14.6 21.4 23.  23.7 25.  21.8 20.6 21.2 19.1 20.6 15.2  7.
  8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
 22.  11.9]

* feature_names
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

* DESCR
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

...以下省略...

* filename
/usr/local/lib/python3.7/site-packages/sklearn/datasets/data/boston_house_prices.csv

Data shape
Data.shape: (506, 13)
Target.shape: (506,)

3. ペアプロットと相関係数ヒートマップでざっくりデータを眺める

13の特徴量とターゲットを加えた 14 x 14 のペアプロットを作成し眺める。また、各特徴量およびターゲットとの「相関行列」を計算し眺める。当初は相関行列を表示していたが、プラスとマイナスの表示が入り乱れて色とりどりになり見づらいので、ここでは、相関行列の各要素の２乗（単相関とした時の決定係数）を表示することにした。

df=pd.DataFrame(dsn.data, columns=dsn.feature_names)
df['target']=dsn.target
plt.figure(facecolor='w')
pd.plotting.scatter_matrix(df,figsize=(20,20))
fnameF='fig_boston_1.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()

df_corr = df.corr()
print(df_corr**2)
plt.figure(figsize=(15, 7), facecolor='w')
sns.heatmap(df_corr**2, annot=True, fmt='.2f', vmax=1, vmin=0, center=0.5, cmap='Blues')
plt.yticks(rotation=0)
fnameF='fig_boston_1-1.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()

4. ローソク足チャートで特徴量の分布範囲を眺める

特徴量の値の分布範囲を眺める。

plt.figure(figsize=(10,6),facecolor='w')
plt.xlabel('Feature')
plt.ylabel('Feature magnitude')
plt.yscale('log')
plt.boxplot(dsn.data)
plt.xticks(range(1,dsn.data.shape[1]+1), dsn.feature_names, rotation=90)
fnameF='fig_boston_2.jpg'
plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
plt.show()

5. いろいろな方法で重回帰してみる

いろいろな回帰を、特徴量の処理方法を変えて試す。
本来、決定係数と図上の相関係数の二乗とは異なるものであるため、図示する数値は reg.score(X_test, y_test) で計算した決定係数にしました。

あとに出てくるように図示している相関係数の２乗（y_testとy_pred=predict(X_test)からnumpy.corrcoefで計算）と、画面表示しているscore(X_test, y_test)の値が違っている。図示のほうが大きい。これはscoreでの決定係数は自由度調整済だから小さめなのかな？。。。そこらへんは不明です。

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neural_network import MLPRegressor


# global variables
fsz=12
xmin=0; ymin=0
xmax=60; ymax=60


def drawfig(y_test, y_pred, ss, nplt):
    plt.subplot(3,7,nplt)
    plt.gca().set_aspect('equal',adjustable='box')
    plt.xlim(xmin, xmax)
    plt.ylim(ymin, ymax)
    if 15<=nplt:
        plt.xlabel('y_test')
    else:
        plt.xticks([])           
    if nplt==1 or nplt==8 or nplt==15:
        plt.ylabel('y_predicted')
    else:
        plt.yticks([])
    plt.plot(y_test, y_pred, 'o')
    plt.plot([xmin,xmax],[ymin,ymax],'-')
    xs=xmin+0.05*(xmax-xmin)
    ys=ymin+0.95*(ymax-ymin)
    plt.text(xs, ys, ss, va='top', ha='left', fontsize=fsz)

    
def linreg(X_train, X_test, y_train, y_test):
    reg=LinearRegression().fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    ss='LinearReg\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=1
    drawfig(y_test, y_pred, ss, nplt)
    print('* LinearRegression')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))


def rfreg(X_train, X_test, y_train, y_test):
    reg=RandomForestRegressor(n_estimators=100).fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    ss='RandomForest\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=2
    drawfig(y_test, y_pred, ss, nplt)
    print('* RandomForest')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))

    
def svr1(X_train, X_test, y_train, y_test):
    reg = SVR(kernel='linear').fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    ss='SVR(linear)\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=3
    drawfig(y_test, y_pred, ss, nplt)
    print('* SVR')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set sore',sc_tes))


def svr2(X_train, X_test, y_train, y_test):
    reg = SVR(kernel='rbf').fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    ss='SVR(rbf)\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=4
    drawfig(y_test, y_pred, ss, nplt)
    print('* SVR')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set sore',sc_tes))

    
def mlp(X_train, X_test, y_train, y_test):
    reg = MLPRegressor(solver='lbfgs', activation='relu', random_state=0, hidden_layer_sizes=[10, 10]).fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    ss='MLP\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=5
    drawfig(y_test, y_pred, ss, nplt)
    print('* MLP')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set sore',sc_tes))

    
def rid(X_train, X_test, y_train, y_test):
    alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
    sc_tra = np.zeros(len(alpha), dtype=np.float64)
    sc_tes = np.zeros(len(alpha), dtype=np.float64)
    for i, alp in enumerate(alpha):
        reg = Ridge(alpha=alp).fit(X_train, y_train)
        sc_tra[i] = reg.score(X_train, y_train)
        sc_tes[i] = reg.score(X_test, y_test)
        y_pred = reg.predict(X_test)
        ss = 'Ridge\n'.format(alp)
        ss = ss + '(alpha={})\n'.format(alp)
        ss = ss +'R$^2$={0:.3f}'.format(sc_tes[i])
        nplt=8+i
        drawfig(y_test, y_pred, ss, nplt)
    print('* Ridge')
    sr1='{:25s}'.format('alpha')
    sr2='{:25s}'.format('Training set score')
    sr3='{:25s}'.format('Test set score')
    for i in range(len(alpha)):
        sr1=sr1+' {0:>6s}'.format(str(alpha[i]))
        sr2=sr2+' {0:6.3f}'.format(sc_tra[i])
        sr3=sr3+' {0:6.3f}'.format(sc_tes[i])
    print(sr1)
    print(sr2)
    print(sr3)

        
def las(X_train, X_test, y_train, y_test):
    alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
    sc_tra = np.zeros(len(alpha), dtype=np.float64)
    sc_tes = np.zeros(len(alpha), dtype=np.float64)
    num = np.zeros(len(alpha), dtype=np.int)
    for i, alp in enumerate(alpha):
        # default max_iter = 1000
        reg=Lasso(alpha=alp, max_iter=100000).fit(X_train, y_train)
        sc_tra[i] = reg.score(X_train, y_train)
        sc_tes[i] = reg.score(X_test, y_test)
        num[i] = np.sum(reg.coef_ !=0)
        y_pred=reg.predict(X_test)
        ss = 'Lasso\n'.format(alp)
        ss = ss + '(alpha={})\n'.format(alp)
        ss = ss +'R$^2$={0:.3f}'.format(sc_tes[i])
        nplt=15+i
        drawfig(y_test, y_pred, ss, nplt)
    print('* Lasso')
    sl1='{:25s}'.format('alpha')
    sl2='{:25s}'.format('Training set score')
    sl3='{:25s}'.format('Test set score')
    sl4='{:25s}'.format('Number of features used')
    for i in range(len(alpha)):
        sl1=sl1+' {0:>6s}'.format(str(alpha[i]))
        sl2=sl2+' {0:6.3f}'.format(sc_tra[i])
        sl3=sl3+' {0:6.3f}'.format(sc_tes[i])
        sl4=sl4+' {0:6d}'.format(num[i])
    print(sl1)
    print(sl2)
    print(sl3)
    print(sl4)

    
def ctl(X_train, X_test, y_train, y_test, iii):
    bw=3; icol=7; irow=3
    plt.figure(figsize=(bw*icol,bw*irow), facecolor='w')
    plt.subplots_adjust(wspace=0.1, hspace=0.1)
    linreg(X_train, X_test, y_train, y_test)
    rfreg(X_train, X_test, y_train, y_test)
    svr1(X_train, X_test, y_train, y_test)
    svr2(X_train, X_test, y_train, y_test)
    mlp(X_train, X_test, y_train, y_test)
    rid(X_train, X_test, y_train, y_test)
    las(X_train, X_test, y_train, y_test)
    #plt.tight_layout()
    fnameF='fig_boston_3-'+str(iii)+'.jpg'
    plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
    plt.show()
    

def main():
    for iii in [1,2,3]:
        X_train, X_test, y_train, y_test = train_test_split(
            dsn['data'], dsn['target'], random_state=0)
        if iii==1:
            pass
        if iii==2:
            scaler=MinMaxScaler()
            scaler.fit(X_train)
            X_train_scaled = scaler.transform(X_train)
            X_test_scaled = scaler.transform(X_test)
            X_train=X_train_scaled
            X_test=X_test_scaled
        if iii==3:
            scaler=MinMaxScaler()
            scaler.fit(X_train)
            X_train_scaled = scaler.transform(X_train)
            X_test_scaled = scaler.transform(X_test)
            poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
            X_train_poly=poly.transform(X_train_scaled)
            X_test_poly=poly.transform(X_test_scaled)
            X_train=X_train_poly
            X_test=X_test_poly
        print('X_train.shape: {}'.format(X_train.shape))
        print('y_train.shape: {}'.format(y_train.shape))
        print('X_test.shape: {}'.format(X_test.shape))
        print('y_test.shape: {}'.format(y_test.shape))
        ctl(X_train, X_test, y_train, y_test, iii)
    

# ==============
# Execution
# ==============
if __name__ == '__main__': main()

(1) 生データのまま回帰

X_train.shape: (379, 13)
y_train.shape: (379,)
X_test.shape: (127, 13)
y_test.shape: (127,)
* LinearRegression
Training set score       : 0.770
Test set score           : 0.635
* RandomForest
Training set score       : 0.983
Test set score           : 0.799
* SVR
Training set score       : 0.744
Test set sore            : 0.564
* SVR
Training set score       : 0.153
Test set sore            : 0.005
* MLP
Training set score       : 0.473
Test set sore            : 0.146
* Ridge
alpha                     0.0001  0.001   0.01    0.1      1     10    100
Training set score         0.770  0.770  0.770  0.770  0.768  0.762  0.747
Test set score             0.635  0.635  0.635  0.634  0.627  0.613  0.593
* Lasso
alpha                     0.0001  0.001   0.01    0.1      1     10    100
Training set score         0.770  0.770  0.769  0.759  0.715  0.559  0.264
Test set score             0.635  0.635  0.632  0.607  0.552  0.399  0.119
Number of features used       13     13     13     12     11      4      2

(2) 特徴量のスケール変換実施

X_train.shape: (379, 13)
y_train.shape: (379,)
X_test.shape: (127, 13)
y_test.shape: (127,)
* LinearRegression
Training set score       : 0.770
Test set score           : 0.635
* RandomForest
Training set score       : 0.982
Test set score           : 0.798
* SVR
Training set score       : 0.679
Test set sore            : 0.497
* SVR
Training set score       : 0.423
Test set sore            : 0.274
* MLP
Training set score       : 0.940
Test set sore            : 0.730
* Ridge
alpha                     0.0001  0.001   0.01    0.1      1     10    100
Training set score         0.770  0.770  0.770  0.770  0.766  0.688  0.405
Test set score             0.635  0.635  0.635  0.634  0.621  0.509  0.271
* Lasso
alpha                     0.0001  0.001   0.01    0.1      1     10    100
Training set score         0.770  0.770  0.769  0.726  0.301  0.000  0.000
Test set score             0.635  0.635  0.631  0.562  0.243 -0.001 -0.001
Number of features used       13     13     13      7      2      0      0

(3) スケール変換実施後の特徴量に多項式特徴量を加えたもの

X_train.shape: (379, 105)
y_train.shape: (379,)
X_test.shape: (127, 105)
y_test.shape: (127,)
* LinearRegression
Training set score       : 0.952
Test set score           : 0.607
* RandomForest
Training set score       : 0.986
Test set score           : 0.768
* SVR
Training set score       : 0.798
Test set sore            : 0.624
* SVR
Training set score       : 0.336
Test set sore            : 0.214
* MLP
Training set score       : 0.951
Test set sore            : 0.790
* Ridge
alpha                     0.0001  0.001   0.01    0.1      1     10    100
Training set score         0.951  0.951  0.944  0.928  0.887  0.789  0.598
Test set score             0.618  0.630  0.702  0.772  0.753  0.637  0.418
* Lasso
alpha                     0.0001  0.001   0.01    0.1      1     10    100
Training set score         0.951  0.939  0.896  0.771  0.308  0.000  0.000
Test set score             0.644  0.738  0.766  0.630  0.229 -0.001 -0.001
Number of features used       96     68     33      8      4      0      0

6. 自動特徴量選択を試す

(1) ３種類の特徴量自動選択を試す

from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
import time


def drawfig(select, iii):
    if iii==1: ss='* SelectPercentile(percentile=50)'
    if iii==2: ss='* SelectFromModel(RandomForestRegressor(),threshold=\'median\')'
    if iii==3: ss='* RFE(RandomForestRegressor(),n_features_to_select=50)'
    mask=select.get_support()
    print(mask)
    nplt=310+iii
    plt.subplot(nplt)
    plt.matshow(mask.reshape(1,-1),cmap='gray_r', fignum=False)
    plt.xlabel('Sample index')
    plt.yticks([])
    plt.title(ss,loc='left', pad=30)
 

def main():
    X_train, X_test, y_train, y_test = train_test_split(
        dsn['data'], dsn['target'], random_state=0)
    scaler=MinMaxScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
    X_train_poly=poly.transform(X_train_scaled)
    X_test_poly=poly.transform(X_test_scaled)
    X_train=X_train_poly
    X_test=X_test_poly
    print('Polynomial feature names:\n{}'.format(poly.get_feature_names()))

    plt.figure(figsize=(10, 4),facecolor='w')
    for iii in [1, 2, 3]:
        start=time.time()
        if iii==1:
            select = SelectPercentile(percentile=50) 
        if iii==2:
            select=SelectFromModel(
                RandomForestRegressor(n_estimators=100,random_state=42),
                threshold='median')
        if iii==3:
            select=RFE(
                RandomForestRegressor(n_estimators=100,random_state=42),
                n_features_to_select=50)
        select.fit(X_train,y_train)
        X_train_selected=select.transform(X_train)
        X_test_selected=select.transform(X_test)
        print('X_train.shape: {}'.format(X_train.shape))
        print('X_train_selected.shape: {}'.format(X_train_selected.shape))
        reg=LinearRegression().fit(X_train_selected, y_train)
        sc_tra=reg.score(X_train_selected, y_train)
        sc_tes=reg.score(X_test_selected, y_test)
        dtime=time.time()-start
        print('* LinearRegression')
        print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
        print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
        print('{0:25s}: {1:.3f}'.format('Calculation time',dtime))
        drawfig(select, iii)

    plt.tight_layout()
    fnameF='fig_boston_4.jpg'
    plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
    plt.show()
        
        
# ==============
# Execution
# ==============
if __name__ == '__main__': main()

Polynomial feature names:
['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']
X_train.shape: (379, 105)
X_train_selected.shape: (379, 52)
* LinearRegression
Training set score       : 0.899
Test set score           : 0.747
Calculation time         : 0.018
[False  True False  True False  True  True  True False  True  True False
  True  True  True False  True False  True  True  True  True  True  True
  True  True  True False False  True False False False False False False
 False False False False False  True False  True False  True  True  True
 False  True False False False False  True False False False False False
 False  True  True False  True  True  True False  True  True False False
 False False False  True  True  True False  True  True  True False  True
 False False False False False False  True  True  True False  True  True
  True False  True False False  True False  True  True]

X_train.shape: (379, 105)
X_train_selected.shape: (379, 53)
* LinearRegression
Training set score       : 0.924
Test set score           : 0.770
Calculation time         : 0.932
[False  True False False False False  True  True  True False False False
 False  True False False  True False  True  True False  True  True  True
  True  True  True False False False False False False False False False
 False False False False False  True  True False  True  True  True  True
  True  True False False False False False False False False False False
 False False  True  True False  True  True False  True  True False  True
  True  True  True  True False False  True  True  True  True False  True
  True  True  True  True  True  True False False False  True  True False
  True  True  True False  True  True False  True  True]
X_train.shape: (379, 105)
X_train_selected.shape: (379, 50)
* LinearRegression
Training set score       : 0.923
Test set score           : 0.772
Calculation time         : 42.644
[False False False False False False  True False False False False False
  True  True False  True  True False  True  True False  True  True  True
  True False  True False False False False False False False False False
 False False False False False  True False False  True False  True  True
  True  True False False False False False False False False False False
 False False  True  True False  True  True False  True  True  True False
  True  True  True  True  True  True  True  True  True  True False  True
 False False  True  True  True  True False False False  True  True False
  True  True  True  True  True  True False  True  True]

下の図では、選択された特徴量のインデックスが黒塗りで示されている。ここでmatshowによるグラフでは、fignum=Falseを入れないとsubplotが効かないことに注意。

処理時間とLinearRegressionを用いたスコアは以下の通り。

特徴量選択手法	処理時間	訓練セットスコア	テストセットスコア
SelectPercentile	0.018 sec	0.899	0.747
SelectFromModel	0.932 sec	0.924	0.770
RFE	42.544 sec	0.923	0.772

(2) RFEは処理に時間がかかるのでSelectFromModelを試す

RFEは処理に時間がかかるのでSelectFromModelを試すわけだが、thresholdに何を入れたらいいかよくわからない。そこで、thresholdの値をいろいろ試してみた。

モデルベースにはRandomForestRegressorを使い、特徴量の重要度を図化してみた。
threshold（しきい値）のmeanとは、重要度の平均値を指すようで、多くの係数を選択したければthresholdに指定する値を小さく（例えば’0.1*mean’）すればよい。

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
import time


def plot_bar(model, X_train, y_train):
    reg=model.fit(X_train, y_train)
    fsz=14
    plt.figure(figsize=(10,6), facecolor='w')
    plt.rcParams['font.size']=fsz
    plt.rcParams['font.family']='sans-serif'
    n_features = X_train.shape[1]
    plt.barh(range(n_features), reg.feature_importances_, align='center')
    ave=np.mean(reg.feature_importances_)
    ymin=-0.5
    ymax=n_features
    plt.ylim(ymin,ymax)
    ys=2
    for r in [0.2, 0.5, 1.0]:
        plt.plot([r*ave, r*ave], [ymin,ymax],'-',color='#ff0000')
        ss='{}*mesn'.format(r)
        plt.text(r*ave, ys, ss, va='center', ha='left', rotation=90, color='#ff0000')
    plt.xlabel('Feature importance')
    plt.ylabel('Feature index')
    fnameF='fig_boston_5-1.jpg'
    plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
    plt.show()


def drawfig(select, ssth, iii):
    ss='* threshold=\''+ssth+'\''
    mask=select.get_support()
    nplt=511+iii
    plt.subplot(nplt)
    plt.matshow(mask.reshape(1,-1),cmap='gray_r', fignum=False)
    plt.xlabel('Sample index')
    plt.yticks([])
    plt.title(ss,loc='left', pad=30)


def main():
    X_train, X_test, y_train, y_test = train_test_split(
        dsn['data'], dsn['target'], random_state=0)
    scaler=MinMaxScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    X_train=X_train_scaled
    X_test=X_test_scaled

    model=RandomForestRegressor(n_estimators=100,random_state=42)
    plot_bar(model, X_train, y_train)

    plt.figure(figsize=(10, 8),facecolor='w')
    thlist=['median', '0.1*mean', '0.2*mean', '0.5*mean', 'mean']
    for iii, ssth in enumerate(thlist):
        start=time.time()
        select=SelectFromModel(model, threshold=ssth)
        select.fit(X_train,y_train)
        X_train_selected=select.transform(X_train)
        X_test_selected=select.transform(X_test)
        print('X_train.shape: {}'.format(X_train.shape))
        print('X_train_selected.shape: {}'.format(X_train_selected.shape))
        reg=LinearRegression().fit(X_train_selected, y_train)
        sc_tra=reg.score(X_train_selected, y_train)
        sc_tes=reg.score(X_test_selected, y_test)
        dtime=time.time()-start
        print('* LinearRegression')
        print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
        print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
        print('{0:25s}: {1:.3f}'.format('Calculation time',dtime))
        drawfig(select, ssth, iii)

    plt.tight_layout()
    fnameF='fig_boston_5-2.jpg'
    plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
    plt.show()

# ==============
# Execution
# ==============
if __name__ == '__main__': main()

上の図と下の図を比べてみよう。
上の図の'1.0mean'を超える特徴量は、インデックス5と12であり、下の図のthreshold='mean'でもインデックス5と12の特徴量が選択されている。
更に、上の図の'0.2mean'を超える特徴量は、インデックス0, 4, 5, 7, 9, 10, 12であり、下の図のthreshold='0.2*mean'でも同じインデックスの特徴量が選択されていることがわかる。

(3) 多項式特徴量を含む場合

def plot_bar(model, X_train, y_train):
    reg=model.fit(X_train, y_train)
    fsz=14
    plt.figure(figsize=(10,12), facecolor='w')
    plt.rcParams['font.size']=fsz
    plt.rcParams['font.family']='sans-serif'
    n_features = X_train.shape[1]
    plt.barh(range(n_features), reg.feature_importances_, align='center')
    ave=np.mean(reg.feature_importances_)
    ymin=-0.5
    ymax=n_features
    plt.ylim(ymin,ymax)
    ys=30
    for r in [0.5, 1.0]:
        plt.plot([r*ave, r*ave], [ymin,ymax],'-',color='#ff0000')
        ss='{}*mesn'.format(r)
        plt.text(r*ave, ys, ss, va='center', ha='left', rotation=90, color='#ff0000')
    plt.xlabel('Feature importance')
    plt.ylabel('Feature index')
    fnameF='fig_boston_6-1.jpg'
    plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
    plt.show()

    
def drawfig(select, ssth, iii):
    ss='* threshold=\''+ssth+'\''
    mask=select.get_support()
    nplt=511+iii
    plt.subplot(nplt)
    plt.matshow(mask.reshape(1,-1),cmap='gray_r', fignum=False)
    plt.xlabel('Sample index')
    plt.yticks([])
    plt.title(ss,loc='left', pad=30)


def main():
    X_train, X_test, y_train, y_test = train_test_split(
        dsn['data'], dsn['target'], random_state=0)
    scaler=MinMaxScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
    X_train_poly=poly.transform(X_train_scaled)
    X_test_poly=poly.transform(X_test_scaled)
    X_train=X_train_poly
    X_test=X_test_poly
    #print('Polynomial feature names:\n{}'.format(poly.get_feature_names()))

    model=RandomForestRegressor(n_estimators=100,random_state=42)
    plot_bar(model, X_train, y_train)
    
    plt.figure(figsize=(10,6), facecolor='w')
    thlist=['median', '0.1*mean', '0.2*mean', '0.5*mean', 'mean']
    for iii, ssth in enumerate(thlist):
        select=SelectFromModel(model, threshold=ssth)
        select.fit(X_train,y_train)    
        X_train_selected=select.transform(X_train)
        X_test_selected=select.transform(X_test)
        print('X_train.shape: {}'.format(X_train.shape))
        print('X_train_selected.shape: {}'.format(X_train_selected.shape))
        reg=LinearRegression().fit(X_train_selected, y_train)
        sc_tra=reg.score(X_train_selected, y_train)
        sc_tes=reg.score(X_test_selected, y_test)
        print('* LinearRegression')
        print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
        print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
        drawfig(select, ssth, iii)

    plt.tight_layout()
    fnameF='fig_boston_6-2.jpg'
    plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
    plt.show()
    
    
# ==============
# Execution
# ==============
if __name__ == '__main__': main()

前出 6. の打ち出しに示したように、多項式特徴量を含む特徴量は、以下のように並んでいる（再掲）。

['1', 
'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 
'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 
'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 
'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 
'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 
'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 
'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 
'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 
'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 
'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 
'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 
'x10^2', 'x10 x11', 'x10 x12', 
'x11^2', 'x11 x12', 
'x12^2']

すなわち、定数項と13個のオリジナル特徴量の後、インデックス番号で14以降は多項式部となる。下に示す特徴量の重要度を示す図から、meanを超えているオリジナル特徴量はインデックス6（x5）とインデックス13（x12）だけで、あとは加えた多項式特徴量の部分である。すなわち、このモデルでは、多項式特徴量の追加は予測精度を上げるために意味のあることになる。

下の表は、thresholdの値と選択された特徴量個数、訓練セットスコア、テストセットスコアをまとめたものである。スコア計算には、線形回帰（LInearRegression）を使っている。

threshold	もとの特徴量個数	選択された特徴量個数	訓練セットスコア	テストセットスコア
median	105	53	0.924	0.770
0.1*mean	105	62	0.929	0.775
0.2*mean	105	36	0.904	0.764
0.5*mean	105	14	0.840	0.710
mean	105	12	0.826	0.694

あまり特徴量を絞りすぎても回帰の精度が落ちていくので、約半分の特徴量を選択するthreshold='mediam'を使っておくのが無難な感じである。

7.　グリッドサーチでSVRの性能を測る

5.で示した結果のうちカーネルrbfを使ったSVRの結果（パラメータはデフォルト）が可愛そうだったので、グリッドサーチでSVRの性能を測ってみた。
処理の流れは以下の通り。

a. 訓練データとテストデータを分離 (train_test_splitの利用)
b. データ変換 (MinMaxScalerの利用)
c. 多項式特徴量の追加 (PolynomialFeaturesの利用)
d. モデルベース特徴量選択 (SelectFromModelの利用)
e. 交差検証を用いたグリッドサーチで最適パラメータ検索 (GridSearchCVの利用)
f. Cとgammaを変数とした平均交差検証スコアヒートパップ作成 (関数：def hmap(df))
g. 最適パラメータを用いた予測値とテストデータの相関図作成 (関数：def drawfig(y_test, y_pred, ss))

上記 e. の処理は、GridSearchCVの利用により、コードを簡略化できる。また、a.〜d.の前処理、あるいはa.〜f.の主要処理は、Pipeline にいれることで更にコードを簡略化できる。ここで紹介するコードでは、使用するライブラリは全てインポートしている（気分的なものです）。

以下に(1)から(4)までのコードを試し、処理時間を計測してみた。結果は以下の通り。Case(1)では作図時間は含んでいない。

Case	前処理	グリッドサーチ	処理時間
(1)	手書き	GridSearchCV	6,455 sec
(2)	手書き	手書きforループ	5.610 sec
(3)	Pipeline		109.254 sec
(4)	Pipeline	GridSearchCV	6.454 sec

上表から言えることは以下の通り。

Case(2) は、コードは長いが一番早い。
Case(3) は、全てを Pipeline の中に詰め込んだもの。コードは簡略化できるが、とても遅い。
比較した中では、コードがゴタツク前処理部分を Pipeline に入れた Case(4) が、実務的かもしれない。

(1) GridSearchCV を使う (作図ルーチン付き)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
import time


def hmap(df):
    sns.heatmap(df, annot=True, fmt='.3f', cmap='viridis')
    plt.xlabel('gamma')
    plt.ylabel('C')
    plt.yticks(rotation=0)
    fnameF='fig_boston_7-1.jpg'
    plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
    plt.show()


def drawfig(y_test, y_pred, ss):
    fsz=12
    xmin=0; ymin=0
    xmax=60; ymax=60
    plt.figure(figsize=(3,3), facecolor='w')
    plt.gca().set_aspect('equal',adjustable='box')
    plt.xlim(xmin, xmax)
    plt.ylim(ymin, ymax)
    plt.xlabel('y_test')
    plt.ylabel('y_predicted')
    plt.plot(y_test, y_pred, 'o')
    plt.plot([xmin,xmax],[ymin,ymax],'-')
    xs=xmin+0.05*(xmax-xmin)
    ys=ymin+0.95*(ymax-ymin)
    plt.text(xs, ys, ss, va='top', ha='left', fontsize=fsz)
    fnameF='fig_boston_7-2.jpg'
    plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
    plt.show()
    
    
def calc(X_train, y_train, X_test, y_test):
    cc=[1e0, 1e1, 1e2, 1e3, 1e4]
    gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
    param_grid={'kernel': ['rbf'], 'C': cc, 'gamma': gg}
    grid=GridSearchCV(SVR(), param_grid, cv=5)
    grid.fit(X_train, y_train)
    test_score=grid.score(X_test, y_test)
    print('Size of training set:   {}'.format(X_train.shape[0]))
    print('Size of test set:       {}'.format(X_test.shape[0]))
    print('Best parameters: {}'.format(grid.best_params_))
    print('Best cross-validation score:        {0:.3f}'.format(grid.best_score_))  
    print('Test set score with best parameters: {0:.3f}'.format(test_score))  
    results=pd.DataFrame(grid.cv_results_)
    #print(results)
    rr=np.array(results.mean_test_score).reshape(len(cc), len(gg))
    df = pd.DataFrame(data=rr, index=cc, columns=gg, dtype='float')
    df=df.sort_index(ascending=False)
    print('* DataFrame for score heatmap')
    print(df)
    y_pred=grid.predict(X_test)
    return df, y_pred, test_score


def preproc(dsn):
    X_train, X_test, y_train, y_test = train_test_split(dsn.data, dsn.target, random_state=0)
    scaler=MinMaxScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
    X_train_poly=poly.transform(X_train_scaled)
    X_test_poly=poly.transform(X_test_scaled)
    X_train=X_train_poly
    X_test=X_test_poly
    #print('Polynomial feature names:\n{}'.format(poly.get_feature_names()))

    model=RandomForestRegressor(n_estimators=100,random_state=42)
    ssth='median'
    select=SelectFromModel(model, threshold=ssth)
    select.fit(X_train,y_train)
    X_train_selected=select.transform(X_train)
    X_test_selected=select.transform(X_test)
    print('data set data.shape:    {}'.format(dsn.data.shape))
    print('X_train.shape:          {}'.format(X_train.shape))
    print('X_train_selected.shape: {}'.format(X_train_selected.shape))
    X_train=X_train_selected
    X_test=X_test_selected
    return X_train, X_test, y_train, y_test
    

def main():
    start=time.time()
    dsn=load_boston()
    X_train, X_test, y_train, y_test = preproc(dsn)
    df, y_pred, test_score = calc(X_train, y_train, X_test, y_test)
    dtime=time.time()-start
    print('Calculation time={0:.3f}'.format(dtime))
    hmap(df)
    ss='SVR(rbf)\n'+'R$^2$={0:.3f}'.format(test_score)
    drawfig(y_test, y_pred, ss)

    
# ==============
# Execution
# ==============
if __name__ == '__main__': main()

data set data.shape:    (506, 13)
X_train.shape:          (379, 105)
X_train_selected.shape: (379, 53)
Size of training set:   379
Size of test set:       127
Best parameters: {'C': 100.0, 'gamma': 1.0, 'kernel': 'rbf'}
Best cross-validation score:        0.885
Test set score with best parameters: 0.827
* DataFrame for score heatmap
          0.0001    0.0010    0.0100    0.1000    1.0000    10.0000
10000.0  0.783758  0.843492  0.852278  0.818788  0.775954  0.659843
1000.0   0.582699  0.785104  0.848694  0.867736  0.832283  0.659843
100.0    0.269348  0.582686  0.795799  0.858090  0.885195  0.659843
10.0     0.022539  0.268821  0.581901  0.826387  0.863065  0.596508
1.0     -0.027188  0.022278  0.262279  0.561103  0.687804  0.209851
Calculation time=6.455

下のヒートマップは、ベスト・パラメータを求める際に実行した結果のスコアを示すもの。最大値は、0.885であり、上記打ち出しの、「Best cross-validation score: 0.885」と一致している。ベスト・パラメータによるテストデータによるスコアは「0.827」であり、相関図にはこれを示した。 ~~下に示す相関図の相関係数の２乗値「0.832」と概ね一致する。この２値の差については、5. でも言及している通り、理由は不明。~~

(2) GridSearchCVを使わない

次に GridSearchCVを使わずにforループでグリッドサーチを行うプログラムを書いてみた。交差検証を行うためにcross_val_scoreをインポートしています。作図ルーチンは省略。結果は (1) と同一。

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score


def calc(X_train, y_train, X_test, y_test):
    cc=[1e0, 1e1, 1e2, 1e3, 1e4]
    gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
    rr=np.zeros((len(cc), len(gg)), dtype=np.float64)
    best_score=0
    for i, c in enumerate(cc):
        for j, g in enumerate(gg):
            reg=SVR(kernel='rbf', C=c, gamma=g)
            scores=cross_val_score(reg, X_train, y_train, cv=5)
            score=np.mean(scores)
            rr[i, j]=score
            if best_score < score:
                best_score=score
                best_parameters={'C': c, 'gamma': g}
    reg=SVR(**best_parameters)
    reg.fit(X_train, y_train)
    test_score=reg.score(X_test, y_test) 
    print('Size of training set:        {}'.format(X_train.shape[0]))
    print('Size of test set:            {}'.format(X_test.shape[0]))
    print('Best parameters: {}'.format(best_parameters))
    print('Best cross-validation score:        {0:.3f}'.format(best_score))  
    print('Test set score with best parameters: {0:.3f}'.format(test_score))  
    df = pd.DataFrame(data=rr, index=cc, columns=gg, dtype='float')
    df=df.sort_index(ascending=False)
    print('* DataFrame for score heatmap')
    print(df)
    y_pred=reg.predict(X_test)
    return df, y_pred
  

def preproc(dsn):
    X_train, X_test, y_train, y_test = train_test_split(dsn.data, dsn.target, random_state=0)
    scaler=MinMaxScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    poly=PolynomialFeatures(degree=2).fit(X_train_scaled)
    X_train_poly=poly.transform(X_train_scaled)
    X_test_poly=poly.transform(X_test_scaled)
    X_train=X_train_poly
    X_test=X_test_poly
    #print('Polynomial feature names:\n{}'.format(poly.get_feature_names()))

    model=RandomForestRegressor(n_estimators=100,random_state=42)
    ssth='median'
    select=SelectFromModel(model, threshold=ssth)
    select.fit(X_train,y_train)
    X_train_selected=select.transform(X_train)
    X_test_selected=select.transform(X_test)
    print('data set data.shape:    {}'.format(dsn.data.shape))
    print('X_train.shape:          {}'.format(X_train.shape))
    print('X_train_selected.shape: {}'.format(X_train_selected.shape))
    X_train=X_train_selected
    X_test=X_test_selected
    return X_train, X_test, y_train, y_test
    

def main():
    start=time.time()
    dsn=load_boston()
    X_train, X_test, y_train, y_test = preproc(dsn)
    df, y_pred = calc(X_train, y_train, X_test, y_test)
    dtime=time.time()-start
    print('Calculation time={0:.3f}'.format(dtime))
    #hmap(df)
    #ss='SVR(rbf)'
    #drawfig(y_test, y_pred, ss)

    
# ==============
# Execution
# ==============
if __name__ == '__main__': main()

data set data.shape:    (506, 13)
X_train.shape:          (379, 105)
X_train_selected.shape: (379, 53)
Size of training set:        379
Size of test set:            127
Best parameters: {'C': 100.0, 'gamma': 1.0}
Best cross-validation score:        0.885
Test set score with best parameters: 0.827
* DataFrame for score heatmap
          0.0001    0.0010    0.0100    0.1000    1.0000    10.0000
10000.0  0.783802  0.843498  0.852295  0.819009  0.776205  0.659981
1000.0   0.582766  0.785148  0.848699  0.867849  0.832527  0.659981
100.0    0.269420  0.582753  0.795827  0.858097  0.885315  0.659981
10.0     0.022615  0.268892  0.581960  0.826356  0.863014  0.596576
1.0     -0.027111  0.022354  0.262348  0.561128  0.687635  0.209916
Calculation time=5.610

(3) Pipelineを使ってみる

scikit-learnには Pipeline という便利な機能があるので、前処理からグリッドサーチまでの全ての処理をその中に突っ込んでみた。コードは簡略化されるが、とにかく遅い！また、結果が他のケースと「微妙に」異なっている。

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import time

    
def main():
    start=time.time()
    dsn=load_boston()
    X_train, X_test, y_train, y_test = train_test_split(dsn.data, dsn.target, random_state=0)
    print('data set data.shape:    {}'.format(dsn.data.shape))
    print('X_train.shape:          {}'.format(X_train.shape))
    model=RandomForestRegressor(n_estimators=100,random_state=42)
    pipe=Pipeline([
        ('scaler', MinMaxScaler()),
        ('poly', PolynomialFeatures(degree=2)),
        ('select', SelectFromModel(model, threshold='median')),
        ('reg', SVR())
    ])    
    cc=[1e0, 1e1, 1e2, 1e3, 1e4]
    gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
    param_grid={'reg__kernel': ['rbf'], 'reg__C': cc, 'reg__gamma': gg}
    grid=GridSearchCV(pipe, param_grid, cv=5)
    grid.fit(X_train, y_train)
    test_score=grid.score(X_test, y_test)
    print('Best parameters: {}'.format(grid.best_params_))
    print('Best cross-validation score:        {0:.3f}'.format(grid.best_score_))  
    print('Test set score with best parameters: {0:.3f}'.format(test_score))  
    results=pd.DataFrame(grid.cv_results_)
    #print(results)
    rr=np.array(results.mean_test_score).reshape(len(cc), len(gg))
    df = pd.DataFrame(data=rr, index=cc, columns=gg, dtype='float')
    df=df.sort_index(ascending=False)
    print('* DataFrame for score heatmap')
    print(df)
    y_pred=grid.predict(X_test)
    dtime=time.time()-start
    print('Calculation time={0:.3f}'.format(dtime))
    #print(grid.best_estimator_.named_steps['scaler'])
    #print(grid.best_estimator_.named_steps['poly'])
    #print(grid.best_estimator_.named_steps['select'])
    #print(grid.best_estimator_.named_steps['reg'])
    
    
# ==============
# Execution
# ==============
if __name__ == '__main__': main()

data set data.shape:    (506, 13)
X_train.shape:          (379, 13)
Best parameters: {'reg__C': 100.0, 'reg__gamma': 1.0, 'reg__kernel': 'rbf'}
Best cross-validation score:        0.869
Test set score with best parameters: 0.827
* DataFrame for score heatmap
          0.0001    0.0010    0.0100    0.1000    1.0000    10.0000
10000.0  0.789522  0.836473  0.839120  0.643391  0.765627  0.648995
1000.0   0.598104  0.790297  0.845290  0.854108  0.818415  0.648995
100.0    0.284321  0.598198  0.801653  0.853009  0.868597  0.648995
10.0     0.025757  0.283716  0.597943  0.821156  0.851981  0.598107
1.0     -0.026828  0.025447  0.276398  0.573044  0.684257  0.210990
Calculation time=109.254

(4) 前処理Pipeline ＋ GridSearceCV

全部の処理を Pipeline に突っ込むと、とても遅いので、コードがごたついている前処理の部分のみを Pipeline にいれ、グリッドサーチは GridSearchCV で行うようにしてみた。速度は (1) と大差なしだし、結果も (1) と同一。

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import time


def main():
    start=time.time()
    dsn=load_boston()
    X_train, X_test, y_train, y_test = train_test_split(dsn.data, dsn.target, random_state=0)
    print('data set data.shape:    {}'.format(dsn.data.shape))
    print('X_train.shape:          {}'.format(X_train.shape))
    model=RandomForestRegressor(n_estimators=100,random_state=42)
    pipe=Pipeline([
        ('scaler', MinMaxScaler()),
        ('poly', PolynomialFeatures(degree=2)),
        ('select', SelectFromModel(model, threshold='median'))
    ])  
    pipe.fit(X_train, y_train)
    X_train=pipe.transform(X_train)
    X_test=pipe.transform(X_test)
 
    cc=[1e0, 1e1, 1e2, 1e3, 1e4]
    gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
    param_grid={'kernel': ['rbf'], 'C': cc, 'gamma': gg}
    grid=GridSearchCV(SVR(), param_grid, cv=5)
    grid.fit(X_train, y_train)
    test_score=grid.score(X_test, y_test)
    print('Size of training set:   {}'.format(X_train.shape))
    print('Size of test set:       {}'.format(X_test.shape))
    print('Best parameters: {}'.format(grid.best_params_))
    print('Best cross-validation score:        {0:.3f}'.format(grid.best_score_))  
    print('Test set score with best parameters: {0:.3f}'.format(test_score))  
    results=pd.DataFrame(grid.cv_results_)
    #print(results)
    rr=np.array(results.mean_test_score).reshape(len(cc), len(gg))
    df = pd.DataFrame(data=rr, index=cc, columns=gg, dtype='float')
    df=df.sort_index(ascending=False)
    print('* DataFrame for score heatmap')
    print(df)
    y_pred=grid.predict(X_test)
    dtime=time.time()-start
    print('Calculation time={0:.3f}'.format(dtime))

    
# ==============
# Execution
# ==============
if __name__ == '__main__': main()

data set data.shape:    (506, 13)
X_train.shape:          (379, 13)
Size of training set:   (379, 53)
Size of test set:       (127, 53)
Best parameters: {'C': 100.0, 'gamma': 1.0, 'kernel': 'rbf'}
Best cross-validation score:        0.885
Test set score with best parameters: 0.827
* DataFrame for score heatmap
          0.0001    0.0010    0.0100    0.1000    1.0000    10.0000
10000.0  0.783758  0.843492  0.852278  0.818788  0.775954  0.659843
1000.0   0.582699  0.785104  0.848694  0.867736  0.832283  0.659843
100.0    0.269348  0.582686  0.795799  0.858090  0.885195  0.659843
10.0     0.022539  0.268821  0.581901  0.826387  0.863065  0.596508
1.0     -0.027188  0.022278  0.262279  0.561103  0.687804  0.209851
Calculation time=6.454

以　上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up