More than 5 years have passed since last update.

scikit-learnでモデルのハイパーパラメータチューニングをしよう！

Last updated at 2017-02-26Posted at 2017-02-25

ハイパーパラメータチューニングって？

モデルによってあらかじめ決めなきゃいけないパラメータがあります。
(例えばk-meansのクラスタ数や、SVCの正則化項の強さ、決定木の深さなど)

それを『ハイパーパラメータ』というのですが、困ったことに同じモデルだとしてもハイパーパラメータの値によって精度が大幅に変わることもあります。

それをうまく、学習データを用いて決めてしまおうというのが、ハイパーパラメータチューニングなのです！！

グリッドサーチとランダムサーチ

そのチューニング手法の内、今回扱うのは、グリッドサーチとランダムサーチの2つです。
ざっくりいいますと、ハイパーパラメータαがあるとすると以下の流れで実行します。

・グリッドサーチは、あらかじめαの範囲(ex. 0,1,2,3,4,5など)を指定して、実際にそのパラメータでモデルの精度を出してみて、一番いいやつをパラメータにする。

・ランダムサーチは、あらかじめαが従う分布(ex. 平均0, 標準偏差1の正規分布など)を指定して、そこからランダムに取り出し、実際にそのパラメータでモデルの精度を出してみて、一番いいやつをパラメータにする.

以上のように、両者ともハイパーパラメータαをそのままあてずっぽうで決めるという手順ではなく、
その前に、範囲や分布を決め、実際の訓練データを使って決めるという手順をとっていることがわかります。(より詳細については、参考文献をご覧になってください!)

Pythonコード

上記の2つがscikit-learnでは標準装備されているので利用していきます！
python3.5.1, scikit_learn-0.18.1でのコードです。

今回は、UCIのMachine Learning Repositoryからデータをとって、RandomForestClassifierの2つの分類器を使用して、パラメータチューニングをしています。
コードの全容はgithubにアップロードしています.

STEP1 データをUCIリポジトリからダウンロード

Grid_and_Random_Search.ipynb

 df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases'
                  '/breast-cancer-wisconsin/wdbc.data', header=None)

分かりやすくするために、予測したいカラムをTargetに、その他をa~にします。

Grid_and_Random_Search.ipynb

 columns_list = [] 
 for i in range(df.shape[1]):
     columns_list.append("a%d"%i) 
 columns_list[1] = "Target" 
 df.columns = columns_list

STEP2 データを分割

Grid_and_Random_Search.ipynb

 y = df["Target"].values
 X = df.drop(["a0","Target"],axis=1)

trainデータとtestデータに分割

Grid_and_Random_Search.ipynb

 #split X,y to train,test(0.5:0.5)
 from sklearn.cross_validation import train_test_split

 X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.5,random_state=2017)

STEP3 デフォルトの状態でのモデルの精度を検査してみる.

Grid_and_Random_Search.ipynb

 from sklearn.metrics import classification_report

 def model_check(model):
     model.fit(X_train,y_train)
     y_train_pred = classification_report(y_train,model.predict(X_train))
     y_test_pred  = classification_report(y_test,model.predict(X_test))
        
     print("""【{model_name}】\n Train Accuracy: \n{train}
           \n Test Accuracy:  \n{test}""".format(model_name=model.__class__.__name__, train=y_train_pred, test=y_test_pred))

print(model_check(RandomForestClassifier()))

出力結果1(デフォルト)

    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       1.00      1.00      1.00        67
              M       1.00      1.00      1.00        75

    avg / total       1.00      1.00      1.00       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.89      0.93      0.91        72
              M       0.93      0.89      0.91        70

    avg / total       0.91      0.91      0.91       142

Trainデータの正答率は1.0、Testデータの正答率は0.91だということがわかりました。
ここからグリッドサーチとランダムサーチを実装していきます。
以降は参考文献3を参考にしています。

STEP4 グリッドサーチ

Grid_and_Random_Search.ipynb

 #Grid search

 from sklearn.grid_search import GridSearchCV

 # use a full grid over all parameters
 param_grid = {"max_depth": [2,3, None],
              "n_estimators":[50,100,200,300,400,500],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

 forest_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=0),
                 param_grid = param_grid,   
                 scoring="accuracy",  #metrics
                 cv = 3,              #cross-validation
                 n_jobs = 1)          #number of core

 forest_grid.fit(X_train,y_train) #fit

 forest_grid_best = forest_grid.best_estimator_ #best estimator
 print("Best Model Parameter: ",forest_grid.best_params_)

出力結果2(グリッドサーチ)

    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       0.99      0.99      0.99        67
              M       0.99      0.99      0.99        75

    avg / total       0.99      0.99      0.99       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.96      0.89      0.92        72
              M       0.89      0.96      0.92        70

    avg / total       0.92      0.92      0.92       142

合計の正答率や、f1-scoreなど全ての精度が上昇しています！！

STEP5 ランダムサーチ

ランダムサーチでは、scipyを使ってパラメータの従う分布を表現する。
今回は、イテレーション回数をグリッドサーチと同様にしています。

Grid_and_Random_Search.ipynb

# Random search
from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint as sp_randint

param_dist = {"max_depth": [3, None],                  #distribution
              "n_estimators":[50,100,200,300,400,500],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

forest_random = RandomizedSearchCV( estimator=RandomForestClassifier( random_state=0 ),
                                    param_distributions=param_dist,
                                    cv=3,              #CV
                                    n_iter=1944,          #interation num
                                    scoring="accuracy", #metrics
                                    n_jobs=1,           #num of core
                                    verbose=0,          
                                    random_state=1)

forest_random.fit(X,y)
forest_random_best = forest_random.best_estimator_ #best estimator
print("Best Model Parameter: ",forest_random.best_params_)

出力結果3(ランダムサーチ)

    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       1.00      1.00      1.00        67
              M       1.00      1.00      1.00        75

    avg / total       1.00      1.00      1.00       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.94      0.92      0.93        72
              M       0.92      0.94      0.93        70

    avg / total       0.93      0.93      0.93       142

デフォルトの場合と比べると、どの項目も2%増加していることがわかりました！

まとめ

グリッドサーチ、ランダムサーチともに精度が良くなりました！
ただ、おそらく今回は元々精度が高いようなデータを選んだため、効果が見えづらくなってしまったのではないかと思います。
精度が良くないようなデータに試してみるとチューニングの効果が分かりやすいかもしれません。

コード全容は、githubにアップロードしてあります。

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up