More than 5 years have passed since last update.

catboostとOptunaでハイパーパラメータ自動最適化

Posted at 2019-02-04

ハイパーパラメータ自動最適化

　前回、catboostを使って回帰分析を行った。
　https://qiita.com/shin_mura/items/3d9ce25a60bdd25a3333

　今回はより精度を高めるために、Optunaを使用したハイパーパラメータ
　自動最適化を実施してみる。

Optunaとは？

　ハイパーパラメータを最適化するフレームワークの一つ。
　国内企業の Preferred Networksが開発を進めている。
　Pythonベースのオープンソースで誰でも使用できる。

　githubでも公開されており、xgboostやlightGBMで行う際の
　サンプルコードも提供されている。
　https://github.com/pfnet/optuna

　catboostでは未だ掲載されていなかったため、試してみる。

実装

データセット

　catboostではサンプルデータとして、Titanicやamazonのデータが活用できる。
　今回はkaggleにアップされているコールセンターに関するデータセットを利用する。これは契約を解約するユーザの属性がまとまっている。
　https://www.kaggle.com/blastchar/telco-customer-churn

データの読み込み

import gc
import pandas as pd

data = pd.read_csv("../inputs/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# strで読み込まれてしまうので、floatに変換
data.TotalCharges = data.TotalCharges.convert_objects(convert_numeric=True)

# 文字列を数値に変換
data.Churn = data.Churn.apply(lambda x : int(0) if x == "No" else int(1))

目的変数と説明変数の分割

import numpy as np

X = data.drop(['customerID', 'Churn'], axis=1)
y = data.Churn
categorical_features_indices = np.where(X.dtypes != np.float)[0]

Optunaの自動パラメータ最適化関数作成

from sklearn.model_selection import train_test_split
from catboost import Pool
import sklearn.metrics
def objective(trial):
    # トレーニングデータとテストデータを分割
    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2)
    train_pool = Pool(train_x, train_y, cat_features=categorical_features_indices)
    test_pool = Pool(test_x, test_y, cat_features=categorical_features_indices)

    # パラメータの指定
    params = {
        'iterations' : trial.suggest_int('iterations', 50, 300),                         
        'depth' : trial.suggest_int('depth', 4, 10),                                       
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.01, 0.3),               
        'random_strength' :trial.suggest_int('random_strength', 0, 100),                       
        'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00), 
        'od_type': trial.suggest_categorical('od_type', ['IncToDec', 'Iter']),
        'od_wait' :trial.suggest_int('od_wait', 10, 50)
    }
    
    # 学習
    model = CatBoostClassifier(**params)
    model.fit(train_pool)
    # 予測
    preds = model.predict(test_pool)
    pred_labels = np.rint(preds)
    # 精度の計算
    accuracy = sklearn.metrics.accuracy_score(test_y, pred_labels)
    return 1.0 - accuracy

　※ OptunaではmseやRMSEなど、低いほうが良いとみなされる指標を
　　基に実行されます。よって、Accuracyの工夫が必要です。
　※ use_best_modelなどの、検証データを用いなければ設定できない
　　パラメータは除外しています。

Optunaの実行

import optuna
if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(objective, n_trials=100)
    print(study.best_trial)```

各指標の確認

ベストスコアの確認

study.best_value
 0.17743080198722494

最適なパラメータの確認

study.best_params
 {'iterations': 282,
  'depth': 4,
  'learning_rate': 0.17741698174722237,
  'random_strength': 92,
  'bagging_temperature': 15,
  'od_type': 'Iter',
  'od_wait': 18
}

まとめ

　手動パラメータチューニング・グリッドサーチを行うよりも、
　かなり精度が高まった。catboost使用時はOptunaで楽に精度を
　高めることが必須ですね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up