More than 3 years have passed since last update.

Facebook のオープンソース最適化ライブラリAx を使ってXGBoost のハイパラチューニングを行ってみた

Last updated at 2021-07-10Posted at 2021-07-10

製造業出身のデータサイエンティストがお送りする記事
今回はFacebook のオープンソース最適化ライブラリAx を使ってXGBoost のハイパーパラメータをチューニングしてみました。

はじめに

勾配ブースティング木に関しては、過去に記事を書いておりますのでそちらを参照して頂けますと幸いです。

勾配ブースティング決定木（XGBoost, LightGBM, CatBoost）を実装してみた

Optuna を使ったハイパラチューニングに関しても、過去に記事を書いておりますのでそちらを参照して頂けますと幸いです。

勾配ブースティング決定木（XGBoost, LightGBM, CatBoost）でOptunaを使ってみた

Ax とは

Ax は、Facebook のオープンソース最適化ライブラリです（公式のGitHubページ）。
Ax は、ベイズ最適化（GP-EI)を使ってハイパーパラメータをチューニングできます。
ちなみに、Optuna では、ベイズ最適化のTPEを使ってハイパーパラメータをチューニングしております。

ベイズ最適化（GP-EI）の概要

GP-EI は、ガウス過程(GP)によって目的関数をモデル化しております。評価値の改善量の期待値(EI)が最大となる点を選択しております。
アルゴリズムの概要は下記です。

ガウス過程(GP)により目的関数を予測
期待値(EI)を計算して最適化
2で得られえた点を評価
1〜3を繰り返す

各手法の詳し内容はここでは省略させて頂きます。
下記資料に詳しい内容が載っておりますので参考になりました。

参考資料：機械学習におけるハイパーパラメータ最適化の理論と実績

Ax でハイパラチューニング

今回はXGBoost をAx でハイパラチューニングします。
データセットは今回もUCI Machine Learning Repositoryで公開されているボストン住宅の価格データを用いて予測モデルを構築します。

# ライブラリーのインポート
import os

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from ax.utils.notebook.plotting import render, init_notebook_plotting

%matplotlib inline

# ボストンの住宅価格データ
from sklearn.datasets import load_boston

# 前処理
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# XGBoost
import xgboost as xgb

# Ax
import ax
from ax import (
    ParameterType,
    RangeParameter,
    SearchSpace,
    SimpleExperiment,
    modelbridge,
    models,
)

# 評価指標
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=pd.core.common.SettingWithCopyWarning)

# データセットの読込み
boston = load_boston()

# 説明変数の格納
df = pd.DataFrame(boston.data, columns = boston.feature_names)
# 目的変数の追加
df['MEDV'] = boston.target

# データの中身を確認
df.head()

今回使用したAx のライブラリーのバージョンは下記です。

print(ax.__version__)

# 0.2.0

次に乱数シードの固定やデータセットを分割をします。

# ランダムシード値
RANDOM_STATE = 10

# 学習データと評価データの割合
TEST_SIZE = 0.2

# 学習データと評価データを作成
x_train, x_test, y_train, y_test = train_test_split(
    df.iloc[:, 0 : df.shape[1] - 1],
    df.iloc[:, df.shape[1] - 1],
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
)

# trainのデータセットの2割をモデル学習時のバリデーションデータとして利用する
x_train, x_valid, y_train, y_valid = train_test_split(
    x_train, y_train, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

今回探索するハイパーパラメータの範囲を設定します。
各パラメータについては、公式文章を参照してください。

search_space = SearchSpace(
    parameters=[
        RangeParameter(
            name="eta",
            parameter_type=ParameterType.FLOAT,
            lower=1e-8,
            upper=1.0,
            log_scale=True,
        ),
        RangeParameter(
            name="gamma",
            parameter_type=ParameterType.FLOAT,
            lower=1e-8,
            upper=1.0,
            log_scale=True,
        ),
        RangeParameter(
            name="max_depth",
            parameter_type=ParameterType.INT,
            lower=3,
            upper=8,
            log_scale=False,
        ),
        RangeParameter(
            name="min_child_weight",
            parameter_type=ParameterType.FLOAT,
            lower=1,
            upper=60,
            log_scale=True,
        ),
        RangeParameter(
            name="max_delta_step",
            parameter_type=ParameterType.FLOAT,
            lower=1e-8,
            upper=1.0,
            log_scale=True,
        ),
        RangeParameter(
            name="subsample",
            parameter_type=ParameterType.FLOAT,
            lower=0.0,
            upper=1.0,
            log_scale=False,
        ),
        RangeParameter(
            name="reg_lambda",
            parameter_type=ParameterType.FLOAT,
            lower=0.0,
            upper=1000.0,
            log_scale=False,
        ),
        RangeParameter(
            name="reg_alpha",
            parameter_type=ParameterType.FLOAT,
            lower=0.0,
            upper=1000.0,
            log_scale=False,
        ),
    ]
)

次にXGBoost のスコアを返す関数を定義します。

def evaluation_function(parameterization, weight=None):

    model = xgb.XGBRegressor(
        eta=parameterization["eta"],
        gamma=parameterization["gamma"],
        max_depth=parameterization["max_depth"],
        min_child_weight=parameterization["min_child_weight"],
        max_delta_step=parameterization["max_delta_step"],
        subsample=parameterization["subsample"],
        reg_lambda=parameterization["reg_lambda"],
        reg_alpha=parameterization["reg_alpha"],
    )

    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        early_stopping_rounds=50,
        verbose=False,
    )

    preds = model.predict(x_valid)
    mae = mean_absolute_error(y_valid, preds)

    return mae

SimpleExperiment オブジェクトを生成します。

%%time
# Axで最適値を見つける
exp = SimpleExperiment(name="test_Xgboost",
                       search_space=search_space,
                       evaluation_function=evaluation_function,
                       minimize=True,
                       objective_name="mae"
                      )

# ベイズ最適化
sobol = modelbridge.get_sobol(search_space=exp.search_space, seed=RANDOM_STATE)
print(f"Running Sobol initialization trials...")

for _ in range(5):
    exp.new_trial(generator_run=sobol.gen(1))
for i in range(50):
    print(f"Running GP+EI optimization tiral {i+1}...")
    gpei=modelbridge.get_GPEI(experiment=exp, data=exp.eval())
    exp.new_trial(generator_run=gpei.gen(1))

今回は、sobol で最初の 5点をランダムに決めて、50回探索で最もスコアの良かったパラメータをベストパラメータとします。

'exp.eval' で過程を確認することができます。

dat = exp.eval()
dat.df

次に探索した中で最もスコアが良かったパラメータをセットします。

best_objectives = exp.eval().df["mean"]
bestParameter_score = np.min(best_objectives)
bestParameter_trial = np.argmin(best_objectives)
bestParameter = exp._trials[bestParameter_trial].arm.parameters
bestParameter["random_state"] = RANDOM_STATE
bestParameter

# {'eta': 0.6504156283754524,
#  'gamma': 5.2927570057348026e-08,
#  'max_depth': 6,
#  'min_child_weight': 33.468526469131646,
#  'max_delta_step': 1.0,
#  'subsample': 0.8434249499376619,
#  'reg_lambda': 29.866681136050957,
#  'reg_alpha': 0.0,
#  'random_state': 10}

あとは、チューニング結果のハイパーパラメータを使ってモデルの学習と予測を行います。

# チューニングしたハイパーパラメーターをフィット
optimised_model = xgb.XGBRegressor(**(bestParameter))

optimised_model.fit(x_train, y_train)

# XGBoost推論
y_pred = optimised_model.predict(x_test)

# 評価
def calculate_scores(true, pred):
    """全ての評価指標を計算する

    Parameters
    ----------
    true (np.array)       : 実測値
    pred (np.array)       : 予測値

    Returns
    -------
    scores (pd.DataFrame) : 各評価指標を纏めた結果

    """
    scores = {}
    scores = pd.DataFrame(
        {
            "R2": r2_score(true, pred),
            "MAE": mean_absolute_error(true, pred),
            "MSE": mean_squared_error(true, pred),
            "RMSE": np.sqrt(mean_squared_error(true, pred)),
        },
        index=["scores"],
    )
    return scores

scores = calculate_scores(y_test, y_pred)
print(scores)

出力結果は下記のようになります。

              R2       MAE        MSE      RMSE
scores  0.826878  3.116676  18.105243  4.255026

Ax　では、探索過程の可視化も行うことができます。
最初に横軸を探索回数、縦軸をベストスコアにしたグラフを描画します。

from ax.plot.trace import optimization_trace_single_method

best_objectives = np.array([[trial.objective_mean for trial in exp.trials.values()]])
best_objective_plot = optimization_trace_single_method(
    y=np.minimum.accumulate(best_objectives, axis=1), ylabel="mae",
)
render(best_objective_plot)

次に2つのパラメータを軸に取って期待値と標準偏差を表したグラフを描画します。

from ax.plot.contour import interact_contour

# 2つのパラメータを軸に取って期待値と標準偏差を表したグラフ
render(interact_contour(model=gpei, metric_name="mae"))

きちっと評価指標が良いところを重点的に探索していることが確認できます。

最後にパラメータ1つ選択してスコアの期待値と標準偏差を表したグラフを描画します。

from ax.plot.slice import plot_slice

# パラメータ 1 つ選択してスコアの期待値と標準偏差を表したグラフ
render(plot_slice(gpei, "reg_lambda", "mae"))

さいごに

最後まで読んで頂き、ありがとうございました。
今回はFacebook のオープンソース最適化ライブラリAx を使ってXGBoost のハイパーパラメータをチューニングしてみました。
Optuna も良かったですが、Ax も使いやすかったです。ベイズ最適化の手法が異なりますので、比較する際に有効かなと思いました。

訂正要望がありましたら、ご連絡頂けますと幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up