More than 3 years have passed since last update.

AutoML（h2o）を使ってみた

Posted at 2021-06-26

製造業出身のデータサイエンティストがお送りする記事
今回はAutoML ライブラリー（h2o）を使ってみました。

はじめに

過去に他のAutoML ライブラリーやツールについては、別の記事に纏めておりますので下記をご参照ください。

h2o を使ってみた

必要なライブラリーは下記です。

pip install h2o

今回もUCI Machine Learning Repositoryで公開されているボストン住宅の価格データを用いて実施します。

import h2o
from h2o.automl import H2OAutoML

import sys, os, os.path
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# ボストンの住宅価格データ
from sklearn.datasets import load_boston

# 前処理
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 評価指標
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

h2o.init(
    nthreads=-1,     # number of threads when launching a new H2O server
    max_mem_size=12  # in gigabytes
)

# データセットの読込み
boston = load_boston()

# 説明変数の格納
df = pd.DataFrame(boston.data, columns=boston.feature_names)
# 目的変数の追加
df["MEDV"] = boston.target

# ランダムシード値
RANDOM_STATE = 10

# 学習データと評価データの割合
TEST_SIZE = 0.2

# 学習データと評価データを作成
x_train, x_test, y_train, y_test = train_test_split(
    df.iloc[:, 0 : df.shape[1] - 1],
    df.iloc[:, df.shape[1] - 1],
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
)

train = pd.merge(x_train, y_train, left_index=True, right_index=True)
test = pd.merge(x_test, y_test, left_index=True, right_index=True)

h2o_train = h2o.H2OFrame(train)
h2o_test = h2o.H2OFrame(test)

# 説明変数/目的変数のカラムを指定
kw = h2o.H2OFrame(list(h2o_train.columns))

feature_cols = kw.columns[:-1]
target_cols = "MEDV"

下記でモデルを学習します。

# max_runtime_secsは最大実行時間を指定
aml = H2OAutoML(
    max_runtime_secs=30,  # 30 sec
    max_models=None,  # no limit
    stopping_metric ='rmse',
    sort_metric ='rmse',
    seed = RANDOM_STATE,
    )

aml.train(
    y=target_cols,
    training_frame=h2o_train,
    )

次に構築したモデルを精度が良い順に並べ替えます。

lb = aml.leaderboard
lb.head(rows=lb.nrows)

下記のような感じで結果が得られます。

model_id	rmse	mean_residual_deviance	mse	mae	rmsle
XGBoost_grid__1_AutoML_20210626_013754_model_2	3.19587	10.2136	10.2136	2.20228	0.151833
StackedEnsemble_AllModels_AutoML_20210626_013754	3.20798	10.2912	10.2912	2.15506	0.147123
StackedEnsemble_BestOfFamily_AutoML_20210626_013754	3.21329	10.3252	10.3252	2.17433	0.150384
XGBoost_grid__1_AutoML_20210626_013754_model_1	3.34327	11.1775	11.1775	2.21535	0.157374
GBM_grid__1_AutoML_20210626_013754_model_2	3.36674	11.3349	11.3349	2.25457	0.158063
XGBoost_grid__1_AutoML_20210626_013754_model_4	3.40357	11.5843	11.5843	2.37371	0.166969
XGBoost_grid__1_AutoML_20210626_013754_model_6	3.42693	11.7438	11.7438	2.27779	0.148145
XGBoost_grid__1_AutoML_20210626_013754_model_5	3.469	12.034	12.034	2.42595	0.165229
XGBoost_3_AutoML_20210626_013754	3.48431	12.1404	12.1404	2.3979	0.166476
XGBoost_2_AutoML_20210626_013754	3.53963	12.529	12.529	2.37003	0.163193
XGBoost_grid__1_AutoML_20210626_013754_model_3	3.73328	13.9373	13.9373	2.45021	0.161169
GLM_1_AutoML_20210626_013754	4.71116	22.195	22.195	3.16006	0.238895
XRT_1_AutoML_20210626_013754	4.88533	23.8665	23.8665	3.1366	0.220186
DeepLearning_grid__1_AutoML_20210626_013754_model_1	4.90804	24.0888	24.0888	2.92665	0.234542
DeepLearning_grid__2_AutoML_20210626_013754_model_1	5.00784	25.0785	25.0785	3.1436	0.203084
DRF_1_AutoML_20210626_013754	5.09842	25.9939	25.9939	3.29748	0.221645
DeepLearning_grid__3_AutoML_20210626_013754_model_1	5.14636	26.4851	26.4851	3.20357	0.208465
GBM_1_AutoML_20210626_013754	5.35071	28.6301	28.6301	3.75348	0.251125
GBM_2_AutoML_20210626_013754	5.40162	29.1774	29.1774	3.64102	0.242478
GBM_5_AutoML_20210626_013754	5.70855	32.5875	32.5875	3.91314	0.249996
GBM_grid__1_AutoML_20210626_013754_model_1	5.9224	35.0748	35.0748	3.90731	0.251146
DeepLearning_1_AutoML_20210626_013754	5.98339	35.801	35.801	3.9921	0.275222
GBM_3_AutoML_20210626_013754	6.51416	42.4342	42.4342	4.56239	0.295999
GBM_grid__1_AutoML_20210626_013754_model_3	6.81432	46.4349	46.4349	4.7091	0.304199
GBM_4_AutoML_20210626_013754	6.97555	48.6583	48.6583	4.90997	0.313206
XGBoost_1_AutoML_20210626_013754	10.5989	112.336	112.336	7.36687	0.68645

最適モデルは下記で出力できます。

# Get the top model of leaderboard
aml.leader

テストデータは下記で予測できます。

# 予測
y_pred = aml.leader.predict(h2o_test)

さいごに

最後まで読んで頂き、ありがとうございました。
h2o も実装簡単でした。

訂正要望がありましたら、ご連絡頂けますと幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up