More than 3 years have passed since last update.

AutoML（TPOT）を使ってみた

Posted at 2021-04-03

製造業出身のデータサイエンティストがお送りする記事
今回はAutoMLライブラリー（TPOT）を使ってみました。

はじめに

過去にAutoMLのライブラリーはPyCaretを使いましたが、今回は別のライブラリー（TPOT）を使ってみました。

TPOTを使ってみる

今回もUCI Machine Learning Repositoryで公開されているボストン住宅の価格データを用いて実施します。

# ライブラリーのインポート
from tpot import TPOTRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# ボストンの住宅価格データ
from sklearn.datasets import load_boston

# 評価指標
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error


# データセットの読込み
boston = load_boston()

# 説明変数の格納
X = pd.DataFrame(boston.data, columns = boston.feature_names)
# 目的変数の追加
y = pd.DataFrame(boston.target)

# 学習データと評価データの分割
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size=0.8,
                                                    test_size=0.2,
                                                    random_state=10
                                                   )

次にTPOTRegressorの設定をします。
TPOTは遺伝的プログラミングを使っているそうです。要は遺伝的アルゴリズムの拡張だと思っておければ良いのかなと思います。詳細はTPOTを見てください。

# TPOTRegressorの設定
tpot = TPOTRegressor(scoring='neg_mean_absolute_error',
                     generations=5,
                     population_size=25,
                     random_state=42,
                     verbosity=2,
                     n_jobs=-1
                    )

tpot.fit(X_train, y_train)

最終的な結果を確認します。

tpot.fitted_pipeline_

得られた結果は下記です。

Pipeline(steps=[('robustscaler', RobustScaler()),
                ('randomforestregressor',
                 RandomForestRegressor(max_features=0.7500000000000001,
                                       min_samples_split=9, random_state=42))])

自動で選ばれたのはランダムフォレストらしいですね。
他のモデルの結果が分かるPyCaretと比べると少し不便ですね。

あとは、予測を行います。

y_pred = tpot.predict(X_test)


plt.figure(figsize=(5, 5))
plt.scatter(y_pred,y_test,alpha=0.5)
plt.xlabel('y_pred')
plt.ylabel('y_test')

最後に評価指標を計算します。

# 評価
def calculate_scores(true, pred):
    """全ての評価指標を計算する

    Parameters
    ----------
    true (np.array)       : 実測値
    pred (np.array)       : 予測値

    Returns
    -------
    scores (pd.DataFrame) : 各評価指標を纏めた結果

    """
    scores = {}
    scores = pd.DataFrame({'R2': r2_score(true, pred),
                          'MAE': mean_absolute_error(true, pred),
                          'MSE': mean_squared_error(true, pred),
                          'RMSE': np.sqrt(mean_squared_error(true, pred))},
                           index = ['scores'])
    return scores

scores = calculate_scores(y_test, y_pred)
print(scores)

評価指標の一覧は下記です。

              R2       MAE        MSE      RMSE
scores  0.848905  2.728567  15.801663  3.975131

さいごに

最後まで読んで頂き、ありがとうございました。
個人的にはPyCaretの方が機能が充実していて良かったですね。
それでも、AutoMLのライブラリーが色々あるのは良いですね。便利です。

訂正要望がありましたら、ご連絡頂けますと幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up