More than 3 years have passed since last update.

LightGBM（実装・パラメータの自動調整（Optuna））

Posted at 2021-01-17

はじめに

LightGBMの実装とパラメータの自動調整（Optuna）をまとめた記事です。

LightGBMとは

LightGBMとは決定木とアンサンブル学習のブースティングを組み合わせた勾配ブースティングの機械学習。
（XGBoostを改良したフレームワーク。）

XGBoostのリリース：2014年
LightGBMのリリース：2016年

※アンサンブル学習：以下を参照
https://qiita.com/hara_tatsu/items/336f9fff08b9743dc1d2

LightGBMの特徴

①予測精度が高い
一般的にディープラーニングを除いた機械学習の中ではXGBoostと並んで最高の予測精度。

②モデルの訓練に掛かる時間が比較的短い
同等の予測精度を誇るXGBoostよりも計算コストが少ない。
（LightGBMが「Light（軽い）」と言われる所以。）

③過学習しやすい
複雑な決定木構造になるため、パラメータを適切に調整しないと過学習となる可能性が高い。

実装

今回は、【SIGNATE】の自動車の評価を題材にします。
以下リンク。
https://signate.jp/competitions/122

データの前処理

データを読み込んで、「文字列」を「数値」に変更します。

pyhon.py

import pandas as pd
import numpy as np

# データの読み込み
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop('id', axis = 1)

# 説明変数
df = df.replace({'buying': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'maint': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'doors': {'2': 2, '3': 3, '4': 4, '5': 5, '5more': 6}})
df = df.replace({'persons': {'2': 2, '4': 4, 'more': 6}})
df = df.replace({'lug_boot': {'small': 1, 'med': 2, 'big': 3}})
df = df.replace({'safety': {'low': 1, 'med': 2, 'high': 3}})

# 目的変数
df = df.replace({'class': {'unacc': 0, 'acc': 1, 'good': 2, 'vgood': 3}})

訓練データと評価データに分類

python.py

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state = 0)

# 訓練データを説明変数データ(X_train)と目的変数データ(y_train)に分割
X_train = train_set.drop('class', axis=1)
y_train = train_set['class']

# 評価データを説明変数データ(X_train)と目的変数データ(y_train)に分割
X_test = test_set.drop('class', axis=1)
y_test = test_set['class']

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(691, 6)
(173, 6)
(691,)
(173,)

LightGBMの実装

LightGBMのデータセットに変換

python.py

import lightgbm as lgb

# 訓練データ
lgb_train = lgb.Dataset(X_train, y_train)
# 評価データ
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

モデルの学習

※２値分類の場合
'objective': 'binary',
'metric': 'binary_error' #評価指標：正答率

※回帰の場合
'objective': 'regression',
'metric': 'rmse'

python.py

# パラメータの設定
parms = {
    'task': 'train', #トレーニング用
    'boosting': 'gbdt', #勾配ブースティング決定木
    'objective': 'multiclass', #目的：多値分類
    'num_class': 4, #分類するクラス数
    'metric': 'multi_error', #評価指標：正答率
    'num_iterations': 1000, #1000回学習
    'verbose': -1 #学習情報を非表示
}

# モデルの学習
model = lgb.train(parms,
                 #訓練データ
                 train_set=lgb_train
                 # 評価データ
                 valid_sets=lgb_eval,
                 early_stopping_rounds=100)

結果の確認

python.py

# 結果の予測
y_pred = model.predict(X_test)
# 予測確率を整数へ
y_pred = np.argmax(y_pred, axis=1) 

from sklearn import metrics
print(metrics.classification_report(y_test, y_pred))


# 結果

precision    recall  f1-score   support

           0       1.00      0.99      1.00       114
           1       0.93      0.98      0.95        42
           2       0.75      0.67      0.71         9
           3       1.00      1.00      1.00         8

    accuracy                           0.97       173
   macro avg       0.92      0.91      0.91       173
weighted avg       0.97      0.97      0.97       173

正答率: 97%

Optunaでパラメータの調整

次に「Optuna」を利用してパラメータの最適化をします。

Optunaとは

Optuna はパラメータの最適化を自動化するためのソフトウェアフレームワークです。パラメータの値に関する試行錯誤を自動的に行いながら、優れた性能を発揮するパラメータの値を自動的に発見する。
（Tree-structured Parzen Estimator というベイズ最適化アルゴリズムの一種を用いています。）

※インストール方法
pip install optuna

詳細はこちらから
①ホームページ
https://preferred.jp/ja/projects/optuna/
②ドキュメント
https://optuna.readthedocs.io/en/stable/index.html

Optunaの実装

自動で最適化されるパラメータは以下の7つになります。
lambda_l1
lambda_l2
num_leaves
feature_fraction
bagging_fraction
bagging_freq
min_child_samples

それでは実装していきます。

python.py

# optuna経由でLightGBMをインポート
from optuna.integration import lightgbm as gbm

# 固定するパラメータ
params = {
    "boosting_type": "gbdt",
    'objective': 'multiclass',
    'num_class': 4,
    'metric': 'multi_error',
    "verbosity": -1,
}

# Optunaでのパラメータ探索
model = lgb.train(params, lgb_train, 
                  valid_sets=[lgb_train, lgb_eval],
                  verbose_eval=100,
                  early_stopping_rounds=100,
                 )

# 最適なパラメータの表示
best_params = model.params
print("Best params:", best_params)


Best params: {
'objective': 'multiclass','num_class': 4, 'metric': 'multi_error', 
'verbosity': -1, 'boosting_type': 'gbdt', 'feature_pre_filter': False,
'lambda_l1': 0.0, 'lambda_l2': 0.0, 'num_leaves': 31, 'feature_fraction': 
0.8999999999999999, 'bagging_fraction': 1.0, 'bagging_freq': 0, 
'min_child_samples': 20, 'num_iterations': 1000, 'early_stopping_round': 100
}

結果の確認

python.py

Y_pred = model.predict(X_test, num_iteration=model.best_iteration)
y_pred = np.argmax(Y_pred, axis=1)

from sklearn import metrics
print(metrics.classification_report(y_test, y_pred))


precision    recall  f1-score   support

           0       1.00      0.99      1.00       114
           1       0.95      0.98      0.96        42
           2       0.78      0.78      0.78         9
           3       1.00      1.00      1.00         8

    accuracy                           0.98       173
   macro avg       0.93      0.94      0.93       173
weighted avg       0.98      0.98      0.98       173

おわりに

正解率　97% → 98% へ向上しました！

前回のランダムフォレストよりも予測精度は高いですね！！

※ランダムフォレスト（（実装・パラメーターまとめ））
https://qiita.com/hara_tatsu/items/581db994ec8866afe8f8

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up