勾配ブースティングランダムフォレストを作ってみた

Posted at 2024-05-13

タイトルの通りで、Scikit-LearnにはBaggingClassifierというのがあり、これを決定木で使っているのがランダムフォレストになります。
ではこれを使って勾配ブースティングランダムフォレストを作ってみたらどんな感じになるのかという試みです。
ただしn_estimatorsは10とします。

決定境界

描画関数

ここに作ってあるのでそこから持ってきます。

import numpy as np
import matplotlib.pyplot as plt
def showline_clf(x, y, model, modelname, x0="x0", x1="x1"):
    fig, ax = plt.subplots(figsize=(8, 6))
    X, Y = np.meshgrid(np.linspace(*ax.get_xlim(), 1000), np.linspace(*ax.get_ylim(), 1000))
    XY = np.column_stack([X.ravel(), Y.ravel()])
    x = preprocessing.minmax_scale(x)
    model.fit(x, y)
    Z = model.predict(XY).reshape(X.shape)
    plt.contourf(X, Y, Z, alpha=0.1, cmap="brg")
    plt.scatter(x[:,0], x[:,1], c=y, cmap="brg")
    plt.xlim(min(x[:,0]), max(x[:,0]))
    plt.ylim(min(x[:,1]), max(x[:,1]))
    plt.title(modelname)
    plt.colorbar()
    plt.xlabel(x0)
    plt.ylabel(x1)
    plt.show()

コーディング

from sklearn.datasets import make_blobs
from sklearn.ensemble import BaggingClassifier as BC
from lightgbm import LGBMClassifier
import pandas as pd
import time

x, y = make_blobs(n_samples=300, centers=4,random_state=0, cluster_std=0.60)

model = BC(estimator=LGBMClassifier(), n_estimators=10)
showline_clf(x, y, model, "BaggingLightGBM", x0="x0", x1="x1")

多少複雑になったと思います。

性能

こちらのページで以前LightGBMとGradientBoostingClassifierで性能を見てみたことがありますので同じことをやってみます。

df = pd.read_csv("iris.csv")
x = df.drop("category", axis=1)
y = df["category"]
score = []
start = time.time()
for i in range(300):
    x_train, x_test, y_train, y_test = tts(x, y, test_size=0.3, random_state=i)
    model = BC(estimator=LGBMClassifier(), n_estimators=10)
    model.fit(x_train, y_train)
    score.append(model.score(x_test, y_test))
end = time.time()
print(end-start)
df_scr = pd.DataFrame(score).describe()
df_scr.columns = ["lightgbm"]
df_scr

206.3249695301056

性能として精度は変わらず速度は遅いのでバギングはせずそのまま使った方が良いかもしれません。

(補足)n_estimatorsを下げる

前回Scikit-Learnの勾配ブースティング決定木とLightGBMでは4倍時間に違いがある事が分かったので、それに合わせて作ってみる。

4の場合

df = pd.read_csv("iris.csv")
x = df.drop("category", axis=1)
y = df["category"]
score = []
start = time.time()
for i in range(300):
    x_train, x_test, y_train, y_test = tts(x, y, test_size=0.3, random_state=i)
    model = BC(estimator=LGBMClassifier(), n_estimators=4)
    model.fit(x_train, y_train)
    score.append(model.score(x_test, y_test))
end = time.time()
print(end-start)
df_scr = pd.DataFrame(score).describe()
df_scr.columns = ["lightgbm"]
df_scr

116.45270299911499

前回の約84秒にだいぶ近くはなりましたがそれでも30秒の違いがあります。

3の場合

df = pd.read_csv("iris.csv")
x = df.drop("category", axis=1)
y = df["category"]
score = []
start = time.time()
for i in range(300):
    x_train, x_test, y_train, y_test = tts(x, y, test_size=0.3, random_state=i)
    model = BC(estimator=LGBMClassifier(), n_estimators=3)
    model.fit(x_train, y_train)
    score.append(model.score(x_test, y_test))
end = time.time()
print(end-start)
df_scr = pd.DataFrame(score).describe()
df_scr.columns = ["lightgbm"]
df_scr

66.07129549980164

84秒以内にはなりました。ただ、精度の最小値と上位25%が異なり、やはり精度ではScikit-Learnの勾配ブースティング決定木には初期値では勝てなかったといったところでした。

まとめ

バギングすればいいというもんじゃない。
LightGBMはそれはそれでそのまま使ったほうが良いわ、精度もいいし速いし。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up