LightGBMをScikit-Learn感覚で使う

Last updated at 2024-05-11Posted at 2024-05-11

今まで基本的にScikit-Learnとかstatsmodelsしか記事にしていませんでしたので今回はLightGBMをやってみます。
で、機械学習系とか数理モデル系のライブラリの面倒な所って多分使い方が統一されていない事だと思うんです。実際statsmodelsはxとyをScikit-Learnとは逆に入れるなど。それでもそれくらいしか違いが無いのであれは使いやすい方ですが他のライブラリが案外そうでもないので今回はLightGBMをScikit-Learn感覚で使って見ようと思います。
具体的には関数の使い方だったりコードの行の量をほぼ変えずにやってみます。
ただし全て初期値でやるためパラメータチューニングは行いません。

Scikit-Learn

from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.model_selection import cross_val_score as cvs
import pandas as pd

df = pd.read_csv("iris.csv")
x = df.drop("category", axis=1)
y = df["category"]
model = GBC()
score = cvs(model, x, y, cv=20, scoring="accuracy")
df_scr = pd.DataFrame(score).describe()
df_scr.columns = ["Scikit-Learn"]
df_scr

LightGBM

from sklearn.model_selection import cross_val_score as cvs
import lightgbm as lgb
import pandas as pd

df = pd.read_csv("iris.csv")
x = df.drop("category", axis=1)
y = df["category"]
model = lgb.LGBMClassifier()
score = cvs(model, x, y, cv=20, scoring="accuracy")
df_scr = pd.DataFrame(score).describe()
df_scr.columns = ["lightgbm"]
df_scr

実行時間

訓練データとテストデータを毎回変えて300回学習して予測し、スコアを記録して実行時間を計測します。

Scikit-Learn

from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import GradientBoostingClassifier as GBC
import pandas as pd
import time

df = pd.read_csv("iris.csv")
x = df.drop("category", axis=1)
y = df["category"]
score = []
start = time.time()
for i in range(300):
    x_train, x_test, y_train, y_test = tts(x, y, test_size=0.3, random_state=i)
    model = GBC()
    model.fit(x_train, y_train)
    score.append(model.score(x_test, y_test))
end = time.time()
print(end-start)
df_scr = pd.DataFrame(score).describe()
df_scr.columns = ["scikit-learn"]
df_scr

84.88612914085388

LightGBM

from sklearn.model_selection import train_test_split as tts
import lightgbm as lgb
import pandas as pd
import time

df = pd.read_csv("iris.csv")
x = df.drop("category", axis=1)
y = df["category"]
score = []
start = time.time()
for i in range(300):
    x_train, x_test, y_train, y_test = tts(x, y, test_size=0.3, random_state=i)
    model = lgb.LGBMClassifier()
    model.fit(x_train, y_train)
    score.append(model.score(x_test, y_test))
end = time.time()
print(end-start)
df_scr = pd.DataFrame(score).describe()
df_scr.columns = ["lightgbm"]
df_scr

21.910428762435913

ここでは母数を増やすと精度の判断は難しいですが、少なくとも実行時間はLightGBMの方が圧倒的に速いみたいです。

補足(Early stop有)

from sklearn.model_selection import train_test_split as tts
import lightgbm as lgb
import pandas as pd
import time

df = pd.read_csv("iris.csv")
x = df.drop("category", axis=1)
y = df["category"]
score = []
start = time.time()
for i in range(300):
    x_train, x_test, y_train, y_test = tts(x, y, test_size=0.3, random_state=i)
    x_train, x_val, y_train, y_val = tts(x_train, y_train, test_size=0.2)
    model = lgb.LGBMClassifier()
    model.fit(x_train, y_train, 
              callbacks=[lgb.early_stopping(stopping_rounds=10, verbose=True)],
              eval_set=[(x_val, y_val)], 
              eval_metric="multi_logloss")
    score.append(model.score(x_test, y_test))
end = time.time()
print(end-start)
df_scr = pd.DataFrame(score).describe()
df_scr.columns = ["lightgbm"]
df_scr

そこまで精度は変わらなかったか・・・

まとめ

という訳で、コード量は同じな訳ですが、クロスバリデーションで精度を細かく比較してみるとどうも初期値だとアヤメのデータではScikit-Learnの方が良いみたいですね。
ここは恐らくパラメータをいじって検証すると精度も変わるでしょうし。
また、ブースティングなので学習率など他にもパラメータがありますのでScikit-Learnとパラメータも同じ状態でやったらどうなるかも実験してみると面白いかもしれません。
ただ速度は圧倒的にLightGBMの方が速いので何らかのサービスを作るならLightGBMは使われやすいでしょうね、MLOpsとかwebサービスでやるなら特に。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up