LightGBMの最良モデル3つでどこまで予測改善できるか

Posted at 2024-06-18

三人寄れば文殊の知恵なんて言葉がありますが、LightGBMを単体で使う場合と過学習抑制をした最良モデル3つを使ってどこまで精度が向上するかやってみました。
使ったデータはirisデータとWineデータと乳がんデータの3つです。

ライブラリのインポート

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split as tts
from lightgbm import LGBMClassifier
from scipy import stats
import lightgbm as lgb
import pandas as pd

アヤメ

df = pd.read_csv("iris.csv")
y = df["category"]
x = df.drop("category", axis=1)

x_train, x_test, y_train, y_test = tts(x, y, random_state=100, test_size=0.3)
model = LGBMClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
models = []
for i in range(100):
    x_train2, x_val, y_train2, y_val = tts(x_train, y_train, random_state=i, test_size=0.2)
    model = LGBMClassifier()
    model.fit(x_train2, y_train2, eval_set=[(x_val, y_val)],
              eval_metric="multi_logloss",
              callbacks=[lgb.early_stopping(stopping_rounds=10, 
                                            verbose=True),
                         lgb.log_evaluation(0)]
             )
    models.append([model, model.score(x_val, y_val)])
models = sorted(models, key = lambda x:x[1], reverse=True)
model1 = models[0][0]
model2 = models[1][0]
model3 = models[2][0]
pred1 = model1.predict(x_test)
pred2 = model2.predict(x_test)
pred3 = model3.predict(x_test)
y_pred = []
for i in range(len(pred1)):
    y_pred.append(stats.mode([pred1[i], pred2[i], pred3[i]])[0])
print(classification_report(y_test, y_pred))

単体

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.91      0.91      0.91        11
           2       0.94      0.94      0.94        18

    accuracy                           0.96        45
   macro avg       0.95      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45

最良3モデル

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.91      0.91      0.91        11
           2       0.94      0.94      0.94        18

    accuracy                           0.96        45
   macro avg       0.95      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45

Wineデータ

df = pd.read_csv("wine.csv")
y = df["Wine"]
x = df.drop("Wine", axis=1)

x_train, x_test, y_train, y_test = tts(x, y, random_state=100, test_size=0.3)
model = LGBMClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
models = []
for i in range(100):
    x_train2, x_val, y_train2, y_val = tts(x_train, y_train, random_state=i, test_size=0.2)
    model = LGBMClassifier()
    model.fit(x_train2, y_train2, eval_set=[(x_val, y_val)],
              eval_metric="multi_logloss",
              callbacks=[lgb.early_stopping(stopping_rounds=10, 
                                            verbose=True),
                         lgb.log_evaluation(0)]
             )
    models.append([model, model.score(x_val, y_val)])
models = sorted(models, key = lambda x:x[1], reverse=True)
model1 = models[0][0]
model2 = models[1][0]
model3 = models[2][0]
pred1 = model1.predict(x_test)
pred2 = model2.predict(x_test)
pred3 = model3.predict(x_test)
y_pred = []
for i in range(len(pred1)):
    y_pred.append(stats.mode([pred1[i], pred2[i], pred3[i]])[0])
print(classification_report(y_test, y_pred))

単体モデル

              precision    recall  f1-score   support

           1       0.88      1.00      0.93        14
           2       1.00      0.84      0.91        19
           3       0.95      1.00      0.98        21

    accuracy                           0.94        54
   macro avg       0.94      0.95      0.94        54
weighted avg       0.95      0.94      0.94        54

最良3モデル

              precision    recall  f1-score   support

           1       1.00      1.00      1.00        14
           2       1.00      0.95      0.97        19
           3       0.95      1.00      0.98        21

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

乳がんデータ

df = pd.read_csv("breast_cancer.csv")
y = df["y"]
x = df.drop("y", axis=1)

x_train, x_test, y_train, y_test = tts(x, y, random_state=100, test_size=0.3)
model = LGBMClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
models = []
for i in range(100):
    x_train2, x_val, y_train2, y_val = tts(x_train, y_train, random_state=i, test_size=0.2)
    model = LGBMClassifier()
    model.fit(x_train2, y_train2, eval_set=[(x_val, y_val)],
              eval_metric="multi_logloss",
              callbacks=[lgb.early_stopping(stopping_rounds=10, 
                                            verbose=True),
                         lgb.log_evaluation(0)]
             )
    models.append([model, model.score(x_val, y_val)])
models = sorted(models, key = lambda x:x[1], reverse=True)
model1 = models[0][0]
model2 = models[1][0]
model3 = models[2][0]
pred1 = model1.predict(x_test)
pred2 = model2.predict(x_test)
pred3 = model3.predict(x_test)
y_pred = []
for i in range(len(pred1)):
    y_pred.append(stats.mode([pred1[i], pred2[i], pred3[i]])[0])
print(classification_report(y_test, y_pred))

単体モデル

              precision    recall  f1-score   support

         0.0       0.98      0.90      0.94        69
         1.0       0.94      0.99      0.96       102

    accuracy                           0.95       171
   macro avg       0.96      0.94      0.95       171
weighted avg       0.95      0.95      0.95       171

最良3モデル

              precision    recall  f1-score   support

         0.0       0.98      0.90      0.94        69
         1.0       0.94      0.99      0.96       102

    accuracy                           0.95       171
   macro avg       0.96      0.94      0.95       171
weighted avg       0.95      0.95      0.95       171

結論

Wineデータでしか向上しなかった

まとめ

まあでもLightGBMは学習が早いので100個モデル作って最良モデル3つ使うのは単体モデル以上の成果は出ますのでやる価値はあると思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up