三人寄れば文殊の知恵なんて言葉がありますが、LightGBMを単体で使う場合と過学習抑制をした最良モデル3つを使ってどこまで精度が向上するかやってみました。
使ったデータはirisデータとWineデータと乳がんデータの3つです。
ライブラリのインポート
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split as tts
from lightgbm import LGBMClassifier
from scipy import stats
import lightgbm as lgb
import pandas as pd
アヤメ
df = pd.read_csv("iris.csv")
y = df["category"]
x = df.drop("category", axis=1)
x_train, x_test, y_train, y_test = tts(x, y, random_state=100, test_size=0.3)
model = LGBMClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
models = []
for i in range(100):
x_train2, x_val, y_train2, y_val = tts(x_train, y_train, random_state=i, test_size=0.2)
model = LGBMClassifier()
model.fit(x_train2, y_train2, eval_set=[(x_val, y_val)],
eval_metric="multi_logloss",
callbacks=[lgb.early_stopping(stopping_rounds=10,
verbose=True),
lgb.log_evaluation(0)]
)
models.append([model, model.score(x_val, y_val)])
models = sorted(models, key = lambda x:x[1], reverse=True)
model1 = models[0][0]
model2 = models[1][0]
model3 = models[2][0]
pred1 = model1.predict(x_test)
pred2 = model2.predict(x_test)
pred3 = model3.predict(x_test)
y_pred = []
for i in range(len(pred1)):
y_pred.append(stats.mode([pred1[i], pred2[i], pred3[i]])[0])
print(classification_report(y_test, y_pred))
単体
precision recall f1-score support
0 1.00 1.00 1.00 16
1 0.91 0.91 0.91 11
2 0.94 0.94 0.94 18
accuracy 0.96 45
macro avg 0.95 0.95 0.95 45
weighted avg 0.96 0.96 0.96 45
最良3モデル
precision recall f1-score support
0 1.00 1.00 1.00 16
1 0.91 0.91 0.91 11
2 0.94 0.94 0.94 18
accuracy 0.96 45
macro avg 0.95 0.95 0.95 45
weighted avg 0.96 0.96 0.96 45
Wineデータ
df = pd.read_csv("wine.csv")
y = df["Wine"]
x = df.drop("Wine", axis=1)
x_train, x_test, y_train, y_test = tts(x, y, random_state=100, test_size=0.3)
model = LGBMClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
models = []
for i in range(100):
x_train2, x_val, y_train2, y_val = tts(x_train, y_train, random_state=i, test_size=0.2)
model = LGBMClassifier()
model.fit(x_train2, y_train2, eval_set=[(x_val, y_val)],
eval_metric="multi_logloss",
callbacks=[lgb.early_stopping(stopping_rounds=10,
verbose=True),
lgb.log_evaluation(0)]
)
models.append([model, model.score(x_val, y_val)])
models = sorted(models, key = lambda x:x[1], reverse=True)
model1 = models[0][0]
model2 = models[1][0]
model3 = models[2][0]
pred1 = model1.predict(x_test)
pred2 = model2.predict(x_test)
pred3 = model3.predict(x_test)
y_pred = []
for i in range(len(pred1)):
y_pred.append(stats.mode([pred1[i], pred2[i], pred3[i]])[0])
print(classification_report(y_test, y_pred))
単体モデル
precision recall f1-score support
1 0.88 1.00 0.93 14
2 1.00 0.84 0.91 19
3 0.95 1.00 0.98 21
accuracy 0.94 54
macro avg 0.94 0.95 0.94 54
weighted avg 0.95 0.94 0.94 54
最良3モデル
precision recall f1-score support
1 1.00 1.00 1.00 14
2 1.00 0.95 0.97 19
3 0.95 1.00 0.98 21
accuracy 0.98 54
macro avg 0.98 0.98 0.98 54
weighted avg 0.98 0.98 0.98 54
乳がんデータ
df = pd.read_csv("breast_cancer.csv")
y = df["y"]
x = df.drop("y", axis=1)
x_train, x_test, y_train, y_test = tts(x, y, random_state=100, test_size=0.3)
model = LGBMClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))
models = []
for i in range(100):
x_train2, x_val, y_train2, y_val = tts(x_train, y_train, random_state=i, test_size=0.2)
model = LGBMClassifier()
model.fit(x_train2, y_train2, eval_set=[(x_val, y_val)],
eval_metric="multi_logloss",
callbacks=[lgb.early_stopping(stopping_rounds=10,
verbose=True),
lgb.log_evaluation(0)]
)
models.append([model, model.score(x_val, y_val)])
models = sorted(models, key = lambda x:x[1], reverse=True)
model1 = models[0][0]
model2 = models[1][0]
model3 = models[2][0]
pred1 = model1.predict(x_test)
pred2 = model2.predict(x_test)
pred3 = model3.predict(x_test)
y_pred = []
for i in range(len(pred1)):
y_pred.append(stats.mode([pred1[i], pred2[i], pred3[i]])[0])
print(classification_report(y_test, y_pred))
単体モデル
precision recall f1-score support
0.0 0.98 0.90 0.94 69
1.0 0.94 0.99 0.96 102
accuracy 0.95 171
macro avg 0.96 0.94 0.95 171
weighted avg 0.95 0.95 0.95 171
最良3モデル
precision recall f1-score support
0.0 0.98 0.90 0.94 69
1.0 0.94 0.99 0.96 102
accuracy 0.95 171
macro avg 0.96 0.94 0.95 171
weighted avg 0.95 0.95 0.95 171
結論
Wineデータでしか向上しなかった
まとめ
まあでもLightGBMは学習が早いので100個モデル作って最良モデル3つ使うのは単体モデル以上の成果は出ますのでやる価値はあると思います。