More than 1 year has passed since last update.

力技で精度を良くする方法論

Last updated at 2024-05-02Posted at 2024-05-02

これは私が大学院卒業間際にたまたま見つけたSIGNATEというサイトでビギナー限定コンペなる物を発見した時にどうやって精度を上げようかと考えたときに作った考えです(5日目に思いつき実装してビギナーからIntermediateに昇格しました)。
タイトルの通り力技で、ループさせて毎回訓練データとテストデータを変えて精度のよさそうな機械学習アルゴリズムを何個か学習させてモデルと精度をリストに格納して精度の良かった上位何個かのモデルで予測させます。

コーディング

ライブラリのインポート

ここでは勾配ブースティング決定木とSVM(RBFカーネル)とランダムフォレストを使用します。
最後の精度の評価は正解率だけでなく再現率、適合率、F1全て見ます。

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import classification_report
import pandas as pd
import scipy.stats as stats

データの読み込み

ここではワインデータを使用します。

df = pd.read_csv("wine.csv")
df.head()

説明変数と目的変数分けとリアルデータと訓練データ分け

リアルデータを用いて最後は予測します。

y = df["Wine"]
x = df.drop("Wine", axis=1)
x_val, x_real, y_val, y_real = tts(x, y, test_size=0.3, random_state=100)

random_stateは後のfor文と重ならないように100に設定します。
またvalと付けている訓練データをさらに分割してここからモデルの構築と評価をします。

ループして学習

modelsというモデルと精度を格納するリストを作成して勾配ブースティング決定木とSVMとランダムフォレストを使います。ここでrandom_stateはfor文のiを使って毎回訓練データとテストデータを変えます。

models = []
for i in range(100):
    x_train, x_test, y_train, y_test = tts(x_val, y_val, test_size=0.3, random_state=i)
    model1 = GradientBoostingClassifier()
    model1.fit(x_train, y_train)
    model2 = SVC()
    model2.fit(x_train, y_train)
    model3 = RandomForestClassifier()
    model3.fit(x_train, y_train)
    models.append([model1, model1.score(x_test, y_test)])
    models.append([model2, model2.score(x_test, y_test)])
    models.append([model3, model3.score(x_test, y_test)])

上位三個のモデル

models = sorted(models, key=lambda x : x[1], reverse=True)
print(models[0][1], models[1][1], models[2][1])

1.0 1.0 1.0

上位3個とも全て100%正解しているみたいです(心強い)。

上位三個のモデルを使った予測

予測では多値問題であるため最頻値を使います。2値問題の場合は中央値でも大丈夫ですが、万が一同じ数の陽性と陰性があった場合0.5になるので使うモデルの数は奇数個にしましょう。

y_pred1 = models[0][0].predict(x_real)
y_pred2 = models[1][0].predict(x_real)
y_pred3 = models[2][0].predict(x_real)
y_pred = []
for i in range(len(y_pred1)):
    y_pred.append(stats.mode([y_pred1[i], y_pred2[i], y_pred3[i]])[0])
print(classification_report(y_real, y_pred))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00        14
           2       1.00      1.00      1.00        19
           3       1.00      1.00      1.00        21

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54

見事に全部的中しました。

まとめ

数の暴力すごい！
ただこれ標準化してないので多分上位3つは勾配ブースティング決定木かランダムフォレストでしょうね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up