More than 1 year has passed since last update.

第2回金融データ活用チャレンジ(全投稿最終LB 14th 解法)

金融データ活用チャレンジ

Last updated at 2024-03-02Posted at 2024-03-02

この記事の内容

第2回金融データ活用チャレンジ全投稿最終LB で14位の解法共有.
自分の直観に反して思いのほかスコアがよかった不思議体験を興味のある皆様に共有したいと思い記事を投稿する.

直観に反してスコアがよかった！って何？

スコア

public	private
0.6889483	0.6884709

不思議なモデル

不思議なモデルのスペックはCV評価値:0.66(1日5回制限のsubmitでなぜこれを投稿したのか記憶なし)
お恥ずかしながら百聞は一見にしかず

Cross Validation(RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=42))で10回，学習 and 評価を繰り返す
各回ごとに学習済モデルを保存する
繰り返し5回目のモデルの評価は最悪の結果
Y軸:MacroF1, X軸:モデルの予測値(predict_proba)の閾値(閾値以上をMIS_Status=1と判定)
0.66に限りなく近い結果

このモデルが最終評価0.6884709になってしまった

何でこのモデルでpublic, privateが0.688を超えたのか？

わかりません
40年以上前にちょっとかじった熱力学では相転移が起きると秩序が滅茶苦茶になるという記憶(e.g. Isingモデル高温展開)があり，そういえばpublicスコアをX軸，privateスコアをY軸としたときR2回帰は0.99となるがスコアをある値以上で集計するとR2回帰は0.3くらいに低下するのではないか？という噂も耳にしており熱力学的な知見が何かヒントを与えてくれるかもしれないと妄想

不思議なモデルの概要

feature engineeringやモデル概要

databricksでコンペにチャレンジを参照

cross validation

構造は外回り(n_splits=10), 内回り(n_splits=5)の2段階構造

外回り

TARGET_COLUMN='MIS_Status'
SEED=42
def train_outer(estimators,train):
  skf = RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=SEED)
  X = train.copy()
  y = X[TARGET_COLUMN]
  :::
  for i, (train_index, test_index) in enumerate(skf.split(X, y)):
      X_train = X.iloc[train_index].copy()
      y_train = y.iloc[train_index].copy()
      X_test = X.iloc[test_index].copy()
      y_test = y.iloc[test_index].copy()
      # 内回り(n_splits=5)で作成された5個の学習済モデルを取得
      score, raw, models = train_model(estimators, X_train, i)
      :::
      for j,[inner_iter, name, model] in enumerate(models):
          pred[j] = model.predict_proba(X_test.drop(columns=[TARGET_COLUMN]))[:,1]
      pred = pd.DataFrame(MinMaxScaler().fit_transform(pred))
      y_pred = pred[pred.columns].apply(np.mean,axis=1)

内回り

  def train_model(estimators,train_prepared,iter):
      skf = RepeatedStratifiedKFold(n_repeats=1, n_splits=5, random_state=SEED)
      X = train_prepared.copy()
      y = X[TARGET_COLUMN]
      :::
      for i, (train_index, test_index) in enumerate(skf.split(X, y)):
          X_train = X.iloc[train_index].copy()
          y_train = y.iloc[train_index].copy()
          X_test = X.iloc[test_index].copy()
          y_test = y.iloc[test_index].copy()
          for name, mod in estimators:
              :::
              mod = mod.fit(X_train, y_train)
              y_pred = mod.predict_proba(X_test)[:,1]
          :::
          models.append([i,name,copy.deepcopy(mod)])
    :::
    return df_score,df_raw,models

submission.csv

predict_probaの閾値0.835, models[20:25]には外回り5回目のモデルを格納

  pred = pd.DataFrame()
  for i,[iter, name, model] in enumerate(models[20:25]):
      pred[i] = model.predict_proba(test_prepared)[:,1]
  pred = pd.DataFrame(MinMaxScaler().fit_transform(pred))
  pred_final = (pred[pred.columns].apply(np.mean,axis=1) > 0.835).astype(int)
  pred_final.value_counts()/len(pred_final)

ソース

コメント誤りや読みにくいソースで申し訳ない
ソース

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

第2回 金融データ活用チャレンジ(全投稿最終LB 14th 解法)