【Kaggle】LightGBMとOptunaでTitanic【acc: 0.768】

Last updated at 2025-02-24Posted at 2024-11-20

はじめに

前々回、LightGBMを使ってTitanicコンペに予測値を提出した。今回はOptunaを導入してハイパーパラメータのチューニングを行う。

できたこと

機械学習モデル: LightGBM
交差検証: Stratified K-Fold (5-Fold)
評価指標: Accuracy
ハイパーパラメータチューニング: Optuna
- max_bin: 255～500 → 302
- num_leaves: 32～128 → 92
- 試行回数: 40回
CVスコア(logloss): 0.406
パブリックスコア: 0.768

ライブラリ

import lightgbm as lgb
import numpy as np
import optuna
import pandas as pd

# Scikit-learn関連
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

データ読み込み

csvファイルをあらかじめダウンロードしておく。pathは適当に設定。

path = '../input/titanic/'

train = pd.read_csv(f"{path}train.csv")
test = pd.read_csv(f"{path}test.csv")
sub = pd.read_csv(f"{path}gender_submission.csv")

前処理&特徴量エンジニアリング

Titanicなので、ここはあまりこだわり過ぎない。

# 結合
data = pd.concat([train, test], sort=False)

# maleを0, femaleを1に変換
data['Sex'].replace(['male', 'female'], [0, 1], inplace=True)
# Embarked列を最頻値のSで補完
# Sを0, Cを1, Qを2に変換
data['Embarked'].fillna(('S'), inplace=True)
data['Embarked'] = data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)
# Fare列を平均値で埋める
data['Fare'].fillna(np.mean(data['Fare']), inplace=True)
# Ageを中央値で埋める
data['Age'].fillna(data['Age'].median(), inplace=True)
# 家族の人数
data['FamilySize'] = data['Parch'] + data['SibSp'] + 1
# 一人かどうか
data['IsAlone'] = 0
data.loc[data['FamilySize'] == 1, 'IsAlone'] = 1

# 使用しない列を削除
delete_columns = ['Name', 'PassengerId', 'Ticket', 'Cabin']
data.drop(delete_columns, axis=1, inplace=True)

# 訓練データとテストデータに戻す
train = data[:len(train)]
test = data[len(train):]

# 目的変数と説明変数に分割
y_train = train['Survived']
X_train = train.drop('Survived', axis=1)
X_test = test.drop('Survived', axis=1)

# 表示
print(f"X_train: {X_train.shape}")
display(X_train.head())
print(f"y_train: {y_train.shape}")
display(y_train.head())
print(f"X_test: {X_test.shape}")
display(X_test.head())

SEED = 1234

# 参考: 乱数を固定
def seed_everything(seed=1234):
    #random.seed(seed)
    #os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    #torch.manual_seed(seed)
    #torch.cuda.manual_seed(seed)
    #torch.backends.cudnn.daterministic = True
seed_everything(SEED)

交差検証の準備

前回と同様にStratified K-Foldを使う。5-Foldにする。

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

チューニング

まずはチューニングから。その前にカテゴリ変数の指定する。

categorical_features = ['Embarked', 'Pclass', 'Sex'] # カテゴリ変数

今回のメイン。Optunaで使ってハイパーパラメータを最適化する。今回、最適化するHパラメータはmax_binとnum_leavesの二つ。スコアにはloglossを使う。コンペに合わせてaccuracyをスコアにしてもよい気もする。

def objective(trial):
    params = {
        'objective': 'binary',
        'max_bin': trial.suggest_int('max_bin', 255, 500),
        'learning_rate': 0.05,
        'num_leaves': trial.suggest_int('num_leaves', 32, 128),
    }

    scores = []

    # 交差検証を実行
    for fold_id, (train_index, valid_index) in enumerate(cv.split(X_train, y_train)):
        # 訓練データと検証データに分割
        X_tr = X_train.iloc[train_index, :]
        X_val = X_train.iloc[valid_index, :]
        y_tr = y_itrain[train_index]
        y_val = y_itrain[valid_index]

        # データセットを生成
        lgb_train = lgb.Dataset(X_tr, y_tr, categorical_feature=categorical_features)
        lgb_eval = lgb.Dataset(X_val, y_val, reference=lgb_train, categorical_feature=categorical_features)

        # モデルを学習
        model = lgb.train(
            params,
            lgb_train,
            valid_sets=[lgb_train, lgb_eval],
            num_boost_round=1000,
            callbacks=[
                lgb.early_stopping(
                    stopping_rounds=10,
                    verbose=False
                    ),
                lgb.log_evaluation(10)
            ]
        )

    ##################################
    # logloss
    ##################################
        # 検証データで予測
        y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)

        # loglossの計算
        score = log_loss(y_val, y_pred_val)
        scores.append(score)

    return np.mean(scores)  
    
    ##################################
    # Accuracy 
    ##################################
    #     # 検証データで予測
    #     y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)
    #     y_pred_val_binary = np.round(y_pred_val) # 2値化

    #     # accuracyの計算
    #     accuracy = accuracy_score(y_val, y_pred_val_binary)
    #     scores.append(accuracy)
    
    # return np.mean(scores)

最適化を実行。試行回数は40回とする。

# Optunaによるハイパーパラメータ最適化の実行

###############################
# logloss 望小
###############################
study = optuna.create_study(
    sampler=optuna.samplers.RandomSampler(seed=SEED)
)
study.optimize(objective, n_trials=40)


###############################
# Accuracy　望大
###############################
# study = optuna.create_study(
# direction='maximize',
# sampler=optuna.samplers.RandomSampler(seed=SEED)
# )
# study.optimize(objective, n_trials=40)

最適化したHパラメータを表示する。max_binは302、num_leavesは92だった。

print(study.best_params)

{'max_bin': 302, 'num_leaves': 92}

最適化したHパラメータで再度学習

最適化したHパラメータで再度学習する。

params = {
    'objective': 'binary',
    'max_bin': study.best_params['max_bin'],
    'learning_rate': 0.05,
    'num_leaves': study.best_params['num_leaves']
}

y_preds_test = []
oof_train = np.zeros((len(X_train),))
models = []

for fold_id, (train_index, valid_index) in enumerate(cv.split(X_train, y_train)):
    # 訓練データと検証データに分割
    X_tr = X_train.loc[train_index, :]
    X_val = X_train.loc[valid_index, :]
    y_tr = y_train[train_index]
    y_val = y_train[valid_index]

    # データセットを生成
    lgb_train = lgb.Dataset(X_tr, y_tr, categorical_feature=categorical_features)
    lgb_eval = lgb.Dataset(X_val, y_val, reference=lgb_train, categorical_feature=categorical_features)

    # モデルを学習
    model = lgb.train(
        params,
        lgb_train,
        valid_sets=[lgb_train, lgb_eval],
        num_boost_round=1000,
        callbacks=[
            lgb.early_stopping(
                stopping_rounds=10,
                verbose=False
                ),
            lgb.log_evaluation(10)
        ]
    )

    # 検証データで予測
    oof_train[valid_index] = model.predict(X_val, num_iteration=model.best_iteration)
    y_pred_test = model.predict(X_test, num_iteration=model.best_iteration)
    
    # 予測値と学習済みモデルを格納
    y_preds_test.append(y_pred_test)
    models.append(model)

評価

各foldの検証データに対するスコア（logloss）を変数scoresに格納・表示する。0.366から0.438だった。前回は0.38から0.48だったので、ばらつきは小さくなった。

scores = [m.best_score['valid_1']['binary_logloss'] for m in models]
print(scores)

[0.37689105112604115, 0.43821552479915354, 0.36678411566241065, 0.4372929103068846, 0.41157843891030305]

各foldの「検証データに対する予測値」を平均して、最終的なCVスコアとする。結果は0.406。前回は0.413だったので、少し改善した。

# 各foldの検証データに対する予測値を平均して、最終的なスコアを生成
score = sum(scores) / len(scores)
print('===CV scores===')
print(score)

===CV scores===
0.4061524081609586

oof_train(訓練に使用されなかったfoldに対する予測値)を0.5で二値化し、正解率を算出する。結果は、0.830。前回は0.829だったので、わずかに増加。

y_pred_oof = (oof_train > 0.5).astype(int)
print(accuracy_score(y_train, y_pred_oof)) # 精度を計算

0.8305274971941639

提出

各foldでの「テストデータに対する予測値」の平均を取り、0.5で二値化する。

y_sub = sum(y_preds_test) / len(y_preds_test)
y_sub = (y_sub > 0.5).astype(int)
print(y_sub[:10])

[0 0 0 0 0 0 1 0 1 0]

提出用のファイルを作成する。

# 提出用ファイルの作成
sub['Survived'] = y_sub
sub.to_csv('submission.csv', index=False)
display(sub.head())

出力したcsvファイルをKaggleに提出する。結果は0.768くらい。前回が0.765くらいだったので、わずかにスコアアップ。

おわりに

Optunaを使ってハイパーパラメータチューニングを導入した結果、わずかにスコアアップした。よかった。

出典

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up