【Kaggle的】LightGBMでカリフォルニア住宅価格予測【RMSE: 0.46】

Posted at 2024-11-18

はじめに

前回、Kaggleの復習のため、titanicの生存者予測を行いました。今回は回帰を行います。データセットはカリフォルニア住宅価格を使います。（記事のタイトルにKaggle的とありますが、Kaggleへの提出は行いません。データセットはscikit-learn上で取得します。）

やったこと

データセット: california_housing
問題: 回帰
機械学習モデル: LightGBM
前処理, 特徴量エンジニアリング: 何もせず
交差検証: KFold(3-Fold)
評価指標: RMSE

結果はRMSE:0.462。

=== RMSE for test data ===
0.4622482797858079

ライブラリ

# 関連ライブラリをインポート
import lightgbm as lgb
import numpy as np
#import os
import pandas as pd
import random
#import torch

# scikit-learn関連をインポート
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.metrics import mean_squared_error

# warningを非表示
import warnings
warnings.filterwarnings('ignore')

乱数を固定。

SEED = 42

# 参考: 乱数固定
def seed_everything(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    #os.environ['PYTHONHASHSEED'] = str(seed)
    # torch.manual_seed(seed)
    # torch.cuda.manual_seed(seed)
    # torch.backends.cudnn.daterministic = True
seed_everything(SEED)

データ読み込み

データを読み込む。件数は2万件強。

data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

print(X.shape)
print(y.shape)

全体の2割をテストデータとする。インデックスを振りなおす。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

print(X_train.shape) 
display(X_train.head(3))
print(y_train.shape) 
display(y_train.head(3))
print(X_test.shape) 
display(X_test.head(3))
print(y_test.shape) 
display(y_test.head(3))

(16512, 8)
(16512,)
(4128, 8)
(4128,)

交差検証の準備

今回は3-Foldとする。

kf = KFold(n_splits=3, shuffle=True, random_state=SEED)

メインループ

メインループを実装する。基本的には前回のタイタニックを踏襲するが、paramsを回帰用にする。

y_preds = []
models = []
y_pred_oof = np.zeros((len(X_train),)) # 各分割でのoof(訓練に使用されなかったfold)に対する予測値
rmse_scores = []
categorical_features = [] # カテゴリ変数

# ハイパーパラメータ
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'learning_rate': 0.1,
    'num_leaves': 31,
    'verbosity': -1,
    'seed': SEED
}

訓練開始。

# メインループ
for fold_id, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
    # 表示
    print('-------------------')
    print(f'Fold: {fold_id}')

    # 訓練データと検証データに分割
    X_tr = X_train.loc[train_index, :]
    X_val = X_train.loc[valid_index, :]
    y_tr = y_train[train_index]
    y_val = y_train[valid_index]
    
    # データセットを生成
    lgb_train = lgb.Dataset(X_tr, y_tr, categorical_feature=categorical_features)
    lgb_eval = lgb.Dataset(X_val, y_val, reference=lgb_train, categorical_feature=categorical_features)

    # 訓練
    model = lgb.train(params, lgb_train, 
                        valid_sets=[lgb_train, lgb_eval],
                        #num_boost_round=1000,
                        callbacks=[lgb.early_stopping(stopping_rounds=10,
                                                      verbose=True),
                                    lgb.log_evaluation(period=10)])
    
    # 予測
    y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)
    y_pred_oof[valid_index] = y_pred_val
    y_pred = model.predict(X_test, num_iteration=model.best_iteration)

    # RMSEを計算
    rmse = np.sqrt(mean_squared_error(y_val, y_pred_val))

    # 結果を保存
    y_preds.append(y_pred)
    models.append(model)
    rmse_scores.append(rmse)

評価

oof（各foldで訓練に使われなかったデータ）に対する予測値を確認する。

print(y_pred_oof[:10])

[1.33857965 3.44364903 2.34178004 0.93449189 1.41572654 3.14414378
 1.54324824 4.68059735 1.9535667  2.90658437]

各foldの評価指標（RMSE)を確認する。0.47前後。

print(rmse_scores)

[0.4762966660167601, 0.46730236799673963, 0.4783137514325842]

平均をとり、CVスコアとする。

cv_score = sum(rmse_scores) / len(rmse_scores)
print('=== CV score ===')
print(cv_score)

=== CV score ===
0.47397092848202793

テスト

メインループ内でテストデータに対する予測を行っておいたので、それを確認する。

print(y_preds)

[array([0.53551911, 0.94839274, 5.12449949, ..., 5.05884548, 0.74876133,
        1.68696057]),
 array([0.57497302, 0.79578338, 5.03911956, ..., 4.93604868, 0.6230162 ,
        1.69786877]),
 array([0.58674261, 0.90330143, 4.73546777, ..., 4.90567227, 0.65916445,
        1.77708261])]

今回3-Foldなので、3セット分の予測値がある。平均をとる。

y_sub = sum(y_preds) / len(y_preds)
print(y_sub)

array([0.56574492, 0.88249252, 4.96636227, ..., 4.96685548, 0.67698066,
       1.72063732])

評価指標を計算する。結果は0.462。

rmse_sub = np.sqrt(mean_squared_error(y_test, y_sub))
print('=== RMSE for test data ===')
print(rmse_sub)

=== RMSE for test data ===
0.4622482797858079

おわりに

回帰問題に対する実装もできた。ハイパーパラメータチューニングやアンサンブル学習は別途行いたい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up