XGBoost（深さ優先探索）LightGBM（幅優先探索）そしてCatBoostの三つの機械学習モデルの性能比較。

Last updated at 2024-08-19Posted at 2024-08-19

タイトル: "データの探求者"

東京の中心部にある高層ビル群の一角、カフェの静かな一席に座っているプログラマ、佐藤翔平は、データ分析の海に没頭していた。彼の目の前には、ノートパソコンのスクリーンが広がり、数千行のコードと複雑なグラフが表示されている。東京の喧騒から隔絶されたこの小さな空間で、翔平はデータの謎を解き明かすために夜を徹して作業していた。

翔平の目標は、異なる機械学習モデルの性能を比較することだった。具体的には、XGBoost、LightGBM、そしてCatBoostの三つのモデルが、異なるデータサイズに対してどのような精度と処理時間を示すのかを評価することに決めた。彼のミッションは、データサイエンスの世界での一歩先を行くための鍵を見つけることだった。

まず、翔平は特訓データセットを準備するため、東京のデータセンターから大量のランダムなデータを取得した。データは、ユーザーの行動や市場の動向を模したものだ。彼の手元にあるノートパソコンには、数万行のサンプルデータが流れ込み、彼のアルゴリズムの試金石となる。

彼のコードは、まずデータをトレーニングセットとテストセットに分割するところから始まる。翔平は、XGBoost、LightGBM、CatBoostの各モデルが、どのようにこのデータを扱うのかを確認するために、それぞれのモデルを訓練し、評価する。彼は、モデルの訓練にかかる時間と精度を測定し、これらの要素がデータサイズにどう影響するかを分析するのだ。

時間が経つにつれて、翔平は結果に興奮し始めた。XGBoostは高速なトレーニングを誇り、精度も高かったが、LightGBMはそのスピードにさらに拍車をかけていた。しかし、LightGBMのGPUサポートが正しく設定されていないことが判明し、CPUでのトレーニングを余儀なくされてしまった。CatBoostもまた、精度が高く、特に大規模データセットでのパフォーマンスが印象的だった。

CPU 実行結果。

翔平のコードは次第に形になり、彼のノートパソコンには精度と処理時間のグラフが美しく描かれていた。これらのグラフは、彼がいかにして各モデルの特性を把握し、最適なアプローチを見つけたかを物語っていた。

GPU 実行結果。

夜が更けるにつれ、翔平は結果を確認し、満足そうに微笑んだ。彼は、このデータの旅を通じて、技術の限界を押し広げ、東京の喧騒の中でも新しい発見を得ることができたのだ。彼の冒険は、データの力を信じ、未来に向けての一歩を踏み出すためのものだった。

データの探求者、佐藤翔平は、また一つの成果を持って東京の夜空を見上げた。彼の目の前には、無限の可能性が広がっている。そして、彼の冒険は、これからも続いていくのだろう。

CPU 実行結果。

Data Size: 10000, Model: XGBoost, MSE: 0.0887, Time: 26.3228 seconds
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.020074 seconds.
You can set force_row_wise=true to remove the overhead.
And if memory is not enough, you can set force_col_wise=true.
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 100
[LightGBM] [Info] Start training from score 0.497693
Data Size: 10000, Model: LightGBM, MSE: 0.0881, Time: 23.4449 seconds
Data Size: 10000, Model: CatBoost, MSE: 0.0852, Time: 3.9915 seconds
Data Size: 50000, Model: XGBoost, MSE: 0.0838, Time: 7.0756 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.051625 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data points in the train set: 40000, number of used features: 100
[LightGBM] [Info] Start training from score 0.501954
Data Size: 50000, Model: LightGBM, MSE: 0.0835, Time: 2.8933 seconds
Data Size: 50000, Model: CatBoost, MSE: 0.0827, Time: 8.3140 seconds
Data Size: 100000, Model: XGBoost, MSE: 0.0838, Time: 10.5519 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.098018 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data points in the train set: 80000, number of used features: 100
[LightGBM] [Info] Start training from score 0.499684
Data Size: 100000, Model: LightGBM, MSE: 0.0833, Time: 14.6246 seconds
Data Size: 100000, Model: CatBoost, MSE: 0.0829, Time: 28.1169 seconds
Data Size: 200000, Model: XGBoost, MSE: 0.0827, Time: 39.6605 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.329360 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data points in the train set: 160000, number of used features: 100
[LightGBM] [Info] Start training from score 0.499769
Data Size: 200000, Model: LightGBM, MSE: 0.0825, Time: 44.3427 seconds
Data Size: 200000, Model: CatBoost, MSE: 0.0823, Time: 19.6659 seconds

!pip install catboost

import numpy as np
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import time

# データサイズの設定
data_sizes = [10000, 50000, 100000, 200000]
models = ['XGBoost', 'LightGBM', 'CatBoost']
results = {model: {'mse': [], 'time': []} for model in models}

for size in data_sizes:
    # データ生成
    X = np.random.rand(size, 100)  # 特徴量10個のランダムデータ
    y = np.random.rand(size)      # ランダムなターゲット変数
    
    # データをトレーニングセットとテストセットに分割
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 各モデルのトレーニングと評価
    for model_name in models:
        if model_name == 'XGBoost':
            dtrain = xgb.DMatrix(X_train, label=y_train)
            dtest = xgb.DMatrix(X_test, label=y_test)
            
            params = {
                'objective': 'reg:squarederror',
                'tree_method': 'hist',  # GPUではなくCPUを使用
                'eval_metric': 'rmse',
                'max_depth': 6,
                'eta': 0.1
            }
            
            start_time = time.time()
            model = xgb.train(params, dtrain, num_boost_round=100)
            end_time = time.time()
            
            y_pred = model.predict(dtest)
        
        elif model_name == 'LightGBM':
            train_data = lgb.Dataset(X_train, label=y_train)
            test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
            
            params = {
                'objective': 'regression',
                'metric': 'l2',
                'device': 'cpu'  # GPUではなくCPUを使用
            }
            
            start_time = time.time()
            model = lgb.train(params, train_data, num_boost_round=100)
            end_time = time.time()
            
            y_pred = model.predict(X_test)
        
        elif model_name == 'CatBoost':
            model = CatBoostRegressor(
                iterations=100,
                depth=6,
                learning_rate=0.1,
                loss_function='RMSE',
                task_type='CPU'  # GPUではなくCPUを使用
            )
            
            start_time = time.time()
            model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=0)
            end_time = time.time()
            
            y_pred = model.predict(X_test)
        
        # 精度と処理時間の記録
        mse = mean_squared_error(y_test, y_pred)
        time_taken = end_time - start_time
        
        results[model_name]['mse'].append(mse)
        results[model_name]['time'].append(time_taken)
        
        print(f"Data Size: {size}, Model: {model_name}, MSE: {mse:.4f}, Time: {time_taken:.4f} seconds")

# グラフのプロット
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))

# 精度のグラフ
for model_name in models:
    ax1.plot(data_sizes, results[model_name]['mse'], marker='o', label=model_name)
ax1.set_xlabel('Data Size')
ax1.set_ylabel('MSE')
ax1.set_title('MSE vs Data Size')
ax1.set_yscale('log')  # 精度は対数スケールで表示することが多い
ax1.legend()
ax1.grid(True)

# 処理時間のグラフ
for model_name in models:
    ax2.plot(data_sizes, results[model_name]['time'], marker='o', label=model_name)
ax2.set_xlabel('Data Size')
ax2.set_ylabel('Time (seconds)')
ax2.set_title('Processing Time vs Data Size')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

GPU 実行結果。

Data Size: 10000, Model: XGBoost, MSE: 0.0867, Time: 1.0918 seconds
Data Size: 10000, Model: CatBoost, MSE: 0.0845, Time: 6.1613 seconds
Data Size: 50000, Model: XGBoost, MSE: 0.0856, Time: 1.5715 seconds
Data Size: 50000, Model: CatBoost, MSE: 0.0843, Time: 7.9125 seconds
Data Size: 100000, Model: XGBoost, MSE: 0.0826, Time: 1.8226 seconds
Data Size: 100000, Model: CatBoost, MSE: 0.0820, Time: 9.0797 seconds
Data Size: 200000, Model: XGBoost, MSE: 0.0837, Time: 1.3365 seconds
Data Size: 200000, Model: CatBoost, MSE: 0.0832, Time: 18.0088 seconds

!pip install catboost

import numpy as np
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import time

# データサイズの設定
data_sizes = [10000, 50000, 100000, 200000]
models = ['XGBoost', 'CatBoost']
results = {model: {'mse': [], 'time': []} for model in models}

for size in data_sizes:
    # データ生成
    X = np.random.rand(size, 100)  # 特徴量10個のランダムデータ
    y = np.random.rand(size)      # ランダムなターゲット変数
    
    # データをトレーニングセットとテストセットに分割
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # 各モデルのトレーニングと評価
    for model_name in models:
        if model_name == 'XGBoost':
            dtrain = xgb.DMatrix(X_train, label=y_train)
            dtest = xgb.DMatrix(X_test, label=y_test)
            
            params = {
                'objective': 'reg:squarederror',
                'tree_method': 'hist',  # GPUではなくCPUを使用
                'device': 'cuda',  # GPUを使用
                'eval_metric': 'rmse',
                'max_depth': 6,
                'eta': 0.1
            }
            
            start_time = time.time()
            model = xgb.train(params, dtrain, num_boost_round=100)
            end_time = time.time()
            
            y_pred = model.predict(dtest)
        
        elif model_name == 'CatBoost':
            model = CatBoostRegressor(
                iterations=100,
                depth=6,
                learning_rate=0.1,
                loss_function='RMSE',
                task_type='GPU'  # GPUを使用
            )
            
            start_time = time.time()
            model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=0)
            end_time = time.time()
            
            y_pred = model.predict(X_test)
        
        # 精度と処理時間の記録
        mse = mean_squared_error(y_test, y_pred)
        time_taken = end_time - start_time
        
        results[model_name]['mse'].append(mse)
        results[model_name]['time'].append(time_taken)
        
        print(f"Data Size: {size}, Model: {model_name}, MSE: {mse:.4f}, Time: {time_taken:.4f} seconds")

# グラフのプロット
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))

# 精度のグラフ
for model_name in models:
    ax1.plot(data_sizes, results[model_name]['mse'], marker='o', label=model_name)
ax1.set_xlabel('Data Size')
ax1.set_ylabel('MSE')
ax1.set_title('MSE vs Data Size')
ax1.set_yscale('log')  # 精度は対数スケールで表示することが多い
ax1.legend()
ax1.grid(True)

# 処理時間のグラフ
for model_name in models:
    ax2.plot(data_sizes, results[model_name]['time'], marker='o', label=model_name)
ax2.set_xlabel('Data Size')
ax2.set_ylabel('Time (seconds)')
ax2.set_title('Processing Time vs Data Size')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up