CatBoostを使ったPythonでのデータ分析入門

Posted at 2024-09-03

はじめに

CatBoostは、機械学習の世界で注目を集めている強力なアルゴリズムです。この記事では、Pythonを使ってCatBoostの基本から応用までを15章に分けて詳しく解説します。初心者の方でも理解しやすいように、各章では丁寧な説明とサンプルコードを提供します。

第1章: CatBoostとは

CatBoostは、Yandexが開発した勾配ブースティングライブラリです。カテゴリカル変数の処理に優れており、高速で精度の高い予測モデルを構築できます。

まずは、CatBoostをインストールしましょう。

!pip install catboost

第2章: データの準備

CatBoostを使う前に、データを準備する必要があります。ここでは、scikit-learnの有名なIrisデータセットを使用します。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# データの読み込み
iris = load_iris()
X, y = iris.data, iris.target

# データの分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

第3章: CatBoostClassifierの基本

CatBoostClassifierを使って、基本的な分類モデルを作成します。

from catboost import CatBoostClassifier

# モデルの初期化
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=5, random_state=42)

# モデルの学習
model.fit(X_train, y_train)

# 予測
predictions = model.predict(X_test)

第4章: モデルの評価

モデルの性能を評価するために、精度とconfusion matrixを確認します。

from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# 精度の計算
accuracy = accuracy_score(y_test, predictions)
print(f"精度: {accuracy:.2f}")

# Confusion Matrixの表示
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

第5章: 特徴量の重要度

CatBoostは特徴量の重要度を簡単に可視化できます。

# 特徴量の重要度を表示
feature_importance = model.feature_importances_
feature_names = iris.feature_names

plt.bar(feature_names, feature_importance)
plt.title('Feature Importance')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.show()

第6章: ハイパーパラメータのチューニング

GridSearchCVを使って、最適なハイパーパラメータを探索します。

from sklearn.model_selection import GridSearchCV

# パラメータグリッドの定義
param_grid = {
    'iterations': [100, 200],
    'learning_rate': [0.01, 0.1],
    'depth': [4, 6, 8]
}

# GridSearchCVの実行
grid_search = GridSearchCV(CatBoostClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"最適なパラメータ: {grid_search.best_params_}")
print(f"最高スコア: {grid_search.best_score_:.2f}")

第7章: カテゴリカル変数の扱い

CatBoostの強みであるカテゴリカル変数の扱いを学びます。

import pandas as pd
from catboost import Pool

# カテゴリカル変数を含むデータフレームを作成
df = pd.DataFrame({
    'numeric': [1, 2, 3, 4, 5],
    'category': ['A', 'B', 'A', 'C', 'B'],
    'target': [0, 1, 0, 1, 1]
})

# カテゴリカル変数のインデックスを指定
cat_features = [1]  # 'category'列のインデックス

# Poolオブジェクトの作成
train_pool = Pool(df.drop('target', axis=1), df['target'], cat_features=cat_features)

# モデルの学習
model = CatBoostClassifier(iterations=100)
model.fit(train_pool)

第8章: 早期停止

過学習を防ぐために、早期停止を使用します。

from sklearn.model_selection import train_test_split

# データの分割（学習用、検証用、テスト用）
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# 早期停止を使用したモデルの学習
model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=5, random_state=42)
model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=20, verbose=100)

print(f"最適なイテレーション数: {model.best_iteration_}")

第9章: クロスバリデーション

クロスバリデーションを使って、モデルの安定性を評価します。

from sklearn.model_selection import cross_val_score

# クロスバリデーションの実行
cv_scores = cross_val_score(CatBoostClassifier(iterations=100, random_state=42), X, y, cv=5)

print(f"クロスバリデーションスコア: {cv_scores}")
print(f"平均スコア: {cv_scores.mean():.2f} (+/- {cv_scores.std() * 2:.2f})")

第10章: 多クラス分類

Irisデータセットを使って、多クラス分類を行います。

# 多クラス分類モデルの学習
multiclass_model = CatBoostClassifier(iterations=100, random_state=42)
multiclass_model.fit(X_train, y_train)

# 予測確率の取得
probabilities = multiclass_model.predict_proba(X_test)

# 各クラスの予測確率を表示
for i, (true_label, probs) in enumerate(zip(y_test, probabilities)):
    print(f"サンプル {i+1}: 真のラベル = {true_label}")
    for class_idx, prob in enumerate(probs):
        print(f"  クラス {class_idx} の確率: {prob:.2f}")
    print()

第11章: 回帰問題

CatBoostRegressorを使って、回帰問題に取り組みます。

from sklearn.datasets import load_boston
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# ボストン住宅価格データセットの読み込み
boston = load_boston()
X, y = boston.data, boston.target

# データの分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# モデルの学習
regressor = CatBoostRegressor(iterations=100, random_state=42)
regressor.fit(X_train, y_train)

# 予測と評価
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"平均二乗誤差 (MSE): {mse:.2f}")
print(f"平方根平均二乗誤差 (RMSE): {rmse:.2f}")

第12章: 特徴量の生成

CatBoostの特徴量生成機能を使って、モデルの性能を向上させます。

# 特徴量生成を有効にしたモデルの学習
model_with_features = CatBoostClassifier(iterations=100, random_state=42, feature_border_type='UniformAndQuantiles')
model_with_features.fit(X_train, y_train)

# 生成された特徴量の数を確認
print(f"生成された特徴量の数: {model_with_features.feature_count_}")

# 元の特徴量と生成された特徴量の重要度を表示
feature_importance = model_with_features.feature_importances_
feature_names = iris.feature_names + [f'Generated_{i}' for i in range(model_with_features.feature_count_ - len(iris.feature_names))]

plt.figure(figsize=(10, 6))
plt.bar(feature_names, feature_importance)
plt.title('Feature Importance (Including Generated Features)')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

第13章: モデルの解釈

SHAPライブラリを使って、モデルの予測を解釈します。

import shap

# SHAPの計算
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# SHAP値の可視化
shap.summary_plot(shap_values, X_test, feature_names=iris.feature_names)

第14章: モデルの保存と読み込み

学習したモデルを保存し、後で読み込む方法を学びます。

# モデルの保存
model.save_model('catboost_model.cbm')

# モデルの読み込み
loaded_model = CatBoostClassifier()
loaded_model.load_model('catboost_model.cbm')

# 読み込んだモデルで予測
loaded_predictions = loaded_model.predict(X_test)

# 元のモデルと読み込んだモデルの予測が一致することを確認
print(f"予測が一致: {np.all(predictions == loaded_predictions)}")

第15章: CatBoostの高度な機能

CatBoostの高度な機能として、カスタム損失関数の使用方法を学びます。

from catboost import CatBoostClassifier, Pool
import numpy as np

# カスタム損失関数の定義
def custom_loss(approx, target, weight):
    # 二乗誤差の例
    return (approx - target) ** 2

# カスタム損失関数の勾配の定義
def custom_loss_gradient(approx, target, weight):
    # 二乗誤差の勾配
    return 2 * (approx - target)

# モデルの初期化
custom_model = CatBoostClassifier(
    iterations=100,
    random_state=42,
    loss_function=custom_loss,
    custom_loss=['CustomMetric'],
    custom_metric=['CustomMetric']
)

# カスタム損失関数の登録
custom_model.add_custom_loss(custom_loss, custom_loss_gradient)

# モデルの学習
custom_model.fit(X_train, y_train)

# 予測
custom_predictions = custom_model.predict(X_test)

# 精度の評価
custom_accuracy = accuracy_score(y_test, custom_predictions)
print(f"カスタム損失関数を使用したモデルの精度: {custom_accuracy:.2f}")

以上で、CatBoostを使ったPythonでのデータ分析入門が完了しました。この記事を通じて、CatBoostの基本的な使い方から高度な機能まで幅広く学ぶことができました。CatBoostは非常に強力なツールであり、様々なデータ分析タスクに活用できます。ぜひ、自分のプロジェクトにCatBoostを取り入れて、その威力を体験してみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up