K-Fold SplitsとStratified K-Fold Splitsの直感的な違い

Posted at 2024-12-29

クロスバリデーションを行う際の使い分けについて

学習データを分割して、hold-out法（データを学習用と検証用に分ける方法）を繰り返すことで、次のような利点がある

各回の学習では、十分なデータ量を確保してモデルを訓練できる。
検証では、学習データ全体を使って評価することができるので、評価の信頼性が向上する。

分割を繰り返すことで「モデルの訓練用データを減らさず、評価には学習データ全体を活用する」仕組みを作れる、ということ。

Stratified K-Foldが必要なのはどのような場合か？

層化抽出は、クラス分布の偏りを考慮して、各foldや検証セットが元データの分布を反映するようにする場合に必要。不均衡データや少数クラスを扱う際に特に重要。分類タスクの場合にしばしば行われる。

下記はK-Fold SplitsとStratified K-Fold Splitsの直感的な差を示すサンプルコード

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, StratifiedKFold

# サンプルデータの作成（不均衡データ）
np.random.seed(42)
X = np.arange(100)
y = np.concatenate([np.zeros(80), np.ones(20)])  # クラス0が80、クラス1が20の不均衡データ

# KFoldとStratifiedKFoldの設定
kf = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 分割の可視化関数
def plot_comparison(kf_folds, skf_folds, y, title):
    plt.figure(figsize=(14, 6))
    
    # K-Foldの可視化
    plt.subplot(1, 2, 1)
    for i, (train_idx, test_idx) in enumerate(kf_folds):
        plt.scatter(train_idx, [i + 1] * len(train_idx), c='blue', label="Train" if i == 0 else "", alpha=0.6)
        plt.scatter(test_idx, [i + 1] * len(test_idx), c='red', label="Test" if i == 0 else "", alpha=0.6)
    plt.title("K-Fold Splits")
    plt.xlabel("Sample Index")
    plt.ylabel("Fold")
    plt.legend()
    
    # Stratified K-Foldの可視化
    plt.subplot(1, 2, 2)
    for i, (train_idx, test_idx) in enumerate(skf_folds):
        plt.scatter(train_idx, [i + 1] * len(train_idx), c='blue', label="Train" if i == 0 else "", alpha=0.6)
        plt.scatter(test_idx, [i + 1] * len(test_idx), c='red', label="Test" if i == 0 else "", alpha=0.6)
    plt.title("Stratified K-Fold Splits")
    plt.xlabel("Sample Index")
    plt.ylabel("Fold")
    plt.legend()
    
    plt.suptitle(title)
    plt.tight_layout()
    plt.show()

# KFoldとStratifiedKFoldの分割
kf_folds = list(kf.split(X, y))
skf_folds = list(skf.split(X, y))

# 比較プロット
plot_comparison(kf_folds, skf_folds, y, "Comparison: K-Fold vs Stratified K-Fold")

プロットされたデータをみると、

K-Foldは、各foldに含まれるクラス0とクラス1の分布が不均一になる可能性がある。
Stratified K-Foldは、各fold内でクラス0とクラス1の割合が不均衡データセットに近くなるように調整される。

ビジネス文脈でのStratified K-Foldの適用

例えば、顧客セグメンテーション、リスク管理、リソース割当などの最適化に対して少数クラスを適切に評価したい場合に有効。
以下のサンプルコードは顧客セグメントを「通常顧客」、「高価値顧客」、「離脱リスク顧客」にラベルをつけてStratified K-Foldを用いて各クラス分布を均等に保ちながらモデルを学習するサンプルコード。

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt
import numpy as np

# サンプルデータを作成 (多クラス分類、不均衡なデータ)
X, y = make_classification(
    n_classes=3, class_sep=2, weights=[0.7, 0.2, 0.1], n_informative=5,
    n_redundant=0, n_features=10, n_clusters_per_class=1, n_samples=1000, random_state=42
)

# データセットにラベルを付与 (ビジネスコンテキスト: 顧客セグメント分類)
# Class 0: Regular Customers, Class 1: High-Value Customers, Class 2: At-Risk Customers
class_labels = ["Regular Customers", "High-Value Customers", "At-Risk Customers"]

# Stratified K-Foldを適用
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# モデルの評価と可視化
fold_results = []
fold_cm = []

for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    # ランダムフォレストモデル
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # 各Foldの評価結果を保存
    fold_results.append(classification_report(y_test, y_pred, output_dict=True))
    
    # 混同行列の可視化
    cm = ConfusionMatrixDisplay.from_predictions(
        y_test, y_pred, display_labels=class_labels, cmap="Blues"
    )
    fold_cm.append(cm)
    plt.title(f"Fold {fold + 1} Confusion Matrix")
    plt.show()

# 各FoldのF1スコアを可視化
f1_scores = [result['weighted avg']['f1-score'] for result in fold_results]

plt.figure(figsize=(8, 6))
plt.plot(range(1, 6), f1_scores, marker='o', label="F1 Score")
plt.axhline(np.mean(f1_scores), color='red', linestyle='--', label="Mean F1 Score")
plt.title("F1 Scores Across Folds (Customer Segmentation)")
plt.xlabel("Fold")
plt.ylabel("Weighted F1 Score")
plt.legend()
plt.show()

下記のコードを実行すれば、ちゃんと各Foldにクラスごとの分布が均等に保たれていることがわかる。

# 各Foldにおけるクラス分布を確認して可視化
fold_class_distribution = []

for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    # 各Foldのテストデータのクラス分布を収集
    y_test = y[test_idx]
    class_counts = np.bincount(y_test, minlength=3)  # クラス数は3（Class 0, 1, 2）
    fold_class_distribution.append(class_counts)

fold_class_distribution = np.array(fold_class_distribution)

# クラス分布を棒グラフで可視化
plt.figure(figsize=(10, 6))
x_labels = [f"Fold {i+1}" for i in range(len(fold_class_distribution))]
width = 0.25  # バーの幅

for i, class_label in enumerate(class_labels):
    plt.bar(
        np.arange(len(fold_class_distribution)) + i * width,
        fold_class_distribution[:, i],
        width=width,
        label=class_label,
    )

plt.xticks(ticks=np.arange(len(fold_class_distribution)) + width, labels=x_labels)
plt.title("Class Distribution Across Folds (Stratified K-Fold)")
plt.xlabel("Fold")
plt.ylabel("Count")
plt.legend()
plt.show()

まとめ

Stratified K-Foldは、K-Fold Corss Validationの拡張版で各Foldにおけるクラス分布を元のデータセットに近づける手法。不均衡データや多クラス分類問題で特に有効。

Stratified K-Foldの長所と短所

長所	短所
クラス分布を維持し、評価の公平性が向上する	データ分割時の計算量が若干増加する
不均衡データや少数クラスを含むタスクで有効	データが非常に少ない場合、Foldの偏りが残る可能性
多クラス分類タスクでも適用可能	通常のK-Foldと比べて設定が少し複雑

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up