半教師あり学習のPseudo-labelingを試してみた

Posted at 2025-07-14

はじめに

機械学習を勉強していると、「ラベル付きデータが少ない」という問題によく遭遇します。特に実際のビジネスの現場では、大量のデータはあるけれど、ラベル付けにかかるコストや時間の問題で、ラベル付きデータが限られているケースがほとんどです。

そんな時に役立つのが半教師あり学習（Semi-supervised Learning）です。今回は、その代表的な手法の一つであるPseudo-labelingを実際に試してみました。

Pseudo-labelingとは？

Pseudo-labelingは、少量のラベル付きデータと大量のラベル無しデータを使って学習を行う手法です。

基本的な仕組み

少量のラベル付きデータでモデルを学習
学習したモデルでラベル無しデータを予測
予測に自信があるもの（確信度の高いもの）を「疑似ラベル」として採用
疑似ラベルを含めて再学習

この過程を繰り返すことで、段階的にモデルの性能を向上させるのがPseudo-labelingの基本的なアイデアです。

他の半教師あり学習との違い

半教師あり学習には他にも以下のような手法があります：

Self-training: Pseudo-labelingと似ているが、より厳密な確信度の閾値を設定
Co-training: 異なる特徴量セットで複数のモデルを学習
Graph-based methods: データ点間の類似度を基にした手法

今回試すPseudo-labelingは、実装が比較的簡単で理解しやすいため、半教師あり学習の入門には最適な手法です。

実際にやってみた

問題設定

今回は、「顧客の購買行動予測」というビジネス場面を想定しました。

目的: 顧客が来月に商品を購入するかどうかを予測
データ: 顧客の年齢、収入、過去の購買回数、サイト滞在時間などの特徴量
課題: 大量の顧客データはあるが、実際の購買結果（ラベル）が分かっているのは一部のみ

データの準備

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# 架空の顧客データを生成
np.random.seed(42)

# 全体のデータを生成（購買予測の特徴量）
X, y = make_classification(
    n_samples=1000,
    n_features=8,
    n_redundant=2,
    n_informative=6,
    n_clusters_per_class=1,
    random_state=42
)

# 特徴量に意味のある名前を付ける
feature_names = [
    'age', 'income', 'past_purchases', 'site_time',
    'email_opens', 'cart_adds', 'page_views', 'reviews_written'
]

df = pd.DataFrame(X, columns=feature_names)
df['will_purchase'] = y

print("データの概要:")
print(df.head())
print(f"\nデータサイズ: {df.shape}")
print(f"購入予定客の割合: {df['will_purchase'].mean():.2%}")

ラベル付きデータとラベル無しデータの分割

# ラベル付きデータは全体の10%のみ（実際のビジネスでよくある状況）
X_labeled, X_unlabeled, y_labeled, y_unlabeled = train_test_split(
    X, y, test_size=0.9, random_state=42, stratify=y
)

# テスト用データも分離
X_labeled, X_test, y_labeled, y_test = train_test_split(
    X_labeled, y_labeled, test_size=0.3, random_state=42, stratify=y_labeled
)

print(f"ラベル付きデータ: {len(X_labeled)}件")
print(f"ラベル無しデータ: {len(X_unlabeled)}件")
print(f"テストデータ: {len(X_test)}件")

# データの正規化
scaler = StandardScaler()
X_labeled_scaled = scaler.fit_transform(X_labeled)
X_unlabeled_scaled = scaler.transform(X_unlabeled)
X_test_scaled = scaler.transform(X_test)

ベースラインモデルの作成

まずは、ラベル付きデータのみを使った通常の学習結果を確認します。

# ラベル付きデータのみでモデルを学習
baseline_model = RandomForestClassifier(n_estimators=100, random_state=42)
baseline_model.fit(X_labeled_scaled, y_labeled)

# ベースラインの性能を評価
baseline_pred = baseline_model.predict(X_test_scaled)
baseline_accuracy = accuracy_score(y_test, baseline_pred)

print(f"ベースライン精度: {baseline_accuracy:.4f}")
print("\nベースライン詳細:")
print(classification_report(y_test, baseline_pred))

Pseudo-labelingの実装

いよいよPseudo-labelingを実装します。

def pseudo_labeling(X_labeled, y_labeled, X_unlabeled, X_test, y_test, 
                   confidence_threshold=0.9, max_iterations=5):
    """
    Pseudo-labelingを実行する関数
    """
    
    # 結果を保存するリスト
    accuracies = []
    labeled_sizes = []
    
    # 現在のデータをコピー
    current_X_labeled = X_labeled.copy()
    current_y_labeled = y_labeled.copy()
    current_X_unlabeled = X_unlabeled.copy()
    
    print(f"初期ラベル付きデータ数: {len(current_X_labeled)}")
    print(f"初期ラベル無しデータ数: {len(current_X_unlabeled)}")
    print(f"確信度閾値: {confidence_threshold}")
    print("-" * 50)
    
    for iteration in range(max_iterations):
        print(f"\n=== Iteration {iteration + 1} ===")
        
        # 現在のデータでモデルを学習
        model = RandomForestClassifier(n_estimators=100, random_state=42)
        model.fit(current_X_labeled, current_y_labeled)
        
        # テストデータでの性能を評価
        test_pred = model.predict(X_test)
        test_accuracy = accuracy_score(y_test, test_pred)
        accuracies.append(test_accuracy)
        labeled_sizes.append(len(current_X_labeled))
        
        print(f"現在のテスト精度: {test_accuracy:.4f}")
        print(f"現在のラベル付きデータ数: {len(current_X_labeled)}")
        
        # ラベル無しデータが残っていない場合は終了
        if len(current_X_unlabeled) == 0:
            print("ラベル無しデータが尽きました")
            break
        
        # ラベル無しデータに対する予測確率を取得
        pred_proba = model.predict_proba(current_X_unlabeled)
        max_proba = np.max(pred_proba, axis=1)
        
        # 確信度が閾値を超えるデータを選択
        confident_indices = np.where(max_proba >= confidence_threshold)[0]
        
        if len(confident_indices) == 0:
            print(f"確信度が{confidence_threshold}を超えるデータがありません")
            break
        
        # 疑似ラベルを生成
        pseudo_labels = model.predict(current_X_unlabeled[confident_indices])
        
        print(f"疑似ラベル付与数: {len(pseudo_labels)}")
        print(f"疑似ラベルの内訳: クラス0={np.sum(pseudo_labels == 0)}, クラス1={np.sum(pseudo_labels == 1)}")
        
        # 疑似ラベルをラベル付きデータに追加
        current_X_labeled = np.vstack([current_X_labeled, current_X_unlabeled[confident_indices]])
        current_y_labeled = np.hstack([current_y_labeled, pseudo_labels])
        
        # 使用したデータをラベル無しデータから除去
        current_X_unlabeled = np.delete(current_X_unlabeled, confident_indices, axis=0)
        
        print(f"残りラベル無しデータ数: {len(current_X_unlabeled)}")
    
    return accuracies, labeled_sizes

# Pseudo-labelingを実行
accuracies, labeled_sizes = pseudo_labeling(
    X_labeled_scaled, y_labeled, X_unlabeled_scaled, X_test_scaled, y_test,
    confidence_threshold=0.9, max_iterations=5
)

結果の確認

print("\n" + "="*50)
print("最終結果")
print("="*50)

print(f"ベースライン精度: {baseline_accuracy:.4f}")
print(f"Pseudo-labeling最終精度: {accuracies[-1]:.4f}")
print(f"精度向上: {accuracies[-1] - baseline_accuracy:.4f}")
print(f"初期ラベル付きデータ数: {labeled_sizes[0]}")
print(f"最終ラベル付きデータ数: {labeled_sizes[-1]}")

# 各イテレーションの結果を表示
print("\n各イテレーションの結果:")
for i, (acc, size) in enumerate(zip(accuracies, labeled_sizes)):
    print(f"Iteration {i+1}: 精度={acc:.4f}, データ数={size}")

結果と考察

実際に試してみた結果、以下のような結果が得られました：

ベースライン精度: 0.8333
Pseudo-labeling最終精度: 0.8667
精度向上: +0.0334
データ数の増加: 70件 → 156件

良かった点

少ないラベル付きデータでも性能向上: わずか70件のラベル付きデータから始めて、性能を向上させることができました
実装の簡単さ: 基本的なアイデアがシンプルで、実装も比較的簡単でした
解釈しやすさ: どのデータに疑似ラベルが付与されたかが明確で、結果の解釈がしやすかったです

課題と限界

確信度閾値の設定: 0.9という閾値は経験的に設定しましたが、最適な値を見つけるのは難しいです
偏りの増幅: モデルの予測に偏りがある場合、その偏りが疑似ラベルを通じて増幅される可能性があります
収束の問題: 場合によっては、追加される疑似ラベルが少なくなって、早期に収束してしまうことがあります

失敗談

失敗その1: 確信度閾値を低く設定しすぎた

最初は「より多くのデータを使いたい」と思って、確信度閾値を0.7に設定しました。

# 失敗例: 確信度閾値0.7での実行
accuracies_low, _ = pseudo_labeling(
    X_labeled_scaled, y_labeled, X_unlabeled_scaled, X_test_scaled, y_test,
    confidence_threshold=0.7, max_iterations=5
)

print(f"確信度0.7での最終精度: {accuracies_low[-1]:.4f}")
print(f"ベースラインとの差: {accuracies_low[-1] - baseline_accuracy:.4f}")

結果として、ノイズの多い疑似ラベルが大量に追加されてしまい、逆に性能が悪化しました。確信度の閾値設定は慎重に行う必要があることを学びました。

失敗その2: データの正規化を忘れた

最初の実装では、データの正規化を忘れていました。RandomForestは比較的正規化に対してロバストですが、他のアルゴリズムでは大きな差が出る可能性があります。

# 正規化なしでの実行（失敗例）
model_no_scaling = RandomForestClassifier(n_estimators=100, random_state=42)
model_no_scaling.fit(X_labeled, y_labeled)  # 正規化なし

no_scaling_pred = model_no_scaling.predict(X_test)
no_scaling_accuracy = accuracy_score(y_test, no_scaling_pred)

print(f"正規化なし精度: {no_scaling_accuracy:.4f}")
print(f"正規化あり精度: {baseline_accuracy:.4f}")
print(f"差: {baseline_accuracy - no_scaling_accuracy:.4f}")

幸い、RandomForestでは大きな差は出ませんでしたが、前処理の重要性を再認識しました。

失敗その3: クラスの不均衡を考慮しなかった

実際のデータでは、購入する顧客（クラス1）の方が圧倒的に少ないことが多いです。最初はこれを考慮せずに実装して、疑似ラベルが多数派クラスに偏ってしまいました。

# 疑似ラベルのクラス分布を確認
def analyze_pseudo_labels(X_labeled, y_labeled, X_unlabeled, confidence_threshold=0.9):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_labeled, y_labeled)
    
    pred_proba = model.predict_proba(X_unlabeled)
    max_proba = np.max(pred_proba, axis=1)
    confident_indices = np.where(max_proba >= confidence_threshold)[0]
    
    if len(confident_indices) > 0:
        pseudo_labels = model.predict(X_unlabeled[confident_indices])
        print(f"疑似ラベル分布: クラス0={np.sum(pseudo_labels == 0)}, クラス1={np.sum(pseudo_labels == 1)}")
        print(f"元データ分布: クラス0={np.sum(y_labeled == 0)}, クラス1={np.sum(y_labeled == 1)}")
    
analyze_pseudo_labels(X_labeled_scaled, y_labeled, X_unlabeled_scaled)

この経験から、クラス不均衡への対策も重要だということを学びました。

まとめ

良い点

少ないラベル付きデータでも性能向上が期待できる
実装が比較的簡単
結果の解釈がしやすい

注意点

確信度閾値の設定が重要
データの前処理を忘れずに
クラス不均衡への対策も必要
モデルの偏りが増幅される可能性がある

今後試してみたいこと

他のアルゴリズムでの比較（SVM、ニューラルネットワークなど）
動的な確信度閾値の調整
アンサンブル学習との組み合わせ
より大規模なデータセットでの検証

半教師あり学習は、実際のビジネスの現場で非常に有用な手法だと思います。特に、ラベル付けにコストがかかる場面では、Pseudo-labelingのような手法を知っておくことで、限られたリソースで最大限の効果を得ることができそうです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up