More than 1 year has passed since last update.

不均衡データ

Posted at 2023-09-03

アンダーサンプリング

このコードでは、imblearn ライブラリの RandomUnderSampler を使用してランダムアンダーサンプリングを実行しています。クラス不均衡なデータセットを生成し、ランダムアンダーサンプリングを適用してモデルをトレーニングし、精度を評価しています。

注意点

実際のデータセットに適用する際には、make_classification の代わりに自分のデータを読み込むことが必要です。
アンダーサンプリングの手法やサンプル比率は、実際のデータセットに合わせて調整する必要があります。
アンダーサンプリングは情報の損失を伴うため、慎重に使用する必要があります。場合によっては、他の方法（例: オーバーサンプリング、異常検出）も検討することが重要です。

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# サンプルデータの生成（仮想的なクラス不均衡データ）
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.9, 0.1], random_state=42)

# クラス不均衡データの分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ランダムアンダーサンプリング(これ)
from imblearn.under_sampling import RandomUnderSampler
undersampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)

# ランダムフォレストを使ってモデルをトレーニング
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_resampled, y_resampled)

# テストデータで予測
y_pred = clf.predict(X_test)

# 精度を計算
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

オーバーサンプリング

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# サンプルデータの生成（仮想的なクラス不均衡データ）
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.9, 0.1], random_state=42)

# クラス不均衡データの分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# オーバーサンプリング
from imblearn.over_sampling import RandomOverSampler
oversampler = RandomOverSampler(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)

# ランダムフォレストを使ってモデルをトレーニング
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_resampled, y_resampled)

# テストデータで予測
y_pred = clf.predict(X_test)

# 精度を計算
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# サンプルデータの生成（仮想的なクラス不均衡データ）
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.9, 0.1], random_state=42)

# クラス不均衡データの分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# モデル1: オーバーサンプリングを行わない
clf_no_oversampling = RandomForestClassifier(n_estimators=100, random_state=42)
clf_no_oversampling.fit(X_train, y_train)
y_pred_no_oversampling = clf_no_oversampling.predict(X_test)
accuracy_no_oversampling = accuracy_score(y_test, y_pred_no_oversampling)

# モデル2: オーバーサンプリングを行う
from imblearn.over_sampling import RandomOverSampler
oversampler = RandomOverSampler(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)
clf_with_oversampling = RandomForestClassifier(n_estimators=100, random_state=42)
clf_with_oversampling.fit(X_resampled, y_resampled)
y_pred_with_oversampling = clf_with_oversampling.predict(X_test)
accuracy_with_oversampling = accuracy_score(y_test, y_pred_with_oversampling)

# 精度の比較
print(f"Accuracy without Oversampling: {accuracy_no_oversampling:.2f}")
print(f"Accuracy with Oversampling: {accuracy_with_oversampling:.2f}")
``

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up