More than 5 years have passed since last update.

不均衡データ用にダウンサンプリングするメソッドを作成した（二項分類用）

Python

Last updated at 2020-02-19Posted at 2020-02-15

このページのアンダーサンプリングの章にあるコードを改変して、実際に動くものにしました。

# trainデータの陽性データと陰性データの数が同じになるようにするためのメソッド
# 参考 ： https://qiita.com/ryouta0506/items/619d9ac0d80f8c0aed92
# kmeansでクラスタにして、クラスタごとに一定の割合でサンプルする

# X : pandas の DataFrame
# target_column_name : クラス名。 「発症フラグ」 など。
# minority_label : 少数ラベルの値。 「1」 など。

def under_sampling(X, target_column_name, minority_label):
    
    # 毎回出るので非表示に
    import warnings
    warnings.simplefilter('ignore', pd.core.common.SettingWithCopyWarning)
    
    # majority と minority に分ける
    X_majority = X.query(f'{target_column_name} != {minority_label}')
    X_minority = X.query(f'{target_column_name} == {minority_label}')

    # KMeansでクラスタリング
    from sklearn.cluster import KMeans
    km = KMeans(random_state=43)
    km.fit(X_majority)
    X_majority['Cluster'] = km.predict(X_majority)

    # クラスタごとに何サンプル抽出するか計算
    ratio = X_majority['Cluster'].value_counts() / X_majority.shape[0] 
    n_sample_ary = (ratio * X_minority.shape[0]).astype('int64').sort_index()
    
    # クラスタごとにサンプルを抽出
    dfs = []
    for i, n_sample in enumerate(n_sample_ary):
        dfs.append(X_majority.query(f'Cluster == {i}').sample(n_sample))
    
    # minority データも結合するようにしておく
    dfs.append(X_minority)
    
    # アンダーサンプリング後のデータを作成
    X_new = pd.concat(dfs, sort=True)
    
    # 不要なので削除
    X_new = X_new.drop('Cluster', axis=1)
    
    return X_new

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up