0

More than 5 years have passed since last update.

@carushi@github(L Carushi)

不均等クラスの学習に関して

0

Last updated at 2018-02-09Posted at 2018-02-07

参考になりそうな記事を自分用にまとめます。

Dealing with Imbalanced Classes in Machine Learning

大事なことはどのような分類を目標とするのか？

不均等クラスの学習で問題となるのは数の少ないクラスの予測がうまくいかないこと
"the goal is to identify the minority class"
Accuracy=(TP+TN)/(All)ではかってしまうとTPが少ない場合にも精度が高く見える
それ以外の指標の最大化を目指すべき
"consider using metrics beyond accuracy such as recall, precision, and AUROC"
recall=TP/(TP+FP)
precision=TP/(TP+FN)
AUC optimizationという学習のやり方もある

考えられる対策

各予測に対する重みを変える
TP, FP, TN, FNそれぞれに対するゲイン or ペナルティをcost matrixで表現する * 数が合うようなサンプリングをする

oversampling （問題：過学習）
undersampling　（問題：重要なサンプル（フィーチャー）を取り損ねる可能性、訓練データの減少）

SMOTE
今論文読んでる
k-nearest neighborの値の差を補完する形でoversampling用のデータを生成する手法
"creates new instances of minority class by forming convex combinations of neighboring instances"
0,1 classのサンプルの間の決定領域にあるようなデータを作るのが目標のようだ
データをもとに生成を行うのでoversamplingと同様に過学習はさけられない
imblearn
異常検知の文脈に落とす
大多数の部分に対してモデルをあてはめて、それと異なるサンプルを検出す
"Clustering methods, One-class SVMs, and Isolation Forests"
Isolation forest in sklearn

蛇足

Positive unlabeled learningはデータセット自体が違う（正例しかない）が不均等という点では同じかもしれない

Class Imbalance, Redux

結局うまくいくことが知られているのはbagging（ブートストラップでアンサンブル学習する）だけ
それがなぜなのかをsimulation/realデータを併用しながら調べた

という話。英語がどことなくおしゃれで馴染みのない単語が多かった（感想）。
こちらでは既存の手法を

over/undersampling
synthetic data (SMOTE)
cost-sensitive algorithm (weighted SVMなど)

の3つに分類。
多次元のときにはその単純さからundersamplingが選ばれるが、それ以外でもしばしばundersamplingがoutperformする。

FNに対するFPのコストをかえる作戦はトレーニングデータセットが分類できない（not separable）なときだけうまくいく。（P/Nがということだと思う）

0

Register as a new user and use Qiita more conveniently

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

0