More than 5 years have passed since last update.

ランダムフォレスト(分類)とハイパーパラメータのチューニング

Last updated at 2019-11-15Posted at 2019-11-15

はじめに

　乳癌の腫瘍が良性であるか悪性であるかを判定するためのウィスコンシン州の乳癌データセットについて、ランダムフォレストとハイパーパラメータのチューニングにより分類器を作成する。データはsklearnに含まれるもので、データ数は569、そのうち良性は212、悪性は357、特徴量は30種類ある。

シリーズ

ランダムフォレストとは

2001年に Leo Breiman によって提案された[1]機械学習のアルゴリズムであり、分類、回帰、クラスタリングに用いられる。決定木を弱学習器とするアンサンブル学習アルゴリズムであり、この名称は、ランダムサンプリングされたトレーニングデータによって学習した多数の決定木を使用することによる。
（wikipediaより）

ランダムフォレストのハイパーパラメータ

詳細は以下を参照されたい。
RandomForestClassifier

ハイパーパラメータ	選択肢	default
n_estimators	int型	10
criterion	gini、entropy	gini
max_depth	int型 or None	None
min_samples_split	int、float型	2
min_samples_leaf	int、float型	1
min_weight_fraction_leaf	float型	0
max_features	int、float型、None、auto、sqrt、log2	auto
max_leaf_nodes	int型 or None	None
min_impurity_decrease	float型	0
min_impurity_split	float型	1e-7
bootstrap	bool型	True
oob_score	bool型	False
n_jobs	int型 or None	None
random_state	int型、RandomState instance or None	None
verbose	int型	0
warm_start	bool型	False
class_weight	辞書型、balanced、balanced_subsample or None	None

手順

乳癌データの読み込み
トレーニングデータ、テストデータの分離
条件設定
ランダムフォレストの実行（グリッドサーチ）
ハイパーパラメータをチューニングしない場合との比較

pythonによる実装

%%time
from tqdm import tqdm
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# 乳癌データの読み込み
cancer_data = load_breast_cancer()

# トレーニングデータ、テストデータの分離
train_X, test_X, train_y, test_y = train_test_split(cancer_data.data, cancer_data.target, random_state=0)

# 条件設定
max_score = 0
SearchMethod = 0
RFC_grid = {RandomForestClassifier(): {"n_estimators": [i for i in range(1, 21)],
                                       "criterion": ["gini", "entropy"],
                                       "max_depth":[i for i in range(1, 5)],
                                       "random_state": [i for i in range(0, 101)]
                                      }}

# ランダムフォレストの実行
for model, param in tqdm(RFC_grid.items()):
    clf = GridSearchCV(model, param)
    clf.fit(train_X, train_y)
    pred_y = clf.predict(test_X)
    score = f1_score(test_y, pred_y, average="micro")

    if max_score < score:
        max_score = score
        best_param = clf.best_params_
        best_model = model.__class__.__name__

print("ベストスコア:{}".format(max_score))
print("モデル:{}".format(best_model))
print("パラメーター:{}".format(best_param))

# ハイパーパラメータを調整しない場合との比較
model = RandomForestClassifier()
model.fit(train_X, train_y)
score = model.score(test_X, test_y)
print("")
print("デフォルトスコア:", score)

結果

100%|███████████████████████████████████████████| 1/1 [10:39<00:00, 639.64s/it]
ベストスコア:0.965034965034965
モデル:RandomForestClassifier
パラメーター:{'criterion': 'entropy', 'max_depth': 4, 'n_estimators': 14, 'random_state': 62}

デフォルトスコア: 0.951048951049
Wall time: 10min 39s

おわりに

　ハイパーパラメータのチューニングにより、デフォルトよりも高い正解率を得ることができた。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up