More than 3 years have passed since last update.

ランダムフォレスト（実装・パラメーターまとめ）

Last updated at 2021-01-10Posted at 2021-01-09

はじめに

ランダムフォレストの実装及びパラメータのまとめの記事です。

ランダムフォレストとは

複数の決定木を組み合わせて予測性能を高くするモデル。

※決定木：機械学習の手法の1つで、Yes or Noでデータを分けて答えを出すモデル

学習の流れは以下のとおり

①複数の決定木モデルを用意する
②それぞれの決定木の学習データは、もとの学習データから重複を許して同じ数をランダムに抽出する（決定木ごとに微妙に学習データを変えることで学習のバリエーションを増やす）
③それそれの決定木の予測から最終的な答えを出す。
分類モデル → 多数決
回帰モデル → 平均

ランダムフォレストの特徴

ランダムフォレストは、アンサンブル学習のバギングに分類される手法になる。

※アンサンブル学習：以下を参照
https://qiita.com/hara_tatsu/items/336f9fff08b9743dc1d2

バギング

異なるデータを抽出（ブートストラップ法）して、複数の異なるモデル（弱学習器）を作成する。その後、作成した複数のモデルの平均を最終的なモデルとする。

※ブートストラップ法：全データの中から同じ数のデータ量をランダムで複数回抽出する。（データを分割する訳ではない）

実装

今回は、【SIGNATE】の自動車の評価を題材にします。
以下リンク。
https://signate.jp/competitions/122

データの前処理

データを読み込んで、「文字列」を「数値」に変更します。

python.py

import pandas as pd
import numpy as np

# データの読み込み
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop('id', axis = 1)

# 説明変数
df = df.replace({'buying': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'maint': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'doors': {'2': 2, '3': 3, '4': 4, '5': 5, '5more': 6}})
df = df.replace({'persons': {'2': 2, '4': 4, 'more': 6}})
df = df.replace({'lug_boot': {'small': 1, 'med': 2, 'big': 3}})
df = df.replace({'safety': {'low': 1, 'med': 2, 'high': 3}})

# 目的変数
df = df.replace({'class': {'unacc': 1, 'acc': 2, 'good': 3, 'vgood': 4}})

訓練データと評価データに分類します。

python.py

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state = 0)

# 訓練データを説明変数データ(X_train)と目的変数データ(y_train)に分割
X_train = train_set.drop('class', axis=1)
y_train = train_set['class']
 
# 評価データを説明変数データ(X_train)と目的変数データ(y_train)に分割
X_test = test_set.drop('class', axis=1)
y_test = test_set['class']

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(691, 6)
(173, 6)
(691,)
(173,)

ランダムフォレストの実装

python.py

# ランダムフォレスト
from sklearn.ensemble import RandomForestClassifier
# 評価
from sklearn import metrics

model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)

print(metrics.classification_report(y_test, pred))


              precision    recall  f1-score   support

           1       0.97      0.96      0.97       114
           2       0.84      0.88      0.86        42
           3       0.71      0.56      0.63         9
           4       0.89      1.00      0.94         8

    accuracy                           0.92       173
   macro avg       0.85      0.85      0.85       173
weighted avg       0.92      0.92      0.92       173

正解率 92%ですね。
次にパラメーターの調整をしていきましょう。

パラメーターの概要

パラメータの調整で一般的に重要となってくるものを紹介します。

※詳細はこちらを確認してください。
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforestclassifier#sklearn.ensemble.RandomForestClassifier

①n_estimators

決定木モデルの本数
整数を指定(デフォルト:100)

②criterion

決定木モデルにデータを分割するための指標
'gini'：ジニ係数（デフォルト）
'entropy'：交差エントロピー

③max_depth

それぞれの決定木モデルの深さ
整数またはNoneを指定(デフォルト：None)
過学習を抑制するために重要となるパラメータ
一般的に、
小さい値：精度低い
大きい値：精度は高いが過学習になりやすい

④min_samples_split

ノードを分割するために必要となってくるサンプル数
（ノードの中にあるサンプル数が指定した値以下になると決定木の分割が止まる）
整数または小数を指定（デフォルト：None）
一般的に値が小さすぎるとモデルが過剰適合しやすくなる

⑤max_leaf_nodes

決定木モデルの葉の数
整数または None を指定（デフォルト：None）

⑥min_samples_leaf

決定木の分割後に葉に必要となってくるサンプル数
整数または小数を指定（デフォルト：1）

グリットサーチでパラメーターの調整を実装

python.py

# グリットサーチ
from sklearn.model_selection import GridSearchCV

# 検証したいパラメータの指定
search_gs = {
"max_depth": [None, 5, 25],
"n_estimators":[150, 180],
"min_samples_split": [4, 8, 12],
"max_leaf_nodes": [None, 10, 30],
}

model_gs = RandomForestClassifier()
# グリットサーチの設定
gs = GridSearchCV(model_gs,
                  search_gs,
                  cv = 5,
                  iid = False)
# 学習
gs.fit(X_train, y_train)
# 最適なパラメータの表示
print(gs.best_params_)

{'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 4, 'n_estimators': 180}

結果の確認

python.py

clf_rand = RandomForestClassifier(max_depth = None, 
                                  max_leaf_nodes = None, 
                                  min_samples_split = 4, 
                                  n_estimators =180)
model_rand = clf_rand.fit(X_train, y_train)
pred_rand = model_rand.predict(X_test)

print(metrics.classification_report(y_test, pred_rand))



              precision    recall  f1-score   support

           1       1.00      0.97      0.99       114
           2       0.87      0.95      0.91        42
           3       0.71      0.56      0.63         9
           4       0.89      1.00      0.94         8

    accuracy                           0.95       173
   macro avg       0.87      0.87      0.87       173
weighted avg       0.95      0.95      0.95       173

おわりに

正解率　92% → 95% へ向上しました！

パラメータの調整も重要ですが、これ以上の正解率向上を求めるのであればデータの前処理（特徴量の抽出）が重要になってくると思います！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up