More than 5 years have passed since last update.

Random Forest

Posted at 2020-03-19

まず最初にDecision Treeについて理解する。なぜかというと、Decision TreeはRandom Forestの各ブロック（構成要素）だからだ。Decision Treeはnon-linear modelの一員である。

CART algorithm

Decision TreeではCARTアルゴリズム法に基づいて構成されていく。CARTアルゴリズムを簡単に説明すると、最初のルートノードに目的変数を最も良く分類する説明変数を採用する。この場合、最も良く分類する基準として、ジニ指標が使用される。例えば、目的変数がXとYの２価をとり、説明変数がA ,B,Cで１、２の２値をとる時、説明変数のそれぞれの値の時のGini impurity（一塊のデータの中から適当に一つピックアップしてそれの”真のラベル”が”Decision Treeのアルゴリズムによって弾き出されたラベル”と異なるかを数値化したもの。「gini impurity=0」は完璧に分類されていることを意味する。）を計算し、平均値が最も低くなるような変数をルートノードとして採用する。これを続けていき、ノードをどんどん決めていく。

Random Forest

前書き

Random ForestはDecision Treeを集めたものだ。集める理由は、Decision Tree単体でアルゴリズムを組もうとすると、train dataにoverfittingしてしまうからだ。さらに、noiseにも反応してしまう。よって、Decision Tree単体を利用するよりも、これらをCombineしensemble modelに仕上げた、Random Forestがより適しているのだ。

The random forest is a model made up many decision tree. Rather than just simply averaging the prediction of tree, this model uses 2 key concepts that give it the name random:

Random sampling of training data points when building trees.
Random subsets of features considered when splitting nodes

1. Random sampling of training data points when building trees.

When training, each tree in a random forest learns from a random sample of the data points. それぞれのDecision Treeにランダムに選択したサンプルを学習させる。Bootstrappingという手法が使用される。

2. Random subsets of features considered when splitting nodes

Only a subset of all the features are considered for splitting each node.　例えば、１６のFeaturesが存在し、そのうちの４つのみをsplitting the nodeのために使用するということ。

参考：
CARTアルゴリズムについて；
https://www.gixo.jp/blog/3980/

Gini impurityについて；
https://victorzhou.com/blog/gini-impurity/

Random Forestについて；
https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up