3

More than 1 year has passed since last update.

posted at

## ■ はじめに

【対象とする読者の方】
・3つのモデリングにおける基礎を学びたい、復習したい方
・理論は詳しく分からないが、実装を見てイメージをつけたい方　など

【全体構成】
・モジュールの用意
・データの準備

1. 決定木
2. ランダムフォレスト
3. 勾配ブースティング

## ■ モジュールの用意

``````
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphviz
import mglearn

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
from sklearn.tree import export_graphviz
``````

## ■ データの準備

データセットを使用します。

``````
X, y = cancer.data, cancer.target

print(X.shape)
print(y.shape)

# (569, 30)
# (569,)
``````
``````
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_train.shape)

# (398, 30)
# (398,)
# (171, 30)
# (171,)
``````

## 1. 決定木

``````
tree = DecisionTreeClassifier(max_depth=1, random_state=0)
tree.fit(X_train, y_train)

print('Accuracy on training set:{:.3f}'.format(tree.score(X_train, y_train)))
print('Accuracy on test set:{:.3f}'.format(tree.score(X_test, y_test)))

# Accuracy on training set:1.000
# Accuracy on test set:0.947
``````
``````
tree = DecisionTreeClassifier(max_depth=2, random_state=0)
tree.fit(X_train, y_train)

print('Accuracy on training set:{:.3f}'.format(tree.score(X_train, y_train)))
print('Accuracy on test set:{:.3f}'.format(tree.score(X_test, y_test)))

# Accuracy on training set:0.937
# Accuracy on test set:0.947
``````
``````
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)

print('Accuracy on training set:{:.3f}'.format(tree.score(X_train, y_train)))
print('Accuracy on test set:{:.3f}'.format(tree.score(X_test, y_test)))

# Accuracy on training set:0.987
# Accuracy on test set:0.965
``````

``````
print('Feature importances:\n{}'.format(tree.feature_importances_))

'''
Feature importances:
[0.         0.         0.         0.         0.01265269 0.
0.         0.03105661 0.         0.         0.01105151 0.
0.         0.         0.00854057 0.         0.         0.
0.         0.         0.74107768 0.08863705 0.         0.
0.         0.         0.00275035 0.10423353 0.         0.        ]

'''
``````

プロットもして見てみます。

``````
def plot_feature_importances_cancer(model):
n_features = cancer.data.shape[1] # 特徴量の数
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), cancer
.feature_names)
plt.xlabel('Feature importance')
plt.ylabel('Feature')

plot_feature_importances_cancer(tree)
``````

## 2. ランダムフォレスト

ランダムフォレストは、複数の決定木を用いて分類を行います。

1個のノードごとに使用する特徴量を変更したりして、複数の異なる決定木を作ります。

``````
forest = RandomForestClassifier(n_estimators=5, random_state=0)
forest.fit(X_train, y_train)

print('Accuracy on training set: {:.3f}'.format(forest.score(X_train, y_train)))
print('Accuracy on test set: {:.3f}'.format(forest.score(X_test, y_test)))

# Accuracy on training set: 0.992
# Accuracy on test set: 0.959
``````
``````
forest = RandomForestClassifier(n_estimators=7, random_state=0)
forest.fit(X_train, y_train)

print('Accuracy on training set: {:.3f}'.format(forest.score(X_train, y_train)))
print('Accuracy on test set: {:.3f}'.format(forest.score(X_test, y_test)))

# Accuracy on training set: 0.997
# Accuracy on test set: 0.982
``````
``````
forest = RandomForestClassifier(n_estimators=10, random_state=0)
forest.fit(X_train, y_train)

print('Accuracy on training set: {:.3f}'.format(forest.score(X_train, y_train)))
print('Accuracy on test set: {:.3f}'.format(forest.score(X_test, y_test)))

# Accuracy on training set: 0.997
# Accuracy on test set: 0.982
``````

n_estimators：決定木の個数

``````
plot_feature_importances_cancer(forest)
``````

## 3. 勾配ブースティング

ランダムフォレストが複数の決定木における予測値の平均を取って出力するのに対して

``````
gbrt.fit(X_train, y_train)

print('Accuracy Training set score: {:.3f}'.format(gbrt.score(X_train, y_train)))
print('Accuracy Test set score: {:.3f}'.format(gbrt.score(X_test, y_test)))

# Accuracy Training set score: 1.000
# Accuracy Test set score: 0.965
``````
``````
gbrt.fit(X_train, y_train)

print('Training set score: {:.3f}'.format(gbrt.score(X_train, y_train)))
print('Test set score: {:.3f}'.format(gbrt.score(X_test, y_test)))

# Training set score: 1.000
# Test set score: 0.977
``````
``````
gbrt.fit(X_train, y_train)

print('Accuracy Training set score: {:.3f}'.format(gbrt.score(X_train, y_train)))
print('Accuracy Test set score: {:.3f}'.format(gbrt.score(X_test, y_test)))

# Accuracy Training set score: 0.992
# Accuracy Test set score: 0.982
``````

learning_rate：学習率（個々の決定木が、それまでの決定木の過ちをどれくらい強く補正するかのパラメータ）

``````
gbrt.fit(X_train, y_train)

plot_feature_importances_cancer(gbrt)
``````

ランダムフォレストと似ていますが、いくつかの特徴量については完全に無視されていることが分かります。

## ■ 参考文献

Register as a new user and use Qiita more conveniently

1. You get articles that match your needs
2. You can efficiently read back useful information
What you can do with signing up
3