【Python】決定木個人的チュートリアル #初心者

カンニングシート的に利用する予定。

決定木(Decision Tree)とは

　決定木は、クラス分類と回帰予測で広く用いられる機械学習モデルである。Yes/Noで答えられる質問で構成された階層的な構造をもつ。決定木では、説明変数の1つ1つが目的変数にどのくらいの影響を与えているのかを見ることができる分割を繰り返していくことで枝分かれしていくが、先に分割される変数ほど影響力が大きいと捉えることができる。

　この分類器は、4クラスのデータを3つの特徴量で識別する分類モデルと表現することができる。このようなモデルは、機械学習アルゴリズムを使うことで、訓練データを学習して実際に上記のような木を描くことが可能になる。

決定木の特徴

メリット

決定木が結果を可視化できることから、解釈が比較的容易

特徴量のスケール違いに影響を受けず、標準化のような前処理が不要

デメリット

学習データへの依存が激しく、パラメータをどうチューニングしても望ましいレベルの木構造が得られない可能性がある

過学習しやすく汎化性能が低い傾向がある

評価方法

混同行列

　混同行列は分類モデルの評価を考える際の基本となる行列で、モデルの予測値と観測値の関係を表したものである。具体的には以下の図のように4つの区分、真陽性(true positive)、真陰性(true negative)、偽陽性(false positive)、偽陰性(false negative)を持つ。

正解率(accuracy)

　全体に対して予測が当たった割合のことであり、以下のように計算できる。
$$
正解率=\frac{TP+TN}{TP+FP+FN+TN}
$$

適合率(precision)

　正と予測したデータのうち、実際に正であるものの割合のことであり、以下のように計算できる。
$$
適合率=\frac{TP}{TP+FP}
$$

再現率(recall)

　実際に正であるもののうち、正であると予測されたものの割合であり、以下のように計算できる。
$$
再現率=\frac{TP}{TP+FN}
$$

Pythonで実装

※ jupyter notebookでの実行を想定

データデータセット

　scikit-learn（サイキットラーン）の乳がんの診断データがまとめられたデータセット。良性(1)、悪性(0)となっている。

使用するライブラリ

[In]

# データ処理に使用するライブラリ
import pandas as pd
import numpy as np

# データ可視化に使用するライブラリ
import matplotlib.pyplot as plt; plt.style.use('ggplot')
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inline

# 機械学習ライブラリ
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics

定数系

[In]

# 定数
RESPONSE_VARIABLE = 'cancer' # 目的変数
TEST_SIZE = 0.2
RANDOM_STATE = 42

データの読み込み

[In]

# データの読み込み(scikit-learnの癌のデータ)
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
cancer = pd.DataFrame(data=data.data, columns=data.feature_names)
cancer[RESPONSE_VARIABLE] = data.target

# 最初の5行を表示
cancer.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...
0	17.99	10.38	122.8	1001	0.1184	0.2776	0.3001	0.1471	0.2419	0.07871	...
1	20.57	17.77	132.9	1326	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...
2	19.69	21.25	130	1203	0.1096	0.1599	0.1974	0.1279	0.2069	0.05999	...
3	11.42	20.38	77.58	386.1	0.1425	0.2839	0.2414	0.1052	0.2597	0.09744	...
4	20.29	14.34	135.1	1297	0.1003	0.1328	0.198	0.1043	0.1809	0.05883	...

基礎集計

[In]

# 統計量の確認
cancer.describe()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	cancer
count	569	569	569	569	569	569	569	569	569	569	...	569
mean	14.12729	19.28965	91.96903	654.8891	0.09636	0.104341	0.088799	0.048919	0.181162	0.062798	...	0.627417
std	3.524049	4.301036	24.29898	351.9141	0.014064	0.052813	0.07972	0.038803	0.027414	0.00706	...	0.483918
min	6.981	9.71	43.79	143.5	0.05263	0.01938	0	0	0.106	0.04996	...	0
25%	11.7	16.17	75.17	420.3	0.08637	0.06492	0.02956	0.02031	0.1619	0.0577	...	0
50%	13.37	18.84	86.24	551.1	0.09587	0.09263	0.06154	0.0335	0.1792	0.06154	...	1
75%	15.78	21.8	104.1	782.7	0.1053	0.1304	0.1307	0.074	0.1957	0.06612	...	1
max	28.11	39.28	188.5	2501	0.1634	0.3454	0.4268	0.2012	0.304	0.09744	...	1

[In]

# 目的変数のカウント
cancer[RESPONSE_VARIABLE].value_counts()

[Out]

1    357
0    212
Name: cancer, dtype: int64

[In]

# 欠損値の確認
cancer.isnull().sum()

[Out]

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
cancer                     0
dtype: int64

データ分割

[In]

# トレーニングデータとテストデータに分割
train, test = train_test_split(cancer, test_size=TEST_SIZE, random_state=RANDOM_STATE)

# 説明変数と目的変数に分ける
X_train = train.drop(RESPONSE_VARIABLE, axis=1)
y_train = train[RESPONSE_VARIABLE].copy()

X_test = test.drop(RESPONSE_VARIABLE, axis=1)
y_test = test[RESPONSE_VARIABLE].copy()

データの可視化

[In]

# 特徴量ごとに目的変数の分布を可視化
features = X_train.columns
legend= ['Benign','Malignant']
plt.figure(figsize=(20,32*4))
gs = gridspec.GridSpec(32, 1)
for i, col in enumerate(train[features]):
    ax = plt.subplot(gs[i])
    sns.distplot(train[col][train.cancer == 0],bins=50, color='crimson')
    sns.distplot(train[col][train.cancer == 1],bins=50, color='royalblue')
    plt.legend(legend)

特徴量選択

　Scikit-learnのRandomForestClassifier()を利用することで、feature_importances_として各特徴量の「重要度」を確認することが可能である。

[In]

# 特徴選択
RF = RandomForestClassifier(n_estimators = 250, random_state = 42)
RF.fit(X_train, y_train)

:[Out]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

[In]

# 重要度が高い順に特徴量を出力
features = X_train.columns
importances = RF.feature_importances_

importances_features = sorted(zip(map(lambda x: round(x, 2), RF.feature_importances_), features), reverse=True)

for i in importances_features:
    print(i)

[Out]

(0.13, 'worst perimeter')
(0.13, 'worst concave points')
(0.13, 'worst area')
(0.11, 'mean concave points')
(0.07, 'worst radius')
(0.05, 'mean radius')
(0.05, 'mean concavity')
(0.04, 'worst concavity')
(0.04, 'mean perimeter')
(0.04, 'mean area')
(0.02, 'worst texture')
(0.02, 'worst compactness')
(0.02, 'radius error')
(0.02, 'mean compactness')
(0.02, 'area error')
(0.01, 'worst symmetry')
(0.01, 'worst smoothness')
(0.01, 'worst fractal dimension')
(0.01, 'perimeter error')
(0.01, 'mean texture')
(0.01, 'mean smoothness')
(0.01, 'fractal dimension error')
(0.01, 'concavity error')
(0.0, 'texture error')
(0.0, 'symmetry error')
(0.0, 'smoothness error')
(0.0, 'mean symmetry')
(0.0, 'mean fractal dimension')
(0.0, 'concave points error')
(0.0, 'compactness error')

ランダムフォレストの特徴選択の結果、上位5つ

worst perimeter
worst concave points
worst area
mean concave points
worst radius

[In]

# 上位5つをリストとして取得
feature_list = [value for key, value in important_features if key >= 0.06]
feature_list

[Out]

['worst perimeter',
 'worst concave points',
 'worst area',
 'mean concave points',
 'worst radius']

[In]

# 訓練データとテストデータを重要度の高い特徴量のみに絞る
X_train = X_train[feature_list]
X_test = X_test[feature_list]

[In]

# 目的変数の分布を再度確認
legend= ['Benign','Malignant']
plt.figure(figsize=(20,32*4))
gs = gridspec.GridSpec(32, 1)
for i, col in enumerate(train[feature_list]):
    ax = plt.subplot(gs[i])
    sns.distplot(train[col][train.cancer == 0],bins=50, color='crimson')
    sns.distplot(train[col][train.cancer == 1],bins=50, color='royalblue')
    plt.legend(legend)

学習・予測・評価

[In]

# 学習
clf = DecisionTreeClassifier(max_depth=4)
clf = clf.fit(X_train, y_train)

[In]

# 訓練データの特徴量を使って予測
y_pred = clf.predict(X_train)

[In]

def drawing_confusion_matrix(y: pd.Series, pre: np.ndarray) -> None:
    """
    混同行列を描画する関数

    @param y: 目的変数
    @param pre: 予測された値
    """
    confmat = confusion_matrix(y, pre)
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
    for i in range(confmat.shape[0]):
        for j in range(confmat.shape[1]):
            ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
    plt.title('Predicted value')
    plt.ylabel('Measured value')
    plt.rcParams["font.size"] = 15
    plt.tight_layout() 
    plt.show()

[In]

def calculation_evaluations(y: pd.Series, pre: np.ndarray) -> None:
    """
    正答率、適合率、再現率をそれぞれ計算して出力する関数

    @param y: 目的変数
    @param pre: 予測された値
    """
    print('正答率: {:.3f}'.format(metrics.accuracy_score(y, pre)))
    print('適合率: {:.3f}'.format(metrics.precision_score(y, pre)))
    print('再現率: {:.3f}'.format(metrics.recall_score(y, pre)))

[In]

drawing_confusion_matrix(y_train, y_pred)
calculation_evaluations(y_train, y_pred)

:[Out]
正答率: 0.969
適合率: 0.979
再現率: 0.972

TP(左上)の163は、モデルが悪性と予測したうち実際に悪性の数。
FP(右下)の9は、予測は悪性で実際は悪性ではない数。
FN(右上)の6は、実際は悪性だが良性と予測してしまった数となる。

[In]

# テストデータを訓練ずみモデルで予測
y_pred_test = clf.predict(X_test)

[In]

drawing_confusion_matrix(y_test, y_pred_test)
calculation_evaluations(y_test, y_pred_test)

[Out]

正答率: (TP + TN)/(TP + TN + FP + FN)
正答率: 0.939
適合率: TP/(TP + FP)
適合率: 0.944
再現率: TP/(TP + FN)
再現率: 0.958