More than 5 years have passed since last update.

Balanced Random Forestをpythonで

Last updated at 2016-11-23Posted at 2016-11-19

Random Forestについて

Random Forestは、決定木を弱学習機とする集団学習アルゴリズムで、代表的な機械学習手法のひとつです。
特徴量と、観測データをサンプリングし、決定木を構築していく。
調べればいっぱい出てきますので、ここではアルゴリズム等は割愛

不均衡データに適用する問題点

例えば、正例1000件と負例50件のデータをRandom Forestで分類しようとすると、何らかの対策をしないと分類器としては使えない。
不均衡データに対する重みづけで対応することもできる（scikit-learnのRandomForestには、class_weightパラメータとして調整可能）
しかし、一つのデータの重みが強すぎて過学習を起こしたりすることも考えられる。

Balanced Random Forestとは

そこで、各決定木ごとにサンプル数を調整することで不均衡データに対応する手法がある。
（先ほどの例だと、例えば、各決定木ごと、正例５０、負例５０ずつの木をつくる。重みづけをせずにこのようなサンプリングで不均衡データに対応する）
これをBalanced Random Forestと呼ぶらしい。
Rでは簡単にパラメータで実行可能だが、scikit-learnだとこれに対応するものがなかったので、自分で簡単ですがやってみました。
（間違い等あったらコメントしてくれると喜びます）

実装

balanced_random_forest.py

# !/usr/bin/env python
# -*- coding: utf-8 -*-

from collections import Counter
from random import sample
from math import sqrt
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import resample


def _build_tree(train: np.ndarray, label: np.ndarray):
    tree = DecisionTreeClassifier()
    tree.fit(train, label)
    return tree


def _sampling_equal(y_values: np.ndarray, n_samples: int, bootstrap: bool = True):
    """
    :param y_values: label data
    :param n_samples: number of samples
    :param bootstrap: whether bootstrap or not
    :return: sampling index
    """
    if bootstrap:
        return [i for v in [resample(pd.Series(y_values)[pd.Series(y_values) == uq].index,
                                     n_samples=n_samples) for uq in np.unique(y_values)] for i in v]
    else:
        return [i for v in [sample(list(pd.Series(y_values)[pd.Series(y_values) == uq].index),
                                   n_samples) for uq in np.unique(y_values)] for i in v]


class BalancedRandomForestClassifier():
    """クラスごとのサンプル数を一定にして木を作ることで不均衡データに対応"""

    def __init__(self, n_estimator: int, n_samples: int, bootstrap: bool = True, max_features: int = 0):
        """
        :param n_estimator: number of tree
        :param n_samples: number of sampling data
        :param bootstrap: how to resample
        :param max_features: number of feature
        """
        self.n_estimator = n_estimator
        self.n_samples = n_samples
        self.bootstrap = bootstrap
        self.max_features = max_features

    def fit(self, x_values: np.ndarray, y_values: np.ndarray):
        """
        :param x_values: train data
        :param y_values: label data
        """
        if self.max_features == 0:
            self.max_features = round(sqrt(x_values.shape[1]))
        # bootstrap等で木に学習させるデータを選択
        index_list = [_sampling_equal(y_values, self.n_samples, self.bootstrap) for i in range(0, self.n_estimator)]
        # 木ごとの特徴量を選択
        self.feature_list = [sample(range(0, x_values.shape[1]), self.max_features) for i in range(0, self.n_estimator)]
        # 上記に基づいて木を構築
        self.forest = Parallel(n_jobs=-1, backend="threading")(
            delayed(_build_tree)(x_values[np.ix_(index, feature)], y_values[index])
            for index, feature in zip(index_list, self.feature_list))
        # 各木の予測の多数決で決定
        self.predict = lambda x: [Counter(item).most_common(1)[0][0]
                                  for item in np.array([tree.predict(x[:, feature])
                                                        for tree, feature in zip(self.forest, self.feature_list)]).T]
        # ここからはimportanceの計算
        count = np.zeros(x_values.shape[1])
        feature_importances = np.zeros(x_values.shape[1])
        for tree, feature in zip(self.forest, self.feature_list):
            count[feature] += 1
            feature_importances[feature] += tree.feature_importances_
        self.feature_importances_ = feature_importances / count

ベースとなる決定木の実装

def _build_tree(train: np.ndarray, label: np.ndarray):
    tree = DecisionTreeClassifier()
    tree.fit(train, label)
    return tree

こちらは、scikit-learnの決定木そのままです。
Balanced Random Forestはサンプリングの仕方を変更するだけなので、
あるものは使ったほうが間違いない（私の実装力は５３です）

Bootstrapによるサンプリング

def _sampling_equal(y_values: np.ndarray, n_samples: int, bootstrap: bool = True):
    """
    :param y_values: label data
    :param n_samples: number of samples
    :param bootstrap: whether bootstrap or not
    :return: sampling index
    """
    if bootstrap:
        return [i for v in [resample(pd.Series(y_values)[pd.Series(y_values) == uq].index,
                                     n_samples=n_samples) for uq in np.unique(y_values)] for i in v]
    else:
        return [i for v in [sample(list(pd.Series(y_values)[pd.Series(y_values) == uq].index),
                                   n_samples) for uq in np.unique(y_values)] for i in v]

これは、各ラベルごとにブートストラップ（オプションで、重複なしのサンプリングも可能だがふつうは使わない）で同じ数ずつラベルを取得する。こんなめんどくさい内包になっているのは、各ラベルごとサンプリングすると多次元配列になるので、それを一次元配列に直すため

学習

def fit(self, x_values: np.ndarray, y_values: np.ndarray):
        """
        :param x_values: train data
        :param y_values: label data
        """
        if self.max_features == 0:
            self.max_features = round(sqrt(x_values.shape[1]))
        # bootstrap等で木に学習させるデータを選択
        index_list = [_sampling_equal(y_values, self.n_samples, self.bootstrap) for i in range(0, self.n_estimator)]
        # 木ごとの特徴量を選択
        self.feature_list = [sample(range(0, x_values.shape[1]), self.max_features) for i in range(0, self.n_estimator)]
        # 上記に基づいて木を構築
        self.forest = Parallel(n_jobs=-1, backend="threading")(
            delayed(_build_tree)(x_values[np.ix_(index, feature)], y_values[index])
            for index, feature in zip(index_list, self.feature_list))
        # 各木の予測の多数決で決定
        self.predict = lambda x: [Counter(item).most_common(1)[0][0]
                                  for item in np.array([tree.predict(x[:, feature])
                                                        for tree, feature in zip(self.forest, self.feature_list)]).T]
        # ここからはimportanceの計算
        count = np.zeros(x_values.shape[1])
        feature_importances = np.zeros(x_values.shape[1])
        for tree, feature in zip(self.forest, self.feature_list):
            count[feature] += 1
            feature_importances[feature] += tree.feature_importances_
        self.feature_importances_ = feature_importances / count

まず最初に、各決定木に何個の特徴量を使うかを決定
- （特徴量が$M$次元とすると、分類だと、$\sqrt{M}$、回帰だと$\frac{M}{3}$が推奨値らしい）
次に、ブートストラップ法で、各ラベルごと均等にサブサンプルを決定
木ごとに使用する特徴量も、先ほどの数で決定
あとは、並列に木を構築し、forestという変数に格納する

予測値は、つくった木の多数決で決定するため、各木にそれぞれ使用した特徴量（self.feature_list）をもとに予測させ、多数決をとる（Counter(item).most_common(1)[0][0]で一番数の多かったラベルを取得している。）

importanceは、各決定木ごとの特徴量の寄与率の平均値を算出するため、
特徴量ごとのimportanceの合計値を、出現回数で割っている

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up