ゼロから作る決定木入門

Posted at 2025-07-07

背景

新卒1年目の研修にてUdemyで機械学習の基礎について学んでいるのですが、インプットだけではやはり身についている気があまりしないですよね。
私が好きな本の一つに「ゼロから作るDeep Learning」という本があります。Kerasなどのライブラリを使用せずゼロから関数を実装することで、深いところまで理解することができます。
そこで今回はClaudeを利用してライブラリを使用せずゼロから決定木のアルゴリズムを解説付きで作らせ、それを模写してみました。それにより理解を深めることができたので共有します。

ゼロから作って見えてきたこと

今回ゼロから実装することで具体的に決定木の分割方法と構築方法を深いレベルで理解することができました

1. 分割方法

ゼロから作る前は、「決定木は最もクラスが分かれる閾値を見つけ、その閾値以上と未満で分割する」という程度の理解でした。今回ゼロから作ることで、実際に最もクラスが分かれる状態とは数値的にどのような形で表現されるのか、その閾値はどのようにして見つけるのかを学ぶことができました。

解説

ジニ係数は以下のように実装される。

def gini_impurity(self,y):
    if len(y) == 0:
        return 0
    proportions = np.bincount(y) / len(y)
    return 1 - np.sum(proportions**2)

この実装からわかるとおり、ジニ係数は np.bincount(y) / len(y) の二乗の和によって決まり、この値が大きいほど小さくなる。すなわち一つのクラスにまとまっている時の方がジニ係数は小さくなる。
親ノードのジニ係数と子ノードのジニ係数の差分（情報ゲインという）をとり、それが最大となる箇所が閾値である。

def information_gain(self,X_column,y,threshold):
    parent_impurity = self.calculate_impurity(y)
    left_mask = X_column <= threshold
    right_mask = X_column > threshold

    if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
        return 0
    
    n = len(y)

    n_left,n_right = np.sum(left_mask),np.sum(right_mask)

    left_impurity = self.calculate_impurity(y[left_mask])
    right_impurity = self.calculate_impurity(y[right_mask])

    weighed_impurity = (n_left/n) * left_impurity + (n_right/n) * right_impurity

    return parent_impurity - weighed_impurity

2. 構築方法

実際に木を構築する方法を学ぶことができました。
終了条件（深さが指定した値より深くなる、ノード内にクラスが一つしかない、情報ゲインが0の場合）に達するまで再帰的に木を構築し続けていることが理解できました。

def _build_tree(self,X,y,depth):
    n_samples,n_features = X.shape
    n_classes = len(np.unique(y))

    if (self.max_depth is not None and depth >= self.max_depth) or \
        n_classes == 1 or \
        n_samples < self.min_samples_split:
        most_common_class = Counter(y).most_common(1)[0][0]
        return {'class':most_common_class,'samples':n_samples}
    
    best_feature,best_threshold,best_gain = self.find_best_split(X,y)

    if best_gain == 0:
        most_common_class = Counter(y).most_common(1)[0][0]
        return {'class':most_common_class,'samples':n_samples}
    
    left_mask = X[:,best_feature] <= best_threshold
    right_mask = X[:,best_feature] > best_threshold

    left_subtree = self._build_tree(X[left_mask],y[left_mask],depth+1)
    right_subtree = self._build_tree(X[right_mask],y[right_mask],depth+1)
    return {
        'feature':best_feature,
        'threshold': best_threshold,
        'left': left_subtree,
        'right': right_subtree,
        'samples': n_samples
    }

実装

class DecisionTree:
    def __init__(self,criterion='gini',max_depth=None,min_samples_split=2):
        self.criterion = criterion
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    def gini_impurity(self,y):
        if len(y) == 0:
            return 0
        proportions = np.bincount(y) / len(y)
        return 1 - np.sum(proportions**2)
    
    def entropy(self,y):
        if len(y) == 0:
            return 0
        proportions = np.bincount(y) / len(y)
        return -np.sum([p*math.log2(p) for p in proportions if p>0])
    
    def calculate_impurity(self,y):
        if self.criterion == 'gini':
            return self.gini_impurity(y)
        elif self.criterion == 'entropy':
            return self.entropy(y)
        else:
            raise ValueError("Criterion must be 'gini' or 'entropy'")
        
    
    def information_gain(self,X_column,y,threshold):
        parent_impurity = self.calculate_impurity(y)
        left_mask = X_column <= threshold
        right_mask = X_column > threshold

        if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
            return 0
        
        n = len(y)

        n_left,n_right = np.sum(left_mask),np.sum(right_mask)

        left_impurity = self.calculate_impurity(y[left_mask])
        right_impurity = self.calculate_impurity(y[right_mask])

        weighed_impurity = (n_left/n) * left_impurity + (n_right/n) * right_impurity

        return parent_impurity - weighed_impurity
        
    def find_best_split(self,X,y):
        best_gain = -1
        best_feature = None
        best_threshold = None 

        n_features = X.shape[1]

        for feature_idx in range(n_features):
            X_column = X[:,feature_idx]
            thresholds = np.unique(X_column)
            for threshold in thresholds:
                gain = self.information_gain(X_column,y,threshold)

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold
        return best_feature,best_threshold,best_gain

    def fit(self,X,y):
        X = np.array(X)
        y = np.array(y)

        self.root = self._build_tree(X,y,depth=0)
        return self
    
    def _build_tree(self,X,y,depth):
        n_samples,n_features = X.shape
        n_classes = len(np.unique(y))

        if (self.max_depth is not None and depth >= self.max_depth) or \
            n_classes == 1 or \
            n_samples < self.min_samples_split:
            most_common_class = Counter(y).most_common(1)[0][0]
            return {'class':most_common_class,'samples':n_samples}
        
        best_feature,best_threshold,best_gain = self.find_best_split(X,y)

        if best_gain == 0:
            most_common_class = Counter(y).most_common(1)[0][0]
            return {'class':most_common_class,'samples':n_samples}
        
        left_mask = X[:,best_feature] <= best_threshold
        right_mask = X[:,best_feature] > best_threshold

        left_subtree = self._build_tree(X[left_mask],y[left_mask],depth+1)
        right_subtree = self._build_tree(X[right_mask],y[right_mask],depth+1)
        return {
            'feature':best_feature,
            'threshold': best_threshold,
            'left': left_subtree,
            'right': right_subtree,
            'samples': n_samples
        }

    def _predict_sample(self,sample, node):
        if 'class' in node:
            return node['class']
        
        if sample[node['feature']] <= node['threshold']:
            return self._predict_sample(sample,node['left'])
        else:
            return self._predict_sample(sample,node['right'])
    
    def predict(self,X):
        X = np.array(X)
        return np.array([self._predict_sample(sample,self.root) for sample in X])
    
    def print_tree(self,node=None,depth=0):
        if node is None:
            node = self.root
        indent = " " * depth

        if 'class' in node:  # 葉ノード
            print(f"{indent}予測クラス: {node['class']} (サンプル数: {node['samples']})")
        else:
            print(f"{indent}特徴量{node['feature']} <= {node['threshold']:.2f}? (サンプル数: {node['samples']})")
            print(f"{indent}├─ Yes:")
            self.print_tree(node['left'], depth + 1)
            print(f"{indent}└─ No:")
            self.print_tree(node['right'], depth + 1)

まとめ

「単にゼロから〇〇を実装したい」と生成AIに投げれば「ゼロから作るDeep Learning」風の学習を行うことができるのは便利ですね。元のソースコードを見にいく必要なく、初学者にも手をつけやすいと思います。
このゼロから作る体験を通してでてきた新しい疑問、たとえば特徴量の増加に当たって計算量はどのように増えるか、それに対処するためにどのような工夫が行われるのか、などをさらに追加で実装することによって学びたいと思います。
最後までご覧いただきありがとうございました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up