More than 5 years have passed since last update.

[AI] Summary of Decision Tree

Last updated at 2015-06-29Posted at 2015-06-29

What is Decision Tree

Decision tree is one of the simplest methods for supervised learning. It can be applied to both regression and classification.

At each node of a tree, a test is applied which sends the query sample down one of the branches of the node.

This continues until the query sample arrives at a terminal or leaf node. Each leaf node is associated with a value: a class label in classification, or numeric value in regression.

The value of the leaf node reached by the query sample is returned as the output of the tree.

An attribute or feature is a characteristic of the situation that we can measure.

Learning Decision Trees from Observations

We should aim to build the smallest tree that is consistent with the observations. One way to do this is to recursively test the most important attributes first.

Decision Tree Learning Algorithm

Aim: Find the smallest tree consistent with the training samples
Idea: Recursively choose "most significant" attributes as root of tree/subtree.

Four basic underlying Ideas

1. If there are some positive and negative samples, then choose the best attribute to split them.
1. If all the remaining samples are all positive or all negative, we have reached a leaf node. Assign label as positive/negative.
1. If there are no samples left, it means that no such sample has been observed. Return a default value calculated from the majority classification at the node's parent.
1. If there are no attributes left, but both positive and negative samples, it means that these samples have exactly the same feature values but different classifications. This may happen because
- some of the data could be incorrect, or
- the attrbutes do not give enough information to describe the situation fully, or
- the problem is truly non-deterministic
  => Call it a leaf node and assign the majority vote as label.

How to choose the best attributes

We need a measure of "good" or "bad" for attributes.
One way to do this is to compute the information content at a node.

i.e., at node R

I(R) = \sum_{i=1}^L P(c_i)log_2 P(c_i)

where ${c_1...c_L}$ are the L class labels present at the node, and $P(c_i)$ is the probability of getting class $c_i$ at the node.

In general, the amount of information will be maximum when all classes are equally likely, and be minimum when the node is homogeneous, which means all samples have the same labels.

The information gain in testing attribute A is the difference between original information content and the new information content

i.e.

$Gain(A) = I(R) - Remainder(A)$

where... $Remainder(A) = \sum_{i=1}^M P( A=v_i ) I(E_{A=i} )$

where... ${E_{A=v1},...,E_{A=v_M}}$ are the child nodes of R after testing attribute A.

The key idea to choose best attribute in the DTL algorithm is to choose the attribute taht gives the maximum information gain.

Remarks

A different point of view is to regard I(node) as a measure of impurity. The more "mixed" the samples are at a node, the higher the impurity value.
On the other hand, a homogeneous node will have zero impurity.

The Gain(A) value can thus be viewed as the amount of reduction in impurity if we split according to A.

This afforfs the intuitive idea that we grow trees by recursively trying to obtain leaf nodes which are as pure as possible.

=> will be continued to prevent overfitting.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up