## Decision Trees – C4.5

C4.5 is an algorithm developed by Ross Quinlan that generates Decision Trees (DT), which can be used for classification problems. It improves (extends) the ID3 algorithm by dealing with both continuous and discrete attributes, missing values and pruning trees after construction. Its commercial successor is C5.0/See5, a lot faster that C4.5, more memory efficient and used for building smaller decision trees.

Being a supervised learning algorithm, it requires a set of training examples and each example can be seen as a pair: input object and a desired output value (class). The algorithm analyzes the training set and builds a classifier that must be able to correctly classify both training and test examples. A test example is an input object and the algorithm must predict an output value (the example must be assigned to a class).

The classifier used by C4.5 is a decision tree and this tree is built from root to leaves by respecting the Occam’s Razor. This razor says that given two correct solution for a certain problem, we should choose the simpler solution.

Here is an example of a decision tree built using C4.5:

The input and output requirements and a pseudo-code for the C4.5 algorithm is presented below:

The input for the algorithm consists of a set S of examples described by continuous or discrete attributes, each example belonging to one class.

The output is a decision tree or/and a set of rules that assigns a class to a new case (example).

Algorithm:

Check for base case.
Find the attribute with the highest informational gain (A_best)
Partition S into S1,S2,S3... according to the values of A_best
Repeat the steps for S1,S2,S3...
<em>

The base cases are the following:

•  All the examples from the training set belong to the same class ( a tree leaf labeled with that class is returned ).

•  The training set is empty ( returns a tree leaf called failure ).

•  The attribute list is empty ( returns a leaf labeled with the most frequent class or the disjuction of all the classes).

The attribute with the highest informational gain is computed using the following formulas:

Entropy:  $E(S ) = \sum_{i=1}^{n}-Pr(C_i)*log_2Pr(C_i)$
Gain:  $G(S,A) = E(S)-\sum_{i=1}^mPr(A_i)E(S_{Ai})$

E(S) – information entropy of S

G(S,A) – gain of S after a split on attribute A

n – nr of classes in S

Pr(Ci) – frequency of class Ci in S

m – nr of values of attribute A in S

Pr(Ai) – frequency of cases that have Ai value in S

E(SAi) – subset of S with items that have Ai value

The C4.5 algorithm improves the ID3 algorithm by allowing numerical attributes, permitting missing values and performing tree pruning.

Numeric attributes:

Temperature attribute:

In order to deal with attributes like temperature (may have a lot of numerical values) binary split is performed. The values of the attribute are sorted. Then, the gain is computed for every split point of an attribute. It is chosen the best split point (h) and  the Ai attribute is split like this: Ai ≤ h, Ai > h.

An optimal binary split is done when it is evaluated the gain only between points of different classes. For example, we don’t have to evaluate the gain between 72 and 75 because they are assigned to the same class ‘Yes’ and this point will surely not be chosen for split. In addition to this, we should not repeat the order of the values at each node, because the order for a node’s children can be determined from the order of the values of the parent.

Missing values:

Missing values are usually marked with “?”. Dealing with missing values involves imputation. Imputation means that if an important feature is missing, it can be estimated from the available data.

Distribution-based imputation is done when we split the example into multiple instances, each with a different value for the missing feature. It is assigned a weight corresponding to the estimated probability for the particular missing value and the weights sums up to 1.

For example: Split the 18 instances by Cost: 10 go down the <5Ł branch, 8 down the >5Ł branch. So when  we get an instance without Cost, the classification for <5Ł happened 10/18 times and the classification for >5Ł, 8/18 times. We can say that ‘Buy 5’ has 5/9 ‘votes’ and ‘Don’t buy’ 4/9 ‘votes’. So, instead of having a single class assigned to an example, we have more classes and each class has a certain probability. Most likely, we will choose as correct the class with the highest probability.

Tree pruning:

Tree pruning is done in order to obtain smaller trees and avoid over-fitting (the algorithm tries to classify the training data so well and it becomes too specific to correctly classify the test data).

Pruning can be done in two ways: pre-pruning and post-pruning. With pre-pruning we stop building the tree when the attributes are irrelevant. It is harder to perform, but faster. When post-pruning we build the entire tree and remove certain branches. We can do this by either using sub-tree raising or sub-tree replacement.

C4.5 algorithm performs sub-tree replacement. This means that a sub=tree is replaced by a leaf if it reduces the classification error.

We estimate error rate as follows: suppose a node classifies N instances with E errors. Then, the actual error rate on new cases usually higher than E/N.

One way of estimating the error rate is by splitting the training set and estimate errors according to the instances not used for training. Another way is by performing pessimistic pruning (we increases the number of errors observed at each node). We base our increase on the error rate on training set, number of instances convered by node and confidence level. If the combined error for a set of branches is higher than the parent, prune them.

Example of tree pruning using pessimistic pruning:

The advantages of the C4.5 are:

•  Builds models that can be easily interpreted

•  Easy to implement

•  Can use both categorical and continuous values

•  Deals with noise

•   Small variation in data can lead to different decision trees (especially when the variables are close to each other in value)

•  Does not work very well on a small training set

Conclusion

C4.5 is used in classification problems and it is the most used algorithm for builing DT.

It is suitable for real world problems as it deals with numeric attributes and missing values. The algorithm can be used for building smaller or larger, more accurate decision trees and the algorithm is quite time efficient.

Compared to ID3, C4.5 performs by default a tree pruning process, which leads to smaller trees, more simple rules and more intutive interpretations.