Understanding Decision Trees with a Spreadsheet

Introduction

Decision trees are widely used in Machine Learning. Even though simple decision trees often have poor predictive results, trees are the main ingredients in more advanced algorithms like Random Forests, AdaBoost and Gradient Boosting. Understanding how to grow a tree from scratch is a necessary step towards mastering more complex algorithms.
In this article, I propose to buid a tree by hand. We will use a complete made up dataset about churn risk. Features describe if a client is active on the website, if s.he is reading e-mails or if it is a new client (less than two years). I created this dataset for illustration purposes.
In parallel, we will compare our results with the tree we obtained from Scikit Learn.

The Gini Impurity

The root node is the starting point of the tree. It describes how many samples are in the dataset and how many samples of each class are in the dataset :
root node SK

As you can see in the dataset, there are 999 samples. The target variable contains 524 ones and 475 zeros.
The trainig process requires to compute a measure of the purity for each feature. The purity measure inform on how well a given feature discriminate the data (information gain). It is possible to use Entropy or the Gini impurity (not to be confused with the Gini coefficient used in economiccs to describe wealth inequalities).
The Gini impurity is used here (the default in Scikit-Learn). For each node, the Gini impurity is :
$$ G = 1-(y_1/n)^2 - (y_2/n)^2 $$ where $y_1$ is the number of samples belonging to the first class and $y_2$ is the number of samples belonging to the second class.
Hence the gini impurity displayed by Scikit-Learn for the first node is
$$ 1 - (\frac{524}{999})^2 - (\frac{475}{999})^2 = 0.498797 $$

The first split

Scikit-Learn has decided that isActive_website is the first feature used to split samples. The first question is basically is isActive_website 1 or 0 ?. It is the Gini impurity that is used to decide the order of features. However, to make the decision, the algorithm compute a weighted average of the Gini impurity :

root node Calc

As you can see on the screenshot above, two impurities are computed for each feature. For example, for the feature isReading_mail, the first impurity describes how well this feature make a pertinent split for a given answer. All features are categorical here ; the question is basically is the feature equal 1 or 0 ? (actually the displayed tree has been trained on continous variables but results are the same).
Once the two Gini impurities are computed for a given feature, the weighted average of these measures can be easily obtained :
$$ \frac{556}{999} * 0.47 + \frac{443}{999} * 0.481 = 0.47 $$

Once this calculation has been done for each available feature, the most useful feature is the one that has the smallest impurity (isActive_website in this particular case).

Next splits

Second node SK

Let’s study the node on the right after the root node. It implies that isActive_website > 0.5. As this feature is binary and the only values are one and zero, the dataset is filtered and only the lines with isActive_website = 1 are kept (556 samples are now under review). It is worth noting, the Gini impurity displayed on this node is equal to 0.47 which is one of the two measures used in the weighted average computed above (the other one is 0.481 and is displayed on the left node). 210 and 346 can also be found in the first Calc screenshot at E8 and E9 ; it is simply the remaining distribution of the target given isActive_website = 1.
As there are two features left, the two weighted averages of Gini impurities are computed and the smallest is chosen to be the next split (isReading_mail here):

Second node Calc

Continuing these steps would allow us to get the complete tree.

For continuous variables, the process is very similar. The only difference is to determine what is the best cutoff value for the feature. Once the variable is sorted, a mean is computed for each pair adjacent values. For each mean, the weigthed average of the Gini impurity associated with the cutoff is calculated. The best split is the split with the smallest Gini impurity.

Gini continuous

Conclusion

The different steps to grow a decision tree are pretty straightforward and key metrics are easily computed on a spreadsheet. The complexity arises from the the tedious data wrangling nedded at each step.
Decision trees are appreciated for being simple to interpret, white box (the algorithm decision can be unserstood) and non parametric (no assumptions about distributions). However over fitting is almost always a problem when using this classifier. More robust tree-based algoritms exist to deal with this issue (random forests, AdaBoost and Gradient Boosting).

decision trees

gini impurity