Introduction to Decision Tree Classifiers

Decision trees are a powerful tool capable of mapping non-linear relationships and complex relationships between many variables. They are also computationally inexpensive compared to other algorithms. There are many forms of decision trees, but I am going to just cover the basics of the algorithm in this post.

Decision trees work like a flow chart. The algorithm constructs a series of yes/no questions that lead to further questions until the best predictions can be produced. This is similar to the game 20 questions or the famous Akinator game. The tree starts with one question regarding the feature data. This is called the “root” or “root node”. Then the data points are put into different sets based on the answer to the question. These sets can either be left as a final solution, “leaf”, or they can be split into even more sets with more questions, “internal nodes”.

The real work is done when the algorithm learns what questions to ask. There are several ways to do this, but the most common way is to use something called Gini impurity. To create nodes the machine first uses all possible questions to split the data into two groups. Then it measures how “impure” each group is. The question that leaves the groups with the smallest impurity values is chosen to create a node.

To demonstrate how Gini impurity works, allow me to create an example: If we were trying to build a tree that predicts whether a passenger on the titanic survived, we may have a split asking if the passenger’s age is equal to or less than 30. Say out of all people younger than thirty, 80 survived and 20 did not, and out of all people older than 30, 25 survived and 90 did not. We could generalize and use this alone to predict that someone died if they were over 30 and they survived if they were under 30. The split would help make predictions, but it is not perfect. We would say it is “impure”. To see if this is the optimal split to start our tree we can calculate Gini impurity.

To calculate Gini impurity we use the following formula for each group (under 30 and over 30 in this case). Let “P” stand for probability.

Plugging this into the group under 30 years of age we would get this:

Working this out the Gini impurity for the group under 30 is 0.312 and the Gini impurity for the group above thirty it is 0.340. These are recorded and then the weighted average is taken as the impurity for the question. This gives us an impurity value of 0.327 for the question “is the passenger under 30?”.

Next we would try splitting on the age 40, or another feature such as passenger class. We would exhaust all possible questions and choose the one yielding the lowest impurity value. This becomes the question for the root node.

After the optimal split is made we can choose whether to add more nodes or to end it there. For each subset made by the first split (root node) we can repeat the process just as we did before. If all of the possible splits have impurity values larger than the subset (people under 30) then the process is ended and the answer is assumed to be whatever class the majority of the group belongs to. This is referred to as a “leaf”.

The algorithm can keep producing nodes until the tree gets overly large. This can overfit the model and produce bad results on unseen data. To prevent this a maximum depth can be set that will not allow any path to involve more than say, 4 splits.

In conclusion decision trees are powerful and transparent machine learning models. However they can be made even better using things like bagging and boosting. State of the art algorithms like XGBoost and Random Forests are based upon the basic decision tree.