Original article was published on Artificial Intelligence on Medium
Introduction to Decision Trees
A decision tree is a simple tree-like structure constituting nodes and branches. At each node, data is split based on any of the input features, generating two or more branches as output. This iterative process increases the numbers of generated branches and partitions the original data. This continues until a node is generated where all or almost all of the data belong to the same class and further splits — or branched — are no longer possible.
This whole process generates a tree-like structure. The first splitting node is called the root node. The end nodes are called leaves and are associated with a class label. The paths from the root to the leaf produce the classification rules.
Suppose, you’re an employee and you want to eat processed food.
Your course of action will depend on several circumstances.
If you aren’t hungry, you won’t spend money on junkies. But if you are hungry then the choices are changed, your next move depends on your next circumstance i.e. have you bought lunch or not?
Now if you don’t have lunch, your action will solely depend on your next pick i.e. is it month-end or not? If it will be the last few days of the month, you will consider skipping the meal otherwise you won’t take it as a preference.
Decision Trees come into play when there are several choices involved to arrive at any decision. Now you must choose accordingly to get a favorable outcome.
Tree-based learning algorithms are considered to be one of the best and mostly used supervised learning methods. Tree-based methods legitimize predictive models with better accuracy, stability, and ease of interpretation. Unlike contemporaries, they work well on non-linear relationships as well. Decision Tree algorithms are referred to as CART (Classification and Regression Trees).
How do Decision Trees work?
There are two components of Decision Trees:
- Entropy — It is regarded as the randomness of the system. It is a measure of node purity or impurity.
Entropy is maximum when p = 0.5 i.e. both outcome has the same favor.
- Information gain — It is a reduction in entropy. It is the difference between the uncertainty of the starting node and weighted impurity of the two child nodes.
It helps us to find the root node for our decision tree, the node with maximum Information Gain is regarded as the root node as it has maximum uncertainty.
We will first split the feature with the highest information gain. This is a recursive process until all child nodes are pure or until the information gain is zero.
Goal of Decision Tree: Maximize Information Gain and Minimize Entropy
Let’s say we have a sample of 60 students with three variables Gender (Boy/ Girl), Class (XI/ XII), and Weight (50 to 100 kg). 15 out of these 60 play football in leisure time. Now, we want to create a model to predict who will play football during free time? In this problem, we need to divide students who play football in their leisure time based on a highly significant input variable among all three.
This is where the decision tree comes into play, it will classify the students based on all values of three variables and identify the variable, which creates the best homogeneous sets of students.
Using Decision Tree, we are easily able to solve our problem and classify students based on traits that whether they will prefer playing football in their leisure time or not?
Decision Trees Implementation from scratch
Sci-kit Learn implementation
Visualizing your Decision Tree