Almost Everything You Need To Know About Decision Trees (With Code)

Original article can be found here (source): Artificial Intelligence on Medium

Gini impurity and entropy

Two friends Joe and Sam were getting bored. So they decided to play a game. The one who finds the line that best splits the following points into blue and orange wins the game. The only constraint is that the line must be parallel to any of the axis.

Figure 1

Both tried a lot and came up with the following lines.

Figure 2

It’s easy to say that Joe’s attempt is better than Sam’s just by visualizing the points. However, things can go crazy when we have more than two colors. One needs a quantitative value to judge the splits. That’s where Gini impurity comes into the picture. It talks about the probability of classifying the points incorrectly.

Let’s take the condition before splitting i.e. figure 1. What is the probability of classifying a point incorrectly?

There could be two ways in which we could classify a point incorrectly. First, we choose a blue point and classify it as orange. Second, we choose an orange point and classify it as blue.

Figure 3
Figure 4

Total probability = 0.25+0.25 =0.5 (Gini impurity at the start)

Any split in data divides the region into two or more parts. The final Gini impurity is calculated by taking a weighted sum of all the parts. The lesser the Gini impurity, the better the split.

Let’s analyze Joe’s attempt-

Figure 5

Similarly, we can analyze Sam’s attempt as well-

Figure 6

Both the attempts have significantly reduced the original Gini impurity, however, Joe has made a better split because the Gini gain, G(before)-G(after), is greater for his split.

Let Y be a random variable which takes the values {y₁,y₂,y₃,….,yₖ} then an easy way of calculating Gini impurity is-

Figure 7

Properties of Gini impurity

Let Y takes values y₊, y₋ (two classes)

When 100% of data points belong to y₊ . The Gini impurity of the system, in this case, would be: –

Figure 8

When 50% of data points belong to y₊ . The Gini impurity of the system, in this case, would be: –

Figure 9

When 0% of data points belong to y₊ . The Gini impurity of the system, in this case, would be: –

Figure 10

The graph of Gini impurity w.r.t to y₊ would come out to be:

Figure 11

Enough of this Gini impurity. Let’s talk about entropy for a while.

Entropy, in simple words, is nothing but randomness in data. Imagine you have two boxes in front of you. Box 1 is mostly filled with blue balls while box 2 is filled with balls of different colors.

Figure 12

Now draw a ball from each of the boxes. What do you think will most likely be the color of the ball drawn from box 1? Blue, right? Can you predict the same with the ball drawn from box 2? I guess not. The reason being the box 2 has a lot of randomness unlike box 1. Congrats, you have learned what entropy means! If you choose entropy as a metric, a decision tree would split the data in such a way that at each split randomness of data keeps on decreasing.

Let Y be a random variable which takes the values {y₁,y₂,y₃,….,yₖ} then the entropy of the system is calculated as:

Figure 13

Properties of entropy

Let Y takes values y₊, y₋ (two classes)

When 99% of data points belong to y₊ .The entropy of the system, in this case, would be: –

Figure 14

When 50% of data points belong to y₊ . The entropy of the system, in this case, would be: –

Figure 15

When 1% of data points belong to y₊ . The entropy of the system, in this case, would be: –

Figure 16

The graph of H(y) w.r.t to y₊ would come out to be:

Figure 17

It must be noticed that the entropy becomes maximum = 1 when all the values are equally probable of occurring.

So far so good. But how to split the data using entropy?

Similar to Gini gain, we use information gain (I.G) to decide the best feature to split with.

Figure 18
Figure 19: a graph that shows Gini and entropy variation with y+