# The Illustrated Guide To Classification Metrics: The Basics

Original article was published on Artificial Intelligence on Medium

# Will it rain tomorrow?

There are two possible answers: yes, it will rain and no, it will not. Then, we can wait for the next day and see what happens: rain or sun. This is a classic binary classification problem: we have to predict between two outcomes.

Naturally, there are two ways to be right: you said rain, and it rained, or you said sunny, and it was sunny. Likewise, there are two ways to be wrong: you said rain, and it was sunny, or you said sunny, and it rained all over your head.

If our task is to predict rain, we shall call rains “positives” and suns “negatives”. Saying rain and getting rain is a “true positive” while saying sun and getting sun is a “true negative”. Conversely, saying rain and getting sun is a “false positive” — a false alarm, and saying sun but getting rain is a “false negative”, or a miss-prediction.

These four possible outcomes are laid out as a matrix, like this:

This representation is known as a confusion matrix and summarizes the four possibilities: being right about the rain, issuing a false alarm, missing the rain, and being right about the sun. It is known as the confusion matrix because it shows how much the sun is confused with rain (false positives) and how much rain is confused with the sun (false negative).

You might be thinking, “I don’t like rain! Why rain should be a positive thing?” I agree it is not very intuitive at first. I find it easier to memorize that “negative” is the “natural/expected state”. For instance, when detecting a disease, being healthy is negative, and being ill is positive.

All these names have many other nicknames. For instance, you can use the names “hit”, “false alarm”, “miss”, and “rejection”, respectively, which are a bit more intuitive. However, standard practice is to use the positives/negatives.

Here is the same matrix but using the more intuitive terms:

Using the sick/healthy example, a “hit” is when you detected the disease, and a “rejection” is when you rejected that the person is ill. Likewise, if you said the person is sick, but he/she isn’t, you gave a false alarm, and if you wrongly said he/she was healthy, you missed the disease.

In the Netherlands, there are approximately 217 days of rain per year. If you ask a Dutch person if it will rain tomorrow, he/she will tell you it rains every day, and they can’t take it anymore 😢. Taking this answer literally, the confusion matrix of the average dutch person regarding the rain is:

Our pessimistic Dutch friend got right all the 217 rainy days at the cost of giving 148 false alarms for rain. Since he never said it would be sunny, he never missed any rain saying sun nor got a sunny day right.

If we asked an entirely optimistic person instead, one that always hoped for sun, we would get the opposite: 148 days of sun correctly predicted as such and 217 showers of rain miss-predicted.

The problem we face now is: how much better or worse are the predictions of the pessimist Dutch over the overly optimistic Dutch? In other words, how can we quantify how good are these predictions?

# Goodness Metrics

A straightforward way of measuring how good are these predictions is to consider how many predictions were right over the total. This is called accuracy.

The pessimistic dutch made 217 correct predictions over 365 days, yielding an accuracy of 59% while the overly optimistic was right in only 148 of 365 days, rendering him only 41% accurate. This makes the pessimistic dutch a better model than the overly optimistic one.

Formally, the correct predictions are TP + TN, while TP + TN + FP + FN is the total number of predictions. Thus, the accuracy can be calculated as:

Or in simple terms: greens over greens and reds.

# The problem with Accuracy

On our rain example, there are clearly more rainy days than sunny days in the Netherlands. Thus, predicting rains right pushes the accuracy higher more than predicting sunny days.

In extreme cases, what you want to detect might occur only in 1% or less of the cases. For instance, it only rains a couple of days per year in the Sahara, so if you always say “sun”, you will be right about 99% of the time.

In medical circumstances, diseases are often rare, and you need to detect them despite their rarity. It doesn’t cut to say “healthy” to everyone and say you are 94% accurate. You need a better measure. Consider the example:

In these cases, it pays off to compute the accuracy for positives and negatives separately, so you can measure how well the model performs for one and the other. In simple terms, we need a “positive accuracy” and a “negative accuracy”. These are known in the business as sensitivity and specificity:

Notice how we just split the accuracy formula into two parts: one detects how well we detect positives (hits over misses) and the other how well we detect negatives (rejections over false-alarms). Therefore, if you naively say, “everyone is healthy!” you will be 0% sensitive and 100% specific, instead of just 94% accurate.

Accuracy is a summary. It tries to capture everything into a single value. That’s why it is useful: one value is easier to work with than two, but it is also a weakness: it cannot capture every nuance of your problem. On the other hand, the specificity and sensitivity are indicators. They tell you how well your model behaves under specific circumstances.

You might ask if it is possible to combine both indicators into a single value so that we can have a summary. One cool formula to use is the geometric mean, which is defined as the square root of the product:

This formula has the beneficial property of averaging out both scores while penalizing unbalanced pairs. For instance, 90% with 90% has a slightly higher rating than 80% with 100%.

Going back to the Dutch rain, the pessimistic dutchman is 100% sensitive and 0% specific, while the optimistic one is 0% sensitive and 100% specific. Using our G score, both models have G = 0. Thus, they are equally bad under the sensitivity/specificity analysis. Neat.

# The Limitation of using Sensitivity and Specificity

In some scenarios, there are too many true negatives to consider. Therefore, all formulas using true negatives won’t work.

For instance, the object detection task is defined as finding objects and enclosing them with bounding boxes. A true positive is a correctly found object, a false positive is a false detection, a false negative is missing an object, and a true negative is detecting nothing where there is nothing to be detected.

Pay close attention to the last: a true negative is saying nothing when nothing is expected. In the above image, how many nothings did we correctly managed to not give a bounding box to?

Another example is Google search. When you look for dogs, all dog sites returned are true positives, non-dog sites are false positives, missed dog sites are false negatives, and the entire remaining internet are true negatives.

In both examples, if we actually tried to count how many true negatives there are, we would always have +99% specificity, as the number of true negatives would far exceed everything else. In such scenarios, we have to replace the specificity for another metric: precision.

Notice that the “sensitivity” changed its name to “recall”. They are the same metric, with the same formula. This is just a naming convention: when using precision, you call sensitivity “recall”.

Using the object detection example, recall measures how many objects you detected among all objects while precision measures how many wrong detections you made along the way. Using the Google example, recall is how many (of all) dog sites you returned while precision is the proportion of dog sites to non-dog sites in your results.

Beware: precision/recall are not substitutes for sensitivity/specificity; they address different issues. The former is only relevant when you cannot compute the true negatives efficiently, or they are far too many. Whenever you can, the sensitivity/specificity is more appropriate.

Again, these are indicators. To make a summary out of them, we use the harmonic mean, which is computed by doubling the product over the sum:

The harmonic mean is known as the F-Score. As for the G-Score, it also penalizes unbalanced pairs. However, it does so more powerfully. Thus, to obtain a high F-Score, a model has to have both high precision and high recall.

# The Multi-class problem

Up to this point, you might be asking: and what if we have several classes to predict? Let’s pick a problem:

First of all, if we have N classes, there are N ⋅ N possibilities: saying sun and getting sun, saying sun and getting cloudy, saying sun and getting rain, etc. Out of these, there are N ways to be right (one for each class) and N(N-1) ways to be wrong (all the other possibilities). If we plot it all as a confusion matrix, we get the following:

As before, the accuracy can be defined as the number of predictions we got right (greens) divided by all predictions (greens plus reds).

To compute the sensitivity and specificity of each class, we have to recast our problem as class vs. not-class. For instance, cloudy vs. not-cloudy. This way, we can extract one binary problem for each class and compute our binary metrics as we did before. Here is an example of sensitivity and specificity:

One thing that might strike you is that our true negatives (depicted in gray) will grow very large if we add more classes. There are just way more gray cubes than greens and reds. For this reason, sensitivity and specificity make little sense when dealing with multi-class problems.

For the same reasons above, the precision and recall scores are more suitable for this case, as they do not rely on the number of true negatives (gray boxes). Here is how it is formulated:

With this, we have a complete set of tools to evaluate our models on binary and multi-class problems, be the classes balanced (accuracy) or unbalanced (specificity/sensitivity or precision/recall).