Classification Evaluation Metrics Explained Clearly

Original article was published by Chetan Ambi on Artificial Intelligence on Medium


Even the experienced people get confused with the terminologies used in the classification evaluation metrics such as Precision, Recall, F1 score, etc. A good understanding of these metrics is crucial for data scientists to make informed decisions. In this article, I will try to explain intuitively so that you don’t have to revisit these again. So, let’s get started.

Confusion Matrix

We will start our discussion by understanding the confusion matrix as this forms the basis for the rest of the article. It is very important that you understand the below terminologies.

We will consider a binary classification example to understand these concepts. Let’s consider that we are predicting whether a patient is suffering from cancer or not. The positive label (1) means the patient is suffering from cancer and the negative label (0) means the person is not suffering from cancer.

Note: You might be thinking why I chose positive label for the patient suffering from cancer and why not other way around. In Statistics or Data Science lingo, the presence of something is considered as positive and absence of something is negative. So in this case presence of cancer can be marked as positive label and absence of cancer is negative label.

Image by Author

TP (True Positives)

It is the number of correctly predicted positive labels by the model.

TN (True Negatives)

It is the number of correctly predicted positive labels by the model.

FP (False Positives)

It is the number of incorrectly predicted positive labels i.e. labels which are actually negatives but predicted as positives by the model. This is also called a Type I error. If your model predicts the person as cancerous when the patient is actually not cancerous, you are making a Type 1 error.

FN (False Negatives)

It is the number of incorrectly predicted negative labels i.e. labels which are actually positives but predicted as negatives by the model. This is also called a Type II error. If your model predicts the patient as not cancerous when the patient is actually cancerous, you are making a Type II error.

It is crucial that you reduce Type II errors as much low as possible. Imagine the consequences when you are not predicting cancer patients correctly.

True Positive Rate (TPR)

TPR refers to the ratio of correctly predicted positive labels from all the positive labels.

True Negative Rate (TNR)

TNR refers to the ratio of correctly predicted negative labels from all the false labels.

False Positive Rate (FPR)

FPR refers to the ratio of incorrectly predicted positive labels from all the negative labels.

False Negative Rate (FNR)

FNR refers to the ratio of incorrectly predicted negative labels from all the positive labels.

Accuracy

Accuracy is the most commonly used classification evaluation metric. Out of all the labels the model has predicted, how many were actually predicted correctly. It is the ratio of correctly predicted labels and the total number of labels.

When accuracy is not a good evaluation metric?

Accuracy is not a good metric when the dataset is imbalanced. Let’s understand with the help of an example. Assume that you are building a classification model for credit card fraud detection. There will be hardly a few, say 10–15, fraud transactions in millions of card transactions. In such a case, your positive labels (fraud) and negative labels (not fraud) are highly imbalanced. This means that you don’t have to build the model but just predict all the labels as not fraud and you are still able to achieve 99% accuracy. Now you see why accuracy is not a good metric when the dataset is imbalanced.

Another disadvantage of using Accuracy is that it can only use class labels such as 1 or 0. It can not take into account predicted probabilities.

Precision

Out of all the positively predicted labels, what percentage of them were actually positive labels.

When Precision is important over Recall?

Consider an example of spam (positive label) or not spam (negative label) e-mail detection. In this case, you should be okay if the model is not able to predict spam emails correctly and doesn’t go spam folder but instead appears in your inbox. You don’t want a good email (not spam / negative label) to go into spam. So in this case if the model predicts something positive (i.e. spam) it better be spam otherwise you may miss important mail. As you have noticed by now, Precision is more important here.

Recall (Sensitivity)

Recall is also called sensitivity and True Positive Rate (TPR). Out of all the positive labels, what percentage of them were actually predicted as positive.

When Recall is important over Precision?

Consider an example of predicting cancer patients. In this case, you don’t want to make mistakes in predicting a cancer patient (positive label) as non-cancerous (negative label). So the goal here is to predict all the positive labels correctly i.e. predicting all cancerous patients correctly. You are okay if the model predicts few patients as non-cancerous (False Positives) but not the other way around. As you have noticed, Recall is important in this.

F1-Score

F1-Score is a harmonic mean of Precision and Recall. The result ranges between 0 and 1, with 0 being the worst and 1 being the best. It is the measure of the model’s accuracy. It is calculated as –

Why F1-Score when we have Precision and Recall?

For some problems, getting high recall is more important than getting high precision as in predicting cancer patients’ example. For some problems, getting high precision is more important than getting high recall as in predicting whether the email is spam or not example. But there are a lot of situations where we need a balance between Precision and Recall. That’s where the F1 Score becomes important as F1-Score takes into account both Precision and Recall. If one of Precision or Recall is very low then F1-Score also goes very low and you know that something wrong with your model so that you can investigate further.

Misclassification Rate

It is the percentage of incorrectly predicted labels.

Conclusion

Hope that you have understood most of the classification evaluation metric. There is one more important metric called AUC ROC which I am planning to cover separately in another article.