Original article was published on Artificial Intelligence on Medium
This is the third and final article in a series to help you understand, use, and remember the seven most popular classification metrics. In the first article in the series I explained the confusion matrix and the most common evaluation term: accuracy. In the second article I shined a light on the three most common basic metrics: recall (sensitivity), precision, and specificity. If you don’t have those terms down cold, I suggest you spend some more time with them before proceeding. 👍
Each of the composite metrics in this article is built from basic metrics. Let’s look at some beautiful composite metrics!
As you saw in the first article in the series, when outcome classes are imbalanced, accuracy can mislead.
Balanced accuracy is a better metric to use with imbalanced data. It accounts for both the positive and negative outcome classes and doesn’t mislead with imbalanced data.
Here’s the formula:
Balanced Accuracy = (((TP/(TP+FN)+(TN/(TN+FP))) / 2
Thinking back to the last article, which metric is TP/(TP+FN) the formula for? That’s right, recall — also known as sensitivity and the true positive rate!
And which metric is TN/(TN+FP) the formula for? That’s right, specificity, also known as the true negative rate!
So here’s a shorter way to write the balanced accuracy formula:
Balanced Accuracy = (Sensitivity + Specificity) / 2
Balanced accuracy is just the average of sensitivity and specificity. It’s great to use when they are equally important. ☝️
Let’s continue with an example from the previous articles in this series. Here are the results from our model’s predictions of whether a website visitor would purchase a shirt at Jeff’s Awesome Hawaiian Shirt store. 🌺👕
Predicted Positive Predicted Negative
Actual Positive 80 (TP) 20 (FN)
Actual Negative 50 (FP) 50 (TN)
Our sensitivity is .8 and our specificity is .5. Average those scores to get our balanced accuracy:
(.8 + .5) / 2 = .65
In this case our accuracy is 65%, too: (80+50) / 200.
When the outcome classes are the same size, accuracy and balanced accuracy are the same! 😀
Now let’s see what happens with imbalanced data. Let’s look at our previous example of disease detection with more negative cases than positive cases.
Predicted Positive Predicted Negative
Actual Positive 1 8
Actual Negative 2 989
Our accuracy is 99%: (990/1,000).
But our balanced accuracy is 55.5%!
(((1/(1 + 8)) + ( 989/(2 + 989))) / 2 = 55.5%
Do you think balanced accuracy of 55.5% better captures the model’s performance than 99.0% accuracy?
Balanced accuracy bottom line
Balanced accuracy is a good measure when you have imbalanced data and you are indifferent between correctly predicting the negative and positive classes. 😀
The scikit-learn function name is balanced_accuracy_score.
Another, even more common composite metric is the F1 score.
The F1 score is the harmonic mean of precision and recall. If you care about precision and recall roughly the same amount, F1 score is a great metric to use. Note that even though all the metrics you’ve seen can be followed by the word score F1 always is. ☝️
Remember that recall is also known as sensitivity or the true positive rate.
Here’s the formula for F1 score , using P and R for precision and recall, respectively:
F1 = 2 * (P * R) / (P + R)
Let’s see how the two examples we’ve looked at compare in terms of F1 score. In our Hawaiian shirt example, our model’s recall is 80% and the precision is 61.5%
The model’s F1 score is:
2 * (.615 * .80) / (.615 + .80) = .695
That doesn’t sound so bad. 🙂
Let’s calculate the F1 for our disease detection example. There the model’s recall is 11.1% and the precision is 33.3%.
The model’s F1 is:
2 * (.111 * .333) / (.111 + .333) = .167
That is not so hot. ☹
The F1 score is popular because it combines two metrics that are often very important — recall and precision — into a single metric. If either is low, the F1 score will also be quite low.
The scikit-learn function name is f1_score. Let’s look at a final popular compound metric, ROC AUC.
ROC AUC stands for Receiver Operator Characteristic — Area Under the Curve. It is the area under the curve of the true positive ratio vs. the false positive ratio. Remember that the true positive ratio also goes by the names recall and sensitivity.
The false positive ratio isn’t a metric we’ve discussed in this series.
False Positive Ratio
The false positive ratio (FPR) is a bonus metric. 👍 It’s calculated by dividing the false positives by all the actual negatives.
FPR = (FP / N)
The false positive ratio is the only metric we’ve seen where a lower score is better. ⬇️=😀
The FPR is used alone rarely. It’s important because it’s one of the two metrics that go into the ROC AUC.
plot_roc_curve(estimator, X_test, y_test)
Here’s an example of a ROC curve:
The ROC curve is a popular plot that can help you decide where to set a decision threshold so that you can optimize other metrics.
The AUC (area under the curve) can range from .5 to 1. A higher score is better. A score of .5 is no bueno and is represented by the orange line in the plot above. ☹️
You want your model’s curve to be as close to the top left corner as possible. You want a high TPR with a low FPR. 🙂
Our model does okay, but there’s room for improvement. 😐
The ROC AUC is not a metric you want to compute by hand. ✍ Fortunately, the scikit-learn function roc_auc_score can do the job for you. Note that you need to pass the predicted probabilities as the second argument, not the predictions. ☝️
ROC AUC is a good summary statistic when classes are relatively balanced. However, with imbalanced data it can mislead. For a good discussion see this Machine Learning Mastery post.
In this article you learned about balanced accuracy, F1 score, and ROC AUC.
Here are the formulas for all the evaluation metrics you’ve seen in this series:
- Accuracy = (TP + TN) / All
- Recall (Sensitivity, TPR) = TP / (TP + FN)
- Precision = TP / (TP + FP)
- Specificity (TNR)= TN / (TN + FP)
- Balanced Accuracy = (Sensitivity + Specificity) / 2
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
- ROC AUC = Area under TPR vs. FPR
ROC AUC stands for Receiver Operating Characteristic Area Under the Curve. It does NOT stand for Receiver Operating Curve. 👍
Here are the results from the Hawaiian shirt example:
- Accuracy = 65%
- Recall (Sensitivity, TPR) = 80%
- Precision = 61.5%
- Specificity (TNR) = 50%
- Balanced Accuracy = 65%
- F1 Score = .695
Here are the results from the disease detection example:
- Accuracy = 99%
- Recall (Sensitivity, TPR) = 11.1%
- Precision = 33.3%
- Specificity (TNR) = 99.8%
- Balanced Accuracy = 55.5%
- F1 Score = .167
As the results of our two examples show, with imbalanced data, different metrics paint a very different picture.
There many, many other classification metrics, but mastering these seven should make you a pro! 😀
The seven metrics you’ve seen are your tools to help you choose classification models and decision thresholds for those models. Your job is to use these metrics sensibly when selecting your final models and setting your decision thresholds.
I should mention one other common approach to evaluating classification models. You can attach a dollar value or utility score for the cost of each false negative and false positive. You can use those expected costs in your determination of which model to use and where to set your decision threshold.
I hope you found this introduction to classification metrics to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀
Happy choosing! 😀