Source: Deep Learning on Medium
Machine learning Metrics for Classfication:
Metrics are important for the measurement of the specific characteristic of any system program’s performance or efficiency. In the machine learning life cycle, we perform a number of activities like exploratory data analysis, feature selection, feature engineering, model implementation. But there must to some parameters/metrics to validate to evaluate the correctness of the program.
These metrics are classified based on Classification, Regression, Neural Networks, etc.:
In this article, the metrics used to solve classification will be discussed followed by the next two articles on Regression and Neural Networks.
This is the ratio of the total number of correct predictions to the total number of input samples.
When To Use: When the target variable in the classes is balanced, then accuracy is a good predictor.
For Example, In Day 1, A Fisherman founds 92% of classes in our fish are Catla (Labeo Catla) whereas the remaining 8% are Rohu(Labeo rohita). On post-training model accuracy was found 92% belonging to Catla.
On day 2, fisherman puts this net the water and found 72% are Catla (Labeo Catla) whereas the remaining 28% are Rohu(Labeo rohita). On the post-training model, it was found the test accuracy was dropped to 72%.
On day 3, fisherman puts this net the water and found 97% are Catla (Labeo Catla) whereas the remaining 3% is Rohu(Labeo rohita). On the post-training model, it was found the test accuracy was increased to 97%.
Finding: We found that per day catch is varying and the dataset is not balanced. Therefore, Accuracy is not a good measure when the target variable classes in unbalanced or skewed.
import numpy as np
accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))0.5
Let’s see how the Confusion matrix solves the problem. This is an output matric and gives a complete performance of the model. Below is a possible output from the confusion matrix :
- True Positives: The cases in which we predicted YES and the actual output was also YES. Ex: The patient was diagnosed with cancer and found the same. With respect to fish example, 99% fish detected of type Catla (Labeo Catla), considering the classification of the binary and not multi-class.
- True Negatives: The cases in which we predicted NO and the actual output was NO. Ex: The patient was not diagnosed with cancer and was not found the same. With respect to fish example, 1% fish detected of type Other, considering the classification of the binary and not multi-class.
- False Positives: The cases in which we predicted YES and the actual output was NO. Ex: The patient was diagnosed with cancer and was not found the same. This is also known as type 1 error. With respect to fish, there might be misclassification of type or it might other fish in the bunch.
- False Negatives: The cases in which we predicted NO and the actual output was YES. The patient was not diagnosed with cancer and was found the same. This is also known as type 2 error.
Accuracy metric can be derived from Confusion matrix :
from sklearn.metrics import confusion_matrix
actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
results = confusion_matrix(actual, predicted)
print(results)[[4 2][1 3]]
Precision is a number of items correctly identified as positive out of total items identified as positive. The TP is actually people having cancer and where FP is people were predicted to be having cancer but does not have any cancer. With respect to fish example, TP refers to the right classification of fish, where FP refers to that wrong prediction of fish caught on any day(purely hypothetical scenario, this cannot be a single number.)
Area Under Curve(AUC) — Receiver Operating Curve(ROC):
AUC-ROC along with Log-loss and F1 can be a good predictor based on the scenarios.
AUC-ROC Curve is a performance metric for classification problems. ROC is the probability curve whereas AUC is a measure of separability(distinguish 1s as 1 and 0s as 0). It tells how much model is capable of distinguishing between classes. Therefore, higher AUC gives a correct prediction.
Before exploring, AUC-ROC further, let’s discuss the related metrics:
TPR/Recall/Sensitivity: Recall is a measure that tells us what proportion of patients that actually had cancer was diagnosed by the algorithm as having cancer.
The actual positives (People having cancer are TP and FN) and the people diagnosed by the model having cancer are TP.
With respect to fish example, correct classification of Rohu and Catla from the bunch of fishes.
Specificity: Specificity tells us what percentage of people without cancer disease was actually correctly identified. With respect to fish example, other fish that were correctly classified from the bunch of fishes.
False Positive Rate: 1- Specificity
The above metrics will be responsible
The AUC — ROC is explained in the below link in youtube. This video also explains how to set the threshold value and find a conclusion from the AUC-ROC curve.
Finding: ROC(Receiver Operator Characteristic Curve) can help in deciding the best threshold value. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis).
AUC gives the rate of successful classification by the logistic model. The AUC makes it easy to compare the ROC curve of one model to another.
With respect to fish example, AUC-ROC gives a better prediction and setting of the threshold.
F1 Score is a harmonic mean between precision and recall and value ranges from 0 to 1. F1 Score is a measure of test accuracy. F1 tells the precision and robustness of the model.
Finding: High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 Score, the better is the performance of our model. Mathematically, it can be expressed as :
>>> from sklearn.metrics import f1_score
>>> y_true = [0, 1, 2, 0, 1, 2]
>>> y_pred = [0, 2, 1, 0, 0, 1]
>>> f1_score(y_true, y_pred, average='macro')
>>> f1_score(y_true, y_pred, average='micro')
>>> f1_score(y_true, y_pred, average='weighted')
>>> f1_score(y_true, y_pred, average=None)
array([0.8, 0. , 0. ])
Log loss measures the classification model performance where prediction is the probability value from 0 to 1. Log loss increases as the probability diverge from the actual label. The goal of machine learning is to minimize loss. Therefore a smaller log loss is better and the ideal value should be 0.
Log Loss quantifies the accuracy of a classifier by penalizing false classifications. Minimizing the Log Loss is basically equivalent to maximizing the accuracy of the classifier.
In order to calculate Log loss the classifier must assign a probability to each class rather than simply yielding the most likely class. Mathematically Log Loss is defined as
where N is the number of samples or instances, M is the number of possible labels,
yij is a binary indicator of whether or not label j is the correct classification, for instance, i
pij is the model probability of assigning label j to instance i.
If there are only two classes then the expression becomes
from sklearn.metrics import log_loss
log_loss(["spam", "ham", "ham", "spam"],
... [[.1, .9], [.9, .1], [.8, .2], [.35, .65]])0.21616
Comparison of Log-loss with ROC & F1:
While taking decision all metrics to be considered and based on the requirement, data, etc, proper evaluation should be done.
The below article clearly defines based on circumstances, which metrics to go with