Model Calibration — the things that a experts may miss

Original article was published on Deep Learning on Medium

Model Calibration — the things that a experts may miss

Since the rise of deep learning, lots of the deep neural network model come to the eyes of machine learning engineer, data scientist or research scientist. But there seems not too much people get aware of the risk that brought by these deep and big models, reliability concern.

What is reliability about?

let say there is a supervised computer vision model, f(x), that we use for a self-driving car. The task we defined on this model is to recognise the pedestrian on the streets and their locations reference to the camera as well.

As we do so well on building the architecture, gathering of data and optimisation during training of the model, the accuracy of the model able to achieve a very high level, 99%, on the majority scenes of a city. So it seems that we feel so safe to use it on a self-driving car.

However, after sometimes that we use it for a real self-driving car, we found that it’s performance isn’t really as good as it said though the confidence that the model say still align with the accuracy that we got previously.

The above example is a typical problem relate to the reliability concern of a supervised model. Reliability is a measure about the alignment of the model confidence and the model real performance. If the confidence of the model is superior over the accuracy, we say it’s overconfidence and vice versa.

Reliability diagram

Before we go into quantify the measure, first we need to construct a reliability diagram. To construct it, we will do the followings:

  1. split the confidence space into several bins with constant separation.
  2. run the model on the test data and allocate them with respect to the confidence of them.
  3. compute the expectation accuracy of each bins and plot them as accuracy vs confidence.
The reliability diagram of a supervised probabilistic model. The left one is the well calibrated and the right one is not.

By taking a look at the above diagram, the left one align with the accuracy of the model across all confidence level and the right one can’t. Some of the sample appear to have high confidence between 0.8 and 0.9 but the accuracy is just about 0.5. This means the model confidence means nothing to tell about how is it performance.

Quantify measure of the reliability

To measure the reliability, one intuitive way to do is converting the message that a reliability diagram to a metric map to the real number. Since the ideally calibrated model is the diagonal line function, we can just measure the distance of the model to the identity function reliability as a metric for measuring the reliability of the model.

Expected Calibration Error (ECE)

the formula of ECE, B means for the bin of the reliability diagram

This metric is simply finding the difference between the model accuracy and confidence for each of the bins and taking sum over the mean of them.

Maximum Calibration Error (MCE)

The formula of MCE.

MCE share the same core concepts of ECE but the calibration extent to the model is stronger in MCE. it require the worst case to be more calibrated if this is very low.

So far we have just entering the gate of the model calibration and the story is not yet end. There are still so many key points for we to start a long discussion. They will be shown in the next articles. Please stay follow.

Cheers!

Papers for reference:

https://arxiv.org/abs/1706.04599