After I had an incident in one of the previous companies I worked for, I started to become a little bit obsessive about the continuity of my models’ performance. In short, one of our models that was live and directly responsible for the output of a critical engine started to predict absurd results. The worst thing was that we did not discover something was wrong until we got angry phone calls from the client and spent long days to debug the entire pipeline before finding the root cause. The model was outputting extremely high or extremely low predictions. In retrospect, we found out that there were no faulty components in our stack. No bugs, no errors. The problem caused by the upstream data. The data that were consumed from upstream, the real-world readings, had changed.
Models are Built by the Snapshot of the World
If you are not employing some kind of online learning where models update themselves at regular intervals, you work on historic data. You consume some data from your data store and train your models. You constantly do this. You fetch the historic data, move it through the pipeline, get your model. The keyword here is historic. You work on data that are already old. You are building your solution for the future while looking at the past.
With your data, you actually develop your solution on the snapshot of the world. The raw data, the real world readings represent the snapshot. This snapshot can be weeks, months, or years. The longer the snapshot the better. Still, this gives you no guarantee that the future will follow the same behavior that was in your snapshot.
Models Fail When the World Changes
When you work with historic data, you sign a contract with your model. You promise your model that the properties of the data will not change. In exchange, the model gives you an accurate description of history. You rely on your KPIs. Celebrate when you exceed a threshold. Push your solution into production. While the same model is still up there, your contract is still binding.
The real-world changes constantly, and so do our upstream data. If you are not taking an additional step, your models will be ignorant of the changes. Your contract is still there. Following a change, when a major event happens, or slowly over time, your upstream data start to change. It may change so drastically that the upstream data may not resemble the historic data your model was trained on anymore at some point. At this point, you violate your contract, and your model starts to fail silently.
Continuity of the Performance
As the world change, and so does our data, we need to be sure that our model is still doing a good job. Even though we may not do anything to put the model on the right path, we need to know the performance. To make sure my models’ performances are still satisfactory, I tend to put the following checks. Mind that these checks are in the service of the model/solution you are serving.
1. Upstream Data Checks
For this one, you define which stage is upstream depending on your project. The idea is to observe the data at any stage before entering the model to make a prediction. While this is the case, probably the last stage of the data before they are fed into the model is a good stage.
The idea in this check is to make sure the upstream data still resembles the distribution, or follow the same pattern, of the historic data used while training the model. A simple but powerful way could be calculating the distribution of the upstream data at every fixed time interval, and comparing it with the historic data’s distribution. If the similarity between the two distributions is close, you are green. Otherwise, you raise an alert, and let the owner of the solution know.
2. Downstream Output Checks
The idea behind the downstream check is very similar to the upstream data checks. Instead of checking the upstream, in this one, you check the output of your model or your service. Like the upstream check, you monitor the similarity between the distribution of the outputs of your model in production and the output of your model when fed with historic data. Monitor the similarity between two distributions, raise an alert if it starts to drift away.