Monitor! Stop Being A Blind Data-Scientist.

Source: Deep Learning on Medium

The Shoemakers’ Children Go Barefoot, Alexander Mark Rossi.

Monitor! Stop Being A Blind Data-Scientist.

How monitoring & alerts can help you achieve far better control of your data and how little it is being practiced today.

The following article explains the many use cases and the monumental importance of monitoring & alerts, specifically from a data-science-researcher point-of-view. We’ll look at use-cases and review several companies that try to provide solutions for this huge, but strangely undiscussed problem that overruns our industry.


Modern machine-learning relies on huge amounts of data and we have become quite successful in scraping and saving that data, but we are far from solving one of the biggest pains of data management.

I’m talking about knowing your data, on a daily basis, something that I believe very few scientists practice these days in our industry. In short, data-science-professionals have no idea what is really happening to their data. In terms of annotation, distribution for every segment of your data, for missing and new values, for label distribution, for version history, including package history, etc. We blindly trust our data, this is a common practice that I have seen and experienced it first hand.

Due to circumstances, many startups start by running and proceed to increase speed over time. Understandably, creating models to achieve growth and give value to your clients is important, but in the long run, we are losing sight of something very important. The fact is, that knowing our data is more important than updating your model again and again.

Ask yourself, how do you know that you can trust your data? Without trust, there is no point in updating or creating new models, you could be introducing noise and sabotaging your own work.

On the other hand, creating & maintaining multiple dashboard views for every model that we put in production is cumbersome. Views that observe statistical metrics such as mean Wasserstein-distance, null-hypothesis, F1, Precision and Recall, or product-related metrics. We don’t want to create additional metrics to observe our data and models, we don’t want to create models on top of our metrics that will find anomalies and trends, and we really don’t want to babysit our dashboards on a daily basis.

We want to be given a system that knows to monitor and alert you when things are changing, to do it seamlessly, automatically and get predictive alerts based on machine learning algorithms. Alerts that can detect failure or try to predict future failure ahead of time. When looking at these alerts we want to have some explanation, using model interpretability tools such as SHAP or LIME.

Let’s look at some notable use cases.

Use Cases

I have encountered several use cases for data monitoring & alerts, I am sure there are others, but these ones impact you the most as a data scientist. The list is written in chronological order.

  1. Annotators
  2. Annotation Distribution
  3. Data Distributions
  4. Dependency Versions
  5. Model Versions
  6. Model Metrics
  7. Business-Product Metrics
  8. Model Software & Hardware Performance


Annotators are usually the people who label your data, you can read about annotator metrics here. We can also use algorithms to do the annotations for us, whether using some form of a heuristical algorithm, an ensemble, a stacking or using libraries such as Snorkel.

Figure 1: measuring Inter-agreement and Fleischer Kappa, showing a signal correlation and a fixed threshold.

Monitoring metrics such as inter-rater agreement, self-consistency, ground-truth agreement are essential (Figures 1 & 2) because without them you don’t know the quality of your annotation.

Figure 2: measuring average annotators accuracy throughout time.

Imagine the following, you have a majority-vote ensemble of annotators that decided on a label, however, several annotators are starting to stray from the initial instructions, for whatever reason. Without you knowing, your annotations end up as label noise or if we are honest we can call them annotation mistakes.

“If you can’t trust your annotators, then don’t use the annotations”

We need to monitor annotator-metrics and get alerts when there is a problem and consequentially throw away data that was labelled badly, i.e., we won’t put it in our final data set.

Without proper monitoring and predictive alerts, you will not catch that label noise in time, therefore wasting a lot of resources such as time and money, you will introduce noise into your training set, directly hurting your models and the client you are trying to give value to.

This can also happen if you don’t monitor annotations that are the result of heuristics or algorithms, for this type of algorithm-based annotation, please read below about data distributions, dependency versions, model versions and model metrics.

Monitoring your source of truth is the most important thing in the data-chain. For example, if you have a supervised problem and your labels are bad, you will struggle to create a good model. It is as and sometimes more important than monitoring your data or training a model.

Annotation Distribution

Now, we believe that we can trust our annotators because we monitored their performance and we discarded annotations that were below our fixed-threshold. Although annotations are not always class labels (e.g., named entities), let’s consider a situation where our annotations are class labels and that we are using them for a supervised problem.

Figure 3: a comparison of distributions (Mona labs)

When training a model, the model is trained on a certain distribution of class labels. Monitoring concept-drift for the label-distribution (Figure 3) in production can be done by using 1. Null hypothesis or 2. Average earth mover’s distance (EMD), which is a measure of the mean distance between two probability distributions and is also known as the Wasserstein metric. Depending on the number of classes, the null hypothesis should be used for binary classes or trinary where you have [-1,0,1] and the mean Wasserstein distance should be used when you have multi classes.

“If your real-time label distribution is different than your training data, you have a problem. Take immidate action!”

Typically, you would compare the distribution of the labels of your training set and the distribution of the labels of your production data in real-time. Receiving predictive alerts due to a label-concept-drift can be a life-saver. (Please note that if your retrain strategy is based on active learning, you may find irregularities between the training distribution and live-data, due to the nature of sample selecting from your model’s weakest predicted-distribution).

For example, imagine a case where you trained two similar models, each trained on a unique distribution of labels, but they are closely related and for some reason, a mistake has happened and a single model was deployed to serve all of the data streams instead of having the appropriate model predicting its own stream. Having a label-distribution monitor for each model will tell you that the first model is behaving according to expectation and for the second one, the distribution for the model’s training data is significantly different from the distribution in real-time, which is a critical problem.

Monitoring label distribution can also be used to trigger an event in order to retrain the model. When distributions change naturally or when new clients are added, the label distribution is bound to change. Having that information can help you be more in control on when you retrain your model or debug various problems, which can lead to using fewer resources as well.

Data Distributions

You now know that your annotators and labels are perfect (hopefully). Alternatively, you may be working on an unsupervised problem and you don’t need annotators or annotation-monitoring. Its time to make sure your data distribution is not changing drastically and that your trained model can still serve unseen data.

Figure 4: a comparison of distributions over features, by parallemIM.

Imagine a case where your data has the following dimensions: Time, Client, Categories and Color. Let’s count how many segments we have, Time is based on a one-minute resolution for the last one year, you have 2000 clients, 30 categories and 3 color classes. Okay, so we have way too many segments that we want to observe. Optimally we would like to use a service that is built to automatically track all of these dimensions without defining them and that service will alert us when a change in distribution is about to happen, in one of those segments (Figure 4). It frees us from building dedicated dashboards using Datadog or Grafana-like services and allows us to focus on research.

“If you can’t trust your data distribution, then don’t use it to train a model”

Similarly to annotation-distribution, for each segment, we need to make sure there isn’t a concept-drift using the same methods and we must investigate the reason for a certain change, or retrain the model in order to support new data-distributions. Keep in mind that we want those alerts to be defined automatically for us, without us fine-tuning each and every one of them and without too many false positives.

Dependency Versions

Most algorithms are based on several dependencies and each one has its own version. Versions change throughout the model’s life and if you don’t keep track of your versions, i.e., the right version dependency for the right algorithm you model will perform differently without you knowing.

Figure 5: black dots represent a dependency or model version checkpoint, this illustration is composed on top of an actual view by

For example, imagine an emoji to text package that is being updated on a daily basis, and your NLP algorithm is based on that conversion. A newer version may support new emojis and words may end up mapped to different emojis. This is a critical change that may lead to a model predicting the wrong thing or predicting the right thing with unexpected probabilities.

“If you can’t track your dependencies versions, you can’t trust that your model performance will be deterministic”

In order to preserve a deterministic behaviour, we must record all the dependency-versions for every model that we put in production, as seen in Figure 5. The monitor needs to alert us when a dependency changes. Please note that unit tests will not always catch this behaviour. Dependency version history should be mapped to your prediction, so you can figure out what went wrong when it does go wrong, quickly and easily.

Model Versions

Similarly to dependency versions, knowing that the right model is plugged to the right data is very important. You don’t want to wake up one morning to client complaints and figure out that engineering did not deploy the models correctly.

“If you can’t track your model version, you can’t trust that it was deployed correctly”

This behaviour should be automatically reported when your CI/CD is training and deploying new models. Version history should be connected and displayed on your model prediction view at all times, so you can figure out what went wrong when it does go wrong, as seen in Figure 5.

Model Metrics

We love to look at model metrics, they tell us how good our model is and we can communicate these numbers externally. However, these numbers rely on upstream events such as labels, data, successful deployment and others.

“If you can’t observe your model performance, you can’t trust it.”

Figure 6: Model performance view by, showing accuracy, probability and F3.

Metrics such as precision, recall, F1 and PRAUC describe our success or failure, therefore we are highly motivated to monitor them in real-time, as seen in Figure 6. For unsupervised algorithms, such as clusters, we can monitor metrics like cluster-homogeneity, etc. Monitoring model performance helps us understand when things start to misbehave or when there are changes upstream.

Business-Product Metrics

Similarly to model performance, there are metrics that were defined as part of the project and are extremely important for Business and Product. These metrics do not have to be the classical metrics that I mentioned above. For example, your model is used as a suggestive model, you are showing your client the top-3 choices based on the top-3 probabilities, similarly to what is happening in Figure 6. Your product-KPI could be accuracy that is based on scoring a prediction in the top-3 probable classes and your alert will be when this number is lower than 95%. Observing these metrics is crucial for a healthy product or feature that you are maintaining and monitoring them is part of your ownership, although I would argue that both business and product need to be able to view them.

“If you can’t observe your product performance, how do you know that your KPI is keeping its goal?”

Theoretically, it’s possible that we are monitoring everything I mentioned so far and everything is perfect. However, these product metrics are still plummeting. It could mean many things but failing to find the reasons that are causing these metrics to drop is a serious problem. Therefore, getting predictive alerts when things are about to go bad is crucial for any product in production.

Model Software & Hardware Performance

Usually the domain of external non-data-science services, such as New Relic. It’s important to monitor model-processes, such as CPU, GPU, RAM, Network, Storage performance and alerts in order to discover high or load loads, as seen in Figure 7. This will ensure that your model can keep on serving on a healthy system without interrupting production.

Figure 7: A typical dashboard by Graphana, showing Memory, CPU, Client Load, Network traffic.