Understanding Dataset Shift

Source: Deep Learning on Medium

Whilst you may scoff at the triviality of such a statement, this is possibly the most common problem I see when viewing solutions to Kaggle challenges. In some ways, a deep understanding of dataset shifting is key to winning Kaggle competitions.

Dataset shift is not a standard term and is sometimes referred to as concept shift or concept drift, changes of classification, changing environments, contrast mining in classification learning, fracture points and fractures between data.

Dataset shifting occurs predominantly within the machine learning paradigm of supervised and the hybrid paradigm of semi-supervised learning.

The problem of dataset shift can stem from the way input features are utilized, the way training and test sets are selected, data sparsity, shifts in the data distribution due to non-stationary environments, and also from changes in the activation patterns within layers of deep neural networks.

Why is dataset shift important?

It is application-dependent and thus relies largely on the skill of the data scientist to examine and resolve. For example, how does one determine when the dataset has shifted sufficiently to pose a problem to our algorithms? If only certain features begin to diverge, how do we determine the trade-off between the loss of accuracy by removing features and the loss of accuracy by a misrepresented data distribution?

In this article, I will discuss the different types of dataset shift, problems that can arise from their presence, and current best practices that one can use to avoid them. This article contains no code examples and is purely conceptual. Classification examples will be used for ease of demonstration.

There are multiple manifestations of dataset shift that we will examine:

  • Covariate shift
  • Prior probability shift
  • Concept shift
  • Internal covariate shift (an important subtype of covariate shift)

This is a huge and important topic in machine learning so do not expect a comprehensive overview of this area. If the reader is interested in this subject then are a plethora of research articles on the topic — the vast majority of which focus on covariate shift.