Statistical pitfalls in data science

Original article was published on Artificial Intelligence on Medium

Statistical pitfalls in data science

How stereotypical results can alter data distributions in people’s minds

There are plenty of ways to infer a large and varied amount of results from a given dataset, but there are infinitely many ways to incorrectly reason from it as well. Fallacies can be defined as the products of inaccurate or faulty reasoning which usually leads to one obtaining incorrect results from the data given.

Photo by Tayla Jeffs on Unsplash

The good thing is that since numerous people have made these mistakes for so long and the results have been documented throughout history in a variety of fields, it is easier to identify and explain many of these statistical fallacies. Here are some statistical traps that data scientists should avoid falling into.

Cherry Picking

This is probably the most obvious and simplistic fallacy that there can be, and is something that most of us have definitely done before. The intuition of cherry picking is as simple as it gets: intentionally selecting data points to help support a particular hypothesis, at the expense of other data points which reject the proposition.

Cherry picking reduces the credibility of experimental findings because it shows only one side of the picture

Cherry picking is not only dishonest and misleading to the public, but it also reduces the credibility of experimental findings, because it essentially shows only one side of the picture, shadowing all the negative aspects. This would make an experiment seem entirely successful, when in reality it isn’t.

Cherry Picked Data vs All Data — Source

The easiest way to avoid cherry picking is not to do it! Cherry picking, by nature, is a deliberate effort brought on by the practitioner and therefore, not accidental. To further avoid the possibility of cherry picking while in the process of collating data, one should use data from a large and varied set of backgrounds (wherever possible) to limit the bias that usually comes along with limited perspective.

Data Dredging

Most people (especially those who are unfamiliar with the nuances of data science) assume that data analysis means picking out obvious correlations from a variety of data. That is not entirely correct as data analysis often requires logical reasoning to explain why a certain correlation exists. Without a proper explanation, there still remains a possibility of chance correlations. The traditional method to proceed with an experiment is to define a hypothesis, followed by an examination of data to prove it. Data dredging, in contrast is a practice of making chance correlations that fit the hypothesis without offering any logical insights into the reasons for the correlation.

Data dredging is sometimes described as seeking more information from a dataset than it actually contains

An offshoot of data dredging is the False Causality where a wrong assumption about correlations could lead to eventual failure in research. Often correlations between two things tempt us to believe that one caused the other or is caused by the other. However, it is usually a coincidence or another, external factor that causes one or both the effects to occur. A data scientist must always dig deeper than what seems to be apparent on the surface and go beyond simple correlations to gather evidence to back research hypothesis.

Correlation does not imply causation


Overfitting is a term that most machine learning and data science practitioners are well-versed with. Overfitting refers to the process of creating an extremely complex model that is overly tailored to the dataset and does not perform well on generalized data.

In machine learning terms, overfitting occurs when a model performs exceedingly well on the training set but fails to give similar results on the testing dataset. John Langford gives a comprehensive description of the most commonly occurring types of overfitting in practice and techniques to help avoid them here.

Overfitting of data — Source

Most data scientists build mathematical based models to understand the underlying relations and correlations between data points. A sufficiently complex model would tend to fit the provided data perfectly, giving high accuracy and minimal loss. That being said, complex models are usually brittle and would break down when provided with other data. Simple models usually tend to be more robust and better at making predictions from given data.

Simpson’s Paradox

Simpson’s Paradox is a perfect example that highlights the need for good intuition regarding the real world when collating and experimenting on data. Data scientists need to recognize and accept the fact that most data is a finite representation of a much larger and much more complex domain. Simpson’s Paradox showcases the dangers of oversimplifying a complex situation by trying to see it from a single point-of-view.

The Simpson’s Paradox was named after Edward Hugh Simpson, a statistician who described the statistical phenomenon that takes his name in a technical paper in 1951. It is simple to state and can be often a cause of confusion for non-statistically trained audiences — A trend or result that is present when data is put into groups that reverses or disappears when the data is combined.

The overall trend reverses when data is grouped by particular categories — Source

The Simpson’s Paradox can be best explained by a simple example. Let’s say we pick the batting scores of two batsmen in Cricket, A and B. In our collected data, A has overall scored more boundaries than B. However, if we look at the lifetime statistics of A and B, it is found out that B has scored more boundaries than A.

Simpson’s Paradox, in some ways, can be thought of unintentional cherry picking. It is usually caused by a variable within the distribution appropriately named the lurking variable which splits data into multiple separate distributions, and they can be often difficult to identify.

We need to know what we are looking for, and to appropriately choose the best data-viewpoint that gives the audience a fair and complete representation of the truth

To avoid falling in the Simpson’s Paradox, a data scientist must know their data, and have a basic idea about the general factors that surround and affect the data. Based on all of these circumstances, the data should be collected and viewed in such a way that the results do not glorify only the hypothesis (cherry picking), but also do not change if viewed from a standalone viewpoint.

Survivorship Bias

Algorithmic bias has recently gained a lot of following and has become a hot topic. Statistical bias, however is as old as statistics itself. Survivorship bias can be best described as drawing conclusions from incomplete data. These play a crucial role in making data analysis inaccurate.

Survivorship bias occurs when the data provided in the dataset has previously been subjected to a filtering process. This results in a faulty deduction and can affect a great deal of analysis. Being aware of biases is usually really crucial in the field of data science because it is human tendency to study successful outcomes and draw inferences from them, while ignoring the accompanying failures.

By looking at just the successful CEOs, we don’t see the full data set, including unsuccessful CEOs and everyone else on the planet that may happen to eat oatmeal for breakfast

Since survivorship bias comes with incomplete datasets and research inputs, there are some techniques that a data scientist can apply to avoid survivorship bias while drawing deductions from data.These include but are not limited to multiple data inputs, imaginary scenarios, contextual understanding of the data as well as increased data while testing.

Gambler’s Fallacy

The gambler’s fallacy is another example of how the human mind tends to draw inferences from stereotypical correlations in the data. The Gambler’s Fallacy states that because something recently occurred more frequently, it is now less likely to occur (and vice-versa). However, this does not really hold true in real life. For example, if a coin lands on head 3 times in a row, one would think along the lines of there is no way the coin lands on the heads side 4 times in a row. However, this is wrong, as there is still an equal probability that the coin may land either on heads or tails.

The same thing happens with data. When multiple data points begin to show similar unaccounted correlations or contradictions, data scientists usually tend to go by a gut feeling rather than logical explanations and formulas, which often lead to disastrous consequences while inferring deductions.

People tend to go with a gut feeling based on previous experiences rather than logical explanations when drawing out inferences from data

The Gambler’s Fallacy requires two key points for its understanding: the law of large numbers and its relation with regression towards the mean. The law of large numbers states that the mean of all the results of performing the exact same experiment for a large number of times should be closed to the expected value, and the difference between the expected value and the original value would be directly proportional to the number of trials conducted.

The concept of regression towards the mean also introduces the Regression Fallacy which assumes that when something happens that is unusually good or bad, over time it will revert back to average. This fallacy is often used to find an explanation for the outliers that are generated in the predictions by a study or model.