Missing Data, its types, and statistical methods to deal with it

Source: Deep Learning on Medium

Missing Data, its types, and statistical methods to deal with it

While learning , most of data scientists and enthusiasts try to deal with famous datasets such as MNIST , ImageNet , … which are complete , clean and well formatted . However , real world problems and datasets are unfortunately far from this academic utopia . In other words , they are not perfect at all , they include noise , they contain a lot of missing data and sometimes they are also not well structured or formatted .

In this post , we are going to talk about one of these tedious problems which will pops up often , and without any further ado as mentioned in the title , we will tackle the “missing data” problem from a wide and statistical perspective

First of all , what do we mean concretely by “missing data” ?

Missing data means that one or more variables ( features ) values are missing generally encoded by -999 , nan , null . . . it often occurs while collecting the data in a wrong way , a lack of data ( Ex : users rating ) , or errors when entering the data ( mistyping ) , and this could lead to drastic findings and conclusions which could affect negatively the decisions !

The following figure illustrates the striking example of “recommender systems” when the “missing data“ problem occurs frequently because a part of our data depends on users feedback .

Credits : X . Amatrian

it also causes a lot of struggle for researchers while analyzing and interpreting the outcomes of their research to make conclusions .

There are three types of missing data:
1) Missing Completely and Random — (MCAR).
2) Missing at Random — MAR.
3) Missing Not at Random — (MNAR).

Type I: Missing Completely at Random (MCAR)
There’s no relationship between whether a data point is missing and any values in the data set (missing or observed) .The missing data are just a random subset of the data . The missingness is nothing to do with any other variable . By the way , data are rarely MCAR.

the following example will depicts this kind of problem :

Credits : Iris Eekhout

It is relatively easy to check the assumption that in our example data is missing completely at random. If you can predict any reason for missing data (e.g., using common sense, regression, or some other method) whether based on the complete variable Age or the Missing variable IQ score , then the data is not MCAR !

TLDR : not affected by neither the observed nor the missing data => Completely At Random

Type II: Missing at Random (MAR)
The missing data here is affected only by the complete (observed ) variables and not by the characteristics of the missing data itself. in other words , for a data point , to be missing is not related to the missing data, but it is related to some of ( or all ) the observed data , the following example will depicts the situation and make it more clear :

Credits : Iris Eekhout

We could easily notice that IQ score is missing for youngsters ( age < 44 yo ) , and thus the missing data depends on the observed data , however there is no dependency with the values of the missing column itself .

TLDR : not caused by the missing data itself but affected by observed data => At Random

Type III: Missing Not at Random (MNAR)
It is nor Type I neither Type II , and the data will be missing based on the missing column itself , for instance the following example points out the fact that data are missing on IQ score with only the people having a low score .

Credits : Iris Eekhout

as you can see , it is impossible to detect MNAR cases without knowing the missing values !

TLDR : caused by the missing data itself => Not At Random

here is a non-exhaustive list about coping and dealing with Missing data problems :

Method 1: Deletion

it falls under two different techniques :

  • Listwise Deletion : In this method, an entire record is excluded from analysis if any single value is missing , and therefore we have the same N (number of records) for all analysis .
  • Pairwise Deletion : during our analysis the number of records taken into consideration denoted “N” will vary according to the studied variable (column) , and for instance we could compute the mean for 2 features (Complete VS missing) and while diving by the number of samples , we end up dividing by different N , one is the total number of rows and the other feature is the total number on complete values .

Method 2: Single Imputation Methods

  • Single value imputation : replacing the missing value with a single value utilizing one strategy such as : Mean , Median , Most Frequent , Mean Person , … of the corresponding feature .
  • Similarity : trying to find the closest ( top-N closer ) row(s) to the row containing our missing value , and fix a strategy among them to assign a value to our missing value .
  • Regression Imputation : In single regression imputation the imputed value is predicted from a regression equation , we assume that the missing values are in a regression line with a nonzero slope with one of the complete features ( predictors )

Method 3: Multiple Imputation Methods

  • Expectation-Maximization Algorithm : an algorithm that could be used for both missing data imputation and for machine learning clustering task (considering the target as a missing feature) , it is based on two steps :

— First : Expectation of missing value
— Second : Maximizing the likelihood

I highly recommend Andrew NG stanford notes to understand it very will , it is not a hard algorithm , don’t fear its formulas ! http://cs229.stanford.edu/notes/cs229-notes8.pdf

  • MI Methods : It is an attractive method for handling missing data in multivariate analysis. The idea of multiple imputation for missing data was first proposed by Rubin , it consists of averaging the outcomes across multiple imputed data sets to account for this. All multiple imputation methods follow three steps :
  1. Imputation — Similar to single imputation, missing values are imputed. However, the imputed values are drawn m times from a distribution rather than just once. At the end of this step, there should be m completed datasets.
  2. Analysis — Each of the m datasets is analyzed. At the end of this step there should be m analyses.
  3. Pooling — The m results are consolidated into one result by calculating the mean, variance, and confidence interval of the variable of concern .

thanks for you time and attention , Keep Learning !

if you want to reach me out on Linkedin I would be very grateful