Who stole my Data?

Source: Deep Learning on Medium

My not-so-fancy memoirs of a not-so-fancy attempt to understand a really fancy problem: Missing Data Completion.

GitRepo: Python Module for Missing Data Imputation


Missing data is an important issue, even small proportions of missing data can adversely impact the quality of the learning process, leading to biased inference. Many methods have been proposed over the past decades to minimize missing data bias and can be divided into two categories: One that attempts to model the missing data process and use all available partial data for directly estimating model parameters and two that attempt to fill in/impute missing values with plausible predicted values. Imputation methods are preferred for their obvious advantage, that is, providing users with a complete dataset that can be analyzed using user-specified models (Gondara and Wang, 2018)

Different Approaches to Missing Data Imputation

I: SRMI (Sequential Regression Multiple Imputation)

II: CART (Classification and Regression Trees)

III: Soft-Impute


Motive — From a Bayesian perspective, missing values follow a distribution given the observed values, and we are trying to find this predictive distribution (and not a single imputation value)

Idea — Imputing several plausible sets of missing values in the incomplete data set resulting in several completed data sets


  • The population is essentially infinite
  • The sample is a simple random sample

Missing data mechanism is ignorable

Multiple Imputations so far:

  • Used a Bayesian framework, where an explicit model is specified for all variables with missing values, conditional on the fully observed variables, unknown parameters, and prior distribution for these parameters
  • The model generates a posterior predictive distribution of missing values given the fully observed values
Imputations are draws from this predictive distribution (Rubin (1976))

Multiple Imputations — Challenges

Specifying a Bayesian model is often difficult, because of three reasons:

  1. Survey data usually has many variables (hundreds) of varying forms and distributions (counts, categorical, continuous, and within continuous — normal, lognormal and others).
  2. Restrictions that arise from nature of certain variables (E.g. “Income from the second job”, imputations must be restricted to people who indicated they had a second job)
  3. Logical bounds that must be maintained while doing imputations (E.g. “Years since quit smoking” should be bound by [0, Age -18] if no evidence of teenage smoking is seen)

All the above are addressed by SRMI

SRMI — Imputation Method

  • X: Design matrix (n x p) for all n observations and p variables with no missing values
  • 𝑌1,𝑌2 ..𝑌𝑘 all variables with missing values, ordered from lowest to highest
  • Joint Conditional Density equation factored into conditional density functions with parameters ϴ1, ϴ2 .. ϴk
  • Each f is a conditional density function and is modelled according to the type of the variable. Draw values from the corresponding predictive distribution, given X (observed values)

Each imputation consists of c rounds

  • Round 1 — Imputation was done sequentially on all Y variables starting from the one with the lowest number of missing values. (Explain with e.g.) and updating X
  • Round 2 to Round c — Repeatedly impute all Y variables by using everything except that particular variable (Explain with e.g.)
  • Impute until Round c/ Convergence

Modifications needed to include restrictions and bounds

  • Example of restrictions –Models have to be fitted to an appropriate subset of individuals. (No. of pregnancies, No of years smoking cigarettes) + Dummy variables for subsequent regressions
  • Truncated normal linear regression/SIR
  • It can also be extended to polytomous variables.

Conditional regression 𝑓1(𝑥)(𝑌1|𝑋, 𝜃1) for each type:

  • Normal linear regression model if 𝑌𝑗 is Continuous
  • A logistic regression model if 𝑌𝑗 is Binary
  • Generalized logit regression model if 𝑌𝑗 Categorical
  • Poisson log-linear model if 𝑌𝑗 is a Count
  • Two-stage model (logistic regression, + normal linear regression model) if 𝑌𝑗 is Mixed

Simulation Study


  • Generate a complete dataset from a hypothetical population
  • Estimate selected regression parameters
  • Delete certain values (Ignorable missing data mechanism)
  • Use SRMI to impute deleted values
  • Obtain new multiply imputed estimated for the same parameters from the second step
  • Examine the differences

A total of 2,500 complete data sets with three variables (U, 𝑌1, 𝑌2 ) and sample size 100 were generated using the following models:

1. U: Normal (0, 1)

2. Y1: Gamma with mean μ1 = exp (U-1) and variance μ12 /5

3. Y2: Gamma with mean μ2 = exp (-1+0.5U +0.5Y1) and variance μ2/2

Model for Y2 is the primary interest


  • For each of the 2,500 simulated data sets with missing values, a total of 250 rounds with M = 5 different random starts were created using SRMI
  • For each replicate, the resulting imputed data sets and the full data set were analyzed by fitting the Gamma model for Y2 using maximum likelihood
  • To assess the differences in the point estimates the standardized difference between the SRMI and full data estimates
  • SRMI data resulted in well-calibrated intervals estimates


  • Describes and evaluates SRMI technique useful for imputing values in a wide variety of complex data structures.
  • Ideal when the joint distribution of variables with missing values, is too hard to formulate.
  • Advantage — Dealing with missing variables on a case by case basis (and fully conditional on all the observed values)
  • Imputations from SRMI and Bayesian models are comparable as proved by case study
  • Key Assumption — Dataset arises from a simple random sample design
  • Disadvantage — Large datasets could be computationally intense
  • Issues — How do you decide which data is suitable for SRMI?


Motive — The primary mission of most national statistical agencies is to disseminate data to the public. However, government agencies are under increasing pressure to limit access to data because of growing threats to data confidentiality. Even stripping obvious identifiers like names, addresses, and identification numbers may not be enough to protect confidentiality. To protect confidentiality in public use datasets, many statistical agencies release data that have been altered to protect confidentiality.

Synthesis of Data — So far

  • Rubin (1993) and Little (1993) suggested that agencies release partially synthetic data, which comprise the original units surveyed with some collected values replaced with multiple imputations.
  • The key to the success of synthetic data approaches, especially when replacing many values, is the data generation model.
  • Current practice for generating synthetic data typically employs sequential modeling strategies based on parametric or semiparametric models similar to those for the imputation of missing data in Raghunathan et al. (2001).
  • The basic idea is to impute Y1 from a regression of Y1 on (Y2, Y3, etc.), impute Y2 from a regression of Y2 on (Y1, Y3, etc.), impute Y3 from a regression of Y3 on (Y1, Y2, etc.), and so on.

Synthesis of Data- Challenges

  • Specifying these conditional imputation models can be daunting in surveys with many variables. Often datasets include numerical, categorical, and mixed variables, some of which may not be easy to model with standard tools.
  • The relationships among these variables may be non-linear and interactive.
  • Specifying models for many variables is a resource-intensive task, and statistical agencies simply may not have the time to invest in the careful specification of these conditional models for many variables.

Identification of disclosure risk measures

  • An intruder has a vector of information, t, on target units in the population.

t0 : unique identifier of the target

dj0: (not released) unique identifier for record j in D, where j = 1, .. , n

  • Intruder seeks to match unit j in D to the target when dj0 = t0, and not to match when dj0 ̸= t0 for any j ∈ D.
  • Let J be a random variable that equals j when dj0 = t0 for j ∈ D, and equals n+1 when dj0 = t0 for some j ̸∈ D.
  • Intruder seeks to calculate the Pr (J = j|t,D) for j = 1, . . . , n
  • If the identification probabilities are large enough (decided on a threshold) they can be declared an identification.

But the intruder does not know Yrep, so for each record in D he computes:

For comparing the risks of the nonparametric synthesizers, authors assume that the intruder approximates identification probability by treating the simulated values in the released datasets as plausible draws of Yrep

Matching probability for any record j is then:

  • Ft is the number of records in the population that satisfy the matching criteria
  • Logical expression (D(l)j = t) equals 1 when the values of variables used for matching by intruders for record j are in accord with the corresponding values in t; else 0
  • Assume Intruder doesn’t know if target is included in D. If the intruder knows who is in D, Ft is replaced with N(l)(the number of records in D(l) that satisfy the matching criteria)

CART (Classification and Regression Trees)

CART seek to approximate the conditional distribution of a univariate outcome from multiple predictors. The algorithm partitions the predictor space so that subsets of units formed by the partitions have relatively homogeneous outcomes. The partitions are found by recursive binary splits of the predictors. The series of splits can be effectively represented by a tree structure, with leaves corresponding to the subsets of units. The values in each leaf represent the conditional distribution of the outcome for units in the data with predictors that satisfy the partitioning criteria that define the leaf

Example of a simple Classification and Regression Tree using just two rules

CART for synthesis

Using only records with Zj = 1, fit the tree of Yi on Y−i so that each leaf contains at least k records — call this tree Y(i):

  • For categorical variables, we grow Y(i) by finding the splits that successively minimize the Gini index;
  • For numerical variables, we successively minimize the deviance of Yi in the leaves.

Stop splitting any leaf when:

  • The deviance in that leaf is less than some agency-specified threshold d, Or
  • When we cannot ensure at least k records in each child leaf.

Only records with Zj = 1 are used to ensure that the tree is tailored to the data that will be replaced:

  • For any record with Zj = 1, trace down the branches of Y(i) until that record’s terminal leaf. Let Lw be the wth terminal leaf in Y(i), and let Y(i)Lw be the nLw values of Y(i) in leaf Lw. For all records whose terminal leaf is Lw, we generate replacement values of Yij by drawing from Y(i)Lw using the Bayesian bootstrap (Rubin, 1981)
  • Repeating the Bayesian bootstrap for each leaf of Y(i) results in the lth set of synthetic values. Repeat this process m times to generate m datasets with synthetic values of Yi

Empirical Evaluation: Simulation Study


1. Take a subset of a public dataset (population), repeatedly draw samples from it.

2. For each sample, construct partially synthetic datasets using different synthesizer techniques.

3. Use inferential methods to determine point estimates and 95% confidence intervals for 162 estimands of multiple analyses (linear regressions, logistic regressions etc.) and evaluate disclosure risk, for ten replications.

Data — public use microdata sample from the 2002 Uganda Population and Housing Census

  • 10% systemic sample of Uganda’s natives
  • ~2.5 million questionnaire records
  • >100 variables


1. From population repeatedly draw 1% simple random samples — this is the original data from which synthetic datasets should be generated

2. Variables chosen: Five — Number of persons in the household, age, marital status, literacy, employment status. Synthetic data generation by replacing all records’ values for these variables; all other variables are left unchanged

3. For each drawn observed dataset, synthesize the five variables using the steps for multivariate synthesis in the order: persons, age, marital status, literacy, and employment status

4. For each synthesizer, generate m = 5 partially synthetic datasets


CART shows the best results among all ML Models. Below is the comparison of the True Population Quantities and Average of the point estimates, across the 1000 simulation runs for the original and synthetic datasets


  • Empirical evaluations show that it is possible to obtain synthetic datasets that provide reliable estimates paired with low disclosure risk by using nonparametric synthesizers.
  • Authors achieved results without any tuning of the different synthesizers (Better results are obtainable if the methods are tailored to the data at hand)
  • SVM and CART synthesizers outperform the bagging and random forests synthesizers in terms of analytical validity, albeit for the price of an increased risk of identification disclosure
  • SVMs are promising, but implementing is difficult (sensitive to tuning, and it is not obvious which variation should be used for synthesis.)
  • CART balances analytical validity and disclosure risks. With appropriate tuning, it can provide a high level of data utility for potentially acceptable disclosure risks.


Motive: In many applications, measured data can be represented in a matrix 𝑋𝑚∗𝑛 for which only a relatively small number of entries are observed. The task — “Complete” the matrix based on the observed entries.

A news recommender system

Idea: Problem rephrased as learning an unknown parameter — a matrix 𝑍𝑚∗𝑛 with very high dimensionality based on very few observations (𝑋𝑚∗𝑛). For inference assume that the parameter Z lies in a much lower-dimensional manifold — i.e. assume that Z can be well represented by a matrix of low rank

Z ≈𝑉𝑚∗𝑘 𝐺𝑘∗𝑛

Here k ≪ min (n, m)


For a matrix 𝑋𝑚∗𝑛 let Ώ ⊂ {1, . . . ,m} × {1, . . . ,n} denote the indices of observed entries

Now the optimization problem:

Such that 𝜕≥0 is a regularization parameter controlling the tolerance in training error. Rank constraint makes this problem for general Ώ combinatorically hard, and to make this easier we need a modification.

The above can be reformulated in Lagrange form:

Convergence Analysis

  • Unlike generic first-order methods (Nesterov, 2003) including competitive first-order methods for nuclear norm regularization, Soft-Impute does not involve the choice of any additional step-size.
  • The algorithm is readily scalable for solving large scale problems.
  • Sequence 𝑍𝜆𝑘 generated via Soft Impute converges asymptotically to a minimizer of the objective function, i.e., as k ⇥ ∞
  • Proofs are complex and are based on trace inequalities

Computational Complexity

  • The computationally demanding part of the algorithm is in 𝑆𝜆 (𝑃𝜆(𝑋)+ 𝑃𝜆⊥(𝑍𝜆𝑘))
  • The above requires calculating a low-rank SVD of a matrix since the underlying model assumption is that rank(Z) ≪ min{m, n}
  • There are very efficient direct matrix factorization methods for calculating the SVD of matrices of moderate size (at most a few thousand in dimensions), from numerical linear algebra literature.
  • For large matrices, one must resort to indirect iterative methods for calculating the leading singular vectors/values of a matrix
  • Authors use PROPACK algorithm (Larsen, 2004, 1998) because of (1) Its low storage requirements, (2) effective flop count (3) structure of the problem which involves sparse matrices and (4) its well-documented MATLAB version
Algorithm for Nuclear Norm Regularization

Simulation Study: Application on Netflix Data

Netflix Prize

  • The Netflix training data consists of the ratings of 17,770 movies by 480,189 Netflix customers. Resulting data matrix is extremely sparse, with 100,480,507 or 1% of the entries observed.
  • The task was to predict the unseen ratings for a qualifying set and a test set of about 1.4 million ratings each with the true ratings in these data sets held in secret by Netflix.
  • Netflix’s own algorithm has an RMSE of 0.9525, and the contest goal was to improve this by 10%, or an RMSE of 0.8572 or better.


  • A maximum of 10 iterations was performed for each of the examples
  • The results are not competitive with those of the competition leaders, but demonstrate the feasibility of applying SOFT-IMPUTE to such a large data set

Deep AutoEncoders

Motive: Missing Data is an important problem and the current imputation models have the following issues:

  • The incapability of handling mixed data types (categorical and continuous)
  • Strict distributional assumptions
  • Often cannot handle arbitrary missing data patterns
  • The existing models capable of overcoming issues are limited in: 1)Ability to model highly nonlinear relationships 2) Deal with high volume data and complex interactions while preserving inter-variable dependencies

Idea: Recent advancements in Deep Learning have established state-of-the-art results in many fields (CV, Speech Recognition, Motion Planning) because:

  • Deep architectures have the capability to automatically learn latent representations and complex inter-variable associations, which is not possible using classical models.
  • Denoising Autoencoders (DAEs), part of the deep learning framework, are designed to recover clean output from noisy input.
  • Missing data is a special case of noisy input, making DAEs ideal as an imputation model.


AutoEncoder Architecture

An Autoencoder takes an input 𝒙∈[0,1]𝑑 and maps (encodes) it to an intermediate representation h ∈[0,1]𝑑′using an encoder, where d’ represents a different dimensional subspace.

  • The encoded representation is then decoded back to the original d dimensional space using a decoder.
  • Encoder and decoder are both Artificial Neural Networks, as follows, where z is the decoded result and s is any nonlinear function, reconstruction error between x and z is minimized during the training phase

Denoising Autoencoders (DAEs)

  • Denoising autoencoders are a natural extension to autoencoders.
  • The approach is to corrupt the input data and force the network to reconstruct the clean output, by making the hidden layers learn robust features.
  • Corruption can be applied in different ways, such as randomly setting some input values to zero or using distributional additive noise.


  • Employ atypical overcomplete representation of DAEs (i.e. more units in successive hidden layers during the encoding phase compared to the input layer)
  • Overcomplete DAEs create representations capable of adding lateral connections, aiding in data recovery.
  • Start with an initial n-dimensional input, then at each successive hidden layer, add nodes, increasing the dimensionality to n+ϴ
  • For experiments, used ϴ = 7 (hyperparameter)
  • Use inputs standardized between 0 and 1 to facilitate faster convergence for small to moderate sample sizes.
The architecture for ϴ hidden layers

Usage and Algorithm

  • Initialize the respective column average in case of continuous variables and most frequent labels in case of categorical variables as placeholders.
  • The training phase initiated with a stochastic corruption process setting random inputs to zero and the model learns to map corrupted input to clean output.
  • Assumption: There is enough complete data to train the model, so the model learns to recover true data using stochastic corruption on inputs, and is not learning to map placeholders as valid imputations.

Simulation Study: Comparative Study with MICE

Competitor — Current state-of-the-art in multiple imputations is Multivariate Imputation by Chained Equations (MICE)

  • Fully conditional specification approach which species multivariate model, on a variable by variable basis using a set of conditional densities, one for each variable with missing data.
  • Draws imputations by iterating over conditional densities.
  • Has an added advantage of being able to model different densities for different variables?


The MIDA model outperforms MICE in 100% of cases with data MCAR and MNAR with a uniform missing pattern, and in > 70% of cases with the random missing pattern.

  • Model’s superior performance in small to moderate dataset sizes with constrained dimensionality is indicative of its utility when datasets are large and are of higher dimensionality (bottleneck for other multiple imputation models).
  • The model does not need a certain proportion of available data to predict missing value. Computational cost associated with MIDA is at par or better than imputations based on MICE for small to moderate-sized datasets. Computational gains are significant when modeling a complete dataset in a single attempt compared to the iterative variable by variable imputation in MICE.


  • Interpretability is low since it is a DL model.
  • Might have to one-hot-encode categorical variables!
  • How to deal with mixed variables?

The approach involving MIDA, a Deep Learning solution using Denoising Autoencoders shows better performance than the current state of the art methods and hence a general-purpose Python implementation of the same can be a useful addition to the Missing Data Imputation algorithms.

Future Work: Implement an open-source Python package for general purpose Missing Data Imputation, but including categorical and mixed variables.

GitRepo: Python Module for Missing Data Imputation