Exploring patterns enriched in a dataset with contrastive principal component analysis

Original article can be found here (source): Deep Learning on Medium

Exploring patterns enriched in a dataset with contrastive principal component analysis

“Torture the data, and it will confess to anything.”

Hello guys,

I hope your are doing absolutely fine.

This blog will cover concepts of contrastive PCA (cPCA) with a tutorial.

Before diving into concepts of cPCA we should be familiar with concepts of PCA.

What is PCA?

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Enough with the definition. Let’s do some real work!

Let’s take a sample use case

In this use case we have taken MNIST digits dataset.

The digit MNIST dataset is a fun dataset to try this algorithm out on because it includes 28×28 variables in an easy-to-visualize form (a picture). What happens when we pick just two principal components, and try to map the result?

Here we are just taking first two labels which are 0 and 1.

preparing dataset

Now let’s use PCA algorithm on data we just prepared

This is the 2-D visualization of PCA output

Simple right!!

Now let’s wrap up this and move on to our main topic which is contrastive principal component analysis (cPCA)

So what is cPCA?

Contrastive PCA is a tool for unsupervised learning, which efficiently reduces dimensionality to enable visualization and exploratory data analysis. This separates cPCA from a large class of supervised learning methods whose primary goal is to classify or discriminate between various datasets, such as linear discriminant analysis (LDA), PCA etc. This also distinguishes cPCA from methods that integrate multiple datasets, with the goal of identifying correlated patterns among two or more datasets, rather than those unique to each individual dataset.

why cPCA?

PCA is designed to explore one dataset at a time. But when multiple datasets or multiple conditions in one dataset are to be compared then the current state-of practice is to perform PCA on each dataset separately, and then manually compare the various projections.

Contrastive PCA (cPCA) is designed to fill in this gap in data exploration and visualization by automatically identifying the projections that exhibit the most interesting differences across datasets. The main advantages of cPCA are its generality and ease of use.

Enough with the theoretical part!

Let’s do some experiments with cPCA.

Before moving forward you will have to install cPCA library.

you can install it by using following pip command.

pip install contrastive

Now you are good to go!!

So, we are going to do two different experiments.

  1. cPCA on image dataset.
  2. cPCA on Mice Protein dataset.

Let’s get on with our first experiment.

1. cPCA on image dataset.

Datasets used :

  • MNIST digits image dataset.
  • Grass images.

These grass pictures are found in this OneDrive link, or they can be downloaded from ImageNet using the synset ‘grass’. (Note: replace IMAGE_PATH with path to the downloaded images)

As a related example, consider a dataset that consists of handwritten digits on a complex background, such as different images of grass. The goal of a typical unsupervised learning task may be to cluster the data, revealing the different digits in the image. However, if we apply standard PCA on these images, we find that the top principal components do not represent features related to the handwritten digits, but reflect the dominant variation in features related to the image background.

source — https://www.nature.com/articles/s41467-018-04608-8.pdf/

Now, let’s see how we managed to get this result using cPCA!

Load MNIST

Load Natural Images of Grass

Corrupt MNIST by Superimposing Images of Grass

As we don’t have digit images already corrupted with grass images, we have manually taken some original images and superimposed them with grass images.

To create each of the 5000 corrupted digits, randomly chosen a 28px by 28px region from a grass image to be superimposed on top of the digits.

Some Examples of corrupted Images

Digit images corrupted by grass images

PCA on Corrupted MNIST

Result of PCA on corrupted digit images.

As we can see the PCA is not able to differentiate between digit 0 and digit 1 due to noise in the images which is grass images superimposed on digit images.

Now let’s see what happens when we use cPCA on these same images!!

cPCA on Corrupted MNIST

Result of cPCA on corrupted digit images.

WOW!

look, how cPCA managed to differentiate between digits 0 and digits 1.

Now you guys must be wondering what is target, background and alpha?

Let’s see,

Target is the dataset with the corrupted digit images.

Background is the dataset with grass images. This dataset is there as a reference for the model so that it will know what grass looks like and it will consider that grass as a noise in the target dataset and it will be easy for model to detect digits which are superimposed by grass images.

Alpha is a hyperparameter in cPCA. Each value of α yields a direction with a different trade-off between target and background variance. By altering the alpha value, the principal components can be clustered to the user’s need.

Features Captured by PCA vs. cPCA

We can clearly see how cPCA works better than PCA. We can clearly see the bold and defining pixels in cPCA.

Denoising with PCA vs. cPCA

Also cPCA gives better results after denoising and helps to bring out the desired target which is digits!!

So, this was all about cPCA on images.

2. cPCA on Mice Protein dataset.

Dataset used :

You can find the required dataset here (https://github.com/abidlabs/contrastive/tree/master/experiments/datasets)

Load the Dataset

Separate target and background datasets:

  • Target consists of mice that have been stimulated by shock therapy. Some have Down Syndrome, others don’t, but we assume this label is not known to us a priori
  • Background consists of mice that have not been stimulated by shock therapy, and do not have Down Syndrome

Run Contrastive PCA

You can see cPCA helps to get proper segregated target data than

PCA (Alpha = 0 )

For Specific Value of Alpha and GUI

let’s take Alpha = 2

GUI to sweep different values of to see an example of how such an animation can be used to reveal clusters within the data.

For more experiments with cPCA you can visit this git hub repository

https://github.com/abidlabs/contrastive

Reference

So, That’s it about the cPCA.

Hope you guys learnt something from this blog.

If you like this post, please follow me as well as Press that Clap button as long as you think I deserve it. If you have noticed any mistakes in the way of thinking, formulas, animations or code, please let me know.

Adios!!