MNIST- Exploration to Execution.

Source: Deep Learning on Medium

MNIST- Exploration to Execution.

Hello All, This is my first story in this publication, I wanna make it as useful as possible.

So in this story I am gonna take the most famous dataset in ML community called MNIST, explore it as much as possible and finally build good models with conclusions.

Note: This story reading time is way more than Medium says it is and you think it is so if you are serious about learning you gotta give that time.

About MNIST.

→ It has the size of 28*28 black & white images of hand written digits

→ It’s probably one of the first datasets to prove the effectiveness of the algorithms and ideas in neural networks (we build)

→ It contains 60,000 training images and 10,000 testing images, used for the task of image classification.

→ It’s probably the most cleanest dataset you ever find on internet for Machine/Deep learning models as it has a good bias-variance balance.

→ The current error rate is ~ 0.21% by using Convolution neural networks with data augmentation.

HW and SW tools usage.

Ubuntu (Linux), GPU (RTX 2080 ti), CPU (AMD Ryzen), 32GB ram, Cuda 10, Python, Pytorch, Numpy, Matplotlib, Scikit-learn and jupyter notebook.


  1. Understanding the stats/distribution of data set
  2. Dimensional Reduction Visualization.
  3. Best Model finding/fine tuning.
  4. Optimizes comparisons on the data set.
  5. Understanding of trained Weights distribution
  6. Trained model gradient visualization.
  7. Visualizing the trained hidden layers.
  8. Gan training.
  9. Transfer learning on MNIST.

Style of Explanation.

  1. All the jupyter notebook code will be available on my Github, here I attach the images of code snippets (I love looking at the code and output).
  2. I skip attaching/explaining the unnecessary code which is available on GitHub anyway.
  3. I pre assume that the readers have the ML/DL vocabulary and some concepts and Math.
  4. Understanding the intuition is important than understanding the code/logic (why I do that over how I do that).
Let’s roll.

Let’s first load and see the data ( t as torch , tv as torchvision)

here I plotted sample 25 images of the batch of 32 images (train_loader iterates the batch of images).

The data has been normalized between 0 and 1 (pixel intensity).

Understanding the stats/distribution of dataset.

I took all the images as a one big numpy array to calculate the mean images (class wise) to get some sense of the data.

here are the results

The mean images look really good which proves that the data does not have a lot of noise and its crystal clear for DL models.

let’s understand the data/pixel distribution,

I plot the histogram for each class (all the images pixels values count).

X-pixel values(0-1) , Y- Number of pixels

as you see most pixels values are ZERO(black), some are ONE(white), few between 0 and 1 for all the classes.

if we plot the mean images distribution,here it looks like (class 1 takes the least number of white pixels or more dark pixels) #ofcourse.

X-Classes (0–9), Y- the mean value for each class

Understanding these stats and distribution is important when you want to do feature engineering/scaling for a dataset.

for example here every class has a different mean so you can consider “mean” as another feature or you can make the entire data as zero centered and feed it to the models without the mean feature. #uptoyou #depends

Dimensional Reduction Visualization.

The whole point of dimensional reduction techniques is to convert the high dimensional data to low dimensional data effectively (without loosing too much information of the data)

they are good for data visualization , feature selection and engineering.

here input X has 28*28 pixels (784- dimensional vector) so let’s apply PCA.

it’s a linear dimensional reduction technique which captures the most variance features (2 or 3 out of 784)

as you can see here, the data/classes have been split into 10 different clusters/groups.

while we can see, the visualization looks good even with just 2 dimensions, its not enough to separate the classes so luckily we have another techniques called t-SNE which is a non linear dimensional reduction technique and a probabilistic approach unlike PCA which is a mathematical approach.

t-SNE requires a lot of computation thus it takes a lot of time (minutes to hours) compared to PCA (secs to minutes) so here I used Multicore TSNE which took around 10–15 mins over scikit learn t-SNE which seemed taking tons of time.

Note: Recently Rapids cuml (GPU implementation of t-SNE) takes only few secs for this job.

here we can clearly see the classes have been split well.

the variable embeddings holds the 2-d vectors for all corresponding MNIST images of 28*28.

here I plot the original data and t-SNE embeddings.

784-d vs 2-d distribution.

so you can see the data distribution of original and t-SNE’s

embeddings axis 0 and 1 distribution

let’s plot each class embeddings.

if observed, the classes 3 and 8 have some tough time while others are pretty good especially (0,1,6)

Best Model finding/fine tuning.

This is probably the most interesting step and most important step for DL practitioners.

Although there is no particular recipe,there are some things that work well (since this field is progressing very quickly , new things come to wipe out the old tricks).

Rule 1: Everything depends on the “Data” that you have.

Rule 2: Sometimes depends on the cool tricks and algorithms.

The way machine learning works is as follows

The data space gets multiplied/added with some n arbitrary dimensional vector space to find a solution space where a good X to Y mapping is achieved.

The loss, the optimization, the processing, everything depends on the numbers (data) that we have so.. having a good dataset is super important.

Since we have the cleanest data, lets create a random model for the classification task.(as you might know CNN’s work well for image tasks so I take that onl;y).

A 2 layer model (Conv+FC)

I just took a 2 layer network (Conv+FC) and Lr = 0.01 and momentum = 0.5 (generally this learning rate works well )

Lets train it

As you can see , after 10 epochs the train accuracy reaches 99% because the data is pretty easy for the model to generalize/separate classes

Attention: It is not about fitting the data, its all about generalization and More data requires big networks and Big networks require more data.

lets add one more conv layer and train it

as you see, the accuracies got improved a bit by adding another Conv layer.

Optimizers comparisons on the dataset.

Above I have used SGD as the optimizer, lets try other optimizers with the same network and same training procedure.

I took the same model but a different optimizer for each network and train all with the same procedure.

Final Epoch results

RMS prop is the clear winner in this race so we can take the same network as we took before then we can use RMS prop as the optimizer for the network.

Since this data set is super easy for even smaller networks, Lets stop it here finding better models and let’s focus on the trained models.

Understanding the weights & gradients

Let’s understand how weights and gradients are changing during the training of the current best model.

I took the same network(SecondModel()) which has 2 conv and 2 fc layers and RMS prop as the optimizer as it performed well and ran for 5 epochs

During training, I take the weights and histogram plot them and save them as images.

The below GIF shows that how the networks weights distribution is changing during the training for all the layers.

The uniform distribution of weights gets slowly transformed to normal distribution during the training and observe that Fc1 has a lot of neurons so most of the neurons weights are very closed to zero.

Let’s also save the gradients and plot them.

I save the gradients after loss.backward() is run by calling the save_gradients function and after training I plotted them, here is how the gradients flow is going from last layer to first layer.