Things learnt and tried during my kaggle digit recognizer submission-Part-1(PCA explained)

Source: Deep Learning on Medium

Note: This blog is a note to self and for those who are interested. The blog may not be interesting to those who are already familiar with all the concepts mentioned.

So when i started with the well known digit recognizer problem in kaggle(recognition of handwritten digits) I didn’t start right away with a neural network approach but rather from the basic feature engineering.I will summarize the methods i used and also mistakes made and the submission that increased my rank by 1149 places.

One of the prime aspects of machine learning is to focus primarily on feature engineering. By feature engineering, I mean methods for generating, extracting or altering features in the data set. Of course some competition winning solutions use a black box like ensembles of gradient boosting but i wanted to use this as an opportunity to learn. And feature engineering is very important and will also help to explain your model better. Even neural networks can be explained by gradient checking to assert or verify whether you are right about your model(more on this in another post).

Ok so without further delay to the main subject. I started with PCA(principal component analysis) for dimensionality reduction.There are other methods like TSNE but i will be explaining how i used PCA and also i will try my best to give an intuitive explanation of it.


SO what is PCA?

PCA is a linear transformation algorithm that seeks to project the original features of our data onto a smaller set of features ( or different co-ordinate axis) while still retaining most of the information. To do this the algorithm tries to find the most appropriate directions/angles ( which are the principal components ) that maximize the variance in the new coordinate-axis.

Ok, so above definition doesn’t seem so intuitive. Let me explain with a real world example which i would say i formulated after reading a stackexchange answer.

Let’s say you go to a showroom to buy a pair of shoes or through online shopping. You would first look for desirable features right? Depending on whether you would use them for running or walking you would look for certain characteristics like sole strength, brand, waterproof easy wear and so on.

We can compose a whole list of different characteristics of each shoe in the display. But many of them will measure related properties and so will be redundant. If so, we should be able to summarize each shoe with fewer features. This is what PCA does.But it doesn’t mean PCA finds repetitive characteristics and discards them but instead it constructs new characteristics that summarizes our shoe and these features are computed using old features.

For example a new feature maybe computed as (sole_strength-shoe_weight) (linear combination of features).We can find the best features using PCA, that summarize the list of shoes so well.

Why is PCA so useful? Well, when we are purchasing a shoe we are looking for some properties (characteristics) that strongly differ across shoes and brands. If we come up with a property that is the same for most of the shoes. This would not be very useful, would it? Shoes are very different, but the new property makes them all look the same! This would certainly be a bad summary. Instead, PCA looks for properties that show as much variation across shoes as possible.

We look out for the properties that would allow you to reconstruct, the original shoe characteristics. Again, imagine that we come up with a property that has no relation to the original characteristics; if you use only this new property, there is no way to reconstruct the original ones. This, again, would be a bad summary. So PCA looks for properties that allow to reconstruct the original characteristics as well as possible.

The above two goals are equivalent.So the gist is that a simple object can be often described by a lot of variables. So to describe it with simple and lesser number of variables is by choosing a proper co-ordinate system.

That’s exactly what PCA analysis does. The term “principal component” denotes new variables (or coordinate systems) we choose to describe our data in a more concise or convenient way. All principal components must satisfy two conditions:

  1. They must be perpendicular (or mathematically, Orthogonal) to each other.

This means principal components are NOT linear correlated between each other.

  1. And that’s why PCA can reduce the number of features with out losing information(remember we don’t discard old features but rather new features are a linear combination of old ones).Variables in raw data are not independent, and correlated variables cause redundancy.

So, now to the code :

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import plotly.offline as py
import plotly.graph_objs as go
import as tls
import seaborn as sns
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

# Import the 3 dimensionality reduction methods
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
train = pd.read_csv('../input/train.csv')

The MNIST set consists of 42,000 rows and 785 columns. There are 28 x 28 pixel images of digits ( contributing to 784 columns) as well as one extra label column which is essentially a class label to state whether the row-wise contribution to each digit gives a 1 or a 9. Each row component contains a value between one and zero and this describes the intensity of each pixel.

Before we start off, let’s conduct some cleaning of the train data by saving the label feature and then removing it from the dataframe

# save the labels to a Pandas series target
target = train['label']

train = train.drop("label",axis=1)

SO, PCA mainly depends on variance and it may be useful to observe how the variances look like for the digits in the dataset. Therefore to achieve this, let us calculate the eigenvectors and eigenvalues of the covarience matrix as follows:

# Standardize the data
from sklearn.preprocessing import StandardScaler
X = train.values
X_std = StandardScaler().fit_transform(X)
# Calculating Eigenvectors and eigenvalues of Cov matirx
mean_vec = np.mean(X_std, axis=0)
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
# Create a list of (eigenvalue, eigenvector) tuples
eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]
# Sort the eigenvalue, eigenvector pair from high to low
eig_pairs.sort(key = lambda x: x[0], reverse= True)
# Calculation of Explained Variance from the eigenvalues
tot = sum(eig_vals)
var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)] # Individual explained variance
cum_var_exp = np.cumsum(var_exp) # Cumulative explained variance

Plotly visualisation package can be used to produce an interactive chart to showcase individual explained variance and cumulative explained variance:

trace1 = go.Scatter(
y= cum_var_exp,
name="'Cumulative Explained Variance'",
# hoverinfo= cum_var_exp,
color = 'goldenrod'
trace2 = go.Scatter(
y= var_exp,
name="'Individual Explained Variance'",
# hoverinfo= var_exp,
color = 'black'
fig = tls.make_subplots(insets=[{'cell': (1,1), 'l': 0.7, 'b': 0.5}],
fig.append_trace(trace1, 1, 1)
fig.layout.title = 'Explained Variance plots - Full and Zoomed-in'
fig.layout.xaxis = dict(range=[0, 80], title = 'Feature columns')
fig.layout.yaxis = dict(range=[0, 60], title = 'Explained Variance')
# fig['data'] = []
# fig['data'] += (go.Scatter(x= list(range(784)) , y=cum_var_exp, xaxis='x2', yaxis='y2', name = 'Cumulative Explained Variance'))
# fig['data'] += [go.Scatter(x=list(range(784)), y=var_exp, xaxis='x2', yaxis='y2',name = 'Individual Explained Variance')]
# fig['layout'] = layout
# fig['layout'] += layout2
py.iplot(fig, filename='inset example')

It can be seen from above plot , out of our 784 features or columns approximately 90% of the Explained Variance can be described by using just over 200 over features. So if one wanted to implement a PCA on this, extracting the top 200 features would be a very good choice as they already account for the majority of the data. In the section below, I will use Sklearn toolkit and its built-in PCA method

target = train['label']
train = train.drop('label', axis=1)
from sklearn.decomposition import PCA
n_components = 60
pca = PCA(n_components=n_components).fit(train.values)
eigenvalues = pca.components_.reshape(n_components, 28, 28)
# Extracting the PCA components ( eignevalues )
eigenvalues = pca.components_

Next let us visualize the eigenvalues

n_row = 5
n_col = 12
# Plot the first 28 eignenvalues
for i in list(range(n_row * n_col)):
offset =0
plt.subplot(n_row, n_col, i + 1)
plt.imshow(eigenvalues[i].reshape(28,28), cmap='jet')
title_text = 'Eigenvalue ' + str(i + 1)
plt.title(title_text, size=6.5)

Next let us visualize some values

for digit_num in range(0,70):
grid_data = train.iloc[digit_num].as_matrix().reshape(28,28) # reshape from 1d to 2d pixel array
plt.imshow(grid_data, interpolation = "none", cmap = "afmhot")

Next i choose features with variance above 0.83

pca = PCA(n_components= 0.8388)
X_5d = pca.transform(X_std)
Target = target

So now with my features selected I used knn classifer to fit and predict the digits.

from sklearn.model_selection import train_test_split
del train
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
X = (train.values[:,1:])
Y = (train.values[:,0])
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.666666, random_state = 0)


from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(include_bias = False)
xtrain = pca.fit_transform(X_train)
xtest = pca.transform(X_test)
xtrainpoly = poly.fit_transform(xtrain)
xtestpoly = poly.transform(xtest)
clf = KNeighborsClassifier(n_neighbors=16)
clf =, y_train)
output_label = clf.predict(xtestpoly)
output = pd.DataFrame(output_label,columns = ['Label'])
output['index'] = output['index'] + 1
output.rename(columns={'index': 'ImageId'}, inplace=True)
output.to_csv('output.csv', index=False)

So this gave me a score of 0.3 something on leaderboard.(Note: I additionally used polynomialfeatures to improve my accuracy. Please look into that and ping me if u need the concept behind it.

Ok, so that’s it in part-1. In part 2 ill tell how i climbed to rank 1159 something.