Source: Deep Learning on Medium

# MNIST digits classification with Deep Learning using Python and Numpy

## In this era of deep learning frameworks, many who start out with the subject often just watch a video or attend a class to learn the theory about Neural Networks and algorithms that are associated with it and then straightaway code their way into DL Frameworks such as Tensorflow, Pytorch, Keras and many more. But in order to understand the nuts and bolts of how they all work, it is very important to go down the inner workings and build our intuition from there. Here I attempt to do the same with the classical problem of machine learning, the MNIST dataset of handwritten digits where only with the theoretical knowledge of the functioning of Neural Networks, some algorithms, Python and Numpy we could put together quite a decent Deep Neural Network.

Basic knowledge of the concept of Neural Networks and Deep Learning is assumed to be known.

The basic architecture of our NN (Neural Network) will be that, it’d contain a total of 4 layers (barring the input layer) out of which, 3 of them are hidden layers of 50 neurons each and the output layer containing 10 neurons. We use One Hot Encoding for the labels in both the train and test set for classification of digits from 0 to 9 i.e 10 classes.

**Importing the libraries and the dataset**. Here we also import ‘to_categorical’ from Keras for One Hot Encoding.

`import numpy as np`

import matplotlib.pyplot as plt

from tensorflow.keras.datasets import mnist

from keras.utils import to_categorical

2.** Load the dataset** by splitting it into (X_train_orig, Y_train_orig), (X_test_orig, Y_test_orig). We have used ‘orig’ because later we need to process the data for which we’ll then later assign the simpler (X_train, Y_train), (X_test, Y_test).

`(X_train_orig, Y_train_orig), (X_test_orig, Y_test_orig) = mnist.load_data()`

3. **Processing the data for Y labels**. We first reshape the original Y_train and Y_test into (60000, 1) and (10000, 1) respectively, because by default the Tensorflow dataset has this weird shape of (60000,) and (10000,). Then we use ‘to_categorical’ function to One Hot Encode them to 10 classes by passing the argument ‘num_classes=10’. Then the data is transposed to give us Y_train and Y_test.

`Y_tr_resh = Y_train_orig.reshape(60000, 1)`

Y_te_resh = Y_test_orig.reshape(10000, 1)

Y_tr_T = to_categorical(Y_tr_resh, num_classes=10)

Y_te_T = to_categorical(Y_te_resh, num_classes=10)

Y_train = Y_tr_T.T

Y_test = Y_te_T.T

4. **Processing the data for X pixel intensity input values**. Originally the X train and test set have this format for their shape which is (60000, 28, 28) and (10000, 28, 28) wherein each image is 28 by 28 pixels in size. To apply the concept of Deep Learning here, we need to ROLL out that 28 by 28 pixels square into a horizontal line containing the same pixels, i.e in total 28*28 = 784 pixels. This process is known as FLATTENING the image. Later we divide both the train and test set by 255. to normalize the pixel intensity values between 0 and 1 for further calculations to give us X_train and X_test.

`X_train_flatten = X_train_orig.reshape(X_train_orig.shape[0], -1).T`

X_test_flatten = X_test_orig.reshape(X_test_orig.shape[0], -1).T

X_train = X_train_flatten / 255.

X_test = X_test_flatten / 255.

5. **Define the Activation Functions**. Here we’ll use Relu for the hidden layer activations and Softmax for the output layer activations.

def relu(p):

return np.maximum(0, p)def softmax(u):

return np.exp(u) / np.sum(np.exp(u), axis=0, keepdims=True)

6. **Initialise the Weights and Biases for each of the layers**. We use a ‘for loop’ to do that and simplify our work and feed in the array with the total number of units in each layer (including the input layer). We initialize the weights randomly and initialize the biases to zeros.

`parameters = {}`

def initialize_parameters(layer_dims):

L = len(layer_dims)

for l in range(1, L):

parameters["W" + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * (np.sqrt(2 / layer_dims[l - 1]))

parameters["b" + str(l)] = np.zeros((layer_dims[l], 1))

return parameters

To make sure all your matrix dimensions match, print them all out (later comment them out).

`for l in range(1, 5):`

print("W" + str(l) + " = " + str(parameters["W" + str(l)]))

print("W" + str(l) + "shape" + " = " + str(parameters["W" + str(l)].shape))

print("b" + str(l) + " = " + str(parameters["b" + str(l)]))

print("b" + str(l) + "shape" + " = " + str(parameters["b" + str(l)].shape))

7.** Forward Propagation**

`outputs = {}`

activation = {}

def forward_prop(parameters, X_train, activation):

m = X_train.shape[1]

outputs["Z" + str(1)] = np.dot(parameters["W1"], X_train) + parameters["b1"]

activation["A" + str(1)] = relu(outputs["Z" + str(1)])

for l in range(2, 4):

outputs["Z" + str(l)] = np.dot(parameters["W" + str(l)], activation["A" + str(l - 1)]) + parameters["b" + str(l)]

activation["A" + str(l)] = relu(outputs["Z" + str(l)])

outputs["Z4"] = np.dot(parameters["W4"], activation["A3"]) + parameters["b4"]

activation["A4"] = softmax(outputs["Z4"])

return outputs, activation

8. **Compute the Cost** using the cross entropy loss to compute their average to get the Cost.

`def compute_cost(activation):`

loss = - np.sum((Y_train * np.log(activation["A4"])), axis=0, keepdims=True)

cost = np.sum(loss, axis=1) / m

return cost

9. **Define** **a function to calculate the derivative of the Relu activation function**

`def drelu(x):`

x[x <= 0] = 0

x[x > 0] = 1

return x

10. **Compute the gradients **of the Loss w.r.t the Weights and the Biases

`grad_reg = {}`

m = X_train.shape[1]

def grad_re(parameters, outputs, activation):

grad_reg["dZ4"] = (activation["A4"] - Y_train) / m

for l in range(1, 4):

grad_reg["dA" + str(4 - l)] = np.dot(parameters["W" + str(4 - l + 1)].T, grad_reg["dZ" + str(4 - l + 1)])

grad_reg["dZ" + str(4 - l)] = grad_reg["dA" + str(4 - l)] * drelu(outputs["Z" + str(4 - l)])

grad_reg["dW1"] = np.dot(grad_reg["dZ1"], X_train.T)

grad_reg["db1"] = np.sum(grad_reg["dZ1"], axis=1, keepdims=True)

for l in range(2, 5):

grad_reg["dW" + str(l)] = np.dot(grad_reg["dZ" + str(l)], activation["A" + str(l - 1)].T)

grad_reg["db" + str(l)] = np.sum(grad_reg["dZ" + str(l)], axis=1, keepdims=True)

return parameters, outputs, activation, grad_reg

11. **Update the parameters using gradient descent** algorithm in which you minimize the Cost function w.r.t the parameters.

`def learning(grad_reg, learning_rate=0.005):`

for i in range(1, 5):

parameters["W" + str(i)] = parameters["W" + str(i)] - (learning_rate * grad_reg["dW" + str(i)])

parameters["b" + str(i)] = parameters["b" + str(i)] - (learning_rate * grad_reg["db" + str(i)])

return parameters

12. **The Model.** Put together all these above functions into another function which is the model we’ll be executing for the final computations.

`num_iterations = 1000`

print_cost = True

costs = []

def grad_descent(num_iterations, costs, activation):

initialize_parameters([X_train.shape[0], 50, 50, 50, 10])

for l in range(0, num_iterations):

forward_prop(parameters, X_train, activation)

cost = compute_cost(activation)

grad_re(parameters, outputs, activation)

learning(grad_reg, learning_rate=0.005)

if l % 100 == 0:

costs.append(cost)

if print_cost and l % 100 == 0:

print("Cost after iteration %i: %f" % (l, cost))

return costs, parameters

13. **Execution.** Call the function ‘grad_descent’ to start the computations of jiggling the parameters.

`grad_descent(num_iterations, costs, activation)`

14. **Plotting.** Visualise the performance of your model using MatplotLib

`plt.plot(costs)`

plt.xlabel('iterations')

plt.ylabel('cost')

plt.show()

For my execution of the model with learning_rate = 0.005 and num_iterations = 1000, I obtained the following plot of the Cost vs Number of Iterations curve which proves that the model is healthy because the cost function is decreasing as the num_iterations increases.

I’m a student too and I have yet to learn to evaluate the model for accuracy, precision and recall for multi-label classification using *Confusion Matrix *after which I’ll be updating the code.

Also I’ll be publishing my work on Regularization and Optimization for better performance of NN on MEDIUM after I’m done with it which also will be done with Python and Numpy only.

Kudos on reaching the end! I’d provided the code in chunks so that it could be implemented in Jupyter Notebooks easily. I’ll also provide the final code I used here.

GitHub Link — https://github.com/pranavsastry/mnist

LinkedIn: Pranav Sastry GitHub: pranavsastry