Original article was published by Muhammad Ardi on Becoming Human: Artificial Intelligence Magazine

### Deep Autoencoder in Action: Reconstructing Handwritten Digit

Hello world, welcome back to my page! Here I wanna show you another project that I just done, **A Deep Autoencoder**. So autoencoder is essentially just a kind of neural network architecture, yet this one is more special thanks to its ability to generate new data based on given sample represented in lower dimension. Here I am going to be using MNIST Handwritten Digit dataset in which each of its image samples has the size of 28 by 28 pixels. This size is then going to be flattened, hence we will have 784 values to represent each of those images.

As usual, I also include all code required for this project in the end of this article.

Before we jump into the code, let me explain first about the structure of a deep autoencoder. Look at the figure below.

What you are seeing in the picture above is a structure of the deep autoencoder that we are going to construct in this project. An autoencoder has two main parts, namely **encoder **and **decoder**. The encoder part, which covers the first half of the entire network, has a purpose to map a sample into its lower dimensional representation. In this case, the encoder consists of an input layer which takes 784 features. Next, it is connected to a hidden layer of 32 neurons and then followed by 2-neurons layer. The encoder part ends at this 2-neurons layer, which is usually called as **latent space**.** **Since this latent space has exactly two dimensions, then we are able to represent all the data in a simple cartesian coordinate system in order to find out where the location of those digit numbers are encoded.

The next half of an autoencoder is called **decoder**. The architecture of a decoder is nearly the same as the encoder part. However, instead of lowering the dimensionality of data, it maps back a value in latent space to the original image shape. In this project, the decoder takes two input values, in which it should be two coordinate numbers that represent a location in a latent space. Then it is attached to a hidden layer and output (original shape) layer of size 32 and 784 respectively.

I think that’s all of the explanation about autoencoder, so now let’s start to implement this!

As usual, the first thing to do is to import all required modules, namely NumPy, Matplotlib, and Keras. The MNIST Handwritten Digit dataset that we will use is available from Keras datasets, so we can load it directly through the code.

import numpy as np

import matplotlib.pyplot as plt

from keras.datasets import mnist

from keras.models import Model

from keras.layers import Dense, Input

# Loading MNIST Digit dataset

(X_train, y_train), (X_test, y_test) = mnist.load_data()

Now that we already have 4 variables, where *X_train *and *y_train* consist of 60000 data-label pairs while the test variables consist of 10000 pairs. Spoiler alert: we do not use use both *y_train *and *y_test *for training. I will explain the reason later.

### Trending AI Articles:

1. Microsoft Azure Machine Learning x Udacity — Lesson 4 Notes

2. Fundamentals of AI, ML and Deep Learning for Product Managers

### Preprocessing

After loading the dataset, the next thing to do is to preprocess those data. Fortunately, the preprocessing steps is very simple for this case because the shape of the images are already uniform (28 by 28). So now, what we need to do is to flatten out both *X_train *and *X_test*, then keep those flattened array to *X_train_flat *and *X_test_flat*.

# Convert 2D arrays into 1D (flattening)

X_train_flat =

X_train.reshape(60000, X_train.shape[1]*X_train.shape[2])

X_test_flat =

X_test.reshape(10000, X_test.shape[1]*X_test.shape[2])

If we check the shape of both flattened variables you will get (60000, 784) and (10000, 784) for train and test data respectively.

The next preprocessing step is array values normalization. We know that pixel brightness in images are represented with values ranging between 0 and 255. In order for neural network to work best, we need those numbers to lie between 0 and 1, Even though in some other cases this step might not affect much. The normalization process can be done like this:

# Normalize values

X_train_flat = X_train_flat/255

X_test_flat = X_test_flat/255

In addition, we do not convert both labels (*y_train *and *y_test*) into **one-hot **encoding representation because, as I said earlier, those data are literally not used to train the neural network model.

### Constructing the autoencoder

After all preprocessing steps done, now we are able to construct the autoencoder. The structure of this deep autoencoder is already shown in the figure that I put in the early part of this writing. Below is the code implementation of the architecture.

input_1 = Input(shape=(X_train_flat.shape[1],))

hidden_1 = Dense(32, activation='relu')(input_1)

latent_space = Dense(2, activation='relu')(hidden_1)

hidden_2 = Dense(32, activation='relu')(latent_space)

output_1 = Dense(X_train_flat.shape[1], activation='sigmoid')(hidden_2)

Technically speaking, this deep autoencoder takes an array of size 784 as the input value (the flattened image array). Next, those values are delivered to the next layers, namely *hidden_1, latent_space *and *hidden_2 *respectively before eventually reach the last layer called *output_1*. Note that we call this a **deep autoencoder** due to the existence of *hidden_1* and *hidden_2* layer. If those two layers do not exist, we can simply call it as an **autoencoder**.

Next, we need to define the segments of the network (**encoder**, **decoder**, and the entire model). Here I will use a variable called *autoencoder* to store the entire neural network model and use *encoder* variable to store the first half of the network.

autoencoder = Model(inputs=input_1, outputs=output_1)

encoder = Model(inputs=input_1, outputs=latent_space)

Notice the way I define the model variables. The *autoencoder* takes the very first layer (*input_1*) as the input and the very last layer (*output_1*) as the output in order to take all the 5 layers of the network. The encoder part, however, stops at the *latent_space* because we want to take the value from this layer to get the lower dimension representation of an image data.

The decoder part is kinda tricky though. Below is the code creating the decoder part:

decoder_input = Input(shape=(2,))

decoder_layer_1 = autoencoder.layers[-2](decoder_input)

decoder_output = autoencoder.layers[-1](decoder_layer_1)

decoder = Model(inputs=decoder_input, outputs=decoder_output)

First we need to create a placeholder called *decoder_input*. This is done because essentially we want to give a particular value as the input of the *latent_space*, while the *latent_space *itself is actually not an input layer. So we can say that *decoder_input *and *latent_space* are actually representing the same layer, but *decoder_input *takes a value from user while the *latent_space* takes a value from the previous layer of the network.

Next, I define more decoder layers which are also basically taken from the layers of the model stored in *autoencoder *variable. *decoder_layer_1* is exactly the same as the second last layer of the entire network, while *decoder_output* is the same as the output of *autoencoder*. Lastly, we need to define the *decoder* variable itself which is the second half of the entire network.

We may check the entire structure of this deep autoencoder using *autoencoder.summary()* just to check whether we have already constructed the model exactly like what is displayed in the picture I shown earlier. Below is the model summary.

Model: "model_1"

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

input_1 (InputLayer) (None, 784) 0

_________________________________________________________________

dense_1 (Dense) (None, 32) 25120

_________________________________________________________________

dense_2 (Dense) (None, 2) 66

_________________________________________________________________

dense_3 (Dense) (None, 32) 96

_________________________________________________________________

dense_4 (Dense) (None, 784) 25872

=================================================================

Total params: 51,154

Trainable params: 51,154

Non-trainable params: 0

_________________________________________________________________

You may also run *encoder.summary() *or *decoder.summary()* if you want.

### Compiling and fitting the model

A neural network can not be trained before we define the loss function and the optimizer. In this case, I decided to go with binary cross entropy loss function and *adam *optimizer. You may change this loss function to something like *mse *(Mean Squared Error), while other optimizers like *adagrad *or *adadelta *are also applicable. Below is how I compile the model:

autoencoder.compile(loss='binary_crossentropy', optimizer='adam')

Now, our deep autoencoder is ready to train. The training of such generative model is quite different to model for performing classification.

autoencoder.fit(X_train_flat, X_train_flat, epochs=10, validation_data=(X_test_flat, X_test_flat))

Notice that when fitting (a.k.a. training) the neural network model, the first and second argument are the same variable (both are *X_train_flat*). If you are familiar with classification task, usually we set the first argument as the sample (X) while the second one is used to pass the ground truth (y). The reason why in autoencoder we pass both *X* variables is because we want the output of the model to be as similar as possible with the input data. Therefore, as I have mentioned in the earlier part of this writing, actually loading *y_train *and *y_test* is not necessary for the training process.

Anyway, below is the output of the model fitting after 10 epochs. We can see here that the loss value decreases as the epoch goes. Theoretically, this loss value can still go lower as we increase the number of epochs. Note that I removed the result of epoch 2 to 9 for simplicity.

Train on 60000 samples, validate on 10000 samples

Epoch 1/10

60000/60000 [==============================] - 11s 186us/step - loss: 0.2133 - val_loss: 0.2098

.

.

.

.

.

Epoch 10/10

60000/60000 [==============================] - 10s 171us/step - loss: 0.1983 - val_loss: 0.1985

Up to this point, our deep autoencoder has just been trained well. Now that we are able to find out the lower dimension representation of all images and draw the distribution in a simple scatter plot using **encoder**. Then we can also use the **decoder **to perform digit image reconstruction.

### What’s done by encoder?

After training the entire deep autoencoder model, we can perform mapping from 784-dimension flattened image to 2-dimension latent space. Now we are going to try to map all those training data into latent space using only the **encoder **part of the model, which can be achieved using the following code:

encoded_values = encoder.predict(X_train_flat)

Here the shape of *encoded_values* variable is (60000, 2), where it represents the number of samples and its dimension respectively. We can think of this 2-dimensional shape as a value in x-y coordinate system for each sample. Hence, we are able to put all those data into a scatter plot. Notice that the *y_train* is used to color-code the samples. Below is the code to do so:

plt.figure(figsize=(13,10))

plt.scatter(encoded_values[:,0], encoded_values[:,1], s=4, c=y_train, cmap='hsv')

plt.colorbar()

plt.show()

Which displays the following image:

And there it is! The figure above shows the handwritten digit distribution in two-dimensional latent space. Previously, each of the images in the dataset are represented in 784 dimensions, which is absolutely impossible to visualize the label distribution. However, now it is a lot easier to see the distribution of all images because we have encode those high data dimension to only 2 dimensions.

Furthermore, the scatter plot above tells some interesting facts. First, we can see that the picture with label 1 (orange) looks very far from most of the other images. Here we can say that images with label 1 are very different with most other numbers. Next, when we pay more attention to data points with label 4 and 9 (those in between pink and purple), we can say that these two handwritten digits are quite similar to each other due to the fact that these data points are literally spread in a same cluster.

That’s all of the **encoder**, now let’s jump into **decoder**.

### What’s done by decoder?

Now, what if we are given a pair of x-y coordinate value representing a point in a latent space? Can we reconstruct the handwriting image from that point? Yes we can! It can simply be achieved by performing prediction using the **decoder** model that we defined earlier.

decoded_values = decoder.predict(encoded_values)

Remember, the shape of *encoded_values* variable is (60000, 2), meaning that it contains 60000 data points in our latent space in which each of those samples are represented using two values. Now we use this variable as the argument of *predict()* method on the decoder, in which its return value is a 60000 flattened images where each of those images are having 784 values representing the brightness of each pixel. Since the MNIST image should have the size of 28 by 28, then we still need to reshape this output value. Below is the code for it.

decoded_values = decoded_values.reshape(60000, 28, 28)

Up to this point, we already got the reconstructed images stored in *decoded_values* variable. Now we can compare each of the sample value stored in the variable with the actual handwritten digit image stored in *X_train* variable. Here I decided to print out 10 images of index 110 to 119 (out of 60000).

Below is the code to display the actual images taken from *X_train *and its output along with the labels:

# Display some images

fig, axes = plt.subplots(ncols=10, sharex=False,

sharey=True, figsize=(20, 7))

counter = 0

for i in range(110, 120):

axes[counter].set_title(y_train[i])

axes[counter].imshow(X_train[i], cmap='gray')

axes[counter].get_xaxis().set_visible(False)

axes[counter].get_yaxis().set_visible(False)

counter += 1

plt.show()

And the code below is used to display the reconstructed images, also with its ground truth.

# Display some images

fig, axes = plt.subplots(ncols=10, sharex=False,

sharey=True, figsize=(20, 7))

counter = 0

for i in range(110, 120):

axes[counter].set_title(y_train[i])

axes[counter].imshow(decoded_values[i], cmap='gray')

axes[counter].get_xaxis().set_visible(False)

axes[counter].get_yaxis().set_visible(False)

counter += 1

plt.show()

Now we can compare some of the actual and reconstructed images pretty clearly. In fact, these reconstructed images are exactly like what I expected. Remember the latent space I displayed earlier. It shows us that data points with label 1 is clearly separated from nearly all other samples. This makes reconstructing the handwritten digit of 1 is pretty easy that it looks like there is no much noise generated in its reconstructed image. Next, for the case of number 4 and 9, as I explained earlier as well, it seems those two numbers are quite similar to each other due to the fact that in the latent space they are spread in the same cluster. We can also see the reconstructed images above that the number of 4 and 9 are kinda indistinguishable.

Don’t worry if you get different latent space image when trying to run the code by yourself since it also produces different results in my computer when I try to run it multiple times.

Thank you very much for reading! Hope you learn something new from this post!

### Don’t forget to give us your 👏 !

The Deep Autoencoder in Action: Digit Reconstruction was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.