Tutorial 1: MNIST, the Hello World of Deep Learning

Source: Deep Learning on Medium

Go to the profile of David Yang

Prerequisite: Tutorial 0 (setting up Google Colab, TPU runtime, and Cloud Storage)

MNIST is a dataset containing tiny gray-scale images, each showing a handwritten digit, that is, 0, 1, 2, …, 9. Your mission is to analyze such an image, and tell what digit is written there. The dataset looks like this:

MNIST, the “Hello World” of deep learning. © Yann LeCun et al.

Handwritten digit recognition, in general, is a realistic task. The MNIST dataset is also not particularly small: it contains 60,000 images in the training set and 10,000 in the test set. Each image has a resolution of 28×28, totaling 28²=784 features — a rather high dimensionality. So why is MNIST a “Hello World” example? One reason is that it is surprisingly easy to obtain decent accuracy, like 90%, even with a weak or poorly designed machine learning model. A practical setting, seemingly challenging task, high accuracy with little work — a perfect combination for beginners.

For deep learning, image classification is typically done with a convolutional neural network, or ConvNet. ConvNet is so effective for MNIST, that even if we randomly flip the labels for most of the dataset, a ConvNet still gets high accuracy.

Even with 100 noisy labels for every clean label the ConvNet still attains a performance of 91%. — Rolnick et al.

With the original dataset, it’s pretty trivial to get 99% accuracy with a basic ConvNet. Spend some time tuning the hyperparameters, and you get 99.2%-99.3%. 99.4% is the first accuracy level that requires some work. This is what we are going to do now.

Google has an MNIST tutorial for TPU, which is supposed to reach 99.4%. However, the code is long, and the tutorial contains a lot of irrelevant details about the Google Cloud Platform. Further, their code leads to 99.2% accuracy not 99.4% (as of April 2019). In this tutorial, we follow exactly the same model, and fix it to get 99.4%.

Before we do anything, let’s set up Tensorflow and Fenwicks:

import tensorflow as tf
import os
import numpy as np
if tf.gfile.Exists('./fenwicks'):
!git clone https://github.com/fenwickslab/fenwicks.git
import fenwicks as fw

Hyperparameters. Let’s define two hyperparameters: batch size and number of epochs. Colab provides a nice interface with sliders and dropdown boxes, like this (click to enlarge):

Click to enlarge.

In Google’s TPU tutorial, the batch size is set to 32, not 256 as we do above. They in fact use a batch size of 256 — the number 32 is batch size per TPU core, and Colab’s TPU contains 8 cores. Fenwicks removes this confusing detail.

GCS. Next we set up Google Cloud Storage (GCS):


Colab then prompts you to sign in with your Google account, which returns a code. Paste this code back to Colab. In this tutorial, we don’t use Cloud Storage to store the dataset. However, Tensorflow still needs to store intermediate files created during training to GCS, such as model checkpoints. In our tutorials, we call the directory containing these intermediate files work_dir.

_, work_dir = fw.io.get_gcs_dirs(BUCKET, PROJECT)

Data and input pipeline. Now we are ready for MNIST. Let’s download the data with Keras, and do some standard transformations:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
n_train, n_test = len(X_train), len(X_test)
X_train = (X_train.reshape(-1, 28, 28, 1) / 255.0).astype('float32')
X_test = (X_test.reshape(-1, 28, 28, 1) / 255.0).astype('float32')
y_train = y_train.astype('int64')
y_test = y_test.astype('int64')
n_classes = np.max(y_train)+1

To feed the data to the TPU, we need to build an input pipeline using tf.data API. To do this with Tensorflow is rather complicated. Fortunately, Fenwicks provides one-liners for this.

train_input_func = lambda params: fw.io.numpy_ds(X_train, y_train, batch_size=params['batch_size'], shuffle_buf_sz=n_train, training=True)
eval_input_func = lambda params: fw.io.numpy_ds(X_test, y_test, batch_size=params['batch_size'], training=False)

Building ConvNet. We build exactly the architecture in Google’s TPU tutorial:

def build_nn(c=6, c_dense=200):
model = fw.Sequential()
model.add(fw.layers.ConvBN(c, kernel_size=3))
model.add(fw.layers.ConvBN(c*2, kernel_size=6, strides=2))
model.add(fw.layers.ConvBN(c*4, kernel_size=6, strides=2))
model.add(fw.layers.DenseBlk(c_dense, drop_rate=0.5))
return model

Here, ConvBN is a 2D convolution layer, followed by BatchNormalization and ReLU activation. Similarly, DenseBN is a fully-connected layer, followed by BatchNorm, ReLU and Dropout. Classifier is simply a fully-connected layer. The kernel size, number of channels, dropout rate and so on come from Google’s TPU tutorial.

Training the ConvNet. Following Google’s TPU tutorial, we use Adam optimizer with an exponentially decaying learning rate, as follows:

steps_per_epoch = n_train // BATCH_SIZE
total_steps = steps_per_epoch * EPOCHS
lr_func = fw.train.exp_decay_lr(base_lr=0.0001, init_lr=0.01, decay_steps=2000)
opt_func = fw.train.adam_optimizer(lr_func)
model_func = fw.tpuest.get_clf_model_func(build_nn, opt_func)

What does the exponential decay learning rate look like? Let’s plot it out:

fw.plt.plot_lr_func(lr_func, total_steps)

The plot is interactive: click on it, and then hover mouse over it to see the exact learning rate values. Here’s a screenshot:

Now let’s start training:

est = fw.tpuest.get_tpu_estimator(n_train, n_test, model_func, work_dir, trn_bs=BATCH_SIZE)
est.train(train_input_func, steps=total_steps)

The TPU is indeed fast: about 0.3 second per epoch for our 5 layer neural net. It takes some time to initialize the TPU though, which includes connecting to the remote TPU server, and compile the neural network into TPU’s hardware instructions.

Unlike Keras training, TPU doesn’t evaluate the model on the test set at every epoch. Instead, we have to do so explicitly:

result = est.evaluate(eval_input_func, steps=1)

The above code puts the whole test set in one single batch and finishes the evaluation in a single step. Let’s print the evaluation result:

print(f'Test results: accuracy={result["accuracy"] * 100: .2f}%, loss={result["loss"]: .2f}.')

The specific accuracy you get is random. Most of the time, the accuracy is over 99.4%.

Finally, let’s clean up work_dir, since we only have 5GB free space in GCS:

print(f'Test results: accuracy={result["accuracy"] * 100: .2f}%, loss={result["loss"]: .2f}.')

That’s it. With Fenwicks taking care of the TPU details, the code feels like plain old Keras. In the next tutorial, we deal with the much more challenging Cifar10 dataset.