Tutorial 2: 94% accuracy on Cifar10 in 2 minutes

Source: Deep Learning on Medium


Go to the profile of David Yang

Prerequisite: Tutorial 0 (setting up Google Colab, TPU runtime, and Cloud Storage)

Cifar10 is a classic dataset for deep learning, consisting of 32×32 images belonging to 10 different classes, such as dog, frog, truck, ship, and so on. Cifar10 resembles MNIST — both have 10 classes and tiny images. However, while getting 90% accuracy on MNIST is trivial, getting 90% on Cifar10 requires serious work. In this tutorial, the mission is to reach 94% accuracy on Cifar10, which is reportedly human-level performance. In other words, getting >94% accuracy on Cifar10 means you can boast about building a super-human AI.

Cifar10: build a 10-class classifier for tiny images of 32×32 resolution. This looks like a toy dataset, like MNIST. It is not — serious people have spent serious time and money writing serious-looking papers about 0.1%-ish accuracy improvements. Try it, and you’ll find getting high accuracy frustratingly difficult. ©CIFAR

Training Cifar10 to 94% is quite challenging, and the training can take a very long time. There is an online competition about fast training called DAWNBench, and the winner (as of April 2019) is David C. Page, who built a custom 9-layer Residual ConvNet, or ResNet. In the following, we refer to this model as “DavidNet”, named after its author.

On a Tesla V100 GPU (125 tflops), DavidNet reaches 94% with 75 seconds of training (excluding evaluation time). On Colab’s TPUv2 (180 tflops), we expect at least comparable performance — within 2 minutes as the TPU takes a long time to initialize. This is what we are going to reach in this tutorial.

Setting up. First, we make necessary imports, download Fenwicks, and set up Google Cloud Storage (GCS).

import numpy as np
import tensorflow as tf
import os
if tf.gfile.Exists('./fenwicks'):
tf.gfile.DeleteRecursively('./fenwicks')
!git clone https://github.com/fenwickslab/fenwicks.git
import fenwicks as fw
fw.colab_tpu.setup_gcs()

Then, we define some tunable hyperparameters, using Colab’s nice interface of sliders and dropdown boxes.

BATCH_SIZE = 512 #@param ["512", "256", "128"] {type:"raw"}
MOMENTUM = 0.9 #@param ["0.9", "0.95", "0.975"] {type:"raw"}
WEIGHT_DECAY = 0.000125 #@param ["0.000125", "0.00025", "0.0005"] {type:"raw"}
LEARNING_RATE = 0.4 #@param ["0.4", "0.2", "0.1"] {type:"raw"}
EPOCHS = 24 #@param {type:"slider", min:0, max:100, step:1}
WARMUP = 5 #@param {type:"slider", min:0, max:24, step:1}
BUCKET = 'gs://gs_colab'
PROJECT = 'cifar10'

The values for batch size, momentum, learning rate, number of epochs, and number of warmup epochs (explained soon) are all from the original DavidNet implementation. The weight decay rate, however, is only a quarter of what DavidNet uses, which is 0.0005. In fact, if we use a weight decay of 0.0005, the result accuracy is going to be much lower (around 90%), due to underfitting. This is probably due to architectural differences between a TPU and a GPU, which is a topic beyond the scope of this tutorial.

Preparing data. Unlike MNIST, Cifar10 is a bigger dataset since it contains color images (3 numbers per pixel, for red, green and blue respectively) rather than grayscale (1 number per pixel). So, putting the entire data in memory, as we did in the MNIST tutorial, would lead to an out of memory error. Therefore, we need to store the dataset on GCS. In our tutorial series, we call the data directory on GCS data_dir, and the directory for storing intermediate files generated during training work_dir.

data_dir, work_dir = fw.io.get_gcs_dirs(BUCKET, PROJECT)

Let’s first download the dataset using Keras:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
n_train, n_test = X_train.shape[0], X_test.shape[0]
img_size = X_train.shape[1]
n_classes = y_train.max() + 1

And do standard scaling: subtract by mean, and divide by standard deviation for every color channel:

X_train_mean = np.mean(X_train, axis=(0,1,2))
X_train_std = np.std(X_train, axis=(0,1,2))
X_train = (X_train - X_train_mean) / X_train_std
X_test = (X_test - X_train_mean) / X_train_std

In Tensorflow, the preferred file format is TFRecord, which is compact and efficient since it is based on Google’s ubiquitous ProtoBuf serialization library. Fenwicks provides a one-liner for this:

train_fn = os.path.join(data_dir, "train.tfrec")
test_fn = os.path.join(data_dir, "test.tfrec")

fw.io.numpy_tfrecord(X_train, y_train, train_fn)
fw.io.numpy_tfrecord(X_test, y_test, test_fn)

Data augmentation and input pipeline. In DavidNet, training images go through the standard Cifar10 transformations, that is, pad 4 pixels to 40×40, crop back to 32×32, and randomly flip left and right. In addition, it applies the popular Cutout augmentation as a regularization measure, which alleviates overfitting. Cutout is a bit tricky to implement in Tensorflow. Fortunately, Fenwicks again provides one-liners:

def parser_train(tfexample):
x, y = fw.io.tfexample_numpy_image_parser(tfexample, img_size,
img_size)
x = fw.transform.ramdom_pad_crop(x, 4)
x = fw.transform.random_flip(x)
x = fw.transform.cutout(x, 8, 8)
return x, y

parser_test = lambda x: fw.io.tfexample_numpy_image_parser(x, img_size, img_size)

With the input parsers ready, we now build our input pipeline, with Fenwicks’ one-liner:

train_input_func = lambda params: fw.io.tfrecord_ds(train_fn, parser_train, batch_size=params['batch_size'], training=True)
eval_input_func = lambda params: fw.io.tfrecord_ds(test_fn, parser_test, batch_size=params['batch_size'], training=False)

Building ConvNet. Building a ConvNet in Fenwicks isn’t hard — we just keep adding layers to a Sequential model as in the good-old Keras. For DavidNet, things are a bit tricky because the original implementation is in PyTorch. There are some subtle differences between PyTorch and Tensorflow. Most notably, PyTorch’s default way to set the initial, random weights of layers does not have a counterpart in Tensorflow. Fenwicks takes care of that. The ConvNet is as built as follows:

def build_nn(c=64, weight=0.125):
model = fw.Sequential()
model.add(fw.layers.ConvBN(c, **fw.layers.PYTORCH_CONV_PARAMS))
model.add(fw.layers.ConvResBlk(c*2, res_convs=2,
**fw.layers.PYTORCH_CONV_PARAMS))
model.add(fw.layers.ConvBlk(c*4, **fw.layers.PYTORCH_CONV_PARAMS))
model.add(fw.layers.ConvResBlk(c*8, res_convs=2,
**fw.layers.PYTORCH_CONV_PARAMS))
model.add(tf.keras.layers.GlobalMaxPool2D())
model.add(fw.layers.Classifier(n_classes,
kernel_initializer=fw.layers.init_pytorch, weight=0.125))
return model

This corresponds to the following architecture (click to enlarge):

DavidNet architecture © David C. Page

Another thing to note is that the final fully-connected classifier layer is following by a scaling operation, which multiplies the logits by 0.125. This scaling factor 0.125 is hand-tuned in DavidNet, and we follow the same hyperparameter.

Model training. DavidNet trains the model with Stochastic Gradient Descent with Nesterov momentum, with a slanted triangular learning rate schedule. Let’s build the learning rate schedule and plot it:

steps_per_epoch = n_train // BATCH_SIZE 
total_steps = steps_per_epoch * EPOCHS
lr_func = fw.train.triangular_lr(LEARNING_RATE/BATCH_SIZE, total_steps, warmup_steps=WARMUP*steps_per_epoch)
fw.plt.plot_lr_func(lr_func, total_steps)

Then we build the SGD optimizer and model function for TPUEstimator:

opt_func = fw.train.sgd_optimizer(lr_func, mom=MOMENTUM,
wd=WEIGHT_DECAY*BATCH_SIZE)
model_func = fw.tpuest.get_clf_model_func(build_nn, opt_func,
reduction=tf.losses.Reduction.SUM)

Let the training begin:

est = fw.tpuest.get_tpu_estimator(n_train, n_test, model_func, work_dir, trn_bs=BATCH_SIZE)
est.train(train_input_func, steps=total_steps)

After the slowish initialization and first epoch, each epoch takes around 2.5 seconds. Since there are 24 epochs in total, the total amount of time spent on training is roughly a minute. Let’s evaluate the model on the test set to see the accuracy:

result = est.evaluate(eval_input_func, steps=1)
print(f'Test results: accuracy={result["accuracy"] * 100: .2f}%, loss={result["loss"]: .2f}.')

Most of the time, the evaluation result is over 94% accuracy. Finally, we delete all files in work_dir to save space on GCS. Optionally, we can clear up the data_dir as well.

fw.io.create_clean_dir(work_dir)