Classifying Fashion-MNIST with Gluon

In a previous post, we took a first look at the Gluon API, a high-level API for built on top of Apache MXNet.

In this post, we’ll keep exploring Gluon but first we need a cool dataset to work with.

Code is available on Github.

The Fashion-MNIST dataset

Put together by e-tailer Zalando, Fashion-MNIST is a drop-in replacement for the well-known (and probably over-used) MNIST dataset: same number of samples, same number of classes and same filenames! You’ll find plenty of details in this technical report.

Instead of digits, this data set contains the following fashion items: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot. Some items look very similar, which is likely to make our classification job harder.

Samples from Fashion-MNIST

Loading the dataset

Thanks to the Gluon vision API, it couldn’t be simpler to load the training set and the validation set. Each sample is a 28×28 greyscale image shaped (28, 28, 1). We’ll use a simple transform function to reshape it to (1,28,28).

Building a configurable Convolutional Neural Network

We’re going to try out a number of network architectures, so let’s write a function that lets us build a variety of CNNs from the following Gluon layers:

  • Convolution and Max Pooling layers, with parameters for kernel size, padding, pooling and stride.
  • Dense layer for final classification,
  • Dropout and Batch Normalization layers to fight overfitting and help our network learn better.

By default, we’ll use the ReLU activation function but let’s also plan for Leaky ReLU, which is a separate layer in Gluon.

Thanks to this function, we can now build a variety of CNNs with one single line of code. Here’s a simple example.

The flexibility of the Gluon API is really great here. I must admit that this would have been more work with the symbolic API in MXNet 🙂

Initializing the model

We have to initialize weights, pick an optimizer and set its parameters. Nothing unusual. Let’s settle for Xavier initialization, but feel free to try something else.

Computing accuracy

During training, we’d like to measure training and validation accuracy. Let’s use an MXNet metric to compute them.

Training the model

Our training loop is the standard Gluon training loop:

  • Iterate over epochs and batches,
  • Record gradients while propagating, computing the loss function and back propagating,
  • Applying a training step, i.e. updating the weights.

In the process, we’re also computing accuracies and storing their values for plotting purposes.

Measuring and plotting accuracy

Once training is complete, let’s plot training accuracy and validation accuracy vs epochs. As usual, we’ll use matplotlib. Pretty standard stuff, right?


Network architecture, hyper parameters, layer parameters: so many combinations to explore… Let’s try to set some guidelines.

We’ll start from a basic CNN, known to work well on the MNIST dataset. We’ll apply it as is to Fashion-MNIST to get a baseline.

First, we’ll work on getting the best training performance possible, making sure that the network is large enough to learn the training dataset.

We’ll probably end up overfitting it in the process, which is why we’ll then work on improving validation accuracy.

Very well then, let’s get to work!

First try: basic CNN

The following network scores 99.2% validation accuracy on MNIST.

  • Convolutional layer with 64 3×3 filters, padding and stride set to 1 (1x28x28 → 64x28x28). This layer doesn’t shrink the image (a.k.a. ‘same’ convolution) as it’s quite small already.
  • Max Pooling layer with 2×2 pooling and stride set to 2 (64x28x28 → 64x13x13).
  • Convolutional layer with 64 3×3 filters, padding and stride set to 1 (64x13x13 → 64x10x10).
  • Max Pooling layer with 2×2 pooling and stride set to 2 (64x10x10 → 64x5x5)
  • Flatten layer (64x5x5 →1600)
  • Fully connected layer with 256 neurons (1600 → 256).
  • Fully connected layer with 64 neurons (256 → 64).
  • Output layer with 10 neurons (64 → 10).

We’ll use ReLU for all activation layers.

Here’s the result after 50 epochs (training log).

Epoch#49 Training=0.9988 Validation=0.9242

Top validation accuracy is 92.42%. This is significantly lower than the MNIST score (99.2%), which goes to show that Fashion MNIST is indeed more difficult to learn. Good :->

On the bright side, it does look like this network is capable of learning the dataset. It also scored much higher than all non Deep Learning based techniques benchmarked on Fashion-MNIST (the top one is a variation of Support Vector Machines at 89.7%).

So, hurrah for Deep Learning, but let’s improve this score, shall we?

Second try: use a better optimizer

We used SGD with a fixed learning rate, which is ok to get a quick feeling for how the network performs. However, more modern optimizers will definitely improve performance.

Popular choices includes AdaDelta (paper), AdaGrad (pdf) or Adam (paper). Which one should we pick? It looks like everyone tends to rely on Adam, so let’s try it for 50 epochs (training log).

Epoch#10 Training=0.9716 Validation=0.9305

Top validation accuracy is 93.05% at epoch #10 (!). Adam does learn very fast indeed.

Third try: add Batch Normalization

Batch Normalization (paper) is a technique that helps train faster and avoid overfitting by normalizing values for each training batch. The authors recommend applying it to the inputs of activation layers.

Let’s update our network accordingly and use this technique for both the convolutional layers and fully connected layers.

Here’s the result (training log).

Epoch#45 Training=0.9994 Validation=0.9331

Compared to previous runs, this one learned even faster. With respect to validation accuracy, we got a small improvement at 93.31%.

Fourth try: add Dropout

Training performance is now very good. Let’s now work on improving validation accuracy. Batch Normalization did help a bit, but we should be able to do even better by adding Dropout layers.

Dropout (paper) is a technique that randomly sets to zero a configurable fraction of connections between two layers. By throwing this wrench into the training process, we slow it down, make it work harder at figuring out unexpected inputs and hopefully help the model generalize better.

Let’s add 30% dropout after each convolution block and before each Dense layer. That’s a lot of Dropout: training should be much slower, so we’ll train for 100 epochs.

Here’s the training log. The best validation accuracy is reached at epoch #72: 94.39%. Dropout helped us squeeze an extra 1% accuracy!

Epoch#72 Training=0.9956 Validation=0.9439

Now what?

I’m sure we could go higher if we kept experimenting: tuning dropout , trying out different activation functions like Leaky ReLU, using data augmentation, maybe adding more convolution kernels and so on. This is a rather long post already, so let’s stop there 🙂 However, please keep tweaking, it’s the best way to learn (pun not intended) and the Gluon API makes it particularly easy to build networks programatically.

Just out of curiosity, I ran this improved network on MNIST and got to 99.52% accuracy after only 16 epochs!

As always, thanks for reading. Happy to answer questions here or on Twitter.

Improving validation accuracy… there’s no way but the hard way :*)

Source: Deep Learning on Medium