Recognizing Styles of Brands and Products

On 17th May 2018, handbags became the talk of the town.

37 brands of luxury brands such as Hermes, Chanel and Bijan were found in the homes of Najib Razak. Seizure from six locations include:

  1. 1,400 chains,
  2. 2,200 rings and bangles,
  3. 2,800 pairs of earrings,
  4. 14 tiaras,
  5. 423 watches like Rolex, Chopard, Richard Mille,
  6. 234 pairs of sunglasses brand Versace, Dior, Gucci,
  7. 567 handbags including 272 Hermes and
  8. 26 currencies;

Total value estimated MYR900 million to MYR 1.1 billion, making this the biggest seizure in Malaysia’s Anti-Money Laundering, Anti-Terrorism Financing and Proceeds of Unlawful Activities Act 2001. This is in relations to the international 1Malaysia Development Berhad (1MDB) case.

Image on 1MDB seized items from

Before the items were announced, I began to develop a neural network to identify luxury brands. Najib Razak court trials began on 4th July 2018.

One of the attributes of this work is the ability to apply it on a regular computer without a graphical processor unit (GPU). This is thanks to Jeremy Howard, Rachel Thomas and Fast.AI community in developing and maintaining library with and without GPU.

At time of writing this piece, Fast.AI has announced, “AdamW and Super-convergence is now the fastest way to train neural nets” on 2nd July which is the latest breakthrough by Ilya Loshchilov and Frank Hutter on AdamW.

Comparison between Adam and AdamW from www.Fast.AI

The coding outline for Multiple Label Classification of Brands and Products :

1. Setup environment with library imports and paths 
2. Data: Inspection, Assign 20% validation
3. Basic model: Product recognition (11 class) 
4. Expand: F-beta, Weight decay, Dropouts, ADAM, Differential learning rate 
5. Expand: Data into multi-label brands and products (62 class) 
5. Apply training techiques: Data augmentation, Size-up, Cosine annealing 
6. Review findings.

Data: Styles

Thanks to Olga Belitskaya for the ‘Styles’ data set hosted on that my experiment in training a convolution neural network is possible.

There are 894 image data, size 150 x 150 pixels from 62 classes of brands and product groups, about 24 megabytes in total size. As luxury items can be exclusive, some brands and product groups have limited images on file.

Usually data scientist would duplicate the images to train the model. I tried a different approach in the attempt to learn more than conventional practice.

Sample of Styles data set

Model: ResNet34

As a product classifier, Resnet34 achieved 95% accuracy with little effort, and had increased in accuracy through guided training.

As a multi-label classifier for 62 class, it began at 61% accuracy. 
About a dozen epochs, made 75% accuracy and overfit at 13th epoch. 
Where would it land?

In an earlier experiment using ResNet50 model, I made it to 77% accuracy.

Metric: F-Beta

Migrating to multi-label classifer, the F-Beta score metric is applied.

The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0.

The beta parameter determines the weight of precision in the combined score. beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> inf only recall).

Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples.

Weight decay & Dropouts

Earlier experiments showed that the neural network lands in overfitting in less than 20 epochs. Hence, weight decay and dropouts are applied. Typically, weight decay helps the model generalize better.

What is generalize better?

My intuition on this is that some data has features that are easily identified such as corners and edges. Some patterns and less prominent features requires more data and training to be ‘found’. Hence I believe weight decay intentionally deteriorate the weights of simple features so the model can pick up on complex features.

ADAM Optimizer

ADAM stands for Adaptive Moment Estimation which comes from the works of Diederik Kingma from OpenAI and Jimmy Ba from the University of Toronto in their 2015.

ADAM is an algorithm that calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages; which tunes the learning rate while the neural net is in training.

Differential Learning Rate

The Fast.AI library came with weights from where data scientist start by training neural net’s final layers on present data set. The initial training yield result 95% and 61% is better than random weights. Hence we apply transfer learning to tune the final layer.

The multi-label classifier started around 61% accuracy, it would train up to 69% accuracy. To get better results, I unfreeze all the layers but holding on to batch normalization weights to train the model. The neural net goes on to achieve 83% accuracy in 7 epochs.

Training the last layer takes about 1 minute per epoch. Training all layers requires more time about 5 minutes per epoch.

To clarify the 7 epochs is applied in 3 cycles which starts with 1 epoch in cycle #1 then multiplied by 2, with 2 epochs in cycle #2 and 4 epochs in cycle #3. This method also relates to cosine annealing (more later).

Differential learning rates are applied [23] in the manner where:

  • Early layers use learning rate: 0.00005
  • Middle layers use learning rate: 0.0002
  • Later layers use learning rate: 0.0006

The early layers of the neural net begins looking for simple features and continues to look for more defined features. Hence the early layers on a model that was trained with imagenet data would not require much changes/ learning. There we progress to higher (from 0.00005 to 0.0002 to 0.0006) learning rates to acquire finer features on present data set.

Data Augmentation

Data augmentation enables neural net to learn features of the images better. because general features would be identified. The data are augmented by:

  • Rotate image
  • Flip image horizontally
  • Zoom image

These are performed by the Fast.AI library and not a manual action. What a convenience and time saver for data scientists!

Size Up : start smaller, go bigger.

I started training multi-label classifier with image size 104 x 104 pixels from its original size 150 x 150 pixels in attempts to run more training time for the neural net and limit overfitting.

This is also feature in Fast.AI and not manual.

I completed at size 128 x 128 as I want to becareful with the temperature of my CPU when I was training in a ambient room temperature (no aircond) or liquid cooling rig.

I have tried to employ my GPU to good use but was unable to find driver for NVIDIA Geforce GTX 860M driver version GPU. So I setup on CPU libraries. Any suggestions here are most welcome.

Cosine Annealing : Stochastic Gradient Descend with Restarts (SGDR)

Neural networks are dependent on learning rate. Initial training starts at a rate until the accuracy plateau. Next learning rate is lower rate number to enable the neural net to train and learn further, so on.

Fast.AI has the feature to train a model in cycles, where when one cycle is run with ADAM optimizer that adjust the learning rate until the cycle is complete, the model restarts to the next cycle with a learning rate higher then it ended, working with ADAM, training epoch by epoch in one setting.

Cycle multiplier is a function that multiplies the number of epoch as it restarts in the next cycle. This allows the neural net to find more features at the same time for ADAM to adjust the learning rate mathematically.

A similar technique is ‘learning rate ensemble’ which is a manual process of recording best learning rates then assign into training the model.

Bonus: Learning Rate Finder

You will find in my codes that I applied a learning rate finder provided by within the Fast.AI library based on the research of Leslie N. Smith on Cyclical Learning Rate in Training Neural Networks.

Finding Learning Rate on operation [22] shown in my Github

In general, find the graph line in decline to the lowest point and use the learning rate one-point before the lowest point.

In my case, I was looking for a learning rate to apply to differential learning rate before I train all the layers. So I like to use a small learning rate on the early layers, I picked 5e-5 or 0.00005 next 2e-4 or 0.0002 and third 0.0006 where learning rates at these points moving to the right would lead to the graph line to decline.


The process went through about 50 experiments and I am glad to learn the intuition in setting parameters; learning rate, ADAM optimizer weight decay, dropouts, differential learning rate, and effect of batch normalisation.

An earlier experiment on ResNet50 started from 31% ended at 77% accuracy. The latest experiment on ResNet34 started from 61% ended at 83% accuracy.

Test time augmentation re-examines the model with data augmentation to reveal a bigger picture and its accuracy. Since the data set for multi-label classification had 22 class with less than 10 images, the accuracy dip 82%.

I suspect it is because the thin data yield less stable accuracy. Example some class had 2 data so the generalisation and precision would be questionable.

Another finding is that some images of jewelries were vague.

Data from Styles

What is that (the image above)?

You will be able to view the codes on Github here:

I started this project with two friends, Choy Hon Yoong and Muhammad Danial bin Rusdi when we were attending AI.Saturdays held by Nurture.AI formed by Yap Jia Qing hosted by James Lee, Desmon, Yen Ping, Han Chong, Lee Rou En and Hafid at venues sponsored by Mindvalley and NEXT Academy. Cloud computer and GPU by Google Cloud. Lastly Jeremy Howard, Rachel Thomas and Fast.AI community at The Data Institute, USFCA — I like to express my, ‘Thanks”

I would love to hear your feedbacks, questions and suggestions. 
– Khoo KC.

Source: Deep Learning on Medium