Original article was published by Prajwal Paudyal on Deep Learning on Medium
Training a Neural Network to do 18 different things.
Convolutional Neural Network (CNN) architectures can be pretty general purpose for vision tasks. In this article, I’ll relay my experience in using the same network architecture for 18 different classification tasks.
The classification tasks include facial features such as length of the chin (3 gradations), type of hair (111 types), and hair color (10 hair-colors) etc.
For these experiments, I used the 10K version of the dataset. On initial exploration, the dataset consists of 10 folders. First order of business is to download the dataset from the website and extract it. You will see these 10 folders:
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Inside each sub-folder there are ‘.png’ image files of the cartoon and a .csv descriptor file.
Let’s do a quick visualization: (checkout my code in github)
There are corresponding .csv files (with the same name as the images) that have descriptions in the format:
“face_shape”, 4, 7
“facial_hair”, 14, 15
“hair”, 29, 111
“eye_color”, 2, 5
Each of these descriptions could be ‘features’ along which we build can build a deep learning network to classify the images. According to the dataset description page, “Each cartoon face in these sets is composed of 18 components that vary in 10 artwork attributes, 4 color attributes, and 4 proportion attributes”. The number of options per attribute (what will become the classes for each model ) have a range of 3 for chin length, and 111 hairstyle.
According to the dataset design page, “Each of these components and their variation were drawn by the same artist, Shiraz Fuman, resulting in approximately 250 cartoon component artworks and ~10^13 possible combinations”.
As promised, I will build a total of 18 networks that should all be specialized (hopefully) as feature classifiers. In a subsequent article, I will solve this same problem with several different approaches of transfer learning and multi-label classification.
First the network definition:
A little background on the various layers
Convolution layer(s) help the network to learn shift or space invariant features and introduce such prior beliefs into the network structure. In this sense, Convolutional Neural Networks are regularized versions of neural networks and help to greatly reduce the number of parameters learned. This network structure as can be seen in the figure above has slightly over a million parameters to train.
Max Pooling layers are for sample-based discretization with an objective to down-sample the input representation. You can see the feature maps starts at the input size of (256, 256, 3) and slowly gets less wider and more deeper as it goes through the network due to the max-pooling layers as well as the stride selected.
Batch Normalization is a way to shift the inputs to zero-mean and unit variance. A very high level understanding is that helps to make the data comparable across features. It is generally known to lead to a faster learning rates (although there are other interpretations to its effectiveness).
Finally, Dropout is a regularization technique that approximates training a large number of neural networks with many different architectures. It achieves this by randomly dropping the activations of various nodes in each layer (as specified by the dropout probability). The effect is that it introduces noise in the training process and thus like any regularization technique helps the network generalize better.
Now, on to training the network. For quick iteration, I wanted to utilize keras’ image_dataset_from_directory as it takes care of image size conversions, validation splits, interpolations and batching. The function results in a tensorflow dataset which is pretty straight-forward to manipulate and work with.
train_dataset = tf.keras.preprocessing.image_dataset_from_directory(
So to facilitate that I wrote a function that copies the files to a cache temp directory. I used an SSD location to speed up IO. The function is copy_images_to_labels_folder which is available in the github repo.
Additionally, I set up a tensorboard callback to visualize the loss. Here is a sample visualization for the neural network built to classify ‘face_shapes’.
Finally, here is a table showing the accuracies obtained by the various networks. I trained all of these using a NVIDIA 2080 Ti.
Everything worked well for me the first time, but for some of the cases like ‘eye-lashes’ the network had previously failed to converge. Specifically the loss value went to ‘nan’ which indicates either an exploding gradient or a vanishing gradient problem. Below are the main reasons I have found that nans show up during training a convnet. This is not meant to be an exhaustive list.
a) Exploding Gradients — your LR is prolly too large
b) Faulty Loss Function — are you using a custom loss, or using the standard one properly. One time I used an output layer with fewer nodes than the number of classes I was predicting and got nans all day.
c) Input data — Are there corrupt instances? — like in the case of the first network here.
d) Hyperparameters — some of them have co-dependencies, so before changing the defaults check the documentation.
e) Improper LR — Try using an adaptive technique like Adam to see if it helps.
I utilized a technique for attributing the prediction of a deep network to its input features using a process called integrated gradients. The following diagram can be interpreted as where the network was paying ‘attention’ to when it made the decision. (github page for code and other visualization). Reference page for learning more.
EYE COLOR: 85%
EYE- EYEBROW DISTANCE: 84%
GLASSES COLOR: 54.66%
FACE COLOR: 26.04%
Here is how the training had gone for this one:
HAIR COLOR: 96.13%
HAIR STYLE: 99.70%
If you want to know a bit more about interpretability and the general sub-field of explainable-AI, look at this post.
Time Spent Analysis:
I spent about 3 hours coding and all 18 models trained overnight on a NVIDIA 2080 Ti for 30 epochs.
In this post I shared my experience setting up and training a CNN to solve 18 different classification tasks without any intervention and hyper parameter tuning. The performance of these networks can definitely be improved by utilizing more data, transfer learning, a more robust architecture or a more careful selection of hyper-parameters, but that is hardly the point. The point being, CNNs are pretty general purpose and do decently well now without much time spent.
If you found the article or code helpful, or have suggestions lets continue the discussion in the comments section.
Interested in Computer Vision, Generative Networks or Reinforcement Learning? Follow me here for future articles and come network on LinkedIN.