Training the same CNN to do 18 different things and visualizing what it learned.

Original article was published by Prajwal Paudyal on Deep Learning on Medium


Toon Images

Training a Neural Network to do 18 different things.

Convolutional Neural Network (CNN) architectures can be pretty general purpose for vision tasks. In this article, I’ll relay my experience in using the same network architecture for 18 different classification tasks.

The classification tasks include facial features such as length of the chin (3 gradations), type of hair (111 types), and hair color (10 hair-colors) etc.

I will be using the CartoonSet 100k Image dataset from Google available here. My code for these experiments is available here.

For these experiments, I used the 10K version of the dataset. On initial exploration, the dataset consists of 10 folders. First order of business is to download the dataset from the website and extract it. You will see these 10 folders:

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Inside each sub-folder there are ‘.png’ image files of the cartoon and a .csv descriptor file.

['cs11502169095236683120.csv',
'cs11502169095236683120.png',
'cs11502298889929094331.csv',
'cs11502298889929094331.png',
'cs11502404786906647764.csv',
'cs11502404786906647764.png',
'cs11502407216397343631.csv',
'cs11502407216397343631.png',
'cs11502919926067511421.csv',
'cs11502919926067511421.png']

Let’s do a quick visualization: (checkout my code in github)

There are corresponding .csv files (with the same name as the images) that have descriptions in the format:

“face_shape”, 4, 7
“facial_hair”, 14, 15
“hair”, 29, 111
“eye_color”, 2, 5

Each of these descriptions could be ‘features’ along which we build can build a deep learning network to classify the images. According to the dataset description page, “Each cartoon face in these sets is composed of 18 components that vary in 10 artwork attributes, 4 color attributes, and 4 proportion attributes”. The number of options per attribute (what will become the classes for each model ) have a range of 3 for chin length, and 111 hairstyle.

According to the dataset design page, “Each of these components and their variation were drawn by the same artist, Shiraz Fuman, resulting in approximately 250 cartoon component artworks and ~10^13 possible combinations”.

As promised, I will build a total of 18 networks that should all be specialized (hopefully) as feature classifiers. In a subsequent article, I will solve this same problem with several different approaches of transfer learning and multi-label classification.

First the network definition:

Structure of the neural network I used. I built one up using Keras. You will note that after every convolutional layer I do a max pooling, a batch normalization and a dropout. The exact order of these layers is really a matter of opinion and should not * affect the performance. You could reverse the order of batch_norm and dropout and see if it works better. My guess is that it shouldn’t alter it by much either way. Note only the final dense layer will have different number of nodes for different classifications.

A little background on the various layers

Convolution layer(s) help the network to learn shift or space invariant features and introduce such prior beliefs into the network structure. In this sense, Convolutional Neural Networks are regularized versions of neural networks and help to greatly reduce the number of parameters learned. This network structure as can be seen in the figure above has slightly over a million parameters to train.

Max Pooling layers are for sample-based discretization with an objective to down-sample the input representation. You can see the feature maps starts at the input size of (256, 256, 3) and slowly gets less wider and more deeper as it goes through the network due to the max-pooling layers as well as the stride selected.

Batch Normalization is a way to shift the inputs to zero-mean and unit variance. A very high level understanding is that helps to make the data comparable across features. It is generally known to lead to a faster learning rates (although there are other interpretations to its effectiveness).

Finally, Dropout is a regularization technique that approximates training a large number of neural networks with many different architectures. It achieves this by randomly dropping the activations of various nodes in each layer (as specified by the dropout probability). The effect is that it introduces noise in the training process and thus like any regularization technique helps the network generalize better.

Now, on to training the network. For quick iteration, I wanted to utilize keras’ image_dataset_from_directory as it takes care of image size conversions, validation splits, interpolations and batching. The function results in a tensorflow dataset which is pretty straight-forward to manipulate and work with.

train_dataset = tf.keras.preprocessing.image_dataset_from_directory(
training_dir,
labels=”inferred”,
label_mode=”int”,
class_names=None,
color_mode=”rgb”,
batch_size=32,
image_size=(256, 256),
shuffle=True,
seed=42,
validation_split=.2,
subset=”training”,
interpolation=”bilinear”,
follow_links=False,
)

So to facilitate that I wrote a function that copies the files to a cache temp directory. I used an SSD location to speed up IO. The function is copy_images_to_labels_folder which is available in the github repo.

Additionally, I set up a tensorboard callback to visualize the loss. Here is a sample visualization for the neural network built to classify ‘face_shapes’.

There were a total of 6 face-shapes in the dataset. The structure of the neural network seems to be pretty good at differentiating face shapes as can be seen by both the training and the validation loss approaching 0 very quickly. The final accuracy of the model trained for 30 epochs was 100%. It is interesting to note that the validation loss decreases much faster rate than the training loss. This is because the training is a trailing smoothed measure.
For the ‘face-color’ detector the network shows overfitting as seen by the difference between the green (validation) and the pink (training) metrics. This should be mitigated by increasing the regularization parameters like dropout probability or by getting more training data. Ultimately the network gets an okay accuracy. Or perhaps the absence of alpha channel during conversion from rgba to rgb messes things up.

Finally, here is a table showing the accuracies obtained by the various networks. I trained all of these using a NVIDIA 2080 Ti.

Accuracies obtained by a CNN trained on the same architecture after 30 epochs. The first item is blank as there was some corrupt data. The second column denotes the number of classes for each facial dimension.

Everything worked well for me the first time, but for some of the cases like ‘eye-lashes’ the network had previously failed to converge. Specifically the loss value went to ‘nan’ which indicates either an exploding gradient or a vanishing gradient problem. Below are the main reasons I have found that nans show up during training a convnet. This is not meant to be an exhaustive list.

a) Exploding Gradients — your LR is prolly too large

b) Faulty Loss Function — are you using a custom loss, or using the standard one properly. One time I used an output layer with fewer nodes than the number of classes I was predicting and got nans all day.

c) Input data — Are there corrupt instances? — like in the case of the first network here.

d) Hyperparameters — some of them have co-dependencies, so before changing the defaults check the documentation.

e) Improper LR — Try using an adaptive technique like Adam to see if it helps.

Finally, Interpretability.

I utilized a technique for attributing the prediction of a deep network to its input features using a process called integrated gradients. The following diagram can be interpreted as where the network was paying ‘attention’ to when it made the decision. (github page for code and other visualization). Reference page for learning more.

CHIN-LENGTH: 97%

First the chin length classifier: Both the normal and the integrated gradients seem to have found the importance of the curvature in the chin. However, there is still not totally honed in.

EYE COLOR: 85%

For the eye Color Classifier. The normal gradients look spot on. You almost have to squint to see it. The integrated gradients not so much. Perhaps there is because of some implementation hyperparameters that I am not setting correctly.

EYE- EYEBROW DISTANCE: 84%

This one is the eyebrow distance classifier. Gradients are spot on.

GLASSES COLOR: 54.66%

Normal Gradients look like they are in the expected spot. But color differentiation is definitely not a strong suite for this CNN. Can you guess why? (Hint: think about channel information — when and how do they combine). I’ll include a mitigation strategy in a future post.

FACE COLOR: 26.04%

Again a bad one related to color classification. The CNN seems to color-blind. (not always a bad thing some would say). So just looking at the gradient map is also a good way to understand if the neural network is doing at least the intuitive thing. (This image was misclassified by the way).

Here is how the training had gone for this one:

The training shows wild variation in validation loss with a relatively smooth curve for training. Also there are tell-tale signs for overfitting. Given that I turned off dropouts almost completely (for speed) and relied on what little regularization effect batch-normalization has, a quick experiment to remedy this would be to using a decent level of drop-out.

HAIR COLOR: 96.13%

The hair color classifier was much better than expected.

HAIR STYLE: 99.70%

The hair style classifier with 111 classes had a validation accuracy of over 99.7 % in only 30 epochs and did surprisingly well.

If you want to know a bit more about interpretability and the general sub-field of explainable-AI, look at this post.

Time Spent Analysis:

I spent about 3 hours coding and all 18 models trained overnight on a NVIDIA 2080 Ti for 30 epochs.

Conclusion:

In this post I shared my experience setting up and training a CNN to solve 18 different classification tasks without any intervention and hyper parameter tuning. The performance of these networks can definitely be improved by utilizing more data, transfer learning, a more robust architecture or a more careful selection of hyper-parameters, but that is hardly the point. The point being, CNNs are pretty general purpose and do decently well now without much time spent.

If you found the article or code helpful, or have suggestions lets continue the discussion in the comments section.

Interested in Computer Vision, Generative Networks or Reinforcement Learning? Follow me here for future articles and come network on LinkedIN.