Generalizable Real-World Classifiers for Computer Vision

Original article was published on Deep Learning on Medium

Data Augmentation Approaches

Color Transformations

In order to differentiate between pixels, I thought the best way to augment the data would be to single out the colors in the images. Thus, I spent most of my time experimenting with the transforms.ColorJitter function. I thought by messing around with the hue, saturation, brightness, and contrast fields of the function, I would be able to really sharpen the characteristics of all the images. To begin with, I worked with the brightness feature. I wanted to physically see what would happen once I started to mess with this feature. Originally, the picture on the left illustrates what the image would be after simple normalization.

However, by increasing the brightness too much (ex. brightness = 10), we can see that all of the colors merge together. Thus, it is important to adjust your brightness accordingly. I found a good medium was to set brightness to 1.

This same logic applied to saturation, contrast, and hue. Increasing the values by a bit would emphasize the colors but by going past a certain threshold, all the colors would merge to the same. Thus, we can see from this graph how the accuracy of the model would be with increased contrast levels, for instance here.

We see that when the contrast is 0.8 compared to 5, the accuracy of the augmented data is much higher.

At this point, I realized that by trying to emphasize all the pixels, I was overfitting my data, ultimately increasing the loss for the validation data. Thus, instead of trying to increase this contrast and emphasize my data, I thought I should instead try the opposite and try to shorten the range of these images. After doing more research, I saw that techniques, such as blurring and randomizing, would be better for the model to access validation data.

As a result, I decided to sway away from increasing the ColorJitter fields. I found better results by setting all four of the fields to 0.1. I assume this is because we heightened the colors but not enough to extend the range and overfit the data.

Instead, I thought a great way to compress these images would be to blur the images. Unfortunately, the transforms image augmentation functions do not include blurring. If I had more time, I would definitely recommend adding a function to randomly blur images with some probability p.

Thus, I tried randomizing tactics. While experimenting with random horizontal and vertical flips, I noticed that horizontal was much more effective than the vertical ones.

This might be due to the specific data that we are using. In addition, I noticed that this random flip with probability 0.2 did not really change the accuracy. However, with probability 0.6, there was a significant increase in accuracy. This is probably because 0.6 would actually make an impact on such a large dataset instead of allowing 0.8 of the data to stay the same.

Moreover, I played around with the RandomCrop and Pad functions in the transforms package. To elaborate, this would take the image and crop the image to a certain size at a random location. In order to keep the sizes of the image consistent throughout the dataset, you would pad the image to go back to the original size.

In the torchvision.dataloaders augmentation package, there are various other randomizing tactics, such as RandomAffine, RandomGrayscale, RandomRotation, etc., but I noticed the best results came from RandomCrop/Pad and RandomHorizontalFlip. Thus, I decided to try to combine a few of the concepts together to see the results.

These were the best results from the various combinations. As we can see, the RandomHorizontalFlip seems to be performing the best, while the ColorJitter does not have as good results.

Overall, my initial research into the color augmentation did not seem to perform well. I thought that by extending the variation in color (especially through contrast), the model would be able to better differentiate between objects and apply better classification. However, I forgot to account for new, unseen data. Thus, the validation loss was a lot larger. Therefore, instead of extended this range, I thought the better tactic would be to compress the images more through the act of blurring and randomization. By using this strategy, I was able to increase my validation accuracy significantly by compromising some training accuracy.

Random Erase

A common perturbation in images in the real world is when an item that should be recognized is partially blocked, for example when a stop sign is partially blocked by a tree branch. Image augmentation can be used to simulate such cases without having to collect and classify new images where the main object is obstructed. In Comparing Data Augmentation Strategies for Deep Image Classification, the authors speak to the issue of insufficient training data for training deep nets and compare the impacts of using Rotation, Skew Tilt, Shear, Random Erase, Random Distortion, and Gaussian Distortion as data augmentation strategies on an adapted ResNet model. The single augmentation with the best accuracy was Random Erase, with a 1.5% increase in accuracy compared to their ResNet benchmark. PyTorch’s RandomErase transform cites the 2017 Random Erasing Data Augmentation paper, which tested the impact of graying out the randomly-erased boxes versus filling them with random pixels, among other parameters.

source: Random Erasing Data Augmentation

Filling in the boxes with random pixels lowered their test error rate the most, so I set value='random' and left all other transforms.RandomErasing parameters at their defaults, which matched successful parameters from the source paper. Inspired by the paper’s finding that Random Erase was complementary to data augmentations, I also investigated the impact of combining it with Random Flip, and with Random Flip and Random Crop. I used PyTorch’s pre-trained ResNet-18 model and trained on the augmented data for 10 epochs for the results below.

Disappointingly, including Random Crop caused a significant drop in accuracy on images that were augmented using cropping, but the baseline model did even worse on that data set, which shows that cropped images are just more challenging to classify. Despite the relatively low accuracy of Model-FCE, we still included it in our ensemble because it did so much better than the baseline on cropped images, and did not do devastatingly worse on clean images. The other two models showed exciting improvements over the baseline on both augmented and clean data, so they were clear candidates for the ensemble.

Adversarial Attacks

Adversarial examples are what separates computer vision from human vision the most. By exploiting the neural networks, we can modify the images such that the human eye doesn’t notice any difference, while a neural network outputs an incorrect prediction with high confidence. Below is an illustration of the method that we used — Fast Gradient Sign Method. It was first introduced in Explaining And Harnessing Adversarial Examples [1]. With access to the network gradients, we can add small noise, which looks random to the human eye, yet makes the neural network output an incorrect prediction with high confidence.


What’s problematic about the adversarial attacks in the real world is that the examples often transfer from one model to another. This allows the attackers to fool the targeted classifier even without access to its parameters. Such an attack is called a black-box attack and we strived to make our classifier more robust to it. We used DenseNet, fine-tuned on our clean dataset, as our source network to craft adversarial examples based on its gradients. We then used those examples to attack our classifier based on the ResNet architecture.

Adversarial Objective Function

In our first approach, we modified the objective function as proposed in [1].


We used cross-entropy as our original loss function. To compute the new loss, we first compute the gradients of the cross-entropy function w.r.t. the original images in the mini-batch. We then create adversarial examples based on the fast gradient sign method and compute another loss on those examples. Our final loss is a linear combination of the two losses as seen above. In our training, we used alpha=0.5 and epsilon=0.2. Since epsilon scales the added noise, we can interpret it as the adversary strength. Below is a comparison of the validation accuracy of our baseline model and a new model trained using the modified objective for different epsilon values. The data we used to evaluate each model performance was created using DenseNet, as explained earlier.

Our network accuracy on the adversarial examples improved by 4–5% even though it has never seen them before.

Adversarial Data Training

In our second approach, we injected adversarial examples into the training set, as proposed in Adversarial Machine Learning at Scale. Because of the transferability of adversarial examples, training the network on examples derived from its own gradients should make it more robust to black-box attacks derived from different architectures.


We followed the above algorithm with m=32, k=16, and optimized the network using a modified loss function, with hyperparameter lambda=0.5.


Compared to our first approach, the new model maintained the improvement on the adversarial data, but this time performed better on the clean data.

Stability Training for Random Noise

As we’ve seen, small perturbations in the input images can distort the neural network outputs. While adversarially crafted examples can occur in real world, random noise perturbations are more common due to things such as JPEG compression, resizing, or camera debris. We implemented Stability Training, proposed by Improving the Robustness of Deep Neural Networks via Stability Training, to make our model more robust to small random distortions. Below is an example from our dataset.

The main idea is to train the network on both original images and images with added random noise sampled from Normal distribution and add another loss to make the outputs on the perturbed images as close to the outputs on the original images. The schematic of the training process is illustrated below.


The final loss is a combination of the original loss (cross-entropy) and the stability loss. For the stability loss, we use KL-Divergence to measure the “distance” between the likelihood on the natural and perturbed inputs. To compute the probabilities for the stability loss, we use torch.nn.functional.softmax and torch.nn.functional.log_softmax.


We tuned the hyperparameter alpha and found that 0.01 gave the best trade-off between validation accuracy on clean data and noisy data. We sampled the noise from Normal distribution with mean 0 and standard deviation 0.1 (since our inputs are normalized). Below is our model performance using those parameters.

Reflection/Rotation Invariance


In the real world, objects are viewed in differing orientations. However, datasets often standardize the orientation of images so that their subjects are viewed “right-side-up”. Nonetheless, a dog is a dog no matter the orientation; therefore, it is desirable for the model’s predictions to be translation-invariant, i.e. the model predicts the same classes regardless of the orientation of the image.

Each image in Tiny ImageNet is cropped to be square. Therefore, there are eight translations to consider: the identity transform; rotations by 90º, 180º, and 270º; horizontal and vertical reflections; and reflections across the diagonals of the square. Each of these transforms corresponds to an element of the group of symmetries of the square, which are the translations that preserve the overall shape of the image. Rotations by arbitrary angles and reflections across arbitrary axes require the image to be cropped, and therefore downscaled, and thus were not considered.

To augment Tiny ImageNet, the RandomSymmetryTransform was implemented, which randomly selects one of these transforms (including the identity) and applies it to the given image. This is accomplished using the rotation/reflection transformations provided in torchvision.transforms.functional. Here is an example of an image from the training set, before and after the transform was applied:

In this case, the RandomSymmetryTransform chose to apply a 180º rotation. Because the probability of each transformation being applied is equal, the training data will have an equal amount of each image in each orientation in the mean. Therefore, we should expect that the model is not biased towards any particular orientation of an image. The RandomSymmetryTransform was applied to both the training and validation sets to create an augmented version of Tiny ImageNet.


Initially, only the classifier layer was re-trained. However, this led to disappointing results, with the best model only reaching 46% top-1 training accuracy and 44% top-1 validation accuracy on the augmented dataset. This is most likely because the orientation of the image is not encoded in the classifier layer but rather in the convolutional layers, which feature orientation-sensitive filters. If we want to “teach” orientation sensitivity out of ResNet, we will likely have to re-train some of these filters. We can get a sense of this orientation sensitivity by visualizing the first layer of filters in the baseline model:

The first-layer filters are clearly sensitive to the orientation of objects in the image — some detect diagonal lines, others horizontal ones, and so on. This is good, as it’s part of what allows CNNs to excel in object recognition applications. We don’t want to teach out the orientation sensitivity of individual filters; instead, we want to re-train the way ResNet combines these filters to render the final class scores translation-invariant, that is, not dependent on the orientation of the input image.

However, Tiny ImageNet is a relatively small dataset, and we are working with pre-trained CNNs. Therefore, the more layers we re-train, the more overfitting becomes a concern. To find a good balance, we trained multiple versions of the same model with varying amounts of layers re-trained. Each was trained for 20 epochs, but only the model with the best validation performance in each category was chosen. We then compared these models’ performance on the non-augmented dataset. The results are summarized in the following table. (All accuracies in this section are top-1.)

To understand what is meant by “Layer $n$+” in the table, here is a diagram of the structure of ResNet18, from

The layers in light blue are collectively “layer 1”, those in orange “layer 2”, and so on. “Layer 3+” indicates that layer 3 and all downstream layers (layer 4 and the classifier) were retrained in that experiment. Note that this structure differs slightly from the one used in our project, as our classifier outputs to 200 class scores rather than 1000.

Note that all of the experiments which retrained more than just the classifier had relatively high levels of overfitting, as seen by the disparity between the best training accuracy and best validation accuracy. However, since only the model with highest validation accuracy (which generally had lower training accuracy) was used on the non-augmented data, this issue is mitigated somewhat.

Based on the results from this experimentation, we decided to move forward with the model in which layers 2+ were re-trained, as this model had the highest validation accuracy and accuracy on non-augmented data. To improve it further, we took the layer 2+ model and additionally re-trained layer 4 and the classifier only for an additional five epochs, as the layer 4+ model from the above experiment performed about equally well as the layer 3+ model. In summary, to develop our “best” model, we took ResNet18, replaced the classifier layer to fit it to Tiny ImageNet, re-trained layers 2–4 and the classifier for 20 epochs, selected the intermediate model with best validation accuracy, and finally re-trained layer 4 and the classifier for an additional 5 epochs. The results are summarized below:

The baseline’s poor accuracy on augmented data compared to clean, non-augmented data, even though it is the same data it was trained on (just rotated or reflected), represents the orientation sensitivity of vanilla ResNet. While the best model does not achieve the same accuracy as the baseline on non-augmented data, it is within <1.5%, and more importantly, the accuracy on augmented and clean data is nearly identical. This model is therefore effectively translation-invariant, with little sacrifice in overall accuracy.

Multi-Task Learning

We tried to leverage the fact that our dataset included bounding boxes to improve the robustness of our classifier. Multi-Task Learning serves as an implicit data augmentation and can help the classifier to generalize better. As described in An Overview of Multi-Task Learning in Deep Neural Networks, “[…] a model that learns two tasks simultaneously is able to learn a more general representation. Learning just task A bears the risk of overfitting to task A, while learning A and B jointly enables the model to obtain a better representation F through averaging the noise patterns.” Multi-Task Learning is commonly used in industry, e.g. Tesla Autopilot predicts tens of different tasks at the same time using a big network with many shared parameters across tasks. We implemented hard parameter sharing, illustrated below.

source: CS182 lecture slides

While it reduced our model overfitting, its performance on the validation set was sub-optimal. We tested it on various affine transformations applied to the validation set and didn’t see any improvement. It’s possible that this model could outperform our baseline model on a completely new dataset but we didn’t have time to test it.

After inspecting the bounding boxes, we’ve noticed that their quality was very poor on some images. We’ve found that the bounding boxes are machine-generated and concluded that their quality might not be good enough to improve our model.