Original article was published on Deep Learning on Medium
II. Middle Man Data Augmentations
These strategies are mostly used in day to day world nowadays. People having decent enough GPU can perform these Augmentation techniques. Let’s discuss each in chronological order.
Elastic Distortion is a technique in which the input image is deformed by change. Let’s look at an example to get a clear understanding
It is one of the simplest Image augmentation techniques used. PyPI has a package named “elasticdeform” which allows us to easily manipulate the input image. [Check out: PyPI and Github if you want to implement]
Simple distortions such as translations, rotations, and skewing can be generated by applying affine displacement fields to images. This is done by computing for every pixel a new target location concerning the original location. The new target location, at position (x,y) is given concerning the previous position. For instance, if ∆x(x,y)=1, and ∆y(x,y)=0, this means that the new location of every pixel is shifted by 1 to the right. If the displacement field was: ∆x(x,y)= αx, and ∆y(x,y)= αy, the image would be scaled by α, from the origin location (x,y)=(0,0). Since α could be a noninteger value, interpolation is necessary. [Source]
Dropouts have a very interesting concept behind it. We will dive into it more intuitive but before that, we would like you to know Kernels & Channels to fully understand how Dropouts could be helpful in a Fully Connected layer rather than just CNN layers as a whole.
Let’s take a look at an example. The following image shows the normal behavior of the CNN layer(left).
Now suppose, if we were to blackout a few neurons(set them to 0) in the input layer, what would be the output? Of course, it changes, but it is not all that changes. When we randomly dropout neurons(blackout/set to 0), the model is now forced to find different neurons, which could help it to better classify the input. Each neuron is capable of holding a set of features based on its kernel and if we drop a few of those, the model would rather prefer to classify based on the available neurons.
This way, we are indirectly saying to the model that if were to give you 2 images of a person, one perfectly fine and second one with all the hands blacked out(pixels are set to 0 wherever hand was there in the image), then are you capable of learning to classify a person even without those hands neurons. This is a more practical scenario, right?
Dropouts could be helpful in over-fitting as stated by the original paper. How? Because we might randomly drop a few redundant features in an image that is very well learned by the model. However, it is a topic of debate. Also remember, dropouts are randomly applied to neurons at each training phase. so it keeps changing the dropping neurons at each training instance. Well, it all seems to just click and work, however, we also assume that we do not randomly drop all the important features. If we do so, then the accuracy may not necessarily increase. Dropouts could be tricky and it would be difficult to judge them based on just neurons. This was recognized by LaCunn in 2015 & he proposed a different strategy to implement Dropouts. He proposed spatial dropouts. Spatial dropouts drop the whole channel itself instead of dropping neurons. Hence, the output of a kernel is not used. Again, this dropping of kernel changes with each training instance. However, spatial dropouts are not frequently used compared to normal dropouts. We have 2 separate functions in Keras framework.
# for importing normal dropout, use
from keras.layers import Dropouts# for importing spatial dropouts, use
from keras.layers import SpatialDropout2D
- Batch Normalization(2015)
To understand batch normalization, refer to our article on batch normalization.
The cutout is a very simple augmentation technique. We simply say, cut out a portion in the image. That’s it. How is it helpful? It follows the same reason as we just discuss. If we remove random features from the image, the model looks for other features to be able to classify that image. Hence, cutting out portions in the image forces the model to learn other features as well. This is helpful for the model to learn more features rather than simply learning the dominant features in an image.
If you want to learn the code, you can make use of the following links.
NOTE: You must be understanding what your model is learning to be able to integrate cutout in your project. In case your model is not learning well, cutout might haunt it down further. There are interpretability methods to find out what the model is seeing in an image. One of them is GradCam.
Mixup is a very interesting augmentation. Let’s try to understand it with the help of an example.
Mixup alpha blends two images to produce a new image. This behavior forces the model to predict 2 classes. Now in a single image, we expect the model to predict 2 outcomes. Randomly mixing up images forces the model to learn features for a class. In the above example, 20% of dog features are visible, so the model is forced to detect those features and should be able to say that a dog is present and it is 20% confident of the prediction. [Mixup source]
Mixup is very similar to label smoothing technique. If you don’t know label smoothing, let us give a simple example. Imagine, if we have 2 classes to predict to and our model gives out 1 for the class which it thinks is, or 0 if it thinks no for the class. So we have corresponding 0/1 labels with 0 being no class and 1 being class. if any other values are there, the model simply rejects it. So what label smoothing does is, it tell the models to not reject all values but say let’s take all the values which are greater than 0.95, so that model will reject all the values below 0.95 and convert all the values above 0.95 to 1 and accept it.
Smart augmentation was a very specific image augmentation technique. It was introduced in 2017. It can only be applied to specific classes and it provides great results. It does this by learning to merge two or more samples in one class. This merged sample is then used to train a target network. The loss of the target network is used to inform the augmenter at the same time. This has the result of generating more data for use by the target network. This process often includes letting the network come up with unusual or unexpected but highly performance augmentation strategies.
The only drawback is, it can be only applied to images where the object size is constant and the location of the object is fixed.
In this technique, synthesize a new sample from one image by overlaying another image randomly chosen from the training data (i.e., taking an average of two images for each pixel). By using two images randomly selected from the training set, we can generate N² new samples from N training samples.[Source]
This technique could be tricky when you implement it. This will most often give you a low training accuracy, however, when it comes to a validation error, it will help you reduce your testing error. Why so? During the training phase, the model is not able to properly classify both and hence accuracy will be lower. This may discourage you in regards to your model’s performance but now your model is also learning lots of features that could be a part of an object, so during testing, it would help it to classify better. This is what the research paper also quoted. Let’s see some stats.
As we can see, the error reduced with sample pairing during validation but increased during training. This is expected behavior and we do not need to worry about our model’s performance.
RICAP crops four training images and patches them to construct a new training image. It selects images and determines the cropping sizes randomly, where the size of the final image is identical to that of the original image. RICAP also mixes class labels of the four images with ratios proportional to the areas of the four images like label smoothing in the mixup. Compared to a mixup, RICAP has three clear distinctions:
- it mixes images spatially
- it uses partial images by cropping
- and it does not create features that are absent in the original dataset except for boundary patching.
RICAP shares concepts with cutout, mixup, and label smoothing, and potentially overcomes their shortcomings.
As we see, with RICAP, the results are much better.
Fig 18 shows the heatmap of where the model is looking to classify the corresponding input image. The heatmaps of the baseline model are not looking at the right location to decide the output class, but after RICAP, the heatmaps are much accurate and the model is looking at the right location.
In this technique, patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. Instead of simply removing pixels, we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. CutMix now enjoys the property that there is no uninformative pixel during training, making training efﬁcient while retaining the advantages of regional dropout to attend to non-discriminative parts of objects. The added patches further enhance localization ability by requiring the model to identify the object from a partial view. The training and inference budgets remain the same. CutMix shares similarity with Mixup which mixes two samples by interpolating both the image and labels. While certainly improving classiﬁcation performance, Mixup samples tend to be unnatural. CutMix overcomes the problem by replacing the image region with a patch from another training image.[Source]
The table shows significant improvements in the results in ImageNet classification, Localization, and VOC detections.
We know you feel this way, but just hold on for few more minutes. Knowledge is worth the risk 😜 😉