Mixed Sample Data Augmentation

Original article was published by Viraj Bagal on Deep Learning on Medium


Mixed Sample Data Augmentation

Data Augmentation is widely used in machine learning as a form of regularization. It is a clever way to circumvent the problem of small data. Most of us are familiar with normal data augmentations like rotation, translation, zoom, etc. In this article, we will get to know a new class of data augmentation techniques called Mixed Sample Data Augmentations that have achieved SOTA results on the popular datasets. Mixed Sample Data Augmentations are the augmentation techniques that involve mixing of two images as well as their labels in the dataset. In this article, we will go through two things:

  1. First, I will explain three popular MSDAs, namely, Mixup, Cutmix, and FMix
  2. I will implement them on the FashionMNIST dataset and compare their performances.

The Mixup, Cutmix codes are taken from this wonderful Kaggle discussion post: https://www.kaggle.com/c/bengaliai-cv19/discussion/126504

Let us dive into the battle of MSDAs!

1. Mixup

Mixup creates virtual training examples by linearly mixing two images as well as their labels in the dataset. The weights for mixing are sampled from the beta distribution. This favors simple linear behavior in-between training examples. The authors of the paper also find that it reduces the memorization of corrupt labels, increases the robustness of the model against adversarial examples, and stabilizes the training of the generative adversarial networks.

Mixup formula. Taken from the original paper

The simple algorithm of the mixup is presented above from the original paper. A simple PyTorch code for its implementation is presented below.

The Mixup function

This mixing of labels is also taken into account while calculating the training loss. The loss function for the mixup then becomes.

The loss function for the Mixup

The criterion used above is CrossEntropyLoss. I tried this code on the FashionMNIST dataset and got the following augmented images. The labels on the top refer to the two classes mixed in the image. Visually, it looks like the code has worked!

Example of Mixup on the FashionMNIST dataset

Please do check out the original paper. You will find the ablation study done with respect to the alpha parameter in the beta distribution and get to know the mixup performance on different datasets. Here is its link: https://arxiv.org/pdf/1710.09412.pdf

2. Cutmix

Regularization techniques like coarse dropout/cutout remove some information from the training images by placing a rectangular/square black patch on certain areas of the image. While being helpful for regularization, it also results in loss of information that is not desirable. In the Cutmix, patches are cut and pasted among training images and even the corresponding labels are mixed in proportion to the area of the patch. The following algorithm is taken from the paper.

Cutmix formula

M refers to a binary mask of the same size as the images. Like mixup, lambda is sampled from a beta distribution. In all their experiments, the authors have used alpha=1 for the beta distribution. All the pixels within the following bounding box in the mask M are set to 0 and others are set to 1.

Coordinates of the bounding box

I used the following code for the implementation.

CutMix code

The same loss function as mentioned in mixup is used for cutmix as well. Now, let us take a look at the sample images from the FashionMNIST dataset.

Cutmix sample images from the FashionMNIST dataset

Here is the link to the original paper: https://arxiv.org/pdf/1905.04899.pdf

The authors have compared the performance of Cutmix against that of mixup, cutout, and other SOTA methods. Gradually, new variants of Cutmix like CAM Cutmix, Attentive Cutmix were developed by other researchers.

3. FMix

In Cutmix, the masks are generated by sampling center coordinates from the uniform distribution while on the other hand, in FMix, binary masks are generated by applying a threshold to low-frequency images sampled from the Fourier space. Thus, the main focus of FMix is to improve the masking technique of Cutmix. Quote from the paper: “we first sample a random complex tensor for which both the real and imaginary parts are independent and Gaussian. We then scale each component according to its frequency via the parameter δ such that higher values of δ correspond to increased decay of high-frequency information. Next, we perform an inverse Fourier transform on the complex tensor and take the real part to obtain a grey-scale image. Finally, we set the top proportion of the image to have value ‘1’ and the rest to have value ‘0’ to obtain our binary mask”. Seems complex to implement?! Don’t worry, the code is made available by the authors and I’ll share it right below in a few moments. The complete mask generation algorithm can be summarised in three equations.

A low-pass filter is created by setting δ=3. Z is a complex random variable whose real and imaginary parts are sampled from independent normal distributions.
Inverse Fourier transform of the filter and take only the real part
Set the top proportion to 1

For the implementation, the authors have made Fmix library. You can directly use the ‘sample_and_apply’ function from the library to obtain the Fmix-ed batch of images. Here is the link to their repo: https://github.com/ecs-vlc/FMix

I applied the ‘sample_mask’ function to obtain masks for the FashionMNIST dataset.

We can see that the masks are not rectangular in shape as found in Cutmix and the black regions in some masks are disjoint as well. To gain more insight and results, do look at the original paper: https://arxiv.org/abs/2002.12047

4. Results On FashionMNIST

Section B of the FMix paper provides the experimental details. I used the same hyperparameters for Mixup, Cutmix, and FMix as mentioned in the paper. Preactivated ResNet18 was used in the experimentation. Preactivation refers to the usage of BatchNorm →Non-linear Activation →Conv2D instead of Conv2D →BatchNorm →Non-linear Activation. α = 1, δ = 3, weight decay of 1 × 10^-4 and SGD with momentum of 0.9 is used for optimisation. The model is trained for 200 epochs. The initial learning rate is 0.1 and it is multiplied by 0.1 at the 100th and 150th epoch.

Above is the accuracy vs epochs plot for the Baseline, Mixup, Cutmix, and FMix. The best accuracy achieved is reported in the legend beside the name of the MSDA. Jump in the accuracy is observed after the 100th epoch because the learning rate is reduced from 0.1 to 0.01. The literature reports the best accuracy scores for FMix, Mixup, and Cutmix as 96.36%, 96.28%, and 96.03% respectively, while I managed to achieve 96.07%, 95.91%, and 95.92% respectively. The baseline accuracy is 95.52%. Thus, the MSDA techniques improved the results upon the baseline.

I hope you enjoyed reading this article! Now the ball is in your court, try to apply these MSDAs in your projects and see if they improve your model performance. Please comment below any of your queries & suggestions, and I would be happy to address them. If you like this article then share it and click on the clap icon on the left. The fun part, I never knew we can clap for an article multiple times-precisely speaking, you can clap up to 50 times per post. Your claps will motivate me to write more. Thank you!