These are the Easiest Data Augmentation Techniques in Natural Language Processing you can think of…

Source: Deep Learning on Medium

Go to the profile of Jason Wei
Augmentation operations for NLP proposed in [this paper]. SR=synonym replacement, RI=random insertion, RS=random swap, RD=random deletion. The Github repository for these techniques can be found [here].

Data augmentation is commonly used in computer vision. In vision, you can almost certainly flip, rotate, or mirror an image without risk of changing the original label. However, in natural language processing (NLP), the story is totally different. Changing one word has the potential to change the meaning of the entire sentence. So we can’t come up with easy rules for data augmentation. Or can we?

I present to you EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks (for a quick implementation, see the EDA Github repository). EDA consists of four simple operations that do a surprisingly good job at preventing overfitting and helping train more robust models. Here they are:

  1. Synonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
  2. Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
  3. Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
  4. Random Deletion: Randomly remove each word in the sentence with probability p.

Do these techniques really work? Surprisingly, yes! Although some generated sentences may be a little nonsensical, inducing some amount of noise into the dataset can be extremely helpful for training a more robust model, especially when training on smaller datasets. As shown in [this paper], using EDA outperforms normal training at almost all dataset sizes over 5 benchmark text classification tasks, and does way better when training on small amounts of data. On average, training a recurrent neural network (RNN) with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data:

Performance on text classification tasks with respect to percent of dataset used for training. Using EDA (easy data augmentation) operations significantly outperforms normal training on small datasets.

Does EDA conserve true labels of augmented sentences?

Now I know that you’re thinking. Can you really just do these augmentation operations while maintaining the true labels of augmented sentences? Let’s take a visualization approach to find out…

So you train a RNN on positive and negative product reviews, then run it on both regular and augmented sentences, extract the last layer of the neural network, and use tSNE to get a latent space visualization:

Latent space visualization of original and augmented sentences in the Pro-Con dataset.

It turns out that latent space representations for augmented sentences closely surround those of the original sentences! This indicates that generated augmented sentences most likely maintained the same label as their original sentence.

Do all these operations work?

Now, let’s find out what the individual effects of each of the data augmentation techniques are. Synonym replacement makes sense, but do the other three operations actually do anything? We can do a test that isolates each of the techniques and uses them to varying degrees of α, a parameter that roughly means “percent of words in a sentence that are changed”:

Average performance gain of EDA operations over five text classification tasks for different training set sizes. The α parameter roughly means “percent of words in sentence changed by each augmentation.” SR: synonym replacement. RI: random insertion. RS: random swap. RD: random deletion.

You can see that performance gain is especially large for small datasets at around 2–3% and modest for larger sizes (~1%). However, all techniques, if used at a reasonable augmentation parameter (don’t change more than a quarter of the words in a sentence), can help train more robust models.

How much augmentation?

Finally, how many augmented sentences should we generate for the real sentence? The answer for this depends on the size of your dataset. If you only have a small dataset, overfitting is more likely so you can generate a larger number of augmented sentences. For larger datasets, adding too much augmented data can be unhelpful since your model may already be able to generalize when there is a large amount of real data. This figure shows performance gain with respect to the number of augmented sentences generated per original sentence:

Average performance gain of EDA across five text classification tasks for various training set sizes. n_aug is the number of generated augmented sentences per original sentence.

What now?

We’ve shown that simple data augmentation operations can significantly boost performance in text classification. If you are training a text classifier on a small dataset and looking for an easy way to get better performance, feel free to implement these operations into your model, or pull the code for EDA from Github. You can find out more details in [this paper].

Feel free to read about my work on my [personal website] and shoot me an email. Best of luck!