Data Augmentation in Natural Language Processing

Original article was published on Deep Learning on Medium

Data Augmentation in Natural Language Processing

Have you played augmented images before? Augmenting images made your model generalize and perform a lot better by exposing it to much data when you had less of it. In this post, we will go through data augmentation in Natural Language Processing.

Why Data Augmentation?

Go through more examples to understand better!!

If you have never heard of this term before here is a brief description of what it is and why do we use it? Image data augmentation is a technique that can be used to artificially expand the size of a training dataset by creating modified versions of images in the dataset. Training deep learning neural network models on more data can result in more skillful models, and the augmentation techniques can create variations of the images that can improve the ability of the fit models to generalize what they have learned to new images.

Data Augmentation in NLP

A paper on this topic was released recently and the paper called it Easy Data Augmentation for Boosting Performance on Text Classification. The paper spoke of four simple ideas of augmenting the texts. Yes, these ideas are as simple as editing the texts. Here they are:

  1. Synonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
  2. Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
  3. Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
  4. Random Deletion: Randomly remove each word in the sentence with probability p.

Here is a sense of what these augmented sentence may look like

Original and augmented sentences in all four ways.

Were these ideas actually successful? At first, it seems like only the first idea Synonym Replacement would work well. But experimenting on a dataset show that all the four ideas worked almost equally well. Exposing a model to augmented data even though they make less sense by applying some random operations to a small dataset worked really well which prevented the model from deeply learning of limited examples and ending up not generalizing. Below is an image that shows how EDA outperformed the normal process.

What about the labels of augmented data?

It’s a valid question that what would happen to the labels of the augmented data? Would they remain the same or will they have changed? If the latter is true then this makes EDA not an effective way of Data Augmentation.

To figure it out, train a model using RNN’s on the dataset that is not augmented for the sake of sentiment analysis. Then, apply EDA to the test set by generating N augmented sentences per original sentence. These are fed into the RNN along with the original sentences. The output vectors of these examples are extracted from the last dense layer. After applying dimensionality reduction to these output vectors and plotting them in the latent space. The plot looked something like this….

visualizing the outputs of the original and the augmented examples.

The plots of the augmented examples are quite close to the plots of their original examples. This proves that the augmented examples retain the labels of their original examples. But wait!! Is this true always? Read further.

Visualizing four ideas individually

How these ideas effect individually? To see this effect apply all the four one at a time. Here are the plots of how actually these ideas performed individually when applied to an RNN model.

Performance Gain vs Alpha parameter.

Alpha parameter: The size of the sentence varies a lot. Hence we cannot apply the same amount of augmentation to all the sentences. Like a bit sentences may require more augmentation. Performing the same amount of augmentation on the small sentence would prove to be a disaster!! To avoid this we make use of something called as Alpha parameter.

value of n, n = alpha_parameter*length_of_the_sentence

We see that for all four ideas having a high alpha value will have a negative impact on the accuracy which we want to seriously avoid. This is because making too many swaps, or changing words would change the identity of the sentence. Hence this proves that augmented sentences do not always retain the labels of the original sentence if the augmented sentences and significantly different from the original images.

Size of the dataset vs alpha value?

From the above plots, we see that EDA works really well on small datasets when compared to larger datasets. This makes sense right because having larger datasets the model will have already generalized well on real data. Hence augmenting the data and exposing the model to less meaningful and artificial sentences may mislead the model and the model may perform less good than it used to be. Look at the plot below:

Performance Gain vs Naug for the dataset of different sizes.

What do we infer from the pic above? The performance gain is high for small datasets with more number Naug per sentence. The value of Naug reduces from 16 to 2 as the size of the dataset increases from 500 to >5000 for the max value of performance gain.

Hyperparameters to be considered

We have got two hyperparameters here.

  1. alpha -> Percent of words to be altered in a sentence.
  2. Naug -> number of augmented sentences per original sentence.

Based on these results, the recommended values for parameters usage are:

Recommended values for parameters for datasets of different sizes.

This is it for this post…. Experienced people reading this post, please help me understand more by providing suggestions and deeper insights. Connect me on LinkedIn.

Thank you for reading this post. Have a great day ahead 🙂