NLP Augmentation Hands-On

Original article was published on Deep Learning on Medium

NLP Augmentation Hands-On

Part-1

Augmentation in Computer Vision is one of the important techniques and has proved to be effective. In NLP, augmentation is also tried and shown improvements in quite a few cases.

In this part, We will first understand the following

-What Data Augmentation is and why it works?-Why does it work so well for Computer Vision?-Benefits on Augmentation.-Rules of Data Augmentation.-Types of NLP Augmentation.

Then we will jump into one of the types of NLP Augmentation and will do hands-on.

What augmentation is and why it works?

Data Augmentation is a technique to synthetically generate new data points such that generated data have the same semantics as of original data. In other words Data Augmentation is semantically invariant transformation.

Data Augmentation has these primary reasons to work.

  • Data Scarcity
  • Improves generalization capabilities (reduce overfitting)
  • Test Time Augmentation (Confident Prediction)

Why does it work so well for Computer Vision?

In Computer vision, particularly Deep Learning algorithms are data hungary. It means more data is always welcome.

Though there are some researchers who object to the volume vs quality of data. If you want to undestand more aabout it please go through this https://www.slideshare.net/xamat/10-lessons-learned-from-building-machine-learning-systems

Transformations applied on image during augmentation still preserve the meaning, hence are semantically invariant transformation.

(Reference — https://medium.com/secure-and-private-ai-writing-challenge/data-augmentation-increases-accuracy-of-your-model-but-how-aa1913468722)

Rules of Data Augmentation

  1. The augmented data must follow a statistical distribution similar to that of the original data.
  2. A human being should not be able to distinguish between the amplified data and the original data.
  3. Data augmentation involves semantically invariant transformations.
  4. In supervised learning, the transformations allowed for data augmentation are those that do not modify the class label of the new data generated.
  5. In order to respect the semantic invariance, the number of successive or combined transformations must be limited, empirically to two (2).

Reference for above Rules Text Data Augmentation Made Simple

Benefits of Data Augmentation

Benefits of augmentation are widely documented in Computer vision research.

  • Implicit regularization
  • Semi-Supervised applications, insufficient data.
  • Cost effective way to data gathering and labeling. Automated synthetic data generation helps to alleviate tedious data collection processes.

Now we have some understanding of Data Augmentation we will shift our attention to text augmentation. Text augmentation and NLP Augmentation could be treated as synonyms.

NLP augmentation can be classified into these major categories. While each category has a bunch of techniques.

Categories of NLP Augmentation

  • Lexical Substitution
  • Back Translation
  • Text Surface Transformation
  • Random Noise Injection
  • Instance Crossover Augmentation
  • Syntax-tree Manipulation

In this part we do hands-on for Lexical Substitution. First load the data.

Load and Clean data

Now train and evaluate the un-augmented dataset using MultinomialNB.

Train and Eval for un-augmented data

Now we will augment the data using wordnet synonyms. We will augment words of each sentence, which are either Noun or Adjective.

Load Clean and Augment Data

Now we augment the data with above functions, we will train and eval using a similar approach as we have done without the augmentation version.

We have noticed slight improvement in validation score, which might be a motivation that Augmentation can help to generalize models.

I understand that this post has become quite long. No worries from the next part of the series I will be being specific to method and practices.

Hope you have liked this. Feel free to put some claps and share. I will be working on bringing Part-2 (Word Embedding based Substitution).

Code and Jupyter Notebook for this article can be found at –