Data Augmentation + Transfer Learning in NLP in low resource settings

Source: Deep Learning on Medium

Data Augmentation + Transfer Learning in NLP in low resource settings

Deep neural networks are getting better and better at mapping inputs to outputs, given huge amount of data. As you might wonder, getting huge amount of data for every task and every language might be little difficult, and is actually sometimes practically impossible – This is where Data Augmentation and Transfer Learning come to our rescue.

Here in this blog post, we’ll see how we can get *amazing* results for a classification task in Hindi — using the combination of Data Augmentation and Transfer Learning — using iNLTK library.

Problem Statement

Classify Hindi Movie Reviews into one of [Positive, Neutral, Negative] categories — a multi class classification problem.

Dataset

Hindi Movie Reviews dataset consists of ~900 movie reviews collected from Hindi News Websites.

Here is a starter kernel for the dataset.

Training set distribution
Test set Distribution

Lets solve the problem

We’ll be using the pre-trained native ULMFiT model which I had trained for iNLTK library. Instructions to download pre-trained weights along with performance metrics are in NLP for Hindi repository.

Transfer Learning with pre-trained native ULMFiT model in Hindi

Transfer Learning has been used to produce SOTA results for multiple downstream tasks. This blog by Sebastian Ruder very well explains Transfer Learning, so we’ll directly jump into the code and results.

We’ll first fine tune the pre-trained LM and then train the classifier on top of it.

You can checkout all of the code in this notebook .

As you can see, we achieve following results:

Results using Transfer Learning

Data Augmentation + Transfer Learning with pre-trained native ULMFiT model in Hindi

Now, in addition to what we did above, we’ll also be using Data Augmentation.

Preparing Augmented Data for Training

For preparing Augmented Data, we’ll be using iNLTK library’s get_similar_sentences function.

Getting 5 augmented variations for every data point in train

This takes ~10 hours to complete, but at the end we get 5 variations for every movie review in our training data.

Out of the 5 variations generated, I chose the first sentence for every movie review to add to the training set. I really did not experiment what would have happened if I would have chosen, say the 5th variation returned by get_similar_sentences function (note that variations returned by the function are in order of decreasing similarity).

This helps us double the size of our training set. You can checkout the augmented training set here

Language Model fine-tuning and Classifier Training

Using the training set prepared above (original+augmented training set), we can now do Language Model fine-tuning and Classifier training.

You can find all of the code in this notebook.

As you can see, the Accuracy of classifier improves from 62.22% to 68.33%, which is an improvement of more than 9%.

Also, Kappa Score of the classifier has an improvement of more than 20%.

One gotcha here is that because our dataset (and hence test set) is quite small, so the %age by which Data Augmentation will help metrics for your dataset might vary, but nonetheless, Data Augmentation+Transfer Learning is the way to go in Low Resource Settings, that’s for sure!

About me

I’m working as a ML Engineer-2 at Haptik working on fundamental Conversational-AI problems. I’m also the creator of open-source iNLTK library.

Checkout my homepage to know more about me or to reach out to me!