Emotion Detection from Hindi Text Corpus Using ULMFiT

Source: Deep Learning on Medium


Go to the profile of GreyMatter.ai
Figure 1: Source- Impact

Written by-Ankit Singh, Dhairya Patel, Kaustumbh Jaiswal

Introduction

Deep Learning has charged up the space of Image recognition and Speech processing for some time now.

We are witnessing a similar trend in Natural Language Processing.

Deep Learning for NLP was less impressive at first, but with the introduction of techniques like ULMFiT, ELMo,Transformers, BERTetc., it has become an impact driver, yielding state-of-the-art (SOTA) results for common NLP tasks.

Named entity recognition (NER), part of speech (POS) tagging, Sentiment analysis, etc., are some of the problems where neural network models have outperformed traditional approaches. The progress in machine translation is perhaps the most remarkable amongst all.

In this blog we will showcase a ULMFiT model and use it for Emotion Detection. ULMFiT is the technique of using transfer learning for text classification task.

Let’s begin!

Transfer Learning

Transfer learning is the technique of using weights from a pre-trained deep neural network and tweaking them a bit to suit our application. In other words, it is applying the knowledge of an already trained model to a different but related problem.

Figure 2: Source- EverythingAi

It is suited to applications having a small dataset and also reduces computation time.

What is ULMFiT?

ULMFiT stands for Universal Language Model Fine-tuning for Text Classification, a technique introduced by Jeremy Howardand Sebastian Ruder. It is a technique to incorporate transfer learningin NLP tasks.

USPs of ULMFiT is-

  • Discriminative fine-tuning
  • Slanted triangular learning rates
  • Gradual unfreezing

Discriminative Fine-Tuning

Figure 3: Source- towardsdatascince

Different layers of a neural network capture different types of information so they should be fine tuned to different extents. Instead of using the same learning rates for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates.

Slanted Triangular Learning Rates

Figure 4: Source- ULMFiT

The model should quickly converge to a suitable region of the parameter space in the beginning of training and then later refine its parameters. Using a constant learning rate throughout training is not the best way to acheive this behaviour. Instead Slanted Triangular Learning Rates (STLR) linearly increases the learning rate at first and then linearly decays it.

Gradual Unfreezing

Gradual unfreezing is the concept of unfreezing the layers gradually which avoids catastrophic loss of knowledge possessed by the model. It first unfreezes the top layer and fine-tunes all the unfrozen layers for 1 epoch. It then unfreezes the next lower frozen layer and repeats until all the layers have been fine-tuned until convergence at the last iteration.

For a detailed explanation on ULMFiT we strongly suggest you to go through thispaper.

Let’s Code!

Installation

For running the code explained in the subsequent sections make sure fastai version 0.7 is installed in your system. To install fastai follow the instructions given here.

from fastai.text import *
import html

Getting Started

We start up by creating different folders for classification and language models.

PATH = Path('') # path to the data
CLAS_PATH=Path('emotion_hindi_clas/')
CLAS_PATH.mkdir(exist_ok=True)
LM_PATH=Path('emotion_hindi_lm/')
LM_PATH.mkdir(exist_ok=True)

Dataset

The dataset is created manually as there’s no pre-existing dataset for Hindi Emotion Detection. It comprises of 5 labels Angry, Happy, Neutral, Sad andExcited.

Each entry of the dataset is then converted to a text file which is stored in a folder of the class to which it belongs. Now, let’s load the dataset.

CLASSES = ['angry','excited','happy','neutral','sad']
def get_texts(path):
texts,labels = [],[]
for idx,label in enumerate(CLASSES):
for fname in (path/label).glob('*.*'):
texts.append(fname.open('r', encoding='utf-8').read())
labels.append(idx)
return np.array(texts),np.array(labels)
trn_texts,trn_labels = get_texts(PATH/'train')
val_texts,val_labels = get_texts(PATH/'test')

Our dataset consists of 5 classes- Angry, Excited, Happy, Neutraland Sad.

The get_texts() function loads the data and stores all the texts in trn_textsand val_textsand their respective labels in trn_labelsand val_labels.

Data Pre-processing

Now we convert our data into csv format having two columns labelsand texts.

col_names = ['labels','text']
df_trn = pd.DataFrame({'text':trn_texts, 
'labels':trn_labels},
columns=col_names)
df_val = pd.DataFrame({'text':val_texts,
'labels':val_labels},
columns=col_names)
df_trn.to_csv(CLAS_PATH/'train_hindi.csv',header=False, index=False)
df_val.to_csv(CLAS_PATH/'test_hindi.csv', header=False, index=False)
(CLAS_PATH/'classes_hindi.txt').open('w', encoding='utf8').writelines(f'{o}\n' for o in CLASSES)

We also create a different csv to train our language model having all the labels as 0 (to train the language model labels are not required).

trn_texts,val_texts = sklearn.model_selection.train_test_split(
np.concatenate([trn_texts,val_texts]),
test_size=0.1)
df_trn = pd.DataFrame({'text':trn_texts, 
'labels':[0]*len(trn_texts)},
columns=col_names)
df_val = pd.DataFrame({'text':val_texts,
'labels':[0]*len(val_texts)},
columns=col_names)
df_trn.to_csv(LM_PATH/'train_hindi.csv', header=False, index=False)
df_val.to_csv(LM_PATH/'test_hindi.csv', header=False, index=False)

Language Model

We create a language model which is trained on Hindi wikidump corpus. Language model is created to give the model a better understanding of the language. For example, if the language model is given an incomplete sentence, the model will try to complete the sentence by predicting the next word.

re1 = re.compile(r'  +')
def fixup(x):
x = x.replace('#39;', "'").replace('\\"', '"').replace('#146;', "'")
return re1.sub(' ', html.unescape(x))

The fixupfunction replaces some of the weird things present in the dataset.

def get_texts(df, n_lbls=1):
labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
for i in range(n_lbls+1, len(df.columns)):
texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
texts = list(texts.apply(fixup).values)
tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
return tok, list(labels)

The get_textsfunction applies the fixupfunction and inserts xbosand xfldtags to mark the beginning of sentence and sentence tag respectively.

def get_all(df, n_lbls):
tok, labels = [], []
for i, r in enumerate(df):
print(i)
tok_, labels_ = get_texts(r, n_lbls)
tok += tok_;
labels += labels_
return tok, labels

The get_all function tokenizes the data and returns the tokenized text and the labels.

tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)
(LM_PATH/'tmp').mkdir(exist_ok=True)

We create a list itoswhich maps the tokens obtained to integers.

# freq.most_common contains the tokens along with their frequency of # occurence
max_vocab = 60000    # The maximum size of vocablury
min_freq = 2
itos = [o for o,c in freq.most_common(max_vocab) if c>min_freq]

Also, a dictionary stoiis required to convert the integers to back their respective tokens.

stoi = collections.defaultdict(lambda:0, 
{v:k for k,v in enumerate(itos)})

Fine Tuning the Language Model

Next we load the pre-trained language model and fine tune it on our dataset. We also load the itos file of the pre-trained language model to map the vocab of the dataset to the pre-trained language model’s. For example, if खुश maps to 7 in the pre-trained language model then खुश in the dataset should also map to 7.

wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)
itos2 = pickle.load((PRE_PATH/'itos_wiki_hindi.pkl').open('rb'))
stoi2 = collections.defaultdict(lambda:-1,
{v:k for k,v in enumerate(itos2)})

To fine-tune the language model, we create a data model object of our dataset which is further used to create an instance of the language model.

trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH, 1,vs, trn_dl, val_dl, bs=bs, bptt=bptt)
learner= md.get_model(opt_fn, em_sz, nh, nl, dropouti=drops[0], 
dropout=drops[1], wdrop=drops[2],
dropoute=drops[3], dropouth=drops[4])
learner.metrics = [accuracy]
learner.model.load_state_dict(wgts) #loading the pre-trained weights

Fine-tuning of the model is done using gradual unfreezing explained in the above sections.

learner.freeze_to(-1)
learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)
learner.unfreeze()
learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=7)

After fine-tuning, the language model is saved along with its encoder weights which will be used by the classifier.

learner.save('lm_fine_tuned')
learner.save_encoder('lm_enc_fine_tuned')

Classification Model

We begin by pre-processing the data in the same way as done for the language model and then make our RNN classifier.

m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, 
n_layers=nl, pad_token=1, layers=[em_sz*3, 50, c],
drops=[dps[4], 0.1], dropouti=dps[0], wdrop=dps[1],
dropoute=dps[2], dropouth=dps[3])
# Adam is used as the optimiser
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

RNN_Learnerhandles the whole creation of a learner object with a text data using a certain bptt.

learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=.25
learn.metrics = [accuracy]

We load the encoder weights of the fine-tuned language model and train our classifier on that using gradual unfreezing.

learn.load_encoder('lm_enc_fine_tuned')
learn.freeze_to(-1)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))
learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))
learn.unfreeze()
learn.fit(lrs, 1, wds=wd, cycle_len=10, use_clr=(32,10))

The pre-trained Hindi language model and the notebook can be found here.

Results

The model achieved a peak accuracy of 90.26 % on validation set.

End Notes

We hope you found this blog post helpful and have understood the concept of ULMFiT. There are still many things to explore in ULMFiT using the fastai library and we encourage you to take a look. For a deeper understanding of the code, we suggest you to go through the fastai course mentioned in the reference section. If you have any doubts/suggestions please feel free to mention them in the comment section.

Thanks for reading. Happy coding! 👨🏽‍💻😊

References

  1. http://course18.fast.ai/lessons/lesson10.html
  2. Regularizing and Optimizing LSTM Language Models
  3. A disciplined approach to neural network hyper-parameters