A brief introduction to Intent Classification

Source: Deep Learning on Medium


Recently I learned about something called “intent classification” for a project, so I thought to share it with all of you and how I create a classifier for it. Intent classification is an important component of Natural Language Understanding (NLU) systems in any chatbot platform.

The best way to understand it by taking an example:

So as I said it is an important component of chatbot platform and as we all know that chatbots are more like assistant for us in our daily lives. So say you have an assistant and you tell him/her to ‘book you a cab’. Now here your assistant knows how to respond for that query of yours because he/she has brain that trained for this. But how will you train your chatbot to respond for a particular query. So in case of chatbots, to make them respond according to the users query we uses “intent classification” and the categories in which a chatbot respond these are known as “intents”.

So say you asked for booking a cab then it will respond under that category and if you asked for booking a flight then it will respond under that category and so on.

Problem Statement

I have given a small dataset of 1113 statements(or queries) with their respective intents and I was asked to build an intent classifier for it. There are total 21 intents(categories/classes) in this dataset.

I used Python, Google Colab Notebook to develop this and Deep Learning components to create this.

Here is the complete notebook, to get full code fork this notebook.

Data Preparation

In the field of Machine Learning and Deep Learning I think this step is the deciding factor of your project. Simplify your data as much as you can which in turn gives your model a helping hand to train easily and faster.

def load_dataset(filename):
df = pd.read_csv(filename, encoding = "latin1",
names = ["Sentence", "Intent"])
intent = df["Intent"]
unique_intent = list(set(intent))
sentences = list(df["Sentence"])

return (intent, unique_intent, sentences)

There are following steps that I performed in it:

1. Data Cleaning

Data is like crude. It’s valuable, but if unrefined it cannot really be used.

If you are using raw data you should clean it before feeding it to your model. To clean the data we can use several methods and tricks there is no definite method.

def cleaning(sentences):
words = []
for s in sentences:
clean = re.sub(r'[^ a-z A-Z 0-9]', " ", s)
w = word_tokenize(clean)
#lemmatizing
words.append([lemmatizer.lemmatize(i.lower()) for i in w])

return words

Here I first removed every punctuation and special characters(if any) from the data then I tokenized the sentences into words. After this I lowercase all the words and use lemmatization on them.

Let’s take a quick review what is lemmatization and why I used it here.

1.1 Lemmatization

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.”

This is the official definition but I never understand it so according to me lemmatization is a process in which we get a lemma(actual words) of a word.

For example:

lemmatizer.lemmatize("cats") ==> cat
lemmatizer.lemmatize("churches") ==> church
lemmatizer.lemmatize("abaci") ==> abacus

This is lemmatization and I used it so that if someone writes a word ‘differently’, classifier can understand it and give us the best result possible.

2. Encoding

2.1 Input Encoding

After cleaning the data I got lists of words of sentences. To convert these words into indexes so that I can use them as input I use Tokenizer class of Keras.

#creating tokenizer
def create_tokenizer(words,
filters = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~'):
token = Tokenizer(filters = filters)
token.fit_on_texts(words)
return token
#getting maximum length
def max_length(words):
return(len(max(words, key = len)))
#encoding list of words
def encoding_doc(token, words):
return(token.texts_to_sequences(words))

Here I used filters for a reason that you will see later. After implementing I got vocab size of 462 and maximum length of words 28.

After this it’s I use padding to make them of equal length so that they can be used in the model.

def padding_doc(encoded_doc, max_length):
return(pad_sequences(encoded_doc, maxlen = max_length,
padding = "post"))

2.2 Output Encoding

For outputs I did the same thing, first indexed those intents by using Tokenizer class of Keras.

output_tokenizer = create_tokenizer(unique_intent,
filters = '!"#$%&()*+,-/:;<=>?@[\]^`{|}~')

Now here I used a different filter than the default one and it’s because when I checked the outputs(intents) they looks like this:

{'commonq.assist',
'commonq.bot',
'commonq.how',
'commonq.just_details',
'commonq.name',
'commonq.not_giving',
'commonq.query',
'commonq.wait',
'contact.contact',
'faq.aadhaar_missing',
'faq.address_proof',
'faq.application_process',
'faq.apply_register',
'faq.approval_time',
'faq.bad_service',
'faq.banking_option_missing',
'faq.biz_category_missing',
'faq.biz_new',
'faq.biz_simpler',
'faq.borrow_limit',
'faq.borrow_use'}

After looking into the output intents I found out that there are ‘.’ and ‘_’ present in ouptut strings. So if when I used the default filter of Tokenizer class it removes them and I was getting only “commonq”, “faq” and so on , these types of strings. So to get the string as it is I changed the default filter and removed ‘.’ and ‘_’ from it which in turn reserves the output.

After indexing those 21 intents it’s time to one-hot encode them so that they can be fed to the model.

def one_hot(encode):
o = OneHotEncoder(sparse = False)
return(o.fit_transform(encode))

3. Train and Validation set

Data is ready for model, so the final step that I did is split the dataset into training and validation set.

train_X, val_X, train_Y, val_Y = train_test_split(padded_doc,
output_one_hot,shuffle = True,test_size = 0.2)

Here I divide the dataset into 80 % of training set and 20 % of validation set and we get this shape of data.

Shape of train_X = (890, 28) and train_Y = (890, 21)
Shape of val_X = (223, 28) and val_Y = (223, 21)

And this concludes data preparation or preprocessing. Now all we have to do is create a model architecture and fed this data into it.

Defining Model

I am using here Bidirectional GRU but you can try it with different networks and see the difference.

def create_model(vocab_size, max_length):
model = Sequential()

model.add(Embedding(vocab_size, 128,
input_length = max_length, trainable = False))
model.add(Bidirectional(GRU(128)))
model.add(Dense(64, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(64, activation = "relu"))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Dense(21, activation = "softmax"))

return model

I trained this model with adam optimizer, batch size 16 and epochs 100. I have achieved 89% of training accuracy and 87% of validation accuracy in this.

Here are some plots for better visualization of results.

Plot between training loss and validation loss

Plot between training accuracy and validation accuracy

Making Predictions

def predictions(text):
clean = re.sub(r'[^ a-z A-Z 0-9]', " ", text)
test_word = word_tokenize(clean)
test_word = [lemmatizer.lemmatize(w.lower()) for w in test_word]
test_ls = word_tokenizer.texts_to_sequences(test_word)

#Check for unknown words
if [] in test_ls:
test_ls = list(filter(None, test_ls))

test_ls = np.array(test_ls).reshape(1, len(test_ls))

x = padding_doc(test_ls, max_length)

pred = model.predict_classes(x)
return pred

So by giving the input text in above function I get the predicted class.

# map an integer to a word
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None

To convert the indexed integer that I get from prediction I used above function. It will convert back the integer into word by using output mapping.

text = "Can you help me?"
pred = predictions(text)
word = word_for_id(pred, output_tokenizer)

I get pred = 17 and word related to this integer is “commonq.bot”.

Conclusion

And there you have it!. I hope you’ve enjoyed learning about intent classification. There are many things that you can try by yourself in this and may get more accuracy. You can use different networks in the model, different hyperparameters and preprocessing, these things totally depend on person to person.

You can tweak all those hyperparameters to generate even better results. Try it out yourself by forking this notebook.

References

  1. Lian Meng, Minlie Huang, Dialogue Intent Classification with LSTM.
  2. Text preprocessing i.e. Lammetization, Stemming.