User Queries Classification using Deep Neural Networks

Source: Deep Learning on Medium


Go to the profile of madhu vamsi
Text Classification

Problem: The provider follows a ticketing system for all the telephonic calls received across all the departments.

Calls to the provider can be for New Appointment, Cancellation, Lab Queries, Medical Refills, Insurance Related, General Doctor Advise, etc. The Tickets have the details of Summary of the call and description of the calls written by various staff members with no standard text guidelines.

The challenge is, based on the Text in the Summary and Description of the call is written in “converse” Column, the ticket is to be classified to Appropriate Category (out of 21 Categories)


We have ‘n’ tickets for n calls. Each ticket consists of some text which doesn’t follow any specific format. These tickets should be classified as one of the 21 categories based on the text present on the ticket.

Example:
Text in ticket: “scheduled appointment, new patient name Akash Awasthi suffering from Fever” 
Corresponding Category: NEW APPOINTMENT

As categories are 21 and we are classifying each ticket text into one of these 21 categories, this problem can be considered as a Multi-Class Classification.

Approach:
The approach includes three major steps:

1. Feature Engineering

2. Feature Selection

3. Using different types of machine learning and Deep Learning algorithm

1. Feature Engineering: All the words in the tickets will be embedded as low dimension vectors using CBOW (Continuous Bag of Words) which accounts the similarity between the context words. 
2. Feature Selection: It helps to delete the noisy features i.e., those words which have a low information gain (most frequent words) such as “the, of, a, an and stop word” will be removed. 
3. Using different types of machine learning and Deep Learning algorithm:
 a. ANN with Global Max-Pooling
 b. CNN 
 c. Bi-LSTM


a. ANN with Global Max Pooling: It involves using Global Max Pool layer with dense layers on the embedded vectors, whose output will be given to Softmax layer.
Advantage: ANN doesn’t have a bias problem due to max-pooling layer. 
Disadvantage: The range of context purely depends upon the Global Max Pool Layer. So, it can account only for small context patterns (unigram).

b. CNN with Max-Pooling: It involves using convolution operations on the embedded vectors. It uses convolution operations and Max Pooling (below figure depicts the process).
Advantage: CNN doesn’t have a bias problem due to max-pooling layer. Thus, CNN may better capture the semantic of texts compared to recursive or recurrent neural networks. 
Disadvantage: The range of context purely depends upon the kernel size used in architecture. If kernel size is too small (example 2) it can account only for small context patterns (bigrams). If Kernel size is too large, the no of computations may explode.

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

c. Bi-LSTM: It is a special case of RNN architecture which propagates signal forward and backward (LSTM only forward). Bi-LSTM also can selectively read, write by using input gate, forget gate and output gate. 
Advantages: The context of the whole utterance is used to interpret what is being said rather than linear interpretation. As in CNN, there are no constraints of specific kernel size 
Disadvantages: Heavy computations


Data Pre-processing:

As our labels are in string format, the computer cannot understand the names of each label. Convert all the labels into unique numbers, where each unique number represents one label

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
nx = labelencoder.fit_transform(data['categories'])

To visualize which label assigned with which number

#(it will print the corresponding label for the index_number)
labelencoder.classes_[index_number]

Now we have converted each label into a unique number. Next, we have to convert these unique numbers into one hot encoding vectors

import numpy as np
def convert_to_one_hot(Y, C):
Y = np.eye(C)[Y.reshape(-1)]
return Y
Y_oh_train = convert_to_one_hot(nx, C = 21)

We have to Lemmatize the data to get good performance. Let’s see what lemmatization will do to the training dataset. Next, create a dictionary and convert all the words to indices.

Lemmatization Example
import nltk
nltk.download('wordnet')
#################
train_val=data['converse'].values
train_doc=[]
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

for sen in range(0, len(train_val)):
# Converting to Lowercase
document = str(train_val[sen]).lower()
#lematize the words
train_doc.append(stemmer.lemmatize(document))
#Get the max length of the sentence in train_doc
maxLen_train = len(max(train_doc, key=len).split())
#Create a dictionary which includes all the words used in training data set
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern = r"(?u)\b\w+\b" )
vectorizer.fit(data['converse'].astype('U'))
dic=vectorizer.vocabulary_
#function to convert these sentences to indices
def
sentences_to_indices(X, word_to_index, max_len):
m = X.shape[0]
X_indices = np.zeros((m, max_len))

for i in range(m):
sentence_words = [w.lower() for w in X[i].split()]
j = 0
for w in sentence_words:
try:
X_indices[i, j] = word_to_index[w]
except:
print(w)
j += 1
return X_indices
X1 = np.array(train_doc)
X1_indices = sentences_to_indices(X1,dic, max_len = maxLen_test )
#inuput will be like: ' please book a appointment......'
print("X1 =", X1)
# output willbe like: ' 102 533 45 2 .....'
print("X1_indices =", X1_indices)
# created dic will be like: { please:102, book:533, a:45,.....}

ANN with GlobalAverage Pooling

model1 = Sequential()
model1.add(keras.layers.Embedding(39288, 100))
model1.add(keras.layers.GlobalAveragePooling1D())
model1.add(keras.layers.Dense(100, activation=tf.nn.relu))
model1.add(keras.layers.Dense(21, activation=tf.nn.sigmoid))

model1.summary()
model1.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc']

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 100) 3928800
_________________________________________________________________
global_average_pooling1d_1 ( (None, 100) 0
_________________________________________________________________
dense_1 (Dense) (None, 100) 10100
_________________________________________________________________
dense_2 (Dense) (None, 21) 2121
=================================================================
Total params: 3,941,021
Trainable params: 3,941,021
Non-trainable params: 0
_________________________________________________________________CNN with Max-Pooling

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)
mc = ModelCheckpoint('best_model_txt.h5', monitor='val_loss', mode='min', verbose=0, save_best_only=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history=model.fit(Xtrain, ytrain, epochs=100, batch_size=10, verbose=1,validation_split = 0.2,callbacks=[es, mc])
#####################################################
Train on 38959 samples, validate on 9740 samples
Epoch 1/100
38959/38959 [==============================] - 53s - loss: 2.0186 - acc: 0.3243 - val_loss: 1.6079 - val_acc: 0.5141
Epoch 2/100
38959/38959 [==============================] - 47s - loss: 1.3820 - acc: 0.5732 - val_loss: 1.2021 - val_acc: 0.6362
Epoch 3/100
38959/38959 [==============================] - 45s - loss: 1.0652 - acc: 0.6644 - val_loss: 1.0454 - val_acc: 0.6783
Epoch 4/100
38959/38959 [==============================] - 46s - loss: 0.9213 - acc: 0.7053 - val_loss: 0.9626 - val_acc: 0.6992
Epoch 5/100
38959/38959 [==============================] - 46s - loss: 0.8278 - acc: 0.7318 - val_loss: 0.9220 - val_acc: 0.7080
Epoch 6/100
38959/38959 [==============================] - 48s - loss: 0.7585 - acc: 0.7492 - val_loss: 0.9276 - val_acc: 0.7016
Epoch 7/100
38959/38959 [==============================] - 41s - loss: 0.7035 - acc: 0.7656 - val_loss: 0.9030 - val_acc: 0.7079
Epoch 8/100
38959/38959 [==============================] - 38s - loss: 0.6548 - acc: 0.7798 - val_loss: 0.9036 - val_acc: 0.7124
Epoch 9/100
38959/38959 [==============================] - 38s - loss: 0.6073 - acc: 0.7948 - val_loss: 0.9547 - val_acc: 0.7105
Epoch 10/100
38959/38959 [==============================] - 38s - loss: 0.5648 - acc: 0.8074 - val_loss: 1.0037 - val_acc: 0.6911
Epoch 11/100
38959/38959 [==============================] - 39s - loss: 0.5289 - acc: 0.8205 - val_loss: 0.9740 - val_acc: 0.7084
Epoch 12/100
38959/38959 [==============================] - 39s - loss: 0.4937 - acc: 0.8319 - val_loss: 1.0020 - val_acc: 0.7075
Epoch 00011: early stopping

CNN with Max-Pooling

model2 = Sequential()
model2.add(Embedding(39288,100))
model2.add(Conv1D(filters=10,
kernel_size=5,
padding='valid',
activation='relu',
strides=1))
model2.add(Conv1D(filters=30,
kernel_size=3,
padding='valid',
activation='relu',
strides=1))
model2.add(Conv1D(filters=80,
kernel_size=1,
padding='valid',
activation='relu',
strides=1))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(10))
model2.add(Dropout(0.2))
model2.add(Activation('relu'))
model2.add(Dense(21))
model2.add(Activation('sigmoid'))
model2.summary()

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_6 (Embedding) (None, None, 100) 3928800
_________________________________________________________________
conv1d_12 (Conv1D) (None, None, 10) 5010
_________________________________________________________________
conv1d_13 (Conv1D) (None, None, 30) 930
_________________________________________________________________
conv1d_14 (Conv1D) (None, None, 80) 2480
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 80) 0
_________________________________________________________________
dense_9 (Dense) (None, 10) 810
_________________________________________________________________
dropout_4 (Dropout) (None, 10) 0
_________________________________________________________________
activation_7 (Activation) (None, 10) 0
_________________________________________________________________
dense_10 (Dense) (None, 21) 231
_________________________________________________________________
activation_8 (Activation) (None, 21) 0
=================================================================
Total params: 3,938,261
Trainable params: 3,938,261
Non-trainable params: 0
_________________________________________________________________

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)
mc = ModelCheckpoint('best_model_txt.h5', monitor='val_loss', mode='min', verbose=0, save_best_only=True)
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history2=model2.fit(Xtrain, ytrain, epochs=100, batch_size=100, verbose=1,validation_split = 0.2,callbacks=[es, mc])
#################################
Train on 38959 samples, validate on 9740 samples
Epoch 1/100
38959/38959 [==============================] - 14s - loss: 2.3662 - acc: 0.2068 - val_loss: 1.8195 - val_acc: 0.2395
Epoch 2/100
38959/38959 [==============================] - 14s - loss: 1.6986 - acc: 0.3426 - val_loss: 1.3626 - val_acc: 0.4417
Epoch 3/100
38959/38959 [==============================] - 14s - loss: 1.3936 - acc: 0.4993 - val_loss: 1.1063 - val_acc: 0.6601
Epoch 4/100
38959/38959 [==============================] - 14s - loss: 1.1596 - acc: 0.6252 - val_loss: 1.0368 - val_acc: 0.6749
Epoch 5/100
38959/38959 [==============================] - 14s - loss: 1.0442 - acc: 0.6581 - val_loss: 1.0466 - val_acc: 0.6754
Epoch 6/100
38959/38959 [==============================] - 14s - loss: 0.9648 - acc: 0.6770 - val_loss: 1.0527 - val_acc: 0.6765
Epoch 7/100
38959/38959 [==============================] - 14s - loss: 0.8867 - acc: 0.7001 - val_loss: 1.1021 - val_acc: 0.6740
Epoch 8/100
38959/38959 [==============================] - 14s - loss: 0.8367 - acc: 0.7121 - val_loss: 1.1250 - val_acc: 0.6693
Epoch 9/100
38959/38959 [==============================] - 14s - loss: 0.7863 - acc: 0.7283 - val_loss: 1.1611 - val_acc: 0.6709
Epoch 10/100
38959/38959 [==============================] - 14s - loss: 0.7302 - acc: 0.7423 - val_loss: 1.2754 - val_acc: 0.6720
Epoch 00009: early stopping

Bi-LSTM

model = Sequential()
model.add(Embedding(39288,100))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.2))
model.add(Dense(21, activation='sigmoid'))
model.summary()

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, None, 100) 3928800
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128) 84480
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_2 (Dense) (None, 21) 2709
=================================================================
Total params: 4,015,989
Trainable params: 4,015,989
Non-trainable params: 0
_________________________________________________________________

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)
mc = ModelCheckpoint('best_modelc_txt.h5', monitor='val_loss', mode='min', verbose=0, save_best_only=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history4=model.fit(Xtrain, ytrain, epochs=100, batch_size=100, verbose=1,validation_split = 0.2,callbacks=[es, mc])
Train on 38959 samples, validate on 9740 samples
Epoch 1/100
38959/38959 [==============================] - 631s - loss: 1.8634 - acc: 0.3920 - val_loss: 1.1067 - val_acc: 0.6724
Epoch 2/100
38959/38959 [==============================] - 621s - loss: 0.9793 - acc: 0.7054 - val_loss: 0.8956 - val_acc: 0.7200
Epoch 3/100
38959/38959 [==============================] - 621s - loss: 0.7814 - acc: 0.7604 - val_loss: 0.8743 - val_acc: 0.7256
Epoch 4/100
38959/38959 [==============================] - 627s - loss: 0.6761 - acc: 0.7892 - val_loss: 0.8589 - val_acc: 0.7320
Epoch 5/100
38959/38959 [==============================] - 628s - loss: 0.5940 - acc: 0.8111 - val_loss: 0.9029 - val_acc: 0.7253
Epoch 6/100
38959/38959 [==============================] - 630s - loss: 0.5456 - acc: 0.8278 - val_loss: 0.9076 - val_acc: 0.7239
Epoch 7/100
38959/38959 [==============================] - 635s - loss: 0.4839 - acc: 0.8476 - val_loss: 0.9498 - val_acc: 0.7233
Epoch 8/100
38959/38959 [==============================] - 643s - loss: 0.4391 - acc: 0.8621 - val_loss: 0.9921 - val_acc: 0.7161
Epoch 9/100
38959/38959 [==============================] - 716s - loss: 0.4080 - acc: 0.8718 - val_loss: 0.9891 - val_acc: 0.7158
Epoch 00008: early stopping

ANN with Global Max Pooling: It will take only Global sequences as we have used GlobalMaxPooling. But it fails to capture
local sequences.
CNN with Max-Pooling: It is good at capturing local sequences but fails to capture the entire sentence sentiment/category
Bi-LSTM: Due to its selective read, write and forget it can capture both local and global sequence.


Specifications: These models have been trained on Tesla K20c 4 GB ram GPU and multicore processor (24 cores) with 90 GB ram

Conclusion: According to the results/Accuracies of the models described above, Bi-LSTM outperforms both ANN with Global Max pooling and CNN architectures.