Original article was published on Deep Learning on Medium
I chose the Stacking (ensemble of Adaboost of Logistic Regression and Logistic Regression) model, as it had a decent training accuracy, and a reasonable validation accuracy. You might be thinking, these accuracies are in the 0.5 to 0.6 range, surely that’s not great. Well, considering this was a 5 way multiclass classification, the odds of randomly choosing one and getting it right was 0.2. Also, these are subjective scores, it can be hard even for a human to be on the dot with choosing the right score. This is better demonstrated with a confusion matrix.
You can see most the time the model does predict the correct score, illustrated by the diagonal line. The majority of the error we saw (accuracy being in 50–60% range), you can see here, comes from the adjacent score, e.g. predicting a score of 1 but true score was 2. I was happy with this as the model would still be good enough to distinguish between great reviews, average reviews, and bad reviews.
At this point the computer could interpret the inputted the text, and somewhat understand the sentinment from it.
I wanted better.
Why not make it more human? Neural networks are designed like the functionality of neurons in our brains, so that was probably the change I could make to better my model.
The preprocessing was a bit different before creating my neural network model.
I created a dictionary with keys that were words, all the unique words in the corpus, and values, a number associated with each unique word. I also added 4 special keys for padding, start of review, unknown words, and unused words. In total I had 17317 word entries in the dictionary. This comes from 9405 reviews.
word_index_dict['<PAD>'] = 0
word_index_dict['<START>'] = 1
word_index_dict['<UNK>'] = 2
word_index_dict['<UNUSED>'] = 3
As a final preprocessing step, I added a padding layer, with a max length of 250 words. Then I trained the model.
- Neural Network Architecture:
The special layer for NLP here is the Embedding Layer.
The words are mapped to vectors in a vector space, in my case 16 dimensional vectors. This time each word has a vector based on the words around it, the context. The vectorisation is different to the TF-IDF vectorisation from earlier, we aren’t just looking at frequency based metrics, but actually looking into the impact of each word, given the context.
This is starting to feel more human.
Now words like good, great, bad, and worse have some more meaningful numbers (vectors) associated with them. New reviews that the model can be tested on, won’t just contain some of these words, but also the words that surround it, that paint a better picture of what the writer of the review is trying to say. This picture could be better explained with more data but the current 9405 review will do a fine job.
- Testing Neural Network Model
The testing accuracy of the model came to 0.5710 which is better than our previous model’s accuracy of 0.5077. So we have an improvement of 7% which is quite significant, but again the best way to observe this 5 way multi-class classifcation is by looking at a confusion matrix.
As you can see, the model didn’t predict a review with a score of 5 as a score of 1 once or vice versa. The other mis-classified scores have improved, and the majority of the predictions are closer to the middle diagonal.
I have designed a demo application of the model using Streamlit and Heroku, that you can try out here: www.hilton-hotel-app.herokuapp.com/
Improvements to be made:
- Use a bigger training dataset
- Try a deeper neural network
- Reduce complexity of classification to binary classification
- Implement other pre-made vectorisation methods — word2vec or GloVe