e-commerce Text Classification with Attention and Self-Attention

Source: Deep Learning on Medium

e-commerce Text Classification with Attention and Self-Attention

With the recent hype and advancement in Natual Language Processing due to the rise of Deep Learning, text classification has a dramatic improvement, especially with the introduction of transfer learning in NLP using large models such as BERT and XLNet. However, with large models comes at the expense of complexity, which incurs higher prediction time and much higher deployment cost. In this post, we aim to introduce attention as well as self-attention for text classification without using large scale models such as BERT. Behold. This post assumes familiarity with Deep Learning. We will compare the performance between BERT, biLSTM with Attention, biLSTM with Self-Attention, and biLSTM model in the end.

Introduction

Introducing our text classification problem, to classify the product listings in the e-commerce websites into different categories such as COUNTERFEIT, SMOKE, WEAPON, PHARMACY, ADULT, LEGITIMATE, TORRENT MOVIE SOFTWARE and so forth. The use case is to flag certain products that are deemed not suitable for the e-commerce platform. The choice is entirely up to the e-commerce platform companies such as Shopee, 11street, Qoo10 and so forth. For example, suppose e-commerce company X would like to flag and ban every adult related products, such a classification task will come in very handy.

Sample Product Listing (src: https://www.qoo10.sg/item/1-7-2020-KOREA-NEW-STYLE-BLOUSE-DRESS-LONG-BLOUSE-SHORT-BLOUSE/539260045?stcode=77)

Let’s have a quick decomposition of the image above. There are many kinds of data available, such as product images, title, prices, product descriptions (if you scroll down) and so forth. It is possible to also make use of the images to form another classification problem, which can be combined with text-based classification problems to train an ensemble model to improve performance. It is also possible to train both images and text using the same model with multi-task learning models. However, in this post, only text classification is discussed.

We will be talking about three main models, namely BERT, bi-directional LSTM model using either attention or self-attention.

Bi-directional LSTM with Attention

Attention is first introduced to solve the problem of Neural Machine Translation (NMT) on a sequence-to-sequence (seq2seq) model, Bahdanau et al., 2015. However, a text classification task does not require encoder-decoder architecture. Hence, some variants of the model will be required.

Our model can be visualized in the following diagram.

Edited from the diagram of this paper Bahdanau et al., 2015

My model is inspired by the paper Hierarchical Attention Network (HAN). The idea is actually very similar to the four-step framework for NLP related tasks as introduced by Dr. Matthew Honnibal, the creator of Spacy, namely embed, encode, attend and predict. For those who are interested, the link is here: https://www.youtube.com/watch?v=8MCNMjF04-0

Hierarchical Attention Network (HAN) Paper

IIT Madras Research Institute actually provides a very good lecture series on Deep Learning, which also covers HAN in detail.

The difference between HAN and our model is that HAN deals with sentences whereas our model only has one sentence (we will be using only product listing titles for the classification task). It is also possible to use product descriptions to improve classification accuracy. However, I have found that the titles alone can provide very good accuracy and f1-score, and in certain cases even outperforms a text classification task formulated with more information.

Attention Score Calculation by Lilian Weng
Attention Score Functions Summary provided by Lilian Weng

We chose to follow Bahdanau’s score function. This is because it also resembles the equations for the self-attention model, as well as the BERT model later. This is to ensure that we have a fair comparison between different models. For the purpose of model explainability which is highly important as clients will often ask for a detailed description as to why the model makes certain predictions, we will also visualize the attention weights. Darker colors correspond to lower values, and vice versa.

Model Output for Explainability

We will leave the model comparison in the end. However, a biLSTM model with attention already performs much better than a traditional machine learning model (which was in production) or just a biLSTM model. There is probably no surprise here.

Bi-directional LSTM with Self-Attention

Self-Attention is first introduced in this paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Google. The idea of self-attention is instead of calculating the attention scores end-to-end using the decoder, self-attention is an idea of calculating attention scores to itself, i.e. the sentence itself. This concept was extremely vague to me when I first read the paper as the model is highly complex, primarily due to the use of new terminologies such as key, value, and query. However, it is possible to decompose the idea of the model starting from a seq2seq model with attention. This video provides a very clear description of the self-attention mechanism.

To keep the equation consistent, since I started the post with some equations as laid out by Lilian Weng, I would also follow her convention when writing the documentation for this task. To put new terminologies of self-attention into seq2seq model with attention,

I have taken an image online for NMT, with added labels

Now that we have mapped and prepared our brains with the new terminologies, we now introduce a new way of calculating attention weights. So far, there are nothing new, just different conventions.

BERT is a highly complex model with various new concepts such as self-attention, multi-head attention, positional encoding and so forth. However, the understanding of self-attention is enough for the purpose of this classification task under discussion for now.

The magic formula to calculate the attention score is

https://arxiv.org/pdf/1706.03762.pdf

To improve my understanding of self-attention model, it is always good to visualize the embeddings.

src: https://towardsdatascience.com/breaking-bert-down-430461f60efb

Notice that the dimension of the input and the embeddings produced after the self-attention mechanism is the same. This is so that it can be concatenated and repeated multiple times. The diagram illustrates the need to produce attention scores multiple times in parallel. This is the multi-head attention mechanism, which will be concatenated in the end.

If you have read the paper, you will notice that in the appendix, the attention scores can be visualized as such

https://arxiv.org/pdf/1706.03762.pdf

To put it into perspective, take for example this sentence, “Hello, how are you?” with a total of 6 tokens, the author of this post has made very nice documentation helping us to visualize the weights of the attention score, which can be translated into a heatmap as shown above.

The output of Attention(Q, K, V)

Unlike the biLSTM model with attention, for which I had to implement the wheel myself since no code is available online (or maybe I couldn’t find), somebody has done the hard work for us for self-attention, who has implemented a self-attention layer in Keras for us.

Using this self-attention mechanism, the model is of course without surprise, performs better than the baseline model in production. For comparison of all the deep models, keep reading 🙂

Bidirectional Encoder Representations From Transformers (BERT)

I doubt no one will be surprised that BERT performs the best amongst the other two models. The question is though, how much better, and how much slower? There is a reason why I leave this model to the last. For a little sneak peek, the final model chosen is not BERT.

BERT relies on transfer learning, just like the other two models. However, the other two models (biLSTMs) rely on pre-trained word embeddings only for transfer learning. BERT, on the other hand, is pre-trained differently using two other mechanisms, namely

  1. Mask Language Model (MLM)
  2. Next Sentence Prediction

I learned about the way BERT is pre-trained in this post. I definitely would recommend. For a complete description of BERT from head-to-toe, I would recommend reading this post.

Now, just to motivate for word embeddings since I did not mention it, for the past two models, they are initialized with different kinds of word embeddings such as word2vec, GloVe, FastText, ELMo, etc. It is also possible to use BERT to generate context-aware embeddings since it is a pre-trained language model. There are multiple ways to combine the output tokens, such as taking the last few tokens and average, et cetera. The comparison between different word embeddings could be a different post altogether. However, I did a bad job at documenting the difference in performance using different word embeddings. In the future, given another NLP task, I will document it properly and make give better insights for word embeddings. In general, word2vec and GloVe work quite well (good enough), and there is no surprise that the context-aware embeddings such as ELMo works better (but not an obvious gain).

BERT model

Model Comparison

The project aims to reduce false positives (FP). This is because, for every single positive (i.e. flagged product listings), somebody will be assigned to investigate the matter further. Perhaps there may be a need to take down the merchant altogether, or if the product listings are taken down automatically due to the model’s prediction, the owner may make appeals and extra manpower will need to be assigned to handle these cases.

Model Comparison. x and t is used to represent the true values for the ease of comparison and for privacy protection

The production model is a shallow machine learning model with the features extracted using a Term Frequency Inverse Document Frequency (TF-IDF) score.

BERT is the best performing model, but with an added huge cost to the prediction time. The time required to make a prediction is the time required to make a prediction using the model with a CPU. CPU is used in the production rather than a GPU to save cost.

Bidirectional LSTM model with attention is chosen due to the added explainability inherited in the model itself. It is much more difficult to interpret the attention weights from the self-attention model because the attention scores are relative to words itself, whereas attention scores from the attention model directly assign an “importance” score to each word with respect to the label. This is much more intuitive for businessmen and clients to understand. The performance does not differ a lot anyway, yet we could still reap the full benefits of a deep learning model.

Hence, here’s a visualization of the metrics with respect to the production model of our new chosen deep learning model, i.e. Bidirectional LSTM with Attention.

Comparison with the baseline model (note that the numbers do not mean anything important)