Predicting Amazon reviews scores using Hierarchical Attention Networks with Pytorch and Apache…

Source: Deep Learning on Medium

Predicting Amazon reviews scores using Hierarchical Attention Networks with Pytorch and Apache Mxnet

This post and the code here is part of a larger repo that I have (very creatively) called “NLP-stuff”. As the name indicates, I include in that repo projects that I do and/or ideas that I have — as long as there is code associated to those ideas — that are related to NLP. In every directory I have included a README file and a series of explanatory notebooks that I hope help explaining the code. I intend to keep adding projects throughout 2020, not necessarily the latests and/or most popular releases, but simply papers or algorithms I find interesting and useful. In particular, the code related to this post is in the directory amazon_reviews_classification_HAN.

First things first, let’s start by acknowledging the relevant people that did the hard work. This post and the companion repo are based on the paper “Hierarchical Attention Networks for Document Classification” (Zichao Yang, et al, 2016). In addition, I have also used in my implementation the results, and code, presented in “Regularizing and Optimizing LSTM Language Models” (Stephen Merity, Nitish Shirish Keskar and Richard Socher, 2017). The dataset that I have used for this and other experiments in the repo is the Amazon product data (J. McAuley et al., 2015 and R. He, J. McAuley 2016), in particular the Clothing, Shoes and Jewellery dataset. I strongly recommend having a look to these papers and references therein.

1. The network Architecture

Once that is done let’s start by describing the network architecture we will be implementing here. The following figure is Figure 2 in the Zichao Yang et al, paper.

Figure 1 (Figure 2 in their paper). Hierarchical Attention Network (HAN)

We consider a document comprised by L sentences s and each sentence contains T words. w_it with t ∈ [1, T], represents the words in the i-th sentence. As shown in the figure, the authors used a word encoder (a bidirectional GRU, Bahdanau et al., 2014), along with a word attention mechanism to encode each sentence into a vector representation. This sentence representations are passed through a sentence encoder with a sentence attention mechanism resulting in a document vector representation. This final representation is passed to a fully connected layer with the corresponding activation function for prediction. The word “hierarchical” refers here to the process of encoding first sentences from words, and then documents from sentences, naturally following the “semantic hierarchy” in a document.

1.1 The Attention Mechanism

Assuming one is familiar with the GRU formulation (if not have a look here), all the math one needs to understand the attention mechanism is included below. The mathematical expressions I include here refer to the word attention mechanism. The sentence attention mechanism is identical, but at sentence level. Therefore, I believe explaining the following expressions, along with the code snippets below, will be enough to understand the full process. The first 3 expression are pretty standard:

Equation Group 1 (extracted directly from the paper): GRU output

Where x_it is the word embedding vector of word t in sentence i. The vectors h_it are the forward and backward output features from the bidirectional GRU, which are concatenated before applying attention. The attention mechanism is formulated as follows:

Equation Group 2 (extracted directly from the paper): Word Attention. Sentence Attention is identical but at sentence level.

First the h_it features go through a one-layer MLP with a hyperbolic tangent function. This results in a hidden representation of h_it, u_it. Then, the importance of each word is measured as the dot product between u_it and a context vector u_w, obtaining a so-called normalised importance weight α_it. After that, the sentence vector s is computed as the weighted sum of the h_it features based on the normalised importance weights. For more details, please, read the paper, section 2.2 “Hierarchical Attention”. As mentioned earlier the sentence attention mechanism is identical but at sentence level.

Word and sentence attention can be coded as:


Snippet 1. Word and Sentence Attention Mechanism with Pytorch


Snippet 2. Word and Sentence Attention Mechanism with Mxnet.

where inp refers to h_it and h_i for word and sentence attention respectively.

As one can see, the Mxnet implementation is nearly identical to that in Pytorch, albeit with some subtle differences. This is going to be the case throughout the hole HAN implementation. However, I would like to add a few lines to clarify the following: this is my second “serious” dive into Mxnet and Gluon. The more I use it, the more I like it, but I am pretty sure that I could have written better, more efficient code. With that in mind if you, the reader, are a Mxnet user and have suggestions and comments, I would love to hear them.

1.1.1 Word Encoder + Word Attention

Once we have the AttentionWithContext class, coding WordAttnNet (Word Encoder + Word Attention) is straightforward. The snippet below is a simplified version of that in the repo, but contains the main components. For the full version, please have a look to the code in the repo.


Snippet 3: Pytorch implementation of the Word Encoder + Word Attention module (aka Word Attention Network)


Snippet 3: Mxnet implementation of the Word Encoder + Word Attention module (aka Word Attention Network)

You will notice the presence of 3 dropout related parameters: embed_drop , weight_drop and locked_drop . I will describe them in detail in Section 2. For the time being, let’s ignore them and focus on the remaining components of the module.

Simply, the input tokens ( X ) go through the embeddings lookup table ( word_embed). The resulting token embeddings go through the bidirectional GRU ( rnn) and the output of the GRU goes to AttentionWithContext ( word_attn ) which will return the importance weights (α), the sentence representation (s) and the hidden state h_n.

Note that returning the hidden state is necessary since the document (the amazon review here) is comprised by a series of sentences. Therefore, the initial hidden state of sentence i+1 will be last hidden state of sentence i. We could say that we will treat the documents themselves as “stateful”. I will come back to this later in the post.

1.1.2 Sentence Encoder + Sentence Attention

Given the fact that we do not need an embedding lookup table for the sentence encoder, SentAttnNet (Sentence Encoder + Sentence Attention) is simply:


Snippet 5: Pytorch implementation of the Sentence Encoder + Sentence Attention module (aka Sentence Attention Network)


Snippet 6: Mxnet implementation of the Sentence Encoder + Sentence Attention module (aka Sentence Attention Network)

Here, the network will receive the output of WordAttnNet ( X ), which will then go through the bidirectional GRU ( rnn ) and then through AttentionWithContext ( sent_attn ).

At this point we have all the building blocks to code the HAN.

1.1.3 Hierarchical Attention Networks (HANs)


Snippet 7: Pytorch implementation of a Hierarchical Attention Network (HAN)


Snippet 8: Mxnet implementation of a Hierarchical Attention Network (HAN)

I believe it might be useful here to illustrate the flow of the data through the network with some numbers related to the dimensions of tensors as they navigate the network. Let’s assume we use batch sizes ( bsz ) of 32, token embedding of dim ( embed_dim ) 100 and GRUs with hidden size ( hidden_dim ) 64.

The input to HierAttnNet in the snippet before X is a tensor of dim (bzs, maxlen_doc, maxlen_sent) where maxlen_doc and maxlen_sent are the maximum number of sentences per document and words per sentence. Let’s assume that these numbers are 5 and 20. Therefore, X is here a tensor of dim (32, 5, 20) .

The first thing we do is to permute the axes, resulting in a tensor of dim (5, 32, 20) . This is because we are going to process one sentence at a time feeding the last hidden state of one sentence as the initial hidden state of next sentence, in a “stateful” manner. This will happen within the loop in the forward pass.

In that loop we are going to process one sentence at at time, i.e. a tensor of dim (32, 20) containing the i-th sentence for all 32 reviews in the batch. This tensor is then passed to wordattnnet , which is simply Word Encoder + Word Attention as described before. There, it will first go through the embeddings layer, resulting in a tensor of dim (32, 20, 100) . Then through the bidirectional GRU, resulting in a tensor of dim (32, 20, 128) and finally through the attention mechanism, resulting in a tensor of dim (32, 1, 128) . This last tensor is sᵢ in the equation 7 in the Zichao Yang, et al paper, and corresponds to the i-th sentence vector representation.

After running the loop we will have maxlen_doc (i.e. 5) tensors of dim (32, 1, 128) that will be concatenated along the 2nd dimension, resulting in a tensor of dim (32, 5, 128)(bsz, maxlen_doc, hidden_dim*2). This tensor is then passed through sentattnnet , which is simply Sentence Encoder + Sentence Attention as described before. There it will first go through the bidirectional GRU, resulting in a tensor of dim (32, 5, 128) and finally through the attention mechanism resulting in a tensor of dim (32, 128) . This last tensor will be the v in the equation 10 in their paper.

Finally, v is then passed through a fully connected layer and a Softmax function for prediction.

2. Embedding, Locked and Weight Dropout

When I started to run experiments I noticed that the model overfitted quite early during training. The best validation loss and accuracy happened within the first couple of epochs, or even after the first epoch. When overfitting occurs there are a number of options:

  • Reduce model complexity: I explore this by running a number of models with a small number of embeddings and/or hidden sizes.
  • Early Stopping: this is always used via an early_stop function.
  • Additional regularisation, such as dropout, label smoothing (Christian Szegedy et al, 2015) or data augmentation. I write “additional” because I already used weight decay.

I have not explored label smoothing or data augmentation in this exercise. If you want to dig a bit more into how to implement label smoothing in Pytorch, have a look to this repo. In the case of Mxnet, the gluonnlp API has its own LabelSmoothing class.

Regarding data augmentation, the truth is that I have not tried it here and perhaps I should. Not only because it normally leads to notable improvements in terms of model generalisation, but moreover because I already have most of the code from another experiment where I implemented EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (Jason Wei and Kai Zou 2019). Nonetheless, one has to stop somewhere and I decided to focus on exploring different dropout mechanisms.

The 3 different forms of dropout I used here are: embedding dropout, locked dropout and weight dropout. The code that I used is taken directly from the salesforce repo corresponding to the implementation of the AWD-LTSM (Merity, hirish Keskar and Socher, 2017). In this section I will focus in discussing the Pytorch implementation, but I will also include information regarding Mxnet’s implementation. Note that these dropout mechanisms were initially thought and implemented in the context of language models. However, there is no reason why they should not work here (or at least no reason why we should not try them).

2.1 Embedding Dropout

This is discussed in detail in Section 4.3. in the Merity et al paper and is based in the work of Gal & Ghahramani (2016). No one better than the authors themselves to explain it. In their own words: “This is equivalent to performing dropout on the embedding matrix at a word level, where the dropout is broadcast across all the word vector’s embedding. […]”

In code (the code below is a simplified version of that in the original repo):

Simplified (i.e. incomplete) implementation of embedding dropout. From Merity, Shirish Keskar and Socher 2017: Regularizing and Optimizing LSTM Language Models

Basically, we create a mask of 0s and 1s along the 1st dimension of the embeddings tensor (the “word” dimension) and then we expand that mask along the second dimension (the “embedding” dimension), scaling the remaining weights accordingly. As the authors said, we drop words.

2.2 Locked Dropout

This is also based on the work of Gal & Ghahramani (2016). Again in the words of the authors: “[…] sample a binary dropout mask only once upon the first call and then to repeatedly use that locked dropout mask for all repeated connections within the forward and backward pass”.

In code:

Implementation of Locked Dropout. From Merity, Shirish Keskar and Socher 2017: Regularizing and Optimizing LSTM Language Models

Simply,LockedDropoutwill received a 3-dim tensor, it will then generate a mask along the second dimension and expand that mask along the first dimension. For example, when applied to a tensor like (batch_size, seq_length, embed_dim), it will create a mask of dim (1, seq_length, embed_dim) and apply it to the whole batch. Mxnet’s nn.Dropout module has an axes parameter that directly implements this type of dropout.

And finally…

2.3. Weight Dropout

This is discussed in Section 2 in their paper. Once again, in their own words: “We propose the use of DropConnect (Wan et al., 2013) on the recurrent hidden to hidden weight matrices which does not require any modifications to an RNN’s formulation.”

In code (the code below is a simplified version of that in the original repo):

Simplified (i.e. incomplete) Implementation of Weight Dropout. From Merity, Shirish Keskar and Socher 2017: Regularizing and Optimizing LSTM Language Models

WeightDrop will first copy and register the hidden-to-hidden weights (or in general terms the weights in the List weights) with a suffix _raw (line 14). Then, it will apply dropout and assign the weights again to the module (line 25 if variationalor 27 otherwise). As shown in the snippet, the variational option does the same as discussed before in the case of Embedding Dropout, i.e. generates a mask along the first dimension of the tensor and expands (or broadcasts) along the second dimension.

There are a couple of drawbacks in this implementation. In the first place, given some input weights, the final model will contain the original weights (referred as weight_name_raw ) and those with dropout (refer as weight_name ), which is not very efficient. Secondly, it changes the name of the parameters, adding ‘ module’ to the original name.

To be honest these are not major drawbacks at all, but I can use them as an excuse to introduce another two implementations that are perhaps a bit better (although of course based on the original one). One is the implementation within the great text API at the fastai library. I guess at this point everyone knows about this library, but if you don’t let me write a couple of lines here. I find this library excellent, not only for the high level APIs that offers, or the clever defaults, but also because there are a lot of little gems hidden in the source code. If you are not familiar with the library, give it a go, there is no turning back.

Another nice implemenation is the function apply_weight_drop at the Mxnet’s gluonnlp API, which I used here. In fact, in their implementation of the AWDRNN language model this function is used for both the embedding and the hidden-to-hidden weight dropout. It is available through their utils module:

from gluonnlp.model.utils import apply_weight_drop

As far as implementation goes, this is it. Time to run some experiments.

3. Results and Visualising Attention

3.1. Results

I eventually recorded 59 experiments (I run a few more), 40 of them using the Pytorch implementation and 19 using Mxnet. Throughout the experiments I used different batch sizes, learning rates, embedding dimensions, GRU hidden sizes, dropout rates, learning rate schedulers, optimisers, etc. They are all shown in Tables 1 and 2 in the notebook 04_Review_Score_Prediction_Results.ipynb. The best results on the test dataset for each implementation are shown in the table below, along with the best result I obtained from previous attempts using tf-idf along with LightGBM and Hyperopt for the classification and hyper-parameter optimisation tasks.

Table 1. Best results obtained using HANs with Pytorch and Mxnet, and tf-idf+LightGBM. Please see the repo for more details

In the first place, it is worth reiterating that I only run 19 experiments with with the Mxnet implementation. This is in part due to the fact that, as I mentioned earlier in the post, I have more experience with Pytorch than with Mxnet and Gluon, which influenced the corresponding experimentation. Therefore, it is quite possible that I missed a minor tweak to the Mxnet models that would have lead to better results that those in the table.

Other than that we can see that the HAN-Pytorch model performs better than a thoroughly tuned tf-idf+LighGBM model on the test dataset for all, accuracy, F1 score and precision. Therefore, the next immediate question most will be asking is: is it worth using HAN over tf-idf+LightGBM (or your favourite classifier)? And the answer is, as with most things in life, “it depends”.

It is true that HANs perform better, but the increase is relatively small. In general, leaving aside the particular case of the Amazon reviews, if in your business a ~3% F1 score is important (i.e. leads to a sizeable increase in revenue, savings or some other benefits) then there is no question, one would use the DL approach. On top of that, attention mechanisms might give you some additional, useful information (such as the expressions within the text that lead to a certain classification) beyond just the keywords that one would obtain by using approaches such as tf-idf (or topic modelling).

Finally, my implementation of HANs is inefficient (see next section). Even in that scenario, the results presented in the table are always obtained in less than 10 epochs and each epoch runs in around 3min (or less depending on the batch sizes) on a Tesla K80. Therefore, this is certainly not a computationally expensive algorithm to train and performs well. In summary, I’d say that HANs are a good algorithm to have in your repertoire when it comes to perform text classification.

3.2 Visualising Attention

Let’s not have a look to the attention weights, in particular to the word and sentence importance weights (α).

Figure 2. Attention plots for reviews that have been correctly classified, for a positive (top) and a negative (bottom) review. Colour Intensity corresponds to the values of the importance weights (the αs in Equation Group 2 in this post or Equations 6 and 9 in their paper).

Figure 2 shows both word and sentence attention weights for two reviews that were classified correctly. The xxmaj token is a special token introduced by the fastai tokenizer to indicate that the next token starts with a capital letter. In addition, it is worth mentioning that in the original dataset review scores range from 1–5 stars. During preprocessing, I merge reviews with 1 and 2 starts into one class and re-label the classes to start from 0 (see here for details). Therefore, the final number of classes is 4: {0, 1, 2, 3}.

The figure shows how, when predicting the review score, the HAN places attention to phrases and constructions like “fit was perfect”, “very superior” or “rubs […] wrong places”, as well as isolated words like “bought” or “not”. In addition, we can see that in the top plot, a bit more attention is placed in the 3rd sentence relative to the other 3.

Figure 3. Same as Figure 2 but for misclassifications.

Figure 3 shows both word and sentence attention weights for two reviews that were misclassified. The top review was predicted as 0 while the true score was 3 (real score in the original dataset is 5). Someone found those boots “yuck”, “disappointing” and “bad” yet gave them a 5 star score. The review at the bottom was predicted as 3 while the true score was 0 (real score in the original dataset is 1). It is easy to understand why the HAN misclassified this review mostly based on the first sentence, where it places the highest attention.

Nonetheless, the figures show that the attention mechanism works well, capturing the relevant pieces in the reviews that lead to a certain classification. Notebook 05_Visualizing_Attention.ipynb contains the code that I used to generate these plots.

4. Dicussion

At this stage, there are a few comments worth making. First of all, I run all the experiments manually (with a bash file), which is not the best way of optimising the hyper-parameters of the model. Ideally, one would like to wrap up the train and validation processes in an objective function and use Hyperopt, as I did with all the other experiments in the repo that focus on text classification. I will include a .py script to do that in the near future.

On the other hand, looking at figures 2 and 3 one can see that attention is normally focused on isolated words or constructions and phrases or 2 or 3 words. Therefore, one might think that using a non-DL approach along with n-grams might improve the results in the table. I actually did that in this notebook and the difference between using or not using n-grams (in particular bigrams via gensim.models.phrases ) is negligible.

Other issues worth discussing are related to model generalisation and efficiency. For example, I already mentioned that one could use label smoothing and data augmentation to add regularisation. In fact, even after adding some dropout, the best validation loss and metrics are still obtained early during training, moreover in the case of the Mxnet implementation. This is not necessarily bad and might simply reflect the fact that the model reaches its best performance just after a few epochs. However, more exploration is required.

In addition, if you have a look to the details of my implementation, you will realise that the input tensors have a lot of unnecessary padding. Nothing will be learned from this padding but still has to be processed, i.e. this is inefficient for the GPU. To remedy this situation, one could group reviews of similar lengths into buckets and pad accordingly, reducing the computation required to process the documents. Furthermore, one could adjust both learning rate and batch size according to the document length. All these approaches have already been used to build language models (e.g see this presentation) and are readily available at the gluonnlp API. At this point I have only scratched the surface of what this API can do and I am looking forward to more experimentation in the near future.

5. Summary and Conclusions

I have implemented “Hierarchical Attention Networks for Document Classification” (Zichao Yang, et al, 2016) using Pytorch and Mxnet to predict Amazon reviews scores, and compared the results with those of previous implementations that did not involved Deep Learning. HANs perform better across all the evaluation metrics, are relatively easy to implement and fast to train. Therefore, I believe this is an algorithm worth having in the repertoire for text classification tasks.

Other than that, and as always, I hope you found this post useful.

Any comments, suggestions, please email me at: or even better open an issue in the repo.


Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio 2016. neural machine translation by jointly learning to align and translate.

Yarin Gal, Zoubin Ghahramani 2015. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.

Ruining He, Julian McAuley 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering.

Julian McAuley , Christopher Targett , Qinfeng (‘Javen’) Shi , and Anton van den Hengel 2015. Image-based Recommendations on Styles and Substitutes.

Stephen Merity, Nitish Shirish Keskar, Richard Socher 2017. Regularizing and Optimizing LSTM Language Models.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna 2015. Rethinking the Inception Architecture for Computer Vision.

Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, Rob Fergus 2013. Regularization of Neural Networks using DropConnect.

Jason Wei, Kai Zou 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.

Zichao Yang , Diyi Yang , Chris Dyer , Xiaodong He , Alex Smola , Eduard Hovy 2016. Hierarchical Attention Networks for Document Classification.