Fine tuning BERT for Amazon Food Reviews

Original article can be found here (source): Deep Learning on Medium

You can download the data from kaggle. We will concern ourselves with the “Reviews.csv” file. Create a folder “AMAZON-DATASET” inside the cloned repo and place “Reviews.csv” in it. The dataset is a simple CSV file that is entirely self contained.

Dataset Prep

We will use pandas to process our data and feed it to a dataloader.

df = pd.read_csv("./AMAZON-DATASET/Reviews.csv", delimiter=',')
# Random shuffle and return a data frame

Splitting the data to training and validation set

def get_train_and_val_split(df, splitRatio=0.8):
print("Number of Training Samples: ", len(train))
print("Number of Validation Samples: ", len(val))
return(train, val)
train, val = get_train_and_val_split(df, config["splitRatio"])

The target column that we would like to predict is the “Score”. Let’s have a loot at its distribution.

num_classes = df['Score'].nunique()
print("Number of Target Output Classes:", num_classes)
totalDatasetSize = len(df)
symbols = df.groupby('Score')
scores_dist = []
for i in range(num_classes):
print("The label ", i+1, " is ", scores_dist[i]*100, " % of the datatset")

Maximum Length. As in any NLP processing pipelines, we will have to set a fixed maximum length for our input sentences. The longest sentence in this dataset was well over 500 words, for the sake of quicker training, I will set this MAX_LENGTH to 100.

Define the Dataloader

This class inherits the Dataset class provided by torch. We make use of the base pre-trained BERT model(bert-base-uncased) for this exercise. A little summary on the architecture of this model.

12-layer, 768-hidden, 12-heads, 110M parameters.
Trained on lower-cased English text.

It is very important to use the same tokenizer from the pre-trained BERT model since the string to token conversion needs to be alligned with the existing model.

You really do not want the word “Hello” being represented both by token 45 and 75 in your models.

Focus on the __getitem__ function. The objective of the function is to simple provide the dataLoader with the input tokens, mask(we will get to that) and the output label.

A review sentence goes through the following transformations. We will also take a sample sentence of “I loved the pizza” with a maxlen of 10 and see how it gets processed.

  1. Tokenize tokens = self.tokenizer.tokenize(review) Converts the review into a list of words. [“I”, “loved”, “the”, “pizza”]
  2. Append CLS and SEP tokens tokens = ['[CLS]'] + tokens + ['[SEP]'] . Bert requires that the input for this particular class “BertModel” has input in this format. Different Bert classes would require a slight variation of this, so refer to them if you intend to implement any of the other classes provided by BERT. [“[CLS]”, “I”, “loved”, “the”, “pizza”, “[SEP]”]
  3. Add Padding and Truncate For sentences with length lesser than maxlen, a PAD token is added to bring them to a uniform length. For sentences with length more than maxlen, we truncate the sentence and add SEP token to the end. [“[CLS]”, “I”, “loved”, “the”, “pizza”, “[SEP]”, “[PAD]”, “[PAD]”, “[PAD]”, “[PAD]”]
  4. Convert tokens to token ids The list of strings will be converted to numbers internal to the tokenizer dictionary. Out of vocabulary words are also handled by the tokenizer.
  5. Masks A list with 0’s in the corresponding position of the PAD token if any and 1 in all other cases. [1,1,1,1,1,1,0,0,0,0]

The label is subtracted by 1 since the classes start from 0 in this MultiClass Classification problem.

Initialize the DataLoader

I would suggest setting the threads to 6 if you have a 4 dual cores(8).

We create 2 loaders, one for the training set and the other for validation.

Define the Classifier

The Classifier inherits from the torch.nn.module and needs to implement a forward function and the init.

Before we get to the classifier, let us try to understand what we will be accomplishing. The pre trained BERT model gives us the entire network in yellow in the image below. For a classification task, the information encoded in the representation of the first input token [‘CLS’] is sufficient to make a prediction. We will be using a Feed Forward Dense Linear layer with a Logaithmic softmax to make the classification. The image is a Binary classifier, we will be implementing the MultiClass case.


init Function – we define the linear layer that takes as input a vector of length 768 and maps it to a vector of length 5. Why 768? That’s the number of hidden units in the BERT model we have used, if you are using a different model, please update it. We also define if the pretrained BERT layer needs to be trained or not in this process. Since we opted to fine tune this model with the Amazon dataset, we will be unfreezing the BERT layer so as to be able to download this fine tuned model for future tasks.

forward function We feed forward the input to the bert layer to obtain the contextualized representation. From the representations, we pick the first one ( The [‘CLS’] rep) to feed to the Linear layer.

Initialize the classifier

We initialize the classifier in

net = classifier.SentimentClassifier(num_classes, config[“device”], freeze_bert=False)

Loss Function and Optimizer

The loss function we will use is the NLLLoss. We will also be making use of an optional input called weights in this loss function. The classes in our dataset are highly imbalanced. Rating 5(Class 4) is over 60 percent of all examples with the other classes hovering around 10 percent. This weighing is needed to mitigate the impact of an unbalanced training set in the loss calculation.

loss_func = nn.NLLLoss(weight=weights)

Optimizer is an Adam optimizer. optim.Adam(net.parameters(), lr = 2e-5) Feel free to play around with the learning rate if you have the resources.

Training Loop

We loop once over the epochs, in each of the epoch, the data has been split into batches of size 64. We calculate the logits logits = net(seq, attn_masks) and calculate loss loss = loss_func(m(logits), labels) using the loss function defined earlier. Loss is backpropagated.

We calculate the training accuracy once every 100 batches and save the network model(The next checkpoint simply updates the previous one). At the end of one epoch, we calculate the validation loss and accuracy. If the model from this epoch performs better than the previous one, it is replaced.

Given how long running such trainings are, it is wise to save models at regular intervals, this way, in the event there is a crash, one can reload the model from the last saved checkpoint and resume. You wouldn’t want to be restarting GTA 5 every single time would you now.

Saving the fine tuned model would also allow you to reuse the model for tasks relevant to food reviews from a different domain. Leveraging this fine tuned model for an analysis of food review in Zomato’s domain could constitute as Transfer learning.

Note: The validation set is pretty huge and it takes a long time to evaluate the validation set. To fit my use case and to reduce the running time of the loop, I have introduced a “validationFraction” param in config. This is the fraction of the validation set that is actually used by the “evaluate” function. For accurate results, increase this fraction.

Training On A GPU (Skip if you dont have a physical GPU)

While going through the code, you would have noticed occurences of “.to(device)”. We determine if there is support for a CUDA device and use it for GPU computations. The very evolution of BERT, Transformers was based on efficient parallelization and efficient use of a GPU. There is no need for a code change when you run it on a GPU.

Note: In case running on a GPU gives errors, do the following.

1) Set forceCPU in the config to True and run the model, if it runs without any errors, then drop a comment here and we will have to figure out the issue with the GPU version.

2) There might be insufficient space in your GPU. Reduce the batch size from 64 to a multiple of 2 that fits.

Training on Google Colab

Training a BERT model for such a large dataset on a CPU might not be the best of ideas (Unless you can afford replacing a burnt out processor). With the current config, it could take around 72 hours for a 6 core i7 with 8 threads dedicated to training. If you do not have access to a GPU, Google colab provides a GPU for free (For an uninterrupted duration of 12 hours).

  1. Head to
  2. Create a New notebook.
  3. Click on Files->Upload Notebook. Upload the BERT_Amazon_Reviews.ipynb file that comes with the repo.
  4. Open your google drive and create a folder named Bert at the topmost level. Upload the AMAZON-DATASET folder to it.
  5. Head back to google colab, running the second cell hosts google drive to colab. You will be asked to authenticate your account, complete it. The advantage of using google drive is that the output models saved by the checkpoint would stay intact even at the end of the 12 hour session( it gets wiped out if you upload the file directly to colab)
  6. Run the rest of the cells.
  7. The last cell starts the process of training.

Note: If google colab complains of insufficient space, then reduce the batch size in config from 64 to a multiple of 2 that fits. The 12 GB GPU memory allocated is not dedicated and acts pretty erratically during initial allocation.


I obtained training accuracy of around 92 percent. The model has not completed it’s first epoch, will update the Validation accuracy once it does, but preliminary runs look very hopeful.

Going Forward

You can experiment with different BERT classes. BERT has classes for tasks such as Sentence prediction, classification, Masked word prediction (This would be a great example to simply fine tune a pre trained model on unlabelled corpus from a domain without a fixed objective at the moment).

Let me know if you face any issues in the comments.