BERT Classifier: Just Another Pytorch Model

Source: Deep Learning on Medium

Valencia, Spain. Whenever I don’t do projects with image outputs I just use parts of my photo portfolio….

At the end of 2018 Google released BERT and it is essentially a 12 layer network which was trained on all of Wikipedia. The training protocol is interesting because unlike other recent language models BERT is trained in to take into account language context from both directions rather than just things to the left of the word. In pretraining BERT masks out random words in a given sentence and uses the rest of the sentence to predict that missing word. Google also benchmarks BERT by training it on datasets of comparable size to other language models and shows stronger performance.

NLP is an area that I am somewhat familiar with, but it is cool to see the field of NLP having its “ImageNet” moment where practitioners in the field can now apply state of the art models fairly easily to their own problems. As a quick recap, ImageNet is a large open source dataset and the models trained on it are commonly found in libraries like Tensorflow, Pytorch, and so on. These skilled pretrained models let data scientists spend more time attacking interesting problems rather than having to reinvent the wheel and be focused on curation of datasets (although dataset curation is still super important). You now need datasets in the thousands not the millions to start deep learning.

For work I have used BERT a few times in a limited capacity mostly building off of other tutorials I have found. However I had been putting off diving deeper to tear apart the pipeline and rebuilding it in a manner I am more familiar with… In this post I just want to gain a greater understanding of how to create BERT pipelines in the fashion I am used to so that I can begin to use BERT in more complicated use cases. Mainly I am interested in integrating BERT into multi-task ensembles of various networks.

By going through this learning process , my hope is to show how that while BERT is a state of the art model that is pushing the boundaries of NLP, it is just like any other Pytorch model and that by understanding its different components we can use it to create other interesting things. What I really want is to get over my fear/intimidation of using BERT and to use BERT with the same general freedom I use other pretrained models.


So for this post I used the classic IMDB movie review dataset. This dataset has 50K movie reviews and are marked with the sentiment “positive” or “negative” for each. Unlike my other posts I did not build a custom dataset, partially because I do not know quick ways of building text datasets and I didn’t want to spend a lot of time on it, and this one is easy to find around on the internet.

Overall I agree that this is not really the most interesting thing I could have done, but for this post I am moreso focusing on how to build a pipeline using BERT. Once the pipeline is in place we can swap out datasets as we choose for more varied/interesting tasks.

Classification Architecture

For this post I will be using a Pytorch port of BERT by a group called hugging face (cool group, odd name… makes me think of half life facehuggers). Often it is best to use whatever the network built in to avoid accuracy losses from the new ported implementation… but google gave hugging face a thumbs up on their port which is pretty cool.

Anyway… continuing on…

The first thing I had to do was establish a model architecture. For this I mostly took an example out of the hugging face examples called BertForSequenceClassification. At the moment this class looks to be outdated in the documentation, but it serves as a good example for how to build a BERT classifier. Basically you can initialize a BERT pretrained model using the BertModel class. Then you can add additional layers to act as classifier heads as needed. This is the same way you create other custom Pytorch architectures.

Like other Pytorch models you have two main sections. First you have the init where you define pieces of the architecture in this case it is the Bert model core (in this case it is the smaller lower case model, ~110M parameters and 12 layers), dropout to apply, and a classifier layer. Second is the forward section where we define how the architecture pieces will fit together into a full pipeline.

class BertForSequenceClassification(nn.Module):

def __init__(self, num_labels=2):
super(BertForSequenceClassification, self).__init__()
 self.num_labels = num_labels
 self.bert = BertModel.from_pretrained('bert-base-uncased')
 self.dropout = nn.Dropout(config.hidden_dropout_prob)
 self.classifier = nn.Linear(config.hidden_size, num_labels)
 def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
 _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
 pooled_output = self.dropout(pooled_output)
 logits = self.classifier(pooled_output)

return logits

Now that the model is defined we just have to figure out how to structure our data so that we can feed it through and optimize the weights. In the case of images this would usually just be figuring out what transformations we need to apply and making sure we get everything into the correct format. For BERT we need to be able to tokenize strings and convert them into IDs that map to words in BERT’s vocabulary.

Mendoza, Argentina. Lots of good wine!

BERT Data Preprocessing

The main piece of functionality we need for data prep with BERT is how to tokenize inputs and convert them into their corresponding IDs in BERT’s vocabulary. Hugging face has added VERY nice functionality to both the BertModel and BertTokenizer class where you can just put in the name of the model you want to use, for this post it is the ‘bert-base-uncased’ model.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

To tokenize the text all you have to do is call the tokenize function of the tokenizer class. see below.

tokenized_text = tokenizer.tokenize(some_text)

Then once you convert a string to a list of tokens you have to convert it to a list of IDs that match to words in the BERT vocabulary. This time you just have to call the convert_tokens_to_ids function on the previously tokenized text.


So with these basics in place we can put together the dataset generator which like always is kind of the unsung hero of the pipeline so we can avoid loading the entire thing into memory which is a pain and makes learning on large datasets unreasonable.

Custom BERT Dataset Class

In general Pytorch dataset classes are extensions of the base dataset class where you specify how to get the next item and what the returns for that item will be, in this case it is a tensor of IDs of length 256 and one hot encoded target value. Technically you can do up to sequences of length 512 but I need a larger graphics card for that. I am currently training on a GTX 2080ti with 11GB of GPU RAM. On my previous 1080 card I was only able to use sequences of 128 comfortably.

max_seq_length = 256
class text_dataset(Dataset):
def __init__(self,x_y_list):
self.x_y_list = x_y_list

def __getitem__(self,index):

tokenized_review = tokenizer.tokenize(self.x_y_list[0][index])

if len(tokenized_review) > max_seq_length:
tokenized_review = tokenized_review[:max_seq_length]

ids_review = tokenizer.convert_tokens_to_ids(tokenized_review)

padding = [0] * (max_seq_length - len(ids_review))

ids_review += padding

assert len(ids_review) == max_seq_length

ids_review = torch.tensor(ids_review)

sentiment = self.x_y_list[1][index] # color
list_of_labels = [torch.from_numpy(np.array(sentiment))]

return ids_review, list_of_labels[0]

def __len__(self):
return len(self.x_y_list[0])

Since this is a decent bit of uncommented code… lets break it down a bit!

For the first bit with the variable x_y_list. It is just something I frequently do when I build datasets… It is basically just a list of the x’s and y’s whatever and however many they may be. Then I index into that specific list of lists to retrieve specific x or y elements as needed.

If anyone has looked at my other image pipelines I basically always have this and it is usually a list of image urls corresponding to the test or training sets. In this case it is the test of training movie review text and the second element is the labels for those movie review texts.

class text_dataset(Dataset):
def __init__(self,x_y_list):
self.x_y_list = x_y_list

So with that out of the way! The most important part of this is how the dataset class defines the preprocessing for a given sample.

  1. For this BERT use case we retrieve a given review at “self.x_y_list[0][index]”
  2. then tokenize that review with “tokenizer.tokenize” as described above.
  3. All of the sequences need to be of uniform length so, if the sequence is longer than the max length of 256 it is truncated down to 256.
  4. Then the tokenized and truncated sequence is converted into BERT vocabulary IDs by “tokenizer.convert_tokens_to_ids”
  5. In the case a sequence is shorter than 256, it is now padded with 0’s up to 256.
  6. The review is converted into a torch tensor.
  7. The function then returns the tensors for the review and its one hot encoded positive or negative label.

def __getitem__(self,index):

tokenized_review = tokenizer.tokenize(self.x_y_list[0][index])

if len(tokenized_review) > max_seq_length:
tokenized_review = tokenized_review[:max_seq_length]

ids_review = tokenizer.convert_tokens_to_ids(tokenized_review)

padding = [0] * (max_seq_length - len(ids_review))

ids_review += padding

assert len(ids_review) == max_seq_length

ids_review = torch.tensor(ids_review)

sentiment = self.x_y_list[1][index]
list_of_labels = [torch.from_numpy(np.array(sentiment))]

return ids_review, list_of_labels[0]

def __len__(self):
return len(self.x_y_list[0])


At this point the training pipeline is pretty standard (now that BERT is just another Pytorch model). I was able to use a normal training for loop if you want to check block 21 of the notebook. The only real difference between this an my other notebooks was a stylistic one where I take the softmax of the final classifier layer outside of the network itself.

outputs = F.softmax(outputs,dim=1)

The final interesting part is that I assign specific learning rates to different sections of the network. I got interested in doing this a few months back when I skimmed over the fastai videos and have found it to be useful.

The first thing that this section does is assign two learning rate values called lrlast and lrmain. lrlast is fairly standard at .001 while lrmain is much lower at .00001. The idea is that when parts of the network are randomly initialized while others are already trained you do not need to apply aggressive learning rates to the pretrained sections without running the risk of destroying the rates, however the new randomly initialized sections may not coverge if they are at a super low learning rate… so applying higher or lower learning rates to different parts of the network is helpful to get each section to learn appropriately. The next section can be aggressive while the pretrained section can make gradual adjustments.

The mechanics for applying this come in the list of dictionaries where you are specifying the learning rates to apply to different parts of the network withing the optimizer, in this case an Adam optimizer.

lrlast = .001
lrmain = .00001
optim1 = optim.Adam(
{"params":model.bert.parameters(),"lr": lrmain},
{"params":model.classifier.parameters(), "lr": lrlast},

optimizer_ft = optim1

With the learning rates set I let it run for 10 epochs decreasing the learning rate every 3 epochs. The network starts at a very strong point…

Epoch 0/9
train total loss: 0.4340
train sentiment_acc: 0.8728
val total loss: 0.4089
val sentiment_acc: 0.8992

Basically initializing the network with Bert’s pretrained weights means it already has a very good understanding of language.

Epoch 9/9
train total loss: 0.3629
train sentiment_acc: 0.9493
val total loss: 0.3953
val sentiment_acc: 0.9160

By the end of the process the accuracy has gone up a few points and the loss has decreased slightly… I haven’t really seen how models score on this dataset normally but I think this is reasonable and good enough for now to show that the network is doing some learning.

10 epochs on this dataset took 243m 48s to complete on my new 2080ti card. As a side note there were a number of annoyances on getting the card to work with Pytorch… mostly just updating various versions of things.

Buenos Aires Metropolitan Cathedral

Closing Thoughts

For me this was important to do to show myself that while BERT is state of the art I shouldn’t be intimidated when trying to apply it to my own problems. Since folks put in a lot of effort to port BERT over to Pytorch to the point that Google gave them the thumbs up on its performance, it means that BERT is now just another tool in the NLP box for data scientists the same way that Inception or Resnet are for computer vision.

In terms of performance I think that I could squeeze out a few extra percentage points by adding additional layers before the final classifier. This would allow for a few more layers specialized in this specific task. I am also not well versed in how to do data augmentation in the NLP field so that would be something else to examine, perhaps using other language models trained to generate synthetic text… But now that I have a BERT pipeline and know that I can build custom classifiers on top of it the way I would any other model… who knows… there are a lot of exciting possibilities here.

Per usual, feel free to check out the notebook here. For simplicity the dataset is also in the repo so if you install pytorch and the pytorch-pretrained-bert libraries you should be good to go.