Learn Hugging Face Transformers & BERT with PyTorch in 5 Minutes

Original article can be found here (source): Deep Learning on Medium

Bidirectional Encoder Representations from Transformers (BERT) marked a new era for Natural Language Processing last year. By the time the paper was published, it achieved the state-of-the-art results over 11 Natural Language Understanding tasks. It proved the capabilities of a Language Model properly trained on huge corpus to largely improve downstream tasks.

The Transformers package developed by HuggingFace unifies the implementation of different BERT-based models. It provides an easy-to-use interface and a wide variety of BERT-based models as shown in the image below.

The various BERT-based models supported by HuggingFace Transformers package

This blog post will use BERT as an example. The usage of the other models are more or less the same. This rest of the article will be split into three parts, tokenizer, directly using BERT and fine-tuning BERT. You will learn how to implement BERT-based models in 5 minutes.


Most of the BERT-based models use similar with little variations. For instance, BERT use ‘[CLS]’ as the starting token, and ‘[SEP]’ to denote the end of sentence, while RoBERTa use <s> and </s> to enclose the entire sentence.

In the transformers package, we only need three lines of code to do to tokenize a sentence.

The tokens variable should contain a list of tokens:

['[CLS]', 'learn', 'hugging', 'face', 'transformers', '&', 'bert', 'with', 'p', '##yt', '##or', '##ch', 'in', '5', 'minutes', '[SEP]']

Then, we can simply call

to convert these tokens to integers that represent the sequence of ids in the vocabulary.

[101, 4553, 17662, 2227, 19081, 1004, 14324, 2007, 1052, 22123, 2953, 2818, 1999, 1019, 2781, 102]

How to map the original word position to tokenized position?

In some application, such as named entity recognition or event extraction, we need to know the tokenized position for an original word. Since the tokenizer may break a single word into several sub-words, you may wonder how do we know we know the tokenized position of a given original word?

Following the Hugging Face’s practice, we basically loop over each word in the sentence and create a mapping from original word position to the tokenized position.

The mapping is stored in the variable orig_to_tok_index where the element e at position i corresponds to the mapping (i, e).

Directly Using Pre-Trained BERT

Getting the pre-trained BERT is straightforward. Thanks to this transformers package. We only need one line of code to do so.

This line of code will automatically fetch the pre-trained weights of a PyTorch BERT and download it to a cache directory for future use.

Fine-tuning Pre-Trained BERT

To fine-tune BERT, the suggested way is to inherit the class BertPreTrainedModel. When we call from_pretrained on our custom classifier class, it will automatically pass in an extra config variable to the model. As shown in the following.

This way, the BERT will be trained jointly with the Linear layer.


In this article, you have learned the three most common usages of the transformers package, tokenization, directly using BERT, and fine-tuning BERT. In the next article, I will demonstrate how I use BERT-based model on one of my current research projects, event extraction in biomedical domains. Stay tuned 🤗