Teaching GPT-2 a sense of humor — Fine-tuning large Transformer models on a single GPU in PyTorch

Source: Deep Learning on Medium

Teaching GPT-2 a sense of humor — Fine-tuning large Transformer models on a single GPU in PyTorch

In this post, I demonstrate how you can use pre-trained GPT-2 to generate text and then fine-tune it on a specific language modeling task using a single GPU. In this case, I try to teach the model to be funny by fine-tuning it on a jokes dataset.

The GPT-2

Recently OpenAI team published an article Better Language Models, and a technical paper Language Models Are Unsupervised Multitask Learners about training bigger and better language models. They research language model abilities to generate coherent text and solve NLP tasks in a zero-shot setting, which means using the model to solve tasks that it was not explicitly trained for.

They created a transformer-based language model that they called GPT-2 and trained it on a huge 40GB internet text dataset. They trained the model on a language modeling task, which is predicting probabilities of the next word in a word sequence. Training NLP models for language modeling and then fine-tuning for a specific task is one of the most common paths for training NLP models. Pre-training a model for language modeling is convenient because it does not require labeled data to learn the structure of language — it only requires plain text, which is openly available in vast amounts. Most publicly available pre-trained NLP models are trained for language modeling.

The results they got at generating text after the training are very impressive; the fragments feel very human and coherent that it’s almost creepy. Also, the model achieved state-of-the-art scores in zero-shot settings on a variety of language modeling tasks, including summarization, reading comprehension, and translation.

Fine-tuning experiment plan

So I decided to experiment a little with the GPT-2. I thought it would be fun to teach the model to crack some jokes. To do that, I need a jokes dataset and a pre-trained GPT-2 model for fine-tuning.

Thanks to the generosity of the AI community and some specific teams who publish pre-trained neural network models, relatively cheap solutions for solving challenging tasks like this one are possible. Training such large neural-network models from scratch would costs tens of thousands of dollars, in some cases, even hundreds of thousands. Fine-tuning a pre-trained model on a new task might take a few hours on a single GPU. And I’ll do just that.

Huggingface has made many pre-trained Transformer models available for easy use in PyTorch. I’ll use their pre-trained GPT-2 and fine-tune it on this Short Jokes dataset published on Kaggle.

GPT-2 comes in 4 different sizes — small, medium, large, and XL, with 124M, 355M, 774M, and 1.5B parameters, respectively. I found that a medium-size GPT-2 model is the largest of the models that I could fine-tune with reasonable input sequence length on a single GPU.

Image source: The Illustrated GPT-2, which is an excellent post, and I highly recommend to read it.

Testing the pre-trained model by generating text

Before fine-tuning the model on jokes, I’ll test it on generating a text.

In the following gist, I demonstrate how to generate a text by using pre-trained medium-size GPT-2 from huggingface. I’ll feed the model the following text fragments to start with and let it generate the rest:

‘ The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work… when you go to church… when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth… ‘

‘ Artificial general intelligence is… ‘

‘ The Godfather: “I’m going to make him an offer he can’t refuse.”… ‘

Judging by the generated conspiracy theories about technology, threatening predictions about the AI industry, and The Godfather dialogue with himself, I would say that the text generation is working.

Fine-tuning the model on a single GPU

Large Transformer models are usually trained in multi-GPU(or TPU) settings because training on reasonable batch size and sequence length on a large model requires lots of tensor/graphical processing unit memory. My machine is equipped with a single GeForce 1080 Ti, which has 11 GB of memory. By empirical tests on the medium-size GPT-2 model, I found that the maximum total sequence element count in a batch for my GPU to process is approximately 550, which is not a lot and might not be sufficient for successful fine-tuning.

But there are some things we can take into account to improve the situation.

The first thing to notice is that the batch size in a forward-backward pass of a transformer-based model does not play a role because Layer Normalization is used instead of Batch Normalization. In Layer Normalization, each feature is normalized across the feature dimension, and the batch dimension is not involved.

Second, we can accumulate gradients over multiple forward-backward passes, and only then do the model weight update. This way, we don’t have to store the computational graph of a whole batch in the memory, but we can process sequence by sequence and achieve the same result as if the whole batch would have been processed in a single forward-backward pass.

Taking it all into account, I’ll process one sequence at a time with a maximum length of 550 and do model weight update every BATCH_SIZE processed sequences.

The length of jokes varies a lot in the dataset — there are many short sequences. To make the total sequence element count in one optimization step more consistent, I’ll try to fit in as many jokes as possible in each 550 element sequence.

Results and conclusions

It is a hard problem to teach AI to generate a text that’ll seem funny to a human, and I think that it is much harder than to generate a coherent text. Even for a human, it is not easy to do — it takes a special kind of creativity, understanding of the context, and even understanding of human psychology. Feeding many jokes to a language model might not be sufficient for the model actually to learn what makes something funny. It might require more sophisticated techniques and a lot more data to train human-level joking models.

Nevertheless, it is hilarious to see this language model trying. Once in awhile, the model manages to generate a funny human-level joke.

*When I started the experiment, I did not notice that a significant portion of the jokes in the dataset are racist and rude, which means you can expect the same in the generated joke list from the model. I apologize for that and be prepared.

Here is the full generated jokes list.

If you see something good and funny in the generated jokes list, post it in the comments. 🙂 I didn’t read through all of them myself.

This is a repost from from my original blog.