Step-by-step guide on how to train GPT-2 on books using Google Colab

Original article can be found here (source): Artificial Intelligence on Medium

If you find the following tutorial helpful, please share and star the GitHub repo — https://github.com/mohamad-ali-nasser/gpt-2.

Preparing your Google Colab Notebook

We will use Google Drive to save our checkpoints (a checkpoint is our last saved trained model). Once our trained model is saved we can load it whenever we want to generate both conditional and unconditional texts.

First, set your Colab’s runtime to GPU, you’ll thank me later.
Use the below code to connect your Google Drive:

from google.colab import drive
drive.mount('/content/drive')

Now that you have your Google Drive connected let’s create a checkpoints folder:

%cd drive
%cd My\ Drive
%mkdir NAME_OF_FOLDER
%cd /content/
!ls

Now let’s clone the GPT-2 repository that we will use, which is forked from nnsheperd’s awesome repository (which is forked from OpenAI’s but with the awesome addition of train.py), I have added a conditional_model() method which will let us pass multiple sentences at once and return a dictionary with the relevant model output samples. It also lets us avoid using bash-code.

!git clone https://github.com/mohamad-ali-nasser/gpt-2.git

Now let’s download our model of choice. We will work with 345M model which is pretty decent. The reason we will work with that model vs the 774M or the 1558M is the limited GPU memory available in Colab when training. With that being said we can use the 774M and 1558M if we want to use the pre-trained GPT-2 without any fine-tuning or training on our corpus.

%cd gpt-2
!python3 download_model.py 345M

Now that the model is installed let’s get our corpus ready and load our checkpoints in case we have previously trained our model.

# In Case I have saved checkpoints
!cp -r /content/drive/My\ Drive/checkpoint/ /content/gpt-2/

Download and merge texts into one corpus

Let’s create and move into our corpus folder and get those texts.

%cd src
%mkdir corpus
%cd corpus/

I have my text in a Github repository, but you can replace the url variable with whatever link you need. You can also manually upload your text files to Google Colab or if you have the texts in your Google Drive then just cd towards that folder.

import requests
import os
file_name = "NAME_OF_TEXT.txt"if not os.path.isfile(file_name):
url = "https://raw.githubusercontent.com/mohamad-ali-nasser/the-
communist-ai/master/corpus/manifesto.txt?
token=AJCVVAFWMDCHUIOOUDSD2FK6PYTN2"
data = requests.get(url)
with open(file_name, 'w') as f:
f.write(data.text)
f.close()

Now that the texts are downloaded, use the below code to merge multiple text files into one. Ignore this code if you will only be using one text file.

# Get file Namesimport glob
filenames = glob.glob("*.txt")
# Add all texts into one filewith open('corpus.txt', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
outfile.close()
!export PYTHONIOENCODING=UTF-8

Download libraries and CUDA

Now that we have everything ready let’s prepare the environment. We’ll start by installing the requirements file.
(If you are by any chance using your local Jupyter notebook to use GPT-2 then install the requirements file before using the download_model.py)

# Move into gpt-2 folder
%cd ..
!pip3 install -r requirements.txt

Now, the below is optional because they are needed if you want to use the 774M model and the 1558M model. I have used the 334M without these but I recommend doing it anyway, especially the CUDA because your model will run faster and those might answer some random bugs that you might get.

!pip install tensorflow-gpu==1.15.0
!pip install 'tensorflow-estimator<1.15.0rc0,>=1.14.0rc0' --force-reinstall

Now install Cuda v9.0:

!wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
!dpkg -i cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
!apt-key add /var/cuda-repo-*/7fa2af80.pub
!apt-get update
!apt-get install cuda-9-0
!export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64/

Restart runtime and move back into the GPT2 folder

%cd gpt-2

Let’s train the model:

Now for the moment, we have all been waiting for, fine-tuning the model. Copy the one-liner below and run it.

!PYTHONPATH=src ./train.py --dataset corpus.txt --model_name '345M'

The model will load the lastest checkpoint and train from there (it seems that loading previously trained checkpoints and adding to it can lead you to run into memory problems with the 345M in Colab).
You can also specify the number of batches and the learning rate. The number of batches real overall benefit is the speed of training, default batch size is 1, increasing the batch size can lead you to run into memory problems, so be careful how much you increase it especially when you are running on one GPU, I used a batch size of 1, try 2 or 4 and add your results in the comments.

While increasing the learning rate can increase the speed of training it could lead the model to get stuck at a local minimum (gradient) or even overshooting so I recommend keeping the learning rate as is or decreasing it if the loss stops decreasing. The default learning rate is 0.00001.

!PYTHONPATH=src ./train.py --dataset corpus.txt --model_name '345M' --batch_size 1 --learning_rate 0.00001

The model will save a checkpoint every 1000 steps. You can keep running it for minutes, hours or days, it all depends on you, just make sure you don’t overfit the model if you have a small text. To stop the model simply “stop” it, it will save the last trained stepped (so if you stopped at step 1028, you will have two checkpoints 1000 and 1028).
Note: When stopping the model from training Colab it might seem unresponsive and keep training, to avoid that, while it trains, clear the output and then stop it, this will save you a few clicks.

Now that the model is trained, let’s save the checkpoints in our Google Drive:

!cp -r /content/gpt-2/checkpoint/ /content/drive/My\ Drive/checkpoint

Copy the newly saved checkpoints into the model’s directory

!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/345M/

Generating Conditional Text

import os
%cd src
from conditional_model import conditional_model
%cd ..

Run the below code to understand what the method’s arguments are:

conditional_model??

I recommend setting a seed so that you get some reproducible results. The argument sentences take a string or a list of strings and return a dictionary with your input as key and the output sample as value.

Now for the magic:

conditional_model(seed=1,sentences=['How are you today?', 'Hi i am here']

You have now fine-tuned your GPT-2 to your selection of books, and have created your own A.I, congratulations!!

If you liked this project don’t forget to follow The Communist A.I and to star the Github repository. Thanks.

To connect on LinkedIn and to follow on Twitter.

You can also check my portfolio site for more projects.

Happy coding!