Understanding ELECTRA and Training an ELECTRA Language Model

Original article can be found here (source): Deep Learning on Medium

Training Your Own ELECTRA Model

One huge advantage of the ELECTRA pre-training approach is that it’s possible to train your own language models on a single GPU!

Below, I’ll show you how you can train your own language model with the Simple Transformers library.

Installation

  1. Install Anaconda or Miniconda Package Manager from here.
  2. Create a new virtual environment and install packages.
    conda create -n simpletransformers python pandas tqdm
    conda activate simpletransformers
    conda install pytorch cudatoolkit=10.1 -c pytorch
  3. Install Apex if you are using fp16 training. Please follow the instructions here. (Installing Apex from pip has caused issues for several people.)
  4. Install simpletransformers.
    pip install simpletransformers

Data preparation

We’ll be training our language model on Esperanto (inspired by Hugging Face’s tutorial here). Don’t worry if you don’t speak Esperanto, neither do I!

For pre-training a model we are going to need a (preferably large) corpus of text in Esperanto. I’ll be using the Esperanto text files from the Leipzig Corpora Collection. Specifically, I downloaded the following datasets;

  1. 2011 — Mixed (1M sentences)
  2. 2012 — Newscrawl (1M sentences)
  3. 2012 — Web (1M sentences)
  4. 2016 — Wikipedia (300k sentences)

You should be able to improve the results by using a bigger dataset.

Download the datasets and extract the archives into a directory data/ .

Take all the “sentence” files and move them into the data/ directory.

Script for Linux/bash users. (Others: Why aren’t you on Linux? 😉)

for d in */;
do mv d/*sentences.txt .;
done;

If you open one of the files, you’ll notice that they have two columns, with the first column containing the index and the second column containing the text. We just need the text, so we’ll drop the indexes, combine all the text, and split the text into train and test files.

Now we are ready to start training!

Language Modeling Model

In Simple Transformers, all language modelling tasks are handled with the LanguageModelingModel class. You have tons of configuration options that you can use when performing any NLP task in Simple Transformers, although you don’t need to set each one (sensible defaults are used wherever possible).

List of common configuration options and their usage here.

Language modelling specific options and their usage here.

The gist above sets up a LanguageModelingModel which can be used to train our new model.

A single training epoch (with this configuration) takes a little under 2 hours on a Titan RTX GPU. To speed up training, you can increase evaluate_during_training_steps or turn off evaluate_during_training altogether.

When training a language model from scratch with Simple Transformers, it will automatically create a tokenizer for us from the specified train_file. You can configure the size of the trained tokenizer by setting a vocab_size in the train_args. In my case, I’m using a vocabulary of 52000 tokens.

You can also configure the architecture of the generator and the discriminator models as required. The configuration options are set in the two dictionaries generator_config and discriminator_config found in train_args. The ELECTRA paper recommends using a generator model that is 0.25–0.5 of the size of the discriminator. They also recommend decreasing the number of hidden layers and keeping the other parameters constant between the generator and the discriminator. With that in mind, I went with a small (12 layers) architecture for the discriminator and a similar generator albeit with 1/4th the number of hidden layers.

"generator_config": {"embedding_size": 128,"hidden_size": 256,"num_hidden_layers": 3,},"discriminator_config": {"embedding_size": 128,"hidden_size": 256,}

You can find all the architecture configuration options and their default values here.

Training the model

Now that we’ve set up our models, all we need to do is initiate the training.

Running the above script will start training our language model!