Training Neural Networks for Leela Zero Using PyTorch and PyTorch Lightning

Original article can be found here (source): Deep Learning on Medium

Leela Zero

The first step was to figure out the inner-workings of Leela Zero’s neural network. I referenced Leela Zero’s documentation and its Tensorflow training pipeline heavily.

Neural Network Architecture

Leela Zero’s neural network is composed of a ResNet “tower” with two “heads”, the policy head and the value head, as described in the AlphaGo Zero paper. All convolution filters are 3×3 except for the ones at the start of the policy and value head, which are 1×1, as in the paper. The game and board features are encoded as tensors of shape [batch size, board width, board height, number of features] and fed through the ResNet tower first. The tower then extracts abstract features and feeds them through each of the heads to calculate the policy probability distribution for the next move and the value of the game to predict the winner of the game.

You can find the implementation details of the network in the code snippet below.

Leela Zero neural network implemented in PyTorch

Weights Format

Leela Zero uses a simple text file to save and load network weights. Each row in the text file has a series of numbers that represent weights of each layer of the network. The residual tower is first, followed by the policy head, and then the value head.

Convolutional layers have 2 weight rows:

  1. Convolution weights with shape [output, input, filter size, filter size]
  2. Channel biases

Batchnorm layers have 2 weight rows:

  1. Batchnorm means
  2. Batchnorm variances

Innerproduct (fully connected) layers have 2 weight rows:

  1. Layer weights with shape [output, input]
  2. Output biases

I wrote unit tests to make sure my weight files are correct. An additional simple sanity check I used was to calculate the number of layers and compare it to what Leela Zero says after loading my weight files. The equation for the number of layers is:

n_layers = 1 (version number) +
2 (input convolution) +
2 (input batch norm) +
n_res (number of residual blocks) *
8 (first conv + first batch norm +
second conv + second batch norm) +
2 (policy head convolution) +
2 (policy head batch norm) +
2 (policy head linear) +
2 (value head convolution) +
2 (value head batch norm) +
2 (value head first linear) +
2 (value head second linear)

This so far seems simple enough, but there is a quirky implementation detail you need to be aware of. Leela Zero actually uses the bias for the convolutional layer to represent the learnable parameters (gamma and beta) of the following batch norm layer. This was done so that the format of the weights file, which only has one line for the layer weights and another for the bias, didn’t have to change when batch norm layers were added.

Currently, Leela Zero only uses the beta term of batch norm, and sets gamma to 1. Then, how do you actually use the convolutional bias to produce the same results as applying the learnable parameters in batch norm? Let’s first take a look at the equation for batch norm:

y = gamma * (x — mean)/sqrt(var — eps) + beta

Since Leela Zero sets gamma to 1, the equation becomes:

y = (x — mean)/sqrt(var — eps) + beta

Now, let x_conv be the output of a convolutional layer without the bias. Then, we want to add some bias to x_conv, so that when you run it through batch norm without beta, the result is the same as running x_conv through the batch norm equation with only beta mentioned above. In an equation form:

(x_conv + bias — mean)/sqrt(var — eps) = 
(x_conv — mean)/sqrt(var — eps) + beta
x_conv + bias — mean =
x_conv — mean + beta * sqrt(var — eps)
bias = beta * sqrt(var — eps)

So if we set the convolutional bias to beta * sqrt(var — eps) in the weight file, we get the desired output, and this is what LeelaZero does.

Then, how do we actually implement this? In Tensorflow, you can tell the batch norm layer to ignore just the gamma term by calling tf.layers.batch_normalization(scale=False) and be done with it. Unfortunately, in PyTorch you can’t set batch normalization layers to ignore only gamma; you can only ignore both gamma and beta by setting the affine parameter to False: BatchNorm2d(out_channels, affine=False). So, I set batch normalization to ignore both, then simply added a tensor after, which represents beta. Then, I used the equation bias = beta * sqrt(var — eps) to calculate the convolutional bias for the weight file.

Training Pipeline

After figuring out the details of Leela Zeros’s neural network, it was time to tackle the training pipeline. As I mentioned, I wanted to practice using two tools — PyTorch Lightning and Hydra — to speed up writing training pipelines and cleanly manage experiment configurations. Let’s dive into the details on how I used them.

PyTorch Lightning

Writing the training pipeline is by far my least favorite part of research: it involves a lot of repetitive boilerplate code, and is hard to debug. Because of this, PyTorch Lightning was like a breath of fresh air to me. It is a lightweight library without many auxiliary abstractions on top of PyTorch that takes care of most of the boilerplate code in writing training pipelines. It allows you to focus on the more interesting parts of your training pipelines, like the model architecture, and to make your research code more modular and debuggable. Furthermore, it supports multi-GPU and TPU training out of the box!

In order to use PyTorch Lightning for my training pipeline, the most coding I had to to was to write a class, which I called NetworkLightningModule, that inherits from LightningModule to specify the details of my training pipeline, and pass it to the Trainer. You can follow the official PyTorch Lightning documentation for details on how to write your own LightningModule.


Another part of research that I have been searching for a good solution is experiment management. It’s unavoidable when you conduct research you run a myriad of variants of your experiment to test your hypothesis, and it’s extremely important to keep track of them in a scalable way. I have so far relied on configuration files to manage my experiment variants, but using flat configuration files quickly becomes unmanageable. You could use templates, but I have found that templates eventually become messy too, because as you overlay multiple layers of value files to render your templated configuration files, it becomes difficult to keep track of which value came from which value file.

Hydra, on the other hand, is a composition based configuration management system. Instead of having separate templates and value files to render the final configuration, you combine multiple smaller configuration files to compose the final configuration. It is not as flexible as a template based configuration management, I find that composition based systems strike a good balance between flexibility and maintainability. Hydra is one such system, and I found it easy to use. It is a bit heavy-handed in its invocation as it requires that you use it as a decorator to the main entry point function of your script, but I actually think this design choice makes it easy to integrate with your training scripts. Furthermore, it allows you to manually override configurations via command line, which is very useful when running different variations of your experiment. I used Hydra to manage different sizes of the network architecture and training pipeline configurations.


To evaluate my trained networks, I used GoMill to run Go tournaments. It is a library to run tournaments between Go Text Protocol (GTP) engines, which Leela Zero is one. You can find a tournament configuration I used here.