Original article can be found here (source): Deep Learning on Medium

# Leela Zero

The first step was to figure out the inner-workings of Leela Zero’s neural network. I referenced Leela Zero’s documentation and its Tensorflow training pipeline heavily.

## Neural Network Architecture

Leela Zero’s neural network is composed of a ResNet “tower” with two “heads”, the policy head and the value head, as described in the AlphaGo Zero paper. All convolution filters are 3×3 except for the ones at the start of the policy and value head, which are 1×1, as in the paper. The game and board features are encoded as tensors of shape [batch size, board width, board height, number of features] and fed through the ResNet tower first. The tower then extracts abstract features and feeds them through each of the heads to calculate the policy probability distribution for the next move and the value of the game to predict the winner of the game.

You can find the implementation details of the network in the code snippet below.

## Weights Format

Leela Zero uses a simple text file to save and load network weights. Each row in the text file has a series of numbers that represent weights of each layer of the network. The residual tower is first, followed by the policy head, and then the value head.

Convolutional layers have 2 weight rows:

- Convolution weights with shape [output, input, filter size, filter size]
- Channel biases

Batchnorm layers have 2 weight rows:

- Batchnorm means
- Batchnorm variances

Innerproduct (fully connected) layers have 2 weight rows:

- Layer weights with shape [output, input]
- Output biases

I wrote unit tests to make sure my weight files are correct. An additional simple sanity check I used was to calculate the number of layers and compare it to what Leela Zero says after loading my weight files. The equation for the number of layers is:

`n_layers = 1 (version number) +`

2 (input convolution) +

2 (input batch norm) +

n_res (number of residual blocks) *

8 (first conv + first batch norm +

second conv + second batch norm) +

2 (policy head convolution) +

2 (policy head batch norm) +

2 (policy head linear) +

2 (value head convolution) +

2 (value head batch norm) +

2 (value head first linear) +

2 (value head second linear)

This so far seems simple enough, but there is a quirky implementation detail you need to be aware of. Leela Zero actually uses the bias for the convolutional layer to represent the learnable parameters (`gamma`

and `beta`

) of the following batch norm layer. This was done so that the format of the weights file, which only has one line for the layer weights and another for the bias, didn’t have to change when batch norm layers were added.

Currently, Leela Zero only uses the `beta`

term of batch norm, and sets `gamma`

to 1. Then, how do you actually use the convolutional bias to produce the same results as applying the learnable parameters in batch norm? Let’s first take a look at the equation for batch norm:

`y = gamma * (x — mean)/sqrt(var — eps) + beta`

Since Leela Zero sets `gamma`

to 1, the equation becomes:

`y = (x — mean)/sqrt(var — eps) + beta`

Now, let `x_conv`

be the output of a convolutional layer without the bias. Then, we want to add some bias to `x_conv`

, so that when you run it through batch norm without `beta`

, the result is the same as running `x_conv`

through the batch norm equation with only `beta`

mentioned above. In an equation form:

(x_conv + bias — mean)/sqrt(var — eps) =

(x_conv — mean)/sqrt(var — eps) + beta x_conv + bias — mean =

x_conv — mean + beta * sqrt(var — eps) bias = beta * sqrt(var — eps)

So if we set the convolutional bias to `beta * sqrt(var — eps)`

in the weight file, we get the desired output, and this is what LeelaZero does.

Then, how do we actually implement this? In Tensorflow, you can tell the batch norm layer to ignore just the `gamma`

term by calling `tf.layers.batch_normalization(scale=False)`

and be done with it. Unfortunately, in PyTorch you can’t set batch normalization layers to ignore only `gamma`

; you can only ignore both `gamma`

and `beta`

by setting the `affine`

parameter to `False`

: `BatchNorm2d(out_channels, affine=False)`

. So, I set batch normalization to ignore both, then simply added a tensor after, which represents `beta`

. Then, I used the equation `bias = beta * sqrt(var — eps)`

to calculate the convolutional bias for the weight file.

# Training Pipeline

After figuring out the details of Leela Zeros’s neural network, it was time to tackle the training pipeline. As I mentioned, I wanted to practice using two tools — PyTorch Lightning and Hydra — to speed up writing training pipelines and cleanly manage experiment configurations. Let’s dive into the details on how I used them.

## PyTorch Lightning

Writing the training pipeline is by far my least favorite part of research: it involves a lot of repetitive boilerplate code, and is hard to debug. Because of this, PyTorch Lightning was like a breath of fresh air to me. It is a lightweight library without many auxiliary abstractions on top of PyTorch that takes care of most of the boilerplate code in writing training pipelines. It allows you to focus on the more interesting parts of your training pipelines, like the model architecture, and to make your research code more modular and debuggable. Furthermore, it supports multi-GPU and TPU training out of the box!

In order to use PyTorch Lightning for my training pipeline, the most coding I had to to was to write a class, which I called `NetworkLightningModule`

, that inherits from `LightningModule`

to specify the details of my training pipeline, and pass it to the `Trainer`

. You can follow the official PyTorch Lightning documentation for details on how to write your own `LightningModule`

.

## Hydra

Another part of research that I have been searching for a good solution is experiment management. It’s unavoidable when you conduct research you run a myriad of variants of your experiment to test your hypothesis, and it’s extremely important to keep track of them in a scalable way. I have so far relied on configuration files to manage my experiment variants, but using flat configuration files quickly becomes unmanageable. You could use templates, but I have found that templates eventually become messy too, because as you overlay multiple layers of value files to render your templated configuration files, it becomes difficult to keep track of which value came from which value file.

Hydra, on the other hand, is a composition based configuration management system. Instead of having separate templates and value files to render the final configuration, you combine multiple smaller configuration files to compose the final configuration. It is not as flexible as a template based configuration management, I find that composition based systems strike a good balance between flexibility and maintainability. Hydra is one such system, and I found it easy to use. It is a bit heavy-handed in its invocation as it requires that you use it as a decorator to the main entry point function of your script, but I actually think this design choice makes it easy to integrate with your training scripts. Furthermore, it allows you to manually override configurations via command line, which is very useful when running different variations of your experiment. I used Hydra to manage different sizes of the network architecture and training pipeline configurations.

# Evaluation

To evaluate my trained networks, I used GoMill to run Go tournaments. It is a library to run tournaments between Go Text Protocol (GTP) engines, which Leela Zero is one. You can find a tournament configuration I used here.