How to Find a Descent Learning Rate using Tensorflow 2

Source: Deep Learning on Medium

How to Find a Descent Learning Rate using Tensorflow 2

Taken from http://www.merzpraxis.de/index.php/2016/06/13/der-suchende/

When it comes to building and training Neural Networks, you need to set a massive amount of hyper-parameters. Setting those parameters right has a tremendous influence on the success of your net and also on the time you spend heating up the air, aka training you model. One of those parameters that you always have to choose is the so called learning rate (also know as update rate or step size). For a long time, select this right was more like trial an error or a black art. However, there exists a very smart, though simple technique for finding a decent learning rate, which I guess became very popular through being used in fastai. In this article, I present you a quick summary of that approach and show you an implementation in Tensorflow 2 that is also available through my repo. So let’s get it on.

The Problem

The learning rate l is a single floating point number that determines how far you move into the direction of the negative gradient to update and optimize your network. As already said in the introduction, choosing it correctly tremendously influences the time you spend training your model until you get good results and stop swearing. Why is that so? If you choose it too small, your model will take ages to reach the optimum as you will just take tiny little baby update steps. If you choose it too large, your model will just bounce around, jumping over the optimum and eventually fail reaching it at all.

The Solution

Leslie N. Smith presented a very smart and simple approach to systematically find a learning rate in a short amount of time that will make you very happy. Prerequisite for that is you have a model and you have a training set that is split into n batches.

  1. You initialize your learning to a small value l=l_min, with for example l_min=0.00001
  2. You take one batch of your training set and update your model
  3. You calculate the loss and record both the loss and the used learning rate
  4. You exponentially increase the current learning rate
  5. You either go back to 2 OR Stop the search if the learning rate has reached a predefined maximum value l_max OR the loss increased too much
  6. You take the best learning rate from all tested ones as the one that lead to the largest decrease in loss between 2 consecutive trials.

To make this all a bit more visual, I show you the smoothed loss plotted over the learning rate on a log scale. The red line marks the computed optimal learning rate.

The Implementation

As I am currently learning Tensorflow 2 (TF2), I thought it is a good idea to practice it by implementing the learning rate finder using the new TF2 concepts. Apart from that, I (and hopefully also you) now have the LR Finder for all upcoming TF2 projects that I (or you) want to pursue, yeahhh. In the code posted here, I have marked the corresponding lines with Step 1–6 to reference the above listing. The code shown here is a bit simplified and refactored to increase readability on medium. You can find the full code together with a small example and plotting functions on my github repo.

from dataclasses import dataclass
import numpy as np
import tensorflow as tf
from tensorflow import Tensor


@dataclass
class LRFinder:
model: tf.keras.Model
optimizer: tf.keras.optimizers.Optimizer
loss_fn: tf.keras.losses.Loss
opt_idx = None
lrs = []
losses = []
smoothed_losses = []

@tf.function
def step(self, src: Tensor, trg: Tensor, lr: float) -> Tensor:
tf.keras.backend.set_value(self.optimizer.lr, lr)
with tf.GradientTape() as tape:
loss = self.loss_fn(trg, self.model(src)) # Step 3
grads = tape.gradient(
loss, self.model.trainable_weights
)

self.optimizer.apply_gradients(
zip(grads, self.model.trainable_weights)
) # Step 2 (model update)
return loss

def __call__(
self,
dataset,
min_lr: float,
max_lr: float,
n_steps: int,
smoothing: float = 1.0,
) -> float:
self.lrs, self.losses, self.smoothed_losses = [], [], []
avg_loss = 0

def exp_annealing(step: int) -> float:
return min_lr*(max_lr/min_lr)**(step/(n_steps - 1))

d_iter = enumerate(dataset, total=n_steps)

for step, (source, target) in d_iter: # Step 2
lr = exp_annealing(step) # Step 1 and Step 4
loss = self.step(source, target, lr).numpy()

avg_loss = smoothing * avg_loss + (1 - smoothing) * loss
s_loss = avg_loss / (1 - smoothing ** (step + 1))

best = loss if step == 0 or loss < best else best

self.lrs.append(lr) # Step 3
self.losses.append(loss) # Step 3
self.smoothed_losses.append(s_loss) # Step 3

if step - 1 == n_steps or s_loss > 4 * best: # Step 5
break

sls = np.array(self.smoothed_losses)
self.opt_idx = np.argmin(sls[1:] - sls[:-1]) + 1 # Step 6
return self.lr_opt

@property
def lr_opt(self) -> float:
return self.lrs[self.opt_idx]

The End

Thanks for following along my small article and my first tensorflow 2 implementation. For questions, comments, or suggestions feel free to contact me.