Comparing Modern Scalable Hyperparameter Tuning Methods

Original article was published by Ayush Chaurasia on Deep Learning on Medium


The Search Space

We’ll use the same search space for all the experiments in order to make the comparison fair.

Random Search

Let’s perform a random search across the search space to see how well it optimizes. This will also act as the baseline metric for our comparison. Our experimental setup has 2 GPUs and 4 CPUs. We’ll parallelize the operation across multiple GPUs. Ray Tune does this automatically for you if you specify the resources_per_trail.

Let’s see the results

Inference

As expected, we get varied results.

  • Some of the models did optimize as the tuner got lucky and chose the right set of hyper-parameters
  • but some models’ inception score graph remained flat as they did not optimize due to bad hyper-parameter values.
  • Thus, when using a random search, you might end up reaching the optimal value but you definitely end up wasting a lot of resources on the runs that don’t add any value.

Bayesian Search With HyperOpt

The basic idea behind Bayesian Hyperparameter tuning is to not be completely random in your choice for hyper-parameters but instead use the information from the prior runs to choose the hyperparameters for the next run. Tune supports HyperOpt which implements Bayesian search algorithms. Here’s how you do it.

Here’s what results look like

Inference

  • There are significant improvements compared to the previous run as there is only 1 flat curve.
  • This implies that the search algorithm chose the hyper-parameter values based on the results of previous runs.
  • On average, the runs performed better than random search
  • Resource wastage can be avoided by terminating the bad runs earlier.

Bayesian Search with Asynchronous HyperBand

The idea Asynchronous Hyperband is to eliminate or terminate the runs that don’t perform well. It makes sense to combine this method with the Bayesian search to see if we can further reduce the wastage of resources on the runs that don’t optimize. We just need to make a small change in our code to accommodate Hyperband.

Let us now see how this performs

Inference

  • Only 2 out of 20 runs were executed for defined epochs while others were terminated earlier.
  • The highest accuracy achieved was still slightly higher than the runs without the Hyperband scheduler.
  • Thus, by terminating bad runs early on in the training process, we have not only speeded up the tuning job but also saved compute resources.

Population-Based training

Image source — https://docs.ray.io/en/latest/tune/tutorials/tune-advanced-tutorial.html

The last tuning algorithm that we’ll cover is population-based training (PBT) introduced by Deepmind research. The basic idea behind the algorithm in layman terms:

  • Run the optimization process for some samples for a given time step(or iterations) T.
  • After every T iterations, compare the runs and copy the weights of good performing runs to the bad performing runs and change their hyper-parameter values to be close to the values of the runs that performed well.
  • Terminate the worst-performing runs. Although the idea behind the algorithm seems simple, there is a lot of complex optimization math that goes into building this from scratch. Tune provides a scalable and easy-to-use implementation of the SOTA PBT algorithm

Let us now look at the results.

Inference

The results look quite surprising. There are multiple factors that are unique about these results.

  • Almost all the runs have reached the optimal point
  • The highest score( of 6.29) was achieved by one of the runs
  • The runs that started off as bad performers or outliers have also converged as the experiment proceeded.
  • There are no runs that have a flat inception score graph
  • Some bad performing runs were stopped in the middle of the process
  • Thus, no resource has been wasted on bad runs

how did PBT optimize the runs that started off with bad Hyper-parameter Values?

The answer is the hyper-parameter mutation done by the PBT scheduler. After every T time steps, the algorithm also mutates the values of hyper-parameters to maximize the desired metric. Here’s how the parameters were mutated by the PBT scheduler for this experiment.

Hyper-parameter Mutations

Let us now see how the hyper-parameters were adjusted by the PBT algorithm to maximize the inception score