Bayesian Optimization

Source: Deep Learning on Medium

What is Bayesian optimization? How does it work to tune hyperparameters for a deep neural network?

An intuitive explanation of Bayesian optimization for hyperparameter tuning for a deep neural network

Let’s first understand the Bayesian statistics

In the past if a cricket team won 5 matches out of 12 matches on a particular ground. It had rained heavily before three of the five matches. One match was lost when it rained. What is the probability that the team will win the next match if it rains?

We want to know p(winning| it rained)=?

p(winning)= 5/12=0.417

p(raining before a match) =4/12=0.33

p(it rained|winning)=3/5=0.6

Applying Bayes theorem

p(A|B) = p(B|A) x p(A)/p(B)

p(winning|it rained) = p(it rained|winning) x p(winning)/p(raining before the match)

0.600 x 0.417/0.333 = 0.75

We can now predict that there is a 75% likelihood for the team to win the match in case it rains.

Bayesian statistics provides a mathematical method for calculating the likelihood of a future event. Future event is predicted from the knowledge of prior events.

Bayesian statistics starts with a prior belief which expressed as a prior distribution. This is then updated with the new evidence to yield a posterior belief. Posterior belief is also a probability distribution.

Let’s now switch gears and talk about hyperparameter tuning.

In deep learning we performthe following steps

  • Create the model,
  • Apply hyperparameters to the model
  • Train the model on the training dataset.
  • Evaluate the model’s performance on the test set or validation set.
  • Fine tune the model for the optimum accuracy

To fine tune the model’s performance one of the techniques is hyperparameter tuning.

Examples of hyperparameters for a neural network

  • Number of hidden units
  • Drop outs
  • Epochs
  • batch size
  • learning rate
  • optimizer

Deep learning models are more deep and complex with multiple hyperparameters to tune. This makes hyperparameter tuning computationally very expensive.

We can optimize the hyperparameters for a neural network using

  • Manual search
  • Grid search or
  • Random search or
  • Bayesian optimization.

Grid Search performs an exhaustive search over the specified parameters . Grid search is a cartesian product of all the specified hyperparameters in grid. Here we take one hyper parameter and keep all other hyperparameters constant to minimize the loss. This is done for all combinations of the hyperparameter. This is computationally very expensive.

Random search uses a statistical distribution for hyperparameters. As the hyperparameters are randomly selected not every combination of parameter is tried. As the number of hyperparameters increases, random search is a better option as it arrives at a good combination faster . Also random search is more efficient than grid search for the hyperparameter optimization in terms of computing costs.

Random search and Grid search methods do not learn from previous results. They are completely uninformed by past evaluations. As a result, we spend a significant amount of time evaluating hyperparameters that may not be very useful

Bayesian optimization is an elegant solution to the hyperparameter optimization problem

Bayesian optimization incorporates prior data about hyperparameters including accuracy or loss of the model. Prior information helps to determine the better approximation of hyperparameter selection for the model.

It offers a principled approach to modeling uncertainty. This allows exploration and exploitation to be naturally balanced during the search.

Bayesian Optimization is Sequential Model-Based Optimization (SMBO) algorithm. SBMO’s are used in applications where evaluation of the fitness function is expensive. Hyperparameter tuning is computationally expensive task and is a perfect candidate for Bayesian optimization

The entire concept of Bayesian model-based optimization is to reduce the number of times the objective function needs to be run. This is done by choosing only the most promising set of hyperparameters for evaluation. The hyperparameter selection is based on previous calls to the evaluation function.

The next set of hyperparameters are selected based on a model of the objective function called a surrogate.

Here our objective function can be minimize the loss in the neural network or maximize the accuracy of the model

Steps for Bayesian Optimization

  1. Create a domain of hyperparameters that we want to explore
  2. Create an objective function which takes in hyperparameters and outputs a score. The score can be loss or error that we want to minimize or accuracy that we want to maximize. Objective function will create the surrogate deep learning model using the hyperparameter passed. It will output the accuracy of the model or the error in the model.
  3. Create a criteria, called a selection function, for evaluating which hyperparameters to choose next from the surrogate model
  4. A history consisting of score and hyperparameter pairs is used by the algorithm to update the surrogate model

Conclusion:

Bayesian optimization is an effective technique to find the best values for hyperparameters to build an optimum deep learning model. Bayesian optimization unlike grid search and random search does not work in isolation. It makes informed decision on the hyperparameter selection using prior information about hyperparameters and accuracy or loss of the model. Computationally very efficient.

References:

https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

http://proceedings.mlr.press/v37/snoek15.pdf

https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf