Regularization, Optimization & Hyperparameter Tuning

Original article was published by Karanpreet Kaur Bains on Artificial Intelligence on Medium


Regularization, Optimization & Hyperparameter Tuning

*This article has been a collective effort of Karanpreet, Adarssh, and Khyati from the Neuron writing team*

Machine learning is based on a lot of algorithms and learning sets. Each one unique on their own; in this article, we will be discussing three such models and how they are implemented into the web-course that Machine Learning is.

Regularization in Regression

To get to the concept of Regularization, one must understand Regression first.

Here’s a way to understand it: Let us say that you are building a pricing model for a shoe company. Now you collect inputs from various sources. Let us assume your variables include the shoe size, the color, brand, store location, age of people, the gender of people, people who prefer loafers over shoes, etc.

Now while building a model and you realize that there are way too many inputs that don’t make sense and in no way it is useful for building, by this the machine learns from too much irrelevant background information, thus resulting in a poor prediction. And this is called overfitting.

To overcome the above problem regression comes in handy. Regression is defined as it is the process of adding information to solve a well-posed problem or to prevent overfitting.

There are three ways in which one can perform Regression.

1.L1 regularization

2.L2 regularization

3.Drop-out regularization

Drop-out regularization

As the name suggests, it simply means that the neurons are simply dropped out randomly during the training phase. These units(neurons) totally have zero participation while training a model.And as discussed the main goal for any regularization is to prevent over-fitting, and Drop-out regularization does the same.

In which L1 and L2 regularization are popularly known.

A loss function is defined as the difference between the original/expected output and the output generated by the model.

Now let us take an equation

y=0+1X1+2X2……..+n Xn(Equation 1)

CL(cost loss function)=RSS(residual sum of squares)=i=1n(Yi- yi) (Equation 2)

Where,

n= total number of coefficients

= regression coefficients

X= variables

yi= true value

yi= predicted value

In equation 1, if the regression coefficients are very high, small changes in X1 will result in very high changes in y. This means it has to be controlled because if that value is high enough then smaller changes in the variables result in a higher y. To eliminate this function, we add a penalty term(s) to this loss function either by L1 or L2.

L2 regularisation aka Ridge regression adds “squared magnitude” of coefficient as penalty term to the cost loss function(CL).

i=1n(yi- yi)+ j=1pj2= L + j=1pj2

Where the penalty term is added to the loss function(L). The coefficients can be estimated by minimizing the above function. plays an important role in this equation by allowing us to penalize the flexibility of our model. The rise in coefficients is now stopped as the flexibility increases with an increase in coefficients, and when we minimize the functions, the coefficients will be small.

When λ = 0, the penalty term has no effect and the equation remains the same. However, as λ→∞, the impact of penalty grows and the ridge regression co-efficient estimates will approach zero. Therefore, selecting a good value of λ is crucial. Cross-validation comes in handy for this purpose.

L1 Regularisation aka Lasso Regression regression adds “absolute value of magnitude” of the coefficient as penalty term to the loss function(L).

This regularisation is advantageous over L2 when penalizing the high coefficients.

i=1n(yi- yi)+ j=1pj2= L + j=1pj

The only difference in the equation is that it uses j instead of j2 and in this type of regression multiple types of solutions may exist, but in L2 regression only one optimal solution can exist.

This method is advantageous when λ→∞, in L2 regression the values tend towards zero but not absolutely zero. And it is not favorable, to overcome this difficulty in L1 regression when λ→∞ it will shrink the unimportant variables are usually exactly zero when the tuning parameter λ is sufficiently large. Therefore resulting in multiple solutions.

Optimization in Machine Learning

The barter between optimization and machine learning is one of the most important developments in modern computational science. It starts with defining some kind of loss function or cost function and ends with minimizing it, by using one or the other optimization routine. Your choice of an optimization algorithm can make a difference between getting an honest accuracy in hours or days. Optimization in this field has limitless applications and it is a widely researched topic in the industry as well as academia.

An optimization algorithm is a procedure which is executed iteratively by comparing various solutions, till an optimum or a satisfactory solution is found. With the arrival of computers, optimization has become a neighborhood of Computer-Aided Design activities.

For the sake of clarity, the optimization algorithms are classified into several groups, which are discussed briefly as follows.

  1. Single- variable optimization algorithms. These algorithms are classified into 2 categories:

a. Direct methods: This method doesn’t use any derivative information of the target function; only objective function values are used to guide the search process.

b. Gradient-based methods: These methods use derivative information ( first and /or second-order) to guide the search process.

methods in multi-variable optimization algorithms.

2. Multi-variable optimization algorithms: These algorithms demonstrate how the look for the optimum point progresses in multiple dimensions.

3. Constrained optimization algorithms: These algorithms use the only variable and multivariable optimization algorithms repeatedly and simultaneously maintain the search effort inside the feasible search region. These algorithms are mostly utilized in engineering-optimization problems.

4. Specialized optimization algorithms: Such as genetic ones.

There are two distinct types of optimization algorithms widely used today:

  1. Deterministic algorithms: these algorithms are used flexibly, in accordance with the purpose. They use specific rules for moving one solution to the other. A deterministic algorithm is that in which output does not change on different runs. PCA (principal component analysis) is an example of this since it would give the same result if we run again,
  2. Stochastic algorithms: Stochastic refers to a variable process where the outcome involves some randomness or some uncertainty. The stochastic algorithms are in nature with probabilistic or random translation rules. Typically, random is used to refer to a lack of dependence between observations in a sequence. For example, the rolls of a fair die are random, so are the flips of a fair coin. These algorithms are gaining more popularity since it has better properties as compared to deterministic algorithms.

For this topic, one should have previous knowledge of the concept of Gradient Descent. For this session, we shall focus more on stochastic algorithms.

Optimization extensions in gradient descent

Various extensions are designed for gradient descent algorithm. a number of them are discussed below:

  1. Momentum method: This method is employed to accelerate the gradient descent algorithm by taking into consideration the exponentially weighted average of the gradients. Using averages makes the algorithm converge towards the minima during a faster way because the gradients towards the uncommon directions are cancelled out. The pseudocode for the momentum method is given below.

2. V=0

For each iteration i: compute dW

V = beta*V + (1 — beta) dW

W = W — (alpha * V)

Where V and dW are analogous to acceleration and velocity respectively. Alpha is the learning rate and beta is normally kept at 0.9.

  1. AdaGrad optimization: In this method, for each parameter, we store the sum of squares of all its historical gradients. This is later used to scale the learning rate. In contrast to other optimizations, we have a different learning rate for each parameter here. The following is a pseudocode.

2. squared_gradients = 0

for i in range(iterations_count):

param_gradients = evaluate_gradients(loss_function,

data,

params)

squared_gradients += param_gradients * param_gradients

params -= learning_rate * param_gradients/

(np.sqrt(squared_gradients) + 1e-8)

{1e-8 is to avoid divide by zero}

Now the question is how this scaling helps us once we have very high condition number for our loss function?

For parameters with high gradient values, the squared term will be large, and hence dividing with the large term would make the gradient accelerate slowly in that direction. Similarly, parameters with low gradients will produce smaller squared terms and hence gradient will accelerate faster in that direction.

However, notice that, as the gradient is squared at every step, the moving estimate will grow monotonically over the course of time and hence the step size our algorithm will take to converge to a minimum would get smaller and smaller.

And in a sense, this is beneficial for convex problems as we are expected to slow down towards minimum in this case. However, an equivalent gift becomes a curse just in case of non-convex optimization problems as the chance of getting stuck in saddle points increases.

  1. RMSprop optimization: This is a slight variation of AdaGrad and works better in practice as it addresses the issues left open by it. Similar to AdaGrad, here as well we will keep the estimate of the squared gradient but instead of letting that squared estimate accumulate over-training, we rather let that estimate decay gradually. To accomplish this, we multiply the present estimate of squared gradients with the decay rate. The idea is to apply an exponentially weighted average method to the second moment of the gradients (dW2). The pseudocode is as follows.

S = 0

for each iteration i

compute dW

S = β S + (1 — β) dW2

W = W — α dW⁄√S + ε

  1. Adam optimization: This incorporates all the great features of RMSProp and Gradient descent with momentum. Specifically, this algorithm calculates an exponential moving average of gradients and therefore the squared gradients whereas parameters beta_1 and beta_2 controls the decay rates of those moving averages. The pseudocode for the approach is as follows.

V = 0

S = 0

for each iteration i

compute dW

V = β1 S + (1 — β1) dW

S = β2 S + (1 — β2) dW2

V = V⁄{1 — β1i}

S = S⁄{1 — β2i}

W = W — α V⁄√S + ε

Notice that we’ve initialized the second moment to zero. So, within the beginning, the second moment would be calculated as somewhere very on the brink of zero. Consequently, we are updating parameters by dividing with a very small number and hence making large updates to the parameter. That means initially, the algorithm would make larger steps. To rectify that we create an unbiased estimate of that first and second moment by incorporating the current step. And then we make an update to parameters based on these unbiased estimates rather than first and second moments. The proposers of Adam (Kingma and Ba) recommended the following values for the hyperparameters.

α = 0.001 β1 = 0.9 β2 = 0.999 ε = 10–8

The SGD algorithm:

In SGD, since just one sample from the dataset is chosen randomly for every iteration, the trail taken by the algorithm to succeed in the minima is typically noisier than your typical Gradient Descent algorithm. But that doesn’t matter all that much because the path taken by the algorithm does not matter, as long as we reach the minima and with significantly but that doesn’t matter all that much because the path taken by the algorithm does not matter, as long as we reach the minima and with significantly shorter training time.

The path taken by the SGD is represented pictorially as below:

One thing to be noted is that, as SGD is usually noisier than typical Gradient Descent, it always took a better number of iterations to succeed in the minima, due to its randomness in its descent. albeit it requires a better number of iterations to succeed in the minima than typical Gradient Descent, it’s still computationally much less costly than typical Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for optimizing a learning algorithm.

Hyperparameter Tuning in Machine Learning

While making an ML model, one often faces the dilemma of choosing the most efficient methods and algorithms to make one. To find the most optimal one, going through various model-architectures is a good way to start, what makes the process even better is if you let the machine decide. The chosen methods and architecture that make it reach its optimal level of choice are called hyperparameters. The process of finding the model that works for you is called hyperparameter tuning.

One should also not, hyperparameters are not model parameters and cannot be trained from data.

So what does the process of hyperparameter tuning entail?

  1. Define a model
  2. Define the range of possible values for all hyperparameters
  3. Define a method for sampling hyperparameter values
  4. Define evaluative criteria to judge the model
  5. Define a cross-validation method

At a very basic level, you should train on a subset of your total dataset, holding out the remaining data for evaluation to gauge the model’s ability to generalize.

There are various ways one can go about hyperparameter tuning, some of the methods are mentioned below:

  • Grid search
  • Random search
  • Bayesian optimization
  • Population-based

Grid Search

The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.

Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds, and discretization may be necessary before applying a grid search.

Random Search

Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affect the final performance of the machine learning algorithm.

Bayesian Optimization
Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization, aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum).

When it comes to using Bayesian principles in hyperparameter tuning the following steps are generally followed:

  • Pick a combination of hyperparameter values (our belief) and train the machine learning model with it.
  • Get the evidence (i.e. score of the model).
  • Update our belief that can lead to model improvement.
  • Terminate when a stopping criterion is met (think when a loss a quantity is minimized or classification accuracy is maximized).

Population-Based

Population-Based Training (PBT) is a recent approach that jointly optimizes neural network weights and hyperparameters which periodically copies weights of the best performers and mutates hyperparameters during training.

It leverages information sharing across a population of concurrently running optimization processes, and allows for online propagation/transfer of parameters and hyperparameters between members of the population based on their performance. Furthermore, unlike most other adaptation schemes, the method is capable of performing an online adaptation of hyperparameters.

Conclusion

Machine Learning is ever-growing and quite everlasting at this point as well. Although new technologies and methods arise in this field pretty often, they all seem the be variations of the existing and prevalent methods. Hence, making it quite essential to understand these as a basis for your growth into the field.