Original article was published by Imdadul Haque Milon on Deep Learning on Medium
Optimization Algorithms Around the World
Key concepts and mathematical foundations behind the popular optimization algorithms
In this era, everything is influenced by Artificial Intelligence, Machine Learning, and deep learning. Nowadays it is common to use very big datasets and we need fast and efficient optimization algorithms to get the leverage of this big amount of data. Training Neural Network is generally much harder than the other optimization problems in deep learning, A good optimization algorithm is easy to implement and faster at the same time. In this post, we will explain the mathematical foundation behind the most common and effective optimization algorithms for neural networks.
We will start with the gradient descent algorithm which is the most basic and useful optimization algorithm for neural networks. Then we will explain the concept of mini-batch and Stochastic Gradient Descent (SGD) which is just a modification of Gradient Descent. After that, we will introduce the concept of momentum which is the core concept of the many modern optimization algorithms. Finally, we will derive the mathematical concepts of RMSProp and Adam Optimization that are two highly efficient algorithms.
Gradient descent is a mathematical concept of first-order iterative optimization algorithm for finding a local minimum of a differentiable function. In a neural network, this function is called loss function or cost function. Suppose we want to solve a classification problem using a neural network. For simplicity, suppose the neural network contains a single hidden layer only.
Suppose, we have a set of inputs or features X with associated classification labels Y
where x ⁽¹⁾, x⁽²⁾, … , x⁽ᵐ⁾ are different training examples and we have m training examples. Then the output, A will be
where a[ˡ]⁽ᵐ⁾ denotes the output of different layers of different examples, l denotes the layer number and m denotes the different examples. Weights of this network are initialized with 0 or randomly.
Then we will use a loss function E to find the error between the prediction and the real value. Our goal is to minimize this error and make predictions as close as possible to the real values by updating the weights. To measure this error we use a metric known as the loss function. A common loss function is the Sum of Squared Error (SSE) defined as
where ŷ is the prediction and y is the true value. One thing to notice is that errors are the functions of weights. We have to tune up the weights to alter the network’s prediction which in turn will influence the overall error. Our goal is to find the weights that will minimize the errors and to do that we use gradients. Suppose, we plot the weights in the x-axis and the error (E) in the y-axis to get a curve.
Here we are showing the simple depiction of the error with one weight. Our goal is to find the weight that minimizes the error. We start with a random weight and step towards the direction of the minimum. The direction is opposite to the gradient or the slope. After taking several steps towards the direction eventually we will be able to reach the minimum of the error function. To update the weights we use
Here η is a constant that is called the learning rate of a neural network. Learning rate is a hyperparameter that needs to be adjusted in the network. Deriving the derivative term
We can simplify the equation defining another term ∆ called error term
Mini-Batch and Stochastic Gradient Descent(SGD)
In gradient descent algorithm, we used all the training examples that are also known as batch gradient descent. But there are some problems with the batch gradient descent. If the number of training examples is too big it takes longer and requires a lot of memory to compute one single epoch. For example, if we have ten million training examples we have to process all the training examples before taking a single step towards the minimum. To resolve this issue we use mini-batch gradient descent by splitting the training examples into small chunks called batches and training one batch at a time. Suppose, we have a set of m number of inputs or features X with associated classification labels Y.
In mini-batch gradient descent, instead of processing all the training examples all together, we split them into small batches e.g. 1000 training examples each.
For simplicity, we can denote mini-batches as
The training process is now similar to batch gradient descent. We will pass the split batches for training and have to update the weights for every single batch. The forward propagation and loss function for mini-batch gradient descent will be
Like before, we can define error term ∆ for simplification and write the equations as
When the mini-batch size is 1 the method is called stochastic gradient descent.
- If t=N, Batch Gradient Descent
- If 1<t<N, Mini-Batch Gradient Descent
- If t=1, Stochastic Gradient Descent
Gradient Descent With Momentum
Gradient descent is a very basic and fundamental optimization algorithm and has some problems. While converging to the minimum gradient descent oscillates in up and down direction and takes a lot of steps. These oscillations make gradient descent a lot slower preventing us to use a larger learning rate. Another big problem is to get stuck in a local minimum instead of a global minimum. Gradient descent with momentum helps to address these issues. In this method, we calculate an exponentially weighted average of our gradients and use that to update our weights.
Suppose, we are trying to optimize a cost function that has contours like above and the red dot denotes the position of the local optima. Starting gradient descent from the first point we reach the second position after one iteration and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical. We can see that there are up and down oscillations in a vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.
In this method, we use an exponentially weighted average which in simple words is just taking previous values into account while updating the weights. Previously, to update weights we used
In the momentum method, we use exponentially weighted averages of ∆w₁ and ∆w₂ and denote it V∆ᵥᵥ₁ and V∆ᵥᵥ₂ respectively.
Here β₁ is a hyperparameter that balances values between the previous and the current values and ranges from [0,1]. After calculating the exponentially weighted averages we will update our parameters using these averages.
The intuition behind this method is quite simple. When we are taking the exponential average of the previous values the up and down oscillations cancel out each other and the vertical motion gets closer to zero. But in the horizontal direction, all the gradients are pointing in the same direction. So it doesn’t slow down in the horizontal direction after adding up the previous values.
Root Mean Square Prop or RMSProp is another optimization algorithm that is quite similar to the gradient descent with momentum algorithm. Like previously, suppose that we are trying to optimize a cost function that has contours like below and the red dot denotes the position of the local optima. Starting gradient descent from the first point after one iteration we reach the second position and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical motion. We can also see that there are up and down oscillations in the vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.
For simplicity, we are deriving the equations on two-dimensional space with two weights w₁ and w₂ where w₁ is moving in the horizontal direction, and w₂ is moving in the vertical direction. In RMSProp, we use exponentially weighted averages like before but here we use the square of the gradients
After calculating the exponentially weighted averages we will update our parameters using these averages.
Here we use ε for numerical stability and it is generally a very small number, 10⁻⁸.
The intuition behind RMSProp is that in the horizontal direction or in our case in w₁ direction we want learning to go fast while in w₂ direction we want to slow it down to reduce the oscillations. Since we are dividing by S∆ᵥᵥ₁ and S∆ᵥᵥ₂ we want S∆ᵥᵥ₁ to be bigger and S∆ᵥᵥ₂ to be smaller. If we look at the derivatives the angle or slope is much larger in the vertical direction while much smaller in the horizontal direction. So the square of S∆ᵥᵥ₂ will be relatively larger than S∆ᵥᵥ₁. In summary, we are dividing the updates in the vertical direction with a much bigger number to reduce the oscillations while dividing the updates in the horizontal direction with a smaller number that has very little impact.
Adam or Adaptive moment estimation algorithm is another very popular optimization algorithm for different types of neural networks. This algorithm is basically using momentum and RMSProp together. Below is the algorithm for Adam optimization.
In this post, we have discussed various optimization algorithms used in deep learning. In these optimization algorithms, there are different hyperparameters as well. Even though there’s no straightforward rule to choose these hyperparameters, we try to follow some common values for β₁,
β₂, and ε.
- Learning Rate, η — Needs to be tuned
- β₁ — 0.9
- β₂ — 0.999
- ε — 10⁻⁸