Original article was published on Artificial Intelligence on Medium
Gradient Descent Explained
A comprehensive guide to Gradient Descent
Optimization refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. In machine/deep learning terminology, it’s the task of minimizing the cost/loss function J(w) parameterized by the model’s parameters w ∈ R^d.
Optimization algorithms (in the case of minimization) have one of the following goals:
- Find the global minimum of the objective function. This is feasible if the objective function is convex, i.e. any local minimum is a global minimum.
- Find the lowest possible value of the objective function within its neighborhood. That’s usually the case if the objective function is not convex as the case in most deep learning problems.
Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms. The goal of Gradient Descent is to minimize the objective convex function f(x) using iteration.
Let’s see how Gradient Descent actually works by implementing it on a cost function.
For ease, let’s take a simple linear model.
Error = Y(Predicted)-Y(Actual)
A machine learning model always wants low error with maximum accuracy, in order to decrease error we will intuit our algorithm that you’re doing something wrong that is needed to be rectified, that would be done through Gradient Descent.
We need to minimize our error, in order to get pointer to minima we need to walk some steps that are known as alpha(learning rate).
Steps to implement Gradient Descent
- Randomly initialize values.
- Update values.
3. Repeat until slope =0
A derivative is a term that comes from calculus and is calculated as the slope of the graph at a particular point. The slope is described by drawing a tangent line to the graph at the point. So, if we are able to compute this tangent line, we might be able to compute the desired direction to reach the minima.
Learning rate must be chosen wisely as:
1. if it is too small, then the model will take some time to learn.
2. if it is too large, model will converge as our pointer will shoot and we’ll not be able to get to minima.
Vanilla gradient descent, however, does not guarantee good convergence, but offers a few challenges that need to be addressed:
- Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.
- Learning rate schedules try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.
- Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
- Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for vanilla Gradient Descent to escape, as the gradient is close to zero in all dimensions.
In simple words, every step we take towards minima tends to decrease our slope, now if we visualize, in steep region of curve derivative is going to be large therefore steps taken by our model too would be large but as we will enter gentle region of slope our derivative will decrease and so will the time to reach minima.
Momentum Based Gradient Descent
If we contemplate, Simple Gradient Descent completely relies on computation i.e. if there are 100 steps, then our model would run Simple Gradient Descent for 100 times.
In laymen language, suppose a man is walking towards his home but he don’t know the way so he ask for direction from by passer, now we expect him to walk some distance and then ask for direction but man is asking for direction at every step he takes, that is obviously more time consuming, now compare man with Simple Gradient Descent and his goal with minima.
In order to overcome the shortcoming of vanilla Gradient Descent, we introduced momentum based Gradient Descent where the idea is to decrease the computation time and that is achieved when we introduce the concept of experience i.e. the confidence using previous steps.
Pseudocode for momentum based Gradient Descent:
update = learning_rate * gradient
velocity = previous_update * momentum
parameter = parameter + velocity – update
In this way rather than computing new steps again and again we are averaging the decay and as decay increases its effect in decision making decreases and thus the older the step less effect on decision making.
More the history more bigger steps will be taken. Even in the gentle region momentum based Gradient Descent is able to take large steps because of momentum it carries along. Momentum helps accelerate vanilla Gradient Descent in the relevant direction and dampens oscillations.
But due to larger steps it overshoots its goal by longer distance as it oscillate around minima due to steep slope, but despite such hurdles it is faster than vanilla Gradient Descent.
In simple words, suppose a man want to reach destination that is 1200m far and he doesn’t know the path, so he decided that after every 250m he will ask for direction, now if he asked direction for 5 times he would’ve travelled 1250m that’s he has already passed his goal and to achieve that goal he would need to trace his steps back. Similar is the case of Momentum based GD where due to high experience our model is taking larger steps that is leading to overshooting and hence missing the goal but to achieve minima model have to trace back its steps.
Nesterov Accelerated Gradient Descent(NAG)
To overcome the problems of momentum based Gradient Descent we use NAG, in this we move first and then compute gradient so that if our oscillations dampen then it must be relatively smaller then Momentum Based Gradient Descent.
Nesterov accelerated gradient (NAG) is a way to give our momentum term this kind of prescience. We know that we will use our momentum term to move the parameters θ. Computing θ−γvt−1 thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters θ.
Let’s see NAG in action!
In figure (a), update 1 is positive i.e., the gradient is negative because as w0 increases L decreases. Even update 2 is positive as well and you can see that the update is slightly larger than update 1, thanks to momentum. By now, you should be convinced that update 3 will be bigger than both update 1 and 2 simply because of momentum and the positive update history. Update 4 is where things get interesting. In vanilla momentum case, due to the positive history, the update overshoots and the descent recovers by doing negative updates.
But in NAG’s case, every update happens in two steps — first, a partial update, where we get to the look ahead point and then the final update, see figure (b). First 3 updates of NAG are pretty similar to the momentum-based method as both the updates (partial and final) are positive in those cases. But the real difference becomes apparent during update 4. As usual, each update happens in two stages, the partial update (4a) is positive, but the final update (4b) would be negative as the calculated gradient at w_lookahead would be negative (convince yourself by observing the graph). This negative final update slightly reduces the overall magnitude of the update, still resulting in an overshoot but a smaller one when compared to the vanilla momentum-based gradient descent. And that my friend, is how NAG helps us in reducing the overshoots, i.e. making us take shorter U-turns.
Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance.
Gradient Descent Strategies.
Stochastic Gradient Descent
Instead of going through all examples, Stochastic Gradient Descent (SGD) performs the parameters update on each example (x^i,y^i). Therefore, learning happens on every example:
- Shuffle the training data set to avoid pre-existing order of examples.
- Partition the training data set into m examples.
Advantages : —
a. Easy to fit in memory
b. Computationally fast
c. Efficient for large dataset
a. Due to frequent updates steps taken towards minima are very noisy.
b. Noise can make it large to wait.
c. Frequent updates are computationally expensive.
Batch Gradient Descent
Batch Gradient Descent is a greedy approach, when we add all examples on each iteration when performing the updates to the parameters. Therefore, for each update, we have to sum over all examples.
a. Less noisy steps
b. produces stable GD convergence.
c. Computationally efficient as all resources aren’t used for single sample but rather for all training samples
a. Additional memory might be needed.
b. It can take long to process large database.
c. Approximate gradients
Mini Batch Gradient Descent
Instead of going over all examples, Mini-batch Gradient Descent sums up over lower number of examples based on the batch size. Therefore, learning happens on each mini-batch of b examples.
It is combination of both Batch Gradient Descent and Stochastic Gradient Descent.
a. Easy fit in memory.
b. Computationally efficient.
c. Stable error go and convergence.
What if we cater sparse data to Gradient Descent?
In case of sparse data, we would experience sparse ON(1) features and more frequent OFF(0) features, now, most of the time gradient update will be NULL as derivative is zero in most cases and when it will be one, the steps would be too small to reach minima.
For frequent features we require low learning rate, but for high features we require high learning rate.
So, in order to boost our model for sparse nature data, we need to chose adaptive learning rate.
Hopefully, this article has not only increased your understanding of Gradient Descent but also made you realize machine learning is not difficult and is already happening in your daily life.
As always, thank you so much for reading, and please share this article if you found it useful! 🙂