Source: Deep Learning on Medium
Most widely used Optimization techniques : Optimizing Algorithms.
The below mentioned are few of the widely used optimizing algorithms which I will be covering in this post –
- Vanilla Gradient Descent,
- Momentum Based Gradient Descent,
- Nesterov Accelerated Gradient Descent,
Before going into these techniques let me tell you why did we come up with these techniques or what was the problem with our usual Gradient Descent update rule that was used to update the weights and bias as follows –
w = w — eta * dw — — — — — — -(1)
where, ‘w’ — weight associated with the input, eta — learning rate, dw — derivative of loss w.r.t weight. Lets analyse the graph of a slope
From the above graph we can say that when the slope is steep the derivative is high and when the slope is gentle the derivative is low. You might be wondering how does it is linked to our update rule?.
So this could be problem when our initialization points are on the flatter regions because the algorithms would not move fast enough. If it so happens that our random initialization of ‘w’ and ‘b’ start at a flat region, then the algorithm would need to run many epochs to get out of the plateau and running many epochs might cost us more and time consuming.
One of the main issue with the gradient descent is that it takes a lot of time to navigate regions with gentle slope as the gradient is very small in these regions.
Okay now we got to know why we come up with new techniques right!
What can we think to do when there is flat region?
Just a vague idea is take bigger steps right!. So An intuitive solution would be that if the algorithm is repeatedly being asked to go in the same direction, then it should probably gain some confidence and start taking bigger steps in that direction.
Say when you are going to some shopping market and you do not know the route after being asked one person will say walk straight for around 1KM and there you can find the shop. After being said this you won’t ask after 100M to every person the same route, right? So you get confidence that moving for 1KM will get you to your desired destination so does with these algorithms.
Various Optimization algorithms :
Vanilla Gradient Descent : It is the first update rule we used to follow by directly updating the weights and biases as follows :
ωt+1 = ωt — η * ∇ωt — — — — — -(2)
Momentum Based Gradient Descent : Now instead of updating the weights directly lets accelerate a bit in the direction it’s moving so that it takes lesser steps in moving in the same direction.
The intuition behind this is as we progress further and further down a series or direction, we can place lesser and lesser importance to the later gradients as we move along the same direction because the ‘gamma’ is getting powered up as we go ahead means it starts decaying and more importance to the last step we have taken.
υt = γ * υt−1 + η∇ωt — — — — — -(3)
ωt+1 = ωt — υt — — — — — — — — -(4)
υt−1 -history of the movement and γ- ranges from 0–1, η — learning rate
Here, we take an exponentially decaying weighted sum, whereby as we move further and further into the series, the weight decays more.
We can see from the above graph that in fig-3(a) for just 29 epochs even in the regions having gentle slopes MBGD is able to take large steps because the momentum carries along. But where as in fig-3(b) which took 40 epochs then too couldn’t be able to cross the plateau.
IS MOVING FAST ALWAYS GOOD?
Would there be a situation where momentum would cause us to run pass our goal?
No always moving fast isn’t a good idea because there comes a problem of overshooting and oscillating about the minima. So this is the common issue for both the MBGD and VGD of overshooting with a high learning rate.
So to come up with the problem of overshooting they introduced what is called Nesterov accelerated gradient descent.
Nesterov Accelerated Gradient Descent : So the problem associated with the momentum and vanilla gradient descent of oscillation can be overcome by nesterov method. What we do here is first we go with the past history and then before moving ahead we then calculate the derivative at that point for the second step whereas in momentum we used to calculate the derivative before taking a step and then move by its history and current which would lead to overshoot problems.
Now lets step back a bit and see the equations of MBGD — Substituting equation (3) in (4) –
ωt+1 = ωt — γ * υt−1 — η∇ωt — — — — — -(5)
What we can see from the above equation is that the movement occurs in two steps –
i. First is with the history term — γ * υt−1
ii. Next is with the weight term — η∇ωt.
So when moving both steps each time it is possible to overshoot the minima between the two steps. So we can consider first moving with the history term, then calculate the second step from where we were located after the first step (ωtemp ).
Using the above intuition, the Nesterov Accelerated Gradient Descent solves the problem of overshooting and multiple oscillations.
ωtemp = ωt — γ * υt-1 — — — — — -(6); Compute based on movement with the history
ωt+1 = ωtemp − η * ∇ωtemp — — -(7); Move further in the direction of the derivative of ωtemp.
υt = γ * υt-1 + η * ∇ωtemp — — —-(8); Update history with movement due to derivative of ωtemp.
Looking ahead helps NAG in correcting its course quicker than MBGD. Hence the oscillations are smaller and the chances of escaping the minima valley is also smaller.
AdaGrad : The motivation for adagrad comes from the sparse and dense features from the data set. When in particular feature if most of the points are ‘0’ then there is no point in updating weights since it means as it is because of the ‘0’ value. So what we can do is for the cases when there are less number of 1’s then 0’s then we can boost the learning rate for such features and decay the learning rate for sparse features.
The basic intuition for the Adagrad is decay the learning rate for parameters in proportion to their update rule.
υt = υt−1 + (∇ωt)^2 — — — — -(9)
In the case of dense features, it increments for most iterations, resulting in a larger υt value.
For sparse features it does not increment much as the gradient value is often 0, leading to a lower υt value.
ωt+1 = ωt − (∇ωt) * η /√((υt) + ε) — — -(10)
From the above equation we can say that the denominator term ‘υt’ serves to regulate the learning rate η that is why it is called the adaptive learning rate.
i. For dense features, υt is larger, √(υt ) becomes larger thereby lowering η.
ii. For sparse features, vt is smaller, √(υt ) becomes much smaller thereby lowering η to a smaller extent. The ‘ε’ term is added to the denominator √(υt ) + ε to prevent a divide-by-zero error from occurring in the case of very sparse features i.e. where all the data points yield zero up till the measured instance.
Black — Vanilla gradient descent, Red — Momentum based gradient descent, Blue — Nesterov accelerated gradient descent, Green — AdaGrad
So from figure we can observe that for the curves(black, red and blue) my ‘w’ was not changing initially only ‘b’ was changing. That’s the problem with these algorithms the issue was that my ‘w’ was a very sparse feature that’s why it’s derivative were zero in most cases and was not updating and ‘b’ was mostly dense feature it was updating.
From the green curve we can clearly see that parameters corresponding to sparse features get better updates. But still misses the convergence the RMS Prop comes into picture to solve this issue of converging.
The LR decays very aggressively as the denominator grows(not good for parameters corresponding to dense features)
RMS Prop : The learning rate decays aggressively as the denominator grows which is not suitable for the parameters corresponding to dense features. So the motivation or the intuition for RM prop is that why not decay the denominator and prevent its decay? Right!.
Here we are taking the exponential decaying growth to make the curve to converge to the minima.
υt = beta * υt−1 + (1 — beta) * (∇ωt)^2 — — — -(11)
ωt+1 = ωt − (∇ωt) * η /√((υt) + ε) — — — — — -(12)
From the above graph we can say that Adagrad was stuck when it was close to convergence(it was no longer able to move in the vertical (b) direction because of the decayed learning rate). RMS prop overcomes this problem by being less aggressive on the decay.
Adam : We know that both momentum and RMS prop were using history. The momentum was using history of gradient and RMS was using history of the square of the gradient, right!.
Now in case of momentum this cumulative history actually helps in moving fast out of flat surface and used to update the current weight. Where as in rms prop, history was used to adjust the learning rate suitably used to shrink or grow.
Can we be to combine it both ?.
Yes we can taking the advantage of momentum as well as rms for faster movement and in the mean time using rms to prevent the learning rate from getting blown up or null.
From the above graph we can see that it over shoot the minima since it uses momentum based gd.
mt = beta1 * vt-1 + (1 — beta1)(wt) — — — -(13)
vt = beta2 * vt — 1 + (1 — beta2)(wt)2 — — -(14)
wt + 1 = wt — mt * η /√((vt) + ε) — — — — -(15)
The equation(13) takes the advantage of the momentum and (14) takes the advantage of rms prop. And the updates are performed as shown in (15). The adam also provides what is called the bias correction which ensures that the training is smoother or say the inputs are not like weird kind of. The mt and vt are updated using the below mentioned equations.
mt = mt / (1 — beta1^t) — — — — -(16)
vt = vt / (1-beta2^t) — — — — — -(17)
The concepts were from One Fourth Labs:
#deeplearning #machinelearning #datascience #optimization #ADAM