This post is to visually show that gradient descent with momentum can converge faster compare to vanilla gradient descent when the loss surface is raven-like (i.e., one direction is substantially steeper than other directions). Let’s start with creating a two-dimensional raven loss surface.

L (

x,y)= 1/16x² + 9 y²

<script src=”https://gist.github.com/hengluchang/b5832ce205fca11060b114729a6d67eb.js“></script>

The below figure is the raven loss surface. For this loss surface, there is a global minimum at the point (0, 0) where the loss is equal to zero.

**Gradient descent with momentum**

The momentum method can accumulate velocity in the direction where the gradient is pointing towards the same direction across iterations. It achieves this by adding a portion of the previous weight update to the current one.

The following equation is the gradient descent with momentum update. β is the portion of the previous weight update you want to add to the current one ranges from [0, 1]. When β = 0, it reduces to vanilla gradient descent.

v = βv + ∇L(w)

w = w − 𝛼v

Let’s first initialize our weights at (-2.4, 0.2) in the ravine loss surface we’ve created earlier. Next, try vanilla gradient descent (β = 0) with a learning rate 𝛼 = 0.1, run for 50 iterations and see how the loss decay.

The loss is around 0.1 after 50 iterations but it is still far from the global minimum (0, 0) where the loss is zero. We can see that the gradient keep changing directions because the gradient in the *y* direction is changing signs for every iteration, this makes the algorithm navigates slowly towards the optimum. We can add momentum to accelerate the learning speed. Let’s try β = 0.8 and run same number of iterations for weight updates.

We now achieve a loss of 2.8e-5 for same number of iterations using momentum! **Because the gradient in the x direction always points to the positive x direction, the velocity can be accumulated.** Effectively this means larger learning rate in the *x* direction so it can reach to lower loss for same number of iterations compare to vanilla gradient descent. The is also why momentum is said to be able to damp oscillation.

We might wonder what if we choose β = 0.9, wouldn’t it accumulates velocity more and reach to the global minimum faster?

The answer is no. The above figure shows that too much velocity will run pass the global minimum with large step. We can also plot the velocity-iteration relation to se how velocity changes in the *x* and *y* direction, respectively.

The python code below is for generating figures above including weight trajectories and velocity-iteration plots, first script is needed to be imported as a module.

<script src=”https://gist.github.com/hengluchang/b36f6c2d512d878720931c5fddd5ed5a.js“></script>

#### Conclusion

This post is to try to visually understand why gradient descent momentum works well when the loss surface is raven-like. However, we also learn that β can not be too large for it will accumulate too much momentum and run past the optimal point quickly.

This post is motivated by Sebastian Ruder’s great post on gradient descent algorithm overview in which inspired me to visualize the momentum update using python and understand it more clearly, and David Robinson’s blog post to advice every data scientist to start a blog. This is my first post so any suggestions and comments are more than welcome!

Source: Deep Learning on Medium