Original article was published on Deep Learning on Medium Adam optimization is an extension to Stochastic gradient decent and can be used in place of classical stochastic gradient descent to update network weights more efficiently.

Note that the name Adam is not an acronym, in fact, the authors — Diederik P. Kingma of OpenAI and Jimmy Lei Ba of University of Toronto — state in the paper, which was first presented as a conference paper at ICLR 2015 and titled Adam: A method for Stochastic Optimization, that the name is derived from adaptive moment estimation.

The authors wasted no time in listing many of the charming benefits of applying Adam to non-convex optimization problems of which I will go ahead to share as follows:

• Straightforward to implement (we will be implementing Adam later in this article, and you will see, first hand, how leveraging powerful deep learning frameworks make implementation much simpler with fewer lines of code.)
• Computationally efficient
• Little memory requirements
• Invariant to diagonal re-scaling of the gradients (This means that Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors— to understand this better read this stack exchange)
• Well suited for problems that are large in terms of data and/or parameters
• Appropriate for non-stationary objectives
• Appropriate for problems with very noisy and/or sparse gradients
• Hyperparameters have intuitive interpretation and typically require little tuning (we will cover this more in the configuration section)

## Well… How does it work?

To put it simply, Adam uses Momentum and Adaptive Learning Rates to converge faster.

Momentum

When explaining momentum, researchers and practitioners alike prefer to use the analogy of a ball rolling down a hill that rolls faster toward the local minima, but essentially what we must know is that the momentum algorithm, accelerates stochastic gradient descent in the relevant direction, as well as dampening oscillations.

To introduce momentum into our neural network, we add a temporal element to the update vector of the past time step to the current update vector. This gives the effect of increased momentum of the ball by some amount. This can be expressed mathematically as shown in figure 2.

The momentum term γ is usually initialized to 0.9 or some similar term as mention in Sebastian Ruder’s paper An overview of gradient descent optimization algorithm.