Original article was published on Deep Learning on Medium
What is Adam?
Adam optimization is an extension to Stochastic gradient decent and can be used in place of classical stochastic gradient descent to update network weights more efficiently.
Note that the name Adam is not an acronym, in fact, the authors — Diederik P. Kingma of OpenAI and Jimmy Lei Ba of University of Toronto — state in the paper, which was first presented as a conference paper at ICLR 2015 and titled Adam: A method for Stochastic Optimization, that the name is derived from adaptive moment estimation.
The authors wasted no time in listing many of the charming benefits of applying Adam to non-convex optimization problems of which I will go ahead to share as follows:
- Straightforward to implement (we will be implementing Adam later in this article, and you will see, first hand, how leveraging powerful deep learning frameworks make implementation much simpler with fewer lines of code.)
- Computationally efficient
- Little memory requirements
- Invariant to diagonal re-scaling of the gradients (This means that Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors— to understand this better read this stack exchange)
- Well suited for problems that are large in terms of data and/or parameters
- Appropriate for non-stationary objectives
- Appropriate for problems with very noisy and/or sparse gradients
- Hyperparameters have intuitive interpretation and typically require little tuning (we will cover this more in the configuration section)
Well… How does it work?
To put it simply, Adam uses Momentum and Adaptive Learning Rates to converge faster.
When explaining momentum, researchers and practitioners alike prefer to use the analogy of a ball rolling down a hill that rolls faster toward the local minima, but essentially what we must know is that the momentum algorithm, accelerates stochastic gradient descent in the relevant direction, as well as dampening oscillations.
To introduce momentum into our neural network, we add a temporal element to the update vector of the past time step to the current update vector. This gives the effect of increased momentum of the ball by some amount. This can be expressed mathematically as shown in figure 2.
The momentum term γ is usually initialized to 0.9 or some similar term as mention in Sebastian Ruder’s paper An overview of gradient descent optimization algorithm.
Adaptive Learning Rate
Adaptive learning rates can be thought of as adjustments to the learning rate in the training phase by reducing the learning rate to a pre-defined schedule of which we see in AdaGrad, RMSprop, Adam and AdaDelta — This is also referred to as Learning Rate Schedules and for more details on this subject Suki Lau wrote a very informative blog post about this subject called Learning Rate Schedules and Adaptive Learning Rate Methods for Deep Learning.
Without going to much going too much into the AdaGrad optimization algorithm, I will explain RMSprop and how it improves on AdaGrad and how it changes the learning rate over time.
RMSprop, or Root Mean Squared Propagation, was developed by Geoff Hinton and as stated in An Overview of Gradient Descent Optimization Algorithms, it’s purpose is to resolve AdaGrad’s radically diminishing learning rates. To put it simply, RMSprop changes the learning rate slower than AdaGrad, but the benefits from AdaGrad (faster convergence) are still reaped from RMSprop — See Figure 3 for the mathematical expression.
This allows the learning rate to adapt over time, which is important to understand since this phenomena is also present in Adam. When we put the two together (Momentum and RMSprop) we get Adam — Figure 4, shows a detailed algorithm.
Thank you for reading to this point, some further readings will be linked below and if you’d like to get in contact with me you can find me on LinkedIn as Kurtis Pykes (click on my name for direct access).