A General Introduction to Deep Learning Optimizers

Source: Deep Learning on Medium

Convex or nonconvex

Convexity plays a vital role in the design of optimization algorithms. As I’ve stated above, this is largely due to the fact that it is much easier to analyze theoretically and compare performance. And convex problems can act as a baseline for optimization algorithms, which means that if the algorithm performs poorly in the convex function then we could not expect it to work in a more complicated situation.

Here we first define a convex function. The math description here is not so precise for simplicity. Given a convex function f we have:

where λ ∈ [0, 1]. This seems hard to understand at first but it actually indicates that the line between any two points should be above the function curve. We can visualize a few functions to better illustrate it. (The figure here is from https://d2l.ai/chapter_optimization/convexity.html)

The cosine function is nonconvex while the parabola and the exponential function are. If the inequality above is strict for any two points, then the function f is strictly convex.

Convex functions have several important properties. One is that convex functions have no local minima. This property ensures that the algorithm would not “be stuck” at any point where we minimize the function. However, convexity doesn’t mean that there can not be more than one global minimum or there might even exist one. Another property is that the second derivative f(x)’’ ≥ 0, while for multivariate problems the Hessian matrix is semi-definite.

Some convex function examples are:

  • The absolute value function f(x) = |x| is convex even though it does not have a derivative.
  • The exponential function is convex and also strictly convex.
  • The square function is convex and also strictly convex.
  • The logarithmic function is concave but negative logarithmic function is convex (So the cross-entropy loss function is convex for ground truth label but for intermediate weights, it’s nonconvex).

Things are different for deep learning models. If we expand the chain rule in backpropagation, we would get similar equations like below:

We can design the activation function and loss function to be convex. However, as neural layers are stacked together, those nested non-linear functions finally lead to a nonconvex function. The actual loss surface is full of local minima and saddle points, which prevent models from further proceeding towards the global minima or a better local minimum. What’s more, there are millions of trainable parameters in a single model, this dimensional disaster also adds difficulty to theory analysis and design a better algorithm. Fortunately, a simple SGD can give acceptable results in most cases.