Gradient Based Optimizations: Jacobians, Jababians & Hessians

Source: Deep Learning on Medium


Sometimes we need to find all of the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is the Jacobian.


The Jacobian matrix J is given by:

Example of Jacobian Matrix

Let’s say:

The function f, takes in a vector of size 2 and outputs a vector of size 2. The operation can be explained by 2 functions in a vector:


And the Jacobian Matrix of f is:


Jababians are a fictional race of aliens from Men in Black.

Hessian Matrix

Hessian is a square matrix of second order partial derivatives of a scalar-valued function or scalar field.

It describes the local curvature of a function of many variables.

The loss functions of neural nets are very high dimensional and computing and storing the full Hessian matrix takes n squared memory, which makes it intractable.

The directional second derivative tell us how well we can expect a gradient descent step to perform. We can make a second order Taylor Series approximation to the function f(x) around the current point x⁰:

… where g is the gradient and H is the Hessian at x⁰.

If we use a learning rate of epsilon, then the new point x, will be given by:

… substituting this into our equation, we get:

You can think of each part of the equation like this: there’s the original value of the function, the expected improvement due to the slope of the function and then the curvature correction from the second derivative.

Let’s call

When alpha is 0 or negative, the Taylor series predicts that increasing epsilon forever will result in the function decreasing forever. This is not true, in practice Taylor expansion is not accurate for large epsilon.

When alpha is positive, solving for the optimal step size yields:

Taylor Series

Taylor Series is a series expansion of a function about a point. The one-dimensional Taylor Series is an expansion of a real function f(x) about a point x= a:

In order for this to be true, the function must meet certain requirements, but given that it meets them, Taylor Series will always work.

Taylor Series also works for complex variables and multivariable expansion.

If you remember your Calculus, there’s the whole concept of optimization, where you can start to graph a polynomial from just the equation. You know that if the first derivative at a point is positive that it must be sloping upwards and if the second derivative is positive then it must be accelerating or curving up.

You also know that when a point has a first derivative of 0, that it’s a critical point and it must be either a minima or maxima. Which brings you to the point that there can be many minima and maxima. Which are referred to as local/global minima/maxima.

And finally you also know that at a critical point you can determine if its a minima or maxima by finding the second derivate which tells you that the if its negative then it must be a maximum.

So similarly in multivariable state, you can use the Jacobian and the Gradient as the first derivative and the Hessian as the second derivative. And roughly apply the same principles to graph in 3D space.

Remember also that you can introduce artificial constraints to your graphs as the problem suggests a real physical phenomena. After all we live in a constrained physical universe. Here we treat these constraint points as critical points. This greatly helps us in finding the global minima and maxima as it reduces our search space.

Constrained Optimization

Preliminary Approaches

Formally, instead of finding the maximum or minimum over all possible values of x, we may wish to find the max or min in some set S.

The points that lie in set S, are called the feasible points.

The common approach is to impose a norm constraint, such that:

If we pick a small epsilon, we can always check whether each resulting value is in set S. Of course this is not elegant.

But if we know certain things, we can find sophisticated approaches to each constraint, but it’s case by case. One example would be, if we wanted x, to be constrained by the unit L2 norm, we can instead minimize:

… with respect to theta, the return [cos(theta), sin(theta)] as the solution to the original problem.

Generalized Lagrange Function

Let set S be:

… where g and h are equality constraints and inequality constraints respectively. The generalized Lagrangian is then defined as:

Karush Kuhn Tucker Approach

The properties that the gradient of the generalized Lagrangian is zero, all constraints on both x, and KKT multipliers are satisfied, and the dot product of alpha and h(x) = 0, are called Karush Kuhn Tucker conditions.

Linear Least Squares

Suppose we have an equation:

… and we wanna minimize it.

Algorithmic Approach

Let alpha be:

Let beta be:

Set the step size (epsilon) and tolerance (delta) to small, positive numberswhile (alpha > delta) do
x = beta
end while

Linear Algebra Approach

… we will approach this later.

Calculus Approach with some Linear Algebra Help

Also called the Newton’s Method, suppose we wish to minimize the same function with constraint dot product of x is less than or equal to 1.

We can now solve

The the smallest norm solution to the unconstrained problem is found using the Moore Penrose pseudoinverse. If this point is feasible, then that’s it.

Otherwise, we differentiate the Lagrangian with respect to x, and get this:

Solving for x, we get:

We must choose the magnitude of lambda so that it makes sense:

When the norm of x exceeds 1, this derivative is positive. So when the derivative is zero and we know the lambda is correct.

Up Next…

In the next article, I’ll talk about the theory behind the complete machine learning theory. If you would like me to write another article explaining a topic in depth, please leave a comment.