Why Logarithms Are So Important In Machine Learning

Original article can be found here (source): Artificial Intelligence on Medium

As you can see, it’s getting pretty cluttered, right?

Moreover, it’s tedious.

The first derivative of the logarithm of the same function is a lot simpler:

The second derivative is also simpler:

Now when you actually use the logarithm, you are not going to have the same function of course.

You do not need to take the same route while walking and driving. You have car lanes separated from the ones used by pedestrians. But you actually do not care that much.

It’s not that you are so much concerned with all the shops that are open on the side. You have already eaten a light snack on your house and want to go directly to your workplace, meaning that they do not matter anyway.

You want to minimize your loss function for certain parameters. You need the parameters that minimize that loss function. This is exactly what a function and the logarithm of that function have in common: the same parameters that minimize the loss function.

You take the derivative of both this function and its logarithm and you get the minimizers of the loss function.

A mathematical proof

Let’s prove that the parameter that minimizes a function is equal to the parameter that minimizes the logarithm of that function.

Let us suppose that a point w* is a local minimum of g(w) = log f(w), which means that any parameter w in a close neighborhood of w* we have that g(w*)≤ g(w). Now, since the e also usually denoted as exp preserves the monotonicity, we will have:

In other words, w* is a maximum of the function f as well, which we wanted to prove.

This means that when we apply the logarithm to any function, we preserve the minimum or maximum (i.e., the parameters that minimize or that maximize the function, but not the actual value of the function).

We also need to prove that taking the logarithm does not eliminate the minima of the function, but we are skipping that here.

As we saw in the examples above, this leads to simpler computations and improved stability.

If you have difficulties understanding this, then let’s illustrate this with some plots.

Let’s take the following function:

A portion of its graph looks like the following:

Its logarithm is:

A portion of its graph looks like the following:

As you can see, in both cases, we have the maximum of the function when x = 0.3.

Yes, we don’t get the same function, but we still have the same critical points that help us minimize the loss function.

This alone is very helpful when training Machine Learning models.