 Math Fundamentals for Neural Networks -part 1

Calculus can be a scary word. But fear not, fellow aspiring neural network engineers! Lucky for us, only a few basic concepts are needed to start understanding how neural networks work. At first, it might not seem apparent how these concepts apply to neural networks, but over the course of several posts, I’ll be moving from these crucial building blocks to implementations in Python code, so stayed tuned, and hang in there!

Derivatives

Lets say we’ve recorded a car’s speed over time and represented the data over an X and Y axis:

This curve can tell us how fast the car was going at any point in time up to 35 minutes after starting it’s journey. What a human will naturally take away from this is the degree to which the car was accelerating or decelerating based on how steep the curve is at any point. Its easy to tell, for example, that the car was accelerating 5 minutes into the drive, but how do we represent this acceleration, or rate of change in speed, mathematically?

We can define acceleration as the local gradient of the speed/time graph. In other words, the slope (m) of the curve at any single point. To represent these local gradients, we need to find the tangent at each point. A tangent is a straight line that shares the same slope as the curve at a chosen point.

Fig. 2 shows three tangents for points A, B, and C. If you stayed awake for your high school math classes, you’ll recall that slope is simply rise over run between any two points on a line. With this in mind, we see that tangents for points A, B, and C have positive, zero, and negative slopes respectively. With respect to the car’s speed, these tangents tell us that speed was increasing at point A, constant at point B, and slowing at point C.

Before we move on, think about what a corresponding data point for acceleration might look like for point B. Even though the car’s speed is 35 mph, at that specific moment in time its speed is neither increasing or decreasing. Therefore, our value for acceleration at point B would be zero. Since each tangent is found by looking at both factors, time and speed, we can say that acceleration is a function of time and speed, and given these time and speed, we can derive acceleration.

Lets take it a step further and find the slopes of tangents for every point on the curve at every second of the car’s trip, then plot each of those acceleration rates on a corresponding graph, overlayed on our original speed/time graph:

The new acceleration line, shown in aqua color, is at 0 for every point in our blue speed line where the speed levels off, or stops changing. At around minute 13, when the car is decelerating, the acceleration curve shows negative values. In this example, acceleration is our derivative. Pretty simple, right?

What if we followed the same procedure, but this time on our acceleration line, deriving a third curve representing the rate of change in acceleration? This new curve would be called a second derivative of the speed. If we continued to find derivatives of each new curve, we would obtain the third, fourth, fifth derivatives, and so on.

Things get even more interesting if we work backwards. What if we only had an acceleration curve, to which we applied the inverse procedure of the one we just discussed? This would give us the anti derivative. In our example, the anti derivative of the acceleration curve would represent the distance of the car from its starting point.

This process of finding derivatives is called differentiation.

Lets turn up the heat a bit.

It’s easy to find the slope of a straight line by dividing the change in y by the change in x between two point on the line, in other words rise over run.

Since the line in figure 4 is straight, the slope will remain the same, regardless of which two point we choose. To arrive at our acceleration curve in figure 3, however, we needed to find the slope of a curve, whose slope, or gradient, is different at each point.

Lets take a look at our speed curve again. For a point A on this curve, lets label our y value as f(x), meaning our y value is dependent on where the point’s x value is.

If we then choose another point B, we can represent that points’ x and y values in terms of it’s relative distance from the original point A. We call this delta x, meaning change in x, and represented by the notation 𝛥x. It follows that if point A’s y value is a f(x), then point B’s y value would be f(x + 𝛥x).

Now lets draw tangents on several points moving from point B to point A.

Notice how as we move closer to point A, the tangents become progressively closer representations of the slope of the tangent at point A. To represent this concept mathematically, we use the limit notation, which in plain english sounds like “as delta x moves to x”. Since delta x becomes zero once it arrives at x, we can substitute “moves to x” with “moves to 0″.

Here’s what it looks like:

Remember our basic definition of slope as rise over run, or change in y over change in x. For a straight line with two points at (X₁, Y₁) and (X₂, Y₂), this would be

Or simply

However, since we’re now referring to our y values for points A and B as f(x) and f(x + 𝛥x) respectively, slope for any point on a curve becomes

This notation starts to look more cryptic, but don’t lose sight of the fact that its still just a fancy way to represent change in y over change in x.

This formula works for any point in our curve, but now we want to represent the equation for finding the a full derivative curve, which as you’ll remember, involves applying this formula for point A and every point between A and B, moving in a direction from B to A, which we can represent by our limit notation. The resulting formula is our basic definition of a derivative, represented by these equivalent notations: