Source: Deep Learning on Medium
This is the 2nd part in the series of posts discussing the working of sigmoid neuron and it’s learning algorithm:
2 | Sigmoid Neuron Learning Algorithm Explained With Math (current story)
In this post, we will discuss the mathematical intuition behind the sigmoid neuron learning algorithm in detail.
Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — Padhai.
Sigmoid Neuron Recap
A sigmoid neuron is similar to the perceptron neuron, for every input
xi it has weight
wi associated with the input. The weights indicate the importance of the input in the decision-making process. The output from the sigmoid is not 0 or 1 like the perceptron model instead it is a real value between 0–1 which can be interpreted as a probability. The most commonly used sigmoid function is the logistic function, which has a characteristic of an “S” shaped curve.
The objective of the learning algorithm is to determine the best possible values for the parameters (w and b), such that the overall loss (squared error loss) of the model is minimized as much as possible.
We initialize w and b randomly. We then iterate over all the observations in the data, for each observation find the corresponding predicted outcome using the sigmoid function and compute the squared error loss. Based on the loss value, we will update the weights such that the overall loss of the model at the new parameters will be less than the current loss of the model.
Learning Algorithm: Math Part
We saw how the weights are getting updated based on the loss value. In this section, we will see why would this specific update rule work to reduce the loss of the model. To understand why the update work rule works, we need to understand the math behind it.
Let’s get started with the math part of the learning algorithm.
In the sigmoid neuron function, we have two parameters w and b. I will represent these parameters in the form of a vector theta, theta is a vector of parameters that belong to R². The objective is to find the optimal value of theta for which the loss function is minimized.
The way we reach the objective is to update θ with some small random value, i.e. delta theta Δθ which is a collection of change in w and change in b. Δθ is also a vector that belongs to R². Let’s see the geometric representation of θ and Δθ,
We have a vector θ and we are adding another vector Δθ to it. From the parallelogram law of vectors, we will get our resultant vector θnew, which is nothing but the diagonal of the parallelogram. From the geometrical representation, we can see that there is a large change in the value of θ between θ and θnew. Instead of taking large steps in learning θ, let’s be a bit conservative and move only a small quantity in the same direction.
Remember that Δθ is a vector in order to take small steps in the same direction I need to have a smaller magnitude for Δθ. To get a small magnitude, I will multiply the Δθ with the learning rate (a very small value). Now the resultant vector is the red vector which represents the θnew. The new θ is going to be the movement of the vector θ from its intial position.
Using Taylor Series
But how do we decide the value of Δθ and what is the right Δθ to use?. How do we get the correct Δθ in a principal manner, so that the loss (which is a function of w and b) at the new theta should be less than the loss at the old theta. The answer to this question comes from Taylor series.
What Taylor series tells us is that, if we have a function f and we know the value of the function at a particular point x then the value of a function f at a new point that is very close to x is given by below formula,
The function value after the small step close to x is equal to the function value at x and some other quantity (present in the red box). If we look at the quantity present in the red box, it’s value is entirely dependent on the delta x. If I am able to find delta x such that, the quantity represented in the red box would be negative, then I would know that the new loss at new x is less than the loss at old x.
Representing the Taylor series in terms of sigmoid neuron parameters and a loss function. For ease of notation, let Δθ = u then we have,
Comparing it with the general Taylor series equation, loss at the new theta is equal to the loss at the old theta and some other quantity (presented in the red box). The value of the quantity present in the box is dependent on the change vector uᵀ. So we need to find the change vector uᵀ such that the quantity present in the red box turns out to be negative. If it is negative, then we will know that the loss at the θnew is less than the loss at the θold.
To find the optimal change vector uᵀ, let’s get rid of some terms in Taylor series for simplicity. We know that the learning rate η is very small then η², η³ etc… would be very close to zero by using this logic Taylor series can be reduced to the form,
The change vector uᵀ is optimal only if the loss at the new θ is less than old θ,
By comparing the above two equations (change vector uᵀ optimization equation and truncated Taylor series equation) we can imply that,
To solve the above equation let’s go back to a bit of linear algebra, specifically cosine angle between two vectors. we know that cos(θ) ranges from -1 to 1.
Since the denominator of cosine angle formula is a magnitude and it will always be positive, we can multiply the cosine formula throughout by k (denominator) without affecting the sign of the inequality. We want the quantity present in-between (marked in the blue box) the inequality to be negative, that is only possible if the cosine angle between the uᵀ and ∇ is equal to -1. If the cosine angle is equal to -1 then we know the angle between the vectors is equal to 180⁰.
If the angle is equal to 180⁰ that means the direction of change vector uᵀ we choose should be opposite to the gradient vector. We can summarise the gradient descent rule as follows,
Similarly, we can write the parameter update rule as follows,
Computing Partial Derivatives
So far in the previous sections, we have seen how we can use Taylor series and cosine angle between the vectors to arrive at a parameter update rule. But how do we compute the partial derivatives for w and b?.
For the sake of simplicity, let’s assume that there is only one point to fit into the sigmoid neuron. So the loss function for one point can be written as,
We need to compute the partial derivatives of w and b with respect to the loss function. Let’s derive these derivatives,
First, we will push the partial derivate inside the bracket next, we have y which is constant and its partial derivative will be zero. In the place of f(x), we will substitute with the logistic function and find the partial derivate of logistic function w.r.to w. The right part of the above image explains the partial derivative of the logistic function. The above final expression (marked in red) valid only one observation and for two observations expression changes to,
Similarly, we can compute the partial derivative of b with respect to loss,
Gradient Descent Rule
Now we have everything that we need to implement the parameter update rule using gradient descent. Let’s take an example and see how we can implement the gradient descent using the parameter update rule we just derived to minimize the loss of the model.
I have taken some toy data and initialized the parameters to w = 0, b = -8 and set the learning rate = 1.0, then iterated over all the observations for 1000 epochs and computed loss for different values of w (-2 to 6) and b (-10 to 2). Once I got the loss values for all possible combinations of w and b, I was able to generate the below 3D contour plot to show the variation in loss values.
Above code snippet shows the implementation of gradient descent update rule for parameters w and b in case of 2D toy data. We can initialize the values of parameters randomly but in this case, I have taken these specific values just for better illustration purposes.
The darker shade of red color in the graph indicates the area where the loss value is a high and darker shade of blue in the valley indicates the lowest point of loss. The below animation shows the gradient descent rule in action for fixed iterations, the points at the bottom indicate the different combinations of w & b and the points on the contour indicate the loss value for the corresponding parameter values.
We have successfully derived the gradient descent update rule which guarantees to find the best parameter values for w and b where the loss of the function is minimum as long as we move in the direction opposite to gradient.
In this post, we looked at the sigmoid neuron in brief. We then saw the sigmoid neuron math free version of the learning algorithm and then we went on to see the mathematical intuition behind the learning algorithm in detail using the Taylor series, linear algebra and partial differentiation. Finally, we have proved that our derived update rule using gradient descent is guaranteed to find the best parameters by using toy data.
This is a bit lengthy & math heavy post, even if you don’t understand fully in the first read, go through the post again. If you can any questions or suggestions feel free to post them in the responses section.
The building block of the deep neural networks is called the sigmoid neuron.towardsdatascience.com
In future posts, we will discuss how to implement the sigmoid neuron model in python and also discuss some other non-linear functions like RELU, ELU. Connect with me on LinkedIn or follow me on twitter for updates on future posts & discussions on the current story and also check out my GitHub account for interesting projects on ML.