Source: Deep Learning on Medium
This article covers the content discussed in the Sigmoid Neuron module of the Deep Learning course and all the images are taken from the same module.
Sigmoid Model and a drawback of the Perceptron Model:
The limitation of the perceptron model is that we have this harsh function(boundary) separating the classes on two sides as depicted below
And we would like to have a smoother transition curve which is more closer to the way humans make decisions in the sense that something is not drastically changed, it slowly changes over a range of values. So, we would like to have something like the S-shaped function(in red in the below image).
And we have the Sigmoid family of functions in Deep Learning of which many of the functions are S-shaped. One such function is the logistic function(it is one smooth continuous function) and this function is defined by the below equation:
So, we will now approximate the relationship between the input x(which could be n-dimensional) and the output y using this Logistic function(Sigmoid function). This function would have some parameters and we would try to learn the parameters using the data in such a way that the loss is minimized.
Now to visualize this function, we can take some values of x and y and plot it to see what it looks like, for example in the below case, we are plotting (‘wx + b’) on the x-axis and ‘y’ value on the y-axis.
If ‘wx + b’ is 0, then the equation(y) is reduced to:
Let’s try some other value:
And y, in this case, would be:
We have plotted some points of the function
And we can get the general trend of the function. So, we can visualize the function to see what the function looks like or how it varies with respect to the input.
So, this is clearly is a smoother function as opposed to the if-else condition that we have in the Perceptron case.
For 2 inputs case, the function equation would be:
And if we plot it, it would look like:
To understand the plot better, we try to look at it from the top:
The dark red region(circled) in the above image is the region where the output value is close to 0 because as we plug larger and larger values, the output would be close to 0.
The green region would correspond to value 1 and the middle region(orange color) would correspond to value 0.5.
If we have more than 2 inputs, then we would write our equation as:
And the below summation
would be the same as the dot product of the two vectors w and x.
The output is going to be a scalar value between 0 and 1 no matter how many inputs we have.
Let’s consider the 2-D case, we have the equation as:
Let the value of w1 be 0.2, w2 be -0.2 and b be equal to -8.
The output of the sigmoid function would be equal to 0.5 when the below quantity is 0 as then only the overall denominator would be 2:
Putting in the values of w1, w2 and b:
which is same as
So, for this 2D case, whenever the difference of the two input values is 40, then the sum w1x1 + w2x2 +b would be 0 and in effect the value y would be 0.5. And this is how we can go about plotting out this function.
How does model help when the data is not linearly separable?
Let’s consider the below case where we have two inputs: Salary in LPA and Family Size and based on these inputs, we are going to make a decision whether that person is going to buy a car or not. We are assuming that there is some relation between the inputs(x1 and x2) and the output y. We don’t know the true relation between the input and the output and we are approximating this relation using the sigmoid(logistic) function.
This is Yes-No decision-making process and the sigmoid function also gives output between 0 to 1.
If we plot out all the data:
Red points are the points for which the output is 0 and the green points are the ones for which the output is 1. And it is clear that no matter how we draw a line we would not be able to separate the red points from the green points. And if train a perceptron model on this, it would not converge for sure but we would train it in a way such that we are okay with the number of errors it makes(meaning the wrong classification of some points).
And if we plot out the perceptron linear boundary for the above data, we have:
In the above image, in the red region we largely have the red points and in the green region, we largely have the green points but off course, there is an error on both sides.
The important thing to note is that the perceptron does not make any distinction between the two circled points in the below image:
The point in yellow in the above image is way inside the decision boundary that means for this points we are very confident that for a person with annual income of 2.5 Lakhs and of family size 8, a human decision-maker would be very confident that this person will not buy a car whereas, for the point in pink in the above image, we would be slightly confused whether this person may buy or may not buy a car. But if we look at the Perceptron decision boundary, its very firm meaning the model is confident for both the points(in yellow and in pink in the above image) that the person is not going to buy a car even though there is difference between these two points; one is near the boundary almost sitting at the fence whereas the other one is way inside the boundary but the Perceptron decision surface or the perceptron output is not able to make these distinctions because the output is either 1 or 0, it’s not a smooth number between 0 to 1.
Now let’s see what would be the scenario if we try to fit it using the Sigmoid function:
We will look at the data and using some learning algorithm and some loss function, we will find the parameters of the model/function.
If we try to fit the data using Sigmoid, we would get the below kind of plot:
And the equivalent 2D plot would like:
If we look at the circled points in the above image, for a person with an annual income of 2.5 Lakhs and with a family size of 8, the output is close to 0(as the point lies in the dark red region for which the output is 0 or close to 0).
And for the pink circled input point in the below image
the output would be close to 0.3 or 0.4 which means the model is not very confident, it thinks it is on the lower side, it’s not clearly 1, its not clearly 0 but maybe on the lower side there is 30% chance that this person might buy a car. So, that’s the interesting thing about Sigmoid function, since it lies between 0 and 1 and another quantity of interest that we care about is Probability which also lies between 0 and 1. So, we can actually interpret the output of the Sigmoid Neuron as a probability. So, when it is 0 , we can say there is a 0% chance of this person buying a car and the output of Sigmoid is 1 we can say there is 100% chance of this person buying a car and so on.
So, now we have this nice way of interpreting the output rather than being very rigid which means saying 0 here and 1 here, we can also account for the fence-sitters and we can say that this person is on leaning towards the positive side but completely towards 1. So, this is how we can actually interpret the output of the sigmoid function.
We are still not able to separate the green points from the red points. The non-linearity that we have introduced is giving us a graded output which allows a better interpretation to evaluate it in terms of probability.
Now, as we keep changing the values of the parameters w, b, we will get different types of sigmoid function for example:
We will get different sigmoid plots for different value of the parameters but none of them would be able to separate the green points from the redpoint.
How does the function change with the change in w and b?
Let’s consider this for only one input, in that case we have the equation as:
where the parameters w and b are going to be a scalar value and x represents the input.
If we take w as -0.3 and b as 0, we have the plot as below:
As w is negative, the slope of the sigmoid function is also negative, so what is happening is that as we are increasing the values of x, the value of the output is decreasing, this is what the negative slope means.
And as we keep increasing the slope or rather make it more and more negative, the curve becomes sharper, that’s what a high negative slope means, even if we change the value of x slightly, the value of the output is dropping drastically:
And now if we make the value of w positive, the slope is going to be positive and the smaller the slope the less drastic is the change in the value of output.
And the next thing to show is that how the function change as we change the value of b:
To start with, we have taken the value of b as 4.9 and if we keep decreasing the value of b(keeping w constant), the function would shift towards right.
And there is an explanation for why this happens:
We know that the value of the sigmoid function would be 0.5 when
So, the value of the sigmoid is 0.5 when x is equal to the below:
As we keep decreasing b, negative of b would keep on increasing the boundary would shift towards the right(assuming w is positive).
The implication of all these would be when we are minimizing some loss function and change some parameters, we get the idea how the function plot is going to change.
Sigmoid: Data and Tasks
So far we have looked at MP Neuron and Perceptron model where our task was of Binary Classification(output could be 0 or 1) and we could also use Sigmoid Neuron for this kind of task with the exception that now instead of getting 0 or 1 as the output, it gives a value between 0 to 1 say 0.7 and we could use that to indicate whether the output is closer to class 1 or class 0. And we can take some threshold value based on the task at the hand to map the output to a particular class for example if the threshold is 0.5 then we can say that it belongs to class 1 and any value less than 0.5 we can map it to class 0.
Of course once we put a threshold it becomes the same as the dealing with a Perceptron model except that now we have more flexibility.
We could also use this function in the case of Regression task where the output is going to be between 0 to 1.
Data could be a bunch of inputs say ’n’ inputs, true output is some value between 0 to 1.
Sigmoid Loss Function:
We have looked at the 3 jars: Model, Data and Task and we are approximating the relationship between the input and the output using a Sigmoid function.
Now we want to compute the loss given input data, true output and the Sigmoid function:
We will first compute the predicted output as per the Sigmoid function for the given input data(let’s say we have the parameters value, so we will be able to compute the predicted output), once we have the predicted output, we can use the Squared Error Loss function:
In practice, we might have the true output as Binary and in that case, we could still use the Sigmoid function as the approximation between the input and the output and we could still compute the Loss using the squared error loss:
And the point of treating the output as a probability instead of having a real value as the predicted output is that it helps the model to understand which data point is contributing more to the loss and then accordingly adjust its parameters, for example, let’s say the true output is 1 for two points and the predicted output is 0.6 and 0.7 for these two data points, then as 0.6 is far from 1 compared to 0.7, 0.6 would contribute more to the loss which would not have been the case for Perceptron model where the predicted output would have been 1 instead of 0.6 and 0.7.
We are now left with 2 jars for the Sigmoid Neuron model which are the Learning Algorithm and the Evaluation metrics which are discussed in this article.