Understanding the learning of sigmoid activations in a neural network

Source: Deep Learning on Medium

Understanding the learning of sigmoid activations in a neural network

In my previous blog, I have explained what the individual neurons in a neural network try to learn. We know that the neurons try to learn classification boundaries by optimizing the weights and biases according to the loss function. Now let’s get to the actual question. What happens after the neurons learn the perfect classification boundary(the case where it exists)? For illustration purpose, let us use a simple example from my previous blog.

Dataset

In the figure we have the dataset and the neural network that is trying to classify the data. All the learning that the neuron has to do is optimize the weights and biases associated with it to mimic the red classification line.

Network

Now let’s say that we require 25 epochs to learn the exact classification boundary. What if we try to run the model for beyond 25 epochs, let’s say 100 epochs? Is the computation power that we have used for the rest of the 75 epochs worthless? Where did all of this learning go? One might think that once the accuracy reaches 100%(in an ideal scenario), there is nothing to do in the network anymore. Also it is a credible argument because we have reached the global minimum on the loss function curve. But it is not correct. Learning still happens in the network but for a different purpose than optimizing the loss function. To understand this, we must look into some concepts before we jump for explanations.

What is the neuron in a neural network trying to learn?

As I have explained in my previous blog, each neuron has its own objective of classifying its input based on the feature it is trying to learn. It tries to learn a classification boundary through its weights and biases to assign positive and negative values to points on opposite sides of the line(for detailed explanation I recommend to read by previous blog).

In the example shown in the figure, the neuron should learn the line w0*X+w1*Y = 3 (where X and Y are the features and weights w0, w1 are 1 and bias equals -3). The first question to note from this section is what about the lines 2*w0*X + 2*w1*Y = 6, 3*w0*X+3*w1*Y = 9, .. ? If all the equations represent the same line, why did the network choose w0*X+w1*Y = 3 (or did it actually)?

Why to use sigmoid in the first place?

Why do we use sigmoid activation function? If the objective is to assign binary values to opposite sides of the line, then we could have used the step function(used in perceptrons). The answer to this question comes from the algorithm of back propagation. Step function is not derivable and hence the approximation, sigmoid function, which is smooth and differentiable at all points.

Sigmoid activation function

The point to note from this section is: In the figure shown, all the curves corresponds to different sigmoid functions. What curve does a neuron choose?

Connecting the dots

The different sigmoid curves in the figure are due to the different weights associated in the sigmoid function. We have seen that by multiplying the weights and biases with a constant term, we get different equations that correspond to the same classification boundary. And these equations gives rise to the different sigmoid curves shown in the figure. Among these curves, the neuron tries to optimize the weights such that the sigmoid function approximates the step function. As the weights and biases increase proportionately, the activation function approximates to the step function. Let us analyze this using a simple experiment. For the network and dataset shown previously, the weights learned after different epochs are shown in the table.

Conclusion: Even though the network is at its global min, the weights and biases keep updating and this is where the learning happens after the results get saturated.

Suggestions to improve the article are most welcome. Hope this article helps in understanding the learning of neural networks in depth. If you like reading my articles, please visit my website for more and more blogs on artificial intelligence, machine learning, natural language processing and more..