The Differences between Sigmoid and Softmax Activation Function

Original article can be found here (source): Deep Learning on Medium

The Differences between Sigmoid and Softmax Activation Function

There are many algorithms in the market which can be used to solve classification problems. Today’s topics will be Artificial Neural Networks and how to define wheater our algorithm is allowed to create many answers for us or to be binary, with only one answer. It all comes down to Sigmoid and SoftMax Activation Functions.

Neurons and Artificial Neural Network

An Artificial Neuron Network represents a computational model that looks just like the artificial human nervous system. It is designed for receiving information, processing them and sending and information in the form of an output value.

It consists of connected units called Artificial Neurons, which look like Neurons in Biological Brain. Each connection can transmit a signal to other values, just like a synapse in a biological brain. After the signal has been sent, the next neuron receives it, processes it, and transmits it further to the last one.

In the implementation, the signal itself is a real number, and output or the value of each neuron is extracted with some non-linear function. Neurons and connections have weights that adjust to the learning process. With these adjustments, the weight increases or decreases the strength of the signal at a specific connection. Each neuron may also have a threshold level, such that the signal will be processed if the value crosses the threshold. It is essential to mention that these neurons are aggregated into layers that may perform different transformations. Input values travel from the first layer (the input layer) to the last layer (the output layer), possibly crossing multiple hidden layers in between.

For these several layers, we can have lots of values. It depends mostly on whether a specific Neuron is activated or not. So to normalize this range of values, we use Activation Functions to make the whole process statistically balanced.

The Sigmoid Activation Function

Sigmoid Activation Function is a mathematical function with a recognizable “S” shaped curve. It is used for the logistic regression and basic neural network implementation. If we want to have a classifier to solve a problem with more than one right answer, the Sigmoid Function is the right choice. We should apply this function to each element of the raw output independently. The return value of Sigmoid Function is mostly in the range of values between 0 and 1 or -1 and 1.

There is a wide range of these functions. Apart from Logistic, there is also a Hyperbolic Tangent Function that has been used in Artificial Neurons. Apart from this, it has also been used as a Cumulative Distribution Function. It is straightforward and reduces the time required for implementation. On the other hand, there is a significant drawback due to derivative having a short-range, which leads to significant information loss.

This is how the Sigmoid Function looks like:

With its first derivative:

If there are more layers in our Neural Network, the more data is compressed and lost per layer, and this amplifies and causes significant data loss overall.

This is how the sigmoid function looks like:

The Softmax Activation Function

The Softmax, also know as SoftArgMax or Normalized Exponential Function is a fascinating activation function that takes vectors of real numbers as inputs, and normalizes them into a probability distribution proportional to the exponentials of the input numbers. Before applying, some input data could be negative or greater than 1. Also, they might not sum up to 1. After applying Softmax, each element will be in the range of 0 to 1, and the elements will add up to 1. This way, they can be interpreted as a probability distribution. For more clarification, the larger the input number, the larger the probabilities will be.

Udacity Deep Learning Slide on Softmax

Softmax is often used in:

  • Artificial and Convolutional Neural Networks — Idea is to map the non-normalized output of data to the probability distribution for output classes. It is used in the final layers of neural network-based classifiers. They are trained under either the log-loss or cross-entropy regime. This way, the result is a non-linear variant of multinomial logistic regression (Softmax Regression).
  • Other Multiclass Classification Methods such as Multiclass Linear Discriminant Analysis, Naive Bayes Classifiers, etc.
  • Reinforcement Learning — Softmax function can be used to convert values into action probabilities.

Softmax is used for multi-classification in the Logistic Regression model, whereas Sigmoid is used for binary classification in the Logistic Regression model.

This is how the Softmax function looks like this:

This is similar to the Sigmoid function. The difference is that, in the denominator, we sum together all of the values. To explain this further, when calculating the value of Softmax on a single raw output, we can’t just look at one element alone, but instead, we have to take into account all the output data.

This is main reason why the Softmax is cool. It makes sure that the sum of all our output probabilities is equal to one.

For example, if we’re classifying numbers and applying a Softmax to our raw outputs, for the Artificial Network to increase the probability that a particular output example is classified as “5”, some other probabilities for other numbers (0, 1, 2, 3, 4, 6, 7, 8 and/or 9) needs to decrease.

Applying Sigmoid or Softmax

The output layer of the Neural Network classifier is a vector of raw values. Let us say that our raw output values are:

[-0.5, 1.2, -0.1, 2.4].

So, what do these raw output values mean?

The idea is to convert these raw values into the understandable format – probabilities, rather than just some output number, which looks arbitrary and confusing.

The next step is to convert these raw output values into probabilities using some of the Activation Functions, either Sigmoid or a Softmax Activation Function.

As you see, the Sigmoid and Softmax Activation Functions produce different results.

Sigmoid input values: -0.5, 1.2, -0.1, 2.4

Sigmoid output values: 0.37, 0.77, 0.48, 0.91

SoftMax input values: -0.5, 1.2, -0.1, 2.4

SoftMaxoutput values: 0.04, 0.21, 0.05, 0.70

The key takeaway from this example is:

  • Sigmoid: probabilities produced by a Sigmoid are independent. Furthermore, they are not constrained to sum to one: 0.37 + 0.77 + 0.48 + 0.91 = 2.53. The reason for this is because the Sigmoid looks at each raw output value separately.
  • Softmax: the outputs are interrelated. The Softmax probabilities will always sum to one by design: 0.04 + 0.21 + 0.05 + 0.70 = 1.00. In this case, if we want to increase the likelihood of one class, the other has to decrease by an equal amount.


Characteristics of a Sigmoid Activation Function

  • Used for Binary Classification in the Logistic Regression model
  • The probabilities sum does not need to be 1
  • Used as an Activation Function while building a Neural Network

Characteristics of a Softmax Activation Function

  • Used for Multi-classification in the Logistics Regression model
  • The probabilities sum will be 1
  • Used in the different layers of Neural Networks