Softmax Regression

Source: Deep Learning on Medium

Multi-class classification using a softmax layer and training a neural network

What is Softmax ?

In a binary class classification problem, we have two labels, either 1 or 0. For an example, either we are tying to predict if the image is of a cat or not cat. If you have solved this kind of a problem, you should be knowing that we could use a Logistic Regression classifier to predict this. Even in the case of Neural networks (which I have been writing about a lot lately) you would remember that the last layer will have a “sigmoid” as an activation function which gives you a probability of whether it is 1 or 0.

Now, say you have to predict if it is a cat or dog or baby chicks or none-of-the-other class. Now the moment we have more than two classes we should figure out a way to generalize the logistic regression so that it could work for a multi-class problem.

There is a generalization of logistic regression called Softmax Regression that lets you make predictions of one of multiple classes.

Mathematical Formulation of Softmax

Let’s see how the generalization and its mathematical formulation works. In our given example where we are trying to predict if it is a cat or dog or baby chicks or none-of-the-other classes.

Here, we will use a neural network with the last layer with 4 nodes as we have c = 4 classes. In general the last layer for a multi-class problem will have n[L] = c nodes where “c” is the number of classes. Let see how it looks like visually.

Now to have this network built we would need a Softmax layer as the output layer where if we add up all the probability values they get to 1 (as they are all conditional probability).

Let’s pick a concrete example. In our given problem set the final layer would look like —

Note in the above example, the “z” vector has a value [5, 2, -1, 3] and it’s corresponding softmax vales are [0.842, 0.042, 0.002, 0.114] which also reflects that the order of highest to lowest values are retained when you move from “z” vector to its softmax representation.

So, in case of “hardmax ” it would take the values of “z” and then would have mapped to the classes like [1, 0, 0, 0] vector. Now softmax is a “soft” in the sense that it normalizes it and then maps to the output class in a more gentle manner which is beneficial in many ways.

Training a Softmax Classifier using a Neural Network

Now given the “softmax” output layer, how will you train a neural network ? The basic idea is the same. Think of creating a Loss function and then try to reach a “global minima” so that the choice variables (parameters “w” and “b”) are optimized.

Now once this Loss Function is calculated using the forward propagation through the layers of the neural network, we use Gradient Descent to train our network and update the weights till we reach the global minima (algorithm converges). The later phase is known as the “backpropagation”.

Now as we realize that to compute the derivatives in backprop for a deep neural network is painstaking and difficult, there are Deep Learning frameworks like Tensorflow, Keras etc. which do that for us. All we need to do is to have the forward propagation done properly. Backprop is taken care by those frameworks.


  1. Deep Learning Specialization by Andrew Ng & team.