Softmax Regression

Original article was published on Artificial Intelligence on Medium

Build a Softmax Regression Model from Scratch in Python!

MNIST Handwritten Digits Dataset.

In my previous article, we learn about logistic regression which is used for binary classification. However, in real world application, there might be more than 2 classes to be classified, for example, digits classification. In this case, we call it multinomial logistic regression or also known as Softmax Regression.

Derivation of Softmax Equation

Consider a classification problem which involved k number of classes.
Let x as the feature vector and y as the corresponding class, where y ∈ {1, 2, … , k}.

Now, we would like to model the probability of y given x, P(y|x), which is a vector of probabilities of y be either of the classes given the features:

Recall that in logistic regression, log-odd for y=1 with respect to y=0 is assumed to have a linear relationship with the independent variable x.

Using the same analogy, we can assume that the log-odd for y=i with respect to y=k is assumed to have linear relationship with the independent variable x.

Since the sum of P(y=j|x) for j=1, 2, 3, … , k is equal to 1, so:

By substitution:

The derived equation above is known as Softmax function. From the derivation, we can see that the probability of y=i given x can be estimated by the softmax function.

Summary of the model:

weight vector associated with class g.
weight matrix where each element corresponds to a feature of a class.
Figure: illustration of the softmax regression model.

With the output probability vector, we can classify the input as the class with the highest probability.

Maximum Likelihood Estimation

Before we proceed, let’s get introduced about indicator function which output 1 if the argument is true or else it will output 0.

Indicator function.

To get the likelihood on the training data, we need to compute all of the probabilities of y=y⁽ⁱ⁾ given x⁽ⁱ⁾ for i=1, 2, 3, …, m. (m is the total number of training data)

With the expression of P(y⁽ⁱ⁾|x⁽ⁱ⁾), we can compute the likelihood function, L(θ) as followed:

Likelihood function is the measure of goodness fit of a model on a sample data with respect to the model parameters (θ).

As in every machine learning task, our ultimate goal is to optimize the weights by minimizing the loss function. In this case, we have likelihood function as the measure of performance for the model.

Note: The higher the likelihood the better the model.

So, we need to maximize the likelihood function instead of minimizing it.

As mentioned in the article on logistic regression, optimization process often involves differentiation which is much easier to be done in summation instead of multiplication. So, we take the natural logarithm on the likelihood function, which is known as log-likelihood function.

With the fact that max f(x)=–min –f(x). So, we could actually minimize the negative of log-likelihood function to avoid confusion with the optimization process of other machine learning models which usually involve minimization of loss function.

Simplifying the loss function:

Note that in last two steps, the summation term, Σ1(y⁽ⁱ⁾=l) for l=1 to k is vanished as it is equal to 1 as explained below:

Finally, we have our loss function as the negative of log-likelihood function. We will use gradient descent algorithm to optimize the weights by minimizing the loss function.

The partial derivative of loss function with respect to any element of the weight matrix is:

The update rule for each iteration of gradient descent: