Original article was published by Anjali Bhardwaj on Artificial Intelligence on Medium
What is the Softmax Function? — Teenager Explains
A brief explanation of the softmax function, what is the softmax function, how it works, and some code.
The softmax function is an activation function that turns real values into probabilities.
In a normal school year, at this moment, I may have been sitting in a coffee shop, two hours away from my house, reading my lectures before my computer programming class. Or perhaps, at this moment, I may have been in class, trying to keep up with my professor’s explanation on exact equations. With schools shut down all over the world, students like myself are left to fight their procrastination on their own. A sad struggle, one I am clearly not winning. For the past couple of weeks, despite the overwhelming online work of my university, I had decided it was the best time to learn everything I could about deep learning. Like a lot of other curious people, I decided to take a course on Udacity called Deep Learning with PyTorch (naturally, this article is inspired by concepts in this course.)
If you have followed along from my previous article on perceptrons, you already know that a perceptron is a binary classification algorithm that makes its predictions using some linear function. But what if we wanted to classify more than two kinds of data? How can we do this? We can actually use an activation function called the softmax function. In this post we will discuss, what is a softmax function, compare binary classification and multiclass classification, discuss how the softmax function works, and provide some example code.
What is a Softmax Function?
A softmax function is a generalization of the logistic function that can be used to classify multiple kinds of data. The softmax function takes in real values of different classes and returns a probability distribution.
Where the standard logistical function is capable of binary classification, the softmax function is able to do multiclass classification.
Let’s look at how Binary classification and Multiclass classification works
Binary Classification Models
Let say we have a model where we know that the probability that you will get a job or not is given as,
So, the probability that you will get a job is p(job) = 0.8 and, consequently, the probability that you do not get the job is p(no job) = 0.2.
The model will take in some inputs like, what are your final exam grades? Or how many projects have you built? And based on a linear model it will return a ‘score’ (as an unbound value), where the higher the score the more likely it is.
Then the probability of getting a job is simply the sigmoid function of the score. (If you want to know more about how the sigmoid function works check this video out.). As we know the sigmoid function will convert the score into a probability, where probabilities are bound between 0 and 1.
This probability is then compared to the probabilities we know(so, p(job) = 0.8 and p(not job) = 0.2).
Suppose, we are trying to build a model which classifies 3 different items. A pencil, a pen and an eraser. Let’s say the probability of getting a pencil is p(pencil) = 0.5, the probability of getting a pen is p(pen) = 0.4 and the probability of getting an eraser is p(eraser) = 0.1. Therefore the probabilities look like,
Where the sum of the probabilities is 1.
Now, let’s do the same steps to our model as we did with the binary classification, given some inputs, calculate the linear function, and give a score to each item.
Now that we have the scores, how can we find their probabilities of all three items? This is where softmax functions come in.
How does the softmax function work?
Given the unbounded scores above, let’s try to convert this into probabilities.