Source: Deep Learning on Medium

# Multi-class Classification with Deep Learning

Hello, today I am going to do multi-class classification using Keras. The activation function that I have chosen is the Softmax function. Let’s talk about why I chose it.

The crucial difference between using binary dataset as opposed to dealing with multiclass dataset is the replacement of the activation functions such as Sigmoid. The Sigmoid function gives us the probability ranging from 0 to 1:

But now we are introducing more than 2 classes. So this function is not an ideal choice anymore. Instead we will use Softmax Function. The equation of the same is below:

The above equation gives us the probability of some score “m” when there are “n” number of classes present in the dataset. So basically what we are doing is , for each score “m”, “e^m” is divided by the summation of the exponential of all scores. This ensures that the relative magnitude of all scores is maintained and also ensures that all the probabilities sum up to one as required. Please note that the score “m” is the output of the neural network and this score will be different for each class.

Okay so now we know “why” we are using Softmax function and what are it’s advantages. Let’s talk about another important factor, Cross Entropy. In the binary dataset the formula for Cross Entropy was below:

Here we took Ln of the probability of the points in the “positive” region which was then added to the Ln of the probability of the points in the “negative” region. This is done for every single point and then took to the negative to obtain total cross entropy. Why negative? I’ll leave that for you to find that out.

For multiclass dataset, let’s take 3 in our case, imagine that for each point, there will be 3 output probabilities that our Softmax function would give. But the actual probability for the actual label would be only 1. For eg, there are 3 fruits classes: A, B and C. Now we input A in our training model and our model gives 3 outputs in probabilities: probability of the input is actually A, probability of the input being B or probability of the input being C. So we will select the probability of the input is actually A and take the Ln(prob A). Similarly we will take Ln(prob B) when the input is B and Ln(prob C) when input is C. Then we add all three and take the negative. Why negative? 🙂

Please note that if the data is already One Hot Encoded and you add all the probabilities from all 3 classes, the behavior will be the same.

Enough of the math, let’s jump to the code:

import numpy as np

import keras

from sklearn import datasets

import matplotlib.pyplot as plt

from keras.models import Sequential

from keras.layers import Dense

from keras.optimizers import Adam

from keras.utils.np_utils import to_categorical

From the above, we are importing datasets form sklearn. I also imported the Sequential model to define my neural network. I imported the “Dense” layer as it is used to connect to proceeding layers in the network to subsequent layers. I also imported Adam optimizer. The “to_categorical” function is used for One Hot Encoding.

n= 1000

center= [[0,0],[1,1],[1,-1]]

x,y= datasets.make_blobs(n_samples=n,centers=center, cluster_std=.4)

I specifically mentioned the centers, but it is completely optional. The default is 3. Here I am using “make_blobs” to make clusters. The cluster_std parameter denseness/wideness of clusters.

Let’s plot our dataset:

plt.scatter(x[y==0,0], x[y==0,1])

plt.scatter(x[y==1,0], x[y==1,1])

plt.scatter(x[y==2,0], x[y==2,1])

ycat= to_categorical(y,3)

The above function has 2 parameters, first one is the column that needs to be Hot Encoded, the second one is the amount of classes we have in our dataset.

model= Sequential()

model.add(Dense(activation=’softmax’, units=3, input_shape= (2,) ))

model.compile(optimizer=Adam(.1), metrics=[‘accuracy’], loss=’categorical_crossentropy’)

hist= model.fit(x,ycat,verbose=1,batch_size=100, epochs=20)

Here I have created a simple neural network with 1 input layer and 1 output layer. In model.add, the activation fn is Softmax, the units is 3 since we want 3 outputs, input shape is well, input shape. Model is compiled using Adam optimizer and the loss function is categorical_crossentropy. We will calculate the accuracy of the model. Lastly we train the model using model.fit. Verbose =1 is used to display the progress, batch size is 100 and epoch is 20. Please note that less epochs lead to underfitting whereas more epochs lead to overfitting.

def plot_multiclass_decision_boundary(x, y, model):

x_span = np.linspace(min(x[:,0]) — 1, max(x[:,0]) + 1)

y_span = np.linspace(min(x[:,1]) — 1, max(x[:,1]) + 1)

xx, yy = np.meshgrid(x_span, y_span)

grid = np.c_[xx.ravel(), yy.ravel()]

pred_func = model.predict_classes(grid)

z = pred_func.reshape(xx.shape)

plt.contourf(xx, yy, z)

Above code is used to plot the boundaries. Here we are using model.predict_classes func on which we feed the entire grid of array and the trained model then tests all the points and return an array of predictions. Finally we plot contour plots of our predicted results using contourf function which will plot distinct class zones.

plot_multiclass_decision_boundary(x, ycat, model)

plt.scatter(x[y==0, 0], x[y==0, 1])

plt.scatter(x[y==1, 0], x[y==1, 1])

plt.scatter(x[y==2, 0], x[y==2, 1])

And the output should somewhat like the below:

For those who are curious about the accuracy:

And I think this completes the multi-class classification using neural network.