Sign Language Classifier using CNN

Source: Deep Learning on Medium

Sign Language Classifier using CNN

Implementation of a Convolutional Neural Network on the MNIST sign language dataset.

1. The dataset

The MNIST sign language dataset is avalaible on Kaggle ( https://www.kaggle.com/datamunge/sign-language-mnist )

It is composed of 27 500 training instances and 7000 testing instances.

Each instance is a 24×24 pixel image, converted into an array of 255 values so that each image can be read as a 1-D array. Here is a picture of all the signs with it’s corresponding letters :

Sign and it’s corresponding letter

2. Frame the problem

The objective of this project is to label an image of a sign to it’s corresponding letter.

This type of problem is a supervised problem because every feature has it’s associated label. Also, it is a classification task because all the labels values are discrete (from A to Z, excluding J and Z because it requires motion).

This dataset was created so it can be hard to solve using standart machine learning techniques. This is due to the fact that the images has complex shapes and contours.

We will be using a Convolutional Neural Network in order to solve this problem because it perform well on finding common pattern in complex images.

3. Data exploration

In order to load the data we will be using Pandas.

Loading .csv files using Pandas

We can use built in Seaborn function countplot() to show the number of observations in each category.

Class distribution using Seaborn countplot function

The bar plot shows us that the dataset is equally distributed, so we do not have to resample the dataset.

4. Prepare the data

4.1 Converting DataFrame into Numpy arrays

We will be converting our DataFrame records into Numpy arrays. That way we will have one 1D array per image instead of a dataframe record with 255 columns.

Converting DataFrame records into Numpy arrays

4.2 Creating a validation dataset

We will split the training dataset into 2 datasets, one for training and the other for validation. We will use 80% of the dataset for training, and 20% for testing. Scikit Learn built in function train_test_split does that for us :

Call of the train_test_split function
Array shape after splitting

4.3 Normalize the data

In any Machine Learning or Deep Learning project, it is always indispensable to normalize the data. The reason behind is that it will help the model compute the weights. In order to do that we just have to divide each values by 255, since the value range is between 0 and 255.

Normalizing black & white pixel data

4.4 Reshape the labels

Since we are dealing with categorical data, it is important to reshape the labels using One-Hot Encoding. The reason behind is that the model will assume that there is a natural ordering between categories and it will result in poor performances and unexpected results. One-Hot encode will replace each label with a binary array. That array will be the size of all possible label values (24). All values will be assign to 0 except the one that corresponds to the instance. We can use Scikit-Learn built-in function LabelBinarizer in order to do that

Implementation of the LabelBinarizer
Example of a label that is a D

4.5 Adding a bias neuron

The bias neuron is used in most of Deep Learning algorithmn. It adds a constant that is used by the model during weights computation, just like input neurons.

5. Training the model

5.1 Hyperparameters

The hyperparameters are the batch size, the number of classes and the number of epochs.

The batch size is the number of instance that will be propagated through the model. If we specify 124, the model will train on the first 124 instances, then train again on the next 124 instances, and so on until the whole dataset has been used. This technique is used to limit memory use. It is a common practice to specify a number of batch that is a power of 2 for computation reasons.

The epoch will be the numbers of forward pass and backpropagation applied to the whole dataset. The more epoch we add the more computation time our model will require in order to train.

Finally the number of classes are the number of differents labels that our model needs to predict.

Defining hyperparameters

5.2 Convolutional Layers

We will be training a CNN ( Convolutional Neural Network) to solve this problem. CNNs, like neural networks, are made up of neurons with learnable weights and biases. Each neuron receives several inputs, takes a weighted sum over them, pass it through an activation function and responds with an output.

The convolution is the process of applying a filter on a multi-dimensionnal input (such as an image) in order to chunk it. We specify the size of the filter, and it goes throught each possible position it can be on the image. At each iteration it output the dot product of the area of the picture and the filter. The final output of one convolutional layer is the image with less dimensions.

One convolution applied to an image

When we stack up multiples convolutional layers we end up with all the high level features that an image have. For example, here is what a multi layer convolutional neural network can do on a picture of a car :

Convolutional layers output

We can see on the high level features recognizable objects on a car such as the rims or the lights. This classifier can then detect if a picture is a car or not based on these high levels features it finds during convolution.

5.3 Pooling Layers

A pooling layer is used to reduce the size of the convolution output. That is used in order to reduce the number of parameters to optimize, and so the computation time. We will be using Max Pooling, which is the most common pooling layer.

Exemple of a Max Pooling layer

Max Pooling is implemented by using a filter that goes through the image just like the convolution filter. Main difference is that it output the highest number in the current filter, which will reduce the dimension of the image.

Final layer is a fully connected layer, and is used to link the final hidden layer to the label classes.

5.4 Implementation of the model definition & training

Here is my implementation of a CNN using Keras :

Model definition using Keras

This code pretty much read itself. The parameters that can be optimized are the numbers of neurons on each layer, the size of the filter, the type of pooling and the number of hidden layers.

We will be using an Adam Optimizer as our cost function. It is very popular on deep learning algorithm because it performs well on most tasks, especially computer vision problems. Here is my implementation of the cost function using Keras :

Definition of our cost function

Once the model is defined, we just have to train it using the data we preprocessed earlier. We need to specify the hyperparameters we defined, it is important to note that these hyperparameters can also be optimized. We will use the built in cross validation function. Cross validation is used to test the model performance on each epoch, by splitting the training set in two, one for training and the other for testing. We specify 10% as the size of the validation split.

Training call & results

6. Validate the model

6.1 Accuracy score on the testing dataset

After training we can see that our models performed very well on the validation sets. We can now use the model to predict labels of the whole testing dataset. Then we will compute the accuracy score between the predicted labels and the true labels to see if it performs well. Scikit Learn has a built-in function for that.

Computing accuracy score

6.2 Learning curve

We can also plot the learning curve by using training data inside our model. Here is the learning curve on the 10 first epochs:

Learning curve on 10 epochs

This plot is very important because it allows us to know when our model has reach his global optimum. If we add more epochs we will have a learning curve like this :

Learning curve on 50 epochs

This second plot shows us that it will be useless to compute more than 10 epoch because it will not improve our accuracy.

6.3 Confusion Matrix

A good way of knowing where your classifier fails is to plot a Confusion Matrix. That way you can see what classes our model mix up.

Confusion matrix on the testing dataset

This shows us for example that our model mix up the U with the R, which is normal because these two signs are very similar. In order to improve our model we can add more features for these classes so it can find more high level features that will sperate the two.

7. Going further

This classifier can be used to type letters on a computer using sign language. It is not very useful because typing is faster. But the technology behind it is mostly limited by the training data. If the dataset used were not the alphabet but all the words you can do in sign language, our classifier could convert sign language to text. But that implies training on video because most of the words require motion, and having way more classes to guess. Our dataset will be much more consequent and therefore the computation power needed to train it will be much more consequent aswell. Multi Resolution processing using CNN architecture will be a way to solve this problem.