Source: Deep Learning on Medium
Sign Language Classifier using CNN
Implementation of a Convolutional Neural Network on the MNIST sign language dataset.
1. The dataset
The MNIST sign language dataset is avalaible on Kaggle ( https://www.kaggle.com/datamunge/sign-language-mnist )
It is composed of 27 500 training instances and 7000 testing instances.
Each instance is a 24×24 pixel image, converted into an array of 255 values so that each image can be read as a 1-D array. Here is a picture of all the signs with it’s corresponding letters :
2. Frame the problem
The objective of this project is to label an image of a sign to it’s corresponding letter.
This type of problem is a supervised problem because every feature has it’s associated label. Also, it is a classification task because all the labels values are discrete (from A to Z, excluding J and Z because it requires motion).
This dataset was created so it can be hard to solve using standart machine learning techniques. This is due to the fact that the images has complex shapes and contours.
We will be using a Convolutional Neural Network in order to solve this problem because it perform well on finding common pattern in complex images.
3. Data exploration
In order to load the data we will be using Pandas.
We can use built in Seaborn function countplot() to show the number of observations in each category.
The bar plot shows us that the dataset is equally distributed, so we do not have to resample the dataset.
4. Prepare the data
4.1 Converting DataFrame into Numpy arrays
We will be converting our DataFrame records into Numpy arrays. That way we will have one 1D array per image instead of a dataframe record with 255 columns.
4.2 Creating a validation dataset
We will split the training dataset into 2 datasets, one for training and the other for validation. We will use 80% of the dataset for training, and 20% for testing. Scikit Learn built in function train_test_split does that for us :
4.3 Normalize the data
In any Machine Learning or Deep Learning project, it is always indispensable to normalize the data. The reason behind is that it will help the model compute the weights. In order to do that we just have to divide each values by 255, since the value range is between 0 and 255.
4.4 Reshape the labels
Since we are dealing with categorical data, it is important to reshape the labels using One-Hot Encoding. The reason behind is that the model will assume that there is a natural ordering between categories and it will result in poor performances and unexpected results. One-Hot encode will replace each label with a binary array. That array will be the size of all possible label values (24). All values will be assign to 0 except the one that corresponds to the instance. We can use Scikit-Learn built-in function LabelBinarizer in order to do that
4.5 Adding a bias neuron
The bias neuron is used in most of Deep Learning algorithmn. It adds a constant that is used by the model during weights computation, just like input neurons.
5. Training the model
The hyperparameters are the batch size, the number of classes and the number of epochs.
The batch size is the number of instance that will be propagated through the model. If we specify 124, the model will train on the first 124 instances, then train again on the next 124 instances, and so on until the whole dataset has been used. This technique is used to limit memory use. It is a common practice to specify a number of batch that is a power of 2 for computation reasons.
The epoch will be the numbers of forward pass and backpropagation applied to the whole dataset. The more epoch we add the more computation time our model will require in order to train.
Finally the number of classes are the number of differents labels that our model needs to predict.
5.2 Convolutional Layers
We will be training a CNN ( Convolutional Neural Network) to solve this problem. CNNs, like neural networks, are made up of neurons with learnable weights and biases. Each neuron receives several inputs, takes a weighted sum over them, pass it through an activation function and responds with an output.
The convolution is the process of applying a filter on a multi-dimensionnal input (such as an image) in order to chunk it. We specify the size of the filter, and it goes throught each possible position it can be on the image. At each iteration it output the dot product of the area of the picture and the filter. The final output of one convolutional layer is the image with less dimensions.
When we stack up multiples convolutional layers we end up with all the high level features that an image have. For example, here is what a multi layer convolutional neural network can do on a picture of a car :
We can see on the high level features recognizable objects on a car such as the rims or the lights. This classifier can then detect if a picture is a car or not based on these high levels features it finds during convolution.
5.3 Pooling Layers
A pooling layer is used to reduce the size of the convolution output. That is used in order to reduce the number of parameters to optimize, and so the computation time. We will be using Max Pooling, which is the most common pooling layer.
Max Pooling is implemented by using a filter that goes through the image just like the convolution filter. Main difference is that it output the highest number in the current filter, which will reduce the dimension of the image.
Final layer is a fully connected layer, and is used to link the final hidden layer to the label classes.
5.4 Implementation of the model definition & training
Here is my implementation of a CNN using Keras :
This code pretty much read itself. The parameters that can be optimized are the numbers of neurons on each layer, the size of the filter, the type of pooling and the number of hidden layers.
We will be using an Adam Optimizer as our cost function. It is very popular on deep learning algorithm because it performs well on most tasks, especially computer vision problems. Here is my implementation of the cost function using Keras :
Once the model is defined, we just have to train it using the data we preprocessed earlier. We need to specify the hyperparameters we defined, it is important to note that these hyperparameters can also be optimized. We will use the built in cross validation function. Cross validation is used to test the model performance on each epoch, by splitting the training set in two, one for training and the other for testing. We specify 10% as the size of the validation split.
6. Validate the model
6.1 Accuracy score on the testing dataset
After training we can see that our models performed very well on the validation sets. We can now use the model to predict labels of the whole testing dataset. Then we will compute the accuracy score between the predicted labels and the true labels to see if it performs well. Scikit Learn has a built-in function for that.
6.2 Learning curve
We can also plot the learning curve by using training data inside our model. Here is the learning curve on the 10 first epochs:
This plot is very important because it allows us to know when our model has reach his global optimum. If we add more epochs we will have a learning curve like this :
This second plot shows us that it will be useless to compute more than 10 epoch because it will not improve our accuracy.
6.3 Confusion Matrix
A good way of knowing where your classifier fails is to plot a Confusion Matrix. That way you can see what classes our model mix up.
This shows us for example that our model mix up the U with the R, which is normal because these two signs are very similar. In order to improve our model we can add more features for these classes so it can find more high level features that will sperate the two.
7. Going further
This classifier can be used to type letters on a computer using sign language. It is not very useful because typing is faster. But the technology behind it is mostly limited by the training data. If the dataset used were not the alphabet but all the words you can do in sign language, our classifier could convert sign language to text. But that implies training on video because most of the words require motion, and having way more classes to guess. Our dataset will be much more consequent and therefore the computation power needed to train it will be much more consequent aswell. Multi Resolution processing using CNN architecture will be a way to solve this problem.