Original article was published by on AI Magazine
Vision is a key factor in sign language, and every sign language is intended to be understood by one person located in front of the other, from this perspective, a gesture can be completely observable. Viewing a gesture from another perspective makes it difficult or almost impossible to be understood since every finger position and movement will not be observable.
Trying to understand sign language from a first-vision perspective has the same limitations, some gestures will end up looking the same way. But, this ambiguity can be solved by locating more cameras in different positions. In this way, what a camera can’t see, can be perfectly observable by another camera.
The vision system is composed of two cameras: a head-mounted camera and a chest-mounted camera. With these two cameras we obtain two different views of a sign, a top-view, and a bottom-view, that works together to identify signs.
Another benefit of this design is that the user will gain autonomy. Something that is not achieved in classical approaches, in which the user is not the person with disability but a third person that needs to take out a system with a camera and focus a signer while the signer is performing a sign.
To develop the first prototype of this system is was used a dataset of 24 static signs from the Panamanian Manual Alphabet.
To model this problem as an image recognition problem, dynamic gestures such as letter J, Z, RR, and Ñ were discarded because of the extra complexity they add to the solution.
Data collection and preprocessing
To collect the dataset it was asked to four users to wear the vision system and perform every gesture for 10 seconds while both cameras were recording in a 640×480 pixel resolution.
It was requested to the users to perform this process in three different scenarios: indoors, outdoors, and in a green background scenario. For the indoors and outdoors scenarios the users were requested to move around while performing the gestures in order to obtain images with different backgrounds, light sources, and positions. The green background scenario was intended for a data augmentation process, we’ll describe later.
After obtaining the videos, the frames were extracted and reduced to a 125×125 pixel resolution.
Since the preprocessing before going to the convolutional neural networks was simplified to just rescaling, the background will always get passed to the model. In this case, the model needs to be able to recognize a sign despite the different backgrounds it can have.
To improve the generalization capability of the model it was artificially added more images with different backgrounds replacing the green backgrounds. This way it is obtained more data without investing too much time.
During the training, it was also added another data augmentation process consisting of performing some transformations, such as some rotations, changes in light intensity, and rescaling.
These two data augmentation process were chosen to help improve the generalization capability of the model.
Top view and bottom view datasets
This problem was model was a multiclass classification problem with 24 classes, and the problem itself was divided into two smaller multi-class classification problems.
The approach to decide which gestures would be classified whit the top view model and which ones with the bottom view model was to select all the gestures that were too similar from the bottom view perspective as gestures to be classified from the top view model and the rest of gestures were going to be classified by the bottom view model. So basically, the top view model was used to solved ambiguities.
As a result, the dataset was divided into two parts, one for each model as shown in the following table.
DEEP LEARNING MODEL
As state-of-the-art technology, convolutional neural networks was the option chosen for facing this problem. It was trained two models: one model for the top view and one for the bottom view.
The same convolutional neural network architecture was used for both, the top view and the bottom view models, the only difference is the number of output units.
The architecture of the convolutional neural networks is shown in the following figure.
To improve the generalization capability of the models it was used dropout techniques between layers in the fully connected layer to improve model performance.
The models were evaluated in a test set with data corresponding to a normal use of the system in indoors, in other words, in the background it appears a person acting as the observer, similar to the input image in the figure above (Convolutional neural networks architecture). The results are shown below.
Although the model learned to classify some signs, such as Q, R, H; in general, the results are kind of discouraging. It seems that the generalization capability of the models wasn’t too good. However, the model was also tested with real-time data showing the potential of the system.
The bottom view model was tested with real-time video with a green uniform background. I wore the chest-mounted camera capturing video at 5 frames per second while I was running the bottom view model in my laptop and try to fingerspell the word fútbol (my favorite sport in Spanish). The entries for every letter were emulated by a click. The results are shown in the following video.
Note: Due to the model performance, I had to repeat it several times until I ended up with a good demo video.