From zero to Real-Time Hand Keypoints detection in five months with OpenCV, Tensorflow, and Fastai

Source: Deep Learning on Medium

Go to the profile of Rafik Rahoui

In this article, I will show you step by step, how to build your own real time hand keypoints detector with OpenCV, Tensorflow and Fastai (Python 3.7). I will be focusing on the challenges I faced when building it during a fascinating 5 months intensive journey.

You can see the models in action here:

The light green box detects the hand on the image, then i crop around the image by connecting the magenta dots before feeding a CNN for hand keypoints detection

Motivation :

It all started with this incredible obsession to understand the dynamics at the heart of Artificial Intelligence. Five months ago, i googled “AI vs Machine learning vs Deep learning” in my first attempt to grasp the nuances between the different concepts 😊.

After reviewing multiple videos and articles, I decided to start with computer vision by developing my own hand key points detector using a mobile camera.

Knowing that the human brain requires only 20 watts to operate, my aim was and would always be to keep things simple and downsize the computational requirements of any model, wherever possible. Complicated things require complex calculus which itself is highly energy intensive.

Few words about my learning curve:

I have a civil engineering academic background with some visual basic coding skills. I have worked in the field of finance since graduation.

Very uncommon, I started my journey by learning Javascript (ex1, ex2). That helped me to understand the « general logic » behind the code and was certainly useful when I later, started learning Python & Django.

After three and a half months into intensive coding, I started the Andrew Ng machine learning course while reading hundreds and hundreds of articles. It was important to understand all the mechanics under the hood by building my own artificial neural network from scratch and coding propagation and back-propagation.

The pipeline:

My process of detecting hand keypoints with a camera follows the following architecture :

pipline for hand keypoints detection

⁃ The image is grabbed by the camera;

⁃ A first deep learning model detects the hand on the image and estimates the coordinates of the box around it (done by retraining tensorflow object detection API model on hand detection, you could also achieve it by building a customized deep learning model);

⁃ A second deep learning regression model takes the image inside the box and estimates the coordinates of all hand keypoints (achieved by transfer learning from a resnet34 with a customised head).

Hand detection :

For this part, I decided to retrain a tensorflow’s object detection model (trained on COCO dataset) on hand dataset. I picked MobileNet_v2 for speed.

I won’t cover this part in detail. You can find many tutorials from public sources.

In case you are using Open Image dataset, I have written a customized script to convert the data to the required format:

It took me about 6 hours to retrain the model.

Keypoints detection:

I tried different approaches before sticking with Fastai:

1- I first tried to use Keras and Tensorflow, but faced at an early stage, the challenge of data augmentation. I had no choice but to implement my own data augmentation with Python using Tensorpack (a low level api), which was quite complicated due to the amount of transformations I had to perform (zooming, cropping, stretching, lightning and rotating) … and due to the fact that all the image transformations had to be impacted on the coordinates which are stored in Json or Csv formats.

2- The second approach, was to draw the location of the coordinates associated with each hand on a grayscale image (see below mask for illustration) and using DataImageGenerator from Keras to perform data augmentation on both images and their corresponding masks. The model performed well as far as the metrics (loss and accuracy) showed, but the predictions were chaotic. I couldn’t figure out what was wrong and moved to a different approach. Keras is a great API but was difficult to debug in my case.

Hand keypoints mask (grayscale image)

3- The next move proved to be successful. After reading about Fastai, I decided to give it a try. The first advantage of Fastai resides in the fact that you can debug all your code. The second advantage is that coordinates augmentation is part of the library core development.

I followed the first lesson tutorial to get used to it and started immediately implementing my code on a Jupyter notebook.

The most interesting thing about Fastai and Pytorch is that the whole code sums up to the following script (easy, right 😊!):

After performing “learn.lr_find()” and “learn.recorder.plot()”, to determine the optimal learning rate, I ran the code for 3 days in total over different cycles (on a CPU!).

The last cycle “learn.fit_one_cycle(36,slice(6e-3))” ended up with following results:

For making predictions, use one of the following codes :

img = im.open_image(‘path_to/hand_image.png’)

preds = learn.predict(img)[0])

or :

img = im.open_image(‘path_to/hand_image.png’)

preds = learn.predict(img)

preds= preds[1]+torch.ones(21,2) # denormalizing,torch.tensor([[img.size[0]/2,0],


preds = ImagePoints(FlowField(img.size, preds))

Inference and visualization:

The model is exported for inference with learn.export(). You should note that Fastai failed at exporting the Reshape function and the custom loss class. These should be incorporated to you script before evoking the model for inference.

To draw the keypoints, you need to add the following to your visualization code:


learn = load_learner(‘path_to_export.pkl’) # load your inference model saved previously with learn.export()

Then :

Where do I go from here?

1- I would like to develop an equity trading model using deep learning. I developed few quant models in the past and they were verbose and complicated to implement. Now I am very curious to see how markets look like through DL.

2- Also, i would like to drop some funny end to end ios app at the intersection of computer vision and augmented reality.

Thank you for your interest.

If you’ve got any questions, feel free to email me at or join me on linkedin.