Realtime JavaScript Face Tracking and Face Recognition using face-api.js’ MTCNN Face Detector

Introducing face-api.js’ MTCNN for Face Detection and 5 Point Face Landmarks with tensorflow.js

Never trust a shitty GIF! Try it out yourself!

If you are reading this right now, chances are that you already read my introduction article (face-api.js — JavaScript API for Face Recognition in the Browser with tensorflow.js) or played around with face-api.js before. If you haven’t heard of face-api.js yet, I would highly recommend you to go ahead and read the article introduction article first and have a look at the repo! ;)

And as always, there is a code example waiting for you in this article. We are going to hack a small application, which is going perform live face detection and face recognition from webcam images in the browser, so stay with me!

Face Detection with face-api.js

So far, face-api.js solely implemented a SSD Mobilenet v1 based CNN for face detection. While this one turns out to be a pretty accurate face detector, SSD is not quite as fast (in terms of inference time) as other architectures and it might not be possible to achieve realtime with this face detector, unless you / the users of your webapp have a decent GPU built into their machines.

Turns out you don’t always need that degree of accuracy and sometimes you would rather trade high accuracy in return for having a much faster face detector.

That’s where MTCNN comes into play, which is now available in face-api.js! MTCNN is a much more lightweight face detector. In the following I will point out, how it compares to SSD Mobilenet v1:


  • shorter inference times (faster detection speed)
  • simultaneous detection of 5 face landmark points (we get face alignment for free)
  • much smaller model size: only ~2MB compared to ~6MB (quantized SSD Mobilenet v1 weights)
  • configurable: there are some parameters you can tune to increase performance for your specific requirements


  • less accurate than SSD Mobilenet v1

MTCNN — Simultaneous Face Detection & Landmarks

MTCNN (Multi-task Cascaded Convolutional Neural Networks) is an algorithm consisting of 3 stages, which detects the bounding boxes of faces in an image along with their 5 Point Face Landmarks (link to the paper). Each stage gradually improves the detection results by passing it’s inputs through a CNN, which returns candidate bounding boxes with their scores, followed by non max suppression.

In stage 1 the input image is scaled down multiple times to build an image pyramid and each scaled version of the image is passed through it’s CNN. In stage 2 and 3 we extract image patches for each bounding box and resize them (24×24 in stage 2 and 48×48 in stage 3) and forward them through the CNN of that stage. Besides bounding boxes and scores, stage 3 additionally computes 5 face landmarks points for each bounding box.

After fiddling around with some MTCNN implementations, it turns out that you can actually get quite solid detection results at much lower inference times compared to SSD Mobilenet v1, even by running inference on the CPU. As an extra bonus, from the 5 Point Face Landmarks we get face alignment for free! This way we don’t have to perform 68 Point Face Landmark detection as an intermediate step before computing a face descriptor.

As promising as this seemed to me, I went ahead and implemented this in tfjs-core. After some days of hard work, I was finally able to get a working solution. :) Let’s see it in action!

Webcam Face Tracking and Face Recognition

As promised, we will now have a look at how to implement face tracking and face recognition using your webcam. In this example I am gonna use my webcam to track and recognize faces of some Big Bang Theory Protagonists again, but of course you can use this bit of code for tracking and recognizing yourself accordingly.

To display frames from your webcam, you can simply use a video element as follows. Furthemore, I am placing an absolutely positioned canvas on top of the video element, with the same height and width. We will use the canvas as a transparent overlay, which we can later on draw the detection results onto:

Once the page is loaded, we will load the MTCNN model as well as the face recognition model, to compute the face descriptors. Furthermore, we are attaching our webcam stream to the video element using navigator.getUserMedia:

You should now be asked to grant the browser access to your webcam. In the onPlay callback that we specified for the video element, we will handle the actual processing for each frame. Note, that the event onplay is hooked onto, is triggered once the video starts playing.

Face Detection

As I said, we can configure some detection parameters here. The default parameters are these:

For tracking faces from your webcam, we will increase the minFaceSize to atleast 200px. Detecting only faces of larger sizes allows us to achieve much lower inference times, as the net will scale down the images by a much larger factor:

As you can see, we can simply feed it the video element, just like an image or canvas element.

A forward pass through the MTCNN gives us an array of FaceDetections(bounding box + score) along with the FaceLandmark5s for each detected face. Now we can draw the results onto our overlay:

Just to show an example, up to this point we will end up with the following:

Computing the Face Descriptors

From my previous tutorial you should already know, that we want to align the face bounding boxes from the positions of the face landmarks, before computing any face descriptors. From the aligned boxes, we extract the aligned face tensor, which we can pass them through the face recognition net:

If that’s too much code for you, there is also a convenient shortcut function, faceapi.allFacesMtcnn, to detect all faces of an image and compute their descriptors, similar to faceapi.allFaces:

Face Recognition

From now on, we simply proceed the same way, as we did in the previous tutorial. Recall, that in order to identify a face, before running the main loop, we have to precompute a (atleast one) face descriptor from an example image for each person, we want to recognize (reference data). To make a decision, which person is sitting in front of the webcam, we the query face descriptor to the face descriptors in the reference data and return the most similar match:

If you solely want to track yourself, it is sufficient to take a picture of yourself and run faceapi.allFaces once to retrieve a face descriptor of your own face (reference descriptor). Then you can directly calculate the distance of the query face descriptor from your webcam image and the reference descriptor using faceapi.euclideanDistance.

Finally we draw the text with the predicted labels and distances relative to the position of the bounding boxes onto the overlay canvas again:

Afterwards, don’t forget to call onPlay to keep on iteratively processing the most recent frame:

And that’s it already!

Some Final Remarks

Note, that recomputing the query face descriptors for each single frame is a very naive approach. Obviously you can come up with a more efficient approach, like keeping track of and updating the face descriptors of your detection results every x frames. Usually the pose of the tracked face(s) doesn’t (don’t) change that drastically in a few frames. But for the sake of simplicity I will just leave it like that. Just keep that in mind, in case you want to squeeze some more fps out of it.

You can find the full source code for the example here. Finally, make sure to also check out the other examples and of course stay tuned for further updates and features, that might make it into face-api.js in future! ;)

If you liked this article you are invited to leave some claps and follow me on medium and/or twitter :). Also feel free to leave a star on the github repository. Stay tuned for more tutorials!

Source: Deep Learning on Medium