Real time face recognition with CPU

Source: Deep Learning on Medium

In this post, we will be using the ultra light detector. But if you are interested in applying any of the other detection methods mentioned, you can refer to my Github repository here.

To use the ultra light model, the following python (python version 3.6) packages are required:

onnx==1.6.0 onnx-tf==1.3.0 onnxruntime==0.5.0 opencv-python== tensorflow==1.13.1

Use pip install to install all the dependencies.

After preparing the environments, we can get the frame feeds from our webcam using the OpenCV library via the following code:

For each of the frames we acquired, we need to follow the exact pre-process pipeline during the model training stage to achieve the expected performance.

As we will be using the pretrained ultra_light_640.onnx model, we have to resize the input image to 640×480. If you are using the 320 model, please rezise accordingly.

Code is shown below:

After pre-processing the image, we will have to prepare the ONNX model and create an ONNX inference session. To learn more about model inference, you can check the link here.

Codes to prepare the model and create an inference session are shown below:

Now it is time to detect some faces with the following code:

Variable confidences contains a list of confidence level for each box inside the boxes variable. The first and second values of one confidence pair indicate the probability of containing background and face respectively.

As the boxes value contains all the boxes generated, we will have to identify the boxes with high probability of containing a face and remove the duplicates according to the corresponding Jaccard Index (a.k.a. Intersection over Union).

Code to get the right boxes is shown below:

The predict function will take in an array of boxes and their corresponding confidence level for each labels. Filtering by confidence will then be performed to retain all the boxes with high probability of containing a face.

After that, intersection of union (IOU) value of each remaining boxes is calculated. Finally, boxes are filtered using non-maximum suppression with a hard IOU threshold to remove the similar ones.

Once we have the filtered boxes, we can draw and show in the video stream:

Result from laptop webcam with Intel(R) Core(TM) i7–8550U CPU @ 1.80GHz:

Full code for the detection part can be found here.

Face Recognition

After detecting the faces, the next step is to recognize them. There are many techniques for facial recognition including OpenFace, FaceNet, VGGFace2, MobileNetV2² and etc. The model we will use in this article is MobileFaceNet, which is inspired by MobileNetV2. Details of this network architecture and how it is trained can be found here.

Generally, there are three steps taken to recognize a face: (1) Data pre-processing, (2) Facial feature extraction, and (3) Comparison of features between the target face and faces from database.


The data we will be using is a video clip of Jimmy Kimmel’s interview with Jennifer Aniston. We will take the video clip and extract Jennifer Aniston’s faces. You can add your own training data in the corresponding folders.

The file structure looks like the following:

Once the training data is in place, we can perform face extraction on the video clips with the code below:

Faces are captured inside boxes. Now, we can start with face pre-processing.

We will identify five facial landmarks, align faces with proper transformation and resize them to 112×112.

We will be using dlib and imutils to accomplish these subtasks. Use pip install to install these two packages if you have not done so.

After meeting the requirements, we need to initiate shape_predictor (for facial landmark prediction) and FaceAligner with the following code:

shape_predictor_5_landmarks.dat used can be downloaded here. desiredLeftEye specifies how large you want your face to be extracted. Usually the value is ranged from 0.2 to 0.4. The smaller the value is, the larger the face will get.

Code below is how to apply face alignment on all the faces extracted and write to files:


Eyes are aligned and faces are of similar sizes.

Further pre-processing is required in order to use MobileFaceNet model. We will have to subtract the aligned face by 127.5 and divide the results by 128 as described in the paper.

Code for more pre-processing as depicted above:

Calculating face embeddings

It’s time to get the facial features (a.k.a. embeddings) from the pre-processed faces. We will begin by loading the TensorFlow model:

Next, we will define the network input, get the embeddings and save to a pickle file:

Recognize a face

To recognize a face, simply load our embedding dataset with corresponding labels. Then use Euclidean distance and a threshold to determine who each detected face belongs to.

Code is shown below:


Let’s see our results:

Embeddings acquired for six main character from Friends series

Again, you will be able to find the full code here.


With that, we have created a system that can perform real-time face recognition with CPU. Although it is only running at around 13 FPS, it is comparably much faster than using complex CNNs.

However, there are still many things we could do to improve the performance (both the accuracy and speed) of this system. Potentially, we can apply knowledge distillation to compress the current model and further reduce the model size using low bit quantization. Moreover, we could improve the accuracy using other machine learning classification methods on the embeddings.

Thank you for reading! Hope you find this helpful.

Stay tuned and see ya~


[1]: Chen, Sheng, et al. “Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices.” Chinese Conference on Biometric Recognition. Springer, Cham, 2018.

[2]: Sandler, Mark, et al. “Mobilenetv2: Inverted residuals and linear bottlenecks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018