Original article was published on Deep Learning on Medium
Before we can perform face recognition, we need to detect faces.
We will use the Multi-Task Cascaded Convolutional Neural Network, or MTCNN, for face detection, e.g. finding and extracting faces from photos. This is a state-of-the-art deep learning model for face detection, described in the 2016 paper titled “ Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks.”
Face detection and alignment in an unconstrained environment is challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, the authors proposed a deep cascaded multi-task framework which exploits the inherent correlation between detection and alignment to boost up their performance. In particular, the framework leverages a cascaded architecture with three stages of carefully designed deep convolutional networks to predict face and landmark location in a coarse-to-fine manner. In addition, an online hard sample mining strategy that further improves the performance in practice was proposed. The method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmarks for face detection, and AFLW benchmark for face alignment, while keeping up real time performance.
We can use the mtcnn library to create a face detector and extract faces for our use with the FaceNet face detector models in subsequent sections.
The first step is to load an image. We will also convert the image to RGB, just in case the image has an alpha channel or is black and white.
Next, we can create an MTCNN face detector class and use it to detect all faces in the loaded photograph.
The result is a list of bounding boxes, where each bounding box defines a lower-left-corner of the bounding box, as well as the width and height.
If we assume there is only one face in the photo for our experiments, we can determine the pixel coordinates of the bounding box . Sometimes the library will return a negative pixel index, and I think this is a bug. We can fix this by taking the absolute value of the coordinates.
We can then use the preprocessing functions to resize this small image of the face to the required size; specifically, the model expects square input faces.
We can use this function to extract faces as needed in the next section that can be provided as input to the FaceNet model.
We here use the mobile version of Facenet, MobileFaceNets, which uses less than 1 million parameters and is specifically tailored for high-accuracy real-time face verification on mobile and embedded devices. There was first an analysis on the weakness of common mobile networks for face verification. The weakness has been well overcome by the above mentioned MobileFaceNets. Under the same experimental conditions, MobileFaceNets achieve significantly superior accuracy as well as more than 2 times actual speedup over MobileNetV2. After trained by ArcFace loss on the refined MS-Celeb-1M dataset , single MobileFaceNet of 4.0MB size achieves 99.55% accuracy on LFW and 92.59% TAR@FAR1e-6 on MegaFace, which is even comparable to state-of-the-art big CNN models of hundreds MB size. The fastest one of MobileFaceNets has an actual inference time of 18 milliseconds on a mobile phone. For face verification, MobileFaceNets achieve significantly improved efficiency over previous state-of-the-art mobile CNNs.