Mask Detection Using Deep Learning

Original article was published by Harsh Sharma on Deep Learning on Medium

Mask Detection Using Deep Learning

Please Wear a Mask!

Hello readers, Just like my previous article, this one is also related to our current dire situation of COVID 19. As the title indicates, I will be going to explain about how you can build a mask detection system on a video feed by using Deep Learning. Basically, you will be able to detect if someone is wearing a mask or not and can use it further to generate a trigger.

This system can be used in a workplace to monitor if the employee is wearing a mask or not. It can also be used in shopping malls, stations etc to make announcements from time to time to point out people not following the mask rule.

Our final product of detecting mask in a frame will include two major steps. One, Detecting faces in a frame and Two, Classifying the detected face, if it has mask on or not.

To do the face detection we will use an architecture called RetinaFace which is the state of the art model for detecting face in a picture and to further classify each face into mask or no mask we will be using a ResNet architecture. As I believe that if you know about what you are using you will be more comfortable in using that. So, I will explain about these two architectures first and then will discuss about it’s implementation and provide you the code.

Here I’ll be explaining about RetinaFace architecture and in my next article, I will explain about ResNet architecture and will discuss in detail about how to combine and implement these two models using Python.


Earlier, Face detection was done using two stage detectors which had one region proposal network and the proposed regions were then sent to another network to find the boxes around the faces. If you don’t know about single stage and two stage detectors, you can go through my previous article where I have explained a bit about them.


RetinaFace was one of the first single stage detector which performed really well in detecting small faces and highly occluded faces. In one of my article, I have explained about a state of the art CNN architecture in Object Detection , RetinaNet. This RetinaFace architecture is similar to that architecture but with some changes which are specific for face detection. In RetinaFace also, we use FPN (Feature Pyramid Network) from a backbone of Resnet 152. Again, If you don’t know about FPN and Receptive field , you can go through my previous article here.

Taken from RetinaNet Paper

From here on I’ll assume that you know about FPN. So, here they are using the output of different layers of Resnet which have different receptive fields and which make it possible for detecting different size of faces. Instead of just using the output of layers to locate and shift the box for faces, they have included one more layer of computation on top of each output, which they are calling as Context Module.

Context Module

The concept of context module is not introduced by them, this was already used by earlier researchers who claim to have better accuracy with this type of addition in face detection architecture. In Context module they are just increasing the receptive field by using something called Deformable Convolution Netwok (DCN). These DCNs are similar to CNN but in DCN there are few offset parameters which does not put constraint on the kernel to look at only a fixed shape of window which is square in almost all the cases (3 x 3 sized kernel can only look at 3 x 3 size of square window at a time). To learn more about DCNs you can get an overview from this beautifully explained article. Basically, in Context Module as you can see in the image above, it makes it easier for model to learn different orientation of faces also along with increasing the receptive fields for each output and as we are increasing the computation along with residual connections, it increases the contextual information that it can incorporate.

One more important thing that they are using in the architecture is mesh decoder. This part of the architecture is pretty complex to dive deep into, but I will give an overview of what it ultimately does.

Mesh Decoder

This part of the architecture is a bit unique. This works on 3D structure of the face. Mathematically a 3D face can be represented as V ∈ R( n×6), where V is the set of vertices and each vertex can be represented with 6 numbers which are (x,y,z)spatial coordinates and (R,G,B) color coordinates. This representation of face can be converted into 2D image by using a 3D renderer which is a kind of complex function hence we do not need to dive into that( If you are enthusiastic, you can go through this paper to section 3.2 to get the understanding of the function). So, our architecture includes a part which generates a 128 dimensional vector from the predicted face box. This 128 dimensional vector is considered as shape and textual information for that face and is further passed to a network (mesh decoder) which decodes this vector into 6 dimensional vector which is the 3D representation of the face. This predicted output is then sent to a renderer function which converts it into a 2D image and the obtained image is compared with the original image using pixel-wise difference.

This type of architecture and loss function incorporates the information about 3D structure of a face in an architecture which is very important as we want the model to locate the face in a given image.

Now, we have an understanding of the architecture. Next, we will look into the loss function which is arguably the second most important part of any neural network.

Loss Function

The loss function that they are using is a combination of 4 types of losses as shown in the image below.

Fig : 1
  1. Classification loss : It is the Face classification loss Lcls(pi , p* i ), where pi is the predicted probability of anchor i being a face and p*i is 1 for the positive anchor and 0 for the negative anchor. The classification loss Lcls is the softmax loss for binary classes (face/not face).
  2. Box Regression Loss : It is given by Lbox(ti , t* i ), where ti = {tx, ty, tw, th}i and t*i = {t*x , t*y , t*w, t*h }i represent the coordinates of the predicted box and ground-truth box associated with the positive anchor respectively. It is a smooth-L1 loss.
  3. Facial Landmark Regression Loss : Along with the shape and location information of the boxes around faces, our model is also generating 5 landmarks of the face (left eye, right eye,left lip,right lip and nose). These predicted landmarks are then compared with the actual annotated landmarks for each face using smooth-L1 loss (Lpts).
  4. Dense Regression Loss : This is the loss from mesh decoder that we discussed above which incorporates the 3D information of the face in the model by taking pixel-wise difference of the rendered output. The actual function to calculate this loss is :
Taken from paper

In Fig : 1 which showed the combined loss function we can see that there some lambda(λ) parameters. These are called loss balancing parameters which ensures how much of what loss we want to include.