Mask Detection Using Deep Learning (Part -II)

Original article was published by Harsh Sharma on Deep Learning on Medium

Mask Detection Using Deep Learning (Part -II)

Hello readers, Continuing on from my previous article where I explained the working of RetinaFace, I will explain about the second part of the whole pipeline of mask detection, i.e classification using Resnet and then I’ll provide the code implementation to get the task done.

If you have gone through the first part of Mask Detection using Deep Learning you know that by now we have extracted all the faces in an image using RetinaFace, of which I also explained about the internal working.

Now, we want some classifier which can classify each face into “face with mask” or “face without mask”. For that classification we will be using Resnet as a classifier which is one of the best architecture for image classification. For training a classifier we require two things mainly, a dataset with enough data and a loss function to optimize.

Datasets : I used combination of different publically available datasets like this one to make a custom dataset of images with masks and non masks. I also used LFW dataset for images without masks. My custom dataset had around 18k+ images for faces with masks and 14k+ images for face without masks. I did some data cleaning like removing duplicates, blurred images and wrong images. On top of this I extracted only the faces from these datasets using RetinaFace and labelled them accordingly because in the end we are only going to classify faces only.

Loss : We know that this is going to be a two class classification task, hence I used a binary crossentropy as a loss function to train the model.

To train a resnet model, I used fastai library as it provides very simple interface and close to state of the art results for various tasks. It is very easy -to- use library and once you get a hold of it, you won’t go anywhere else to train your deep learning model. This library is written on top of pytorch. When you start using it, it becomes just like pieces of legos to assemble and then you are good to go.

As I mentioned earlier, I am using Resnet as a classifier to classify the face images and if you have gone through my previous articles you would have noticed that I like to explain about what we use in a task. So, I will explain a bit about Resnet now.


In Early days, VGG net, Alex net, Inception net etc were considered as the state of the art model for image classification tasks. These architectures extract feature information from an image and by using loss function and backpropogation they learn about which features are important in an image for that particular task. So, people started to make deeper networks in a hope that it will extract more features and the architecture can capture more information. But the results were on the opposite spectrum of what was expected. The loss saturated and increased in some cases even after adding more layers which drove people to think about the reason of why it was happening. Then came this revolutionary paper with minimal change in the network but it was game changing. They just added a residual block in between. It was an identity connection between two blocks of CNN layers.

So why did this identity connection worked and what is the actual architecture?

The reason behind, why deeper layer models were performing worse than shallow layer ones was , as the number of layers increased, there arose a problem of vanishing gradients. That means ,some layers were not able to learn properly because the weights were not getting properly adjusted. So, to account for that problem the authors added skip connections between layers. That means if the input to a layer is X and lets say the convolution operation is denoted by f(X) so the output from the current layer will be X+f(X). In earlier models it was just f(X). This skip connection provided a way for gradients to flow back.

Below is a comparison between VGG network and a Resnet. I have just pasted a cutout of the original image from the paper here just to have an understanding, how the information flows.

Taken from the paper

Typically a Residual block consist of 2 or 3 convolutional layers in a sequence and the output includes a summation of the input to the first layer and the output obtained after convolution operations and activations. There are various types of resnet architecture depending upon the number of conv. layers like resnet-18, resnet-34,resnet-50 etc.

The output that is obtained after an image goes through all the layers in a resnet (arranged sequentially) is then connected to n number of activation units where n is the number of different classes (2 in our case)in a classification task.

There is one thing to note here which was new for me when I learnt about Resnet architecture. Earlier for a trained model, it required a fixed size of image that it can process otherwise it will fail because the Dense fully connected layers in the end of the architecture required a particular size of image to process. But, here they are using Adaptive Avg Pooling which tackles the problem of fixed shape image completely. Although, I am not sure who introduced this concept but I came to know about this when I read this paper.

So, now we know a bit about ResNet and we also know how to perform classification using this network (Earlier we saw that Resnet was being used as a feature extractor for FPN). Next, I will explain how to implement Mask Detection and also provide the code for the same.

Implementation Of Mask Detection

As explained, our mask detection process involves two stages, first we have to detect faces and then do the classification on top of that. For face detection we will be using this repository as they have trained weights for RetinaFace. Other reason for using this repository is that they have the official implementation of InsightFace which is state of the art model for face recognition which I will explain about in my next article where I will explain how to build attendance marker system.

Step I

Clone this repository which contains all the code required for our application by using the following command :

git clone

Download the models for RetinaFace and ResNet classification from this drive. Make a directory “models/retinaface” inside Face_detection folder and extract “” in that folder.

Make a directory “model” inside Mask_classification folder and put “model_clean_data.pkl” in that folder.

Step II

Lets assume you clone the repository in ‘loc’ folder. So, next step is to change the pwd to Face_detection folder by running following command :

cd loc/Mask-detection/Face_detection

The implementation of RetinaFace is on mxnet and for classification using Resnet I am using fastai’s implementation in pytorch. So to keep it clean we will be using two separate conda environments.

To create conda environment for face detection we will be using the requirements.yml file in Face_detection folder by running the following command :

conda env create -n detection -f requirements.yml

Note : If you don’t know about conda and virtual environments, please go through my previous article here where I have explained a bit about them.

Step III

The other environment that we will create is for classification task. The requirements.yml file for that environment is inside Mask_classification folder.

First change the pwd to Mask_classification by running :

cd loc/Mask-detection/Mask_classification

Then create the conda environment in a similar way using :

conda env create -n classification -f requirements.yml

Step IV

Now that we have created the conda environment, we will need to run a flask server which will respond to the requests of classifying the face images.

Note : We will be running a server for classification and we will use detection environment to detect the faces and then get a response from the classification server for all the detected faces.

We will be running the classification server on port 3000 where our detection environment will ping, to get the result. We can change the port but then accordingly we have to set the request address in the script. To run the server we will use the following set of commands:

First activate conda environment using :

conda activate classification

Then change the pwd to Mask_classification using:

cd loc/Mask-detection/Mask_classification

Then run the flask server using:

flask run – -port 3000

Step V

Now that the server is up and running, we will run the “” script in the root folder by first activating the environment for that by using:

conda activate detection

Then change the pwd to root directory by running:

cd loc/Mask-detection

Note: Do not close the server. Run the script in a new terminal

The script “” has three arguments which are “is_image”,”in_path” and “out_path”

If you want to infer on an image put the first argument as True else if you want to infer on a video then put that argument as False. The next two arguments are self explanatory as “in_path” refers to the path of input file and “out_path” is the location for the output to saved.

You can run the script in following way after activating the environment:

python –is_image True –in_path path/to/image –out_path path/to/save

Specify the name of the output file also in the path.

I have put some sample images in folder “sample-images”, you can try using those images for the start. It will return an image with green boxes around faces having mask and red boxes around the ones which are not wearing any.