Original article was published by Francesco Zuppichini on Deep Learning on Medium
Face Unlock with 2D Data
A deep learning approach
Today we are going to use deep learning to create a face unlock algorithm. To complete our puzzle, we need three main pieces.
- a find faces algorithm
- a way to embed the faces in vector space
- a function to compare the encoded faces
First of all, we need a way to find a face inside an image. We can use an end-end approach called MTCNN (Multi-task Cascaded Convolutional Networks).
Just a little bit of technical background, it is called Cascaded because it is composed of multiple stages, each stage has its neural network. The following image shows the framework.
We rely on the MTCNN implementation from facenet-pytorch repo.
We need images! I have put together a couple of images of myself, Leonardo di Caprio and Matt Demon.
Following PyTorch best practices, I load up the dataset using
ImageFolder. I created the
MTCNN instance and pass it to the dataset using the
My folder structure is the following:
MTCNN will automatically crop and resize the input, I used an
image_size=160 because the model will are going to use was trained with images with that size. I also add
18 pixels of margin, just to be sure we include the whole face.
(tensor([[[ 0.9023, 0.9180, 0.9180, ..., 0.8398, 0.8242, 0.8242], [ 0.9023, 0.9414, 0.9492, ..., 0.8555, 0.8320, 0.8164], [ 0.9336, 0.9805, 0.9727, ..., 0.8555, 0.8320, 0.7930], ..., [-0.7070, -0.7383, -0.7305, ..., 0.4102, 0.3320, 0.3711], [-0.7539, -0.7383, -0.7305, ..., 0.3789, 0.3633, 0.4102], [-0.7383, -0.7070, -0.7227, ..., 0.3242, 0.3945, 0.4023]], [[ 0.9492, 0.9492, 0.9492, ..., 0.9336, 0.9258, 0.9258], [ 0.9336, 0.9492, 0.9492, ..., 0.9492, 0.9336, 0.9258], [ 0.9414, 0.9648, 0.9414, ..., 0.9570, 0.9414, 0.9258], ..., [-0.3633, -0.3867, -0.3867, ..., 0.6133, 0.5352, 0.5820], [-0.3945, -0.3867, -0.3945, ..., 0.5820, 0.5742, 0.6211], [-0.3711, -0.3633, -0.4023, ..., 0.5273, 0.6055, 0.6211]], [[ 0.8867, 0.8867, 0.8945, ..., 0.8555, 0.8477, 0.8477], [ 0.8789, 0.8867, 0.8789, ..., 0.8789, 0.8633, 0.8477], [ 0.8867, 0.9023, 0.8633, ..., 0.9023, 0.8789, 0.8555], ..., [-0.0352, -0.0586, -0.0977, ..., 0.7617, 0.7070, 0.7461], [-0.0586, -0.0586, -0.0977, ..., 0.7617, 0.7617, 0.8086], [-0.0352, -0.0352, -0.1211, ..., 0.7227, 0.8086, 0.8086]]]), 0)
Perfect, the dataset returns a tensor. Let’s visualize all the inputs. They have been normalized by the MTCNN image-wise. The last three images of the last row are selfies of myself 🙂
Our data pipeline is ready. To compare faces and find out if two are similar, we need a way to encode them in a vector space where, if two faces are similar, the two vectors associated with them are also similar.
We can use one model trained on one of the famous face datasets, such as vgg_face2, and use the output of the last layer (latent space) before the classification head as the encoder.
A model trained on a faces dataset must have learned important features about the inputs. The last layer (just before the fully connected layers) encodes the high-level features of these images. Thus, we can use it to embed our inputs in a vector space where, hopefully, similar images are close to each other.
In detail, we are going to use a inception resnet trained on the
vggface2 dataset. The embeddings space has
Perfect, we had
8 images and we obtained
To compare our vectors, we can use
cosine_similarity to see how much they are close to each other. Cosine similarity will output a value between [-1, 1]. In the naive case, where the two compared vectors are the same, their similarity is 1. So the closest to 1, the similar.
We can now find all the distances between each pair in our dataset.
Apparently, I am not very similar to Matt or Leo, but they have something in common!
We can go further and run PCA on the embeddings and project the images in a 2-D plane
Take this image with a grain of salt. We are compressing 512 dimensions in 2, so we are losing lots of data.
Okay, we have a way to find faces, and to see if they are similar to each other, now we can create our face unlock algorithm.
My idea is to take
n images of the allowed person, find the center in the embedding space, select a threshold, and see if the cosine similarity between the center and a new image is less or bigger than it.
Let’s test it!
People say I am similar to Rio from Casa de Papel
The similarity score was higher than the previous image, so I guess it is true!
Let’s try with a new selfie of myself
It worked, I am in.
We have seen an attractive way to create a face unlock algorithm using only 2D data (images). It relies on a neural network to encode the cropped faces in a high dimensional vector space where similar faces are close to each other. However, I don’t know how the model was trained and it may be easy to fool it (even if in my experiments the algorithm works well).
What if the model was trained without data-augmentation? Then, probably, simply flipping of the same person can disrupt the latent representation.
A more robust training routine would be an unsupervised one (something like BYOL) that relies heavily on data-augmentation.
Thank you for reading.