Source: Deep Learning on Medium
Today, before getting into the details of the project, let us get into some movie stuff. We watch different movies and every movie has a number of actors and out of all those actors, you may know some and may not know some. So what if you have to figure out who are all the actors in a particular scene to describe how well or worse they acted to a friend of yours? You don’t want to put yourself into havoc by searching for every actor in a search engine right? Let us see if advancements in computer vision and deep learning can help us. Thank god we actually have good number of libraries without doing everything from scratch :)
So let me consider this famous “That’s my secret Captain,I am always angry” scene from Avengers movie. We have five actors
in this scene(Mark Ruffalo,Scarlett Johansson,Robert Downy,Jeremy Renner and Chris Evnas). For now, let us not consider Chris Hemsworth
because that dude just got one second of screen presence.
After a bit of research on this, I figured out a way on how to do this. Why not iterate over each and every frame in the video and find every actor in each frame. Opencv will help us to get this work done. Create a new python file and paste the below code.
Make sure your python file and video are in the same folder else specify the whole path instead of just the video name.
Now you have images where each image is a frame. They will look like this…
Now place all these frame-images in a folder so that we can iterate through it later. Create a new folder with one image of every actor which we will use for training.These are the images of actors I used for training.
Import the necessary libraries.
argparse is to pass your data as arguments in the command line. We use paths in imutils to deal with image paths . face_recognition is the library we are going to use. If you are not familiar with it. I suggest you to go through this blog of mine.
The argument names are traindata and checkdata(I know that is very odd). Here through traindata, we pass original images of the actors and through checkdata, we pass our images of frames. imagePaths is a list of paths to images in train data and video_imagePaths is a list to frames in our video. They will look like this.
Now, let us iterate through the five actor images in image_Paths and create encodings to them and append them into a list Encodings.
Wait where the hell did this concept of encodings come from? Fasten your seat belts lets get into some flashback.
Back in 2015, researchers at google introduced a project called Facenet that is based on one-shot learning. Previously, one has to train a lot of images of a person to recognise him/her in a new picture and the accuracy was also not that great.Training on a lot of images of one person is not good for real time projects.Everyone understood the need for one-shot learning where only one clear and straight image of a person is enough to train and can be later used to recognise them in other pictures.
This is when the concept of face encoding came into picture.
The main theme of encodings is to convert the face into a 128D Euclidean space or simply a list of 128 numbers for a face so that encodings of the same person in two different pictures is almost same. This is tuned using triplet loss.
Triplet loss is actually a simple concept with some math. So, during triplet loss, you have three images 1.Anchor,2. Positive 3.Negative. Anchor image is the image of a person you would like to verify, Positive image is a different picture of the same person and negative image is the picture of a different person.
Now that we have our encodings to the three pictures, we have to find the euclidean distance of positive and negative images with the anchor image. The euclidean distance of positive and anchor image should be less and between negative and anchor image should be more.So, what is this Euclidean Distance? Its quiet easy. Consider two lists L1 and L2.
L1=[x1,x2,x3,x4] and L2=[y1,y2,y3,y4]
Euclidean_distance(L1,L2) = square_root(((x1-y1)**2)+((x2-y2)**2)+((x3-y3)**2)+((x4-y4)**2))
Triplet loss tunes these encodings so that our relation between positive,anchor(less distance) and negative,anchor(more distance) is met. This main concept of encodings for a face and tuning using triplet loss enabled the single shot detection.
Now let’s get back to the coding part. The Encodings is a list of encodings of five actors and encoding of an actor is a list of size 128. It will look like this.
Let us create a list of names of the actors. You can also create a dictionary where keys are the names of actors and values are encodings of their faces.
The small function below is used to obtain the image number in the whole path. Based on the number of the image, we are going to sort it since the video_imagePaths doesn’t return images in order. Images should be in order since they are actually frames in the video and we just can’t afford to alter them.
In the above code, you are sorting your paths of frames and then loading one frame per iteration and creating encodings to all the faces in the frame.
If there is a face in our picture, we compare encoding of every face with the encodings list. This returns a list of boolean values where True represents a match and False a mismatch. You can then find the index of true and the corresponding actor name from the actors_list and add this actor to final_list.If you don’t have any faces in the image, append “no one” to your final_list.See the code below for better understanding.
Print your final_list and it will look like this…
The index of this final_list is the frame number and value at that index is the actor present in that frame. So, in frame 0 or second 0, we have Scarlett Johansson. In frame 1 or second 1, we have Mark Ruffalo and so on..
Now that we have our final list, the task is quite simple. Assume a list of repeated numbers and you need to find the first and last occurrence of every number in the list along with their count. The same applies here, you have a
list of actor names. You need to find the first and last frame they appeared and total number of frames they are present in.The total number of frames is their screen time in the scene and first and last frame they are present is their first
and last appearance in the scene.
Lets get a bit creative and present the output in a .txt file format. Below is the code to do this and is really self explanatory.
The .txt file will look like this…
The performance is very good but not excellent. There are very few exceptions. Consider these two frames
In the first frame, only Jeremy Renner was recognised but not Chris Evnas and in the second image Mark Ruffallo was not recognised. Why did this happen?
While creating encodings, we can pass either “HOG” (Histogram of Gradients) or CNN(“Convolutional Neural Network”) as an argument. CNN is more accurate than HOG but is time taking and requires a GPU(Graphical Processing Unit) to run. Since most of us don’t use a gpu, it’s better we stick to HOG and by default, we have HOG set for encodings.
One real world application of this project can be found in Amazon Prime Video, where every actor in a scene is listed along with their images in the left side of the screen.