Original article was published by Tim Lee on Artificial Intelligence on Medium
Moviegoer: Vision Features — Faces
This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).
We’ve been using computer vision to draw conclusions on many aspects of a film’s visuals, but we can really learn a lot from faces. Faces contain lots of “data”, the most apparent being a character’s current emotional state. They also contain basic demographic information of characters (age, race, gender). We can also infer a few cinematography and structure-related features based on how big, or where a face is in the image. Let’s take a look at what we can learn.
We’ve previously figured out how to find self-introductions, like “I’m Ben”. When we read a self-introduction like this in the subtitles, we can generally assume that the onscreen face is Ben. Using the Python library face_recognition, we can save his facial encoding and recognize Ben’s face whenever he’s onscreen.
Even if we don’t have a self-introduction, we can still identify when a unique face appears in multiple frames. Using hierarchical agglomerative clustering (HAC) with Keras/TensorFlow, we can cluster similar face encodings, and find the frames in which it appears. In the below example, we don’t know Alice, but we know that her face appears in about half of the frames of the scene, across from Ben. (He mentions her name several times during this scene, but more on that in a future NLP post.)
Face Counting and Primary Character
We can count the number of faces found in a frame. We can also define if a face is the frame’s “primary character”. If there’s only one face, it’s the primary character. If there are multiple faces, we check their sizes — if any face is significantly larger than the others, then we designate that the primary character of the frame.
Mirrored Shots (Shot/Reverse-Shot)
Generally, two-character conversations use the shot/reverse-shot model. We see a medium close-up of Character A on the left side of the screen, and then we cut to a medium close-up of Character B on the right side of the screen. Then we cut back and forth between A and B. The shots are usually mirrored, Character A’s face is the same size as Character B’s face. We can take advantage of this convention by looking for pairs of shots which features two different characters with roughly equal face sizes, with one in the left rule-of-thirds alignment point, and the other in the right alignment point.
To assist with dialogue attribution, we can measure if a character’s mouth is open or not.
Emotion and Demography
The deepface library can automatically predict a face’s age, gender, race, and emotional state. This is very resource intensive, so we’re using it sparingly for now.
Wanna see more?