A Deep Learning Examination of Facial Attraction

Source: Deep Learning on Medium


Go to the profile of Ryan Hedges

Ryan Hedges

Introduction

For this project we are going to explore the application of Convolutional Neural Nets to the dissection of faces, specifically as they pertain to a face’s perceived attractiveness. The face is a complicated image, but ConvNets have proven to be valuable for many facial image tasks, such as face recognition, facial expression classification, among others. We are going to put a ConvNet to the task of breaking down facial images with the specific objective of assessing its degree of attractiveness. Additionally, we will use visualization techniques to explore which facial features are most helpful to the net in making it’s decisions.

The Data

For this project we will be using the Celeba dataset, found at the following link:

Figure 1

CelebA contains over 200K images of over 10K individual subjects. Additionally, the dataset comes with 40 binary annotations, capturing characteristics such as gender, the presence of a beard, hair color, and our attribute of interest for this project: attractive (Y/N). This leads us to our next, extremely important section: a disclaimer regarding the attractive attribute.

Attractiveness Disclaimer

The results of the neural net’s attractiveness scorings are purely a representation of the data that the net is trained on. Our attractiveness training data was scored by an unknown entity and is therefore not a reflection on the perception of the author of this blog (me) or the general public, and it should in no way be used to pass judgement on any subjects. If the dataset contains bias, which it probably does in certain ways, so will the model. These scorings should not be generalized, and caution should be exercised in all interpretations of the model output. Additionally, it’s important to acknowledge the limitation regarding the binary nature of the attraction data. There is attractive, and there is not attractive, with nothing in between, which introduces a gap in precision of the model.

Data Exploration

The CelebA dataset can provide endless entertainment by sifting through photos of many familiar faces. We will be examining the photos from the lens of attractiveness as defined by the provided binary annotations. Figure 2 previews a few select images of faces labeled as attractive (top row) and a few that are labeled as not attractive (bottom row). As a hater of the New York Yankees, I have no qualms with Jason Giambi’s presence among the not attractive images.

Figure 2

Figure 3 depicts the distribution of our two labels. They are represented with roughly equal proportion, which will be a convenience to our model-building process.

Figure 3

Methodology

We will train a Convolutional Neural Network (CNNs) on the data to assess whether an image is classified as attractive or not. CNNs use a sequence of layers, or filters, that seek to learn the hierarchical structure of images. The model will then learn to use these geometric representations to help make its classification prediction in the final layer of the network.

Prior to diving into the construction and training of our neural net, we must first partition our data into a training group, validation set, and a final test set. Note that the partitions are created using the specific subjects, not the individual images. We do not want individuals that the model was trained on to also appear in the set that we use to test our model.

The process of constructing the architecture of the neural net was an iterative process that alternated between increasing model complexity and taking measures to combat overfitting. We began with a simple 3-layer architecture to gauge how daunting of a prediction task this will be. The model overfits after only a few epochs of training, and the max validation accuracy rate was 72%.

As facial images are a complicated composition of intricate curves, edges, and colors, a deeper model is required for a more satisfying performance. The final net was composed of a Convolutional Base containing five convolutional Layers of 32, 64, 64, 128, and 256 neurons, each followed by a MaxPooling downsampling layer. The Dense classification layer contained 512 neurons. To combat overfitting, a dropout rate of .3 was added to the Dense layer, and L2 regularization was added to the final 3 convolutional layers. An Early Stopping mechanism was used to interrupt training once the model validation accuracy plateaued.

Results and Analysis

The net achieved its maximum result on out-of-sample data after 23 training epochs and reached a classification accuracy of 80% on the validation data. For the final test set, the net’s accuracy was 80.4%. Figure 4 depicts the developing accuracy and loss over training epochs.

Figure 4

One interesting piece of the results is the distribution of prediction scores. How confident was the net in asserting its predictions? Figure 5 displays a histogram of the confidence of the net in its prediction, broken out by the true label. The stark differences in the shape of the distributions between the two classes is very interesting.

Figure 5

A main question of interest is how the model is making its decisions. Which facial features, or parts of the images, are most useful for the net in coming to its output. We will use a technique called Class Activation Mapping (CAM) to investigate what part of the face the net found useful in making its decision. For this exercise, we will use the image examples in which the net most confidently labeled as attractive and where the net most confidently labeled as not attractive.

10 most confidently predicted as attractive:

Figure 6

Below are the CAM outputs for the top attractive:

Figure 7

10 most confidently predicted as not attractive:

Figure 8

Below are the CAM outputs for the top not attractive:

Figure 9

The CAM outputs are incredibly interesting, as they point to the specific areas of the image that the net found most useful in predicting how the image was labeled. Between our samples from each of our classes, it appears that the forehead, mouth, and cheeks are the most common area of the face to exhibit high activation.

References/Citations:

Deep Learning with Python; Chollet, Francois; pgs. 172–176