Deep Learning for Cosmetics

Original article was published on Deep Learning on Medium

Deep Learning for Cosmetics

At Mira, we build tools that empower beauty enthusiasts to learn, gather inspiration and make informed buying decisions. In conversations with roughly 75 beauty consumers, we’ve learned that one of the foremost challenges that a consumer faces in finding the right products and techniques is identifying authentic and authoritative voices who can speak to their individual concerns.

In this blog post, we’ll demonstrate how we can use computer vision to solve a particularly poignant instance of this problem: finding influencers, images and videos that address a specific eye shape and complexion. Along the way, we’ll illustrate how three simple yet powerful ideas — geometric transformations, the triplet loss function and transfer learning — allow us to solve a variety of difficult inference problems with minimal human input.

Background: Eye Shape and Complexion

A sample of useful eye classifications, from Smashbox

Finding the right products and techniques for your eyes is notoriously tricky — every individual has a unique shape and complexion. While Birchbox and others have published helpful visual guides, one of the things we’ve learned from our community of beauty enthusiasts is that people typically seek advice from authentic, independent voices in their community and that finding quality advice from others with similar eye concerns challenging even for experts.

Techniques with the same product can vary wildly across eye shapes. Adapted from Makeup.com

But what if the characteristics of your eyes, along with the countless other facets that make you unique, seamlessly informed your beauty browsing and buying decisions?

The Problem

Let’s formalize the problem: given a set of images of faces, along with a small number of human-labeled images, (eye color, lid shape, etc.) find an intuitive visual similarity metric between eyes (“this beauty guru has eyes similar to yours!”) and a classifier that captures the human-labeled properties. In this blog post, we will focus on eye similarity; a follow-up will address classification tasks.

Jackie Aina, aka LaBronze James, #slaying with a smokey eye

Raw images are not suited well to either computing visual similarity or performing classifications. They can contain many superficial similarities (e.g. similar makeup applied, different skin tones washed out by strong lighting, etc.) that are unrelated to eye structure/complexion. Furthermore, raw images live in a high dimensional space, requiring a large amount of labeled training data for classification tasks. (See the curse of dimensionality)

Similar eyes when pixels are compared directly; note that eyeshadow, lighting conditions and gaze direction are consistent, but eye color/complexion vary.
The challenges of working with raw images: while clearly quite different to the human eye, these two images are relatively close when their raw data is compared. (Uses euclidean distance between raw pixels)

Our primary challenge lies in deriving low-dimensional and dense mathematical representations of eye images — known as embeddings — that capture the qualities that we care about and nothing more. That is, these embeddings should intentionally ignore:

  • Eye pose/gaze direction
  • Specific lighting conditions (and insta filters, of course)
  • Whatever makeup is already applied
When eye embeddings are trained with the triplet loss function, we learn the ability to ignore superficial/irrelevant features (e.g. applied eyeshadow/eye pose in the images above) and focus on what matters.

Image Normalization via Projective Transform

We can eliminate an entire class of superficial similarities with a simple preprocessing step: the projective transform.

While cropped images of eyes will exhibit many obvious structural differences (e.g. the eye isn’t at the center, or is rotated due to head tilt, etc.), the projective transformation allows us to “warp” images such that the same eye landmarks are guaranteed to occupy the same coordinates.

This is explained well in the scikit-image documentation. With a little bit of linear algebra mathemagic, we can warp an image such that a set of points map to a new, desired shape, rotating and stretching the image in the process:

Using a projective transformation, we can warp the top image such that the four red points become a rectangle, “straightening” the text. We apply a similar method to normalize images of eyes. (from the scikit-image documentation)

We can apply the same technique to normalizing eye images, rotating/stretching them to a more consistent form. We detect facial landmarks using dlib, crop the eyes, and warp them to ensure alignment and consistency. This preprocessing step significantly improves our ability to converge on embeddings that are invariant to head tilt/pose. (A detailed overview of this method, when applied to general face alignment, is available here)