Original article was published on Deep Learning on Medium
Deep Learning for Cosmetics
At Mira, we build tools that empower beauty enthusiasts to learn, gather inspiration and make informed buying decisions. In conversations with roughly 75 beauty consumers, we’ve learned that one of the foremost challenges that a consumer faces in finding the right products and techniques is identifying authentic and authoritative voices who can speak to their individual concerns.
In this blog post, we’ll demonstrate how we can use computer vision to solve a particularly poignant instance of this problem: finding influencers, images and videos that address a specific eye shape and complexion. Along the way, we’ll illustrate how three simple yet powerful ideas — geometric transformations, the triplet loss function and transfer learning — allow us to solve a variety of difficult inference problems with minimal human input.
Background: Eye Shape and Complexion
Finding the right products and techniques for your eyes is notoriously tricky — every individual has a unique shape and complexion. While Birchbox and others have published helpful visual guides, one of the things we’ve learned from our community of beauty enthusiasts is that people typically seek advice from authentic, independent voices in their community and that finding quality advice from others with similar eye concerns challenging even for experts.
But what if the characteristics of your eyes, along with the countless other facets that make you unique, seamlessly informed your beauty browsing and buying decisions?
Let’s formalize the problem: given a set of images of faces, along with a small number of human-labeled images, (eye color, lid shape, etc.) find an intuitive visual similarity metric between eyes (“this beauty guru has eyes similar to yours!”) and a classifier that captures the human-labeled properties. In this blog post, we will focus on eye similarity; a follow-up will address classification tasks.
Raw images are not suited well to either computing visual similarity or performing classifications. They can contain many superficial similarities (e.g. similar makeup applied, different skin tones washed out by strong lighting, etc.) that are unrelated to eye structure/complexion. Furthermore, raw images live in a high dimensional space, requiring a large amount of labeled training data for classification tasks. (See the curse of dimensionality)
Our primary challenge lies in deriving low-dimensional and dense mathematical representations of eye images — known as embeddings — that capture the qualities that we care about and nothing more. That is, these embeddings should intentionally ignore:
- Eye pose/gaze direction
- Specific lighting conditions (and insta filters, of course)
- Whatever makeup is already applied
Image Normalization via Projective Transform
We can eliminate an entire class of superficial similarities with a simple preprocessing step: the projective transform.
While cropped images of eyes will exhibit many obvious structural differences (e.g. the eye isn’t at the center, or is rotated due to head tilt, etc.), the projective transformation allows us to “warp” images such that the same eye landmarks are guaranteed to occupy the same coordinates.
This is explained well in the scikit-image documentation. With a little bit of linear algebra mathemagic, we can warp an image such that a set of points map to a new, desired shape, rotating and stretching the image in the process:
We can apply the same technique to normalizing eye images, rotating/stretching them to a more consistent form. We detect facial landmarks using dlib, crop the eyes, and warp them to ensure alignment and consistency. This preprocessing step significantly improves our ability to converge on embeddings that are invariant to head tilt/pose. (A detailed overview of this method, when applied to general face alignment, is available here)