This Dress Doesn’t Exist

Source: Deep Learning on Medium

Sample images from 4 days of training SR StyleGAN

This post was originally published on the Shoprunner Engineering blog here feel free to check it out and at some of the other work our teams are doing.

This Dress Doesn’t Exist

Our ShopRunner Data Science team allows all members to have a quarterly hack week. It is important for data science teams to keep innovating so once per quarter team members are allowed to spend a week working on more speculative projects of their choice. For my 2019 Q3 hack week I decided to build a series of generator models to attempt to create fake products. Generator models are models commonly trained to create realistic images or text based on real world examples. This project may seem fairly outlandish, which it is, but my general idea is that if we can create strong generator models that can capture the diversity of our product catalog then we could use these generators to augment low frequency classes within our catalog for other deep learning projects such as taxonomy classification or attribute tagging.

The two networks I decided to use were OpenAI’s GPT-2 117M parameter small model for text generation and Nvidia’s StyleGAN for image generation. I then fine tuned these to internal ShopRunner datasets. For both models I found using the original Tensorflow implementations to be the best path forward since ports to other frameworks either didn’t have features that I needed or were not as well built out.

Both models were trained on a Nvidia 2080 TI graphics card.

GPT-2

In February of 2019 OpenAI announced its newest language model GPT-2. GPT-2 is trained on 40GB of internet text but OpenAI restricted the release of the model down to much smaller versions due to concerns of malicious behavior. A large part of building that high quality dataset was taking higher quality reddit. In its raw state GPT-2 is excellent at generating realistic sounding text but the text tends to fall into either reddit style dialogue or wikipedia style description. So to get the most use out of GPT-2 as a generator for fake products we need to tune it to our specific use case, in this case fashion.

The repo provided by nshepperd for fine tuning provides a series of scripts and instructions for fine tuning. To perform fine tuning we really just need to format a text dataset for consumption by GPT-2. For this I ended up writing 100K product descriptions to their own line in a .txt file with a GPT-2 specific <endoftext> token appended to the end so that GPT-2 would learn how to end the product descriptions and hopefully learn how to structure them in a more realistic manner.

GPT-2 trained fairly quickly producing good results after 15K batches/steps which took a few hours.

ShopRunner GPT-2 (SR GPT-2)

We fine tuned GPT-2 on ShopRunner data for 15K steps, leading to what we call SR GPT-2. After fine tuning SR GPT-2 is able to generate fairly realistic looking product descriptions including line breaks and formatting. These are also fairly entertaining to read.

======================================== SAMPLE 1 ========================================
WATERDOG | PINK HALSTON COLLECTION. HALSTON'S HOODED SILK FIT IS SO AWESOME HONG KONG\'s black wool-blend hooded cowl jacket is handmade from lightweight wool sourced from two countries located in the Arctic Circle and Wye Hydroelectric Power Supply.
- HALSTON WOOL CLIMB JERRY WOOL FIT
- Fabric: 90% wool, 10% wool crepe; 12% viscose, 8% polyester and 5% nylon
- Washable, dry clean recommended
- Imported
SKU: HU18-17FBCFABRIC & CAREMachine wash cold with similar colorsWarm down with whiteWash in cold water and hang dry
======================================== SAMPLE 2 ========================================
Wool-Cotton Cropped T-Shirt Dress Dress with Boho Floral Print - Black This T-Shirt Dress features a Boho Floral printed front, a short sleeve hemline, and an asymmetrical ruffled hemline. A limited edition piece from our New York Fashion Show capsule.
New York Fashion is celebrating the 20th anniversary of their popular T-Shirt Dress in 2011. The T-Shirt Dress is a loose-fitting bodice cut out cut in black silk crepe de chine with a fitted waistband. The dress has a V neckline, long sleeves, two front split pockets, and a draped waist.100% SilkDry Clean OnlyImportedSI38S-F18Size & FitModel is wearing a size 6Model Measurements: Height: 5ft 9in; Bust-Chest: 33in; Waist: 23.6in; Hips: 34.5inAbout SI38S-F18Sieviast designer and renowned expert James Simone launched his versatile collection of womens ready-to-wear and accessories in 2008. The label\'s garments have been named the year\'s top fashion showpieces, and the lines are sure to become a summer staple.

However using GPT-2 we can also generate context based examples.

Here is a sample generated with the context “sleeveless red asymmetric dress

DetailsThis asymmetric fit-and-flare fit is cut in a slim fit. Cut for a flattering, relaxed fit.- Round neckline. 
- Sleeveless.
- One center front, two center back.
- 100% polyester.
- Machine wash.
- Imported.
- Model is 5\'10""/178 cm and has a 32"" bust. She wears a US size 2.

This example seems fairly realistic and reasonable. One thing I was happy about is that besides just generating a realistic looking description was that words like “asymmetric” and “sleeveless” which were used in the input phrase also appear in the generated description.

On a less serious note, sometimes with less fashion sounding context we can still get some of the old GPT-2 training and structure bleeding through.

For context “meat dress” inspired by Lady Gaga

With the same context meat dress below are two generated examples. When SR GPT-2 generates a sample it takes into consideration words with the highest probability of following the given context. However there is some randomness involved in the process so outputs can come out quite differently. The first example generates a fairly reasonable looking product description. The second SR GPT-2 falls back a bit into the more definition type training of GPT-2


Meat dress: our Italian twist, crafted of a crepe fabric with a stretchy, crinkle finish. Features hand-woven details, an embroidered floral pattern throughout.
- Adjustable, pull-on, belt
- Side slit
- Adjustable, belt with cut from a relaxed fit
- Fabric has been softened by hand washing
- 95% rayon, 5% spandex blend; lining: 100% polyester crepe de chine
- Washable
- Imported

dress made of meat, bone, and vegetable gabardine. In honor of the American Heart Foundation.

StyleGAN

Generative Adversarial Networks (GAN) are an interesting area of deep learning where the training process involves two networks a generator and a discriminator. The generator model starts to create images on its own, it starts from random noise while the discriminator gives feedback by looking at training examples and generator output and predicts if they are “real” or “fake”. Overtime this feedback helps the generator create more realistic images.

StyleGAN is a model that was released by Nvidia near the end of 2018. It is an improvement over a previous model from Nvidia called ProGAN. ProGAN was trained to generate high quality images 1024×1024 and did so by implementing a progressive training cycle where it starts training images at low-resolution (4×4)and increases that resolution over time by adding additional layers. Training the low resolution images helped make training faster and increased the quality of final images as the networks were able to learn important lower level characteristics. However ProGAN has limited ability to control the generated images.

StyleGAN improves on ProGAN by giving the ability to control the “style” of outputs by allowing users to manipulate the latent space vectors of a generated image. Every image that StyleGAN generates is represented by a vector that exists within StyleGAN’s latent space. So if you modify that vector you can adjust the characteristics of the image within StyleGAN’s latent space to create a new image with desired characteristics.

This is just a brief description of StyleGAN for more information check out the paper or other write-ups on online.

ShopRunner StyleGAN (SR StyleGAN)

I ended up training SR StyleGAN for around 4 days and generated around 2 million 512×512 images in the process. As a starting point for weight initialization I actually used another anime trained StyleGAN. I used this anime StyleGAN as a starting point because the original Nvidia StyleGAN was trained to generate 1024×1024 images which are great, but also harder to work with because they require more computational firepower. The anime StyleGAN in comparison was trained to generate 512×512 images so it is more manageable.

The dataset for SR StyleGAN was around 9000 mostly dress product images which I pruned down based on a few criteria. Step 1 I did with a few lines of code, but the last three steps were manual.

  1. Size: I threw out images with a width or height below 300. If you leave low quality images in the dataset you end up with pixelated looking final generated images.
  2. composition: for simplicity I tried to keep images where the model/product was located in the center of the image
  3. background: removed overly complex backgrounds since it would mostly just mean lots of additional effort on the model’s part to begin to generate them well
  4. removed non product shots: certain images were either blank placeholder images or zoomed in shots of pattern/fabric. Leaving shots like this in the dataset I found gives StyleGAN an easy way to cheat and generate “realistic” looking images to fool the discriminator. However, this is not really the most desirable behavior so I did my best to remove them.
Walks through SR StyleGAN’s latent space

Generating Low Frequency Examples

Now that we have walked through some of the training details of SR StyleGAN we can start talking about how to generate those low frequency products. In the following video you see a few seconds where jumpsuits are generated even though this dataset was mostly dresses. So for this hack week I used jumpsuits as an example low frequency class.

This shows some jumpsuits

One quick method to generate additional jumpsuit samples would be to generate a large number of SR StyleGAN images unconditionally and search through those to find examples of the low frequency class we care about.

Here are some examples that I manually pulled out of a few hundred generated StyleGAN images. This is fine? BUT if we can figure out where exactly jumpsuits exist in the SR StyleGAN latent space we could generate them as we see fit.

Style Mixing with SR StyleGAN

Each image is represented by a feature vector in SR StyleGAN’s latent space. So if we combine different vectors together we are able to start to get at “style mixing”. In the two sets of videos below what we are seeing the top right image get mapped onto the bottom left image. The resulting mixture is in the bottom right image which should be dominated by the characteristics of the top right image. In both videos you see the characteristics of the top right image shift in response to changes in the bottom left image.

Top right getting mapped onto bottom left image. Result shown in the bottom right
Top right mapped onto bottom left for a synthesized image the upper right dress mostly just shifts in response to length

This is cool, it sort of gives us a way to combine two images by combining their feature vectors in SR StyleGAN’s latent space.

Combining Image and Text Generators?

Since the goal was to generate realistic looking fake products here are two examples of generated images with contextually generated text. As of now the text context is manually generated, but a future project could be to build a captioning model or simply use tags generated by internal attribute and taxonomy models which the team has been working on. These generated tags can be used as a sort of stand in for product title. For example potential attributes of the following dress could be “sleeveless red asymmetric dress” and could be fed into GPT-2 to get contextually generated product descriptions.

3SR StyleGAN generated dress.

Context for SR GPT-2: sleeveless red asymmetric dress

DetailsThis asymmetric fit-and-flare fit is cut in a slim fit. Cut for a flattering, relaxed fit.- Round neckline. 
- Sleeveless.
- One center front, two center back.
- 100% polyester.
- Machine wash.
- Imported.
- Model is 5\'10""/178 cm and has a 32"" bust. She wears a US size 2.

A second example using a jumpsuit generated by SR StyleGAN with potential tags being “black short sleeve jumpsuit” which feels like a reasonable description or boring title.

SR StyleGAN generated Jumpsuit

SR GPT-2 context: black short sleeve jumpsuit.

A loose, fluid silhouette lends a comfortable wear to any look. The buttonless back features a keyhole on the chest.Material and CareMaterial information: 100% Cotton, Lining: 100% Viscose, Lining: 100% Polyester

Then as a fun follow up I fed these two examples through our internal taxonomy classification service which uses images and text input and found that the taxonomy service successfully categorizes the two images as a “women’s dress” and a “jumpsuit”.

Wrapping Things Up

Over the course of this hack week I spent a lot of time training models and looking at generated image and text outputs. I still think that while hard to utilize these generator style models could potentially be very useful for adding interesting business value. I mentioned initially things like synthetic data augmentation for low frequency classes, but other ideas that came from the team could be letting users generate items and manipulate the items if we can figure out how to successfully locate and manipulate different features in the GAN’s latent space. If users can generate items they would like then we can do more standard visual searches of our catalog and so on.

As for notes on the models.

GPT-2 seems fine and learns quickly, potentially overfits quickly as well… by feeding it appropriate context it can generate reasonable text. What we use as context is really the question. Thoughts would be a captioning model based on a real image or a GAN image. A simpler way would just be to feed in all available attributes and taxonomy information as plain text and see how it does.

StyleGAN is decently trained and to get better results I would likely need to train it from scratch or at least from a much earlier point in its training. I intentionally had StyleGAN start at a point where it was generating fairly large images. all while using those anime weights.

Something that I experimented with but did not find much success with was mapping images in and out of StyleGAN’s latent space. The general idea is to use a pretrained network to learn find the the closest approximation of an image in StyleGAN’s latent space by generating StyleGAN vectors and comparing how close the image is to the original. If we can successfully map items into StyleGAN’s latent space then we can combine those vectors to have a bit more control of what we are modifying. For example we could map a bunch of jumpsuits into StyleGAN’s latent space and mix those jumpsuits together to make new samples. Another related step is finding where certain attributes or patterns exist in the latent space then we could potentially

For a full training run Nvidia does list a training time of 42 days with a single GPU. You could likely get good results in a full week or two of training since other folks who have tried training from scratch report the last few weeks are really just about getting clean minor details.

If I end up continuing on this hack week idea a lot of future work will likely be around manipulating SR StyleGAN outputs and locating where things like dress/sleeve length are located or colors and patterns in order to allow for more fine grained control over manipulating different aspects of the generated images.