Conditional Icon Generation with GANs

Source: Deep Learning on Medium

Conditional Icon Generation with GANs

The Deepest of Learners — Aniruddh Bharadwaj, Charlie Hwang, Rachit Kataria, Charles Lee


Apple, Google, McDonald’s — all have timeless icons that convey company mission and set a tone for external communication. With the growth of digital media and the swift adoption of mobile phones, the demand for high quality digital content, of which a critical component is branding, is growing exponentially. In fact, “77 percent of B2B marketing leaders say branding is critical to growth” [1]. This is for good reason; designers can provide a valuable, human-centric view of the product. The most visible representation of these views today is the icon.

Our primary motivation is to break down the barrier of designing quality icons for creators that want to release valuable products out in the world.

Problem Statement

The goal of our project is to leverage Generative Adversarial Networks (GANs) in order to produce reasonable-looking app icons, conditioned on a desired app genre or description. The novelty here derives from the conditioning, which would make the resulting icons substantially more useful for our end creators.

High Level Intuition

For our particular icon domain, a key driving intuition is that icons have very visually differentiable features. Text, shape, foreground and background separation, and color are just a few of these traits. The recent success of GANs performing style extraction seems to indicate that mixing and matching visible image styles is feasible, and could directly apply to our use case.


No icon training sets existed as is, so we decided to build our own robust dataset by leveraging the existing icons from the millions of apps on online app stores. Using an off the shelf API scraping tool, we curated a dataset of roughly 11,000 icons from the Apple App Store [2].

All icons initially were retrieved as JPEGs at a 100x100x3 resolution, which we then downscaled to 32x32x3 for the purposes of faster training time. We also categorized the images by their primary genre type (e.g. Books, Social Networking, Travel) and collection type (e.g. top free, top paid, top grossing). In addition to the icons themselves, we also collected metadata per icon, stored alongside as JSON files. For later experiments, which did not involve the primary genre type retrieved from the Apple App Store, we created a separate, flattened dataset to decrease the time required for experiment setup.

A sample representation of our data can be seen below.


Our first method of conditional icon generation consisted of bucketing our icons by their primary genre, which we fetched from each icon’s respective metadata file, and passing the one-hot encodings of our labels (the primary genres) as the condition vectors alongside our icons. Our intuition here was that icons from different genres may have defining features and specific visual characteristics that can be learned such that icons generated for different genres are visually differentiable from each other.

To accomplish this conditioning, we looked at a variety of existing Conditional StyleGAN implementations and relevant blog posts, and settled on an approach in which we conditioned the generator’s and discriminator’s loss functions on the one-hot encoded label of each icon, leading to the following loss function:

We also experimented with introducing noise to the one-hot encoded labels by means of an additional matrix multiplication between the matrix of one-hot encoded labels, with shape (batch size, # of labels), and a noise matrix, with shape (# of labels, latent dimension) sampled from 𝑁(𝜇, 𝜎²)

Our second set of novel approaches involved modifying our genre-conditioning pipeline to support additional conditioning vectors, including one-hot encoded vectors supplied by the result of a pre-trained ResNet-18 model predicting ImageNet-1000 labels, as well as embedding vectors provided by a pre-trained BERT language transformer.

In our ResNet-based approach, we passed our 32x32x3 icons through a pre-trained ResNet-18 model, which re-bucketed our icons into 392 different ImageNet labels. We then one-hot encoded these labels, and passed them as condition vectors to our GAN, with our intuition being that ResNet, a SotA recognition network, could more closely cluster the icons than by their primary genres.

On the other hand, in our BERT-based approach, we took the descriptions of each icon and passed them through a keyword-extraction library called RAKE (Rapid Automated Keyword Extraction) in order to retrieve top keywords from each description. We then converted each keyword into an embedding vector of shape (1,1024)by passing them through BERT, and averaged all the embedding vectors per icon to get the final description vector passed as a condition to our GAN.



DC-GAN served as our baseline experiment, so that we could determine a good GAN architecture from which to build off of. The key idea of DC-GAN is to generate an image by up-sampling from the latent space using a series of deconvolutional layers, and to verify the image by down-sampling the generator’s output using a series of convolutional layers. Our intuition in this experiment was that the convolution-based architecture of the DC-GAN would capture meaningful visual information from our icon dataset in a lowly-dimensional latent space, which could then be used as the basis for newly-generated icons.

However, as can be seen from the sampled output above, taken from the DC-GAN after 80 epochs of training (~14 hours), the generator’s loss starts low but then spikes and remains between 2 and 5, while the discriminator’s loss starts above the generator’s loss but quickly drops to between 0 and 1. This indicated to us that the discriminator quickly learned how to differentiate between the ground truths and the generator’s output, causing the generator to stop improving the realism of its outputs over time.


Next, we decided to try using PGGANs, also known as Progressive Growing of GANs.

PGGANs have a discriminator and generator which are progressively grown from a low resolution to a high resolution. As PGGAN progresses, more layers are added to the model, which increases the level of detail in the generated images. We specifically used the implementation found in this Github repository [3] using Google Colab.

We opted to try PGGANs since they provide more stability and speed over a Vanilla GAN architecture. PGGANs also perform extremely well on CIFAR-10 image generation, so we figured it would be successful in icon generation since CIFAR-10 images and our scraped icons are of similar quality and detail. The only aspect we were concerned with was that icon art is significantly more abstract than the natural images found in the CIFAR-10 dataset.

The results after ~7 hours of training on a single Tesla K80 GPU are below:

We were surprised by the level of quality and detail PGGAN was able to produce, despite the relatively low amount of resources put into training the model.


StyleGAN’s architecture is an extension of PGGAN. The generator no longer uses a point from the latent space; rather, it uses a mapping network and noise layers. The mapping network allows the model to extract different aspects of the image (e.g. foreground, background) and gives the model control over the higher-level styles present. The additional noise layers introduced per feature map add a more fine-grained, per-pixel understanding of each icon style.

We attempted this approach because it has proven to work extremely well in the past at mixing facial features to generate real-looking images of human faces [4]. As app icons tend to have different controllable styles (e.g. text, color, shape) and level of detail, StyleGAN seemed to be the most appropriate GAN for our purposes. We specifically used NVLabs’ official Tensorflow implementation of StyleGAN [5].

Our first StyleGAN experiment aimed to determine if it could generate high quality synthetic icons from our entire dataset without any conditioning. With the less than ideal results of DCGAN and StyleGAN’s recency over PGGAN, this would give us the confidence that StyleGAN could serve as a baseline model to then condition with moving forward.

We trained our StyleGAN for 250 ticks, where each tick corresponds to a single run of 1,000 images, on our 32×32 icon inputs. After 34 hours of training on a Tesla V100 GPU, our results are as follows:

Unconditioned StyleGAN Fakes

Unconditioned StyleGAN Reals

These outputs, from a qualitative perspective, seemed very comparable to the real dataset icons. One can clearly make out hearts, video cameras, and lettering across a variety of colorful backgrounds.

Genre-Conditioned StyleGAN

Given the promising outputs of our baseline StyleGAN, we moved ahead with our second experiment to condition StyleGAN with one-hot encoded app genres. This use-case seemed to be more relevant and useful in terms of icon output for a designer with an app genre already in mind.

We trained our newly conditioned StyleGAN for 250 ticks on our 32×32 icon inputs, split across 25 discrete genres. After 28 hours of training on a Tesla V100 GPU, our results are as follows:

Genre-Conditioned StyleGAN Fakes

Genre-Conditioned StyleGAN Reals

(Order of genres from top to bottom: Entertainment, Finance, Reference, Education, Photo & Video, Business, Health & Fitness, News, Productivity, Food & Drink)

Two key concerns emerge with these results: quality and distribution. In regards to quality, it appears that the icons suffer significantly in style recognizability in comparison to our baseline. This is most likely due to the limited number of icons per genre bucket for StyleGAN to train on. Distribution is another issue that we did not initially investigate. Looking at the real dataset icons per genre, it is very difficult to find a distinguishing set of style features between rows. In other words, the genre distributions are all quite visually similar. Only Games appeared to stand-out, given apps in that genre usually have more involved, complex figures and borders.

This lack of differentiation made it much more difficult to assess whether or not our outputs actually reflect the given genre or are succumbing to noise.

ResNet-Conditioned StyleGAN

After realizing distributions across genres are not unique, we considered alternative ways to group icons based on similarities. Manually grouping the icons based on qualitative similarities was unrealistic because of the size of our dataset. Therefore, we opted to use ResNet to do transfer learning and pre-classify our dataset into classes specified in the ImageNet database [6]. Specifically, we used a pretrained ResNet-18 model from the CNTK library [9]. We chose ResNet as our architecture because it is still one of the best architectures in terms of accuracy, and has the most publicly accessible pretrained models.

Although training the pretrained ResNet-18 model in the icon domain would have yielded the best results, we decided to use the model as is. Training the model would have involved creating “ground truth” labels for icons and manually assigning them to a portion of our dataset. This would not have been the best use of our time, especially since we didn’t know whether ResNet would provide reasonable results.

ResNet-Conditioned StyleGAN Reals

ResNet actually managed to the group similar-looking icons together into approximately 350 labels of the ImageNet database. Because we obtained reasonable clustering, we converted the classes into a one-hot vector and used them in the same Conditional StyleGAN model from genre conditioning. After training for 34 hours using a single Tesla V100 GPU, we gathered the following results:

ResNet-Conditioned StyleGAN Fakes

We found that the distributions across classes were much more distinct! As you can see in the illustrations above, rows with a consistent background color or foreground shape in the reals translated to the same, respective background color or foreground shape in the fakes (e.g. the last row with a red background and the middle row with a white background and circular foreground).

To confirm that ResNet was able to detect patterns within the images, we also constructed t-SNE plots of our icon data. t-SNE plots reduce the dimensionality of high dimensional datasets in order to identify clustering. Below are our t-SNE plots when using the top 80 ResNet labels and the top 20 ResNet labels respectively:

We noticed that when using 80 labels, the data is not unable to be clustered at all. However, when using only 20 labels, we see that the left (purple) and right (green) regions are generally separable. This tells us that we might achieve better results by reducing the number of label buckets of ResNet.

BERT-Conditioned StyleGAN

Given the reasonably-distributed outputs of our ResNet-conditioned StyleGAN, we decided to move ahead with our novel, fourth experiment — attempting to condition StyleGAN on text-embedding vectors generated by a pre-trained language model. One observation we made throughout our experiments was that different keywords generally correspond to different icon visuals, and that keywords of icons from one category generally don’t overlap with those from another category. For example, keywords from mobile gaming app descriptions — “multiplayer”, “battle royale”, “gun game” — rarely overlap with keywords from messaging app descriptions — “chatting”, “message”, “real time”. Thus, our intuition was that it may be possible to condition our GAN on vector-representation of keywords, in hopes that it learns the conditional distribution of icons over keywords and can generate icons whose visual characteristics are representative of their keywords. We specifically chose text-embeddings over other vector-representations because text-embeddings capture semantic information and context, which we thought would improve the quality of our conditionally-generated outputs.

To accomplish this, we decided to parse the first 100 words from each app description, in order to strip away unwanted description text such as privacy policies, extract the 4 most important keywords using RAKE [7], and pass each keyword through the pre-trained BERT language model [8]. The BERT model gave us a word-embedding vector with shape (1, 1024) for each keyword, and in our experiments, we either a) summed or b) averaged all the text-embedding vectors for the keywords into a final embedding vector, representing all keywords for a given icon, that our GAN was conditioned upon. This use-case seemed to be the most relevant and useful in terms of icon output, as designers could simply provide a set of relevant keywords in order to generate a representative app icon.

We trained each of our BERT-conditioned StyleGANs for 210 ticks on our 32×32 icons, each icon paired with a (1, 1024) text-embedding vector representing all keywords extracted from that icon’s description. After 28 hours of training each on a Tesla V100 GPU, our results are as follows:

BERT Reals BERT (Avg) Fakes

BERT (Sum) Fakes

Unfortunately, our BERT-conditioned models were unable to capture the conditional distribution of icons over keywords, resulting in mostly blob-like outputs that lacked structure or semantic meaning. We posit that this is because keywords are often tightly-tied to an individual app and because keywords vary considerably across icons. This diminishes learning, as our GAN rarely sees similar text-embedding vectors, and thus does not have enough information to learn the conditional distribution of icons over some of the less-frequent keywords. For example, our GAN only saw the list of keywords “caribou”, “supplies”, “hunting”, and “season” once, making it hard for our GAN to conditionally create an icon based on these keywords at generation-time.

However, our GAN was able to produce some semi-realistic outputs for certain text-embedding conditions (shown below), likely representing the few common keywords that were seen repeatedly across different icon descriptions.

Room for Improvement

First of all, we could generate 64×64 icons instead of 32×32 icons because it gives us better details and improves the quality of the image in general. We were only able to generate 32×32 icons because of limited computing resources and time.

Secondly, the premise for our first genre-conditioned model to work was based on the assumption that app icons have a high correlation with their primary genres. However, it seemed that it was not the case. If we ran t-SNE over our dataset in the early stages, we would have realized that the ground-truth distributions weren’t very visually distinctive, and could have spent more time developing different strategies to cluster the data.


Alongside our t-SNE quantitative analysis, we wanted a more qualitative understanding of our results. The first evaluation we tried is as follows: (please refer to picture A below) the two icons on the left hand side are generated based on two unique ResNet labels and only one of them comes from the distributions of labels on the right hand side. The user’s job is to select the correct icon given the corresponding distribution. The accuracy was measured by how many icons and labels they matched correctly over the total number of test cases they attempted.

We achieved around 72 to 78% accuracy, which tells us that the evaluators were able to tell some visual differences from the icons we generated.

(Picture A)

Due to the abstract nature of our generated images, we also attempted an alternative way to evaluate our results. Instead of picking one icon that matches the given label, we gave the evaluators two labels and ask them to match them to their corresponding distributions. The answer in this case would either be “swap” or “no swap” (please refer to picture C and D below).

We achieved 82.3% across 8 evaluators and 1200+ test cases. We believe the results are meaningful and underscore that there are notable visual differences of our generated icons to achieve such accuracy.

(Picture B, swap)

(Picture C, no swap)

Next Steps

Although we were able to achieve a relatively high accuracy rate in our human validation tests, we were not satisfied with the level of detail and quality of our generated icons as we were hoping to generate icons that would be indistinguishable from real icons. Thus, our next steps are as follows.

Continue to Explore ResNet-Conditioned StyleGAN

We observed that one issue with the ResNet classification was that many of the labels only had one or two icons, which created sparsity in our paired dataset. Thus, for these labels, the training output had a lot less detail and quality. We are planning to scrape more icons from other application stores, like the Android App store, to hopefully increase the number of icons that go into labels with only a few icons.

Another potentially significant improvement is to change the ResNet’s domain from ImageNet to our icon domain. We know we don’t have an ideal model for grouping these icons because ResNet is comparing them to natural images of animals and objects, whereas icons are very abstract. By taking the time to come up with labels specific to our icon domain (e.g. circular foreground, white background, letter, etc.) and labeling a small set of icons, we would be able to train our ResNet model to be significantly more accurate and hopefully result in better conditioning.

We would also like to reduce the number of labels the ResNet can classify our icons into. As illustrated by our t-SNE plot above, reducing the number of labels may result in even better clustering. However, it is important to note that, when tweaking the number of labels, we must strike a balance between variance across labels and variance within a label. Specifically, by reducing the number of labels and thereby reducing variance across labels, we will be increasing the variance and potential noise within the new “compressed” label. By experimenting with different label counts, we hope to improve our ResNet classification.

Explore Other Means of Grouping and Conditioning Icons

So far, we have only attempted conditioning on genres, descriptions, and appearance. Although genres and descriptions didn’t give us the best conditioned results, there are other potential ways we can group the icons together.

Conditioning on the app’s developer may work since icons developed by the same team or person will typically look similar to each other. This method of conditioning would be helpful for developers who want to continue to create different icons that share an artistic theme.

Another method is to strictly focus on the foreground and background colors of the icons, instead of the appearance of the icon as a whole. This may work because icons with similar color combinations may share similar styles overall. This would help creators quickly compare various color combinations.

Finally, a time consuming, but a potentially accurate means of clustering would be to manually group similar-looking icons together. This would be similar to our ResNet approach in terms of considering the holistic appearance of the icons, but may yield better results because icons are inherently abstract.

Explore Other Baseline Architectures

Instead of switching what we condition on, we can also switch our baseline architecture. In this project, we experimented heavily with StyleGAN. However, there are many other variations of the baseline GAN architecture that may lead to more success. Namely, we would like to try AttentionGAN and BigGAN.

AttentionGAN highlights certain features of the input image to focus on during training. This architecture may help when conditioning off specific elements of the icons (e.g. foreground, background, color, boundaries, etc.).

When using StyleGAN, our outputs are noticeably low in resolution and we had to manually incorporate conditioning into the base architecture. However, BigGAN inherently has conditioning built into the architecture, and yields extremely high resolution images. Furthermore, it has outperformed other state-of-the-art models in ImageNet image generation. Therefore, it may help create icons with higher fidelity.


Although we weren’t able to produce icons that were indistinguishable from real icons, we were able to generate decent images humans were able to accurately classify into their respective conditioned labels. We’ve learned to avoid making assumptions about the distribution of the dataset and observed how the fidelity of image generation can change by adding conditioning to a baseline architecture. We hope to continue with the next steps mentioned above to eventually provide an effective and cheap icon generation service for creators and designers.