Automate the UI design process with AI (pix2code)

Brainstorming, mockups and prototyping take up a large portion of time in product designs.

Source Alex Yee

The process involves idea sketching,

Source Alex Yee


Source Zeplin

and prototyping.

Source Invision

From creating layouts to making prototypes for phones, it can take hours. However, with the application of Deep Learning in AI, the whole process can be automated. Here is a 3-minute demonstration from UIzard in generating native mobile applications directly from the sketches.

The technology pix2code is still developing. This article targets at building a deep learning network to create running XCode and HTML from design scratches.

Create image captions using RNN

pix2code uses a deep network model composed of a CNN and two LSTM networks. Its design is similar to the image captioning in Deep Learning. For example, the image-captioning model reads a yellow bus picture below and generates the caption “A yellow school bus idles near a park.” automatically.

We feed the image into a CNN network to extract image features. Together with the label (the true caption provided by the sample) as input, the LSTM module generates captions.

Let’s unroll the LSTM to see how the model is trained. For example, we have the true label “<start> A yellow school bus idles near a park . <end>”. We feed each word into each LSTM cell at the bottom. In the diagram below, for the first token “<start>”, the LSTM predicts the word “A”. We continue with the second token “A” in the true label which predicts “new”. Eventually, it predicts the caption as “A new bus is parking near the street.”

Once it is trained, we make an inference by feeding validation images into the model. For simplicity, we reuse the school bus image as our example. We start the first input token as “<start>”. The LSTM produces the word “A”. Here is the major difference from the training: for the next input to the LSTM cell, we use the output from the last step. i.e. we feed the last output “A” to the model in the next time step. “A” predicts “yellow” which later feed into the model to produce “bus”. We repeat the steps which eventually generate the caption “A yellow bus idles near a park.”


pix2code takes a hand drawing UI mockup from a designer, and then feed it to the deep network to produce the XCode project with the UI design. pix2code can also produce code for Android or Web applications using different HTML/CSS/JS frameworks.

pix2code model

Here is the model architect. pix2code composes of a encoder (the left side LSTM and CNN) and a decode (LSTM’). The CNN encodes the GUI picture into latent features. Each training sample comes with a context containing information about the GUI design. The LSTM encodes the corresponding context of a GUI.



The context is the DSL code (Domain specific language) of the GUI mockups.


The context above has a stack of rows and a footer in holding UI elements. Since we are only interested in GUI components and the layouts, the actual textual values (the texts) of the components are ignored. This significantly reduces the vocabulary size and allows the tokens (like <stack>) to be coded as a one-hot-vector rather than a word embedding. This saves the model from training the embedding layer.

Vision model encoder (CNN)

Images are rescaled to 256×256 pixels without preserving the aspect ratio. Pixels are normalized. The vision model composed of 3 convolutional modules. Each module composes of 2 convolutional layers with 3×3 filters and stride 1. Each module is followed by a 2×2 max-pool for downsampling and a dropout layer for regularization. The convolutional modules output 32, 64 and 128 channels respectively. The final shape is therefore 64x64x128. Then the data is flattened and feed into two fully connected layers of size 1024 with ReLU activations and dropouts.

Language model encoder (LSTM)

The context is encoded by a language model consisting of a stack of two LSTM layers. Each LSTM is unrolled into 48 time steps (48 LSTM cells). The prediction at each time step is a vector of 128 dimensions. (i.e. h1 at time step 1 is a vector with 128 elements.)

Decoder (LSTM’)

The latent features for both context and images are concatenated and feed to a decoder. The decoder contains a stack of two LSTM layers with output dimension at 512 for each time step. Then it is feed into a fully connected layer to compute probabilities for each vocabulary using softmax. We select the output DSL token with the highest probability. For example, if our vocabulary size is just 5, the model will make a prediction of (0.05, 0.1, 0.05, 0.3, 0.5) to represent the probability for each word in the vocabulary.


Modified from source

For the context with tokens (x1, x2, x3, x4, x5, …), we create a sliding window to feed data into the LSTM for training. We start with the first training sample (0, 0, …, 0, x1).

We slide the window to the left once to prepare the next training sample. The following diagram indicates the next two training samples fitted into the model.

The model is trained with mini-batches of 64 image-sequence pairs. The total loss, using the cross entropy, for a single image is:


In making predictions, we feed the GUI image and a context of 48 tokens with values (0, 0, …, 0, <start>) into the model. With the first prediction h1 from the model, we create another context (0, 0, …, 0, <start>, h1) for the second prediction h2. We continue the process until the model predicts the <end> token. The resulting sequence of DSL tokens (<start>, h1, h2, …, <end>) is compiled to the desired target language (HTML, XCode) using traditional compiler techniques.



Use BLEU to compute the accuracy of our outputs with the true labels. It breaks a word sequence into say four n-grams. If the true label is (<start>, tk1, tk2, tk3, <end>) and the prediction is (<start>, tk1, tk2, wr3, <end>), the calculation is:


= (4/5) * 0.25 + (2/4) * 0.25 + (1/3) * 0.25 + (0/2) * 0.25

= 0.2 + 0.125 + 0.083 + 0 = 0.408

Since the word-length of the prediction and the true label is the same, we do not further reduce the BLEU score.


Here is the Keras code snippet in building the vision model (source). This implementation consists of 3 convolution modules using max pooling, dropout and ReLU followed by 2 fully connected layers.

class pix2code(AModel):
def __init__(self, input_shape, output_size, output_path):
image_model = Sequential()
image_model.add(Conv2D(32, (3, 3), padding='valid',
activation='relu', input_shape=input_shape))
image_model.add(Conv2D(32, (3, 3), padding='valid',
image_model.add(MaxPooling2D(pool_size=(2, 2)))

image_model.add(Conv2D(64, (3, 3), padding='valid',
image_model.add(Conv2D(64, (3, 3), padding='valid',
image_model.add(MaxPooling2D(pool_size=(2, 2)))

image_model.add(Conv2D(128, (3, 3), padding='valid',
image_model.add(Conv2D(128, (3, 3), padding='valid',
image_model.add(MaxPooling2D(pool_size=(2, 2)))

image_model.add(Dense(1024, activation='relu'))
image_model.add(Dense(1024, activation='relu'))


visual_input = Input(shape=input_shape)
encoded_image = image_model(visual_input)

The second code snippet is the language model encoder with a stack of 2 LSTM:

class pix2code(AModel):
def __init__(self, input_shape, output_size, output_path):
language_model = Sequential()
language_model.add(LSTM(128, return_sequences=True,
input_shape=(CONTEXT_LENGTH, output_size)))
language_model.add(LSTM(128, return_sequences=True))

textual_input = Input(shape=(CONTEXT_LENGTH, output_size))
encoded_text = language_model(textual_input)

Finally, this is the decoder with a stack of 2 LSTM and the optimizer:

class pix2code(AModel):

decoder = concatenate([encoded_image, encoded_text])

decoder = LSTM(512, return_sequences=True)(decoder)
decoder = LSTM(512, return_sequences=False)(decoder)
decoder = Dense(output_size, activation='softmax')(decoder)

self.model = Model(inputs=[visual_input, textual_input],

optimizer = RMSprop(lr=0.0001, clipvalue=1.0)

Future of pix2code

In HTML and CSS coding, there are many hidden rules and constraints imposed by the browser implementations and shortfalls. For that, UI implementations are sometimes a trial and error effort. Since deep learning can extract million-patterns, AI will eventually win. (Just like AlphaGo beats the GO master.)

However, even coding layout is tedious, one of the challenges for the front-end coding is flexibility and maintainability: how easy to make changes. pix2code needs to demonstrate how well it can group and organize information. Can they share the same CSS attributes for related components? For that, I start my introduction on prototyping. Because the code is throw away, the quality requirement is much lower and the market is ready for such automation.

Airbnb Sketching Interfaces

pix2code is similar to the language translation problem. Instead of translating text into different languages, we transcript images into UI DSL. Airbnb has demonstrated their Sketching interfaces similar to pix2code. Airbnb standardizes their UI components for all their applications. The whole system contains about 150 types of components. With only 150 words (components), the model will be much easier to train. But of course, such model is less generalized. You cannot draw just any designs. But many corporations have straight design guidelines, this may not be an issue.

Future exploration

Apply GAN

For each mockup, designers usually prepares multiple visual design options. By combining GAN with pix2code, we can create multiple variants from the mockups.

Here is an example in generating Anime characters using Deep learning. The right side defines the character’s attribute and the deep network creates the Anime character on the left.


Theoretically, we can apply similar technology with GAN in producing design variants including different color schemes, layouts and data hierarchy.


In cognitive science, selective attention illustrates how we restrict our attention to particular objects in the surroundings. It helps us focus, so we can tune out irrelevant information and concentrate on what really matters. Attention helps us to learn more efficiently. Instead of looking at the whole image at every time step, we use the current LSTM state to narrow our focus. In the following picture, each output caption word is generated by a more focus region of interests determined by the LSTM state.

Other possible improvement to pix2code

  • Bidirectional LSTM models
  • Emil Wallner has suggested the use of stride 2 instead of max-pool in CNN to improve accuracy.


Completely replacing the layout coding task by AI may still be years away. The accuracy needs to be improved for much complicated designs. But some corporations have straight design guidelines that may make it happens soon than later.

Other resources

The pix2code research paper.

The pix2code Github code.

The pix2code dataset.

Source: Deep Learning on Medium