Image Generation: Text to Image

Source: Deep Learning on Medium

Image Generation: Text to Image

Nowadays researchers are very much interested in generation through machines like Image Generation, audio or video generation, etc. And these all are possible in this era of Deep Learning. Because of the advancement of Deep Learning technologies, we can generate anything we want. One good example is that: machine generates faces which does not exists yet, another interesting example is “Talking Mona Lisa” and many more. This proved the strength of Deep Learning.

Realistic faces, right?

Generative Adversarial Networks abbreviated as GANs are evolutionary technology in field of Deep Learning which can generate images from random noise, I will explain this in further reading. Look at the above image, you are thinking that these are real faces captured on the camera but the truth is that these faces even does not exists yet and generated by GAN. This is very interesting, right?

So in this article, I will be talking about Image Generation task from Text using StackGANs with respect to Image Recreation, which consists of two parts:

  1. Image Captioning: Converting Image to Text representation
  2. Image Generation: Generating Image from Text description

I have discussed the first task in Image Captioning: Image to Text. And now let’s talk about second task.


Generating photo-realistic images from text is an important problem and has tremendous applications, including photo-editing, computer-aided design, etc. Recently, Generative Adversarial Networks(GANs) have shown promising results in synthesizing real world images. StackGAN is one of the types of GANs. which takes text description as an input and gives corresponding image based on the data set given.

There are several GANs which generate images from text descriptions like Conditional-GAN, AttnGAN, etc. But the main problem about Image generation is that it takes lots of training time and not able to efficiently generate high-solution images, StackGAN solves this problem by adding two GAN architecture sequentially. This will be cleared in further sections.


Generative Adversarial Network: GAN composed of two networks

  1. Generator network: takes d-dimensional noise as an input and gives RGB image as an output. This is done through series of convolution layers. This images are terms as fake images.
  2. Discriminator Network: Takes fake and real images as an input and discriminate. if fake image then gives output=0, and real then output=1.

If you want to learn more about GAN, take a look at this article.

Architectural flow

Architectural model of StackGAN

Take a look at the architecture of StackGAN starting from top left corner, takes an input of text description. This text is first converted into embedding vector, which can be done by transfer learning methodology where there are two good pre-trained models are available named as Word2Vec and Embedding.

Now talking about the architecture, this is divided into two stages, Stage 1 and Stage 2.

Components of Stage 1: (input=[text], output=[64 x 64 image])

  1. Conditioning augmentation: This component is responsible for combining embedding vector and random noise and make a d-dimensional array. Here significance of adding random noise is to introduce uniformity in the input vector.
  2. stage-1 generator: now the output of CA component is given to the stage-1 generator and this generator generates low resolution image i.e. 64 x 64 RGB image. This image represents low level features of input like shape, color, segments of areas, etc.
  3. stage-1 discriminator: takes real and generated images and discriminate between them and gives output of 0(in case of fake) and 1(in case of real image)

Components of Stage 2: (input=[64 x 64 image], output=[256 x 256 image])

  1. stage-2 Generator: takes input of low resolution image(64 x 64) which are generated by Stage-1 and embedding noise, compute the series of convolution operations and generates the high resolution images i.e. 256 x 256 RGB images. This images include specific high level features like beak of bird, eyes, throat, etc.
  2. stage-2 discriminator: takes generated images by stage-2 generator and real images from data set and discriminate them. This is done through series of downsamping steps.

For the detailed implementation of StackGAN, you can see this git hub repository.

Scope of the task

The aim of my project Image Recreation is to efficiently train GAN and generate more realistic images. Image captioning task will give me the captions related to the image and this caption will be given to the StackGAN architecture, hence discriminator has one fix real image (i.e. given to the Image captioning task) by which architecture can learn efficiently.


  1. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks