Summary: GameGAN

Original article was published on Deep Learning on Medium

Ideas from this summary are taken from the GameGAN paper.

Simulated environments are crucial to training reinforcement learning agents. We previously reviewed World Models, The general idea is to learn a dynamics model that understands the world and then either use features learned by the model to train an agent or train an agent entirely inside the model. In the case of GameGAN we will specifically focus on on building the dynamics model.

Previous approaches to building dynamics models did a subpar job of constructing consistent samples for anything past a short number of time-steps. GameGAN takes a large step passed previous works by introducing an external memory module and a way to disentangle static and dynamic elements of the environment.



The architecture contains three major components:

  1. External Memory Module
  2. Dynamics Engine
  3. Rendering Engine

External Memory Module


The external memory module is inspired by the Neural Turing Machine architecture. The main difference is to gate attention shifts based on if an action is valid (ex: walking through a wall).

K, G, ε are MLPs

M is the memory block, α is the attention location, w is a shift kernel and g gates the shift kernel. After gating based on valid action hidden state combinations and encoding the LSTM’s hidden state follows a standard NTM external memory read/write.

Dynamics Engine

  1. encoding an image observation (x) to a lower dimensional vector (s)
  2. modify the hidden state (h) based on the action (a), external memory (m), and a random variable (z). v can be thought of as the modified hidden state
H is an MLP and C is a CNN

This is then followed by a standard LSTM forward pass

Rendering Engine


m and h are passed through a CNN and split channel-wise to produce attribute (A) and object (O) maps. A linear layer produces the type map (v, different from the modified hidden state). O is either spatially softmaxed or sigmoided. A rough sketch (R) is generated by passing v masked by O and through a transposed CNN. The rough sketch and the attribute map masked by O are passed through a SPADE layer followed by a transposed CNN to produce a fine mask (η) and image component (X). The image is generated by summing over the element-wise products between corresponding fine masks and image components.

In the above example we use m and h to generate the image (x) but it’s possible to include extra context (ex: previous memory and hidden state vectors) by including their corresponding fine mask and image component in the summation over element-wise products.



  1. Single Image: optimizes for realistic frames
  2. Action-Conditioned: ensures predicted frames are consistent with respect to an action
  3. Temporal: account for temporal consistency. For example, walls should be in the same place (difficult in partially observed environments)

The GAN loss is the sum over single frame, action-conditioned and temporal GAN losses. This is followed by additional objectives to help stabilize training. The additional losses are an Action (cross entropy), Info (mutual information), Reconstruction (L2), and Feature (L2) loss. When the memory module is present, the cycle loss can also be included.

Cycle Loss

After running through T time steps, the corresponding memory vectors (m) and read locations α are stored. Since dynamic elements do not stay in the same location over time they are encouraged not be stored in X^m.