Original article was published on Deep Learning on Medium
Ideas from this summary are taken from the GameGAN paper.
Simulated environments are crucial to training reinforcement learning agents. We previously reviewed World Models, The general idea is to learn a dynamics model that understands the world and then either use features learned by the model to train an agent or train an agent entirely inside the model. In the case of GameGAN we will specifically focus on on building the dynamics model.
Previous approaches to building dynamics models did a subpar job of constructing consistent samples for anything past a short number of time-steps. GameGAN takes a large step passed previous works by introducing an external memory module and a way to disentangle static and dynamic elements of the environment.
The architecture contains three major components:
- External Memory Module
- Dynamics Engine
- Rendering Engine
External Memory Module
The external memory module is inspired by the Neural Turing Machine architecture. The main difference is to gate attention shifts based on if an action is valid (ex: walking through a wall).
M is the memory block, α is the attention location, w is a shift kernel and g gates the shift kernel. After gating based on valid action hidden state combinations and encoding the LSTM’s hidden state follows a standard NTM external memory read/write.
The dynamics engine is inspired by the approach found in Recurrent Environment Simulators. The basic idea is to begin each LSTM pass by:
- encoding an image observation (x) to a lower dimensional vector (s)
- modify the hidden state (h) based on the action (a), external memory (m), and a random variable (z). v can be thought of as the modified hidden state
This is then followed by a standard LSTM forward pass
m and h are passed through a CNN and split channel-wise to produce attribute (A) and object (O) maps. A linear layer produces the type map (v, different from the modified hidden state). O is either spatially softmaxed or sigmoided. A rough sketch (R) is generated by passing v masked by O and through a transposed CNN. The rough sketch and the attribute map masked by O are passed through a SPADE layer followed by a transposed CNN to produce a fine mask (η) and image component (X). The image is generated by summing over the element-wise products between corresponding fine masks and image components.
In the above example we use m and h to generate the image (x) but it’s possible to include extra context (ex: previous memory and hidden state vectors) by including their corresponding fine mask and image component in the summation over element-wise products.
There are three GAN components.
- Single Image: optimizes for realistic frames
- Action-Conditioned: ensures predicted frames are consistent with respect to an action
- Temporal: account for temporal consistency. For example, walls should be in the same place (difficult in partially observed environments)
The GAN loss is the sum over single frame, action-conditioned and temporal GAN losses. This is followed by additional objectives to help stabilize training. The additional losses are an Action (cross entropy), Info (mutual information), Reconstruction (L2), and Feature (L2) loss. When the memory module is present, the cycle loss can also be included.
The goal of the novel cycle loss is to secure long-term consistency of static elements. This is done by disentangling static and dynamic elements.
After running through T time steps, the corresponding memory vectors (m) and read locations α are stored. Since dynamic elements do not stay in the same location over time they are encouraged not be stored in X^m.
The environments generated in this paper are very compelling. The cycle loss is a clever way to disentangle static and dynamic representations. The fact that they were able to train this end-to-end is incredible considering how brittle GANs and LSTMs can be. It would be interesting future work to see some ablation studies on the architecture of GameGAN.