Source: Deep Learning on Medium

### Introduction

The generative query network is an unsupervised generative network, published on Science in July 2018. It’s a scene-based method, which allows the agent to infer the image from a viewpoint based on the pre-knowledge of the environment and some other viewpoints. Thanks to its unsupervised attribute, the GQN paves the way towards machines that autonomously learn to understand the world around them.

The GQN is mainly comprised of three architectures: a representation, a generation, and an auxiliary inference architecture. The representation architecture takes images from different viewpoints to yield a concise abstract scene representation whereby the generation architecture generates an image for a new query viewpoint. The inference architecture served as the encoder in a variational autoencoder provides a way to train the other two architectures in an unsupervised manner.

Now let’s delve into these three architectures and see what they look like. Note this post involves some math expression, for better readability, welcome to refer to my personal blog.

### Representation Architecture

The representation architecture describes the true layout of the scene by capturing the most important elements, such as object positions, colors and room layout, in a concise distributed representation.

#### Input & Output

**Input**: For each scene *i*, the input is comprised of observed images xᵢ¹,…,xᵢᵐ and their respective viewpoints vᵢ¹,…vᵢᵐ, where the superscript indicates different recorded views for a scene. Each viewpoint vᵢᵏ is parameterized by a 7-dimensional vector

where *w *consists of the 3-dimensional position of the camera, and *y* and *p* correspond to its yaw and pitch respectively.

**Output**: Representation *r*, an effective summary of the observations at a scene, computed by the scene representation network:

where ψ is a ConvNet shown later, and *r* simply the element-wise sum of all rᵏ.

### Candidate Networks

The authors suggest three different architectures:

They find the ‘Tower’ architecture to learn fastest across datasets and to extrapolate quite well. And interestingly, these three architectures do not have the same factorization and compositionality properties. We will leave the discussions of the properties of the scene representation at the end

### Generation Architecture

During training, the generator learns about typical objects, features, relationships, and regularities in the environment. This shared set of ‘concepts’ enables the representation network to describe the scene in a highly compressed, abstract manner, leaving it to the generation network to fill in the details where necessary.

#### Input & Output

**Input**: Query viewpoint *v^q*, representation *r* for a set of *x* and *v* at the same scene as the query viewpoint, latent variable vector *z*

**Output**: Query image *x^q* corresponding to *v^q*.

#### Network

The query viewpoint *v^q* and the representation r are the same across LSTM cells, while the latent vector zₗ is derived from the hidden state hₗᵍ, which means it varies depending on hₗᵍ. Moreover, zₗ is sampled from the prior

a Gaussian distribution whose mean and standard deviation are computed from the ConvNet η_θ^π.

This recurrent architecture splits the vector of latent variables *z* into *L* groups of latent zₗ, *l=1,…,L*, due to which the *prior* can be written as an auto-regressive density:

The generated image is sampled from the generator

also a Gaussian distribution, whose mean is computed from the ConvNet ηᵍ_θ(μₗ), and standard deviation is annealing towards over the duration of training. The authors explain that annealing variance encourages the model to focus on large-scale aspects of the prediction problem in the beginning and only later on the low-level details

The workflow at each skip-connection convolutional LSTM cell is given as below:

- Concatenate
*v^g, r, zₗ*to hidden state hₗᵍ - Pass the resulting tensor to convolutional layers with
*sigmoid/tanh*as activations to compute forget gate, input gate, candidate values and output gate separately - As in general LSTM, update cell state and hidden state sequentially
- Upsample the hidden state via transposed convolutional layer and add the result to uₗ. Notice that uₗ is constructed bit by bit as LSTM unrolls. This indicates that our eventual image is also constructed part by part as time goes by.

### Inference Architecture

The inference architecture acts much like the encoder network in VAE, which is mainly used to approximate the latent variable distribution, thereby training the other two architectures.

**Input**: Representation *r*, query viewpoint *v^q*, query image *x^q*

**Output**: The variational posterior density *q_ϕ(z|x,v,r)*, a Gaussian treated as the approximation to* π_θ(z|v,r)*.

#### Network

Here’s the LSTM cell for inference architecture I draw

The workflow shares most similarity with the generation architecture, except that now there is no *u* to update and the inference architecture produces as output the variational posterior density *q_ϕ(z|x,v,r)*.

### Optimization

The loss function is derived almost the same as the variational autoencoder except that now we want to maximize the conditional probability *p_θ(x|v,r)*. For a detailed derivation, please refer to my previous post for VAE. Here we just deliver the ultimate loss function

The Loss function mainly consists of two parts:

- Reconstruction likelihood:
*-log N(x|ηᵍ_θ(u))*. This term measures how likely the image produced by the generator is to the real query image. It is crucial to realize that the latent variable*z*is sampled from*q_ϕ(z|x^q,v^q,r)*, which makes sense why we anticipate the generator to generate*x^q*. Also because of that, the reconstruction likelihood does not involve the update of the ConvNet η_θ^π. - The KL divergence of the posterior approximate
*q_ϕ(zₗ|x,v,r,z_{<l})*from the prior*π_θ(zₗ|v,r,z_{<l})*. This term penalizes the difference between the distribution of the latent variable*z*used in the generator and that in the inference architecture. Note that this term is measured sequentially in the network and can be calculated analytically by conditioning on the previous latent sample. A detailed analytical computation of the KL divergence between two Gaussians will be appended at the end. Another thing worth some attention is that, as the reconstruction likelihood disregards the prior*π_θ(z|v,r)*, the KL divergence does not contribute to the update of the transpose convolutional layer in *Cᵍ_θ*

### Algorithm Pseudocode

#### Loss

To compute the ELBO, we first construct the representation *r*, then we run the generator architecture in parallel with the inference architecture, computing the KL divergence sequentially. At last, we add the reconstruction likelihood to form the ELBO.

#### Training

At each training step, we first choose a batch of scene points. For each scene, we sample several viewpoints and their corresponding images, along with the query viewpoint and image to form our training data. Then we compute the loss function and optimize our network the Adam optimizer.

#### Generation

At generation stage, we first compute the representation at the query scene and then run through the generation network to compute a Gaussian distribution from which we sample our query image.

### Supplementary Materials

### Properties of The Scene Representation

The abstract scene representation exhibits some desirable properties as follows:

- T-SNE visualization of GQN scene representation vectors shows clear clustering of images of the same scene despite remarkable changes in viewpoint

- When prompted to reconstruct a target image, GQN exhibits compositional behavior as it is capable of both representing and rendering combinations of scene elements it has never encountered during training. An example of compositionality is given below

- Scene properties, such as object color, shape, size, and etc. are factorized — change in one still results in a similar representation

- GQN is able to carry out scene algebra. That is, by adding and subtracting representations of related scenes, object and scene properties can be controlled, even across object positions

- Because it’s a probabilistic model, GQN also learns to integrate information from different viewpoints in an efficient and consistent manner. That is, the more viewpoints are provided, the more accurate the prediction is likely to be.

### Analytical Computation of The KL Divergence Between Two Gaussians

Assuming we have two Gaussians

now we compute their KL divergence as follows

For multivariate Gaussians with a diagonal covariance matrix, let *J* be the dimensionality of *x*, we’ll have

where *μ_{i,j} σ_{i,j}* are the *j*th dimension of *μᵢ, σᵢ *respectively.

### References

S. M. Ali Eslami, et al, from DeepMind. Neural scene representation and rendering