Getting Computers to Imagine.

How DeepMind’s Generative Query Network brings AI one step closer to human level imagination.


The idea of the generative model is perhaps one of the most significant ideas, not only in the field of artificial intelligence, but in the entirety of the technological era. Since the creation of the Boltzmann Machine (Geoffrey Hinton and Terry Sejnowski, 1985) and the Autoencoder ( Dana H. Ballard, 1987), exponential progress has been made within the field of generative models. Today, we have numerous types of generative models, ranging from variations of classical generative models such as Variational Autoencoders (a type of autoencoder), and new types of generative models such as Generative Adversarial Networks (Ian Goodfellow, 2014).

Less than a week ago, Google DeepMind published a paper describing a new algorithm which they call the Generative Query Network, or the GQN. This new algorithm is capable of “imagining” and generating 3-dimensional models of a scene from various viewpoints, from observing only a few 2-dimensional pictures of various locations within this scene.

A GIF from DeepMind’s blog post showing the observation image which is inputted into the GQN (left)and the three-dimensional model/rendering of the scene outputted by the network (right).

What is a Generative Query Network?

For a human, it only takes a few glances and milliseconds for our visual cortex to interpret and make inferences regarding a certain scene; within a matter of seconds, the brain uses the information which it has gathered to “imagine” and construct a model or representation of the room. For example, when a person walks into a room and sees only part of a table due to a certain obstruction, he or she can infer that the full table is present despite not being able to see the complete table, and can even visualize how that table might look. Similarly, if we look at any person’s face from the right side, we would easily be able to visualize what that person’s face might look like from the left side as well. This is exactly what DeepMind has got its Generative Query Network to accomplish.

The Generative Query Network is capable of learning about the details of the objects within a certain scene from the color and texture of individual objects to the lighting and spatial relationships between the different objects, only from a few snapshots of the scene. Using this information, the network is able to render a three-dimensional model of the scene from a query viewpoint(a human given input to the network which asks it to render the scene from a certain viewpoint) which is fed into the network, hence the name Generative Query Network.

Another GIF from DeepMind’s blog which shows the neural rendering of objects which are generated by the GQN after it is given a few observation images of the objects from certain viewpoints.

As stated by DeepMind in their blog post: “ Much like infants and animals, the GQN learns by trying to make sense of its observations of the world around it. In doing so, the GQN learns about plausible scenes and their geometrical properties, without any human labelling of the contents of scenes.”

As you may have noticed from this quote, the problem which a GQN is solving is an unsupervised learning problem: the network gets no explicit details regarding the scene, its contents and properties, and learns everything from what it “sees” in the training images of the scene which it is given. This, according to me, is perhaps the most interesting aspect of this algorithm, as it means that the GQN, like the mind of a human baby, perceives the world and innately makes sense and generalizations of its observations. If you think about it, it makes intuitive sense: imagination and visualization in general are unsupervised learning tasks: they are spontaneous and unguided, and do not depend on any explicit, externally given direction.

Explaining the Generative Query Network

The GQN is primarily made up of two networks: a representation network and a generation network. The job of the representation network is essentially to make sense of the data (snapshots of a scene) given to the network and the job of the generation network is to actually visualize and create a 3-dimensional model using the knowledge gained by the representation network. More specifically, the representation network takes the observations of the scene as an input and creates a representation vector which is packed with information describing the scene; you can think of it as the encoder segment of an autoencoder producing a latent space representation of the input image given to the autoencoder. The generation network then uses this representation vector to construct the three-dimensional scene and to visualize the scene from various, previously unobserved perspectives; you can think of this as the decoder segment of an autoencoder creating a reconstructed image using the aforementioned latent space representation.

The architecture of the Generative Query Network, with the Representation Network creating a neural scene representation (representation vector) from the input to the network and the Generation Network using the same to predict/render the scene from a different viewpoint.

In order to give the generator network the most amount of data in order to make the most accurate model it possibly can, the representation network must be able to extract as much information as possible from the scenes which it is given, and must pack as much of this information into the representation vector. The representation network not only captures data regarding the positions of objects within scenes, but also the objects’ orientation, color, and texture, as well as the general layout of the room. As the generator network also learns about objects, features, relationships and patterns within the scene during the training phase, the representation network can create an extremely compact representation vector which only contains the most essential information regarding the scene, using which the generator network can create the scene filling in minor gaps wherever necessary.

This animation from DeepMind gives a quick run through of what the GQN is doing. The animation shows how, given a few observations of a scene from a certain viewpoints, the GQN can learn the scene and can be queried to render the scene from various, previously unseen viewpoints of the scene. We see that the GQN successfully and accurately renders the scene from given viewpoints from not only different angles, but different locations in the scene as well.

Implications of the Generative Query Network

As DeepMind writes in their blog: “ The GQN’s generation network can ‘imagine’ previously unobserved scenes from new viewpoints with remarkable precision. When given a scene representation and new camera viewpoints, it generates sharp images without any prior specification of the laws of perspective, occlusion, or lighting. The generation network is therefore an approximate renderer that is learned from data.”

Additionally, the researchers from DeepMind say that the “ GQN’s representation network can learn to count, localize and classify objects without any object-level labels.” Despite the fact that the representation vector generated by the representation network is compact, the generation network’s constructions are essentially indistinguishable from the truth; this goes to show the extent to which the representation network can accurately perceive and extract information from the data which it is provided with. The generator network is also capable of accounting for uncertainty and and inconsistency within training data, as it can still construct a model of a scene even when it only has a few partial views of a certain part of the scene by stitching together these partial scenes and relating the information which it gathered from the partial scenes to combine these scenes in a way which is most representative of the ground truth.

This GIF from DeepMind’s blog shows how the GQN renders a scene by accounting for the uncertainty caused by the limited observations of the scene. The GIF also shows how the GQN updates its neural rendering of the scene as the number of observations of the scene increases and its uncertainty regarding the ground truth scene decreases.

The capabilities of the GQN can allow for much easier training of other deep learning models, especially in the fields of computer vision. Using the GQN to train object detection neural networks could drastically reduce the amount of training data required for the models to learn and could help object detection models achieve much higher performance with regards to understanding spatial relationships between objects in an input image. Furthermore, researchers at DeepMind highlight that using the representations produced by the GQN could allow “state-of-the-art deep reinforcement learning agents learn to complete tasks in a more data-efficient manner compared to model-free baseline agents”. The representation vector generated by the GQNs representation network could be used by RL agents as “innate” knowledge of the environment of the agent.

The above image from DeepMind’s blog shows how the representation generated by GQNs can be used for extremely data efficient reinforcement learning. We see that compared to other methods such as learning from raw image pixels, the RL agents can learn at a profoundly quicker rate using the representation vector of GQN. This same idea can be applied in other fields as well such as computer vision.

All of these characteristics of Generative Query Networks makes it an extremely powerful an valuable tool which can be applied in an vast number of fields. As stated above, GQNs could prove to be a significant step forward in the development of fully autonomous vehicles. Normally, neural networks used in self-driving cars use hundreds of millions of images and take hundreds of hours training. However, using the information-packed representations by the GQN, in addition to its capabilities to learn relational inference and learning spatial relationships between objects, in autonomous vehicle systems could drastically decrease the training time and increase the accuracy of these systems when it comes to tasks such as object detection and semantic scene segmentation. Furthermore, GQNs could also be applied in the fields of virtual reality, augmented reality, and in simulation/game engines for “querying across space and time to learn a common sense notion of physics and movement”, which could be used ton make more realistic video games and simulations.

As the most important aspect of autonomous cars is to understand their surroundings or the “scene” wherein they are located, the GQNs scene representation and understanding abilities has immense potential to be extremely useful within this field.

However, I believe that perhaps the most powerful, influential and scary application of GQNs is to combine its autonomous scene understanding capabilities with the learning capabilities of state-of-the-art reinforcement learning algorithms within robots to create truly autonomous, self-aware robots which would be able to explore, understand and learn from scenes without any sort of human guidance. Similar to the robots from the Terminator movies, these robots would be hyper-aware of their surroundings, visualize and imagine hypothetical scenarios, and use the same to make extremely accurate decisions and predictions in order to achieve their goals. The creation of such robots could either be the greatest innovation in the entirety of mankind, or the reason why mankind will cease to exist….

Could GQNs be the first step in the creation of self-aware autonomous robots, such as those from the “Terminator” movie franchise?


In its blog, DeepMind has mentioned that despite all of its successes, GQNs are still under development, and have numerous limitations, most notably, their inability to recreate scenes with an extremely high complexity, and the fact that until now it has only been tested on rendering simple, synthetic (human generated) scenes. Nevertheless, the development of GQNs represents a gigantic leap in not only scene understanding and rendering, but but computer vision as a whole, and the significance of this discovery and its potential for the future is undeniable. As of today, only time will be able to tell us whether the aforementioned scenarios will remain as science-fiction, or become a reality.

Source: Deep Learning on Medium