Learning to Simulate

Source: Deep Learning on Medium

How learning to simulate better synthetic data can improve deep learning

The paper presented at ICLR 2019 can be found here. I also have slides as well as a poster explaining the work in detail.

Photo by David Clode on Unsplash

Deep neural networks are an amazing piece of technology. With enough labelled data they can learn to produce very accurate classifiers for high dimensional input such as images and sound. In recent years the machine learning community has been able to successfully tackle problems such as classifying objects, detecting objects in images and segmenting images.

The caveat in the above statement is with enough labelled data. Simulations of real phenomena and of the real world can sometimes help. There are cases where synthetic data has improved performance in deep learning systems in computer vision or robotic control applications.

Simulation can give us accurate scenes with free labels. But let’s take Grand Theft Auto 5 (GTA) for example. Researchers have leveraged a dataset collected by free-roaming the GTA world and have been using this dataset to bootstrap deep learning systems among other things. Many game designers and map creators have worked on creating the intricate world of GTA. They painstakingly designed it, street by street, and then fine-combed the streets adding pedestrians, cars, objects, etc.

An example image from GTA V (Grand Theft Auto V)

This is expensive. Both in time and in money. Also, by generating random scenes we might undersample important cases. Let’s image we are trying to train a classifier which detects dangerous scenes. In the real world we will run into dangerous scenes like the one below with very low frequency.

Example of a dangerous traffic scene

Using random simulated scenes we might not do much better. This means important edge cases might be severely undersampled and our classifier might not learn how to detect them correctly.

Learning to simulate is the idea that we can potentially learn how to optimally procedurally generate scenes such that a deep network can either learn a very good representation or can perform well in a downstream task.

We create a parameterized procedural traffic scene simulator using Unreal Engine 4 and the Carla plugin. Our simulator creates a road of variable length with different types of intersections (X, T, or L). We can populate the road with buildings on the side and cars of 5 different types on the road. We can also change the weather between 4 different weather types, which control for lighting and rain effects.

A demo of our procedural scene simulator. We vary the length of the road, the intersections, the amount of cars, the type of cars and the amount of houses

We place a car on the road of our generated scenes which can capture RGB images from the generated scene which have semantic segmentation labels and depth annotations.

An inside view of the generated scenes from our simulator with a fixed set of parameters

However, the learning to simulate algorithm is more general than this. It can apply to any type of parameterized simulator. By this we mean that, for any simulator that takes in parameters as an input, we present a way to search for the best parameters such that the data generated is optimal for a deep network to learn the downstream task. Our work, to the best of our knowledge, is the first to do simulation optimization to maximize performance on a main task, as well as apply it to traffic scenes.

A traditional machine learning setup is the following, where data is sampled from a distribution P(x,y) (x is the data and y is the label). Usually this happens by collecting data in the real world and manually labeling the samples. This dataset is fixed, and we use it to train our model.

Traditional machine learning setup

By using a simulator to train a main task network, we can generate data from a new distribution Q defined by the simulator. This dataset is not fixed and we can generate as much data as our computation and time constraints allow. Still, the data generated in this domain randomization setup is randomly sampled from Q. The data needed for obtaining a good model could be large. Can we do better?

We introduce learning to simulate which optimizes a metric of our choice on a main task — the pipeline is trained by defining a reward function R which is directly related to this metric (usually is identical to the metric itself). We sample data from a parameterized simulator Q(x,y|Θ), with which we train the main task model at every iteration of the algorithm. The reward R that we defined is then used to inform the update of the Policy which controls the parameter Θ. It is obtained by testing the trained network on a validation set. In our case, we use vanilla policy gradient to optimize our policy.

Informally, we are trying to find the best parameter Θ which gives us the distribution Q(x,y|Θ) which maximizes accuracy (or whichever metric) for the main task.

We use learning to simulate to solve this problem and compare to what happens using only random simulation. In the graph below, focus on the red and grey curves, which show how learning to simulate (LTS) achieves a much higher reward (lower mean absolute error of cars counted) after 250 epochs. The random sampling case improves shortly, but performance decreases once the random batch sampled is not adequate for the task. The grey curve rises slowly over several iterations but learning to simulate converges on the best possible accuracy shown by the blue curve (where we use the ground-truth simulation parameters).

Reward for the car counting task. Note how learning to simulate converges to the best possible reward (on a simulated dataset) shown by the blue curve.

What is happening? A nice way to look at it is by visualizing the probabilities of different scenarios and objects in our scene. We plot the weather probabilities over time. The ground-truth validation dataset which was generated oversampled certain weathers (clear noon and clear sunset) and undersampled the rest. We can see that our algorithm recovers the rough proportions!

Weather probabilities (logits) over time

Let’s do the same with car spawning probabilities. Our ground-truth dataset oversampled certain types of cars. Learning to simulate reflects these proportions after training as well. In essence, the algorithm pushes the simulator parameters to generate datasets which are similar to the ground-truth dataset.

Car probabilities (logits) over time

Now we show an example of how learning to simulate improves accuracy over random simulation on the KITTI traffic segmentation dataset which is a dataset captured in the real world.

An example image from the KITTI dataset.