Training Object Detectors with No Real Data using Domain Randomization

Source: Deep Learning on Medium

Training Object Detectors with No Real Data using Domain Randomization

Solving sim2real transfer for specialized object detectors with no budget

Deep learning has recently become the favored approach to object detection problems. However, like with many other uses of this technology, annotating training data is cumbersome and time-consuming, especially if you are a small company with a specific use-case. In this article, I present some of our work on synthetic data generation for object detection and show a few live examples.

Recently, many research papers and private enterprises have focused their attention on automatic object detection in images. Gone are the days of pyramid sliding box detectors and come have the days of Convolutional Neural Networks. Researchers over the world are fine-tuning networks and adding bells and whistles to their training schemes in order to improve scores on large datasets, such as Pascal VOC or COCO.

But what if you are not looking to find any of the classes offered by COCO and the likes in your application? You would normally have to fund a new dataset of your selected object(s) of interest (OOI), which can be costly and time-consuming, especially if you are a small company. Good datasets show your OOI from a variety of angles, in different lighting, and with a number of additional differences such that the detector is not overfitted to one specific version; for example, it is counterproductive to train a human detector on a single person, or a car detector on a single car model.

Augmented training data has been a thing since training data existed, with noise being added to signal inputs and black boxes over areas of images. This paradigm has shifted slightly with the need for, and introduction of, Domain Randomization (DR), a powerful tool used to improve robot maneuvering and similar tasks that require lots of training data. DR relies on randomizing parameters that are unimportant to the task at hand, such that a network learns to ignore those parameters. Say you are looking to detect a Raspberry Pi with a camera. Those come in different colours, and they can have different backgrounds and lighting environments. They can also be obstructed by various other objects, and be close to or far from the camera and in many orientations. These are the parameters you want to randomize. The one thing you do not want to randomize in this scenario is the general physical shape of the object.

Thanks to modern computer graphics, we can control these parameters and continuously render infinitely many variations of a Raspberry Pi given a 3D model of it. This leaves us with two paths; render images with nice lighting and realistic environments, possibly using some advanced CG programming methods such as Raytracing, or render faster images with low detail and sub-optimal lighting.

Coincidentally, we have been experimenting with training object detectors solely on synthetic data with an arbitrary number of random parameters. Our philosophy in this regard has been that the more varied training data we can offer in less time, the better our model can learn to generalize, faster. An image of a real, natural-looking object suddenly becomes just another variation that the model has potentially already seen.

Creating a Synthetic Data Generator

We selected five objects in the lab with matching 3D models and set out to create a fast synthetic image generator. Due to my previous experience with Unity and the scale of the project, that was the engine we chose. The overall idea of the system is simple; completely randomize all parameters of the object and its environment that are not critical for determining its class and position. This is mostly experimental, in order to see the opportunities complete domain randomization offers.

A randomly generated Unity3D scene containing light sources, random “garbage” objects and Objects of Interest.

We randomize camera parameters, material colours and other material properties, intensity, colour and number of lights in the scene, and a host of other things in order to create the most varied dataset we can. The OOI are randomly scaled, rotated and positioned. The scene is then rendered twice; once to generate the image, and once to generate bounding boxes. An example of a randomly generated scene can be seen here.

Below, I show some of the wacky renders of Raspberry Pis the system outputs. The red boxes only indicate to us where the bounding box is, and are of course not part of the training data. As you can see, the outputs are not at all realistic looking. In fact, it can be difficult to immediately see what they represent. However, I can generate hundreds of thousands of these per hour on my laptop, which gives our model a decent chance of learning the shape representation of the objects and, we assume, learn to ignore everything else.

Examples of randomly generated images of a Raspberry Pi used as training data

I decided to train a Single Shot Multibox Detector (SSD) model on MobilenetV2, as we are interested in using the trained model in a mobile application. This means that accuracy is expected to be slightly lower than, say, a VGG19 based model, but this model should both train and run faster. I wrote a quick script to continuously fetch images from the Unity application and write them to a Tensorflow .record file. This allows us to use transfer learning with model_main.py, a Tensorflow script which retrains existing models based on some configurable parameters. We can then visualize the results using Tensorboard. I also prepared a small test dataset of real images annotated using labelImg.

Training an Object Detector on Synthetic Data

One of the things this endeavor has shown us is that models trained on synthetic data are often very unstable. Loss and mAP results fluctuate heavily between validation steps. One crucial fix to this is to increase batch size. Additionally, we test the training without hard example mining, a method that has otherwise become standard in object detection training. The hypothesis here is that some examples might be weighted so highly that the model overfit to them in time for evaluation. The evidence for this is that during each validation step, several real images had predicted bounding boxes and classes that were very similar in position and size, though the image did not even contain an object of that class. Examples of this are shown below.

Examples of predictions made with hard example mining. Left: Predictions. Right: Ground truth.

There is also a large discrepancy between training loss and evaluation loss. This is expected, as the types of images are not similar. Since training loss converges much faster than evaluation loss, the latter hardly improves as the former slows. Additional data augmentation techniques and training methods come in handy here; we include random black patches, pixel shifts, and Gaussian blobs to further increase the difficulty for the detector, and introduce dropout in the network. If you attempt this at home, beware; patience is required. It takes a while for the idea to “click” with the detector, at which point mAP scores will start to increase. The “click” generally happens once regularization loss starts dropping consistently. Until then, guesses might seem totally random and loss might fluctuate significantly.

Results and Considerations

Below, we show some results from step ~28,000. The amount of image data required to train this many iterations takes a few hours to generate. These images are from Tensorboard, and show both the predicted and actual bounding boxes of the evaluation set.

Decent predictions made at step 28,000. Left: Predictions. Right: Ground truth.

This image shows some decent predictions made by the network. Despite (likely) never having been trained on anything remotely as realistic looking as this, the network appears to relatively accurately predict and classify both a Raspberry Pi and a drill, even with random objects scattered about the scene and in front of the OOI.

At this point, the network has started to learn a strong representation of the objects. However, more data is required to better distinguish actual objects from background noise. Some examples are shown below.

Poor predictions made at step 28,000. Left: Predictions. Right: Ground truth.

These images show a selection of wrong predictions made at step ~28000. Note how the detector still detects general outlines of objects — just not the right ones. The object it predicts as a Raspberry Pi is also the most detailed object there, which indicates it has somehow learned that many lines and contours might indicate that class. Also note that these confidences are significantly lower than the correctly predicted classes in the previous images, indicating that a slightly higher threshold during visualization might mitigate this problem.

In order to better visualize the results, I present to you the detector running in the TensorflowLite Object Detector Demo App for android. These results are based on the configurations discussed previously, i.e. no hard example mining, but I let the model train on a few million images. In synthetic data generation time, this is roughly equivalent to starting the generation as you leave the office and it being done when you get in the following morning.

This project was a fun experiment, which proved to us that synthetic data is a viable alternative to real data at a significantly lower cost, both regarding time and money. In fact, this project cost us nothing. All the parts are free to use and open source. This just proves that whether you are a hobbyist, project lead, or anywhere in between, you have the opportunity to create exceptional software using the latest in deep learning at no cost to you, other than a little effort.

I work in the Visual Computing Lab at the Alexandra Institute, a Danish non-profit company specializing in state of the art IT solutions. In our lab, we focus on utilizing the newest in computer vision and computer graphics research. We are currently working on data annotation, generation, and augmentation techniques to allow smaller companies and individuals to get started with deep learning. We are always open to collaboration!