Thermal Image Generation from RGB

Source: Deep Learning on Medium

Go to the profile of Hannes Liik
Thermal images. From of the FLIR ADAS dataset [1].


Thermal images have useful discriminative properties. Concretely, warm bodies (i.e. humans, animals, hot vehicles, etc.) tend to be objects of interest.

Unfortunately, thermal cameras:

  • Are expensive
  • Have a narrow field of view
  • Are not sufficient for all the tasks of perception (e.g. cannot read traffic lights)

For these reasons, we will always use RGB cameras for perception tasks such as semantic segmentation.

Semantic segmentation — one of the ultimate perception tasks, but extremely expensive to get labels for. Adapted from Cityscapes:

Perhaps, however, we could leverage thermal cameras in neural network training, to get better performance at test time using RGB cameras only?

In this work, I test the ability of neural networks to learn from thermal images by learning to convert RGB images to thermal images.


Examples of aligned RGB and thermal images.

For my experiments I used the FLIR thermal dataset, which has 14k paired RGB and thermal images (split into train and validation sets). The dataset also has bounding box labels for some objects, but I won’t be using them.

The paired images were not aligned. Some effort was needed compute transforms to align the pixels of the images. The alignment results are shown above.

There is, however, an interesting method, which promises image to image domain translation without image pairs:

CycleGAN [2]

Examples of how CycleGAN is able to transition between image domains. Image adapted from CycleGAN [2]

CycleGAN is perhaps one of the most elegant methods in deep learning. It works roughly as follows:

  1. A Convolutional Neural Network (CNN) takes an image from domain A and generates an image in domain B. A -> B’.
    A discriminator will force the generated image to look like it is from the target domain (thermal images).
  2. A CNN in the opposite direction will generate a reconstruction B’->A’.
    A simple L1 loss between the original image and the reconstruction will force the intermediary fake image B’ to contain all the necessary for the reconstruction. It is for this reason the generated images look like the inputs, but in the target domain, and not some random images from the target domain.
Diagrams of CycleGAN losses. (a) — discriminator losses, (b), (c) reconstruction (cycle-consistency) losses. Image adapted from CycleGAN [2]

Practical TL;DR:
We just need a bunch of (random) RGB and thermal images and CycleGAN will hopefully learn the transitions between the domains. To make things easier, the images will be from one dataset, although in theory this isn’t necessary.

Pix2Pix [3]

Because we do have aligned images, we can use supervised learning to train a network to generate thermal images.

A simple L1 (or alternatively mean squared error) loss for a fully convolutional neural network would generate blurry images.

To generate more detailed images, Pix2Pix simply adds a discriminator. Blurry outputs would be an easy tell for the discriminator, so the generator is forced to create sharp images.


Comparison of generated fake thermal images and ground truth. Samples are from the validation set which was not seen during training.

From the results I observe that:

  1. CycleGAN, an unsupervised method, generates images that look like thermal images because of the color scheme, but fails to capture the thermal propereties of objects.
  2. Pix2pix, a supervised (and therefore data-constrained) method, is able to generate more meaningful thermal images, but sometimes totally fails.
Example of pix2pix generating gibberish
The car is shown to be hot, which is a good result, but humans are indistinguishable (Pix2Pix).
CycleGAN prediction vs ground truth (not aligned). CycleGAN’s behavior is much like converting the image into grayscale and coloring the sky black. Not very useful


Unforunately CycleGAN was not able to learn to generate meaningful thermal images. Pix2Pix was better in this regard, but the predictions are not reliable.

Extra: Discussion

Thermal images may not be a good source of learning for neural networks because they have a high variance: the scaling can be different between images, humans can be colder than the ground, etc.

A GAN approach might be bad, the detail it creates if only to make the images more believable, not true. We would probably be totally fine with blurry images, if it hints us to possibly warm regions of interest. A counter-argument might be that the discriminator is actually an extra source of learning signal, which is good.


The general idea is by Tambet Matiisen.

Thanks to Martin Valgur for help with aligning RGB and thermal images.


[1] FREE FLIR Thermal Dataset for Algorithm Training

[2] (CycleGAN) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros.,

[3] (Pix2Pix)Image-to-Image Translation with Conditional Adversarial Networks