Original article can be found here (source): Deep Learning on Medium
Regressing colors on lightness with convolutional neural networks
What is this project about?
The automatic colorization of grayscale images is a problem that has been dragging my attention for a long time. In this image-to-image translation problem, we want to infer the colors in an image based only on the patterns and textures that are recognizable in its colorless grayscale variant. Unfortunately, this arguably creative process is highly subjective since one can think of many different colorizations for the same grayscale image. I approached this problem by fitting a regression model in the form of a deep convolutional neural network that maps the lightness information onto the colors in the image. During this project, I learned more about so-called color spaces and discovered a new deep learning framework (PyTorch) for this and future projects. (GitHub)
What is a digital image?
We have to clarify what we understand as a digital image. In contrast to us humans, computers are relying on silicon-based hardware and digital circuits, which restricts them too finite and discrete representations of the real world. Therefore, a natural image captured by a camera is typically stored in a digital format that is composed of a few grid-based layers of numeric intensity values. Each of these grid layers (channels) has a certain semantic to it, which depends on the underlying color space. A single position in these grids, which is given by a depth vector with numeric values from all the channels is called a pixel. For instance, in the most commonly used color space (RGB) each of the channels encodes the light intensity values for the three light colors red, green, and blue so that a single pixel is a three-dimensional vector. Unfortunately, the RGB color space has a few problems related to it when it comes to the colorization of grayscale images. Because of that, I followed the approach of many related papers to tackle this image translation problem in another more suitable color space.
What is a color space: RGB vs. LAB?
- RGB: As mentioned above, in the RGB color space each of the three channels is associated to one of the three additive primary colors: red, green, and blue. According to basic color theory, three monochromatic lights of these colors can emit light according to the pixels’ intensity values in order to create every color inbetween. Hence, all of these channels affect the color as well as the lightness of the entire image. This is suboptimal for our goal of automatically colorizing images solely based on their pixels’ lightness values. We would have had to transform the RGB image into a single-channel grayscale image and map it onto its three-channel ground-truth counterpart. Operating in another color space that I have never heard before this project has felt a lot more natural and suited to this image translation problem.
- LAB: In contrast to RGB, the responsibilities for lightness and hue/saturation are divided in the LAB color space between the three channels. The first channel (L) is given by the perceived lightness intensity values and contains no information about the hue or the saturation of the respective pixel. The other two remaining channels (A and B) are responsible for this. Together, these two color channels span out a two-dimensional plane, where each of the points refers to a certain hue and saturation combination. The first of these two channels specifies the amount of green (-) or red (+) that is contained, while the second channel specifies the amount of blue (-) or yellow (+) that is present in the respective pixel. Without going into too much details, this color space is more elaborated and approximately ensures that euclidian distance between two colors resembles their actual perceptual distance. This is a strong argument for relying on mean-squared instead of mean-absolute regression during the training of the model. However, the arguably most convining reason for working in this color space is that lightness and color-dependent channels are totally separated, which conceptually simplifies the structure of the colorization model.
The network architecture
The basic neural network architecture that I used in this project is given by a so-called cascaded refinement network, which is composed of several refinement blocks where each operates on a certain image resolution. These blocks are chained together starting at a very small resolution and get increasingly larger until the final target resolution is reached. Between these refinement blocks, bilinear upsampling is used to reduce the number of learnable parameters and induce some kind of prior on the generative function. Each of these blocks gets a concatenation of a bilinearly downsampled version of the input lightness channel (L) and the upsampled version of the previous block’s output as an input. The forward pass through this generator network is then recursively defined as a flow from the initial block that only receives a downscaled version of the main input L to the final refinement block that produces the two AB color channels. Feeding the lightness channel multiple times to the network at different resolutions is supposed to help the network with keeping the shapes and textures of the grayscale image in mind and presumably allows it to focus more on an iterative refinement of its color choices. For more technical information about the exact structure of the generator blocks, I refer to my GitHub repository. This cascaded network was then trained in a supervised manner using the mean-squared-error loss function. Unfortunately, this basic generator network has its problems with standard mean-squared regression and produced in the majority of the cases only low-saturated colorizations with a low variety in colors.
Brownish grays: Why mean-squared regression alone is problematic
In mean-squared regression, we want to minimize the mean sum of the squared differences between the pixels of the networks prediction and the ground-truth target. In our case, we are comparing the true AB color channels of the original image to the two generated color channels of the network. So for each of the pixels in the image, we have to compute the distance between two points on a two-dimensional plane. Each position on this plane corresponds to one value on the A-axis (green-red) and one value on the B-axis (blue-yellow). The larger the magnitude of a vector on these plane, whose origin is given by the representative of no color (gray), the more saturated the associated color is.
If the networks prediction is identical to the ground-truth colorization, all the point pairs on the plane are aligned and hence the sum of their distances is equal to zero. So if the network is quite confident about what color to put in a certain pixel after training, the mean squared distance for this pixel might be low. However, in the case of multiple plausible but vastly different colors, the mean-squared error is hard to reliably kept low. For instance, the network might predict the color of the sky with confidence but struggle with the hue of the body of a car. In other words, in the presence of uncertainty, the mean-squared error forces the generator network to average all plausible hypothesis together instead of encouraging it to decide for one of them with confidence. As all of us know, a wild mixing of colors in reality usually leads to a brownish non-saturated mush. Something similar happens with the prediction of the network, when it targets at a minimization of the mean-squared error in the case of multiple plausible color choices.