I recently took part of a competition organized by Weight & Biases where the objective was to automatically colorize black & white pictures.
When I first heard of the challenge, I thought it would be really fun to take part of it.
Challenges are a great way to test your skills and see what you still need to improve. During the Lyft Perception Challenge, I learnt a lot in the process, digging through papers, fixing all kinds of bugs, implementing whatever ideas were in my mind… This new one was a great excuse to learn more (whether new framework, architecture, supporting tools, code structure…).
Also, the reward was to have the opportunity to meet and chat with Shivon Zilis while playing with a development version of a Tesla, which ended up happening and was super cool!
The first step was to know what’s currently done in the industry. According to HowStuffWorks:
Most of the classic black-and-white movies have been “colorized,” mainly so that they can be shown on television in color. It turns out that the process used to add the color is extremely tedious — someone has to work on the movie frame by frame, adding the colors one at a time to each part of the individual frame.
Not very helpful…
What is interesting in this challenge is that the task would be difficult even for a human. When trying manually, it is hard to choose a color, sometimes we can even barely identify what we are looking.
Before going straight to the code, it is important to develop an intuition on how the neural network could work, if at all, by comparing it to a human approach.
While a tone of grey can correspond to a large palette of colors, once we know the type of flowers, only a small number of colors are possible. Even when several colors are possible, we can infer the right one by how dark it looks relatively to the overall picture. For example, a darker color on a tulip would probably mean it is red while a lighter one may be orange, yellow or pink.
The same applies to the background. If we recognize a cloud, we know that around it is a sky, which is most likely blue. If we see trees we expect the branches to be brown and the leaves green.
The fact that the data-set is limited to flowers helps reduce the space dimension of possible inputs.
The recognition of objects at a pixel level reminded me of semantic segmentation problems, which made me directly think of U-Net / SegNet architectures. The main difference is that instead of predicting classes for each pixel, we need to output a color (3 channels), and this is now a regression problem.
An issue I had was that it happened right before my vacations. And despite the great interest I had in this challenge, it could not beat a beach in Brazil.
Nevertheless, I wanted to give it a try so printed a bunch of relevant papers to read while on the plane and between 2 naps.
I ended up not finding a satisfactory solution from my printed papers. Some of them were tackling this challenge but I was not really convinced of their approach.
I took a lot of notes on training implementation and hyper-parameters tuning on many famous architectures (ResNets, MobileNets…) even though the conclusions often contradicted each other (such as when to use batchnorm, dropout, regularization…).
My first programming activity was to implement an architecture which I thought would be suitable: a U-net/Segnet type architecture for which the output layers would be the color channels.
After doing a quick search on color spaces, I found out that the YCrCb color space had the “gray” image on one of its channels, so I would have to predict only 2 channels, reducing the problem.
With this approach and after only a few epochs, we quickly get brownish pictures with green surroundings, probably corresponding to the average picture.
After a few more iterations, we get more promising results, with more diverse colors.
It’s now time to think of optimizations!
At this point, I have less than 10 days to complete the challenge (working mainly at night) and my network takes about 24h to reach decent results on my desktop. I need to decide on the best strategy to adopt:
- My model seems to be in the right direction. I could look for pre-trained models, though I think it may have limited success as my data-set is limited to flowers. Plus it’s always more fun to create your own network!
- There’s about 5,000 pics in the data-set. I quickly scroll through it and a lot are irrelevant (why do we always find babies and dogs in any data-set?). I estimated quickly that I could sort 30–50 pictures per minute and saw that I actually removed close to 35% of them on a small batch.
I sadly realized that the best use of my time would be to increase my data-set (mainly with Google Open Images v4) and open pictures of flowers one by one to select the ones I wanted to keep (good quality, no filter, no dogs…). If I had known how long I would spend on this task, I would have kept the “bad” pictures and created a neural network to sort the pictures for me (would probably work well enough).
My dataset got a nice upgrade from 5,000 unsorted pictures to 20,000 relevant pictures of high resolution (perfect to also enjoy data augmentation with crop/scale effects).
In parallel, I was already refining my network. Due the the limited capacity of my desktop, I ran a few experiments on Paperspace and monitored closely the most promising trainings through Weights & Biases.
Here are a few insights from training & fine-tuning:
- Up-convolutions do not seem much better than simple up-sampling while increasing model size by 35–40% (so many parameters!). My intuition is that they also just learn an interpolation to pass on the features to larger size pictures.
- I tried to go as deep as possible (isn’t it supposed to be better?) & started to clearly over-fit at 7 layers. In the end, even my final architecture (6 layers) may have been too deep & wide but would over-fit only much later (probably needs more data to avoid it).
- Using weight decay was way too slow, even after decreasing its contribution factor to the final loss multiple times by several orders of magnitude. It may have leaded to better final results and avoided problems of over-fitting but who has the time for that?
Please refer to my W&B report for more details on the training and tuning of final architecture.
The final architecture was the following:
See below the results!
Note: I like to look first only at the black & white column, hiding the others, to see if I could predict the colors. Then I add the middle one and see if the prediction looks realistic, and finally I add the original picture.
This challenge made me think a lot and gave me ideas of other problems I now want to tackle:
- With a regression model, if the network hesitates 50/50 between blue and yellow, it will minimize the loss by outputting green (mid point), even if it thinks there is no chance for the output to be green.
- I want to keep track of difficult training data on the fly and reinforce learning on it (so many possible approaches: larger loss, larger update, more frequent sampling…).
I am also probably going to start using PyTorch (was waiting for release of 1.0) instead of Tensorflow or Keras. While I really like all their pre-built layers, models and utility functions, I hit a roadblock on this challenge when I wanted to implement some of my ideas that involved too custom layers. Hopefully I’ll be able to do it with PyTorch (and spend less time trying to understand graph execution bugs…).
Let me know if you have any questions, remarks or ideas!
Source: Deep Learning on Medium