Inferring a super-resolution neural network on Raspberry Pi GPU

Original article was published on Deep Learning on Medium

It all makes sense form the research standpoint… but is frustratingly impractical. If you do not have the reference HR image and you do not master the entire image formation process, you do likely not have any sort of “degradation kernel”. In a very general setting you only have an image to upscale in a sharp visually pleasant fashion, and you have no idea how exactly and how many times it has been resampled and what else it may have experienced in its obscure past before getting to you.

To get rid of this in a simple way, I extended the training set with LR images produced with as many degradation kernels as I had immediately available. I simply took everything offered in OpenCV’s resize: bilinear, bicubic and Lanczos interpolators, rendered new LR images and added them to the original ones. The pre-trained model was then re-fitted to the new larger data in the same way. This allowed to outperform the bicubic baseline on Set5 and Set14 with a slight gain still producing a noticeable visual difference (and not ugly sharpened images). The price to pay is PSNR on the original DIV2K validation set: the fine-tuned model achieves only 32.57 dB instead 33.26 dB.

The impact of fine-tuning with multiple degradation kernels on PSNR on Set5 and Set14.
Bicubic baseline (left, 24.99 dB) vs fine-tuned model (right, 26.11 dB) on an image from Set14. There is no antialiasing applied when doing the bicubic downscale, so our network learns to correct aliasing artifacts and produce smooth edges. You can see the difference without zooming in, right?

Although our model and its training can be further explored and improved in many aspects, it will be our final model. So let me stop here for the machine learning part and proceed to the inference implementation.

Implementing the inference using OpenGL

Inferring a convolutional neural net on an image typically requires a lot of computing. Fortunately, it is easily parallelizable and thus suitable for GPU.

GPU are here to render pictures, originally. Long time ago graphic pipelines were fixed (non-programmable) and capable of a predefined set of standard computer graphics operations. But using them for doing more general-purpose computations puzzled many of us. There existed a specific term for this, “GPGPU”, that you will find almost nowhere today.

This is because things changed, and now we use GPU to compute pretty much anything by means of a specific interface such as CUDA, OpenGL compute shaders, OpenCL, etc. For example, TensorFlow uses CUDA to speak with GPU. CUDA is a proprietary technology by Nvidia, so if you have a graphic card from another vendor, you will likely not get the maximum of your hardware with TensorFlow any soon unfortunately. But maybe I am too pessimistic: some time ago TensorFlow Lite introduced OpenGL compute shaders support for some models and applications. This allows to help out to CPU when doing some face detection on Android devices that are not massively powered with Nvidia GPU.

OpenGL is ubiquitous. Any decent GPU from any vendor is conformant with some version of OpenGL generally offering a certain level of programmability.

So does Raspberry Pi. I am not talking about the most recent Pi model available at the moment of writing, 4 Model B, whose GPU is OpenGL ES 3.1-conformant making it capable of almost anything a decent Android smartphone is capable of. I am talking here about all the other models of Pi only compliant with OpenGL ES 2.0 standard. This means: no compute shaders but vertex and fragment ones only, no floating point for input/output, no ability to output many values of a shader (four 8-bit scalars only)…

Regardless, it is enough to run the inference of the model we have just built.

It is worth noticing that if you are in love with Raspberry Pi, there are more efficient ways to access its GPU computing power without OpenGL overhead: here, here or even a Python library for doing GPGPU on Pi here. This becomes then very Pi-specific, but you can go much faster. Here I keep going with OpenGL because I want to run the inference on other devices.

Overview

To put it simple, we implement the operations performed during the inference in form of small programs (shaders) written in GLSL (OpenGL Shading Language). The shaders will also contain the hardcoded trained network weights. All the images and feature maps become textures, all of the input LR image resolution.

GLSL is much like C with some syntax differences and limitations. Shaders are compiled in runtime by the GPU driver into a hardware-specific binary code that GPU is able to execute, much like CPU. But there are differences, mainly due to the SIMD nature of the hardware behind GPU. For example, GLSL not a Turing-complete language so that you cannot recurse in GLSL code in a way you do in C++ or Python. Fortunately we do not need this for the inference of a feedforward convolutional neural network.

Since there is no compute shaders in OpenGL ES 2.0 standard, we proceed in a traditional way where we need a vertex shader and a fragment shader to perform a render pass.

  • Our vertex shaders are trivial: they render a single quadrilateral projecting the entire input onto the entire viewport. I will not detail their code here.
  • Fragment shaders are where the magic happens. They will sample input textures containing the LR input (for the input layer) or feature maps (for hidden and output layers) and compute output feature maps. The GLSL code of fragment shaders is generated by a Python script from the trained model.

To get GLSL shaders running you typically need to write some nasty platform-dependent code setting up an OpenGL context and implementing all the machinery to perform a render pass. I skip the details here; the whole code is available anyway.

We proceed in the way explained above: the Y component of the input gets upscaled by the neural net, while Cb and Cr chrominance channels are upscaled as a regular texture. OpenGL natively supports bilinear interpolation along with the nearest neighbor one (which is by the way used to sample all the feature maps), so the chroma gets interpolated bilinearly. This is not the only way; one may implement a bicubic chroma interpolation in a shader, or apply the neural net on R, G and B inputs successively.

This is our model with each brick being a shader. They are 32 in total. The output of every shader is a texture containing 4 feature maps. They are all of the input LR resolution, except the output image indeed.

Constraints

As mentioned above, our model is shaped by constraints coming from Raspberry Pi OpenGL ES implementation. Let me finally explain this.

  • Fragment shader is a program executed for every pixel. It has a single output, 4-component pixel color (this is what is written in gl_FragColor). Therefore we can only compute (up to) four feature channels in a single shader. It multiplies the number of shaders we need but does not constrain the model size, so we can live with that. It also heavily increases the memory bandwidth, since the feature maps textures are sampled many times… but there is no other way with GL ES 2.0 on Raspberry Pi up to my knowledge.
  • All the feature maps values are sampled with 8 bit fixed-point values in [0, 1] range. A way to overcome this is to use an activation function whose output range fits into [0, 1]. This is why we use [0, 1]-bounded ReLU as activation function everywhere. Actually, the simple fact of writing to gl_FragColor clamps the value to [0, 1] range, so we do not even need to implement it explicitly: GLSL applies the bounded ReLU anyway. Cool!
  • Fragment shader has a limited number of input textures, at least 8 according to the standard. For Raspberry Pi it is exactly 8. Since textures are (at most) 4-channel images containing RGBA colors, we end up with at most 8*4=32 feature maps on input. This is an actual constraint: to compute a 2D convolution we need access to all feature maps in a single shader. Otherwise we have to split the convolution among different shaders, each sampling at most 32 channels, and then use another shader putting the partial results together… It quickly becomes a mess and may be unfeasible due to the 8-bit shader output sampling constraint. Therefore, all the feature maps can have at most 32 channels.
  • There are two extra conditions limiting the number of input channels. Firstly, there is a limit on the number of texture sampling operations per shader (64 for Pi). To compute 3×3 convolution, every texture gets sampled 3*3 times. With the limit of 64 samples we can have at most 7 textures, so 28 feature maps. But for 1×1 convolutions it is not an issue.
  • Secondly, there is a limit on the total number of instructions per shader. 3×3 convolutions of multiple input feature maps are the most greedy in this sense. An implementation with 12 input and 8 output feature maps passes on all the hardware I had at hand (although, to be honest, I think I messed up with freeing GPU driver memory after linking GLSL programs on my Raspberry Pi, so it is further optimizable). There might be a way to get more feature channels going with the depthwise convolutions like in MobileNet, but it leads to a model with yet smaller capacity and this does not seem to perform well in few tests I did. Therefore we rely on the grouped 3×3 convolutions of 12 feature maps putting pointwise convolutions on top of the grouped blocks to mix up their feature channels. This is the key design decision shaping the model and giving 48–32–24–16 feature maps on output.

This is it, we now have a trained model that fits the hardware. There are few things remaining to get it running!

GLSL implementation

Converting such a fully convolutional model into a bunch of GLSL shaders becomes simple once it respects the hardware constraints: all we need is to take the weights and biases from the trained model and implement the convolutions in fragment shaders. Accessing the trained parameters of a layer in your favorite machine learning framework is generally not a problem (as simple as layer.kernel.numpy() and layer.bias.numpy() to get the Numpy arrays in TensorFlow 2 / Keras), so a Python script would do the job.

As for storing the network parameters in GLSL, there are different options, for example to put them into a separate texture or a uniform variable. However, the model is small, so the simplest option is to expose the weights and biases as hardcoded constants in GLSL code. This is likely the most efficient way too.

Here is how the very last 1×1 convolution shader (the fifth layer of the model) looks like. It is the smallest shader in terms of the code size among the 32 ones. All it does is sampling the 16 input feature maps (4 textures of 4 channels each), convolving them with the learned kernel, adding the bias and writing the result out to the fragment color variable.

I used dot GLSL function to implement the convolution. It appeared quite efficient in some trials I performed on Raspberry Pi, but there are indeed other ways to organize the computations.

Together with the convolutional shaders, one last fragment shader we need is the one merging the nicely upsampled luma with the cheaply upsampled chroma (the one of salmon color in the scheme above). Indeed, the fifth layer output is 4-channel texture of the LR input size; at each pixel position its 4 channels contain 4 pixel values of the output HR luminance. So we demultiplex these values using gl_FragCoord GLSL variable and add the chrominance from the input image in this last fragment shader:

The Fun Part: Testing on all the hardware around

Raspberry Pi

The whole thing is designed to run on Pi, so let the tests begin on Pi!

To upscale a 256*256 input image to 512*512 pixels, Raspberry Pi 3 Model B runs the inference in ~130 ms in average (with 9.5 ms of standard deviation over 10 repetitions). Raspberry Pi Zero W goes a little slower (171 ms), having the same Broadcom’s VideoCore IV GPU onboard running it at a lower clock frequency. The shaders get compiled in 3.5 to 4 seconds (and just to remind, it is done once and not for every image we may have to process).

Is this slow? Well, rather yes. But we just managed to run the inference of a neural net on Raspberry Pi GPU, this is still great.

For the sake of visibility an image of Set5 (top left) gets downscaled here by factor of four (top right). Then it is upscaled back using the bicubic interpolation (bottom left) or applying our model twice (bottom right). We got 22.82 dB, the bicubic finishes at 21.41 dB. And yes, the bottom right image comes straight from Raspberry Pi.

Android smartphones

Android smartphones have OpenGL ES-compliant GPU too, so there is no way for them to escape from our tests. Here are figures of performing the same 256*256 to 512*512 upsampling test on some Android phones.

I do not put any PSNR numbers here: there is no visible difference between images coming from different GPU. The results may differ on different hardware indeed, but they remain very similar. Thus, among three images rendered from the same source on Pi, an Android smartphone and a high-end desktop GPU the two most different ones are at 45.8 dB from each other.

Another test would be passing a small-resolution camera preview through the network in real time. Without any additional pixel transfer, the camera image may be accessed as OpenGL texture through samplerExternalOES sampler in GLSL, so we can directly plug it to our network.

Huawei P20 Lite manages to run the upsampling of a 352*288 input to 704*576 output at 6 to 7 FPS. My old Asus K016 (Fonepad 8) does the same job at ~13.2 FPS. Huawei P10 runs at 30 FPS!