Creating artistic live video filters with deep neural networks

Original article was published on Deep Learning on Medium

Scaling Inputs

Many phones today can take stunning 4k videos, including the iPhone XS that I developed on. While the A12 chip in the device is powerful, it would be far too slow to use a deep neural network on every frame of that size. Usually video frames are downscaled for image recognition on devices and the model is run on a subset of frames. For instance, an object recognition app may run a model every second on a 224 x 244 frame, instead of 30 times per second on a 4096 x 2160 frame. That works in an object detection use case, as objects don’t change that much between frames.

This obviously won’t work for stylizing video frames. Having only a single stylized frame flicker every second would not be appealing to a user. However there are some takeaways from this. First, it is completely reasonable to downscale the frame size. It is common for video to be streamed at 360p and scaled up to the device’s 1080p screen. Second, perhaps running a model on 30 frames per second is not necessary and a slower frame rate would be sufficient.

There is a trade-off between model resolution and frame rate, as there are a limited number of computations the GPU can make in a second. You may see some video chat platforms have a slower frame rate or more buffering when using convolutions for video effects (i.e. changing the background). To get a sense of what different frame rates and input shapes looked like, I created a few stylized videos on a computer with the original neural network and OpenCV. I settled on a goal frame rate of 15 fps with 480 x 853 inputs. I found these to still be visually appealing as well as recognizable numbers for benchmark testing.

Utilizing the GPU

I used tfcoreml and coremltools to transform the Tensorflow model to a CoreML model. A gist of the complete method can be found below. There were a couple of considerations with this. First, I moved to batch normalization instead of instance normalization. This was because CoreML does not have an instance normalization layer out of the box, and this simplified the implementation since only one frame would be in each batch at interference time. A custom method could also be used in tfcoreml to convert the instance normalization layer.

Next, the tensor shapes differ between TensorFlow and CoreML. The TensorFlow model has an out in (B, W, H, CH) while CoreML supports (B, CH, W, H) for images. After converting the model to a CoreML model, I edited the model spec to have a transpose layer to adjust to shape. Only after changing the model output shape to the format (B, CH, W, H) did I change the output type to an image. This something that has to manually be done on the model spec; as of this writing tfcoreml supports images as inputs as a parameter but not outputs.

Additionally, since I downscaled the images to pass them through the network, I was able to add an upscaling layer using coremltools to bring the image to to a reasonable 1920 x 1080 frame size. An alternative would be to resize the pixel buffer after getting the result of the network, but this would either involve work from the CPU or additional queueing on the GPU. CoreML’s resize layer has a bilinear scaling and provided satisfactory upscaling with few feature or pixel artifacts. Since this resizing layer is not based on convolutions, it also added minimal time to the model inference time.

One final way I utilized the GPU was in displaying the frames. Since I added custom processing to the frames, I could not send them directly to standard AVCaptureVideoPreviewLayer. I used an MTKView from the Metal Kit to present the frames, which utilizes the GPU. While the Metal shader was a simple pass through function (the input was returned as the output), the drawing proved performant and the queues in the view were also helpful in the event that a frame was dropped.

Simplifying the model architecture

The original model architecture had five residual convolutional layers. While very performant on a standard GPU this was too deep for a the A12 processor, at least for a typical frame rate. One primary component for the neural net to learning a variety of textures is the five residual blocks. If a texture is simple, the later residual layers may look closer to identity filters. If the texture is more complex, all layers may have meaningful filters. I experimented with trimming out some of these blocks for a more performant network, at the cost of not being able to learn some highly complex textures. Additionally, I experimented with separable convolutional layers instead of traditional fully connected layers, as used in other light weight architectures such as MobileNets.

Architectures used on the A12 testing. The grayed convolutional residual blocks (2–4) were experimented on by removing or using separable convolutional layers in the blocks.

I tested several different architectures on a computer GPU to narrow down the most performant networks with minimal degradation in texture. I kept the downscaling and upscaling from the original architecture largely consistent though I only used 3 x 3 sized kernels. Some changes (reducing the residual blocks to 1, narrowing the number of filters to 64) had fast inference times, but had a high degradation in quality. Afterward the GPU testing I tested the models on an iPhone XS with an A12 chip.

Results (in milliseconds) of various model architectures.

These are the results (in milliseconds) of a benchmark test of 100 iterations, with an input frame size of 480 x 853. The first frame was omitted since it was an outlier from the model “starting up”. One interesting take from these results is that the separable convolutional blocks did not make the network more performant. Separable convolutional layers are often difficult to implement for efficiency. I’ve read a variety cases in which the separable layers do not perform as anticipated in different environments, which could be the case here as well and this deserves more investigation.

I used the 3 full (not separable) residual blocks for the following results. This model worked very well on a variety of styles and cases. With 15 fps being 66 milliseconds per frame, this implementation was probably the upper bound of the device with this implementation as there were a couple of occurrences of dropped frames or lag.