AI-Assisted Transcoding of VVC on VideoCoin Network

Original article was published by VideoCoin on Artificial Intelligence on Medium

AI-Assisted Transcoding of VVC on VideoCoin Network


VVC is the next-generation video coding standard also known as H.266. It is expected to be widely adopted not only for its compression efficiencies but for the versatility in handling screen content and adaptive resolution change. VVC is targeted to achieve up to 50% bitrate savings compared to H.265 at double the decoding complexity. The encoding complexity of the encoders grows exponentially due to the complexity of the new tools as well as support for resolutions up to 16k. The VVC encoder is estimated to be 10x to 27x more complex than HEVC depending on the coding profiles. There are several initiatives from academia and industry working on complexity reduction of the VVC codec to lower its complexity and reach real-time encoding and decoding. These techniques would include but not limited to, algorithmic encoder optimizations using artificial intelligence(AI).

VideoCoin Network is a decentralized and distributed framework that can bring huge computing resources together and support scalable and flexible computational workflows. The unique heterogeneous ecosystem of VideoCoin Network is a conglomeration of compute-nodes with specialized hardware/GPU based compression codecs, compute-nodes with generic computing capabilities, and will be joined by nodes that will support image analytics. A built-in micropayment system and service validation system enables a vibrant community of content publishers, developers, and compute resource owners.

The Versatility aspect of the VVC is mainly in providing two unique toolsets called Screen Content Coding (SCC) and Reference Picture Resampling (RPR). In this paper, we are presenting a model, where the flexibility and scalability of VideoCoin Network can enable these unique toolsets of VVC and support use cases that leverage the toolset. We are aiming to support use cases particularly in the areas of high resolution(4k or 8k) game streaming and adaptive streaming with resolution switching. Rapid adaptation of 5G enhanced mobile broadband and the variation in bandwidth the mobile handsets receive raises the need for enhancements in current adaptive streaming technologies. RPR toolset enables adaptive streaming with resolution switching (spatial scalability). Efficient resampling filters are the key to this toolset. VideoCoin Network’s capability of integrating cloud and edge computing nodes that can specialize in compression and AI algorithms opens up unique opportunities to support these use cases.

Convolution Neural Nets(CNN) for video compression is a heavily researched topic. Video compression standards specify the bitstream syntax and implementation of the encoder are left open. Unlike traditional encoding where resolution and targeted bitrates control the transcoding, the new approaches use the targeted perceptual quality. An AI algorithm may look at the complexity of each scene and estimate the encode parameters that satisfy the targeted perceptual quality. These dynamically optimized profiles claimed to save the bit-rates up to 50% while providing the same perceptual quality compared to static profiles [4]. The paper “An Integrated CNN-based Post Processing Filter For Intra Frame in Versatile Video Coding”[31] proposes a CNN-based post-processing method that serves as an integrated solution for filtering to replace the in-loop filters. It achieves good performance by taking advantage of information about quantization parameters and partitioning structure. The paper “On Versatile Video Coding at UHD with Machine-Learning-Based Super-Resolution” studied the spatial up- and downsampling using CNNs.

Patent Licensing is the major issue that caused hurdles in the adaptation of H.264, H.265 the predecessors of VVC. VVC is not free. There are early initiatives such as “Media Coding Industry Forum”[33] that aims to smoothen the adaptation of the codec. The traditional codec software offerings suffer from accommodating licensing obligations due to the static nature of bundling distributing the software. VideoCoin Network, due to its transparency and built-in payment system, helps reduce the friction and can dynamically bundle a mix of paid and royalty-free algorithms to suit the requirement of users.

Overview AI-Assisted VVC Encoding on VideoCoin Network

A major benefit of the VideoCoin Network is bringing together heterogeneous computing resources distributed over the network to perform highly compute-intensive software tasks. In the context of VVC encoding, this feature facilitates in distributing different phases of encoding to different computing workers (resources). For example, codec-control algorithms along with analysis can run on a computing platform that is suitable for running AI algorithms while actual codec algorithms can run on one or more computing platforms tuned for the task. This distributed approach helps in supporting higher resolution up to 16k and serves the use cases Adaptive Multi-Resolution Encoding for ABR Streaming and Screen Content-Encoding.

Open Framework

The following diagram shows the major elements of VideoCoin Network used for setting VVC transcoding flow. VVC compression software can consist of free and licensed modules that can be bundled together dynamically for the targeted application. An open protocol framework will enable the construction of the transcode pipeline using components offered by different software vendors.

VVC is currently available as a reference software implementation (VVC VTM reference software[13]). We plan to use an FFmpeg based framework with external codec plugins. AI workers and transcode workers will establish communication through metadata-based information flows. AI workers and transcode workers may run on the same node or separate nodes depending on the tolerable latencies and length of look-aside buffers.

AI workers may choose one from several available CNN models based on the user’s preference and pricing of the model and analytic capability of the AI node.

SW Vendors can make their offerings which consist of both pre-trained CNN models and codec implementations. The VideoCoin Network payment system compensates for the usage of the software resources.

Publishers can dynamically create the workflows that bundle the relevant software modules and AI models depending on the targeted application. For example, a game streaming application may bundle an AI model pre-trained to make decisions between Intra-prediction mode and IBC mode. This will help in efficient compression of “Screen Content Coding”.

Transcode Worker provides the computing resource to the codec algorithms.

AI Worker provides the resource to run CNN models and interface with transcode workers with VideoCoin Network open protocol framework.

Overview of VVC Encoding and Bitstream Syntax

Like most previous standards, VVC has a block-based hybrid coding architecture that combines inter and intra-picture prediction and transforms coding with entropy coding. The following Figure shows the general block diagram of the encoder. In-loop filtering includes In-loop reshaping, Deblocking filter, Sample adaptive offset, and Adaptive Loop Filter

A VVC bitstream consists of a sequence of data units called a network abstraction layer (NAL) units. A picture in VVC divided into one or more tile rows and one or more tile columns. A tile is a sequence of Coding Tree Units (CTU) that covers a rectangular region of a picture. The CTUs in a tile is scanned in a raster scan order within that tile. A slice consists of an integer number of complete tiles or an integer number of consecutive complete CTU rows within a tile of a picture. Two modes of slices are supported, namely the raster-scan slice mode and the rectangular slice mode. In the raster-scan slice mode, a slice contains a sequence of complete tiles in a tile raster scan of a picture. In the rectangular slice mode, a slice contains either a number of complete tiles that collectively form a rectangular region of the picture or a number of consecutive complete CTU rows of one tile that collectively form a rectangular region of the picture. Tiles within a rectangular slice are scanned in a tile raster scan order within the rectangular region corresponding to that slice.

CTU is split into Coding Units (CU) to adapt to the characteristics of various local textures. A quadtree with a nested multi-type tree (QTMT) offers more flexibility for CU partition shapes, which is the reason why a CU can be a square or a rectangle shape. First, a CTU is partitioned by a quaternary tree (QT) structure. Then, the quaternary tree leaf node can be partitioned by a binary tree (BT) or ternary tree (TT) structure, both of which are collectively called a multi-type tree (MT) structure. BT structure includes vertical and horizontal binary splitting (BTV/BTH), while TT structure involves vertical and horizontal ternary splitting (TTV/TTH). CU is split into two, three, or four sub-CUs according to the partition mode. For a 4:2:0 YUV sequence, the default size of the CTU is 128×128 (ie, there is a 128×128 luminance CTB and two 64×64 chrominance CTBs). MinQTSize is 16×16, MaxBtSize is 64×64, MaxTtSize is 64×64, MinBtSize and MinTtSize are 4×4, and MaxMttDepth is 4.

Convolutional Neural Nets

Convolutional Neural Nets or CNN is one of the Deep Learning algorithms which is mainly used in image processing applications. The preprocessing required in ConvNet is very low compared to other classification algorithms. Image processing filters are usually hand-engineered, but with the advent of ConNets, these filters can be learned with enough training. The architecture of the ConvNets is inspired by the organization of the visual cortex in the human brain. A ConvNet is able to successfully capture the Spatial and Temporal dependencies in an image through the application of relevant filters. A ConvNet includes an automatic feature extraction by employing a mix of convolution, max-pooling and fully connected layers. The convolutional layer computes the dot product between the input image X and a set of Kj learnable filters. Each filter Kj sized k1 × k2 moves across the input space performing the convolution with local subblocks of inputs, providing Yj, the feature map (Yj= X Kj +Bj where B is the bias term). Generally, the outputs are connected to a ReLU activation layer. The convolutional layer is followed by a downsampling layer using average or max-pooling filters. Finally, the learned features are classified using a fully connected Neural Net layer. Several ConvNet architectures are in use depending on the complexity of the task i.e. image classification, object detection, image segmentation, etc. There are classical networks such as LeNet-5, AlexNet, VGG-16, or modern architectures such as Inception(GoogLeNet), ResNet, etc.

During the development of ConvNets undergo training and testing phases, before they can be deployed. The ConvNet is able to adjust its filter values (or weights) by going through a training process called backpropagation. The training set consists of thousands of labeled images relevant to the targeted area of usage. The training is an iterative process involving (1) forward pass, (2) calculation of loss, (3) backward pass, and (4) weight update. After training the ConvNet, it is tested against test data. If the results are satisfactory, it will be deployed otherwise it will be redesigned. Generally, either a pre-trained model used as it is, or a technique called transfer learning is used to customize a model to a given task. This will reduce the overall efforts and accelerate the deployment of ConvNets.

Example use cases

We present a deeper overview of an AI algorithm based on convolutional neural nets[11][13] as an example of AI-assisted VVC encoding. This algorithm improves the performance of adaptive multiresolution coding of VVC. The algorithms being discussed are available in an open-source project Deep Learning-Based Video Coding (DLVC)[5].

We set the context by briefly introducing the VVC encoding process and concepts of convolutional neural nets. Then we will present an encoding optimization algorithm called block adaptive resolution coding[7] and details of integrating into the VideoCoin Network.

Game Streaming and Screen Content Coding

In recognition of the popularity of screen content applications, such as online gaming streaming, remote desktop sharing, screen content coding (SCC) is a recognized feature in the Versatile Video Coding (VVC) standard. To support efficient coding of these computer-generated content, SCC tools that have been studied in the past, especially those in HEVC extensions on SCC], are incorporated in this new standard. Tools that include Intra Block Copy(IBC) prediction mode help efficient game streaming. The paper “Deep learning-based intra prediction filter”[30] proposes a multi-scale CNN based intra filtering scheme (MIF scheme). The input of the network is the combination of the prediction block e and the surrounding reference pixels. The output is the filtered block treated by MIF. The proposal is targeted to exploit the nonlinear correlation between the surrounding reference pixels and the current block. Extracting the correlation between the reference pixel and the current block by deep learning method may produce better intra prediction performance.

Adaptive Streaming Applications and Reference Picture Respaming

is a key feature in VVC that allows storing of reference lists at different resolutions from the current picture and then resampled in order to perform regular decoding operations. The inclusion of this technique supports interesting application scenarios such as real-time communication with the adaptive resolution, adaptive streaming with open GOP structures, and enhanced omni-directional viewport-based streaming, for example allowing different temporal structures for different parts of the picture. RPR allows also to support, from the first version of the codec, scalability (in particular spatial scalability). This is a major change compared to AVC and HEVC, where scalability was defined and introduced only after the completion of the first version of the standard, and supported by separate profiles.

A Detailed view Adaptive Resolution using CNN

We will examine a Convolutional Neural Net that provides an enhancement called block adaptive resolution coding(BARC) proposed in Joint Video Experts Team(JVET) and available as an open-source project.

The following diagram shows the general scheme of the approach, the way CNN filters are used in the VVC encoding framework. It mainly uses two ConvNets shown in blue in the diagram that provides downsampling/upsampling as an alternative to conventional bicubic downsampling/upsampling[17].

The CNN based down sampler claimed to outperform the bicubic downsampler and also saves on computational resources up to 30%.

Convolutional Neural Net based block adaptive resolution coding (BARC)

Each CTU of the input picture can be coded at its full-resolution version or the low-resolution version.
For the low-resolution coding scheme, there are two down/up-sampling methods

⁃ The FIR filters

⁃ The CNN-based filters

The following diagram shows the Convolutional Neural Net used for downsampling. It is called CRCNN (CNN for compact resolution). The CNN for up-sampling is denoted as CNN-SR (CNN for Super-Resolution).

Figure: Convolutional Neural Net for downsampling

The ConvNet basically contains a convolution layer that downsamples to half resolution followed by 9 more layers to extract the features and learns the filter weights for more efficient low-resolution coding.

The following diagram shows the formulation of loss functions for training the Compact Resolution Convolutional Neural Net(CRCNN).

Training of CRCNN

  • End-to-end training with two loss functions
  • Reconstruction loss is to keep the Compaq Resolution image as informative as possible compared to the original image
  • Regularization loss enables the Compaq Resolution image friendly to compression

Service Validation and Rewarding

It will be a challenging task to verify the services provided by AI models and enhancements they provide to compression algorithms, particularly measuring video quality enhancements. The existing framework of VideoCoin Network can be easily extended to accommodate the AI services and provide a rich ecosystem for service developers and users.