TVM: an AI-Tuning AI-Compiler

Source: Deep Learning on Medium

Source: TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Implementation: tvm

I am not the author, this post is just a quick reading summary for the paper.

For the readers who ever has a bit of the knowledge of Deep Learning and Compiler Optimization.

Introduction

The increasing need to bring machine learning to wide diversity hardware devices challenges existing compiler technologies. Briefly, different accelerator has different primitives (hardware instructions), and the compiler optimizations are hard to fit for all vendor-specific accelerators, including server-class GPU, embedded GPU and FPGA based Deep Learning Accelerators (DLA).

TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding.

Background

The memory hierarchy varies from CPU, GPU to DLA. One of the characteristic of deep learning workload is that the memory transportation and arithmetic intensity is really high. The most popular neural network in deep learning is CNN, and n-D convolution (matrix multiplication) is one of the heaviest workload in CNN. Therefore, n-D convolution will be used for evaluation in the work.

All the compiler optimizations for the hardware devices are to enhance the memory utilization and reduce the hardware instructions, but it needs lots of hand-crafting. Therefore, TVM is proposed to solve these problems automatically for different hardware.

Key Ideas

Optimizing Computational Graphs

  • Operator Fusion

This optimization can greatly reduce execution time, particularly in GPUs and specialized accelerators. Specifically, we recognize four categories of graph operators:

(1) injective (one-to-one map, e.g., add),

(2) reduction (e.g., sum),

(3) complexout-fusable (can fuse element-wise map to output, e.g., conv2d), and

(4) opaque (cannot be fused, e.g., sort).

  • Data Layout Transformation

Data layout optimization converts a computational graph into one that can use better internal data layouts for execution on the target hardware. It starts by specifying the preferred data layout for each operator given the constraints dictated by memory hierarchies. We then perform the proper layout transformation between a producer and a consumer if their preferred data layouts do not match.

Generating Tensor Operations

  • Example
  • Explicit Memory Latency Hiding

Latency hiding refers to the process of overlapping memory operations with computation to maximize utilization of memory and compute resources. It requires different strategies depending on the target hardware back-end.
On CPUs, memory latency hiding is achieved implicitly with simultaneous multithreading or hardware prefetching. GPUs rely on rapid context switching
of many warps of threads [44]. In contrast, specialized DL accelerators such as the TPU usually favor leaner control with a decoupled access-execute (DAE) architecture and offload the problem of fine-grained synchronization to software.

Automating Optimization

  • ML-Based Cost Model

TVM takes a statistical approach to solve the cost modeling problem. In this approach, a schedule explorer proposes configurations that may improve an operator’s performance. For each schedule configuration, TVM use an ML model that takes the lowered loop program as input and predicts its running time on a given hardware back-end.

The model, trained using runtime measurement data collected during exploration, does not require the user to input detailed hardware information. TVM update the model periodically as we explore more configurations
during optimization, which improves accuracy for other related workloads, as well.

Evaluation

The authors focused on four critical questions:

  • Can TVM optimize DL workloads over multiple platforms?
  • How does TVM compare to existing DL frameworks (which rely on heavily optimized libraries) on each back-end?
  • Can TVM support new, emerging DL workloads (e.g., depthwise convolution, low precision operations)?
  • Can TVM support and optimize for new specialized accelerators?

Therefore, there are many experiments, and here are the ones that I concerned:

  • The performance on server-class GPU
  • The performance compared to mobile NN framework
  • The performance compared to mobile libraries

Conclusion

TVM proposed an end-to-end compilation stack to solve fundamental optimization challenges for deep learning across a diverse set of hardware back-ends. TVM system includes automated end-to-end optimization, which is historically a labor-intensive and highly specialized task.