Original article was published by Jonathan Hui on Artificial Intelligence on Medium
To demonstrate its potentials, the code below demonstrates how to code an ML algorithm, the k-means clustering, with these patterns.
As another example, the convolution layer of CNN can be expressed with the map and reduce below.
For more complex data-intense applications and DL models, we can concatenate and nest these parallel patterns into a hierarchical structure to represent them.
Our modern programming model is developed for general computing. Concurrency and parallelism are usually discovered at the fine-grain local level only. It is often extremely hard to rediscover the global structure. In short, you can’t see the forest for the trees. But by encoding the algorithms in parallel patterns above, these high-level structures are maintained throughout the process and can be easily optimized later. Let’s get an overview in the next 2 sections before detailing it with an example.
Metapipelining is a pipelining of pipelining. ML algorithms can be realized as a hierarchical dataflow containing pipelines and Metapipelining.
By coding the problem as a hierarchical pipeline, we carry a master concurrency plan that can be accelerated in hardware. In addition, we can use explicit controls to design a more complex dataflow as we wish.
Tiling and Memory Hierarchy
But there are still major gaps between this logical view and the hardware design. In particular, the logical view treats the input as one single data space with unlimited memory. In the physical world, we tile the computation, i.e. we subdivide a workload such that each chunk can fit into the limited memory and computation resources.
In the example below, instead of working on an unlimited large data space, we use the outer “reduce” loop to break up the data in the off-chip memory into tiles (tA and tB) of size B.
In many AI algorithms, data movement is a major performance bottleneck and energy hog. Reducing data movement is critical. By maximizing data locality, we can minimize the data movement between off-chip and on-chip memory. Therefore, to move one step closer to hardware design, System 2.0 introduces the memory hierarchy and data movement into the model. For example, the code below provides a parameterized model (sized with parameters C, D) in declaring data storage using external memory (DRAM), on-chip SRAM tiles, and registers.
Below is another example of moving data onto local distributed SRAM (more detailed example later).
By defining such data movement, the compiler will size the resources properly in avoiding global memory access or complex caching scheme. In summary, tiling and Metapipelining capture locality and nested parallelism. To accelerate AI algorithms, we can take advantage of reconfigurable hardware devices to exploit the nested parallelism in Metapipelining and data locality in the memory hierarchies and tiles.
Let’s get into the details to fill some of the holes in our discussion.
Spatial is a language to encode the IR (intermediate representation) for the hardware accelerator (called accelerator IR). It is also a compiler to translate the IR into hardware design configurations.
Starting with models/codes in a DSL (domain-specific language), like Tensorflow, they can be transformed into IRs containing parallel patterns. Then they are further transformed into the accelerator IR encoded in Spatial. Finally, these IRs will be eventually compiled into hardware configurations describing the hardware designs, say for programming FPGAs.
Say we want to create a hardware solution in accelerating the dot product of two vectors. The logical model for this operation can be represented by the map and reduce pattern below.
The code on the right below is how the dot product will be represented (encoded) in the accelerator IR using Spatial.
As shown, it adds an outer reduce loop to break down the workload with N elements into chunks with B elements, i.e. it tiles the computation. In the outer loop, it loads two tiles (tA and tB) from the DRAM (global shared memory) into SRAM (local memory). Then the inner loop performs the multiplication and the accumulation. It makes B an explicit parameter in which its value will be later optimized by the Spatial compiler according to the capacity of the target hardware and the type of the computation using the tile.
As shown below, the left diagram is the accelerator IR for the dot product of two vectors. The Spatial compiler will take the IR and compiles it into hardware design for the target chips. As a demonstration, the right diagram below is the logical block diagram for the compiled hardware. Data is loaded into a scratchpad memory (a distributed local memory). Then it iterates the elements in computing the dot product. In this example, the inner loops include a 4-way multiplication for parallelization.
In the Spatial IR, many configurations, like sizing information, are implicitly or explicitly parameterized. This defers the decision of their values and allows the compiler to optimize them based on the target device capacity and the DNN. In this example, these parameters include:
- the tile size B that we already discussed,
- how much pipeline (# of stages) or no pipeline (sequential) in the Metapipelining and the inner loop.
Here, we have a 3-stage pipeline below.
In the design below, a double buffer is deployed — one is fetching data and one is feeding data to the pipeline. This allows stages to run without stalling.
- what level of parallelism (4-way) to exploit in each component.
- how much memory banking (how many units of data can be read in parallel from the memory) and the banking mode (details later) for the tiles to maintain a constant supply of data.
Here is the IR with the corresponding parameters (middle section) to be optimized by the compiler.
Again, the diagram below repeats this information using the logical blocks for the hardware design.
For completeness, here is the list of parameters and controls that the compiler can optimize.
Another Spatial Example (Optional)
To demonstrate the power of the Spatial IR, here is another more complex example demonstrating a matrix multiplication.
As shown below, Spatial provides many control primitives in creating a complex hierarchical dataflow.
Plasticine (A Reconfigurable Architecture For Parallel Patterns)
Now, we have an accelerator IR written in Spatial. It encodes a coarse-grain dataflow using hierarchical parallel patterns. It also contains tiling, memory hierarchy, and data movement information. We can use the Spatial compiler to exploit this information to generate a hardware design.
Finally, we are ready to discuss Pasticine — a reconfigurable general processor for data-intense applications including ML and DL. It is designed to optimize parallel patterns with goals to achieve near ASIC performance with high energy efficiency and small die size, but with the programmability and generalization of CPU and GPU.
Plasticine composes of alternating Pattern Compute Units (PCU) and Pattern Memory Units (PMU) connected by switch boxes.
Pattern Compute Unit (PCU)
The Spatial compiler will map the inner loop computation to a PCU, i.e. a PCU is responsible for executing a single, innermost parallel pattern. Each PCU contains a multi-stage SIMD pipeline that combines pipeline parallelism and SIMD for data parallelism.
The following diagram shows a PCU with 4 pipeline stages and 4 SIMD lanes (a.k.a. process a single instruction on 4 data concurrently).
Using a producer-consumer pattern, PCU buffers the vector and scalar input with separate and small FIFOs. Each SIMD lane is composed of functional units (FU) interleaved with pipeline registers (PR). These registers store intermediate scalar results and propagate values between stages within the same lane. Therefore, no external PCU communication or scratchpad accesses are needed for intermediate calculations. The output of these lanes can be merged to form a vector output.
These FUs, containing arithmetic and binary operations, are reconfigurable to implement the function f needed in the parallel pattern.
Crosslane communication between FUs can be done by
- a reduction tree network to reduce values from multiple lanes into a single scalar, or
- a lane shifting network to present PRs with sliding windows across stages.
PCU also contains a control block with reconfigurable combinational logic. This block controls a programmable chain of counters that serve as a state machine that triggers the start of a workflow and used as the loop indices for parallel patterns. (PCU execution begins when the control block enables one of the counters.)
To scale the system, Plasticine follows a hierarchical control structure to control pipelines, i.e. the parent pipeline coordinates the execution of its children. Based on such flow dependencies and the explicit application’s control (if any), we can configure the control block to combine control signals from parent and sibling PCUs, as well as the local FIFOs status, to trigger the execution of a PCU.
Let’s introduce some memory access patterns before we discuss the Pattern Memory Unit (PMU). The memory access pattern in parallel patterns can be simplified as:
- statically predictable linear functions of the pattern indices,
- or random accesses.
The former case usually creates a pattern that we can load the data in parallel easily. For the latter case, we often consolidate multiple requests together. On top of that, data may or may not be reused by PCU. We call them streaming (one time use) and tiled data respectively.
Pattern Memory Unit (PMU)
PMUs contains a programmer-managed scratchpad memory (a distributed local memory) for tiled data.
The scratchpad is coupled with a reconfigurable scalar unit for address calculation (address decoding). To support multiple SIMD lanes in PCU, the scratchpad contains multiple SRAM matching the number of PCU lanes for parallel streaming.
To improve the memory bandwidth for different data access patterns, the address decoding logic can be configured to operate in several banking modes to support various access patterns. These modes are:
- Strided banking mode allows linear access patterns for dense data structures.
- FIFO mode support data streaming.
- Line buffer mode follows a sliding window pattern.
- Duplication mode duplicates contents across all memory banks for parallelized random access (gather operation).
In addition, to support coarse-grained pipeline parallelism, the scratchpad can be configured to operate as an N-buffer with any of the banking modes.
PMU also has a control block that works similar to PCU. Its logic is reconfigurable and is responsible for triggering the PMU read and/or write operations.
The PCU and PMU communicate with other units through a pipelined interconnect (switch). These interconnects are statically configured for custom datapaths. It contains three separate networks and all have the same topology. But each network is responsible for different data types, one for scalar, one for vector, and one for control. Links in switches include registers to prevent long wire delays.
These switches are similar to the interconnects in FPGA. But in FPGA, it is done at a fine-grain level (in bit level). Such fine granularity controls take up a lot of die space and powers. Plasticine involves complex components, like PCU, in a coarse-grain level. By avoiding fine-grain configuration, Plasticine saves a lot of die space and being power efficient. In addition, the place and route algorithms take minutes instead of hours or days in FPGA.
Off-chip Memory Access
Off-chip DRAM is accessed by 4 DDR memory channels through address generators (AG) on two sides of the chip. Each AG contains a reconfigurable scalar unit for address calculation similar to PMU. And each AG contains FIFOs to buffer outgoing commands, data, and incoming responses.
Multiple AGs are connected to an address coalescing unit. The memory commands from AGs can be dense or sparse. Dense requests are converted to multiple DRAM burst requests by the coalescing unit to bulk transfer contiguous DRAM areas. This is commonly used in reading or writing tiles of data. On the other hand, sparse requests are enqueued into a stream of addresses in the coalescing unit. The random reads and writes will be performed with gathers and scatters. This minimizes the number of issued DRAM requests.
Plasticine uses a distributed and hierarchical control scheme to minimize synchronization between units. The controls are done in the outer loop of a nested pipeline which controls the execution sequences of the pipelines that it contains. It supports three types of controller scheme for its children:
- sequential execution — execution is done sequentially without pipeline optimization.
- coarse-grained pipelining, — pipeline optimization is enabled, and
- streaming — it allows the concatenation of units with FIFOs to form a large pipeline.
The diagram below summarizes the type of hardware acceleration that we discussed in supporting the specific programming model.
Mapping Spatial IR to Hardware
Let’s see how the Spatial IR will map into the Plasticine architecture. On the right diagram below,
- the light blue area is responsible for loading data from DRAM. That will be done on AG with the light blue color on the left diagram.
- Tile tA and tB will be corresponding to PMU, both in deep blue.
- The red area executes the inner loop of the parallel pattern and maps it to a PCU.
- A separate PCU in yellow will be responsible for the reduction operation.
The following is how a NN can map into this architecture.
To unify the solutions across data-intense applications, System 2.0 uses parallel patterns to model these problem domains. Regardless of the software platforms that the problem is coded on (TensorFlow or Spark), they can be converted and represented with parallel patterns. This lays down a path in creating a new general AI chip.
First, the DSL application models or codes are converted into parallel patterns — the Parallel Pattern IR.
Using a high-level compiler, information like tiling is added into the Spatia IR to bridge the gap between the logical model and the hardware design. Throughout the process, the IR retains the hierarchical pipelines structures. With Plasticine designed to optimize parallel pattern executions, the Spatial IR can map to the Pasticine easily in a simple and coarse-grain level. In fact, the Spatial compiler only takes minutes to run, in contrast to hours in FPGA’s place and route. This reflects how well Pastcine is designed for executing parallel patterns with minimum design gaps.
Researchers make things perfect and engineers make things cheap.
We don’t have the crystal ball to tell whether System 2.0 is the next computing architect for AI or the existing solutions like GPU may still have a lot of mileage to run. This all depends on the market share and how well engineers can simplify the concept, prioritize goals, and execute them correctly. And many unforeseen problems will pop up and need to be solved. SambaNova is likely working on the compiler as well as a chip similar to Plasticine. The cofounders also address the need for hardware acceleration in handling dynamic precision (HALP), sparsity, and access to training data. In addition, they want to address the labeling complexity with weak supervision. But we don’t have enough detail on how this complete workflow may look like in the context of SambaNova. Or this may be just another Startup concept like the Snorkel AI. Whether System 2.0 or Plasticine will be ahead of its time or everyone will jump into the bandwagon, it can only tell by time — likely in the next couple of years.
We had discussed half of the AI chip companies that we want to cover. In the final article, we will discuss Intel and Mobile SoC, like Apple, Qualcomm, and Samsung. We will also look into other chips specialized for AI accelerations as well as developments in the Asian countries. Finally, we will look into vendors focused on lower-power AI edge devices.