AI inferencing solutions with commercial silicon and hardware: A full stack strategy

Source: Deep Learning on Medium

Go to the profile of Rohit Mittal

Recent market analysis has shown that the TAM for inferencing is much larger than training.

Figure1: Estimated TAM of AI workloads (source: Barclays research, October 2018)

In this short blog, quantitative data will be shown to prove that owning the full stack is necessary to optimize performance and play in this market. Vendors which do not optimize hardware with the optimal open source software to build a complete solution will have inferior solutions. We will quote extensively from our other blogs on these topics.

Use case: We took a use case from the hardware manufacturing industry [1]. The last stage of a typical manufacturing line requires human visual inspection for quality control. Visual inspection by humans suffer from various issues (a) difference in results from person to person (b) fatigue leading to errors (c )under reject of bad parts leading to product recalls (d) over reject of good parts leading to revenue loss. A well-known manufacturing company executive commented that they employ 100+ operators on this task and still suffer from the issues outlined above.

Training was performed on good and bad images taken from the manufacturing line. A multi-GPU appliance was used for training. The hyper-parameters were tuned to get very low false positives as per the user requirement. This blog focuses on inferencing so we will not talk about training going forward.

We used multiple approaches to compare inferencing performance. In one case we employed different edge accelerators[2]. In another case we employed a cloud restful application[3]. This enables us to compare latency performance of both inferencing use cases outlined in the market analysis in Figure1. It also enables us to compare different software stacks and their impact on the performance with the same workload. Note, latency is the most critical parameter for inferencing (assuming same model is used for inferencing regardless of the stack). Accuracy was different between different stacks. However, it was above the target of 95% in each run so accuracy numbers are not quoted in the comparisons.

Edge inferencing: The trained model was tested with traditional software stack (Keras) on a CPU. Intel’s OpenVino software stack implements optimization for inferencing applications. These optimizations include hardware independent optimization (such as pruning, layer removal etc) as well as hardware dependent optimization (quantization levels etc). The trained model was tested with OpenVino to compare the benefits of software optimization on the same hardware (CPU).

Following this, the same model was run on multiple accelerators with varying degree of performance benefit. One of the accelerators was tested with two different software stacks and the performance showed a marked difference between the stacks. The two stacks in this case were movidius NCS graph and OpenVino. The edge hardware in this case was raspberry pi + Intel movidius stick.

Lastly a more powerful accelerator was chosen (FPGA) to compare the benefits of hardware acceleration. FPGA has more compute power allowing more parallel computations. Other benefits of FPGA include more variable quantizations.

Cloud inferencing: The trained model was hosted as a RESTFUL API on a cloud infrastructure. It is assumed that the cloud infrastructure has been optimized with low latency hardware. Again the benchmark data was run and the latency tabulated.

No attempt was made to optimize the scalability of the solution since the inferencing workload chosen had a batch size of 1. However, it is likely that a component vendor which doesn’t think about scalability in a cost-effective manager will suffer performance and TCO penalty compared to vendors who do have such a solution. After all, one of the benefits of a cloud solution in latency independent of the number of API calls.

Conclusion: It is clear that optimizing with software stack and accelerator choice is critical to improve performance. The intent was to also see how much improvement accelerators can have over traditional CPU. There is 30X to 100X improvement in latency time using software optimization. There are multiple software stacks in the marketplace. Finding the optimum one is non-trivial but very important for the component vendors. In such a fast moving market it is not always a given that documentation, support or revision numbers of hardware and software will move together in a cohesive manner.

A component vendor which doesn’t utilize the full stack could potentially have 100x poorer performance.