AI Accelerator Products

Source: Deep Learning on Medium

Go to the profile of Partha Deka

Authors: Partha Deka and Rohit Mittal

If you have followed our previous blog in medium (below), we posted on how we used deep learning for quality inspection in a manufacturing setting:

In this blog we would like to present in details various Intel Software and hardware AI products. These products are available in the market that accelerate deep learning inference in a production environment in various hard wares

What is deep learning inference?

Once we develop the optimal deep neural network model by training for hours/days in powerful GPUs, we must ope-rationalize the model to realize its core business value. Let’s exemplify with a computer vision use case, let’s say we trained a hardware product defect detection deep neural network classifier with Keras & Tensorflow in GPUs. Now, since we have a trained model in hand it’s time to deploy the model in a production environment which could be on a premise server, in a cloud instance or at the edge. If you have followed our previous blog, here is a pictorial representation of the inference flow for the abstract use case we talked about:

We get a new image, perform ROI extraction with the trained CNN ROI generator and then perform defect detection with the trained CNN defect detector. Inference with a ML or DL model is a part of the production pipeline where a real-time or a batch data engineering pipeline processes an image and invoke the trained model to perform a prediction. The SLA to perform an inference could in split seconds or just a few milliseconds. We have experienced that the trained models we built for our discussed use case with Tensorflow keras/pytorch use to take more than 1 second to infer on a single image inserialized h5py format or .pt format in a standard CPU. But how do we infer in milliseconds ?

Software Accelerator — Optimize inference with Open Vino Software:

Intel built Open Vino tool kit to optimize inference with Convolution Neural Network models. Open Vino toolkit extends across Intel hardware and maximizes performance:

– Enables CNN-based deep learning inference on the edge

– Supports heterogeneous execution across computer vision accelerators — CPU, GPU, Intel Movidius Neural Compute Stick, and FPGA — using a common API

– Speeds time to market via a library of functions and pre-optimized kernels

– Includes optimized calls for OpenCV and OpenVX*

Two steps towards deployment: Model Optimizer and Inference Engine

Step one is to convert the pre-trained model into IRs using Model Optimizer:

  • Produce a valid Intermediate Representation: If this main conversion artifact is not valid, the Inference Engine cannot run. The primary responsibility of the Model optimizer is to produce the two files to form the Intermediate Representation.
  • Produce an optimized Intermediate Representation: Pre-trained models contain layers that are important for training, such as the dropout layer. These layers are useless during inference and might increase the inference time. In many cases, these layers can be automatically removed from the resulting Intermediate Representation. However, if a group of layers can be represented as one mathematical operation, and thus as a single layer, the Model Optimizer recognizes such patterns and replaces these layers with one. The result is an Intermediate Representation that has fewer layers than the original model. This decreases the inference time.

The IR is a pair of files that describe the whole model:

.xml: Describes the network topology

.bin: Contains the weights and biases binary data

Step two is to use the Inference Engine to read, load, and infer the IR files, using a common API across the CPU, GPU, or VPU hardware

Please refer the Open Vino documentation below to have an overview of the Python API:

Please refer the flow diagram and the steps below to deploy a neural network model with OpenVino

Please follow the steps (code snippets) below:

Step 1: Save the trained model in h5 format'trained_model.h5')

Step 2: Convert the trained model to tensorflow pb format

(Use the github repo:

python3 -input_model_file trained_model.h5

Step 3: Run model optimizer (Optimizer- C:\Intel\OpenVINO\computer_vision_sdk_2018.3.343\deployment_tools\model_optimizer\

python3 — input_model <your_model.pb> — input_shape=[1,224,224,3]

Step 4: Pre-process the image and run inference

Some of our inference benchmarks on a single image:

Hardware Accelerators:

Intel Movidius:

Intel Movidius is custom built to inference deep neural networks for images. It is powered by Intel Movidius Vision processing Unit which is custom built for Computer Vision

Technical Specifications

  • Processor: Intel® Movidius™ Myriad™ X Vision Processing Unit (VPU)
  • Supported frameworks: TensorFlow* and Caffe*
  • Connectivity: USB 3.0 Type-A
  • Dimensions: 2.85 in. x 1.06 in. x 0.55 in. (72.5 mm x 27 mm x 14 mm)
  • Operating temperature: 0° C to 40° C
  • Compatible operating systems: Ubuntu* 16.04.3 LTS (64 bit), CentOS* 7.4 (64 bit), and Windows® 10 (64 bit)

Please refer the video below for an introduction:

Please take a look at the Intel Movidius github documentation:

Intel Movidius Neural Compute SDK (Software Development Kit):

Command line tools: The Intel Movidius Neural Compute SDK provides tools for profiling, tuning and compiling a deep neural network (DNN) model on development computer

mvNCCompile is a command line tool that compiles network and weights for Caffe or Tensorflow models into an Intel Movidius graph that is compatible with the Intel Movidius Neural Compute SDK (NCSDK) and Neural Compute API(NCAPI)

mvNCCompile inception-v1.pb -s 12 -in=input -on=InceptionV1/Logits/Predictions/Reshape_1 -is 224 224 -o InceptionV1.graph

The Intel Movidius Neural Compute SDK comes with a Python API that enables applications that utilize accelerated Deep Neural networks via neural compute devices such as Intel Movidius Neural Compute Stick.

Python API: The Python API is provided as a single Python module (, which is placed on the development computer when the NCSDK is installed. It has is available for both Python 2.7 and 3.5

Python API overview :

  1. Import the NCAPI module

The Python NCAPI is in the mvncapi module 9within the mvnc package

from mvnc import mvncapi

2. Setup a neural compute device

The Device class represents a neural compute device and provides methods to communicate with the device

#get a list of available device identifiers
device_list = mvncapi.enumerate_devices()

Initialize the Device with one of the device identifiers obtained from the call to enumerate_devices()

#Initialize the device and open communication

3. Set up a network graph and associated FIFO queues for the device

The NCSDK requires a neural network graph file compiled with mvNCCompile NCSDK tool. Many network models from Tensorflow and Caffe are supported.

When we have compiled graph, load the graph file to a buffer

#Load graph file data
GRAPH_FILEPATH = './graph'
with open(GRAPH_FILEPATH, mode='rb') as f:
graph_buffer =

The Graph class provides methods for utilizing the network graph.

we can initialize the Graph with a name string. The name string can be anything we like up to mvncapi.MAX_NAME_SIZE characters, or just an empty string.

#Initialize a Graph object
graph = mvncapi.Graph('graph'

Graph input and output is done with FIFO (first-in, first-out) queues. The Fifo class represents one of these queues and provides methods for managing it.

We should create input and output Fifo queues for our Graph and allocate the graph to our device with Graph.allocate_with_fifos. We can omit the keyword parameters to use default Fifo settings or wecan specify other values as needed.

# Allocate the graph to the device and create input and output Fifos with default arguments
input_fifo, output_fifo = graph.allocate_with_fifos(device, graph_file_buffer)
## Allocate the graph to the device and create input and output #Fifos with keyword arguments
input_fifo, output_fifo = graph.allocate_with_fifos(device, graph_file_buffer,
input_fifo_type=mvncapi.FifoType.HOST_WO, input_fifo_data_type=mvncapi.FifoDataType.FP32, input_fifo_num_elem=2,
output_fifo_type=mvncapi.FifoType.HOST_RO, output_fifo_data_type=mvncapi.FifoDataType.FP32, output_fifo_num_elem=2)

4. Get input tensor

Using the cv2 module to read an image from file and resize it to fit your network’s requirements

import cv2

# Read an image from file
tensor = cv2.imread('img.jpg')
# Do pre-processing specific to this network model (resizing, #subtracting network means, etc.)

5. Perform an inference

We shall use Graph.queue_inference_with_fifo_elem() to write the input tensor to the input Fifo and queue it for inference. When the inference is complete the input tensor will be removed from the input_fifo queue and the result tensor will be placed in the output_fifo queue

# Write the tensor to the input_fifo and queue an inference
graph.queue_inference_with_fifo_elem(input_fifo, output_fifo, tensor, 'user object')

If the input Fifo is full, this method call will block until there is room to write to the Fifo.

After the inference is complete , we can get the inference result with Fifo.read_element. This will also return the user object that we passed to Fifo.write_elem()

# Get the results from the output queue
output, user_obj = output_fifo.read_elem()

We can then use the output result as intended for our particular network model

6. Clean up

Before closing communication with the device, we shall use Graph.destroy() and Fifo.destroy() to destroy the Graph and Fifo objects and cleanup associated memory. The Fifos must be empty before being destroyed. Then use Device.close() to close the device and Device.destroy() to destroy the Device object and clean up associated memory

#Clean up

Intel FPGA:

FPGAs are silicon devices that can be programmed for workloads, such as data analytics, image inference, encryption, and compression. FPGAs enable the provisioning of a faster processing, more power efficient, and lower latency service maximizing compute capacity within the power, space, and cooling constraints of data centers.

Benefits of Intel FPGAs

  • Ease of deployment –The Intel Programmable Acceleration Card (Intel PAC) provides an Intel FPGA in a PCIe-based card that is available on validated servers from several leading OEMs. The Intel Acceleration Stack for Intel Xeon CPU with FPGAs abstracts away much of the complexity of programming FPGAs
  • Standardization –The Intel Xeon CPU with FPGAs defines standardized interfaces that FPGA developers and development and operations teams can use to hot-swap accelerators and enable application portability.
  • Accelerator Solutions — A portfolio of accelerator solutions developed by Intel and third-party technologists to expedite application development and deployment. Application that can benefit from FPGA acceleration range from streaming analytics, image inference etc.

Image inference with OpenVino with FPGA support:

Step 1: Configuring the FPGA board: Hardware

There are some hardware-side setups that must be performed before moving onto software configurations. We need to follow instructions here:

Make sure we have both the Jtag connection (through micro USB) and the PCIe connection.

Step 2: Configuring the FPGA board: Software

Please follow instructions as per the following link:

2.1 Initialize the Intel Arria 10 GX for use with Open CL

The board must be initialized with an OpenCL image correctly so that it can be used with OpenCL. Please follow the instructions in the “Initializing the Intel Arria 10 GX FPGA Development Kit for use with OpenCL” in following link below:

2.2 Install OpenCL Runtime Driver

The OpenCL runtime driver (FPGA RTE for OpenCL) comes with the OpenVINO installation, so we need to make sure that OpenVINO is installed first. Follow instructions in “Installing OpenCL Runtime Environment” section of the following link

2.3 Program board with OpenVINO-provided bitsream

Once we complete all the steps above, we need to program the board with a bistream pertaining to a particular CNN topology

The bitstream we program should correspond to the topology we want to deploy. Depending on how many bitstreams we select, there could be different folders for each FPGA card type which would be downloaded in the OpenVINO package. For the Intel Arria 10GX DevKit FPGA, the pre-trained bitstreams are in


For the Intel Vision Accelerator Design with Intel Arria 10 FPGA the pre-trained bistreams are in /opt/intel/computer_vision_sdk/bitstreams/a10_vision_design_bitstreams

Please follow the steps below to program a bitstream:

  1. Rerun the environment setup script.
source /home/<user>/Downloads/fpga_support_files/setup_env.s

2. Change to home directory

cd /home/<user>

3. We need choose the option based on the card we have

  • Program the bitstream for Intel® Arria® 10 FPGA Development Kit
aocl program acl0 /opt/intel/computer_vision_sdk/a10_devkit_bitstreams/2-0-1_A10DK_FP11_SqueezeNet.aocx
  • Program the bitstream for the Intel® Vision Accelerator Design with Intel® Arria® 10 FPGA
aocl program acl0 /opt/intel/computer_vision_sdk/bitstreams/a10_vision_design_bitstreams/4-0_PL1_FP11_SqueezeNet.aocx

Step 3: Setup a Neural Network Model for FPGA

Please follow the steps below to create FP16 model for inferencing:

  1. Make a directory for the FP16 SqueezeNet Model:
mkdir /home/<user>/squeezenet1.1_FP16

2. Go to /home/<user>/squeezenet1.1_FP16:

cd /home/<user>/squeezenet1.1_FP16

3. Use the Model Optimizer to convert an FP16 Squeezenet Caffe model into an optimized Intermediate Representation (IR):

python3 /opt/intel/computer_vision_sdk/deployment_tools/model_optimizer/ --input_model /home/<user>/openvino_models/FP32/classification/squeezenet/1.1/caffe/squeezenet1.1.caffemodel --data_type FP16 --output_dir .

4. The squeezenet1.1.labels file contains the classes ImageNet uses. This file is included so that the inference results show text instead of classification numbers. Copy squeezenet1.1.labels to the optimized model location

cp /home/<user>/openvino_models/ir/squeezenet1.1/FP32/squeezenet1.1.labels  .

5. Copy a sample image to the release directory. You will use this with your optimized model

sudo cp /opt/intel/computer_vision_sdk/deployment_tools/demo/car.png  ~/inference_engine_samples/intel64/Release

Step 4: Run a Sample Application

  1. Go to the samples directory
cd /home/<user>/inference_engine_samples/intel64/Release

2. Use an Inference Engine sample to run a sample application on the CPU:

./classification_sample -i car.png -m ~/openvino_models/ir/squeezenet1.1/FP32/squeezenet1.1.xml

Note the CPU throughput in Frames Per Second (FPS). This tells us how quickly the inference is done on the hardware. Now lets run the inference using the FPGA.

./classification_sample -i car.png -m ~/squeezenet1.1_FP16/squeezenet1.1.xml -d HETERO:FPGA,CPU

Our inference benchmarks on FPGA+OpenVino on a single image: