Deploying AI at the Edge with Intel OpenVINO- Part 3 (final part)

Source: Deep Learning on Medium

In my previous posts, I have introduced OpenVINO, described how to install it in windows computer, how to process input and output and how to get a model or prepare one in model optimizer. We are now at the final step, performing the inference with the inference engine. So lets begin. The discussed topics in this post are,

  • Inference Engine
  • Feeding a model to the inference engine
  • Checking for unsupported layers and using CPU extension
  • Sending inference request
  • Handling the output
  • Integrating into an app

Inference Engine

Inference engine runs the actual inference on a model. In part 1, we have downloaded a pre-trained model from the OpenVINO model zoo and in part 2, we have converted some models in the IR format by the model optimizer. The inference engine works only with this intermediate representation. In part 2, we have seen how the model optimizer helps to optimize the model by improving size and complexity. The inference engine provides further hardware based optimization to ensure the use of as little hardware resource as possible. Thus it helps to deploy AI at the edge in the IoT devices.

To communicate with the hardware better, the inference engine is built on C++. So you can use the engine in you C++ app directly. Also there is a python wrapper to use the engine in python code. In this post, we will use python.

How to work with the Inference Engine

Here are the steps to work with inference engine from first to last,

Feed a model > Check for any unsupported layers > Send Inference Request > Handle the Result > Integrate with your APP

Now we will see the detail works we need to do for each step. In part 1, we have created two python files, one named and the other named We will be continuing the code from there.

Feed a Model

Photo by Thibault Mokuenko on Unsplash

We need to work with two python classes from the “openvino.inference_engine” library. They are IECore and IENetwork. The documentation of IECore and IENetwork will be super helpful to use different methods from the classes, so check them out.

IENetwork holds the information about the model network read from the IR and allows further modification to the network. After necessary processing, it feeds the network to IECore which creates an executable network.

We will begin by importing the necessary libraries in our file.

from openvino.inference_engine import IECore
from openvino.inference_engine import IENetwork

Now I am going to create a function name “load_to_IE” which will one argument, model (location to the models *.xml file) and from that we will also get the location of the *.bin file. Also there will be another variable named “cpu_ext” which I will explain later (as example, I am using the same face detection model that I worked on in part 1).

cpu_ext_dll ="C:/Program Files (x86)/IntelSWTools/openvino_2019.3.379/deployment_tools/inference_engine/bin/intel64/Release/cpu_extension_avx2.dll"

Now lets define the function.

def load_to_IE(model):
# Getting the *.bin file location
model_bin = model[:-3]+"bin"
#Loading the Inference Engine API
ie = IECore()

#Loading IR files
net = IENetwork(model=model_xml, weights = model_bin)

Check for Unsupported layer

Photo by Randy Fath on Unsplash

Even after successfully converting a model to the IR, some layers might still remain unsupported for the CPU. In that case, we can use a CPU extension file to support those unsupported layers in the inference engine. The CPU extension file location is slightly different in different OS. For example, I found my CPU extension *.dll file in this location: <installation_directory>\openvino_2019.3.379\deployment_tools\inference_engine\bin\intel64\Release.

The file is named “cpu_extension_avx2.dll”. In linux there are several extension files and in Mac there is just one.

Not all the models will need a CPU extension. So first we will check if our model needs it or not. Continue the code inside the “load_to_IE” function.

# Listing all the layers and supported layers
cpu_extension_needed = False
network_layers = net.layers.keys()
supported_layer_map = ie.query_network(network=net,device_name="CPU")
supported_layers = supported_layer_map.keys()

Let me explain the code. First I set a flag that cpu extension is not needed. Then, I list the name of all the layers in our network in the “network_layers” variable. Then, I used the “query_network” method of IECore class which returns a dictionary of layers that are supported in the current configuration. By extracting the keys from that dictionary, I create a list of the supported layers and store it in the “supported_layers” variable.

Now I will iterate over all the layers in our current network and check if they belong to the supported layer list. If all the layers are present in the supported layer, CPU extension won’t be necessary otherwise, we will set our flag to false and move to adding the cpu extension. Continue the following code inside the “load_to_IE” function.

# Checking if CPU extension is needed 
for layer in network_layers:
if layer in supported_layers:
cpu_extension_needed =True
print("CPU extension needed")

We will use the “add_extension” method of the IECore class to add the extension. Continue the code inside the same function.

# Adding CPU extension
if cpu_extension_needed:
ie.add_extension(extension_path=cpu_ext, device_name="CPU")
print("CPU extension added")
print("CPU extension not needed")

To be safe, we will again use the same code to see if after adding the CPU extension, all the layers are supported now. If all the layers are now supported, we can move to next step. But if not, we will exit the program.

#Getting the supported layers of the network 
supported_layer_map = ie.query_network(network=net, device_name="CPU")
supported_layers = supported_layer_map.keys()

# Checking for any unsupported layers, if yes, exit
unsupported_layer_exists = False
network_layers = net.layers.keys()
for layer in network_layers:
if layer in supported_layers:
print(layer +' : Still Unsupported')
unsupported_layer_exists = True
if unsupported_layer_exists:
print("Exiting the program.")

Now that we are sure that all the layers are supported, we will load the network to our inference engine.

# Loading the network to the inference engine
exec_net = ie.load_network(network=net, device_name="CPU")
print("IR successfully loaded into Inference Engine.")
return exec_net #exec_net short for executable network

Send Inference Request

Photo by Braden Collum on Unsplash

We will send our inference request to the executable network “exec_net” that is returned by our “load_to_IE” function. There are two methods of inference, Synchronous and Asynchronous. In synchronous method, the app sends inference request to the engine and does nothing but wait until inference request is complete. On the other hand, in asynchronous method, the app can continue other works while the inference engine performs the inference. This is helpful is the inference request processing is slow for some reason and we don’t want our app to hang while the inference is done. So in asynchronous method, while the inference request on one frame is being processed, the app can continue to gather the next frame and pre-processing it instead of getting frozen in case of synchronous method.

Lets define two functions. One function will perform the synchronous inference and the other function will perform the asynchronous inference. The function for synchronous, named “sync_inference”, will take the executable network, and the pre-processed image as arguments. The function for asynchronous inference, named “async_inference” will take one more additional argument, “request_id” which I default to 0, as in this example we will just sent one request so we don’t need to get bothered about the ids. When you will use this method to make your app “asynchronous” in real sense (in our example, we are actually waiting while the IE completes the inference, as there is nothing else for us to do in our simple demo app), you might be feeding several images one after the other with different request ids before the engine finishes the inference on one image. When the engine will complete the inference, you will extract the corresponding results using these ids. So, make sure to assign unique ids to your images in your real app. You can read the documentation to know more about the asynchronous method. Handling the output from the two methods are different. For synchronous method, we can return the result directly. But with the asynchronous method, we will return the executable network.

def sync_inference(exec_net, image):
input_blob = next(iter(exec_net.inputs))
result = exec_net.infer({input_blob: image})
return result
def async_inference(exec_net, image, request_id=0):
input_blob = next(iter(exec_net.inputs))
exec_net.start_async(request_id, inputs={input_blob: image})
return exec_net

When the asynchronous inference is complete, the result can be extracted using the request id, so it needs a bit more processing.

Handling the Output

Photo by jesse ramirez on Unsplash

We have seen, the synchronous method directly gives us the result of the inference. In part 1, we have used this result and processed our output to draw the bounding box. Of course the processing varies according to what we want the app to do. For getting the result of inference from async method, we are going to define another function which I named “get_async_output”. This function will take two arguments. The executable network that the “async_inference” returned and the request id for which we want the result.

def get_async_output(exec_net, request_id=0):
output_blob = next(iter(exec_net.outputs))
status = exec_net.requests[request_id].wait(-1)
if status == 0:
result = exec_net.requests[request_id].outputs[output_blob]
return result

The “wait” function returns the status of the processing. If we call the function with the argument 0, it will instantly return the status, even if the processing is not complete. But if we call it with -1 as argument, it will wait for the process to complete.

Integrate with your APP

Photo by Chris Ried on Unsplash

We are done with our inference engine. Now, we will use our functions in our python file which is very simple. First we need to import the functions that we just defined in the file to our file. Then, inside the “main” (that we defined in part 1), we will just call the functions accordingly to use them in our app.

from inference2 import preprocessing, load_to_IE, sync_inference, async_inference, get_input_shape, get_async_outputdef main(): #*..................................*# exec_net = load_to_IE(model, model_bin)
# Synchronous method
result = sync_inference(exec_net, image = preprocessed_image)


Remember, we hard coded the required image dimension in the first part? We don’t need to do that anymore. Define this new function in our “” file.

def get_input_shape(model):
model_bin = model[:-3]+"bin"
net = IENetwork(model=model, weights = model_bin)
input_blob = next(iter(net.inputs))
return net.inputs[input_blob].shape

Now, import the function in the file. And, add this line inside the “main” function instead of hard coding the height and width.

from inference import get_input_shapedef main():
n, c, h, w = get_input_shape(model)
preprocessed_image = preprocessing(image, h, w)

I have uploaded the complete code in a github repository. If you find any part of the code confusing, check the full code.


Intel OpenVINO toolkit has endless possibilities. The huge model zoo gives developers everything they need to build a powerful edge application with the cutting edge AI technologies without worrying too much about the training. OpenVINO also brings AI to your low powered device without a hustle. Even in a small raspberry pi paired with Intel NCS, you can make amazing applications using AI with OpenVINO. Your imagination is the limit here. I hope my posts have helped you getting started with OpenVINO and making your edge app. Until next time, enjoy coding.