Deploying AI at the Edge with Intel OpenVINO- Part 2

Source: Deep Learning on Medium

Deploying AI at the Edge with Intel OpenVINO- Part 2

Photo by Drew Patrick Miller on Unsplash

In part 1, I talked about how to download a pretrained model which was already optimized for using in openVINO toolkit. In this part, we will see how to optimize an unoptimized model by the model optimizer. Topics that will be covered in this post are,

  • Optimization techniques
  • Supported frameworks
  • Intermediate Representation
  • Converting ONNX model
  • Converting Caffe model
  • Converting TensorFlow model

How Optimization Is Done

Model optimizer is a command line tool that converts a model, made using a supported framework, into an intermediate representation which can be used in the inference engine. The model optimizer is a python file named “” and it can be found in this location:


In the conversion process, it modifies the performance of the model a little. So what exactly the model optimizer do to make the model light for using in edge application? There are several things that it can do. Some few of them are quantization, freezing and fusion.


The model optimizer can reduce the precision of the weights and biases from floating point values to integers or lower precision floating point. The very high precision values are useful for training, but in inference the precision can be reduced to lower precision, i.e. from FP32 to FP16 or INT8, without hurting the accuracy of the model significantly. This reduces the computation time and size with very small sacrifice of accuracy. This process is called quantization. The models in openVINO are usually default to FP32. In the pre-trained models, int8 and fp16 precision are available. But the model optimizer itself currently does not support int8.

Reduction of weights from 32-bit floating point to 8-bit integers


The term freezing in training a neural network means freezing some layers randomly from getting trained so that the other layers can be fine tuned. In model optimizer, freezing means removing some metadata and operations which were only useful for training the network but not for inference. Freezing is only used for TensorFlow models specifically. For example, back propagation is used only for training but it has no use in predicting. So the steps used for back propagation can be removed from the model during the conversion to intermediate representation.

Left: Model during training, Right: Model after freezing


Fusion means combining several layers into one single layer. For example, a batch normalization layer, activation layer and convolutional layer can be combined into one single layer. Several operations may occur in separate GPU kernels but a fussed layer will run on one kernel which removes the overhead of switching from one kernel to another.

Combining multiple layers into one single layer

Supported Frameworks

The model optimizer supports various popular deep learning frameworks. Here is a list of the supported deep learning frameworks

ONNX (includes PyTorch and Apple ML)

There are some differences in how to handle the model based on what framework is used. But most of the work is similar for all the models. I will talk about converting different models in the later portion of the post.

Intermediate Representation

So what is this intermediate representation that we have been talking about so long and trying to convert different models into? It is kind of a common dialect of neural networks. What does that mean! Different frameworks use different name for same type of operation. For example, ‘Conv2D’ in tensorflow, ‘Convolution’ in caffe and ‘Conv’ in ONNX are called ‘Convolution’ in the common tongue which is used in an intermediate representation. If you are interested, openVINO documentation has the list of all the different layer names. When the different optimization methods that we just talked about in earlier section is applied in a model, this conversion of name also takes place. The IR can be directly feed into the inference engine and the inference engine then can use the model to perform the inference. The representation consists of two file, a *.xml file and a *.bin file. The *.xml file carries the architecture of the model and the *.bin file carries the weights and biases of the model.

Let’s Convert A Model!

We will start with an ONNX model which is the simplest of all. Then we will see how to convert a Caffe model and finally we will work with a TensorFlow model which is a little bit complex. (Make sure that you have configured your model optimizer by following the steps 8 to 14 mentioned in part 0 and activate the virtual environment that we created.)


Converting ONNX Model

ONNX is used to convert model formats from different frameworks. As PyTorch is not directly supported in OpenVINO, a PyTorch model can first be converted into ONNX format and then the converted model can be very easily optimized into IR format with the model optimizer. There is a list of ONNX models in the documentation of OpenVINO. Let’s try the first model, ‘bvlc_alexnet’.

  • Open command prompt (windows)/terminal (Linux, Mac) and change your current directory to the location where the “model.onnx” file is saved.
  • Run the following command.
python "C:\Program Files (x86)\IntelSWTools\openvino\deployment_tools\model_optimizer\" --input_model model.onnx

The “ –input_model” argument is the argument used to indicate which model we want to convert. (The example command is run in windows and uses the default installation directory of OpenVINO. If your installation location is different, use the appropriate path to the “” file, “<installation_directory>\openvino\deployment_tools\model_optimizer\”)

  • You will get a printed line which will tell you the location of the created *.xml and *.bin file, which is our expected IR files.

Converting Caffe Model

Converting a Caffe model to IR is also pretty simple. The difference with ONNX model optimization is that in case of Caffe, the model optimizer can take some additional arguments which are specific for Caffe models. You can find details about them in the related documentation. I will keep things simple here.

Let’s first download an example model to work with. We will use this SqueezeNet model as example. After downloading, change your current directory inside the “SqueezeNet_v1.1″ folder where the “squeezenet_v1.1.caffemodel” file is saved. Run the following command (example for windows. Use appropriate location and symbols for your case),

python "C:\Program Files (x86)\IntelSWTools\openvino\deployment_tools\model_optimizer\" --input_model squeezenet_v1.1.caffemodel --input_proto deploy.prototxt

We have used an additional argument “–input_proto”. This argument is necessary if the *.prototxt is not in the same name as the *.caffemodel file or the files are saved in different directories. If the file names of *.caffemodel and *.prototxt are identical, this argument isn’t necessary.

You will get an output informing you about the directory of newly created *.xml and *.bin files.

Converting TensorFlow Model

The TF model zoo in the OpenVINO can even further extend the range of pre-trained models avaibale to you. TF models have frozen or non-frozen types, both available in the model zoo. Based on what you are working with, frozen or unfrozen, the procedure will slightly change. The dedicated documentation page for converting a TF model explains the different steps.

If you are working with a non-frozen model, then first convert it into frozen model. That is done in python using TensorFlow. Here is the code to freeze a model.

import tensorflow as tf
from tensorflow.python.framework import graph_io
frozen = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ["name_of_the_output_node"])graph_io.write_graph(frozen, '<where/to/save/>', '<name_of_the_generated_graph>.pb', as_text=False)

“sess” is the instance of the TensorFlow Session object where the network topology is defined. [“name_of_the_output_node”] is the list of output node names in the graph. Frozen graph will include only those nodes from the original “sees.graph_def” that are directly or indirectly used to compute given output nodes.

Alternatively, you can use a non-frozen model directly in the model optimizer following the specific instructions in the documentation.

As an example here, I will use a frozen graph from the model zoo. The model I will be working with here is the SSD MobileNet V2 COCO. Download the model if you want to follow the steps by yourself.

First, change the current directory to the location where the “frozen_inference_graph.pb” file is saved.

Then, run the following command in the command line,

python "C:\Program Files (x86)\IntelSWTools\openvino\deployment_tools\model_optimizer\" --input_model frozen_inference_graph.pb --tensorflow_object_detection_api_pipeline_config pipeline.config --reverse_input_channels --tensorflow_use_custom_operations_config "C:\Program Files (x86)\IntelSWTools\openvino\deployment_tools\model_optimizer\extensions\front\tfssd_v2_support.json"

Let’s break it down.

As always, we are running the python file “” which takes several arguments, some of them here are TF specific.

  • –input_model: Takes the model name that we are converting. In our case it’s the frozen_inference_graph.pb file.
  • –tensorflow_object_detection_api_pipeline_config: This takes a pipeline configuration file.
  • –reverse_input_channels: TensorFlow models are trained in RGB (Red Green Blue) color channel format. OpenCV uses BGR (Blue Green Red) format. This argument is necessary if we want to work with BGR format.
  • –tensorflow_use_custom_operations_config: In some cases, the model optimizer might not be able to convert a model due to the presence of custom operations that are not recognizable to the model optimizer. Or there might be mismatch of tensor layouts used by the model optimizer and the model. In those cases, a *.json file is fed to give hints to the model optimizer about the custom layers in a way that is recognizable to the optimizer.

If everything went well, you will get an output saying the location of the newly created *.xml and *.bin files.

What’s Next

We have a trained model in our hand, either by downloading directly in the IR format from the OpenVINO model zoo or by converting to IR format using the model optimizer. Now, the next step is to use this model in the inference engine to perform the actual inference and getting results. In part 3, we will use the inference engine. So, let’s move to the next part.