Introduction to OpenVINO

Source: Deep Learning on Medium

Model Optimizer

Model optimizer is a cross-platform command line tool that facilitates the transition between the training and deployment environment. It adjusts the deep learning models for optimal execution on end-point target devices.

Fig 2 : Model Optimizer | Image Credits : Intel

Working

Fig 3: Working of Model Optimizer | Image Credits : Intel

Model Optimizer loads a model into memory, reads it, builds the internal representation of the model, optimizes it, and produces the Intermediate Representation. Intermediate Representation is the only format that the Inference Engine accepts and understands.

The Model Optimizer does not infer models. It is an offline tool that runs before the inference takes place.

Model Optimizer has two main purposes:

  • Produce a valid Intermediate Representation. The primary responsibility of the Model Optimizer is to produce two files (.xml and .bin) that forms the Intermediate Representation.
  • Produce an optimized Intermediate Representation. Pretrained models contain layers that are important for training, such as the Dropout layer. These layers are useless during inference and might increase the inference time. In many cases, these layers can be automatically removed from the resulting Intermediate Representation. However, if a group of layers can be represented as one mathematical operation, and thus as a single layer, the Model Optimizer recognizes such patterns and replaces these layers with only one. The result is an Intermediate Representation that has fewer layers than the original model. This decreases the inference time.

Operations

  1. Reshaping
  • The Model Optimizer allows us to reshape our input images. Suppose you have trained your model with an image size of 256 * 256 and you want to convert the image size to 100 * 100, then you can simply pass on the new image size as a command line argument and the Model Optimizer will handle the rest for you.

2. Batching

  • We can change the batch size of our model at inference time. We can just pass the value of batch size as a command line argument.
  • We can also pass our image size like this [4,3,100,100]. Here we are specifying that we need 4 images with dimensions 100*100*3 i.e RGB images having 3 channels and having width and height as 100. Important thing to note here is that now the inference will be slower as we are using a batch of 4 images for inference rather than using just a single image.

3. Modifying the Network Structure

  • We can modify the structure of our network, i.e we can remove layers from the top or from the bottom. We can specify a particular layer from where we want the execution to begin or where we want the execution to end.

4. Standardizing and Scaling

  • We can perform operations like normalization (mean subtraction) and standardization on our input data.

Quantization

It is an important step in the optimization process. Most deep learning models generally use the FP32 format for their input data. The FP32 format consumes a lot of memory and hence increases the inference time. So, intuitively we may think, that we can reduce our inference time by changing the format of our input data. There are various other formats like FP16 and INT8 which we can use, but we need to be careful while performing quantization as it can also result in loss of accuracy.

Using the INT8 format can help us in reducing our inference time significantly, but currently, only certain layers are compatible with the INT8 format: Convolution, ReLU, Pooling, Eltwise and Concat. So, we essentially perform hybrid execution where some layers use FP32 format whereas some layers use INT8 format. There is a separate layer which handles theses conversions. i.e we don’t have to explicitly specify the type conversion from one layer to another.

Calibrate layer handles all these intricate type conversions. The way it works is as follows —

  • Initially, we need to define a threshold value. It determines how much drop in accuracy are we willing to accept.
  • The Calibrate layer then takes a subset of data and tries to convert the data format of layers from FP32 to INT8 or FP16.
  • It then checks the accuracy drop and if it less than the specified threshold value, then the conversion takes place.
Fig 4: Calibrate Layer | Image Credits: Intel