Efficient and quality Object Detection in Raspberry Pi.

Source: Deep Learning on Medium

Efficient and quality Object Detection in Raspberry Pi.

Currently, a major limiter for running neural networks in edge environments such as raspberry pi is its limited hardware. Networks that have many parameters or involve complex logic in their processing require a large amount of memory and processing power. To meet this need, various optimization techniques such as Pruning and Quantization have emerged in order to optimize these networks, however, it is not always possible to maintain quality. But is it possible to run an efficient, good-quality network in a limited edge environment? In this post we will discuss about ESPNetv2, a network that comes up with the premise of being lightweight by construction while maintaining quality levels equal (or even higher) to state-of-the-art deep learning networks such as Yolov2 and Mobilenetv1.

On the day this post is being written, there is another limitation to running neural networks in a raspberry pi environment, and this limitation concerns the compatibility of deep learning frameworks with the Pi architecture. ESPNetv2 is built entirely on PyTorch, whose framework is not yet available for armv7l architectures. To do this, we use a cross-compilation technique to generate pre-compiled PyTorch and Torchvision .whl files in the armv7l architecture. If you have time, these libraries can be compiled manually, however, some compilation processes can take days (and fail due to lack of memory). Another library that does not have an armv7l version yet is OpenCV, which is also required to run ESPNetv2, and for it we have prepared an installation script with the necessary steps, however this script may take a while depending on the environment. Based on our tests on raspberry pi 3 model B+, this process can take up to 7 hours. A complete tutorial on how to easily install Pytorch, Torchvision and OpenCV on raspberry pi can be found here. With these files, you will be able to run not only ESPNetv2, but also most other networks that use the Pytorch framework (and are small enough to run in this environment).

At below you can check some results regarding the ESPNetv2 network’s Mean Average Precision (mAP), which is the quality metric for measuring the accuracy of a model that performs object detection, i.e the higher the better. And the amount of float operations per second (FLOPS) that is the metric that tells us the amount of multiplications that are happening per second on the machine, i.e the smaller the better. This table was taken from the official article of the network.

In its construction, ESPNetv2 used several techniques of optimization of convolutional operations already known, however, the great differential proposed by the network was to use them together, in order to get the best of each one. Currently, operations such as depth-wise separable convolution, dilated convolution, and group convolution have emerged as options for those who want to optimize such operations. Below is a table taken from the ESPNetv2 article itself that compares these convolutions.

As you can see from the table, the most notable convolution is depth-wise separable dilated, which is precisely the convolutional operation proposed in the ESPNetv2 architecture. This operation uses the depth-wise separable convolution separation method to decrease the amount of parameters, and at the same time uses the kernel expansion method used for dilated convolution to obtain a convolution technique that spans a large amount. features with fewer parameters, i.e the best of both worlds.

Depth-wise separable dilated convolution is already a legacy of other networks proposed by the same author, however, for ESPNetv2, there has still been the addition of optimizations with Group Convolution, which, in short, groups similar convolutions to avoid instantiation of multiple kernels (one for each convolution), causing only 1 kernel to be used for all convolutions. Performing this grouping will decrease the parameters in the construction of the network. Check below the evolution of this architecture.

Given the explanation, let’s go to the main point of this post. How to run this network in a raspberry pi environment? Simple, assuming you have read the tutorial on how to install the dependencies needed to run ESPNetv2 on raspberry pi (and have been able to install them), you just need to redirect to the official network repository on github, download it and run the demo file. for the object detection task, however, on the day this post is being released, ESPNetv2 has some Segmentation Fault related issues when you try to make inferences on raspberry pi, fortunately we had a network debugging process and were able to fix it this problem. We made a pull request to the repository to fix this issue, which has already been coupled with ESPNetv2 code.

To perform Object Detection on raspberry pi with ESPNetv2, simply download its repository and run the following command:

$ python3 detection_demo.py

Without parameters, this command will inference images contained in the sample_images folder and save such inferences in the vis_detect folder. You can pass several parameters to this script, such as — im-dir to select your own image folder, — im-size to select pre-trained model with 300px or 512px size images, among other parameters. We strongly suggest opening the detection_demo.py script with your preferred code editor and examining which options are available to you.

Check out a comparison below when making inference on the same image on SSD-Mobilenet v1 and ESPNetv2 models.

SSD-Mobilenet v1 with 0.4 threshold
ESPNetv2 with 0.4 threshold

ESPNetv2’s superior detection quality is noticeable, which in addition to providing more accurate detection, also has a greater depth capability for object detection.

And ready! You have just run an extremely efficient network with a high quality of detection accuracy, all in a limited raspberry pi environment!