Source: Deep Learning on Medium
This post stems from a seminar held at University of Bologna for the students of the Machine Learning course. It is also my first post on Medium 🎉, so I do hope you will enjoy it. The aim is to provide a general overview about what Edge AI is and its applications, with a focus on Objects Detection.
Edge AI, or simply Artificial Intelligence on the Edge, is a term coined by Intel in an article where they wish for a change in the use of Machine Learning services. In fact, today those services are exploited as Cloud APIs while what we would like to do is to run inference directly on the device, mainly for the following reasons:
- Connectivity: there might places where the network is not available.
- Latency: sending data to a server and getting results back is not suitable for real time applications.
- Costs: running inference in the cloud is not for free.
- Privacy: these data are personal and we want to make sure no one could access them.
Edge AI is required when a smart system needs to take a decision in the same physical location where data are generated and, for sure, this decision must be taken immediately. At this point the reader might be wondering which kind of practical applications can be developed by using this paradigm. To give a glimpse:
- Automatically authorize access to your home by face identification. A smart home could unlock the doors for maids, babysitter, pet sitter.
- Drones used during the war that could autonomously distinguish between a child and an adult. In this case, a computer vision algorithm could save hundreds of innocent children.
- A census of wild animals living in a remote area. Smart cameras can be strategically placed and they could identify animals without human operators.
- A smart traffic light could maximize traffic efficiency by counting how many vehicles are on the road instead of using a predefined time-based schedule.
In order to deal with these systems, we must adopt the following workflow:
- Choosing a Deep Learning Framework to develop the model locally.
- Training the model.
- Monitoring performances.
- When performances converge, we must export the model. This step is particularly relevant because it represents the ‘intelligence’ we want to instill in our system. The exported model, also called frozen model, contains both the architecture and weights tuned for obtaining certain results.
- Deploy the model on the smart device. At this point we are ready to run inference locally. No cloud is needed anymore.
From Image Classification to Object Detection
Before showing two case studies, it is necessary to provide some basic knowledge about Deep Learning and Deep Architecture for Computer Vision.
Convolutional Neural Networks in pills
Convolutional Neural Networks (CNNs) or ConvNets are the base model for every computer vision task solved using Deep Learning. This model was invented by Yan LeCunn in 1986, but only recently it has returned in vogue due to the availability of huge amount of data and recent advances in GPU computing. CNNs were specifically invented for image processing. The magic behind this model relies in local connections and shared weights among artificial neurons of different layers. This way, the network can process efficiently images with an high resolution.
By combining Convolution and Pooling operations at each layer, an hierarchical representation of the image is created such that layers at the beginning are capable of detecting features and small details while layers at the end are able to see the whole image. The last layer of the network is fully connect and it is where the classification phase happens.
So, given an image with a cat a CNN can tell us: “Hey, there is a cat!”, but what if the image contains several animals and for each of them we want to find out also their location in the picture?? This is exactly what Object Detection means. We want a predicted label plus a tupla that represents the location: <x, y, height, width>. Mathematically speaking, detection is a matter of classification and regression. In fact, the aim of the neural network is to find a function that maps the input image to a tuple made of 4 numbers.
There are several Deep Architectures for Object Detection. I will not go into the details, but if you are interested they are explained in this article.
However, we can say that a general model for Object Detection exploits the Selective Search algorithm to find Regions of Interest (where potentially there might be objects). Then, these Regions of Interest are processed by a CNN where the last layer is replaced by a classificator (eg: SVM) and a regressor (eg: bounding box regressor). The output is a label with the predicted class and bounding boxes.
In the next part of the article we will focus on two embedded devices: the first one is the new Google AIY Vision kit, a smart camera for developers. The second one is a sort of autonomous vehicle made by Audi.
Google AIY Vision Kit
The first case study we consider is based on the Google AIY Vision Kit. This device is currently sold only in the US, but we hope it will be soon available in Europe. It consists of a Raspberry Pie connected to a Pie Camera. What makes this kit so special is the presence of the new compact GPU board called ‘Vision Bonnet’. This board is equipped with the Movidius Myriad 2 MA2450 chip, a Vision Processing Unit designed by Intel and intended for machine vision in low-power environments. The Vision bonnet allows the kit to run real-time Deep Neural Networks directly on the device, rather than in the cloud.
The VPU presents hardware acceleration that runs neural network graphs at low power. Despite the hardware acceleration, the inference engine has been coded from scratch by Google to enhance performances at runtime.
The Vision Bonnet reads data directly from the Pi camera through the flex cable, processes them and passes said data to the Raspberry Pi. This way, while the code is running, the process has complete access to the camera and the whole processing phase does not overhead the Raspberry Pi, which is equipped with just 1 GHz ARM single core processor.
The Google AIY Vision kit supports Tensorflow as Machine Learning framework and it can only be used with embedded_ssd_mobilenet for real time Object Detection, while Image Classification and offline Object Detection can be achieved also using MobileNets or SqueezeNets.
Mobilenets are a special kind of deep architectures specifically designed for embedded systems. Developers can easily trade off processing speed and accuracy by setting two global hyper-parameters. At the same time these architectures are mindful of the restricted resources for an on-device or embedded applications. Another key feature is the depth-wise separable convolution, which consists in splitting the traditional convolution into a separate layer for filtering and another layer for combining. This mechanism dramatically reduces computation and model size. Mobilenets can be built upon for classification, detection and image segmentation.
Pikachu Detector 🤓
We would like to develop our custom model, capable of detecting anything we want in real time. I chose to create a ‘Pikachu Detector’ since I had this small puppet in my room. This makes the testing phase easier and we can download images from the web in just few seconds. I will not report all the details, but I will try to give you a small overview about how to reproduce this process.
We need to install Tensorflow Object Detection APIs and then manually label a hundred of images taken from the web. The core of this process is contained into a configuration file that defines both the architecture of the neural network and the training pipeline. The quickest way to obtain results is to apply Transfer Learning: instead of training a neural network from scratch we take a pre-trained model and retrain only the last few layers. Using Tensorboard we can monitor performances and, according to the Edge AI pipeline explained before, we can export the trained model as frozen graph. The frozen graph must be compiled and then it is ready to be executed on the Google AIY Vision Kit. The last part requires a bit of coding and, as starting point, I found very helpful this tutorial written by Chad Hart from Cogint.
Then, we can run our custom model for the Google AIY Vision Kit, capable of detecting Pikachu. These are some visual results:
The project is available on my Github repository.
Audi Autonomous Driving Cup 2018
Every year the famous car manufacturing company Audi organizes the Audi Autonomous Driving Cup to test new technologies in the automotive field. This is the first year that the competition is open to teams coming from outside Germany. After a careful selection based on projects submitted from all European universities, only 10 teams were chosen to access the finals. Among these, there is also a team represented by the University of Bologna.
The competition involves several challenges, but most of them require object detection for solving tasks such as avoiding pedestrians, recognizing road signs, detecting zebra crossings, allowing emergency vehicles to pass.
The car 🚘 🚙
The hardware platform, developed by Audi specifically for the contest corresponds to a 1:8 scale replica of the Audi Q2. For the Audi Autonomous Driving Cup 2018, the vehicles were equipped with a miniTX board.
• an Intel Core i3 processor
• 8GB RAM
• a fast 128 GB M.2 SSD hard drive
• an NIVIDIA GeForce GTX1050Ti graphics card
In addition to two Gigabit Ethernet ports, the board also has several USB 3.0 interfaces and a USB-C port. Furthermore, a Bluetooth and a WLAN module (IEEE 802.11ac) are available. The sensor set of the 2018 AADC model car is getting closer to the real car sensors. For development, a tested industry standard environment ADTF (Automotive Data and Time-triggered Framework) is installed. A developer license for the software is available on every vehicle computer so that a convenient development directly on the vehicle is possible. At the front, the car is equipped with a double camera. The first one is the Intel Realsense R200 that can streams at 640×480 or 1920×1080 resolution. The framerate is 15, 30 or 60 FPS. Especially for road sign and lane detection, the car is equipped with an additional front camera that has a 1280×960 resolution at 45 FPS. Both cameras are good enough for real-time scenarios.
🤖 🧠 AI-related driving tasks
Each team is called upon to solve different kind of tasks. Among all of these, there is one category called ‘Artificial Intelligence driving tasks’, that are:
Adult versus child: the vehicle must be able to distinguish between adults 🙋♀️ and children 👶. If a child is detected, the speed shall be reduced and this shall be indicated by the brake lights. For adults, no actions are required.
Yielding to Emergency Vehicles: the car must identify normal cars 🚗 and emergency vehicles 🚔 as two separate classes.
🧙🏻♂️ 💍 One Neural Network to rule them all
The graphic card mounted on the car is fully compatible with SSD_Mobilenet, so we decided to create a unique model to detect adults, children, emergency vehicles and normal vehicles. The dataset was created by filming every ob- ject and then extracting frames from each video. As total, the dataset is made by 3960 images, 660 of them used as test set and 3300 as training set.
In order to avoid biases, each object had almost the same number of images
in the whole dataset. Audi already provided dolls for simulating adults and children. The mini car was considered as normal vehicle, while we decided
to consider as emergency vehicle only cars that had sirens on it. This means that our model needs only to identify flashing sirens.
Results are brilliant. The network is able to correctly detect adults, children, normal vehicles and even emergency vehicles. Check it out:
Finally, a video that shows the car slowing down as soon as a child is detected on the right side of the road:
Edge AI is a core technology that will play a central role in the next generation of intelligent devices. One one side, companies such as Intel or Nvidia are investing a lot in engineering chipsets designed for these specific tasks (eg: VPU, TPU). However, we still need Deep Architectures for taking full advantages of atherefore mentioned processors. In this direction, we must choose models that are fast enough for real-time applications, even if this means to lose a bit of accuracy. Some examples are Mobilenets, Squeezenets or YOLO.
This post is a short extract from my master thesis, which is available at this link. Feel free to read it and even re-post it, but please do not forget to cite the source.