Edge AI

Source: Deep Learning on Medium

Cloud based architecture

Here’s how a cloud based setup would look like, it would involve the steps detailed below:

Cloud only Architecture for Inference. (image references at end).

Step 1: Request with input image

There are two possible options here:

  • We can send the raw image (RGB or YUV) from the edge device as it’s captured from a camera. Raw images are always bigger and take longer to send to cloud.
  • We can encode the raw image to JPEG/PNG or some other lossy format before sending, decode them back to raw image on cloud before running inference. This approach would involve an additional step to decode the compressed image as most deep learning models are trained with raw images. We will cover some more ground on different raw image formats in future articles in this series.

To keep the setup simple, first approach [RGB image] is used. Also HTTP is used as the communication protocol to POST an image to a REST endpoint (http://<ip-address>:<port>/detect).

Step 2: Run inference on cloud

  • tensorflowjs is used to run inference on an EC2 (t2.micro) instance, only a single nodejs worker instance (no load balancing, no fail over, etc) is used.
  • Mobilenet version used is hosted here.
  • Apache Bench (ab) is used to collect latency numbers for HTTP requests. In order to use ab, RGB image is base64 encoded and POST ed to an endpoint. express-fileupload is used to handle the POST ed image.

Total latency (RGB) = Http Request + Inference Time + Http Response

ab -k -c 1 -n 250 -g out_aws.tsv -p post_data.txt -T "multipart/form-data; boundary=1234567890" http://<ip-address>:<port>/detectThis is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking <ip-address> (be patient)
Completed 100 requests
Completed 200 requests
Finished 250 requests
Server Software:
Server Hostname: <ip-address>
Server Port: <port>
Document Path: /detect
Document Length: 22610 bytes
Concurrency Level: 1
Time taken for tests: 170.875 seconds
Complete requests: 250
Failed requests: 0
Keep-Alive requests: 250
Total transferred: 5705000 bytes
Total body sent: 50267500
HTML transferred: 5652500 bytes
Requests per second: 1.46 [#/sec] (mean)
Time per request: 683.499 [ms] (mean)
Time per request: 683.499 [ms] (mean, across all concurrent requests)
Transfer rate: 32.60 [Kbytes/sec] received
287.28 kb/s sent
319.89 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 5.0 0 79
Processing: 530 683 258.2 606 2751
Waiting: 437 513 212.9 448 2512
Total: 530 683 260.7 606 2771
Percentage of the requests served within a certain time (ms)
50% 606
66% 614
75% 638
80% 678
90% 812
95% 1084
98% 1625
99% 1720
100% 2771 (longest request)
Histogram of end to end Inference Latencies for Cloud based architecture (bucket size of 1s). It shows the inference latencies for requests generated by Apache Bench (ab) in a given second.
End to End Inference Latencies for Cloud based architecture sorted by response time (ms). This article explains the difference between the two plots.

As we can see here 95% percentile request latency is around 1084ms.