Original article was published on Artificial Intelligence on Medium
Object Detection Algorithms will be used to detect localized persons within the image frame. Faster R-CNN, SSD (Single Shot Detection), and YOLO (You Only Look Once) are among the three most common architectures for Object Detection problems. Despite each having their own advantages, the one used in this project was SSD (Single Shot Detection) which is less computationally expensive and can be processed in the fastest time. SSD was the only one working flawless and smoothly on CPU in almost real time. Among the many architectures of CNN the one used here is the MobileNet version of it that was developed by Google, it stands out for being very light and fast.
COCO dataset is a publicly available dataset having over 80 classes, the MobileNet SSD was trained on COCO dataset. Since we needed only 1 class from that dataset, that is ‘person’ class, thus all other classes were filtered out. The pre-trained MobileNet on COCO dataset was used instead of training from scratch, the weights and architecture were imported through transfer learning. The implementation was done on Python due to its easy to use libraries such as Tensorflow, Keras, OpenCV, and many more
Once the persons are detected in the frame the distance of a person from all others would be taken through Euclidean Distance, a threshold will be set and all those having distance less than the threshold will be highlighted by a red bounding box. The total number of violators will be displayed on the screen. The threshold or the minimum distance will be 1 metre as recommended by WHO.
Although the best approach to get distance is to incorporate the vision part with depth sensors such as LiDAR, Radar, or Sonars, but since this is a purely software based project that option is out of bounds. Camera calibration is required whenever this is being deployed, calibration will vary depending on the camera. For depth a pinhole camera model and similar triangle approach was used to estimate the depth of person from camera, doing so resulted in more accurate distance calculations between the persons.
Figure 2 shows the diagram that is the basis of Similar Triangle Approach, figure 3 mentions the equations that we used to find the depth of each person from camera. In it we had to assume one thing, that was the focal length, only because physical calibration was not possible as the video used did not mention camera specifications and some external values, but if we apply it in our own environment, we can take some values on which we can do the calibration. The depth of persons were used to get 3-D coordinates of each person, and Euclidean Distance of those 3-D coordinates resulted in the distance between each detected person in the image. This added feature resulted in much more improved depth and consequently more accurate distance between persons.