Video Analytics — Deep Learning Edition


There is a lot of stuff out there when it comes to doing Video Analysis based on Deep Learning (use The Googles). It’s always been a bit of a pain to setup though — as with most things Deep Learning, it’s largely a case of Do It Yourself. Install the packages, re-install them because you didn’t get it quite right, build your models, start training them, reinstall the packages because you didn’t get the GPU version…. you get the point, it takes a while to just get stuff setup before you can even start doing anything.

Or, it used to take a while. Thanks to Akshay Bhat, herewith a one-stop shop for all your Deep Video Analytics needs. In one single set of docker containers, you get everything you need to do DL based video analytics, from importing, to analyzing, and thence to searching.

To begin with, you import all the videos you want to process, either directly, from YouTube, or S3/GS/etc. Once you’ve imported them, the project comes with a bunch of pre-trained models (seriously, is this so hard? Why don’t others do it?), including
Zhang el al’s MTCCN (•) for face detection, which is remarkably good at working regardless of pose, lighting, obstructions, etc.
Schroff et al’s Facenet (••) for face recognition and verification
• Object detection trained on MS COCO
And if that’s not enough, it also comes with support for pretty much any model that is compatible with the TensorFlow Object Detection API (VGG, AVA, and more are included out of the box)

The part I like is that when you’re importing the videos, you can do Group of Picture (GOP) segmentation, and then do map/reduce style workflows on these for analysis purposes. I can see this as being particularly useful as input for your own research purposes, e.g., “I need to find all the GOP segments that have cats in them” etc.

Anyhow, once the videos are processed, the project also includes all sorts of fun stuff to help you find stuff in them (seriously, this is one well thought out system!). Basically, you setup pipelines that automagically index the output from the detection and analysis phases. These incorporate Facebook’s FAISS for similarity searches, and Yahoo’s LOPQ for approximate nearest neighbor searches. Once you’ve got all this info in your indexes, it’s just a case of what you want to look for, right?

There is more, much more, including a custom query language, REST APIs, GPU and non-GPU versions of the containers, immutable operations, oh, the list goes on.

The code for the whole thing is available here, and the k8s info is here. Go take a look at the project, and if you’ve got some time, take a look at the preso about this.


(•) “Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks” — by Zhang et al.
(••) “FaceNet: A Unified Embedding for Face Recognition and Clustering” — by Schroff et al.

(This article also appears on my blog)

Source: Deep Learning on Medium