Video processing pipeline with OpenCV

Source: Deep Learning on Medium

In the previous story, I explained how we could implement the image processing pipeline in a modular way and why. The task was to detect faces in a bunch of image files and save them in a separate folder together with a nice structured JSON summary file. If you haven’t read it before, read it first!

Let’s do the same with a video stream. For this purpose, we will construct the following pipeline:

Face detection pipeline from a video stream

First, we need to capture the video stream. This pipeline task will generate a sequence of images from a video file or webcam (frame by frame). Next, we will detect faces on every frame and save them.
Next three blocks are optional, and their goal is to create an output video with annotations like boxes around the detected faces. We can display the annotated video and save it.
The last task will gather the information about detected faces and save JSON summary file with the box coordinates of the faces and confidence.

If you haven’t already set up the jagin/image-processing-pipeline repository to review the source code and run some examples you can do it now:

$ git clone git://github.com/jagin/image-processing-pipeline.git
$ cd image-processing-pipeline
$ git checkout 7df1963247caa01b503980fe152138b88df6c526
$ conda env create -f environment.yml
$ conda activate pipeline

If you already clone the repo and set up the environment just update it with:

$ git pull
$ git checkout 7df1963247caa01b503980fe152138b88df6c526
$ conda env update -f environment.yml

Checking out the 7df1963247caa01b503980fe152138b88df6c526 commit will ensure that you can use the code out of the box with this story.

Capture video

Video capturing is very simple with OpenCV. We need to create VideoCapture object where the argument is the device index (the number specifying which camera) or the name of a video file. Then we can capture the video stream frame by frame.

We can implement the capture video task with the following CaptureVideo class which extends Pipeline:

Pipeline generator task capturing video frames (pipeline/capture_video.py)

With __init__ we createVideoCapture object (line 6) and extract the properties of the video stream like frame per seconds and number of frames. We will need them to display a progress bar and to save the video properly.
The image frames will be yielded in the generator function (line 30) with the dictionary structure:

data = {
"image_id": f"{image_idx:05d}",
"image": image,
}

including a sequence number of the image and binary data of the frame.

Detect faces

We are ready to detect faces. This time, we will use a deep neural networks module from OpenCV instead of Haar cascade as I promised in the previous story. The model we are going to use is much more accurate and additionally gives us confidence score.

An example image from the movie “Friends” with detected faces (notice no false-positives).

From version 3.3, OpenCV is supporting many deep learning frameworks like Caffe, TensorFlow and PyTorch, allowing us to load a model, pre-process an input image and make an inference obtaining the output classification.

There is an excellent blog post of Adrian Rosebrock explaining how to implement face detection with OpenCV and deep learning. We will use part of the code in our FaceDetector class:

Face detector class (pipeline/libs/face_detector.py)

This simple class does one thing — detects faces on the batch of the images.

Remember, we are trying to be modular and to decouple the pipeline building blocks. This approach will give us a manageable code, and makes tests easier to write:

Face detector test (tests/pipeline/libs/test_face_detector.py)

Using pipeline architecture, it is easy to swap out CascadeDetectFaces from the previous post for their more accurate deep learning face detector model.

Let’s use FaceDetector in our newDetectFaces pipeline step:

Detect faces pipeline task (pipeline/detect_faces.py)

This part of the code requires some explanation. With VideoCapture we already generate data (that is image frames) for our pipeline so why is there another generator in DetectFaces? The answer is batch processing.
We buffer the stream of images (line 15–20) until it hits the batch_size (line 24), then we detect faces on all buffered images (line 28), collect faces coordinates and confidence (line 31–32) and re-yield it (line 35–37).

When we use a GPU (graphics processing unit), we have thousands of processing cores running simultaneously in our arsenal, which are specialised in matrix operations. It is always faster to execute inference in batch, presenting more images to the deep learning model at once than one by one.

Save faces and summary

SaveFaces and SaveSummary produce output results. I would like to encourage you to look at the source code(pipeline/save_faces.py and pipeline/save_summary.py).

The SaveFaces class, using the map function, loops over all detected faces, crops them from the image and saves to the output directory.

The task of the SaveSummary class is to gather all the metadata about recognized faces and save them as a well structured JSON file. The map function is used for buffering the metadata. Next, we extend our class with extra write function which we will need to trigger at the end of the pipeline to save the JSON file with the summary.

The images of the faces are stored in a separate directory for every frame.

output
├── 00000
│ └── 00000.jpg
├── 00001
│ └── 00000.jpg
├── 00002
│ └── 00000.jpg
...
├── 00260
│ ├── 00000.jpg
│ ├── 00001.jpg
│ ├── 00002.jpg
│ └── 00003.jpg
├── 00261
│ └── 00000.jpg
...
├── 00457
│ ├── 00000.jpg
│ └── 00001.jpg
├── 00458
│ └── 00000.jpg
├── 00459
│ └── 00000.jpg
└── summary.json

Video output

To observe the results of the pipeline, it is nice to have a possibility to display the video with annotated faces on them. It all about AnnotateImage (pipeline/annotate_image.py) and DisplayVideo(pipeline/display_video.py).

Image with an annotated face (green box and confidence value).

We can also save the annotated video itself to present the results to our colleagues using SaveVideo (pipeline/save_video.py).

Pipeline in action

In process_video_pipeline.py we can see that the whole pipeline is defined as follows:

pipeline = (capture_video |
detect_faces |
save_faces |
annotate_image |
display_video |
save_video |
save_summary)

The optional tasks could be set to None if we don’t need them as it is in the case of annotate_image, display_video and save_video.

Neat and clear!

That was a lot of explanation, but video and images speak louder than words. Let see our pipeline in full glory triggering the command:

$ python process_video_pipeline.py -i assets/videos/faces.mp4 -p -d -ov faces.avi

-p will show us the progress bar,
-d will display the video results with annotated faces on them,
-ov faces.avi will save the video result to output folder.
(for other options and their defaults see the source)

The resulted video is presented to us:

and the output directory is full of frame folders with images of detected faces.

As you can see in the sample video, not all faces are recognized. We can decrease the confidence of the deep learning model setting the parameter
--confidence 0.2 (the default is 0.5). Dropping the confidence threshold can increase the occurrence of false-positives (face in the place of the image where no face is present).

Let’s experiment with a batch size of our DetectFaces class:

$ python process_video_pipeline.py -i assets/videos/faces.mp4 -p
--batch-size 1
100%|███████████████████████████| 577/577 [00:11<00:00, 52.26it/s]
[INFO] Saving summary to output/summary.json...

$ python process_video_pipeline.py -i assets/videos/faces.mp4 -p
--batch-size 4
100%|███████████████████████████| 577/577 [00:09<00:00, 64.66it/s]
[INFO] Saving summary to output/summary.json...
$ python process_video_pipeline.py -i assets/videos/faces.mp4 -p
--batch-size 8
100%|███████████████████████████| 577/577 [00:10<00:00, 56.04it/s]
[INFO] Saving summary to output/summary.json...

On my hardware (which is Core i7–8750H CPU @ 2.20GHz and NVIDIA RTX 2080 Ti) I got 52.26 frames per second with --batch-size 1, but for
--batch-size 4 we have a little bit speed up to 64.66 frames per second.

We can also observe that bigger batch size doesn’t necessarily mean growing processing speed. There are other IO operations which are blocking us like reading the image and writing the results. We can deal with it using Python threading which is useful in this kind of situations.

I will talk about it more in the next story. Happy codding!