Detecting animals in the backyard — practical application of deep learning.

Source: Deep Learning on Medium

  1. Install OpenCV
  2. Multiprocessing VideoReader
  3. Tensorflow model Megadetector
  4. Batches
  5. Possible optimizations: Graph Optimize, TensorRT

As I didn’t have data, resources, and time to train my own animal detection neural network, I searched the net for what was available today. And I’ve found that the task even with state-of-the-art neural networks and data gathered all the world is not so simple as it seemed.

Of course, there are products and researches doing animal detection. Still, with one main difference from what I was looking for — they are detecting creatures from photo cameras or smartphone cameras, and such shots differ by color, shapes, and quality from what you are getting with motion detection cameras.

But, whatever. There are still projects doing the same as my goal was. My searches led me to the CameraTraps project from Microsoft. As I understood, they are building Image Recognition API using data collected from different Wild Life Cameras all over the world. As a result of that, they open-sourced the pre-trained model for detecting, if “animal” or “human”, is present on the image, called “MegaDetector.”

The main limitation of that model is coming from the name of the model. It is only a “Detector,” but not a “Classifier.”

Statement from Microsoft about detectors and classifiers

Even considered limitations like that, such an approach did fit me perfectly.

The model is trained to detect three different classes:

  1. Animal
  2. Person
  3. Vehicle
Racoon identified with “Animal” class

In most of the cases, you’ll find in various blog posts when speaking of video object detection, the real-time video will be described. My case was a bit different — as an input, I had a huge pile of video files produced by the camera, and as output, I also wanted video files.

For reading and writing video files in Python today as standard-de-facto considered OpenCV library. I also find it my favorite image manipulation package.

The logic of running inference on video file is quite straightforward:

  1. Read: Get a frame from the video
  2. Detect: Run inference on the image
  3. Write: Save a video frame to the new file with detection if there are any.
  4. Repeat: Run steps 1–3 until the end of the video

It can be implemented with this code sample

Tensorflow object detection with OpenCV VideoReader

Even though such a straightforward approach has several bottlenecks as the same thread reading and writing, it works. So, if you are looking for a code to try your model on a video, check that script.

It took me around 10 minutes to process a FullHD one-minute 10 FPS video file.

Detection took 9 minutes and 18.18 seconds. Average detection time per frame: 0.93 seconds

But you can find many tutorials like that — telling you how to run a vanilla OpenCV/Tensorflow inference. The challenging part is how to make that code run continuously and with nice performance.

I/O blocks

With the code provided, reading frames, detecting, and writing back are happening in the same loop, and it means sooner or later one of the operations will become a bottleneck, for example, reading video files from “not-very-stable” network storage.

To get rid of that part, I’ve used instructions from a fantastic computer vision blogger Adrian Rosebrock and his library imutils. He is offering splitting reading frames and processing frames into multiple threads, and such an approach gave me a prepopulated queue of frames ready to be processed.

Modified FileVideoStream from Adrian Rosebrock

It won’t impact much on inference time, but it helps with slow drives, which are often used for video storage.

Optimization: Graph analysis

Another part, I’ve heard about was optimizing models for deployment. I’ve followed a guide discovered here: and managed to achieve some improvement by assigning non-GPU supported layers to be processed on CPU.

[INFO] :: Detection took 8 minutes and 39.91 seconds. Average detection time per frame: 0.86 seconds

Batch inference

Based on my previous experience, one of the bottleneck parts in deep learning training was data transfer from disk to GPU, and to minimize that time were used so-called “batches” when GPU got several images at once.

I wondered if it was possible to do the same batch processing on inference. And luckily, it was possible according to StackOverflow answer.

I just needed to find the largest acceptable batch size and pass array or frames for inference. For that, I’ve extended FileVideoStream class with batch functionality

[INFO] :: Detection took 8 minutes and 1.12 second. Average detection time per frame: 0.8 seconds

Optimization: Compiling from sources

Another important part, when we are talking about running heavy, time-consuming computations, is squeezing the most from the hardware.

One of the most straightforward approaches is using machine-type optimized packages. The message every Tensorflow user has seen:

tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

It means that Tensorflow is underutilizing hardware because of ignoring built-in CPU optimizations. And the reason for that is because the generic package was installed, which will work on any type of x86 machine.

One way of increasing its performance is to install optimized package from 3rd parties like, or

Another way is to follow instructions from Google and build the package from source But consider that it is could a bit difficult if you didn’t have an experience before and it is quite a time and RAM consuming process (last time it took 3.5 hours on my six-core CPU).

The same comes with OpenCV, but that is an even more complex topic, so I’m not covering it here. There are handy guides by Adrian Rosebrock, if you are interested in that topic, please follow them.


Small Python application waiting for the incoming videos with detections. As the video arrives, it updates my Telegram channel. I’ve used my previous project, which was resending incoming videos to my telegram channel.

The app is configured like that and continuously monitoring a folder for the new files using watchdog library

"xiaomi_video_watch_dir" : PATH_TO_WATCH,
"xiaomi_video_temp_dir" : PATH_TO_STORE_TEMP_FILES,
"xiaomi_video_gif_dir" : PATH_WITH_OUTPUT_GIFS,
"tg_key" : TELEGRAM_KEY
Initial Telegram group version

What didn’t work out?

This project brought me lots of new learnings and even though I’ve managed to reach my final goal, I’ve gone through some failed trials. And I think that is one of the most important parts of each project.

Image Enhancing

During my research, I’ve come across several reports from iWildCam Kaggle competition participants. They mentioned quite often about applying the CLAHE algorithm to input images for Histogram Equalization. I’ve tried the mentioned algorithm and several others, but with no success. Applying image modification dropped the number of successful detections. But to be honest, night camera images looked more sharp and crisp.

def enchance_image(frame):
temp_img = frame
img_wb = wb.balanceWhite(temp_img)
img_lab = cv.cvtColor(img_wb, cv.COLOR_BGR2Lab)
l, a, b = cv.split(img_lab)
img_l = clahe.apply(l)
img_clahe = cv.merge((img_l, a, b))
return cv.cvtColor(img_clahe, cv.COLOR_Lab2BGR)