Slam Dunk Video Classification (Tensorflow 2.0

Source: Deep Learning on Medium

Slam Dunk Video Classification (Tensorflow 2.0 Distributed Training with the NVIDIA Data Science PC!)

NVIDIA Data Science PC (Built by Digital Storm)

The recently announced NVIDIA Data Science PC is a very interesting step forward in the future of Artificial Intelligence and Deep Learning. This article will highlight the power of the 2 Titan RTX GPUs on the PC in tangent with the easy syntax of Tensorflow 2.0’s new Distributed Training API for Computer Vision applications! In this example, distributed training achieves a surprising ~2.5x speedup, averaging 63s / epoch vs. 143s / epoch compared to training a single GPU on the same machine.

This end-to-end tutorial will build a binary image classifier to process video frames. The motivation behind the project is a Computer Vision assistant to clip out short dunk videos from long recordings containing mostly empty frames. I gathered the data by setting my iPhone into a tripod recording the half court. The irritating component of this is that I end up with 2 minutes videos only containing about 4, 5-second dunk clips. Most of the video is just an empty half court, as shown in the image below and to the left.

Most of the video is an empty hoop, we want to use the classifier to crop out these frames

This end-to-end tutorial will build a classifier to parse out dunks from a full workout tape. The end-to-end workflow consists of extracting image frames from video data, my data labeling strategy for this project, how to resize images, how to train the classifier on the Data Science PC w/ TF 2.0 Distributed Training, how to use this classifier to crop out dunks from a full workout video, and how to stitch these images together into a video!

This workflow contains 6 steps, extract frames from video files, label data, resize the data, train the classifier, use the classifier to parse video frames, and finally, stitch together the images into videos containing just the dunks from the workout!

I have also made a video tutorial for this if you would prefer to following along that way!

Step 1: Extract Frames from Video Files

The first step in the workflow is to get the image frames from video files. Typically videos contain about 30 frames per second, which in our case means 30 images per second of video. On average these 1–2 minute raw video clips lead to ~2,000 image frames. One interesting characteristic of the code below is that the cv2 video reader was reading the images in rotate by default, so I had to use the imutils library to rotate these images back upright. I also recommend observing how the count is incremented such that the image frames are not being constantly written over one another.

CV2 to extract image frames from video files!

Step 2: Label Data

After extracting video frames this way, we are left with these folders. The numbers at the end of the clips, e.g. “BlackShirtClip8-k.jpg’ denote the ordering of video frames. Luckily for the sake of labeling, we can take advantage of this sequential ordering to label several data points at a time. For example, if I first appear in the scene at frame 405, we can safely assume 1–404 can be labeled as “Empty” or not “InFrame”.

My strategy for labeling data in this project is to take advantage of the sequential nature of frames to mass label data points.

Step 3: Resize Data

This classifier will use the ResNet50 prebuilt into Tensorflow.Keras.Applications. This network takes in 224 x 224 images as input. This does kind of distort the natural images which are 1080 x 1920, but it doesn’t seem to result in poor performance so for now we will just accept it as is. The following code loops through the labeled data directory and uses Pillow to open images and resize them.

The keras.applications.resnet50 classifier takes in 224 x 224 images as input!

Step 4: Train the Classifier!

Now that we have our training data ready for the model, we will look at training a binary image classifier with the Tensorflow 2.0 Distributed Training API on the NVIDIA Data Science PC. The first block of code shown below imports necessary libraries from keras and loads in the image data. We load the image data in this way in order to bind it to the GPU in a tf.data object.

Import necessary libraries for the classifier and load the images into memory to fit into a tf.data object

The following block of code sets up our model and distributed training strategy. This tutorial uses the Mirrored Strategy covered in the Tensorflow distributed training documentation. Our model takes in the ResNet50, removes the 1000-way classification layer used for ImageNet and adds 3 fully connected layers to the output. The final output layer has 1 activation node and uses a sigmoid activation to classify images as either having me in the frame or not in the frame.

Setup for the TF 2.0 Distributed Training API with the Mirrored Strategy for distribution!

The code block below loads the training data into the tf.data object and caches the data onto the GPUs. The amazing part of the Tensorflow 2.0 Distributed Training API is the simplicity. All we have to do to scale training to 2 GPUs is the same syntax of “model.fit”! After we train the model, we save the weights to use later on to clip the videos.

Training the model with Distributed Training on the NVIDIA Data Science PC

Speed Comparison with Training on 1 Titan RTX GPU

The following line of code shows the remarkable speedup achieved with the Tensorflow 2.0 Distributed Training API + the NVIDIA Data Science PC. We are able to achieve 63s / epoch vs. 143s / epoch using the exact same code and setup everywhere else!

Note: 143s / epoch using just 1 GPU. Compared to 63s / epoch using 2 GPUs with the Distribuetd Training API

Step 5: Using the model to crop out dunks!

Now that we have finished training the model, we load it into memory to apply to our video frames. The following block of code loads in the model and the saved weights file.

Load the model for inference

Now that we have our model, we loop through the video frames and save the location of frames that I’m in (labeled as 1 by the classifier). This is done in the line:

if (label >= 0.95):
action_frames.append(frame_counter)

I moved the decision boundary up to 0.95 to avoid false positives. I found that this worked really well to avoid noise in the classifications.

Label frames which contain me in the frame, compared to the empty hoop frames shown previously

Now we have an action_frames list that looks something like:

[405, 406, 407, ..., 577, 768, 769, ..., 999, 1000, .... ]

This list contains all the frames I am in the scene. I want to parse this clip so it has pairs of (start, end) frames. This is achieved by looping through the action list with the following code:

Write clips to separate folders

The result of the code above is an array such as [(405, 577),(768,999)]. We then write the frames contained into these intervals in separate folders and assemble them into videos in Step 6.

Step 6: Assemble Videos from Image Frames

Now that we have the image frames that correspond to each dunk from the video, we can use the cv2 VideoWriter to make them into their own videos. Note you can also change the fps from say 30 to 15 if you want to have a slow motion effect in the videos!

Assemble the dunk clips into videos!

Thank you for reading this tutorial, please let me know any questions you have about this! I have also made a video tutorial going through these steps if you are interested in it!