Automating the process of selecting thumbnails

Original article was published by Simon Oury on Deep Learning on Medium


Our method: Automatically selecting a correct thumbnail

Once we were satisfied with the performances of our model, we needed to create an end-to-end pipeline, ingesting a video as an input and outputting an appropriate thumbnail. There are four main steps in the process of selecting a thumbnail:

Diagram: End-to-end code for automatically finding thumbnails

1. Extracting frames with FFmpeg

The first step was to extract images from the input video. To do so, we used FFmpeg, an open source code and library for handling videos. With FFmpeg, we extracted images at a one frame per second rate. We believe this is a good extraction rate, as a higher rate would just lead to extracting similar images whereas lowering it might not capture every import frame. We used the following command to extract images:

ffmpeg -i {} -r  1 -f image2 image-%3d.png".format(video_name)

We could then work with these images to try and find the most suitable thumbnail.

2. Preprocessing the images

Network architectures applied directly on raw frames are impractical because of the massive data amount (hundreds of Terabytes), storage and computational scaling. In order to make the dataset more scalable and practical, we used the Inception V3 model to preprocess the images. Instead of working with 299x299x3 images, we worked with 2048-dimensional feature vectors, making every following calculation faster and more efficient.

3. Clustering the images

An important step was to cluster our data. This meant that we regrouped resembling images into clusters, meaning each cluster contains a set of similar images. To find the appropriate number of clusters for a specific video, we used DBSCAN, a Density-Based Algorithm for Discovering Cluster. DBSCAN analyses the data and picks up similar images and outputs k, the number of clusters. It is important to keep in mind that DBSCAN is not entirely autonomous. It depends on two external hyper-parameters, epsilon and minimal number of samples. While this last variable can be set to 1, a single value of epsilon could not fit every video.

Diagram: DBSCAN variables

Epsilon corresponds to the maximum distance two points can be to be clustered together. The greater the radius, the lesser amount of clusters. In our code, epsilon varies between 8 and 18, and its precise value depends on the number of extracted images.

We associated this with the K-means algorithm, which clusters images, as we can see in the image below:

Example: Clustering data

Clustering the extracted images allows us to work with different images only. As we extract one frame per second, it occurs that we have duplicate images. Now, these similar images are all classified and regrouped in their appropriate cluster. As we only keep the best-looking image, we deal with fewer images overall.

4. Ranking the images

Finally, we rank the images. To do so, we pass the feature vectors into our model which predicts the score. The model attributes to each image a score out of 10. The image with the highest score is selected as the video’s thumbnail. We consider this image to be the thumbnail of the video. As we can see below, the algorithm is accurately differentiating images with high and low aesthetics, which is important.

Example: Score ranking on a music video

Results of our Algorithm

To see the overall performances of the algorithm, we processed videos from the Yahoo Thumbnail Dataset. This is a dataset with 1018 videos. We compared our results to the automatic thumbnail generation of FFmpeg (which is currently used at Dailymotion):

ffmpeg -i input.mp4 -vf  "thumbnail,scale=640:360" -frames:v 1 thumb.png

Curated human validators were selected to assess qualitatively the model performances. Their task was to blindly evaluate, given two thumbnails, one generated by FFmpeg and the other by our deep learning model based algorithm, which was the most attractive. It was found that our algorithm had a better or similar thumbnail 74% of the time. As evaluating the performance of our algorithm can be subjective, we found this was the most accurate way to examine the results. Here are a few examples: