Downloading The Kinetics Dataset For Human Action Recognition in Deep Learning

Source: Deep Learning on Medium

Downloading The Kinetics Dataset For Human Action Recognition in Deep Learning

If you are interested in performing deep learning for human activity or action recognition, you are bound to come across the Kinetics dataset released by deep mind. There are 3 main versions of the dataset; Kinetics 400, Kinetics 600 and the Kinetics 700 version. Kinetics 700 is the latest version at the time of the writing of this blog.

The Kinetics 700 dataset is described on the deep mind website as:

A large-scale, high-quality dataset of URL links to approximately 650,000 video clips that covers 700 human action classes, including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 600 video clips. Each clip is human annotated with a single action class and lasts around 10s.

The URL links in the above context means YouTube URL links, therefore, the videos are YouTube videos.

The dataset is becoming a standard for human activity recognition and is increasingly been used as a benchmark in several action recognition papers as well as a baseline for deep learning architectures designed to process video data. The main vision for the Kinetics dataset is that it become’s the ImageNet equivalent of video data.

This blog will go through the steps taken taken in downloading the videos from the annotations files as well as challenges faced and some strategies used to get around the challenges. It will highlight some basic statistics about the data which will hopefully help you make informed decisions if you choose to download it yourself. However, it won’t go into too much detail about the annotations dataset e.g. how it was collected, the distribution of the different classes etc. This information can be found by reading the following papers:

Getting the Kinetics annotation dataset

The biggest pain point when dealing with the Kinetics dataset as opposed to the ImageNet or COCO equivalents is that the actual videos are not available for download. In place, an annotations file is provided which contains a list of entries in json and csv format containing the YouTube URL links, action category and the start and end times of the action category within the video.

The implications are that you have to download the videos yourself and crop them at the correct temporal range. There are about 650,000 videos, therefore this is not an easy task due to the various challenges we will cover later.

The annotations file can be downloaded from the following link. Below is a screenshot of what you should see.

Kinetics 700 is the dataset of focus for this blog. Clicking the “Download dataset” link, downloads a 25 MB gzip file containing the annotation files. After extracting the contents of the gzip file, there are 3 folders which contain the train, val and test datasets in 2 file formats (csv and json). The structure of the csv file is:

washing feet,--GkrdYZ9Tc,0,10,validate
air drumming,--nQbRBEz2s,104,114,validate

The items in the CSV file can be broken down as follows:

  • The label indicates what type of human activity is found within the video e.g. testifying, washing feet, air drumming etc.
  • The youtube_id is the unique video identifier YouTube uses for each video. The complete video can be downloaded by substituting the youtube_id into the following string{youtube_id} .
  • The time_start and time_end (in seconds) indicates the section within the video where the human activity indicated by the label is found. Using the first row of the csv sample with label testifying as an example, the video length is 95 seconds (can be verified from—QUuC4vJs), the label of interest will therefore be between 84–94 seconds which forms the temporal range.
  • The split indicates whether it belongs to the training, validation or testing dataset.

The structure of the json file is as follows, which should be easy to follow from the csv context:

"---QUuC4vJs": {
"annotations": {
"label": "testifying",
"segment": [
"duration": 10.0,
"subset": "validate",
"url": ""
"--GkrdYZ9Tc": {
"annotations": {
"label": "washing feet",
"segment": [
"duration": 10.0,
"subset": "validate",
"url": ""

The json files are much larger than the csv files, occupying 197.5 MB as opposed to 24.5 MB in memory, so might a bit faster to read data from csv as opposed to json. However, most open source software that are capable of downloading the kinetics dataset from the annotations file use the json format so might need to pre-process the csv data to the correct format. Personally I chose the JSON format due to the open source code base I ended up using.

Technological Environment

The data download was primarily on a desktop computer running Ubuntu 18.04 with consistent internet connection at about 60 Mb/s download speed with 16 GB of memory. However, some of the downloading occurred over my MacBook Pro when I was not using it.

I did try to use AWS and Google Cloud however there were significant throttling issues from YouTube which will be addressed in the errors section.

Code base to download the data

The next thing to consider is the codebase to download the data. There are two main options:

  • Write the code base yourself.
  • Find an existing open source codebase and if necessary modify it as required.

The second option was chosen and the codebase selected was the showmax/kinetics-downloader and a fork of it was created in dancelogue/kinetics-datasets-downloader. The main requirements to use the codebase are python ≥ 3.4, ffmpeg and youtube-dl. Where youtube-dl is used to do the actual download while ffmpeg is used to crop the video at the required segment i.e. the time_start and time_end times.

How to use the codebase is covered in the file therefore we will not delve into the code. It is worth noting though that it uses the python multiprocessing module which I found to be necessary when downloading such a large dataset, and we will cover why in this blog.

Some modifications were made based on issues encountered when downloading the dataset. The modifications to the codebase includes:

  • Ability to write to a stats.csv file to track how long each download took as well as the duration of the ffmpeg cropping time for each video. Unfortunately, the intuition to create this functionality only occurred after half the dataset was downloaded. So the stats data does not cover the whole sample, but it should be sufficient enough to gain insight into the download process.
  • Ability to write to a failed.csv file to indicate which video had errors and what were the errors returned.
  • Ability to pause the download process once throttling occurred.

The stats and failed logs were used to generate basic stats about the data and will hopefully help you make informed decisions if you choose to download the data yourself.

Overall Stats

YouTube is a dynamic platform which means videos are added and removed all the time. Therefore downloading the Kinetics dataset at different times will not have consistent results due due to videos being taken down. The following pie chart shows the downloaded and missing videos in my kinetics dataset.

The total downloaded video count is 631604 while the failed videos is 15380, which means 2.37 % of the entire dataset could not be downloaded out of a total of 646984 videos. It is assumed this is within acceptable error margins.

Total Download Duration

In order to figure out how long it would take to download the entire dataset, the download time and the time it took to generate crop the videos (FFMPEG duration)was logged in seconds. As mentioned the stats for only 298651 videos were logged. The table below shows the mean and max of the individual processes.

Full indicates the entire dataset while IQR indicates the data within the interquartile range. Getting the interquartile data was necessary to prevent extreme outliers as shown by the high max value for the download duration and the FFMPEG duration. The theoretical time to download 646984 videos sequentially was:

  • Using the Full mean the anticipated download time is 176.1 days.
  • With the IQR mean the anticipated download time is 80.7 days.

This assumes the videos are downloaded synchronously without any interruptions. Luckily we have multiprocessing in our favour. I was running 16 separate processes using the python multiprocessing module.

The pie chart below indicates the dominant process between download dominant and ffmpeg (cropping) dominant tasks.

It can be seen that for most of the download process, the actual download process dominates, while the ffmpeg process dominates only 1.67 % of the time. Thus the main bottleneck in the entire process is actually downloading the videos from YouTube.


One of the first mistakes I made when downloading the kinetics dataset was downloading videos at a higher quality than necessary (this could indicate why there were quite extreme outliers).

Eventually I settled on videos with a max resolution of 360p, after all, these videos are meant to be consumed by machines and not people. Videos at this quality contain enough information to train the relevant deep learning algorithms, and are significantly faster to download and crop and take less space on disk during storage. It can be argued that a lower resolution could be tried as well i.e. 240p or 144p which will lead to significant space and time savings during download while maintaining the same baseline/benchmark accuracies.

Space Requirements

A quick calculation was conducted to figure out the space requirements and it was found that the entire cropped dataset occupied 628.43 GB on disk. In order to download the dataset you probably need about 20 GB extra (depending on number of concurrent downloads occurring) to account for the full un cropped videos which needs to be stored temporarily.

Failed Video Downloads

The reasons 2.37 % of the videos failed to download were recorded and are shown in the following pie chart. The number next to the description in the legend was the total instances where the particular error occurred.

Majority of the errors are based on YouTube’s error messages and the description is an indicator of a group of errors. These are:

  • Video Not Available (10606) errors were by far the largest cause for failures and contained a variety of reasons such as the uploader deleting their YouTube account or having their accounts deleted by YouTube, videos only being available in certain countries, videos being made private etc.
  • Content warning (2451) errors were probably due to age restricted content and is assumed an authenticated account was required where the account holder is assumed to be over 18 years of age.
  • HTTP Error 404 (943) errors could potentially be due to the Kinetics video youtube_id being wrong as 404 generally indicates a page not found error code. I didn’t have time to investigate this hypothesis though.
  • Copyright (672) errors are videos removed due to copyright claims.
  • Video Removed By User (337) errors as the name states, the user deleted the videos.
  • Miscellaneous (144) errors were probably due to errors to do with the library being used or the error reason could not be identified.
  • Violating YouTube Terms (134) errors was usually videos taken down for violating community guidelines on spam, racism, bullying, nudity etc.
  • Duplicate Video (2) errors seems to indicate YouTube doesn’t allow for duplicate videos.
  • HTTP Error 503 (1) error occurred only once and relates to a service not found error, not sure why this was the case.

Even though there were issues downloading the video, the failed videos formed only 2.37 % of the entire dataset which can be considered within acceptable error margins. However, it is worth noting though as time progresses the fraction of failed videos will increase as more videos get taken down over time.

Even though these were the errors that prevented the videos from being downloaded, there was one error which proved to be the most frustrating experience in downloading the YouTube videos, the dreaded 429 too many requests error.

HTTP Error 429: Too Many Requests

It is by far is the biggest pain point of downloading the kinetics dataset and this is what it made me feel like for the duration of the download process.

The primary reason for this error is caused by YouTube throttling requests which I assume is done by blacklisting the requesting ip address. It makes sense for YouTube to throttle requests where some of the reasons includes reducing the load to the server, preventing malicious parties from having access to the data etc. But it is a pain when downloading 650 000 video clips.

What makes it especially challenging is the time it took for the requesting ip addresses to be allowed again i.e. the “cooling off” period. From experience it took anywhere from 12 hours to 5 days. I wasn’t able to find a discernible pattern to get around it. The most amount of videos I was able to download in a single session before been throttled was 136963, the pie chart below shows the distributions between the runs (some of the runs were terminated manually as opposed to throttling).

The throttling issues has been highlighted in different sources as major hindrance when downloading data from YouTube.


As far as I could tell the criteria by which an ip address is blacklisted is not clear. On my home desktop I could download over 50 000 videos before hitting the 429 error code, however moving to AWS or Google cloud I could maybe manage a 100 downloads before hitting the 429 error. Perhaps there is some criteria YouTube uses to immediately blacklist ip addresses from cloud VM vs personal machines.

When the HTTP Error 429 was encountered it’s best to stop the download and either try again at a later time or change IP addresses.

The main viable option I was able to come up with was to change IP addresses by switching networks. Having 2 different OS (e.g. Windows and Ubuntu) on the same machine worked for sometime. If all else fails., wait for the cool off period.

As downloading the dataset was not a huge priority at the time, when all the networking workaround options were encountering the HTTP Error 429 status, download of the dataset stopped and was attempted a few days later. I didn’t explore other options such as using a VPN etc.


One major topic that hasn’t been covered so far is ethics i.e. scraping YouTube videos. On one hand the annotations file for the videos exist and was provided by Deepmind which is a subsidiary of Google, on the other hand, what are the rules in downloading the dataset especially for deep learning research. There are quite a few papers out there which make use of the data which shows that people are downloading it. It kind of feels like the head in the sand scenario is happening.

This could possibly be the reason as to why the data has not been made publicly available, as such, anyone interested in deep learning must download it themselves. There are several issues with this approach which I believe are:

  • Firstly it hinders deep learning research with video data as the kinetics dataset is not a trivial dataset to download.
  • The datasets between 2 different researchers might be different due to missing videos which means results reported in research papers might not be exactly reproducible.

Not sure what the workaround concerning the ethical situation can be in making the data public but hopefully Deepmind will make the video data easily accessible for non commercial use.


Hopefully this blog has given you some insights when it comes to downloading the Kinetics dataset and the challenges faced should you attempt it yourself.

The Kinetics dataset was necessary as I undertook a personal project for the whole of 2019 on building a Shazam for dance deep learning start-up. The Kinetics data was used to pre-train the dance algorithms as a proof of concept. I will soon be blogging about this process.

Follow me to get notified as I post the Shazam for dance using deep learning series of blogs.