Zero-Shot Action Recognition in Videos: A Survey

Original article was published on Deep Learning on Medium

Zero-Shot Action Recognition in Videos: A Survey

Zero Shot Learning (ZSL), aims at recognizing objects from unseen classes, it is about the ability of a learner to identify different classes from observed data without never having observed such type of data during the training phase.

We all know how taxing the process of collecting, annotating, and
labeling data for training purposes is. Therefore there are a lot of requirements/demands for methods to make use of methods that could classify instances from classes that are not present in training samples, especially in Videos.

In this article, I aim to provide an overall review of how ZLS, especially in Videos, how ZSL reads and relates to classes available for training purposes using Visual and Semantic Embeddings.

The learner in ZSL reads the data and relates to the classes available from the training and does an error estimation to provide an accurate output. In simple terms, ZSL recognizes an action by its closest estimation to the available data class. Hence, the accuracy of the technique will be limited by the availability of data classes from the training model. To overcome this obstacle, ZSL makes use of both Visual Embedding Labels and Semantic Embedding Labels in tandem with the help of different mapping techniques.

Let us elaborate on what the above statement means:

Visual Embedding Labels: Visual Embedding is the ability to understand imagery and convert it into a machine-readable code. After training from a large data set of images, the model usually identifies different actions and tries to classify it into available classes to provide a result.

Semantic Embedding Labels: Semantic embedding is like Visual Embedding with the only difference that the model tries to read words instead of images and classify it after comparing it with the available data set.
Now how do both come into play when it comes to Zero-Shot Learning?

Photo Credit: https://arxiv.org/pdf/1909.06423.pdf

As illustrated in the picture above, what ZSL does is that it extracts both the visual and semantic information, identifies them in classes, and performs a mapping function. The mapping function outputs a class that is estimated against the available classes and the estimation with minimum error is identified as the unseen class.

Zero Shot Recognition in videos:
When compared to images, ZSL becomes comparatively difficult in videos due increase in dimension for the constraints. For instance, when still images were constrained by Pixel x Pixel for the quality, Videos are constrained by Pixel x Pixel x Frames Per Second. This added dimension adds up to the complexity of the problem. Since the solution is as we discussed above of mapping together the Visual and Semantic labels, the method is constrained by three attributes-
A) The efficiency of the Visual Embedding and the Semantic Embedding Methods
B) The Mapping Techniques
C) The Data sets used

A) Visual Embedding and Semantic Embedding techniques

Visual Embedding Methods are usually categorized as:

1. Hand Crafted Methods

2. Deep Features Methods

All the different techniques used can be usually grouped in either of these two methods. Valter Luís Estevam Juniora et al have surveyed the popularity of these methods from a pool of over 100 published researches and came to the following analysis as shown in the picture below:

photo credit: https://arxiv.org/pdf/1909.06423.pdf

A similar analysis can be done for Semantic Embedding Methods as well. The Semantic techniques are categorized as:

1. Attribute

2. Word Embedding

photo credit: https://arxiv.org/pdf/1909.06423.pdf

B) Mapping Techniques

The mapping techniques are broadly classified into:

1. Attribute-based methods

2. Word Label Embedding

3. Semantic Relationships

4. Objects as Attributes

5. Multi-Modal Learning

6. Generative Methods

These methods can be broken down according to aspects considered in these methods. These aspects may include, Actions or Activities, Few Shots Learning Methods, and Transductive Setting. Grouping together all the methods and splitting their aspects in pie charts helps us come with a general idea of the Aspect which is included in the majority of the methods. It can be observed that the Actions aspect is covered in the majority of the methods.

photo credit: https://arxiv.org/pdf/1909.06423.pdf

C) Data sets Used
Data sets used during training the model are categorized as the actions being performed in the videos, the quality of the videos, and the number of videos used. The Data sets used by researchers may differ in the classes included in the data set and the use of different data set might be opted depending on the application of the model. For Example, models might be trained for Cooking Dataset (A data set comprising of all cooking activities) or an Olympic Sports Data set( A data set comprising of videos of people doing sports activities) depending on where the model is going to be used- a cooking show or a sports event. For general purposes, the widely used data sets are HMDB51 and UCF101. General statistics of the data sets used are:

photo credit: https://arxiv.org/pdf/1909.06423.pdf

Conclusion:

In this article, I tried to present ZSL methods for action recognition in videos such as visual and semantic extraction. Several methods/mapping techniques that bridge the semantic gap. Also a few datasets and general statistics for their usage.

Reference

https://arxiv.org/pdf/1909.06423.pdf