Human Activity Recognition(HAR) using Multi-Modal Attention

Source: Deep Learning on Medium

Human Activity Recognition(HAR) using Multi-Modal Attention

“A breakthrough in machine learning is worth 10 Microsofts”

-Bill Gates

What is HAR and why is it important?

Knowing what’s happening in a video, a live stream, a movie etc is an interesting as well as beneficial task. This could help us better understand the huge volume of content available. In 2012, the amount of data being consumed every day was over 7.6 exabytes. This number has been growing larger every day and the data being consumed has become more dense and complex, it’s close to impossible for a human to go through such rich content and share their understanding.

This is where we will be needing help from an automated system. Advances in Deep Learning primarily in the fields of Computer Vision and Natural Language Processing allow us to tackle this problem and understand the rich multi-modal data.

The goal of human action recognition is to identify human activities in everyday settings. Activity Recognition is a challenging problem because of the diverse range and complexity of human actions. This task is trivial for human beings given our cognitive ability but is a difficult task for any machine owing to the huge complexity of the task. Despite the incredible progress we have seen in image tasks over the past few years due to the advent of deep learning, progress in architectures for video tasks has been slow.

Activity Recognition is a problem that has the potential to aid in the understanding of video streams — it can be used in applications ranging from abandoned object detection to CCTV surveillance, anomaly detection and aggressive behaviour detection.

Related Works

AttnSense: Multi-level Attention Mechanism For Multimodal Human Activity Recognition

Ma et al, proposed using sensor-level data (accelerometer, gyroscope, etc.) to predict human activity recognition. Multi-modalites are handled by subnet level attention. This is made by using an attention subnet across CNNs and a separate attention over GRU and this is further used to classify the action.

Multimodal Multi-stream Deep Learning for Egocentric Activity Recognition[9]

Fusion of CNN based on optical flow, single frame etc is used along with Fusion of LSTMs based on sensor data. Both the score fusions are merged and softmax values are used.

Human Action recognition using Multimodality:

Carter et al[2] Proposed a method for merging and analysis of multiple modes of data for suspicious behavior of Human actions. These modalities include RGB videos, depth videos, skeleton positions, and inertial signals from a Kinect camera and a wearable inertial sensor for a comprehensive set of 27 human actions[1].

The Charades Dataset

The Charades dataset consists of videos of hundreds of people enacting the action given to them.

The Charades dataset is composed of 9848 videos of indoor activities which were collected through the Amazon Mechanical Turk. The users were given a sentence and were asked to record a video acting out the sentence, resembling a game of Charades. Each video has been annotated using the consensus of 4 workers on the training set and 8 workers on the test set.

The dataset contains 66,500 temporal annotations for 157 action classes, 41,404 labels for 46 object classes and 27,847 textual descriptions of the videos. The dataset contains videos encoded in H.264 / MPEG-4 using ffmpeg. The videos maintain their original resolutions and frame rates. It also contains the jpeg frames extracted from the videos at 24 fps.

The training data consists of additional features such as :

  1. Quality :- 7-point scale, 7 denoting highest, judged by an annotator.
  2. Relevance :- 7-point scale, 7 denoting highest, judged by an annotator.
  3. Script :- The sentence, based on which the video was generated.
  4. Verified :- Whether the annotator successfully verified that the video matches the script.
  5. Descriptions — List of descriptions by annotators watching the video.

Our methodology and novelty

A multi-modal multilevel model is used for achieving the task. Multi-modal refers to usage of more than one modality, eg: Image + Audio or Audio + Text etc. As the description available above, our modalities are text (the script corresponding to the action) and the video of the action.

The multi-modal architecture (The network architecture)
Text embedding generation for sentence

An overview of the procedure

  1. The extracted video frames(available in dataset) are processed to ensure that there is no redundancy.
  2. These frames are then given as input to MobileNet_V2 and Inception models to calculate the embeddings.
  3. The script file available is also processed and multi-level features are extracted from it.
  • The script is divided into sentences and the corresponding embeddings to this is calculated using USE(Universal Sentence Encoder)
  • The sentences are further divided into words and the corresponding word embeddings are extracted with the help of GLoVe.

4. These extracted embeddings are fed into multiple BiLSTMs and the sentence embedding to a feed-forward neural network, and the results of this is given as an input to the attention layer.

5. A residual connection is made from inputs to the next layer of attention which has the categorical values of relevance and quality.

6. These features together are used for the output prediction.


Cross Entropy loss plotted against the no. of epochs (training set)
Cross Entropy loss plotted against the no. of epochs(test set)
Comparing the loss on the test and the training set
Comparing training loss with the LSTM’S and the CNN
Attention score for the multiple modalities


We can see from the results that a multimodal based approach is giving better prediction accuracy than a simple CNN based model. The more rich representation of the input(as video and sentence(sub word level)) enables the model to generalise across newer environments or unseen data. The attention based method helps to streamline the process of giving considerable weightage to different modalities effectively.

Future Works

  • The work can be further extended by introducing attention between LSTM timesteps.
  • The attention block used, can be altered with self attention (constrained only access to the past, similar to that of a decoder layer of transformer).
  • Meta learning techniques can be used to identify similar tasks and this can help to tackle the lack of data.
  • Reinforcement learning based attention models can be used to better adapt over new datasets.
  • Image and text embeddings from various other networks can be used.
  • Combine multiple camera views (ego-centric, wide angle etc) and this can be used in combination to better understand the annotation.


  1. C. Chen, R. Jafari, and N. Kehtarnavaz, ‘‘UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), Sep. 2015, pp. 168–172.
  2. C. Chiu, J. Zhan, F. Zhan, “Uncovering suspicious activity from partially paired and incomplete multimodal data”, IEEE Access, vol. 5, pp. 13689–13698, 2017.
  3. A. Shahroudy, T. T. Ng, Q. Yang, G. Wang, “Multimodal multipart learning for action recognition in depth videos”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2123–2129, Oct. 2016.
  4. Fortin, Mathieu Pagé and Brahim Chaib-draa. “Multimodal Multitask Emotion Recognition using Images, Texts and Tags.” WCRML ’19 (2019).
  5. Shiqing Zhang, Shiliang Zhang, Tiejun Huang, and Wen Gao. 2016. Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval (ICMR ‘16). ACM, New York, NY, USA, 281–284.
  6. K. Chen, T. Bui, C. Fang, Z. Wang, and R. Nevatia. AMC: Attention guided multi-modal correlation learning for image search. In CVPR, 2017.
  7. Hori, C., Hori, T., Lee, T.Y., Zhang, Z., Harsham, B., Hershey, J.R., Marks, T.K., Sumi, K.: Attention-based multimodal fusion for video description. In: 2017 IEEE International Conference on Computer Vision (ICCV).
  8. Ma, Haojie, Wenzhong Li, Xiao Zhang, Songcheng Gao, and Sanglu Lu. “AttnSense: multi-level attention mechanism for multimodal human activity recognition.” In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3109–3115. AAAI Press, 2019.