Guide to Multimodal Machine Learning

Original article was published by Parth Chokhra on Deep Learning on Medium

Guide to Multimodal Machine Learning

Analysing Text and Image at the same time!

Meme with the same text but different meaning. Source: Author of this post

I got my attention on multimodal learning from Facebook recent Hateful Meme Challenge 2020 on Driven Data. The challenge is about how to make an effective tool for detecting hate speech, and how it must be able to understand content the way people do. Seems pretty cool challenge as it makes use of both text and image for analysing content with is similar to what humans do. Let’s dive deep into Multimodal Machine Learning to get what it is actually.

Multimodal Learning

As per definition Multimodal means that we have two and or more than two modes of communication through combinations of two or more modes. Modes include written language, spoken language, and patterns of meaning that are visual, audio, gestural, tactile and spatial.

In order to create an Artificial Intelligence ( even A.G.I 🤩 ) that is on par with humans, we need AI to understand, interpret and reason with multimodal messages. Multimodal machine learning aims to build models that can process and relate information from multiple modalities.

To understand how to approach this problem we must first need to understand the challenges that need to be addressed in Multimodal Machine Learning.

The challenge of Multimodal AI

Representation: The first and foremost difficulty is way to represent and summarize multiple modalities in a way we can exploit their complementarity and redundant nature. See we need to understand that usually, all modes of information we take into account points towards a piece of single information like lip-reading and sound we hear from a person represent the same thing. But using both things together gives us that robustness which helps us understand what the other person whats to convey. So the first challenge is how we can combine multimodal data. eg: Language is often symbolic while audio and visual modalities will be represented as signals. How can we combine them?

Alignment: Secondly we need is to identify the direct relations between sub-elements from different modalities. Let’s make this easy with a real-life example. We have a video on how to complete a cooking recipe. Now we also have subscript. To make it intuitive we need to match the steps shown in the video with the subscript to make a complete sense of whats going on. This is known as alignment. How do we align different modalities and deal with possible long-range dependencies and ambiguities?

Translation: Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective. At some point, we might need to convert one form on information to another. Image captioning is one prime example of this. But there exist a number of correct ways to describe an image and one perfect translation may not exist. So how do we map data from one modality to another?

Fusion: The fourth challenge is to join information from two or modalities to perform a prediction. The competition discussed above Facebook AI hateful Meme challenge is one example of it. Usually, we divide fusion techniques into two parts. Early Fusion or Late Fusion. ( Model -Agnostic Approaches)

Early Fusion And Late Fusion. Source: Author of this post

Co-Learning: Transfer knowledge between modalities, including their representations and predictive models. This is an interesting one because sometimes we have a unimodal problem and what we want from other modalities is some extra information at training time so that our system can perform best at testing time.

If after reading out this if Multimodal Machine Learning got you hooked I would suggest going through CMU Multimodal Machine Learning Course.Link in the reference.