Pythia (Facebook)— Greek god doing Deep learning

Source: Deep Learning on Medium

“Artificial Intelligence” in 2019 has been exciting. Can it be more exciting than this? Guess what I found an answer for it and the answer is from Facebook Research

Go to the profile of Sai Nath

Recently Facebook open sourced their new framework called Pythia

Pythia: Demo

ELI5: Give Pythia an image and shoot questions to it, get your questions answered; sounds interesting isn’t it, but Pythia can do more than this….

Pythia is the new multimodal research framework for supercharging vision and language tasks built on top of Pytorch

Is there a reason behind naming it Pythia?

The name ‘Pythia’ is an homage to the Oracle of Apollo at Delphi, who answered questions in Ancient Greece.’

Lycurgus Consulting the Pythia

Ever wondered how great it would be if we had a single framework that could incorporate both NLP and vision tasks easily, this is what Pythia does, it is designed for answering questions related to visual data and automatically generating image captions

CloudCV has a Demo of Pythia up and running which you can try it out

Pythia incorporates elements of Facebook research’s winning entries in recent AI competitions (the VQA Challenge 2018 and Vizwiz Challenge 2018) done by the Facebook AI Research (FAIR)’s A-STAR (Agents that See, Talk, Act, and Reason) team

Visual Question Answering (VQA) Challenge- Given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

Pythia got an accuracy of 70.24% on VQA v2.0 dataset

In addition to multitasking, Pythia also supports distributed training and a variety of datasets, as well as custom losses, metrics, scheduling, and optimizers.

Now let’s get into Technical Details

  1. Bottom-up and Top-down attention

The critical idea in up-down is the use of an object detector — Faster RCNN pre-trained on the Visual Genome dataset — to extract image features with bottom-up attention, i.e., visual feed-forward attention. A ResNet-101 was chosen as the backbone network. Visual Genome is a knowledge base to connect structured image concept to language

Visual Genome Dataset

The question’s text is then used to compute the top-down attention, i.e., task-specific attention, for each object in the image. Multi-modal fusion is done through a simple Hadamard product (also called Schur product) followed by a multi-label classifier using a sigmoid activation function to predict the answer scores. The performance reached 70.34% on VQA 2.0 test-std split with an ensemble of 30 models.

Hadamard product

2. Model Architecture

A weight normalization followed by ReLU was used instead of gated hyperbolic tangent activation to reduce computations involved. As evident from the application of Pythia, it needed some way of connecting the Text and Image, and this was done with the help of element-wise multiplication to combine the features from text and visual modalities when computing the top-down attention(task-specific) instead of just feature concatenation. How do we represent the Task Question? Use GloVe (Global Vectors) then pass the word embeddings onto a GRU network(Gated Recurrent Unit-a LSTM with a forget gate) and a question attention module to extract attentive text features and discard unnecessary ones.

3. Optimizer

The optimizer used is ADAMAX (a version of ADAM). This fantastic post on gives an overview of most of the optimizers.

Why Pythia matters?

Pythia smooths the process of entering the growing subfield of vision and language and frees researchers to focus on faster prototyping and experimentation. Our goal is to accelerate progress by increasing the reproducibility of these models and results. This will make it easier for the community to build on, and benchmark against, successful systems. We hope that removing some of the obstacles will allow researchers to more quickly develop new ways for people and intelligent machines to communicate. This work should also help researchers develop adaptive AI that synthesizes multiple kinds of understanding into a more context-based, multimodal understanding. In addition to this open source release, we plan to continue adding tools, tasks, data sets, and reference models.(source-

Features of Pythia

  • Model Zoo: Reference implementations for state-of-the-art vision and language model
  • Multi-Tasking: Support for multi-tasking which allows training on multiple datasets together
  • Datasets: Includes support for various datasets built-in including VQA, VizWiz, TextVQA, VisualDialog and COCO Captioning.
  • Modules: Provides implementations for many commonly used layers in vision and language domain
  • Distributed: Support for distributed training based on DataParallel as well as DistributedDataParallel.
  • Unopinionated: Unopinionated about the dataset and model implementations built on top of it.
  • Customization: Custom losses, metrics, scheduling, optimizers, TensorBoard; suits all your custom needs.
Source :Github: Pythia

With Open Source driving the AI community Pythia can be seen as a step getting closer to democratizing AI which earlier was confined to the walls of Research labs and large companies. This easy to use framework can give birth to exciting projects from independent groups and researchers thus accelerating the progress in this field

Stay tuned for the PART-2 of this article where we go through setting up Pythia and running it on Tensorpad



  1. Github:
  2. FB research:
  3. arXiv paper:
  4. Doc: