Facebook AI & Retrieval-Augmented Generation (RAG)

Original article was published by Alex Moltzau 莫战 on Artificial Intelligence on Medium


Facebook AI & Retrieval-Augmented Generation (RAG)

A new open-source language model through Hugging Face Transformers in 2020

There is so much we do not understand. At times it all seems to blend together. Yet, there is a wish to aggregate our language — all of our human communication — to make sense.

How much language data does Facebook process?

According to Statista with Facebook had over 2.7 billion monthly active users as of the second quarter of 2020, and is the biggest social network worldwide.

Hard to say. Truly, it is hard to say just how much data is flowing through Facebook.

The Hive

An article on Kinsta has gathered a variety of stats about Facebook and it has a section on Data and Usage [bold added]:

“Facebook generates 4 petabytes of data per day — that’s a million gigabytes. All that data is stored in what is known as the Hive…

…which contains about 300 petabytes of data. This enormous amount of content generation is without a doubt connected to the fact that Facebook users spend more time on the site than users spend on any other social network, putting in about an hour a day.”

I found a post by Facebook Engineering from 2009, then a Wikipedia article about Apache Hive written by Facebook.

It has quite a weird logo, an elephant-wasp:

A bit later I found the project on GitHub.

“The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.”

Regardless, this is a digression…

What I wanted to write about is Facebook developing new software!

RAG framework for AI

Facebook has designed a novel framework for AI that can create more intelligent natural language processing (NLP) models.

Facebook announced its new Retrieval Augmented Generation (RAG) architecture.

It is being released as part of its open-source Hugging Face transformer library.

Natural language in NLP largely means human language — the way you and me communicate through words or utterances. First and foremost this is often thought about pertaining to words in different human languages.

How do we understand words and what meanings do words hold?

This may seem like an easy question when you talk, but creating algorithms that make sense of small amounts or large amounts of words or sentences to generate insight is quite a challenging task!

There is so much context that goes into different utterances.

What has changed now for Facebook?

This is done to generate more accurate answers to questions without having to be ‘constantly retrained’.

RAG is based on the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.

“Retrieval-augmented generation (“RAG”) models combine the powers of pretrained dense retrieval (DPR) and sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.”

To understand this statement it may be useful to retrieve a few descriptions of what this descriptions entail.

Dense Passage Retrieval (DPR) – is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the following paper:

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020.

Sequence-to-sequence: a typical sequence to sequence model has two parts — an encoder and a decoder. Both the parts are practically two different neural network models combined into one giant network. Broadly, the task of an encoder network is to understand the input sequence, and create a smaller dimensional representation of it.

Furthermore, this sentence may need some exploring:

“The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.”

It may be helpful to look at a figure from their paper: