NLP News Cypher | 10.11.20

Original article was published by Quantum Stat on Artificial Intelligence on Medium

Legal BERT

Straight out of EMNLP, we now have pre-trained models for the Legal domain called Legal BERT! These models were trained with an eye for application in the legal research, computational law, and legal technology fields. For the training, the model was exposed to12 GB of English legal text derived from legislation, court cases, and contracts. 👇

pre-training corpora

Where were some of the best performance gains?

“Performance gains are stronger in the most challenging end-tasks (i.e., multi-label classification in ECHR-CASES and contract header, and lease details in CONTRACTS-NER)”

*The model was evaluated on text classification and sequence tagging tasks.


Los Modelos:

Indic BERT

If you are interested in Indic languages checkout Indic BERT library built on HF transformers 👀. Their multi-lingual ALBERT model has support for 12 languages and was trained on a custom 9 billion token corpus. The library holds a multitude of evaluation tasks:

News Category Classification, Named Entity Recognition, Headline Prediction, Wikipedia Section Title Prediction, Cloze-style Question Answering (WCQA, Cross-lingual Sentence Retrieval (XSR) and many more.


Thank you Tathagata for forwarding their model to us!

Colab of the Week

Intent Detection in the Wild

Bottom line: we need more real-world* datasets. In this recent Haptik paper, the authors showed how 4 NLU platforms (RASA, Dialogflow, LUIS, Haptik) and BERT performed when given 3 real-world datasets containing in and out-of-scope queries. (results were mixed given difficulties in generalizing to the test sets of the data)

What kind of datasets?

“Each dataset contains diverse set of intents in a single domain — mattress products retail, fitness supplements retail and online gaming…”

*Real-world here means real user queries as opposed to crowdsourcing.

Find the data here:



Have you heard of Wikipedia2Vec, it’s been around for a couple of years now. It contains embeddings of words and concepts that have corresponding pages in Wikipedia. Since Wikipedia is one of the most researched dataset in IR, this may come in handy to you. Their embeddings come in 12 languages and they include an API.

Applications of Wikipedia2Vec:


Data Augmentation for Text

NLPaug is a handy library used for data augmentation where you can inject noise in your dataset on the character or word level that improves model performance.

Here’s an example of what I mean:

Here are a few of its features:

  • Character: OCR Augmenter, QWERTY Augmenter and Random Character Augmenter
  • Word: WordNet Augmenter, word2vec Augmenter, GloVe Augmenter, fasttext Augmenter, BERT Augmenter, Random Word Character
  • Flow: Sequential Augmenter, Sometimes Augmenter

Blog post on the library:


Secret: There’s an nlpaug colab in the SDNR 👆

State of AI Report 2020

The annual of State of AI Report is out and NLP is winning big.

TL;DR on the NLP side of things:

Only 15% of papers publish their code.

Facebook’s PyTorch is fast outpacing Google’s TensorFlow in research papers.

Larger model needs less data than a smaller peer to achieve the same performance.

Biology is experiencing its “AI moment”: Over 21,000 papers in 2020 alone.

Brain drain of academics leaving university for tech companies.

A rise of MLOps in the enterprise.

NLP is used to automate quantification of a company’s Environmental, Social and Governance (ESG) perception using the world’s news.

Model and dataset sharing is driving NLP’s Cambrian explosion.

Honorable Papers


Paper: Sparse Open Domain QA

Paper: Novel Framework for Distillation

Paper: Semantic Role Labeling Graphs

Dataset of the Week: eQASC

What is it?

Dataset contains 98k 2-hop explanations for questions in the QASC dataset, with annotations indicating if they are valid or invalid explanations.