Original article was published by Quantum Stat on Artificial Intelligence on Medium
Straight out of EMNLP, we now have pre-trained models for the Legal domain called Legal BERT! These models were trained with an eye for application in the legal research, computational law, and legal technology fields. For the training, the model was exposed to12 GB of English legal text derived from legislation, court cases, and contracts. 👇
Where were some of the best performance gains?
“Performance gains are stronger in the most challenging end-tasks (i.e., multi-label classification in ECHR-CASES and contract header, and lease details in CONTRACTS-NER)”
*The model was evaluated on text classification and sequence tagging tasks.
If you are interested in Indic languages checkout Indic BERT library built on HF transformers 👀. Their multi-lingual ALBERT model has support for 12 languages and was trained on a custom 9 billion token corpus. The library holds a multitude of evaluation tasks:
News Category Classification, Named Entity Recognition, Headline Prediction, Wikipedia Section Title Prediction, Cloze-style Question Answering (WCQA, Cross-lingual Sentence Retrieval (XSR) and many more.
Thank you Tathagata for forwarding their model to us!
Colab of the Week
Intent Detection in the Wild
Bottom line: we need more real-world* datasets. In this recent Haptik paper, the authors showed how 4 NLU platforms (RASA, Dialogflow, LUIS, Haptik) and BERT performed when given 3 real-world datasets containing in and out-of-scope queries. (results were mixed given difficulties in generalizing to the test sets of the data)
What kind of datasets?
“Each dataset contains diverse set of intents in a single domain — mattress products retail, fitness supplements retail and online gaming…”
*Real-world here means real user queries as opposed to crowdsourcing.
Find the data here:
Have you heard of Wikipedia2Vec, it’s been around for a couple of years now. It contains embeddings of words and concepts that have corresponding pages in Wikipedia. Since Wikipedia is one of the most researched dataset in IR, this may come in handy to you. Their embeddings come in 12 languages and they include an API.
Applications of Wikipedia2Vec:
- Entity linking: Yamada et al., 2016, Eshel et al., 2017, Chen et al., 2019, Poerner et al., 2020.
- Named entity recognition: Sato et al., 2017, Lara-Clares and Garcia-Serrano, 2019.
- Question answering: Yamada et al., 2017, Poerner et al., 2020.
- Entity typing: Yamada et al., 2018.
- Text classification: Yamada et al., 2018, Yamada and Shindo, 2019.
- Relation classification: Poerner et al., 2020.
- Paraphrase detection: Duong et al., 2018.
- Knowledge graph completion: Shah et al., 2019.
- Fake news detection: Singh et al., 2019.
- Plot analysis of movies: Papalampidi et al., 2019.
- Enhancement of BERT using Wikipedia knowledge: Poerner et al., 2019.
- Novel entity discovery: Zhang et al., 2020.
- Entity retrieval: Gerritse et al., 2020.
Data Augmentation for Text
NLPaug is a handy library used for data augmentation where you can inject noise in your dataset on the character or word level that improves model performance.
Here’s an example of what I mean:
Here are a few of its features:
Character: OCR Augmenter, QWERTY Augmenter and Random Character Augmenter
Word: WordNet Augmenter, word2vec Augmenter, GloVe Augmenter, fasttext Augmenter, BERT Augmenter, Random Word Character
Flow: Sequential Augmenter, Sometimes Augmenter
Blog post on the library:
Secret: There’s an nlpaug colab in the SDNR 👆
State of AI Report 2020
The annual of State of AI Report is out and NLP is winning big.
TL;DR on the NLP side of things:
Only 15% of papers publish their code.
Facebook’s PyTorch is fast outpacing Google’s TensorFlow in research papers.
Larger model needs less data than a smaller peer to achieve the same performance.
Biology is experiencing its “AI moment”: Over 21,000 papers in 2020 alone.
Brain drain of academics leaving university for tech companies.
A rise of MLOps in the enterprise.
NLP is used to automate quantification of a company’s Environmental, Social and Governance (ESG) perception using the world’s news.
Model and dataset sharing is driving NLP’s Cambrian explosion.
Paper: https://arxiv.org/pdf/2009.13013.pdf Sparse Open Domain QA
Paper: https://arxiv.org/pdf/2010.03099.pdf Novel Framework for Distillation
Paper: https://arxiv.org/pdf/2010.03604.pdf Semantic Role Labeling Graphs
Dataset of the Week: eQASC
What is it?
Dataset contains 98k 2-hop explanations for questions in the QASC dataset, with annotations indicating if they are valid or invalid explanations.