ICLR 2020: NLP Highlights

Original article was published on Deep Learning on Medium

ICLR 2020: NLP Highlights

This post is some of my highlights for ICLR 2020. Because my current research is Natural Language Processing (NLP), this post will focus on this area. However, Computer Vision, Quantum-inspired algorithms, General Deep Learning methods, etc. will be mentioned as well.

This is also my first post on Medium, so I would love to connect and hear feedback from the community as well!

About ICLR 2020

From the main website https://iclr.cc/:

ICLR Logo (Source: https://iclr.cc/)

The International Conference on Learning Representations (ICLR) is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence called representation learning, but generally referred to as deep learning.”

ICLR is one of the top-tier international AI conferences, along with NIPS and ICML. One of the differences is that ICLR’s main focus is on Deep Learning.

Participants can come from many backgrounds, “from academic and industrial researchers, to entrepreneurs and engineers, to graduate students and postdocs.”

The review process is double-blind and open-review, meaning that the authors and reviewers do not know each other during the review process, and all reviews and authors’ responses can be publicly viewed.

All materials can be accessed on ICLR Virtual Page

Papers with code for ICLR 2020

2. Technology trend

My main interest is in Natural Language Processing (NLP), so I will focus my recap on this area.

a. Hot Topics:

There are many other topics of NLP that I do not cover here, so please come to the website to read and discover those topics.

b. Emerging Topics:

There are several topics I found are emerging in the conferences:

3. Thoughts

This year is the first time ICLR is held virtually, and from my perspective the organizers have done an excellent job: Accepted papers’ search engine, papers’ visualization with measure of similarity, Chat forums for authors and participants, Zoom rooms, etc.

I had a chance to ask many of the authors directly, as well as participating in public forums on NLP and Huggingface (a popular NLP framework) researchers. I also had a chance to participate with companies’ sponsor chats.

The last talk from Yann LeCun and Yoshua Bengio about the directions of deep learning was very insightful as well, such as how to apply self-supervised learning successes in NLP to Computer Vision, how they think quantum algorithms, neuroscience ideas will also be important. Their stories on how they get to where they are now are inspiring as well.

4. Details: Highlights of selected papers:

Cross-Lingual Ability of Multilingual BERT: An Empirical Study

One of the popular hypotheses of multilingual models is that they learn well because of lexical overlaps between languages. In this paper, the authors find out that “lexical overlap between languages plays a negligible role in the cross-lingual success, while the depth of the network is an integral part of it.” (Karthikeyan et al., 2020)

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Source: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

This paper opens the door for entities with low-computing resources to pre-train their own language models. In particular, the authors “train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark.” (Clark et al., 2020)

  • The code is available on Github.
  • ELECTRA model is available on Huggingface.
  • A project pre-training electra-small for Vietnamese is available on Github

On the Relationship between Self-Attention and Convolutional Layers

Source: On the Relationship between Self-Attention and Convolutional Layers

After the success of replacing RNN with attention-based models with NLP tasks, researchers are now exploring if they can do the same with convolutional layers and in Computer Vision tasks. In this paper, the authors “prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis.” (Cordornnier et al., 2019)

  • The code is available on Github

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

Source: StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

In this paper, the authors add two additional pre-training tasks which “leverage language structures at the word and sentence levels” and achieve SOTA results in GLUE benchmark. (Wang et al., 2019)

Are Transformers universal approximators of sequence-to-sequence functions?

This paper is more in the theory side. In this paper, the authors “show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain.” They also prove that “fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers.” (Yun et al., 2019)

About Me

I am an AI Engineer and Data Engineer, focusing on researching state-of-the-art AI solutions and building machine learning systems. You can reach me on LinkedIn.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Yun, Chulhee, et al. “Are Transformers universal approximators of sequence-to-sequence functions?.” arXiv preprint arXiv:1912.10077 (2019).

Brunner, Gino, et al. “On Identifiability in Transformers.” (2020).

Shi, Zhouxing, et al. “Robustness verification for transformers.” arXiv preprint arXiv:2002.06622 (2020).

Cordonnier, Jean-Baptiste, Andreas Loukas, and Martin Jaggi. “On the Relationship between Self-Attention and Convolutional Layers.” arXiv preprint arXiv:1911.03584 (2019).

Wang, Wei, et al. “StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding.” arXiv preprint arXiv:1908.04577 (2019).

Lee, Cheolhyoung, Kyunghyun Cho, and Wanmo Kang. “Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models.” arXiv preprint arXiv:1909.11299 (2019).

Wu, Zhanghao, et al. “Lite Transformer with Long-Short Range Attention.” arXiv preprint arXiv:2004.11886 (2020).

Lan, Zhenzhong, et al. “Albert: A lite bert for self-supervised learning of language representations.” arXiv preprint arXiv:1909.11942 (2019).

You, Yang, et al. “Large batch optimization for deep learning: Training bert in 76 minutes.” International Conference on Learning Representations. 2019.

Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The Efficient Transformer.” arXiv preprint arXiv:2001.04451(2020).

Clark, Kevin, et al. “Electra: Pre-training text encoders as discriminators rather than generators.” arXiv preprint arXiv:2003.10555 (2020).

Rae, Jack W., et al. “Compressive Transformers for Long-Range Sequence Modelling.” arXiv preprint arXiv:1911.05507 (2019).

Fan, Angela, Edouard Grave, and Armand Joulin. “Reducing Transformer Depth on Demand with Structured Dropout.” arXiv preprint arXiv:1909.11556 (2019).

Lee, Cheolhyoung, Kyunghyun Cho, and Wanmo Kang. “Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models.” arXiv preprint arXiv:1909.11299 (2019).

Melis, Gábor, Tomáš Kočiský, and Phil Blunsom. “Mogrifier lstm.” arXiv preprint arXiv:1909.01792 (2019).

Orhan, A. Emin, and Xaq Pitkow. “Improved memory in recurrent neural networks with sequential non-normal dynamics.” arXiv preprint arXiv:1905.13715 (2019).

Tu, Zhouzhou, Fengxiang He, Dacheng Tao. “Understanding Generalization in Recurrent Neural Networks.” International Conference on Learning Representations. 2019

Karthikeyan, K., et al. “Cross-lingual ability of multilingual BERT: An empirical study.” International Conference on Learning Representations. 2020.

Berend, Gábor. “Massively Multilingual Sparse Word Representations.” (2020): Azonosító-2582.

Cao, Steven, Nikita Kitaev, and Dan Klein. “Multilingual alignment of contextual word representations.” arXiv preprint arXiv:2002.03518 (2020).

Wang, Zirui, et al. “Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework.” arXiv preprint arXiv:1910.04708 (2019).

Panahi, Aliakbar, Seyran Saeedi, and Tom Arodz. “word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement.” arXiv preprint arXiv:1911.04975 (2019).

Kerenidis, Iordanis, Jonas Landman, and Anupam Prakash. “Quantum Algorithms for Deep Convolutional Neural Networks.” arXiv preprint arXiv:1911.01117 (2019).

Su, Weijie, et al. “Vl-bert: Pre-training of generic visual-linguistic representations.” arXiv preprint arXiv:1908.08530 (2019).

Chen, Yu, Lingfei Wu, and Mohammed J. Zaki. “Reinforcement learning based graph-to-sequence model for natural question generation.” arXiv preprint arXiv:1908.04942 (2019).

Clift, James, et al. “Logic and the $2 $-Simplicial Transformer.” arXiv preprint arXiv:1909.00668 (2019).

Zhang, Matthew Shunshi, and Bradly Stadie. “One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation.” arXiv preprint arXiv:1912.00120 (2019).

Yu, Haonan, et al. “Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP.” arXiv preprint arXiv:1906.02768 (2019).