The Best Deep Natural Language Papers You Should Read — BERT, GPT-2 and looking forward

Source: Deep Learning on Medium

Predictions for 2020

I would be remiss not to make a few predictions for what will happen next. By the time you read this, some of this may have already happened, or even published. Don’t @ me.

More scaling. As our group at NVIDIA showed with Megatron and Google showed with T5, there are still gains to be made, even on existing NLP metrics, from training bigger models for longer, on more fresh data. The gains may seem to be diminishing, but that will change as GLUE, SuperGLUE, etc add harder, more niche tasks. As Amit Singhal used to tell us on the Google Search team — as we make search better, the users keep asking harder questions.

Data augmentation, selection, efficiency. As T5 points out, big Transformers are really good at overfit (since memorization a part of text understanding)— thus you should never train it on the same text twice. As finding more good data on the internet gets harder and harder, we’ll see more focus on augmenting the data we have (BPE-dropout is a great example, and Quoc Le’s group at Google had other good suggestions). Right now, data augmentation for text is not where it is for deep computer vision. And even in CV, Prof Le’s same group is still making breakthroughs.

Furthermore, as a FB executive friend liked to point out — the data you really care about, is always scarce, even if less important data is abundant.

For example, take resume screening — a project I worked at for NVIDIA early in my tenure. The number of outstanding resumes can be tiny — especially for a specific role. Overfitting is a problem — and data augmentation is often our best solution.

Another problem is that text datasets may contain 90% news and clickbait, and 0.01% of articles on your topic of interest. In practice, that’s enough for a huge model to learn good embeddings for your niche task, be it baseball, science or financial topics. But wouldn’t you like to do better? Or to do as well, more efficiently?

Given that the naive approach to this works (I’ve done it, I’m sure others have too), I think we’ll see methods emerge — think of them as RL for downstream task — then select documents for you, to help with niche domain pre-training. Instead of over-fitting to your task, why not just over-sample relevant documents in pre-training? The main reason we’ve not seen this, I think, is because the benefits of scale have swamped everything else so far. But at some point the lines between more training and better selection will start to cross.

For now, filtering by sub-reddit, is a fine way to build your niche-specific pre-training dataset — for some topics.

Rewriting — for style, and otherwise. I’ve found the Transformer models are sneaky good at style detection and generation. Much better than at reasoning, specific knowledge, etc. Style is mostly a local feature — which word or phrase to use, can we keep a consistent vocabulary with the previous sentence (including implied vocabulary that was not actually used). Using the very big BERT-style models (T5 seems tailor made), I expect big breakthroughs in text rewriting, document editing tools, etc. Right now Google (and to a lesser extent Apple) will help you write short emails and finish your sentences. Imagine a tool instead that trims the fat, and suggests good rewrites. Doing a full email re-write in one shot seems hard, but all text edits more or less break down to individual operations:

  • insert, delete, replace
  • move a sentence

There’s no reason to think that a model could not make such suggestions. And that you could not agree/disagree with them one at a time, after each it will recompute more suggestions. Shoot, an RL agent could even make those choices for you. And it probably should.

Non-local gradients and backprop — through reinforcement learning. I’ve touched on this several times, but the best and worst feature of language modeling is that we get very far by optimizing for local losses, perhaps with a large context window, but writing the outputs — in permanent marker — one token at a time. However, what we care about in good writing, can only be measured across multiple words. To make this more concrete, conditioned image generation works on the whole image, because you can backprop into the individual pixels. You can’t really do that with large Transformers. Or can’t you? I expect to see some progress in this space. Maybe on a small scale, but it doesn’t seem impossible — to pass some signal, between multiple trainable text tokens.

Think of this as complimentary to data selection and re-writing.

Given the interest in Turing test and generative models, I expect some serious resources and brain power are being devoted to this already.

You can’t manage, what you can’t m̶e̶a̶s̶u̶r̶e̶ backprop.

Transformers beyond text. It’s obvious that Transformer modules will be useful on problems other than text. We’ve already seen large Transformers make a big improvement on protein modeling and I’ll have a paper out soon on our own genomics work, that also includes a Transformer module. Transformers have been useful for some computer vision tasks — mainly because they easily support a larger receptive field than convolutions. It can be useful to add a Transformer module after a few initial convolutional layers (and that’s what we’ve done on our genomics problem).

The question is: will the mammals go back to the ocean? Can we learn anything anything from non-text Transformers that will be helpful back to text?

Which of these will pay off most for my practical problems? Honestly, in the medium term, I think the re-writing. That’s a strange thing to say given that Transformers are not doing this at all right now. But we all need an editor. Summarizing and re-writing content for a particular use case in mind, perhaps paired with a human in the loop to choose “which is better” will be huge. Not just for jokes and games, although those will probably be the first impressive use cases.