MIT Researcher Neil Thompson on Deep Learning’s Insatiable Compute Demands and Possible Solutions

Original article was published by Synced on Artificial Intelligence on Medium

MIT Researcher Neil Thompson on Deep Learning’s Insatiable Compute Demands and Possible Solutions

In June 2018, OpenAI introduced its first GPT (Generative Pre-Training) large language model. Trained on massive amounts of unlabelled text corpora and leveraging breakthrough Transformer generative deep learning architecture, GPT-1 made short work of complex language understanding tasks.

In February 2019, the deep learning community welcomed the new and improved GPT-2, whose 1.5 billion parameters made it 12 times larger than the original. Then, this spring, Open AI rolled out the GPT-3, a behemoth packing 175 billion parameters.

As the size of deep learning models continues to increase, so does their appetite for compute. And that has Neil Thompson, a research scientist with MIT’s Computer Science and Artificial Intelligence Lab (CSAIL), concerned.

The growth in computing power needed for deep learning models is quickly becoming unsustainable,” Thompson recently told Synced. Thompson is first author on the paper The Computational Limits of Deep Learning, which examines years of data and analyzes 1,058 research papers covering domains such as image classification, object detection, question answering, named-entity recognition and machine translation. The paper proposes that deep learning is not computationally expensive by accident, but by design. And the increasing computational costs in deep learning have been central to its performance improvements.

“For decades, software of many types have increased their usage of computing power. But those increases grew in proportion to the improvements in hardware provided by Moore’s Law, so these heavier software demands didn’t significantly change the economic or environmental impact of these systems. That’s not true for Deep Learning systems since 2012. Their economic and environmental footprints are growing worrying fast,” says Thompson.

The paper explains how the introduction of GPU-based (and later ASIC-based) deep learning led to the widespread adoption of these powerful systems by AI researchers. “But the amount of computing power used in cutting-edge systems grew even faster, at approximately 10× per year from 2012 to 2019. This rate is far faster than the [at that point] ≈ 35× total improvement from moving to GPUs, the meager improvements from the last vestiges of Moore’s Law, or the improvements in neural network training efficiency.”

The team says much of the increase in computing power has come from running models for more time on more machines. Just two years ago, when Google introduced its BERT (Bidirectional Encoder Representations from Transformers) model for NLP pretraining, its 340 million parameters were considered “extreme” (GPT-3 is more then 500 times the size). Google AI trained the 340M model in 4 days on 16 Cloud TPUs (64 TPU chips total). Thompson and the team point out Google Research’s 2019 Evolved Transformer model required more than 2 million GPU hours to train — and costs millions of dollars to run.

Why should we concerned about this trend?

If we continue on the path we’re on, systems will go from costing millions or tens-of-millions to train to costing hundreds of millions or billions of dollars. The environmental impacts will similarly grow quickly. So, if we don’t find a way to improve performance more efficiently, fewer and fewer researchers will be able to continue doing this work and the environmental damage will mount,” Thompson told Synced. “I started this study after hearing talks by big companies with enormous computing resources. Even they would talk about how their deep learning models were overflowing their available resources.

Today’s SOTA systems achieve an error rate of about 11.5 percent in image recognition on the benchmark ImageNet dataset. The paper estimates that training to reach an error rate of 1 percent would theoretically cost over US$100 quintillion and add 100 quintillion pounds (50,000,000,000,000,000 tonnes) of carbon emissions. Thompson believes these exponentially mounting costs will leave researchers little choice but to pivot towards more efficient methods.

What are the likely impacts of these computational limits on deep learning? And what are the alternatives?

The paper suggests that deep learning will be forced “towards less computationally-intensive methods of improvement, and machine learning towards techniques that are more computationally-efficient than deep learning,” and identifies several key areas and approaches for countering mounting computational burdens:

  • Increasing computing power: Hardware accelerators.
  • Reducing computational complexity: Network Compression and Acceleration.
  • Finding high-performing small deep learning architectures: Neural Architecture Search and Meta Learning.

“There are some exciting techniques being explored within the Deep Learning community,” says Thompson, “for example the ‘lottery ticket hypothesis’ where researchers are trying to prune their networks early in training. If that works, it will mean that many fewer connections in the network need to be trained, saving a lot of computation. In the long run, there is even more potential in work like [physics-based approach] ‘A.I. Feynman’, where complicated networks can be condensed down to easy-to-calculate equations.”

While Thompson says these techniques hold promise, he believes “at some point we may also need to build more expert insights into our models rather than relying on the flexibility of deep learning to discover them. This is what was done in the early days of computing and it can make models much more efficient — albeit usually at the cost of more work for the designers!”

The paper The Computational Limits of Deep Learning is on arXiv.