Original article was published on Deep Learning on Medium
Welcome to My Week in AI! Each week this blog will have the following parts:
- What I have done this week in AI
- An overview of an exciting and emerging piece of AI research
Absorbing Best Practices
This week I attended the Spark + AI summit, hosted by Databricks. There were lots of informative and useful talks, mostly on the topics of data engineering and productionizing machine learning models. I found two talks to be particularly enlightening — one was ‘Accelerating MLFlow Hyper-parameter Optimization Pipelines with RAPIDS’ by John Zedlewski from NVIDIA and the other was ‘Scaling up Deep Learning by Scaling Down’ by Nick Pentreath from IBM.
Training Models Rapidly
The first talk introduced RAPIDS, which is a collection of libraries than run on GPUs that can be used in place of the standard Python data science libraries (pandas, scikit-learn, PyTorch, Matplotlib). These libraries allow machine learning model development to happen at a tiny fraction of the typical computation time. This is because none of the standard libraries, with the exception of PyTorch, have built in GPU support and so are computing on CPU which takes significantly longer. Models that took an hour to train in scikit-learn were trained in less than 5 minutes when using cuML, RAPIDS corresponding library. When I heard this statistic, I was astounded because that is a lot of time saved, and on top of the speed, the libraries are very easy to use. There is a corresponding RAPIDS library to each of the Python data science libraries, and each of the RAPIDS’ versions have the same functions as the Python versions. Therefore, to use the RAPIDS version, you just replace each instance of pandas in your code with cudf and similarly with the other libraries. The talk went on further to demonstrate hyperparameter sweep with Hyperopt and how RAPIDS integrates with this to make the sweep extremely fast compared with a grid search in scikit-learn. RAPIDS is a toolkit that I plan to explore further as computation time is a significant frustration for me (as it is for many Data Scientists!)
Optimizing Models for Production
The second talk was on running deep learning models for inference on edge devices like mobile phones. These devices typically have limited resources and so the models have to be scaled down for them to run efficiently. The speaker presented four main ways of doing this — architecture improvement, model pruning, quantization and model distillation. Each of the four techniques leads to significant efficiency improvements however their effect on accuracy varies. Architecture improvement and model distillation typically cause a decrease in accuracy, whereas model pruning and quantization can often cause an increase in accuracy. I think it is easy for models to become bloated and so these techniques can be useful in managing memory and computation time regardless of whether the models are being run on edge devices.
A cheaper and more accurate BERT
Scaled down models is also the topic of the research that I will be presenting this week. This week’s paper is titled ‘ALBERT — A Lite BERT for Self-supervised Learning of Language Representations’ by Lan et al.¹ and presents a successor to the famous BERT. This research was presented at the ICLR conference in April 2020. The researchers demonstrate two ways to reduce the training time and memory consumption of BERT, whilst also attaining superior accuracy on benchmark tasks.
Their optimized architecture, ALBERT, uses two parameter reduction techniques to achieve this — factorized embedding parameterization and cross-layer parameter sharing. Factorized embedding parameterization splits the vocabulary embedding matrix into two smaller matrices so that the vocabulary embedding is no longer connected to the size of the hidden layers in the model. Cross-layer parameter sharing means all parameters are shared across each layer thus, the number of parameters does not necessarily grow as the network becomes deeper.
Furthermore, the researchers used sentence-order prediction loss in training the model instead of the next-sentence prediction loss used in training BERT. Next-sentence prediction loss is a binary classification loss used to predict if two sequences of text appear sequentially in a dataset. The aim of this loss was to improve BERT’s performance on downstream tasks such as natural language inference by focusing on topic prediction and coherence prediction, however studies have found it to be unreliable. The loss proposed by Lan et al. focused only on coherence prediction and helped to train an ALBERT model that is consistently more accurate on downstream tasks than BERT.
An ALBERT configuration analogous to BERT-large has 1/18th the number of parameters and trains in less than 2/3 the amount of time. Furthermore, ALBERT achieved the state-of-the-art accuracy on three standard NLP benchmarks — GLUE, RACE and SQuAD.
Seeing the advances made in NLP research since BERT was released has been very exciting and for me, it makes NLP tasks much easier when I can use such powerful and optimized pretrained models.