Training the Deep Neural Networks Cheaply: The Ongoing Research

Source: Deep Learning on Medium

Training the Deep Neural Networks Cheaply: The Ongoing Research

(This article is based on my experience, my reading on research articles in the field and the talk by Vivienne Sze in NeurIPS 2019)

This image is taken from https://towardsdatascience.com/paper-summary-optimal-dnn-primitive-selection-with-partitioned-boolean-quadratic-programming-84d8ca4cdbfc

Well, when people are talking to how to train the DNN free of charge, people will mostly think of Google Colaboratory or Kaggle Notebooks, but if you want to train a dataset with large resolutions of images and with high number of classes, you will face lots of problems if using these free GPU/TPU platforms. 😞

I have started to become curious 🤔 with how to train the Deep Neural Networks (DNNs) architecture in the most efficient way during me and my students did this Kaggle competition. As our university is just a small postgraduate college and I am almost like the pioneer of the deep learning researcher here, we don’t have sufficient research funding to buy the PCs with good GPU to train our DNNs, so we chose to solely use Google Colab and Kaggle notebooks to compete in that competition.

Then, I started to dig on what current researchers did to overcome the hardware problems, especially if you want to embed the DNN models in mobile hardware. And, I have the keywords, network compression ! This field is quite near to my previous PhD research field, which I mostly research on image compression. They compressed the network by using low-rank optimization, downsampling, scaling and pruning methods.

The conclusion that I got after all the readings, it’s a trade-off. It’s either you compress the model, then it will become faster and can be used in cheap GPU hardware, but unfortunately, you will reduce the training accuracy.

Let’s see what kind of model that has the lowest parameter and the highest accuracy.

The comparison of accuracy and parameters between current DNN models. (This image is taken from https://arxiv.org/pdf/1905.11946.pdf)

So far, for mobile-used and low-parameter DNN architectures, we currently have:
1. EfficientNet — they used scaling method to scale the layers
2. MobileNet — they implemented the depth-wise separable convolutions in their convolutional layer
3. ShuffleNet —they used point-wise group convolution and shuffle the channels in their DNNs architecture
You may want to read this medium article to understand more why these three architectures are fast.

The another way to reduce the hardware burden for DNNs training is by mapping the DNN to a matrix multiplication.

From Vivienne Sze’s NeurIPS 2019 Talks

Recently, researchers used Fast Fourier Transform (FFT), Strassen and Winograd in matrix multiplication. The compiler select which transforms they want to use based on the filter size.

According to Vivienne, in order to design a CPU and GPU for a low cost DNNs architecture, we need to consider the following things:
1. Software (compiler)
– Reduce the unnecessary multiple and accumulate (MAC) operations
– Increase the processing element (PE) utilization
2. Hardware
– Reduce time per MAC
– Increase number of parallel MACs
– Increase PE utilization

She has also mentioned that the algorithm and hardware should be co-design. It can be done by reducing size of operands for storage/compute (Reduced Precision) and reducing the number of operations for storage/computer (Sparsity and Efficient Network Architecture). The commercial products that currently support reduced precision are the Nvidia’s Pascal, Google’s TPU and Intel’s NNP-L. The sparsity properties can also help to reduce number of MACs and data movement. The well-known ReLU activation functions output many zeros thus increase the sparsity of the activation data. The pruning method also make weights sparse. According to her talks, it is important to consider a comprehensive set of metrics when evaluating different DNN solutions: the accuracy, throughput, latency, power, energy, flexibility, scalability and cost.

Besides the common CNN and DNN for image classification and object detection, currently researchers are looking to use spectral graph theory in designing the neural networks. The recent ICLR paper on Graph Wavelet Neural Network(GWNN) presented a graph neural network that used the wavelet properties instead of the normal Laplacian graph. The GWNN obtained by a fast algorithm would not require matrix’s eigendecomposition. The eigendecomposition has a disadvantage that it will lead to higher computational cost.

Why did they use graph and wavelets?

So, what is Graph Neural Network?
According to this article, the typical application of Graph Neural Networks (GNN) is node classification, where every node is associated with a label, and it is necessary to predict the label of the nodes without ground-truth. The GNN can is good for a data represented by graph such as if the class labels of the images are organized in a graph-like structure.

What are wavelets?
Wavelets can be decomposed to low frequency and high frequency band signals compared to the Fourier transform where the decomposed signal will be in same size. Wavelet transform is usually being applied in image compression and image denoising. You can read more how the wavelet transform is done for image compression in my Ph.D dissertation.

What is the advantage of using the GWNN?
According to the author, the first advantage is GWNN has higher efficiency compared to the normal GNN. The second one is high sparseness. Still remember the importance of sparseness that Vivienne presented just now? Having more sparseness means you will have more 0’s in your signal, so when you compute the 0’s it will be useless and need to be get rid of, or can easily be thrown away. The sparsity is the most important factor for compressing signals, where neural networks are also a signal. Third, localized convolution. Forth, flexible neighborhood.

In conclusion, in order to reduce the size and number of operations to achieve the low cost DNNs architecture, we need to:
1. Take importance on sparsity (this is the main reason why they use ReLU as the activation function in most of the convolutional layer)
2. Channel scaling or downsampling
3. Consider other types of networks like graph, wavelet or spiking NNs.

Back to the story in the introduction part of this article, so what did we do to overcome the limitation in Google Colab hardware? We use a very small batch size, which is 8, in the EfficientNet architecture. Unfortunately, we only managed to get about 50% accuracy.

Who knows, perhaps in the next 5 years, everyone can afford the training GPU, or maybe we don’t even need that to train our deep learning dataset! 👏🏼

Looking forward for a brighter future in AI 🙏