Three Learning Paradigms For The Future Development Of Deep Learning

Original article was published by Md Ashikquer Rahman on Deep Learning on Medium

Photo by Istockphoto

Deep learning is a large field. Its core is a neural network algorithm. The size of the neural network is determined by millions or even billions of constantly changing parameters. It seems that a large number of new methods are proposed every few days.

However, generally speaking, current deep learning algorithms can be divided into three basic learning paradigms. Each learning method and belief provides great potential and interest for improving the current ability and scope of deep learning.

Blended learning-how do modern deep learning methods cross the boundaries of supervised and unsupervised learning to adapt to large amounts of unused unlabeled data?

Component learning-how to use an innovative method to link different components to generate a mixed model, the effect of this model is better than the simple addition of each part?

Simplified learning-how to reduce the size and information flow of the model while maintaining the same or scale predictive ability to achieve performance and deployment goals?

The future of deep learning lies mainly in these three learning paradigms, each of which is closely linked.

Blended learning

This learning paradigm tries to cross the boundary between supervised learning and unsupervised learning. Due to the lack of label data and the high cost of collecting labeled data sets, it is often used in commercial environments. Essentially, blended learning is the answer to this question.

How can we use supervised learning methods to solve or link unsupervised learning problems?

For example, in this case, semi-supervised learning is becoming increasingly popular in the field of machine learning because it can perform exceptionally well on supervised problems with little labeled data. For example, a well-designed semi-supervised Generative antimarial Network (Generative antimarial Network) uses only 25 training samples on the MNIST dataset and achieves an accuracy of over 90%.

Semi-supervised learning is designed for data sets that have a large number of unlabeled samples and a small number of labeled samples. Traditionally, supervised learning uses the labeled part of the data set, while unsupervised learning uses another unlabeled part of the data set. The semi-supervised learning model can combine the labeled data with the information extracted from the unlabeled data set.

The semi-supervised generative confrontation network (SGAN for short) is an improvement of the standard generative confrontation network. The discriminator not only outputs 0 and 1 to determine whether it is a generated image, but also outputs the type of sample (multi-output learning).

This is based on the idea that through the discriminator learning to distinguish between real and generated images, it is possible to learn specific structures without labels. By making additional enhancements from a small amount of labeled data, the semi-supervised model can achieve the best performance with the least amount of supervised data.

You can read more about SGAN and semi-supervised learning here.

GAN also involves other areas of hybrid learning-self-supervised learning, in which unsupervised problems are clearly defined as supervised problems. GANs manually create supervised data by introducing generators; the created tags are used to identify real/generated images. Under the premise of unsupervised, a supervised task is created.

In addition, consider using an encoder-decoder model for compression. In their simplest form, they are neural networks with a small number of nodes in the middle to represent a form of bottleneck and compression. The two parts on either side are the encoder and the decoder.

Train this network to generate the same input as the input vector (a supervised task designed by hand with unsupervised data). Because there is an intentionally designed bottleneck in the middle, the network cannot passively transmit information. On the contrary, in order for the decoder to be able to decode better, it must find the best way to retain the input information in a very small unit.

After training, the encoder is separated from the decoder, and the encoder is used at the receiving end of compressed data or encoded data for transmission, using very few data formats to transmit information while ensuring the least loss of data information. It can also be used to reduce the dimensionality of the data.

Another example is to consider a large collection of texts (perhaps comments from digital platforms). Through some kind of clustering or manifold learning method, we can generate cluster labels for text collections, and then treat them as labels (provided that the clustering works well).

After interpreting each cluster (for example, cluster A represents complaints about product reviews, cluster B represents positive feedback, etc.), then you can use deep NLP architectures such as BERT to classify new text into these clusters , All of these are completely unlabeled data and minimal human involvement.

This is another interesting application that transforms unsupervised tasks into supervised tasks. In an era where the vast majority of data is unsupervised data, building a creative bridge through blended learning, crossing the boundary between supervised and unsupervised learning, has great value and potential.

Component learning

Component learning uses not only the knowledge of one model, but also the knowledge of multiple models. It is believed that through a unique combination of information or input (including static and dynamic), deep learning can be deeper in understanding and performance than a single model.

Transfer learning is a very obvious example of component learning. Based on this idea, model weights pre-trained on similar problems can be used to fine-tune a specific problem. Build a pre-trained model like Inception or VGG-16 to distinguish different categories of images.

If I plan to train a model that recognizes animals (such as cats and dogs), I will not train a convolutional neural network from scratch because it will consume too much time to achieve good results. Instead, I will use a pre-trained model like Inception, which has stored the basic information of image recognition, and then trains on this data set (cat and dog data set) for additional iterations.

Similarly, the word embedding model in the NLP neural network maps words to positions closer to other words in the embedding space according to the relationship between words (for example, the distance between an apple and a sentence is smaller than the distance between an apple and a truck) . Pre-trained embeddings like GloVe can be put into neural networks, starting with meaningful entities that have effectively mapped words to numbers.

Not so obvious is that competition can also stimulate knowledge growth. First, the generative adversarial network borrows the compound learning paradigm to fundamentally make the two neural networks oppose each other. The goal of the generator is to deceive the discriminator, and the goal of the discriminator is not to be deceived.

The competition between models will be called “adversarial learning”, not to be confused with another type of adversarial learning, which is to design malicious inputs and discover weak decision boundaries in the model.

Adversarial learning can stimulate models, usually different types of models, where the performance of a model can be expressed as being related to the performance of other models. There is still a lot of research to be done in the field of adversarial learning. Generative adversarial networks are the only outstanding innovation in adversarial learning.

On the other hand, competitive learning is similar to adversarial learning, but on a node-by-node scale: nodes compete for the response right of the input data subset. Competitive learning is implemented in a “competitive layer”. In the competitive layer, except for some randomly distributed weights, a group of neurons are all the same.

The weight vector of each neuron is compared with the input vector, and the neuron with the highest similarity is activated, which is the “winner takes all” neuron (output = 1). The others are “disabled” (output=0). This unsupervised technique is the core part of self-organizing mapping and feature discovery.

Another example of component learning is neural architecture search. Simply put, in a reinforcement learning environment, a neural network (usually a recurrent neural network) learns to generate the best network architecture for this data set-the algorithm finds the best architecture for you, you can read about it here More knowledge about this theory, and apply python code here.

The integrated method is also important in component learning, and the deep integrated method has demonstrated its efficiency. And the end-to-end stacking of models, such as encoders and decoders, has become very popular.

Many component learning are looking for unique ways to establish connections between different models. They are all based on this idea:

A single model or even a very large model usually performs worse than a few small models/components. Each of these small models is assigned a part of the task.

For example, consider the task of building a restaurant chatbot.

We can divide this robot into three separate parts: greeting/chat, information retrieval and action robot, and design a model for each part. Or, we can delegate a single model to perform these three tasks.

It is not surprising that the combined model can perform better while taking up less space. In addition, these types of nonlinear topologies can be easily constructed with tools such as Keras functional API.

In order to deal with increasingly diversified data types such as video and 3D data, researchers must construct creative combination models.

Read more about ingredient learning and the future here.

Simplify learning

In the field of deep learning, especially in NLP (the most exciting and exciting field of deep learning research), the scale of the model is constantly growing. The latest GPT-3 model has 175 billion parameters. Comparing it to BERT is like comparing Jupiter to a mosquito (well, not literally). Will the future of deep learning be bigger?

Logically speaking, no, GPT-3 is very convincing, but it has repeatedly shown in the past that “successful science” is the science that has the greatest impact on mankind. The academic world is always too far away from reality and too vague. At the end of the 19th century, neural networks were forgotten for a short period of time due to too little available data, so this idea, no matter how clever, was useless.

GPT-3 is another language model that can write compelling text. Where is its application? Yes, for example, it can generate answers to queries. However, there are more effective ways to do this (for example, traversing a knowledge graph and using a smaller model such as BERT to output the answer).

In the case of exhaustion of computing power, the huge size of GPT-3 (not to mention a larger model) is not feasible or unnecessary.

“Moore’s Law is kind of useless.” Satya Nadella, CEO of Microsoft

Instead, we are moving towards an artificial intelligence embedded world, smart refrigerators can automatically order food, and drones can automatically navigate the entire city. Powerful machine learning methods should be able to be downloaded to personal computers, mobile phones and small chips.

This requires lightweight artificial intelligence: make neural networks smaller while maintaining performance.

This directly or indirectly shows that in deep learning research, almost everything is related to reducing the amount of necessary parameters, which is closely related to improving generalization ability and performance. For example, the introduction of convolutional layers greatly reduces the number of parameters required by neural networks to process images. The recursive layer integrates the idea of ​​time, while using the same weights, so that the neural network can better process the sequence with fewer parameters.

The embedding layer explicitly maps the entity to a value with physical meaning, so that it will not increase the burden on additional parameters. In one interpretation, the Dropout layer explicitly prevents parameters from operating on certain parts of the input. L1/L2 regularization ensures that the network uses all parameters by ensuring that all parameters do not grow too large, and that each parameter can maximize its information value.

With the creation of this special dedicated layer, the network requires fewer and fewer parameters for more complex and larger data. Other newer methods explicitly seek to compress the network.

Neural network pruning attempts to remove those synapses and neurons that are of no value to the output of the network. Through pruning, the network can maintain its performance while deleting almost all of itself.

Other methods like Patient Knowledge Distillation find some ways to compress the language model so that it can be downloaded to a format such as the user’s mobile phone. This is a necessary consideration for Google’s neural machine translation system. This system supports Google Translate. Google Translate needs to create a high-performance translation service that can be accessed offline.

Essentially, simplified learning focuses on deployment-centric design. This is why most studies that simplify learning come from the company’s research department. One aspect of deployment-centric design is not to blindly follow the performance indicators of the data set, but to focus on potential problems when deploying the model.

For example, the aforementioned adversarial input is malicious input designed to deceive the network. Spraying paint or labeling on signs can induce autonomous vehicles to accelerate beyond the speed limit. Responsible simplification of learning not only makes the model lightweight enough for use, but also ensures that it can adapt to corner situations that have not appeared in the data set.

In the research of deep learning, simplified learning may be the least concerned, because “we achieve good performance through a feasible architecture size” is not like “we use an architecture composed of tens of thousands of parameters Achieved the most advanced performance” as attractive.

Inevitably, when the promotion of higher score performance disappears, as the history of innovation shows, simplified learning — in fact, true practical learning — will receive more attention it deserves.


  • The goal of hybrid learning is to cross the boundary between supervised learning and unsupervised learning. Methods similar to semi-supervised and self-supervised can extract information from unlabeled data. This is very useful when the amount of unsupervised data increases exponentially. Value things.
  • As tasks become more and more complex, component learning deconstructs a task into several simple components. When these components work together or work against each other, the result is a better model.
  • Simplified learning has not received too much attention because deep learning is undergoing a hype phase, and soon enough practice and deployment-centric design will emerge.