10 essential learning methods for artificial intelligence practitioners

Original article was published on Deep Learning on Medium

10 essential learning methods for artificial intelligence practitioners

Over the past decade, people’s interest in machine learning has not diminished. You will see machine learning in computer science programs, industry conferences, and the Wall Street Journal almost every day. For all the discussions about machine learning, many people confuse what machine learning can do with what it wants to do. Fundamentally, machine learning is the use of algorithms to extract information from raw data and implement it through models. We use this model to infer other data that we have not yet modeled.

Neural networks are a type of machine learning model that has been around for more than 50 years. Its basic unit is a nonlinear transformation node inspired by biological neurons in the mammalian brain. The connections between neurons also imitate the biological brain and develop through training over time.

In the mid-1980s and early 1990s, significant progress was made in many important architectures of neural networks. However, the time and amount of data required to obtain good results hindered its application, so people’s interest was greatly reduced. In early 2000, computing power increased exponentially, and the industry witnessed the “Cambrian explosion” of computing technology that was not possible before. As an important competitor in this field, deep learning has stood out in the explosive computing growth of this decade, and it has won many important machine learning competitions. The popularity of deep learning peaked in 2017, and you can see deep learning in all areas of machine learning.

After researching and learning a lot of knowledge, I want to share 10 powerful deep learning methods that engineers can use to solve their machine learning problems. Before we start, let’s define what deep learning is. Deep learning is a challenge faced by many people because it has slowly changed its form over the past decade. To visually define deep learning, the following diagram shows the relationship between artificial intelligence, machine learning, and deep learning.

The field of artificial intelligence is the most extensive, and has been in existence for 60+ years. Deep learning is a sub-field of machine learning, and machine learning is a sub-field of artificial intelligence. The reasons why deep learning is distinguished from traditional feed-forward multi-layer networks are usually the following:

  • More neurons
  • More complex connection between layers
  • “Cambrian Explosion” for training computing power
  • Automatic feature extraction

When I say “more neurons”, I mean that the number of neurons grows year by year to express more complex models. The layers also evolve from each layer that is fully connected in the multi-layer network to a patch of locally connected neurons between layers in the convolution neural network, and the same neuron in the recurrent neural network (except for the connection to the previous layer) Periodic connection.

Then deep learning can be defined as a neural network with a large number of parameters and layers in one of the following four basic network architectures:

  • Unsupervised pre-training network
  • Convolution Neural Network
  • Recurrent neural network

This article will mainly introduce the following three architectures. The convolution neural network is basically a standard neural network that extends across the entire space by sharing weights. The convolution neural network is designed to recognize the image by convolution inside, and you can see the edges of the objects on the recognized image. The recurrent neural network feeds the edge to the next time step instead of entering the next layer in the same time step, thereby achieving the expansion in the entire time. Recurrent neural networks are designed to recognize sequences, such as speech signals or text sequences, and their internal loops can store short-term memories in the network. Recurrent neural networks are more like a hierarchical network, where the input sequence has no real time dimension, but the input must be hierarchically processed in a tree-like manner. The following 10 methods can be used for all the above architectures.

1 — Back propagation

Back propagation is simply a method of calculating the partial derivative (or gradient) of a function (in the form of a compound function in a neural network). When using gradient-based methods (gradient descent is just one of them) to solve optimization problems, you need to calculate the gradient of the function in each iteration.

In neural networks, the objective function is usually in the form of a compound function. How to calculate the gradient at this time? There are two ways: (i) Analytic differentiation. The form of the function is known, and the derivative can be calculated directly using the chain rule. (Ii) Approximate differentiation using finite difference. This method requires a large amount of calculation because the number of function evaluation is equal to O(N), where N is the number of parameters. Compared to analytical differentiation, the amount of calculation is much larger. Finite difference is usually used to verify the implementation of back propagation during debugging.

2 — Stochastic gradient descent

An intuitive way to understand gradient descent is to imagine the path of a river running down the mountain. The goal of gradient descent is exactly what the river strives to achieve — that is, to reach the lowest point (foot of the mountain).

Suppose the terrain of the mountain does not require the river to make any stops before reaching the lowest point (in the ideal case, in machine learning it means that the global minimum/optimal solution is reached from the initial point). However, there are also terrains with many dimples that make the river stagnate halfway along its path. In machine learning terminology, these pits are called local minima and are situations that need to be avoided. There are many ways to solve this problem

Therefore, gradient descent tends to stagnate at local minima, depending on the nature of the terrain (or function in machine learning). However, when the terrain of the mountain is a special type, that is, a bowl-shaped landform, called a convex function in machine learning, the algorithm can guarantee to find the optimal solution. Convex function is the most wanted function in machine learning optimization. Moreover, starting from different peaks (initial points), the path before reaching the lowest point is also different. Similarly, the difference in the flow rate of the river (learning rate or step size in the gradient descent algorithm) also affects the shape of the path. These variables affect whether gradient descent is stuck in the local optimal solution or avoids them.

3 — decay of learning rate

Adjusting the learning rate of the stochastic gradient descent optimization process can improve performance and reduce training time, which is called learning rate annealing (annealing) or adaptive learning rate. The simplest and probably the most commonly used learning rate adjustment technique in training is to reduce the learning rate over time. This is beneficial to use a larger learning rate to obtain greater changes at the beginning of the training, and to use a smaller learning rate to fine-tune the weights later.

The two simple and commonly used learning rate attenuation methods are as follows:

  • Decrease the learning rate with the increase of epoch;
  • Intermittently reduce the learning rate at specific epochs.

4 — dropout

Deep neural networks with a large number of parameters are very powerful machine learning systems. However, such a network has serious over-fitting problems. Moreover, the running speed of large networks is very slow, so that the process of solving over fitting by combining the predictions of multiple different large neural networks during the test phase also becomes very slow. Dropout is exactly the technology applied to this problem.

The key idea is to randomly delete the neural network units and corresponding connections during the training process, thereby preventing overfitting. During the training process, dropout will sample an exponential number of different sparse networks. In the testing phase, it is easy to approximate the results by averaging the predictions of these sparse networks with a single untwined network (with a smaller weight). This can significantly reduce overfitting and can provide a greater performance improvement than other regularization methods. Dropout has been shown to improve neural network performance in supervised learning tasks such as computer vision, speech recognition, text classification, and computational biology, and achieve top results in multiple benchmark data sets.

5 — Maximum pooling

Maximum pooling is a sample-based discretion method. The goal is to down sample the input representation (images, output matrix of hidden layers, etc.), reduce the dimensional, and allow hypothesized discarded features contained in the sub region.

By providing abstract forms of characterization, this approach helps to some extent to solve overfitting. Similarly, it also reduces the amount of computation by reducing the number of learning parameters and the transformation invariance that provides basic internal representation. Maximum pooling extracts features and prevents overfitting by taking the maximum value in the initially characterized sub-regions (usually non-overlapping).

6 — Batch normalization

Neural networks (including deep networks) usually require careful adjustment of weight initialization and learning parameters. Batch normalization can make this process easier.

Weight problem:

  • No matter what kind of weight initialization, such as random or empirical selection, these weight values ​​are very different from the learning weight. Considering a small batch in the initial epoch, there may be many outliers in the required feature activation.
  • Deep neural networks are inherently morbid, that is, small changes in the initial layer will cause huge changes in the next layer.

In the back-propagation process, these phenomena will cause the deviation of the gradient, which means that the gradient needs to compensate for the outliers before learning the weights to generate the required output, thus requiring additional epoch to converge.

Batch normalization systematically gradients, avoids deviations due to outliers, and thus directly leads to a common goal (through normalization) in several small batches.

Learning rate issues:

  • The learning rate is usually kept small, so that the gradient correction of the weight is small, because the gradient activated by outliers should not affect the learning activation. Through batch normalization, these outlier activation will be reduced, so that a larger learning rate can be used to accelerate the learning process.

7 — Long-term and short-term memory

The neurons of the long and short-term memory (LSTM) network are different from the commonly used neurons in other RNNs, and have the following three characteristics:

  • It has the right to determine the input of neurons;
  • It has the right to decide the storage of the calculated content in the previous time step;
  • It has the right to decide when to pass the output to the next time step.

The power of LSTM is that it can determine all the above values ​​based only on the current input. Take a look at the chart below:

The input signal x(t) of the current time step determines the above three values. The input gate determines the first value, the forget gate determines the second value, and the output gate determines the third value. This is inspired by the way our brain works and can handle sudden scene changes in the input.

8 — Skip-gram

The goal of the word embedding model is to learn a high-dimensional dense representation for each vocabulary item, where the similarity of the embedding vector represents the semantic or syntactic similarity of related words. skip-gram is a model for learning word embedding algorithms.

The main idea behind the skip-gram model (and many other word embedding models) is that if two vocabulary items have a similar context, they are similar.

In other words, suppose you have a sentence, such as “cats are mammals”. If you replace “cats” with “dogs”, the sentence is still meaningful. So in this example, “dogs” and “cats” have a similar context (ie, “are mammals”).

Based on the above assumptions, we can consider a context window, that is, a window containing k consecutive items. Then we should skip some words to learn the neural network that can get all the items except the skipped items, and use this neural network to try to predict the skipped items. If two words share similar contexts in a large corpus, then these embedding vectors will have very similar vectors.

9 — Continuous word bag model

In natural language processing, we want to learn to represent each word in the document as a numeric vector, and make words appearing in similar contexts have very similar or similar vectors. In the continuous bag of words model (CBOW), our goal is to be able to use the context of a particular word to predict the probability of that particular word appearing.

We can do this by extracting a large number of sentences from a large corpus. Whenever the model sees a word, we extract the context words that appear around that particular word. Then input these extracted context words into a neural network to predict the probability of the head word under the condition of the context.

When we have tens of thousands of context words and head words, we have a data set sample for training neural networks. In the training neural network, the encoded hidden layer finally outputs the embedded expression of specific words. This expression is exactly the same context with similar word vectors, and it is precisely such a vector that can be used to represent the meaning of a word.

10 — Transfer learning

Now let us consider how exactly the image flows through the convolution neural network, which helps us to transfer the knowledge learned by the general CNN to other image recognition tasks. Suppose we have an image, and we put it into the first convolution layer to get the output of a combination of pixels, which may be some recognized edges. If we use convolution again, we can get a combination of these edges and lines to get a simple figure outline. In this way, iterative convolution can finally find a specific pattern or image hierarchically. Therefore, the last layer will combine the previous abstract features to find a very specific pattern. If our convolution network is trained on Image Net, then the last layer will combine the previous abstract features to identify specific 1000 categories. If we replace the last layer with the category we want to recognize, then it can be trained and recognized very efficiently.

Each layer of the deep convolution network will build more and more advanced feature representation methods. The last few layers are often specifically for the data we feed into the network, so the features obtained by the earlier layers are more general.

Transfer learning is obtained by modifying the CNN model we have trained. We usually cut off the last layer, and then use the new data to retrain the newly created last classification layer. This process can also be interpreted as using high-level features to reassemble into new targets that we need to identify. In this way, the training time and data will be greatly reduced, we only need to retrain the last layer to complete the training of the entire model.

Deep learning pays great attention to technology, and many technologies do not have many specific explanations or theoretical derivations, but most experimental results prove that they are effective. So perhaps understanding these technologies from the basics is what we need to accomplish later.