Original article was published on Deep Learning on Medium

# Principles of Machine Learning

Deep Blue is the first computer to win the chess world championship. It was 1996, and after 20 years, another program AlphaGo defeated the best human Go player. Deep Blue is a model-based system with hard-line chess rules. AlphaGo is a data mining system. It is a deep neural network trained through thousands of Go games. There is no improved hardware, but the breakthrough of the software is essential for defeating top chess players to defeating top Go players.

Machine learning or “artificial intelligence” does not always involve data mining strategies. In fact, the most popular and profitable data mining method works without any fancy neural networks or support vector machines.

**Principles of machine learning**

A learning algorithm will provide data samples, usually in some way to derive data from historical prices. Each sample consists of n variables x 1 .. xn, which are usually called predictors, features, signals or inputs. These predictors can be the price return of the last n bars, they can also be a collection of classic indicators, or they can be any other imaginable function of the price curve. Each sample usually also contains the target variable y, such as the next transaction return or the next price change after sampling. In the literature, you can find y also known as tags or targets. During the training process, the algorithm learns to predict the target y from the predictor variables x 1 .. xn. The learned “memory” is stored in a data structure called a model, which is specific to the algorithm (not to be confused with the financial model of model-based strategy)! ). The machine learning model may be a function with prediction rules in the C code generated by the training process. It can also be a set of connection weights of the neural network.

Predictors, characteristics, or any variables you call must have enough information to accurately predict the target y. They usually must also meet two formal requirements. First, all predicted values should be in the same range, for example -1 .. +1 (for most R algorithms) or -100 .. +100 (for Zorro or TSSB algorithms). Therefore, before sending them to your computer, you need to standardize them in some way. Second, the sample should be balanced, that is, evenly distributed over all values of the target variable. Therefore, there should be as many winning samples as there are losing samples. If you do not comply with these two requirements, you will wonder why machine learning algorithms produce bad results.

The regression algorithm can predict a value, such as the magnitude and sign of the next price movement. Classification algorithms can predict qualitative sample categories, for example, rising or falling. Some algorithms (such as neural networks, decision trees or support vector machines) can be run in two modes.

Some algorithms learn to divide samples into classes without any objective y. This is unsupervised learning, not supervised learning using goals. Something in between is reinforcement learning, that is, the system performs self-training by running simulations with a given function and using the results as training targets. AlphaGo’s successor, AlphaZero, uses reinforcement learning by playing millions of Go games on itself. In the financial field, there are few applications for unsupervised learning or reinforcement learning. 99% of machine learning strategies use supervised learning.

No matter which signals we use as financial forecast indicators, they are very likely to contain a lot of noise and a small amount of information, and are unstable above it. Therefore, financial prediction is one of the most difficult tasks in machine learning. More complex algorithms may not necessarily achieve better results. Choosing predictors is critical to success. Using a large number of predictors is not a good idea, because it will only cause overfitting and out-of-sample operations to fail. Therefore, data mining strategies usually use a preselection algorithm to determine a small number of predictors among many predictors. The pre-selection can be based on the correlation, importance, information content of the predictor variables or the prediction success based only on the test set.

This is a list of the most popular data mining methods in finance.

**Technical index**

Most of the trading systems we program for clients are not based on financial models. Customers just want trading signals from certain technical indicators, and combine with other technical indicators to filter more technical indicators. When asked how to turn these messy indicators into a profitable strategy, he usually answers: “Trust me. I am trading manually and it works fine.”

Indeed, at least sometimes it will succeed. Although most of these systems have not passed the WFA test (some even fail a simple backtest), the number is surprisingly large. And these are often profitable in real trading. Customers systematically tried technical indicators until they found a combination that could be used for real-time trading of certain assets. This trial-and-error technical analysis method is a classic data mining method that is only performed by humans, not by machines. I can’t really recommend this method-it may involve a lot of luck, let alone money-but I can prove that it sometimes brings a profitable system.

**Candlestick pattern**

Not to be confused with the Japanese candle pattern that had its best date long ago. Modern equivalents are price action transactions. You are still viewing the candle’s opening price, highest price, lowest price and closing price. You still want to find a pattern that predicts the price direction. However, you are now data mining contemporary price curves to collect these patterns. There are software packages for this purpose. They search for patterns that can benefit from certain user-defined criteria and use them to build specific pattern detection capabilities. It may look like this (from Zorro’s pattern analyzer):

When the signal matches one of the patterns, this C function will return 1, otherwise it returns 0. As you can see from the lengthy code, this is not the fastest way to detect patterns. A better method that Zorro uses when it does not need to export detection functions is to sort the signals according to their size and check the sort order. An example of such a system can be found here.

Can price action trading really work? Just like technical indicators, it is not based on any rational financial model. At best, one can only imagine that the sequence of price fluctuations will cause market participants to react in some way, thereby establishing a temporary forecasting model. However, when you only view the sequence of a few adjacent candles, the number of patterns is very limited. The next step is to compare candles that are not adjacent but arbitrarily selected over a longer period of time. In this way, you will get an almost unlimited number of patterns-but at the expense of eventually leaving the realm of reason. It is difficult to imagine how to predict price movements through some candlestick patterns a few weeks ago.

Nevertheless, much effort is still required. Blog author Daniel Fernandez runs a subscription website (Asirikuy), specializing in data mining candlestick patterns. He refined the model transaction to the smallest detail, and if anyone can make any profit in this way, it is him. But to his subscribers’ disappointment, the real-time trading model (QuriQuant) produced results that were very different from his excellent backtest results. If there are indeed profitable price action systems, obviously no one has yet found them.

**Linear regression**

The simple basis of many complex machine learning algorithms: predicting the target variable y through a linear combination of predictors x 1 .. xn.

**Sensor**

Usually called a neural network with only one neuron. In fact, the perceptron is a regression function as described above, but has a binary result, so it is called logistic regression. It is not a regression, but a classification algorithm. Zorro’s advisor (PERCEPTRON,…) function generates a C code that returns 100 or -100, depending on whether the predicted result is above the threshold

**Neural Networks**

Linear or logistic regression can only solve linear problems. Many are not in this category — a famous example is predicting the output of a simple XOR function. And most likely also predict prices or return on investment. Artificial neural networks (ANN) can solve nonlinear problems. This is a bunch of perceptrons, which are connected together in multiple layers. Any perceptron is a neuron of the network. Its output will enter the input of all neurons in the next layer, as shown below:

Like perceptrons, neural networks can also learn by determining coefficients that minimize the error between sample prediction and sample target. But this now requires an approximation process, usually backpropagating the error from the output to the input to optimize the weights. This process imposes two restrictions. First, the neuron output must now have a continuously differentiable function instead of a simple perceptron threshold. Second, the network must not be too deep-the neurons between the input and output must not have too many “hidden layers”. The second constraint limits the complexity of the problems that standard neural networks can solve.

When using neural networks to predict transactions, you will use a lot of parameters, and if you are not careful, there will be a lot of selection bias:

**Hidden layers**

**Number of neurons in each hidden layer**

**The number of back propagation cycles, called the period**

**Learning rate, the pace of an era**

**Momentum and weight adaptive inertia factor**

**Activation function**

For back propagation, you need a continuously differentiable function that can generate a “soft” step at a certain value of x. Sigmoid, tanh or softmax functions are usually used. Sometimes it is also a linear function, returning only the weighted sum of all inputs. In this case, the network can be used for regression, to predict numerical values rather than binary results.

**Deep learning**

Deep learning methods use neural networks with many hidden layers and thousands of neurons, and traditional backpropagation can no longer effectively train them. In the past few years, several methods for training such a huge network have become very popular. They usually pre-train hidden neuron layers to achieve a more effective learning process. A restricted Boltzmann machine (RBM) is an unsupervised classification algorithm with a special network structure that has no connection between hidden neurons. A sparse autoencoder (SAE) uses a traditional network structure, but by reproducing the input signal on the layer output with as few active connections as possible, the hidden layer is pre-trained in a clever way. These methods allow the use of very complex networks to handle very complex learning tasks. For example, defeat the best human Go player in the world.

Deepnet and darch R software packages provide deep learning networks. Deepnet provided the autoencoder, and Darch provided the restricted Boltzmann machine. I haven’t tried Darch yet, but here is an example script that uses the Deepnet autoencoder with 3 hidden layers, which can be used to trade signals through Zorro’s Neuro() function:

**Support Vector Machines**

Like neural networks, support vector machines (SVM) are another extension of linear regression. When we look at the regression formula again

We can interpret the feature xn as the coordinates of the n-dimensional feature space. Setting the target variable y to a fixed value will determine a plane in this space, called a hyperplane, because it has more than two dimensions (actually n-1). The hyperplane separates samples with y>0 from samples with y<0. The coefficient can be calculated in the following way: the distance from the plane to the nearest sample (called the “support vector” of the plane, hence the name of the algorithm) is the largest. In this way, we have a binary classifier that can best separate the winning and losing samples.

The problem is: usually these samples are not linearly separable-they are scattered irregularly in the feature space. There is no flat surface between the winner and the loser. If possible, we can use a simpler method to calculate the plane and perform linear discriminant analysis. But for common situations, we need the SVM trick: add more dimensions to the feature space. To this end, the SVM algorithm uses the kernel function to generate more functions to combine the existing two predictors into a new function. This is similar to the above step from simple regression to polynomial regression. In this step, by adding the unique predictor to the nth power, more features are added. The more dimensions added, the easier it is to separate the sample with a flat hyperplane. Then, the plane is converted back to the original n-dimensional space, wrinkled and wrinkled on the way. By cleverly selecting the kernel function, the process can be performed without actually calculating the conversion.

Like neural networks, support vector machines can be used not only for classification, but also for regression. They also provide some parameters to optimize and possibly overfit the prediction process:

Kernel function. Normally, you use RBF kernels (radial basis functions, symmetric kernels), but you can also choose other kernels, such as Sigmoid, polynomial, and linear.

**Gamma, the width of the RBF kernel**

**Cost parameter C, the “penalty” for misclassification in training samples**

The commonly used SVM is the libsvm library. It is also available through the R version of the e1071 software package. In the next and final part of this series, I plan to describe the trading strategy using this SVM.

**K nearest neighbor**

Compared with the heavy ANN and SVM, this is a nice simple algorithm with unique properties: no training is required. So the sample is the model. You can use this algorithm for a permanently learned trading system by simply adding more and more samples. The nearest neighbor algorithm calculates the distance from the current feature value to the feature space of the k nearest samples. Just like two dimensions, calculate the distance between two feature sets (x 1 .. xn) and (y 1 .. yn) in n-dimensional space:

The algorithm only predicts the target based on the average of the k target variables of the latest sample (weighted by its inverse distance). It can be used for classification and regression. Software techniques borrowed from computer graphics, such as adaptive binary tree (ABT), can make nearest neighbor search quite fast. In my past life as a computer game programmer, we used this method in games to accomplish tasks such as self-learning enemy intelligence. You can call the knn function in R for nearest neighbor prediction-or write a simple function in C for this purpose.

**K mean**

This is an approximate algorithm for unsupervised classification. It has some similarities with k nearest neighbors, not just the name. In order to classify the samples, the algorithm first places k random points in the feature space. Then, assign all the samples with the shortest distance to any of these points. Then move the point to the average of these recent samples. This will result in a new sample allocation because some samples are now closer to another point. The process is repeated until the allocation is no longer changed by moving these points, that is, each point is located exactly at the average of its most recent sample. Now, we have k sample categories, and each sample is near one of the k points.

This simple algorithm can produce surprisingly good results. In R, the kmeans function can solve the problem. An example of the k-means algorithm used to classify candlesticks can be found here: Unsupervised candlestick classification for fun and profit.

**Naive Bayes**

The algorithm uses Bayes’ theorem to classify samples of non-numeric features (ie, events), such as the candle pattern described above. Suppose that event X (for example, the “open” of the previous bar is lower than the “open” of the current bar) appears in 80% of all winning samples. So, what is the probability of winning when the sample contains event X? Probably not 0.8 as you think. Probability can be calculated using Bayes’ theorem:

P(Y | X) is the probability that event Y (fi wins) occurs in all samples containing event X (in our example, Open(1)<Open(0)). According to the formula, it is equal to the probability of occurrence of X in all winning samples (here 0.8) multiplied by the probability of Y in all samples (approximately 0.5 when following my recommendations for balanced samples above) and divided by the probability of X in all samples .

If we are naive and assume that all events X are independent of each other, we can calculate the overall probability of the sample winning by simply multiplying the probability P (X | winning) of each event X. In this way, we finally get the following formula

The scale factor is s. In order for the formula to work, the functions should be selected as independently as possible, which constitutes an obstacle to the use of Naive Bayes in trading. For example, the two events Close (1) <Close (0) and Open (1) <Open (0) are most likely not independent of each other. By dividing numbers into separate ranges, you can convert numeric predictors into events.

Naive Bayes algorithm can be used in the ubiquitous e1071 R software package.

**Decision tree and regression tree**

These trees predict results or values based on a series of if-then-else decisions, and their structure is similar to the branches of the tree. Any decision is the presence or absence of an event (in the case of non-numeric features) or the comparison of feature values to fixed thresholds. Typical tree functions generated by Zorro’s tree generator are as follows:

How to produce such a tree from a set of samples? There are several methods. Zorro uses Shannon information entropy. First it will check one of the features, for example x1. It puts the hyperplane with the plane formula x1 = t into the feature space. This hyperplane is separated from the sample by X 1> t from sample X 1 <t. The division threshold t is selected so that the information gain — the difference between the information entropy of the entire space and the information entropy of the two divided subspaces is the largest. This is the case when the samples in the subspace are more similar to each other than the samples in the entire space.

Then, repeat this process using the next feature x 2 and two hyperplanes that split the two subspaces. Each segment is equivalent to a comparison of features and thresholds. Through repeated splitting, we will soon get a huge tree, which contains thousands of threshold comparisons. Then, run the process backwards by pruning the tree and deleting all decisions that do not lead to substantial information acquisition. In the end, we get the relatively small tree in the above code.

Decision trees have a wide range of applications. They can produce excellent predictions that are superior to neural networks or support vector machines. But they are not a omnipotent solution, because their segmentation planes are always parallel to the axis of the feature space. This limits their prediction to some extent. They can be used not only for classification, but also for regression, for example by returning the percentage of samples that contributed to a certain branch of the tree. Zorro’s tree is a regression tree. The most famous classification tree algorithm is C5.0, R can be found in the C50 software package.

In order to further improve the prediction effect or overcome the parallel axis limitation, a bush called random forest can be used. The predictions are then generated by averaging or voting the predictions from a single tree.

**Conclusion**

You can use many different data mining and machine learning methods. Key question: Will a strategy based on machine learning be better? There is no doubt that machine learning has many advantages. You don’t need to care about the market’s microstructure, economy, and trader’s psychology. You can focus on pure math. Machine learning is a more elegant and attractive way of generating trading systems. Although there are many enthusiastic topics on the trading forum, it tends to mysteriously fail in real-time trading.

Every two weeks, a new paper about trading using machine learning methods will be published (some of them can be found below). Please be cautious about these publications. According to some papers, the phantom win rate of 70%, 80% or even 85% has been reached. Although the winning rate is not the only relevant standard-even if the winning rate is high, you may lose money-the accuracy of the forecast transaction is usually 85%, which is higher than the profit rate of 50%. Through such a system, the relevant scientists should also be billionaires. Unfortunately, I have never used the method described to reproduce those win rates, and they are not even close. As a result, there may be many selection biases. Or maybe I am too stupid.

Compared with model-based strategies, so far, I have not seen many successful machine learning systems. From the algorithmic methods heard by successful hedge funds, machine learning still seems to be rarely used. However, with the advent of more processing power and new algorithms for deep learning, this situation may change in the future.