Source: Deep Learning on Medium
Machine learning is nothing more than trying to predict values and gaining insight into our data. Whether it’s predicting a class or label like different kinds of fruits or trying to predict what the next price of a certain stock will be. Gaining insight into our data by using clustering methods and trying to group examples to understand our data. Nothing about machine learning is hard to understand if you don’t dive into the actual mathematics. The actual mathematics can be hard, but everyone can understand and reason about machine learning and how it works. It all comes down to spatial reasoning as you will see with feature spaces and decision boundaries.
A quick overview of what we’re going to cover.
- introduction into machine learning
- different types of machine learning
- basic machine learning concepts
- different machine learning methods
- testing our learned models
One very simple example will explain almost everything you need to know about what machine learning is and does. Machine learning comes down to feature spaces and decision boundaries. As you will see, machine learning is nothing more than spatial geometric reasoning.
A feature is something you need to define objects. Features of people could be height, width, skin color, etc. Features of cars could be horsepower, color, length, etc. Features of the stock market could be price, time, amount of tweets, etc. Alright, you get the point. Features are used to define something, and a feature space is an x-dimensional space in which these objects live. If we have 2 features of fruit, width, and height, we will get a 2-dimensional feature space. If we add ‘roundness’ to this with values from 1–10, we get a 3-dimensional space. Adding another feature like color will get us a 4-dimensional feature space, etc. We can imagine up to 3 spatial dimensions.
If we look a little into the example of fruits with 2 features, width, and height. We measure apples and pears and measure their size. If we plot them onto our feature space, we get something like this.
Our machine learning model (in this case a standard perceptron which will generate a linear decision boundary, will be explained further what this is and does exactly) will learn the distinction by splitting these points in what is called a decision boundary. If we introduce new fruit (black label) and we would like to know if it’s an apple or pear, we add this to our feature space. Our machine learning model will verify on which side of the decision boundary it is placed and will label our new fruit an apple.
Keep in mind that this is just a 2-dimensional feature space, this can easily be a 100-dimensional feature space with 100 features, we cannot imagine this (we can only imagine up to 3 spatial dimensions), but mathematics doesn’t care. Everything stays the same regarding calculations, regardless of the number of spatial dimensions and thus features.
Let’s dig a little deeper into decision boundaries. Each machine learning method operates a little different, some have straight decision boundaries (linear decision boundary functions), some have curvy ones (higher polynomial decision boundary functions). It all depends on which method you use. As you can see the Linear SVM has linear decision boundaries because it uses a linear kernel (more on this later), if it uses a higher polynomial function kernel such as an RBF (Radial Basis Function), it will have higher dimensional decision boundaries. Decision trees operate on splitting your data based on your features, so their decision boundaries will be straight. Random Forests use a lot of decision trees and perform bagging (more on this later) and can map your feature space more tightly. These are just a few examples of machine learning methods and their decision boundaries, the principle stays the same whatever machine learning method you use.
First things first, there are very important distinctions you need to know, the difference between classification and regression, supervised and unsupervised learning and the differences between deep learning and reinforcement learning. Machine learning engulfs all other types of learning such as reinforcement and deep learning.
Classification vs regression vs clustering
With classification, you want to classify something, you want to assign a label to it and try to separate your data as much as possible. This can be an object in an image, this can be a certain sentiment a specific sentence has, this can be the classification of a certain fruit, etc. Regression is about trying to fit a curve as close as possible to your data and getting to know how variables influence each other. Regression can be applied in time series, stock predictions, weather forecasts, etc. Classification predicts discrete variables whereas regression predicts continuous variables. Clustering is used to gain insight into your data and does not involve labeling.
Unsupervised vs supervised learning
Learning your data, requires labeling your data, right? How could the machine otherwise possible know what you want to learn? You’ll have to specify that a certain image is of an apple and a certain image is of a banana if you want your machine learning model to learn the distinction. This requires labeling your data. This is called supervised learning. And is the most used approach, although this is also the most time-consuming. Labeling data is cumbersome and boring and is seen as the Achilles heel of machine learning. Other approaches involve not labeling your data, called unsupervised learning. Machine learning methods such as k-means, which will be explained later on, are unsupervised and are used for clustering problems where you would like to know similarities between your data without actually knowing what it represents. It can be used to gain more insight into your data.
Reinforcement learning is a special kind of learning and is widely used and very popular with machines learning to play video games (see OpenAI for more examples of reinforcement learning). Learning based on reinforcement will happen with an ‘agent’ which will try to execute certain steps towards a goal. If this agent does something wrong, for example, goes the wrong way, it will be penalized. If it does something right, such as jump over an enemy, it will be rewarded. It will try to optimize a cost function that will define the goal to reach and when it will be penalized and rewarded.
A very impressive use case is from OpenAI and features OpenAI Five which currently beats the top DOTA2 teams around the world.
Standard machine learning becomes deep learning when neural networks with multiple layers are involved. A neural network has and input and output layer and can have any number of hidden layers in between with a vast variety of how they’re interconnected. Deep learning is very popular with computer vision and uses deep convolutional layers.
Let’s start by explaining the basic concepts in machine learning.
What does it mean for your data to be linearly separable? If your data can be separated with a lower polynomial function, such as
f(x)=x, then your data is linearly separable. If your data is more complex, as seen in the image below, it’s not linearly separable.
Examples of linear classifiers that can separate data linearly and thus produce linear decision boundaries are perceptrons (explained later on what a perceptron is). If your data can only be separated by higher polynomial functions, it’s not linearly separable. Examples of such classifiers are support vector machines (SVM) with a radial basis function kernel (which will also be explained later).
Curse of dimensionality
Every problem has a certain dimension, a certain amount of features that have to be taken into account when training your model. If these dimensions exceed the actual number of training examples, it becomes very easy for your machine learning model to learn how to construct its decision boundaries. It will separate all instances very easily and will overfit on your dataset.
Underfitting & overfitting
Training a model can go bad both ways, you either don’t learn your data well enough or you learn it a little too well. If your machine learning method is not able to learn your data, you’re underfitting your data, if it learns it too well, you’re overfitting on your data. Both are bad. Overfitting occurs when you have very high accuracy on your training data, but very bad results on unseen test data. The fitting of your model is highly correlated with something called bias and variance which will be explained next.
Bias & variance
Bias and variance are two important concepts to take into account when training and validating a machine learning technique. It is visible in numerous experiments that the accuracy goes down when the machine learning methods start overfitting and thus have a high variance. There is always a trade-off between bias and variance. A high bias results in underfitting and the resulting model of the machine learning technique will not learn the concept well enough to make accurate predictions about unseen instances. A high variance results in overfitting and the resulting model will have learned the model too well and might not be able to generalize enough to make accurate predictions of unseen instances.
A high bias and low variance will result in a high error rate because the model has not learned the data well enough. Low bias and high variance will have a low error rate on the training data because it will have learned the data too strict and will have overfitted, this could result in a high error rate on unseen test data because the model didn’t generalize too well. There is always a trade-off between bias and variance, between underfitting and overfitting your data, this is called the bias-variance trade-off.
Bagging & boosting
Bagging and boosting are ways to combine individual learners. These can reduce variance and bias, and thus overfitting and underfitting.
Variance reduction is obtained by using tree bagging. Bagging or bootstrap aggregating generates smoother decision boundaries. It reduces variance and thus helps against overfitting. Bagging reduces variance by smoothing out outliers and noise by averaging over a lot of generated models. By taking different subsets of the same size of the training data and learning a model from each of those datasets, it creates additional training data which helps reducing variance. The random forest constructs many trees by using bagging over the training data and constructing a tree for each bag of data. After constructing the forest and thus the multiple decision trees it averages out the votes of each of those trees on unseen data. It is not possible to increase the predictive force of your learner by applying bagging but it is possible to reduce the variance which results in less overfitting and better generalization on unseen data.
Whereas bagging is a parallel ensemble with each model built independently, boosting is a sequential ensemble that tries to add new models that do well where previous models lack. Incrementally train a new learner on data that the previous learner misclassified. Boosting is used to reduce bias, and thus helps against underfitting by generating a stronger learner. An example is gradient boosting, such as xgboost.
Information gain & information entropy
Being able to learn your data can be based on something called information gain and entropy. Decision trees use this metric to differentiate between choices and thus how to build the tree.
Information entropy resembles how random a given event was and how much information was contained in that event. Decision trees use the information entropy to maximize the information gain for a particular node and the choice that has to be made at that node. The amount of entropy is calculated by using the following equation with
N = a distinct amount of classes.
Information gain results from information entropy. By calculating the amount of entropy for a given node in the decision tree and thus a given choice, the amount of information by making that particular choice can be obtained. It is important to maximize information gain for a decision tree so that the constructed tree contains as little nodes as possible, with each node the greatest amount of information that will be obtained after traversing through. The information gain is calculated by the following equation.
The kernel trick is very important if you are not able to easily draw a decision boundary or surface between your classes. With support vector machines (SVM, explained later on), you can transform your data to higher dimensional space which could enable you to easily separate your training data with a decision surface. This will make it possible for your machine learning method to learn your data more efficiently without using higher dimension kernels such as RBF (Radial Basis Function, explained later on).
Let’s walk through some of the more common and popular machine learning methods available, such as decision trees, random forests, SVM’s, neural networks, k-nearest neighbors (k-NN) and k-means.
A decision tree generates decision boundaries by following the branches of the tree down to the leaves. Each branch gets further down into the tree guided by questions at every node. A decision tree is generated by using the principle of maximum entropy or information gain. Each step has to deliver the most amount of information. Each node is constructed so that the decision tree has to make a choice about which path to follow and thus continue to try and classify a given instance when the path traversing reaches a leaf. These leaves represent the classes to which the decision tree will classify.
Random forests are an ensemble of individual decision trees and are a discriminative classifier which means it makes up decision boundaries to learn a model of the training data. It trains many decision trees and combines each of their responses at the output. Training multiple decision trees help reduce variance and overfitting the training data.
Support vector machine
Support vector machines are maximum margin discriminative classifiers, they rely on support vectors, see next figure, to make up their decision boundaries. Support Vector Machines use hyperplanes to separate several instances of other classes and try to maximize the space between each class by maximizing the margin. Support vectors are the closest points to the decision boundary and exert the largest force onto the decision sheet. That is why they are called ‘support’ vectors, they support the decision surface. The space between the support vectors is called the margin.
Linear kernel & radial basis function kernel
There are 2 important kernels you can use, a linear kernel and an RBF (Radial Basis Function) kernel. The linear kernel will try to separate your training data linearly, you can apply the kernel trick (as explained earlier) to transform your data into higher dimensional space so it becomes linearly separable. If this is not sufficient, you can use the more computational expensive RBF kernel which is a non-linear kernel and will be able to separate your data non-linearly.
The neuron has three important parts: the cell body, dendrites, and axon. The dendrites act as electrical conveyors feeding information into the neuron from other neurons. Some neurons have only one dendrite, others have many. This varies the amount of information the neuron decides on. The cell body determines what to do with all the incoming electrical information and acts as an activation function. If the number of electrical input signals exceeds a certain threshold, the activation function decides to fire the neuron or not. When the neuron is fired, this signal propagates through the axon. The axon leads information away from the neuron and can vary in size, depending on how important the neuron is.
There are 2 types of artificial neural networks, single- and multi-layered.
The single-layer artificial neural network is also known as the perceptron. It consists of only one neuron as seen in the next figure. The neuron has dendrites that control the information flow into the neuron, the cell body which determines to fire the neuron or not and the axon which conveys the signal away from the neuron. To mimic this with a perceptron, the dendrites are input vectors given weights to clarify the importance of the information. The cell body is simulated by an activation function and the axon is the output of the perceptron.
The perceptron calculates the weighted sum
u from the input vectors xᵢ ∈ ℝᴺ and weight vectors wᵢ ∈ ℝᴺ where
N is the dimension of the input vector. After calculating the weighted sum
u, the perceptron calculates the function f(u): ℝ → ℝ by taking the sum of the dot products of the inputs vectors with the weight vectors. The concept of the perceptron is around for quite some time, from as early as the rise of the computer itself. It gained a lot of interest at first but was rejected due to its limitations for computing certain functions e.g. XOR Boolean function and due to other branches in AI achieving much better results. Those other branches focused less on machine learning and more on direct approaches to different computational problems. It took some time for interest to return in the perceptron and following the multilayer perceptrons.
The multilayer artificial neural network, also called multilayer perceptron (MLP), consists of multiple neurons, with hidden layers and multiple output neurons. The MLP is more expressive and more widely used as the single-layer perceptron. With those extra hidden layers, a lot more difficult target functions can be approximated. A lot of interest has emerged for this type of artificial neural network due to its expressiveness and successes of deep learning, which uses a lot of processing layers.
There are different kinds of multilayer perceptrons, each with their specialty which are listed below.
- Feedforward Neural Networks Information must flow in one direction, from input to output without going back to previous layers, thus without loops.
- Recurrent Neural Networks Information can flow in all directions, there are no limitations as with feedforward neural networks. There are also no limitations about the number of loops the network contains.
Variants: Boltzmann Machines and Hopfield Networks.
- Convolutional Neural Networks Convolutional neural networks are based on the visual cortex of animals and thus very suitable for computer vision applications. Each convolutional layer is stacked to get a better understanding of each layer that is represented by the input.
The activation function decides whether to fire the neuron or not. The step function is a straightforward example of such an activation function. The most widely used is the sigmoid function which is characterized by its ‘S’-shape. Whereas the step function is binary, the sigmoid function is continuous. There is also a tanh activation function which is a rescaled version of the sigmoid function which ranges from -1 to 1. The rectified linear unit (relu) function is widely used in deep neural networks and is one-sided in contrast with the antisymmetry of the tanh function as well as the exponential linear unit (elu) function.
The cost function is needed to represent the error on the output and optimize the learning of the neural network. It represents a certain cost which has to be optimized to get to an optimal solution. The gradient of the cost function in a higher dimensional landscape is always directed to the bottom of that landscape, so by taking the gradient of the cost function and traveling downhill, more accurate training of the neural network can occur and help to find the optimal set of parameters.
The total error over the weights are represented by the sum, of the difference between the actual output
y and the expected output
t, over every of the training examples as illustrated above. The delta rule for learning updates the weights according to a certain learning rate. The weights are updating according to the ∆
w rule which was calculated by multiplying a certain learning rate ε with the input vector and the difference between the expected output and actual output as illustrated in the above equation ∆
learning rate ε The learning rate is a very important parameter to ensure that a local minimum in the configuration of the weights is found. If the learning rate is too large, the algorithm will behave erratic and may not find the optimal configuration with a local cost minimum. If the learning rate is very small it takes a lot of time to get to the local or global cost minimum. A global cost minimum is found when the function is convex, otherwise, it is not certain whether the minimum is a local or global cost minimum.
A gradient specifies how much one variable will change when another variable is adjusted. So by using a gradient, it is possible to calculate how much the total error will change when the weights are changed of one parameter against the others. By using a gradient the most optimal weight configuration can be found by following that gradient downhill and thus obtaining the optimal configuration of that set of parameters. If this method is not used, it has to be brute-forced and tested with all possible variations of all parameters which probably would still be running after we are no longer here. Gradient descent is used to optimize the divergence from the results known as the loss function. With the loss function, the neural network can calculate how much off it is and update its weights accordingly. Note that the term ‘cost function’ is used to denote optimization theory and that the term ‘loss function’ is used to denote parameter estimation. By taking derivatives and thus following the gradient of the loss function and repeatedly recalculating the gradient and updating its weights, accuracy increases. On one end you have the different configuration of weights and on the other the cost of that configuration. Each configuration produces output and with it an error of what output is expected. By following the gradient of the cost function down, the optimal configuration of the weights is found. This results in a minimal divergence from the expected output and actual output.
The backpropagation algorithm uses gradient descent and consists of two phases, the propagation phase, and the weight update phase. It reduces the loss function by taking the gradients of the error concerning the weights. The propagation phase equals to one forward pass through the network and the weight update phase equals one backward pass through the network.
propagation phase In the forward propagation phase the input is fed through the network to generate the output at the output neurons. In the equation of the cost function, to get the total amount of error is given. This is needed to calculate the error on the network. In the backward propagation phase all ∆
w, which represents the difference in actual and expected output, are calculated for all the neurons in the network. All errors of each layer are derived from the neurons of the layer above.
weight update phase During the weight update phase all weights of the neurons are updated using the derivatives of the errors of the neurons a layer higher. There are several frequencies at which the weights can be updated. The three update frequency modes are online, batch or stochastic. As mentioned with gradient descent, there is a batch gradient descent which uses the batch mode. There is also a mini-batch gradient descent which also uses the batch mode but with some restrictions regarding the size of the batch. In a special case, the mini-batch gradient descent with randomness at picking the batches uses the stochastic model which introduces randomness. There is one more frequency to which the weights can be updated, called online mode. With online mode, the weights are updated after each input of a training example. This can be quite helpful when implementing a system that learns gradually from the continuous input of people interacting with the system without retraining the whole model after each sentence or after each set of sentences.
This is one of the more straight forward machine learning methods. It has only one parameter,
k. This denotes the number of data points that should be taken into consideration before labeling a new data point. In the below example, we already have 2 sets of labels, blue and orange. We also have 2 features,
x2. If we assign
k=3, we look at the 3 closest points, of which 2 are blue and 1 is orange, there are more blue examples close to the new data point, so k-NN predicts this new data point to be of category ‘blue’.
K-means is unsupervised and does not need labeled data. This machine learning method will thus not predicts labels but will cluster data so we can find similarities in the number of data points. If we assign, we first assign 2 random points (cluster centers), which are annotated as blue and red in the below example. After assigning initial values, a few steps can be followed.
- the distance is calculated between the data point and the centers, each data point will be assigned to the nearest cluster
- the red and blue cluster centers are updated based on each assigned data point by summing al x and y values and dividing them by 2
- repeat until cluster centers don’t move anymore
k clusters will have formed. This can give valuable insights into your data. It is already used for spam filtering, document analysis, etc. Also, keep in mind that this example only has 2 features, but in real-life problems, there are a lot more and this 2-dimensional representation quickly goes beyond our capability of imaging this x-dimensional space. Remember that our way of working stays the same if this feature space has 2 dimensions or 2000 dimensions.
Testing your model is as important as training it. Always be careful when interpreting results. Chances are you are overfitting and didn’t specify a test set, or that you accidentally also trained on your test data because you didn’t split them correctly.
K-fold cross-validation is used as an evaluation metric and is needed to efficiently validate the model which the machine learner has built. Cross-validation consists of taking a part of the dataset as the training set and the rest as a validation set. The ‘k’ in k-fold stands for the number of testing runs are going to be performed and the distribution of the training and test data. If you have a dataset that contains 100 examples and you do 10-fold cross-validation, 10 runs will be done over the dataset. The average accuracy will be calculated by taking the average of the result of each of the 10 runs. The following figure which represents 5-fold cross-validation makes it clearer just how k-fold cross-validation works. Each vertical column is the dataset which will be used as a part training set and part validation set. Because it is 5-fold cross-validation, 5 runs will be performed and their results will be averaged to retrieve the final average accuracy of the trained model.
The F-score is defined as a weighted average by combining the precision and recall as defined in the following equation. Precision measures how many selected items are relevant and recall measures how many relevant items are selected as seen in the figure below. A system that has a high recall and low precision returns a lot of classified instances of which little are correct. A system with low recall and high precision returns very few classified instances but a large percentage of those are classified correct.
Precision measures how precise the system classifies instances. If the system has high precision it means that a large portion of the returned instances are correct. This does not necessarily mean that the system has high accuracy. If there are 10 returned instances of which 9 are correctly labeled, the system will have large precision, but in a system with over 1000 instances, it performed poorly and will have a low accuracy.
Recall measures how much correct results the system returns and thus how likely it is that the system gets good accuracy. This does not necessarily mean that the system has high accuracy because a large portion of the returned results could be classified wrong.