Source: Deep Learning on Medium

Machine learning is nothing more than trying to predict values and gaining insight into our data. Whether it’s predicting a class or label like different kinds of fruits or trying to predict what the next price of a certain stock will be. Gaining insight into our data by using clustering methods and trying to group examples to understand our data. Nothing about machine learning is hard to understand if you don’t dive into the actual mathematics. The actual mathematics can be hard, but everyone can understand and reason about machine learning and how it works. It all comes down to spatial reasoning as you will see with feature spaces and decision boundaries.

A quick overview of what we’re going to cover.

- introduction into machine learning
- different types of machine learning
- basic machine learning concepts
- different machine learning methods
- testing our learned models

# Introduction

One very simple example will explain almost everything you need to know about what machine learning is and does. Machine learning comes down to feature spaces and decision boundaries. As you will see, machine learning is nothing more than spatial geometric reasoning.

## Feature space

A feature is something you need to define objects. Features of people could be height, width, skin color, etc. Features of cars could be horsepower, color, length, etc. Features of the stock market could be price, time, amount of tweets, etc. Alright, you get the point. Features are used to define something, and a feature space is an x-dimensional space in which these objects live. If we have 2 features of fruit, width, and height, we will get a 2-dimensional feature space. If we add ‘roundness’ to this with values from 1–10, we get a 3-dimensional space. Adding another feature like color will get us a 4-dimensional feature space, etc. We can imagine up to 3 spatial dimensions.

If we look a little into the example of fruits with 2 features, width, and height. We measure apples and pears and measure their size. If we plot them onto our feature space, we get something like this.

Our machine learning model *(in this case a standard perceptron which will generate a linear decision boundary, will be explained further what this is and does exactly)* will learn the distinction by splitting these points in what is called a **decision boundary**. If we introduce new fruit (black label) and we would like to know if it’s an apple or pear, we add this to our feature space. Our machine learning model will verify on which side of the decision boundary it is placed and will label our new fruit an apple.

Keep in mind that this is just a 2-dimensional feature space, this can easily be a 100-dimensional feature space with 100 features, we cannot imagine this (we can only imagine up to 3 spatial dimensions), but mathematics doesn’t care. Everything stays the same regarding calculations, regardless of the number of spatial dimensions and thus features.

## Decision boundaries

Let’s dig a little deeper into decision boundaries. Each machine learning method operates a little different, some have straight decision boundaries (linear decision boundary functions), some have curvy ones (higher polynomial decision boundary functions). It all depends on which method you use. As you can see the Linear SVM has **linear decision boundaries** because it uses a linear kernel (more on this later), if it uses a higher polynomial function kernel such as an RBF (Radial Basis Function), it will have **higher dimensional decision boundaries**. Decision trees operate on splitting your data based on your features, so their decision boundaries will be straight. Random Forests use a lot of decision trees and perform bagging (more on this later) and can map your feature space more tightly. These are just a few examples of machine learning methods and their decision boundaries, the principle stays the same whatever machine learning method you use.

# Different Types

First things first, there are very important distinctions you need to know, the difference between classification and regression, supervised and unsupervised learning and the differences between deep learning and reinforcement learning. Machine learning engulfs all other types of learning such as reinforcement and deep learning.

## Classification vs regression vs clustering

With classification, you want to classify something, you want to assign a label to it and try to separate your data as much as possible. This can be an object in an image, this can be a certain sentiment a specific sentence has, this can be the classification of a certain fruit, etc. Regression is about trying to fit a curve as close as possible to your data and getting to know how variables influence each other. Regression can be applied in time series, stock predictions, weather forecasts, etc. Classification predicts discrete variables whereas regression predicts continuous variables. Clustering is used to **gain insight into your data** and does not involve labeling.

## Unsupervised vs supervised learning

Learning your data, requires labeling your data, right? How could the machine otherwise possible know what you want to learn? You’ll have to specify that a certain image is of an apple and a certain image is of a banana if you want your machine learning model to learn the distinction. This requires **labeling your data**. This is called supervised learning. And is the most used approach, although this is also the most time-consuming. Labeling data is cumbersome and boring and is seen as the Achilles heel of machine learning. Other approaches involve not labeling your data, called unsupervised learning. Machine learning methods such as k-means, which will be explained later on, are unsupervised and are used for clustering problems where you would like to know similarities between your data without actually knowing what it represents. It can be used to gain more insight into your data.

## Reinforcement learning

Reinforcement learning is a special kind of learning and is widely used and very popular with machines learning to play video games (see OpenAI for more examples of reinforcement learning). Learning based on reinforcement will happen with an ‘agent’ which will try to execute certain steps towards a goal. If this agent does something wrong, for example, goes the wrong way, it will be penalized. If it does something right, such as jump over an enemy, it will be rewarded. It will try to optimize a cost function that will define the goal to reach and when it will be penalized and rewarded.

A very impressive use case is from OpenAI and features OpenAI Five which currently beats the top DOTA2 teams around the world.

## Deep learning

Standard machine learning becomes deep learning when neural networks with multiple layers are involved. A neural network has and input and output layer and can have any number of hidden layers in between with a vast variety of how they’re interconnected. Deep learning is very popular with computer vision and uses deep convolutional layers.

# Basic Concepts

Let’s start by explaining the basic concepts in machine learning.

## Linearly separable

What does it mean for your data to be linearly separable? If your data can be separated with a lower polynomial function, such as `f(x)=x`

, then your data is linearly separable. If your data is more complex, as seen in the image below, it’s not linearly separable.

Examples of linear classifiers that can separate data linearly and thus produce linear decision boundaries are perceptrons (explained later on what a perceptron is). If your data can only be separated by higher polynomial functions, it’s not linearly separable. Examples of such classifiers are support vector machines (SVM) with a radial basis function kernel (which will also be explained later).

## Curse of dimensionality

Every problem has a certain dimension, a certain amount of features that have to be taken into account when training your model. If these dimensions exceed the actual number of training examples, it becomes very easy for your machine learning model to learn how to construct its decision boundaries. It will separate all instances very easily and will overfit on your dataset.

## Underfitting & overfitting

Training a model can go bad both ways, you either don’t learn your data well enough or you learn it a little too well. If your machine learning method is not able to learn your data, you’re underfitting your data, if it learns it too well, you’re overfitting on your data. Both are bad. Overfitting occurs when you have very high accuracy on your training data, but very bad results on unseen test data. The fitting of your model is highly correlated with something called bias and variance which will be explained next.

## Bias & variance

Bias and variance are two important concepts to take into account when training and validating a machine learning technique. It is visible in numerous experiments that the accuracy goes down when the machine learning methods start **overfitting and thus have a high variance**. There is always a trade-off between bias and variance. A high bias results in underfitting and the resulting model of the machine learning technique will not learn the concept well enough to make accurate predictions about unseen instances. A high variance results in overfitting and the resulting model will have learned the model too well and might not be able to generalize enough to make accurate predictions of unseen instances.

A high bias and low variance will result in a high error rate because the model has not learned the data well enough. Low bias and high variance will have a low error rate on the training data because it will have learned the data too strict and will have overfitted, this could result in a high error rate on unseen test data because the model didn’t generalize too well. There is always a **trade-off between bias and variance**, between underfitting and overfitting your data, this is called the bias-variance trade-off.

## Bagging & boosting

Bagging and boosting are ways to combine individual learners. These can reduce variance and bias, and thus overfitting and underfitting.

**bagging**

Variance reduction is obtained by using tree bagging. Bagging or bootstrap aggregating generates smoother decision boundaries. It **reduces variance** and thus helps against overfitting. Bagging reduces variance by smoothing out outliers and noise by averaging over a lot of generated models. By taking different subsets of the same size of the training data and learning a model from each of those datasets, it creates additional training data which helps reducing variance. The random forest constructs many trees by using bagging over the training data and constructing a tree for each bag of data. After constructing the forest and thus the multiple decision trees it averages out the votes of each of those trees on unseen data. It is not possible to increase the predictive force of your learner by applying bagging but it is possible to reduce the variance which results in less overfitting and better generalization on unseen data.

**boosting**

Whereas bagging is a **parallel** **ensemble** with each model built independently, boosting is a **sequential** **ensemble** that tries to add new models that do well where previous models lack. Incrementally train a new learner on data that the previous learner misclassified. Boosting is used to **reduce bias**, and thus helps against underfitting by generating a stronger learner. An example is gradient boosting, such as xgboost.

## Information gain & information entropy

Being able to learn your data can be based on something called *information gain* and *entropy*. Decision trees use this metric to differentiate between choices and thus how to build the tree.

**information entropy**

Information entropy resembles how random a given event was and how much information was contained in that event. Decision trees use the information entropy to maximize the information gain for a particular node and the choice that has to be made at that node. The amount of entropy is calculated by using the following equation with `N`

= a distinct amount of classes.

**information gain**

Information gain results from information entropy. By calculating the amount of entropy for a given node in the decision tree and thus a given choice, the amount of information by making that particular choice can be obtained. It is important to maximize information gain for a decision tree so that the constructed tree contains as little nodes as possible, with each node the greatest amount of information that will be obtained after traversing through. The information gain is calculated by the following equation.

## Kernel trick

The kernel trick is very important if you are not able to easily draw a decision boundary or surface between your classes. With support vector machines (SVM, explained later on), you can **transform your data to higher dimensional space** which could enable you to easily separate your training data with a decision surface. This will make it possible for your machine learning method to learn your data more efficiently without using higher dimension kernels such as RBF (Radial Basis Function, explained later on).

# Different Methods

Let’s walk through some of the more common and popular machine learning methods available, such as decision trees, random forests, SVM’s, neural networks, k-nearest neighbors (k-NN) and k-means.

## Decision tree

A decision tree generates decision boundaries by following the branches of the tree down to the leaves. Each branch gets further down into the tree guided by questions at every node. A decision tree is generated by using the principle of maximum entropy or information gain. Each step has to deliver the most amount of information. Each node is constructed so that the decision tree has to make a choice about which path to follow and thus continue to try and classify a given instance when the path traversing reaches a leaf. These leaves represent the classes to which the decision tree will classify.

## Random forests

Random forests are an ensemble of individual decision trees and are a discriminative classifier which means it makes up decision boundaries to learn a model of the training data. It trains many decision trees and combines each of their responses at the output. Training multiple decision trees help reduce variance and overfitting the training data.

## Support vector machine

Support vector machines are maximum margin discriminative classifiers, they rely on support vectors, see next figure, to make up their decision boundaries. Support Vector Machines use hyperplanes to separate several instances of other classes and try to maximize the space between each class by maximizing the margin. Support vectors are the closest points to the decision boundary and exert the largest force onto the decision sheet. That is why they are called ‘support’ vectors, they support the decision surface. The space between the support vectors is called the margin.

**Linear kernel & radial basis function kernel**

There are 2 important kernels you can use, a linear kernel and an RBF (Radial Basis Function) kernel. The linear kernel will try to separate your training data linearly, you can apply the kernel trick (as explained earlier) to transform your data into higher dimensional space so it becomes linearly separable. If this is not sufficient, you can use the more computational expensive RBF kernel which is a non-linear kernel and will be able to separate your data non-linearly.

## Neural networks

The neuron has three important parts: the cell body, dendrites, and axon. The dendrites act as electrical conveyors feeding information into the neuron from other neurons. Some neurons have only one dendrite, others have many. This varies the amount of information the neuron decides on. The cell body determines what to do with all the incoming electrical information and acts as an activation function. If the number of electrical input signals exceeds a certain threshold, the activation function decides to fire the neuron or not. When the neuron is fired, this signal propagates through the axon. The axon leads information away from the neuron and can vary in size, depending on how important the neuron is.

There are 2 types of artificial neural networks, single- and multi-layered.

**single-layered**

The single-layer artificial neural network is also known as the perceptron. It consists of only one neuron as seen in the next figure. The neuron has dendrites that control the information flow into the neuron, the cell body which determines to fire the neuron or not and the axon which conveys the signal away from the neuron. To mimic this with a perceptron, the dendrites are input vectors given weights to clarify the importance of the information. The cell body is simulated by an activation function and the axon is the output of the perceptron.

The perceptron calculates the weighted sum `u`

from the input vectors xᵢ ∈ ℝᴺ and weight vectors wᵢ ∈ ℝᴺ where `N`

is the dimension of the input vector. After calculating the weighted sum `u`

, the perceptron calculates the function f(u): ℝ → ℝ by taking the sum of the dot products of the inputs vectors with the weight vectors. The concept of the perceptron is around for quite some time, from as early as the rise of the computer itself. It gained a lot of interest at first but was rejected due to its limitations for computing certain functions e.g. XOR Boolean function and due to other branches in AI achieving much better results. Those other branches focused less on machine learning and more on direct approaches to different computational problems. It took some time for interest to return in the perceptron and following the multilayer perceptrons.

**multi-layer**

The multilayer artificial neural network, also called multilayer perceptron (MLP), consists of multiple neurons, with hidden layers and multiple output neurons. The MLP is more expressive and more widely used as the single-layer perceptron. With those extra hidden layers, a lot more difficult target functions can be approximated. A lot of interest has emerged for this type of artificial neural network due to its expressiveness and successes of deep learning, which uses a lot of processing layers.

**different types**

There are different kinds of multilayer perceptrons, each with their specialty which are listed below.

**Feedforward Neural Networks**Information must flow in one direction, from input to output without going back to previous layers, thus without loops.

Variants: a*uto-encoders*.**Recurrent Neural Networks**Information can flow in all directions, there are no limitations as with feedforward neural networks. There are also no limitations about the number of loops the network contains.

Variants:*Boltzmann Machines*and*Hopfield Networks.***Convolutional Neural Networks**Convolutional neural networks are based on the visual cortex of animals and thus very suitable for computer vision applications. Each convolutional layer is stacked to get a better understanding of each layer that is represented by the input.

**activation function**

The activation function decides whether to fire the neuron or not. The step function is a straightforward example of such an activation function. The most widely used is the sigmoid function which is characterized by its ‘S’-shape. Whereas the step function is binary, the sigmoid function is continuous. There is also a tanh activation function which is a rescaled version of the sigmoid function which ranges from -1 to 1. The rectified linear unit (relu) function is widely used in deep neural networks and is one-sided in contrast with the antisymmetry of the tanh function as well as the exponential linear unit (elu) function.

**cost function**

The cost function is needed to represent the error on the output and optimize the learning of the neural network. It represents a certain cost which has to be optimized to get to an optimal solution. The gradient of the cost function in a higher dimensional landscape is always directed to the bottom of that landscape, so by taking the gradient of the cost function and traveling downhill, more accurate training of the neural network can occur and help to find the optimal set of parameters.