Source: Deep Learning on Medium
Deep Learning 101 — The Theory
There are many machine learning practitioners who question the approach to theory that many machine learning instruction typically takes. Their position is one of pragmatism — learn to apply the model, see how it works, then when you’re comfortable, go back and learn the theory. I can appreciate this approach, after all, machine learning is a rather empirical exercise. However, I prefer to start with the theory because I’m simply not fully comfortable with developing and implementing models if I don’t have at least some understanding of how they work under the hood. If you’re like me, then read on as we walk through the important theoretical underpinnings to deep learning. If you’re in the former camp, then check out this post to dive right in.
We look at neural networks, their structure, purpose, and the math that underpins them including topics related to linear algebra, gradient descent, and back propagation. Medium doesn’t do a great job with equations, so if you want all of the gory details, check out my original post.
What is Deep Learning?
Is a technique that has exploded into the mainstream of AI and machine learning research over the past five years. It consists of artificial neural networks with many layers (hence the characterization of “deep”) which allows for the expression of complex representations. The advances in image recognition, for example, have largely been driven by deep learning where the successive layers of the neural network allow simple features to be aggregated to higher and higher levels until a proper classification can be determined. The raw pixels come in and simple features such as contours are combined to form lines, edges, boundaries, and eventually the entire image is recognized by the network. Deep learning applications extend beyond image recognition into a wide-variety of different spaces such as natural language processing (NLP), forecasting, a myriad of classification problems, and can be applied to other, classic predictive analytic problems.
Anatomy of a Neural Network
The layout or architecture of a neural network is one of the drivers of its performance, and there is a veritable zoo of architectures out there. Despite the wide variation, they contain the same few components, just rearranged. At the most basic level, we have an input layer, hidden layer, and output layer, each with various nodes (sometimes called cells). Below, we have a simple image of a two-layer neural network (we don’t typically count the input layer when counting neural network layers).
Here, the inputs undergo a transformation as they move from the input layer to the hidden layer. There are typically two components to any transformation, a linear and non-linear component, known as the activation function (more info later on this). For classification purposes, the output layer typically takes whatever it received from the last hidden layer and applies a transform to convert the results into some sort of probability or interpetable result so that the data it has received can be mapped to one of the possible classes. If this all sounds a bit mysterious to you, then don’t worry, we’ll get into the mathematical basics below to explain how all of this happens!
There are a few fundamental mathematical techniques to understand to be able to work with deep learning effectively.1
In this post, I want to introduce:
- Linear Algebra
- Activation Functions
- Forward Propagation
- Back Propagation
Linear algebra is a branch of mathematics that deals with vector spaces and their mappings and comprises a huge chunk of deep learning. Confession: during my undergrad, nearly everyone in my program took linear algebra except for me, instead I replaced it with a graduate level game theory course thinking I wouldn’t need linear algebra. Ooops! If you need to get up to speed on the basics, I can recommend no better book than Coding the Matrix which teaches you linear algebra through Python. Not only will you become a pro in the subject, you do all of it in Python and become a better programmer in the process.
A few terms that are thrown around in this area: scalars, vectors, matrices, and tensors, all of which are important to understand.
Whether you know the term or not, you’re already familiar with scalars. Scalars are just a single number, like -1, or 371.12, or π. They are numbers on their own and don’t require any special treatment.
Vectors are ordered lists (sometimes called 1-D arrays) of numbers. Let’s define a vector as follows:
Now, our vector has four elements, each of which are scalars. They are ordered, so if we want to access the first element, which is 6, we know that it is indexed as element 0: x_0 (in zero-indexed languages like
Python, others like
R would assign the index 1). Often times you’ll do things like sum vectors, which is just to add all of the elements of the vector to one another. So in this case, the sum of our vector is:
Which is a scalar value. When working with languages like
R, we often say they are vectorized meaning we don’t have to do anything special to apply values to the vectors. So, going back to our sample vector, we can multiply it by a scalar value and it will apply to all of the values in the vector. So if you’re in
Python and you have your vector
x = np.array([6, 18, 4, 23]) and you multiply that by 2 (
x * 2), all the values will double and it remains a vector.2
Sometimes you’ll see vectors written in a line like I did above, other times as a column. Regardless, it’s the same thing. I typically denote vectors with a little arrow on top like:
other times authors like to write them in bold like, x, these tend to be the two most common conventions. Finally, in mathematical notation, you may see vectors written as:
which is basically saying, “vector x consists of real-numbers and has n elements.” If you add two vectors together, each element will be added if and only if they have the same length, otherwise the operation can’t proceed according to the rules of linear algebra.
If vectors are 1-D arrays, matrices are in 2-D and have height/length and width. Whereas vectors had only one value as an index, matrices have two, typically denoted by i and j. Take the following matrix:
This is a 2 × 2 matrix, so has four elements. The first is given by A(0,0) = 6 (again if you want your indexing notation to match your programming language of choice), the second on the top as A(0,1) = 18 and so on. Matrices share similar properties as vectors in that multiplication by a scalar is applied evenly to all elements, you can sum matrices, and do all sorts of transformations with them. One key thing to keep in mind though, are the dimensions of the matrix which is often written as:
where m gives the height or number of rows in the matrix, and n gives the width or number of columns. The matrix notation is read just like the vector notation above, “matrix A consists of real numbers and has m rows and n columns.” These dimensions are critical to keep straight, however, because with deep learning, you’ll often be multiplying matrices and you can only do that if the dimensions match properly!
Scalars, vectors, matrices, all of these are specific types of tensors, namely tensors with a particular rank. Scalars are rank 0 tensors. Vectors are rank 1, and matrices are rank 2. Higher ranked tensors (i.e 3 and up) don’t usually get special names because they are, by and large, less common than the first three, so they’re just known collectively as tensors. If we have a rank 3 tensor T, for example, we’d denote it’s values using indices i, j, and k like:
We would just keep on adding letters as we get higher and higher ranked tensors, but don’t worry because these aren’t so common.
Also, you may be aware of a popular deep learning framework known as TensorFlow. Hopefully now the first part of the name makes some sense.
With the linear algebra terminology out of the way, let’s turn our attention to some of the key linear algebraic operations. Deep learning consists of matrix transformation on top of matrix transformation. Do you see all of those nodes and lines in that neural network image above? Basically to move from one are of the network to the next you need to multiply some matrices! As I mentioned before, multiplying matrices requires the proper dimensions. If we go back to our old matrix A, it’s a 2 × 2 matrix or:
If we are going to multiply it by another matrix B, we need that other matrix to have the same number of rows as A has columns. This multiplication will result in a new matrix C with the same number of rows that A has and the same number of columns that B has. Writing this out, we can say:
With this shape in mind, we can compute the dot product of A and B. This multiplies all of the values in the column with the values in the corresponding row, then sums them to yield the result. Let’s make A a 2 × 3 matrix and B a 3 × 2 matrix to see how this works.
The dot product is incredibly common and used extensively in deep learning. Often times the exact dimensions don’t match and you must transpose your matrix, this means turning it so that the rows become the columns and the columns become the rows. If you’re careful about your dimensions, you’ll quickly see where this is necessary. In
Python, this is given as
np.dot(A, B) or, if
A is already a
numpy array, you can write
There is also element-wise multiplication also called the Hadamard product which requires matrices of identical dimensions. To try an example in code of this type of multiplication, let’s define our matrices A and B as:
A = np.array([[6, 18], [4, 23]])
B = np.array([[2, 1], [1, 2]])
To multiply these together in
Python, you can either write
C = A * B, or
C = np.multiply(A, B). Both yield the product of each of the elements with the corresponding element in the other matrix:
C = np.array([[12, 18], [4, 46]])
Finally, you can add matrices together, but they must have the same dimensions, and you can add vectors to matrices. In this latter case, the vector must match one of the dimensions of the matrix and will be added to each element in turn.
Of course, there’s more linear algebra to cover, but this should be enough to make you dangerous.
So far, we covered a lot of linear algebra, but there’s one problem, all of that remains linear. To get around this, we introduce activation functions into our network to provide needed non-linearities to the that allows learning intricate patterns and functions that wouldn’t be possible otherwise. Hidden layer transforms have two components, a linear and non-linear. The first, linear transformation is applied using the matrix multiplication and vector addition discussed above, the second comes from the activation function which is applied to each node of the hidden layer. Different layers can use different activation functions which will adjust the output accordingly. If we look at a node or neuron, we would see something like this:
This is the linear component which consists of weights W and bias terms b.³ The non-linear piece is then applied to the output, Z. We often denote the activation function as g(x) and the result is the matrix .
This result is then passed along to the next layer as the input to another linear transformation and then to another output until we reach the final layer of our neural network, the output layer.
Types of Activation Functions
Activation functions got their inspiration from observing the firing of actual neurons which have been observed to fire if a certain action potential is reached. If it hits this threshold, it fires and sends a signal to the connected neurons. This same idea is taken and applied to activation functions, so what you’ll see here are a series of functions that follow the same sort of pattern and determine whether the node or neuron “fired” or was “activated” or not.4
There are a variety of activation functions to explore because of the various success that different functions have had as innovation has continued in this area. Because the linear weights we looked at previously are adjusted using a technique known as back propagation, the derivatives of the activation functions are very important as well and can affect the learning speed and capability of the network.
If you feel like you need a refresher on calculus or are new to the subject, I would recommend Silvanus Thompson’s brilliantly titled book from 1910, Calculus Made Easy: Being a very simplest introduction to those beautiful methods of reckoning which are generally called by the terrifying names of differential calculus and the Integral calculus. The title is worth the sticker price alone. Because of it’s age, it is in the public domain, so you can get a free, scanned PDF of the second edition from 1914 here as well as on Project Gutenberg.
Perhaps one of the oldest and most common activation functions is the sigmoid function, σ(x), also called the logistic function. This function is defined as:
The derivative is given as:
Plotting the function and its derivative, we can note a few important values.
First, looking at the sigmoid function itself, we see that it ranges from 0–1 which makes it a great function to apply to the output layer of a binary classification problem because it can be interpreted probabalistically. The other thing to note is how flat the values are towards the edges and reflected in the derivative nearing 0 at these points. This is a big problem of the sigmoid function because we use the derivative to provide updates to the weights. If the derivatives are near 0, then a very small change will be made to the weight, even if a larger change would be more helpful. This is a variant of the vanishing gradient problem and can slow or even prevent sufficient learning from taking place. For that reason, a lot of current deep learning practitioners have moved away from using the sigmoid apart from the output layer for binary classification.
Perhaps the workhorse of modern deep learning activation functions, the rectified linear unit (ReLU) function has grown in popularity.
This is a much different function than the sigmoid function. It is linear for all values greater than 0, and 0 for all other values. It has a number of advantages most notably that it’s easy and efficient to compute and has no vanishing gradient. There are a couple of problems with it, however, that it’s non-differentiable at 0 and can encounter the “dying ReLU problem”. The first case can be addressed in a few different ways, but most commonly, we just set the derivative at 0 to 0, or 0.5, or 1. The “dying ReLU” problem is a little more complicated and arises during back propagation (more below on this) because of the 0 output values. So, if a ReLU node has a 0 or negative value as the input, it has 0 as the gradient. Because back propagation uses the gradients to adjust the weights, they’ll (almost) never change and the node will fail to provide results in the future leading to a dead node. Other ReLU’s like leaky ReLU’s have been proposed to help mitigate this where values less than 0 are multiplied by a small number (i.e. 0.001).
This post seems like it could grow indefinitely as we continue to list out activation functions, so it’s time to call it quits. Needless to say, there are numerous others out there like tanh, ELU, SELU, PReLU, softplus, softmax (output layer only), even Gaussian functions. If you want to see what else is being used, Wikipedia tends to keep a pretty up-to-date list handy. Play around with them and see how they work for your problem. The thing with data science that I keep coming back to is that it is highly empirical, so some functions and values may work well for one problem and not another, so go and experiment!
Putting it Together
Ok, so we’ve gone through the basics of linear algebra, seen a basic neural network and have some activation functions. How do we go about combining this stuff to train our network and make predictions? This comes down to two basic procedures, forward propagation and backward propagation. The forward prop pass takes our input data and pushes it through each layer of the network applying the linear and non-linear transformations to each successive matrix until you get to the output. The bacward propagation step is a bit more complicated (and where the magic happens), it looks at the results of the forward propagation pass and utilizes a loss function to determine how good your steps are. It then takes that loss function and calculates the derivative of it and all of the other functions along the way pushing these derivative values backward through the network to update the weights using gradient descent. Let’s examine the details of each in turn.
Suppose we have some data in a matrix called X and some binary labels (e.g. cat vs. non-cat) in another vector called y⃗. Our network has two layers with hidden nodes in the hidden layer which uses a ReLU activation function. The output layer has one node and uses a sigmoid function so we can classify our value. We want to give our input data to the network to see how well it predicts, so we commence a forward propagation step. Algebraically, our forward prop pass would look like the following few steps (I’m including all the matrix dimensions just to see how it changes through out the network where m just gives the number of observations we’re providing):
- Input layer applies linear weights (W1) and biases (b1⃗) gives our output Z1:
- Z1 then gets passed to our ReLU activation function (g1(x)) to give us A1:
- A1 then gets passed to our linear output layer with weights (W2) and biases (b2⃗) to give us Z2:
- Now, we have a vector of values, but we want to determine if our image represents a cat or not. Our final activation function is going to be the sigmoid, because it ranges from 0 to 1, and we can treat it like a probability. In other words, if the network applies a score of 0.67 to one of our images, you can understand that as, “the network ‘believes’ there’s a 67% probability that this is a cat picture.” We’ll call this activation function (g2(x)) and get our classification probabilities y^:
And in four quick steps, we’ve just made a prediction using a neural network! Notice that on the final output, (y^), we have an (m×1) matrix of values that range between 0 and 1 because our output activation layer contained a sigmoid function. To map that to our class, we just take everything with a value less than 0.5 and map that to class 0 and everything greater than 0.5 and map that to class 1.Loss Function
Ok, so our cat detector is going to perform really poorly out of the box. We need to train it on those examples. As stated above, assume we have some vector of labels y⃗, where there’s a 1 if it’s a cat picture and a 0 if it’s not a cat picture. From here, we can then calculate the cross-entropy loss to compare the predictions (y^) with the actual values. The loss will then be used to update the weights and biases in the network to help it improve predictions in the future. The more data we show the network, the better it’s going to get at detecting cats.
For the math, the cross-entropy loss is given as:
Where m is the number of examples in the data set (same m as in our m×n matrix dimension), y(i) denotes the i-th label, and y(i)^denotes the i-th predicted value. Notice what happens if the true value of the point y(i) is 1. In this case, the second term inside the summation goes to 0 and we’re left with the log of y(i)^. If our predicted value is close to the true value (in this case 1 because y(i)=1), then the log is very low and thus our loss or error is very low. On the other hand, if our predicted value is close to 0 when the true value is 1, then the loss increases towards infinity. Once the log probabilities like this are calculated for each of the data points we have, we take the average by summing them all and dividing by m. The negative value on the outside ensures that our total loss value is positive because log values for everything less than one are always negative.
We then update our network to minimize this value. In other words, we try to change the weights and biases using the back propagation algorithm to get values that move our loss as close to 0 as possible.
Back propogation is the most mathematically and computationally intensive part of deep learning. Because we use gradient descent to make our updates to the parameters, we have to calculate the derivative of the loss with respect to all of the parameters in the network. This is done by applying the chain rule from calculus to each of the functions in the network (and here’s another link to Thompson’s book in case you changed your mind).
The chain rule is key because it is simply a way to find the derivative of a function when it is nested within a larger function — which is basically all a neural network is, a massive, nested function. Getting the derivative for each parameter in the network helps us determine how much error each parameter is contributing to the network and correct it by adjusting the weight to minimize the error, which, in turn will improve our predictions.
To start with back propagation, we need to calculate the derivative of our loss function with respect to our prediction (see the end of this post for the full derivation).
Once we have this, we can work backwards through our layers and continue to calculate the partial derivatives of each output value with respect to the loss.
- Output layer activation function derivative:
- Output layer weight and bias derivatives:
- Input layer activation function derivative:
- Input layer weight and bias derivatives:
This can be quite time consuming in practice if we need to calculate everything by hand! Notice too how quickly the chain rule speeds things up because we’re able to simply calculate the derivative of each value as a product of the later layers and its inputs. If you’re a bit fuzzy on the details, don’t fret too much, because modern frameworks do calculate these derivatives for you. If you do want to dig deeper, check out Geoffrey Hinton’s video on back propagation or Andrew Ng’s video on the subject.
Once we have the gradients for all of our parameters listed above, we can apply our update rule. The general form for gradient descent updating is:
Where θ represents all of our parameters and gets updated according to all of the derivatives that we calculated. Also, I use “:=” just as an update rule, so the value on the right hand side of the equation replaces the value on the left when the right is calculated. α is our learning rate and determines how much of that derivative gets applied, 1%, 10%, or more. Usually α is small (always less than 0) and often between 0.01 and 0.1. Another thing to note is that, in practice, it’s a good idea to vary your learning rate by decaying it over time so that it moves more slowly as it gets closer and closer to the optimal result. This is helpful to avoid the network from passing by the optimal value. Thankfully, modern deep learning frameworks can do all of this for you automatically!
Applying it to each of our parameters, we have:
Once we’ve updated our parameters, we go back to the beginning and run another forward propagation pass and repeat the process. These small, incremental updates are repeated until the loss reaches an acceptable level, isn’t improving anymore, or we’ve decided that we’ve calculated this enough.
In the next post, we’ll take all of this and apply it in code to demonstrate how it works, step-by-step from scratch.
Why does it work?
One big question remains, why does this work? Moreover, why does it work so well? This question is intimately related to what Nobel Prize Laureate Eugene Wigner’s called “the unreasonable effectiveness of mathematics” at explaining the world. This is a deep issue with no clear answer. Regardless, deep learning seems to work by being able to pick out various features of high importance or that have high predictive value and assigning that value to those features. Still, the question remains, out of all of the possible features and functions that a network can approximate, why does it seem to approximate the right (or at least useful) ones? Some, such as Henry Lin of Harvard and MIT’s Max Tegmark and David Rolnick postulate that there’s a deep connection with the physical laws that govern our universe, and these laws are surprisingly few and surprisingly simple, which allows deep learning to be so successful. Another, recent proposal sees deep learning as an “information bottleneck” which compresses noisy and irrelevant information and retains relevant information. As the network processes inputs through each layer, it reaches a “bottleneck” or a theoretical limit the network has at extracting information. This is certainly an area of active exploration that has seen proposals of robust ensemble methods for explanation, to more speculative explanations like the holographic principle, and others.
Regardless of how it works, we do know that it does work and I hope this post has helped clarify the theory behind the techniques and paved the way for your own applications and exploration.
Next post, building your own neural network from the ground up!