Source: Deep Learning on Medium
There are only and exactly two types of layers:
- Layers that contain parameters
- Layers that contain activations
Parameters are the things that your model learns. You use gradient descent on them like
parameters = learning_rate * parameters.grad. Those parameters are used by multiplying them by input activations and doing a matrix product.
We take some input activations or some layer activations and we multiply it by weight matrix to get a bunch of activations. So activations are numbers but these are numbers that are calculated. Input is a kind of a special activation. They’re not calculated.
Activations don’t only come out of matrix multiplications, they also come out of activation functions. And the most important thing to remember about an activation function is that it’s an element-wise function. So it’s a function that is applied to each element of the input, activations in turn, and creates one activation for each input element. So if it starts with a 20 long vector it creates a 20 long vector. By looking at each one of those in turn, doing one thing to it, and spitting out the answer. So an element-wise function. ReLU is the main one we’ve looked at, and honestly it doesn’t too much matter which you pick. So we don’t spend much time talking about activation functions because if you just use ReLU, you’ll get a pretty good answer pretty much all the time.
Components of Deep Neural Networks
- Activation functions /nonlinearities
- Fine tuning
What happens when we take a ResNet 34 and do transfer learning? The first thing to notice is that the ResNet34 we took grom IMageNet has a very specific weight matrix at the end. The matrix has 1000 columns.
Why? Because the problem they asked you to solve in ImageNet’s competition is to figure out which one of these 1000 image categories this picture belongs to. The target vector is length 1000. You have to pick the probability that it’s which one of those thousand things.
There are a few reasons this weight matrix is not helpful to you when doing transfer learning. The first is that you probably don’t have a thousand categories. For example, our first image classification was between teddy bears, brown bears, and black bears. I don’t need 1000 categories. The second reason is even if I had exactly 1000 categories, they’re are most likely not the same thousand categories that are in ImageNet. Meaning this whole weight matrix is a waste of time. We don’t use it. When you use
create_cnn in fastai, it deletes that matrix. Instead, it puts in two new weight matrices in there for you with a ReLU in between.
There are some defaults as to what size the first one is, however, the size of the second one is as big as you need it to be. In your data bunch which you passed to your learner, from that we know how many activations you need. If you’re doing classification, it’s how many classes you have, if you’re doing regression it’s how many ever numbers you’re trying to predict in the regression problem. Remember, if your data bunch is called
data that will be called
data.c . So we’ll add for you this weight matrix of size
data.c by however much was in the previous layer.
Now we need to train those because initially these weight matrices are full of random numbers. New weight matrices are always full of random numbers. We’ve just grabbed them and trhown them in there so we need to train them. However, the other layers are not new. The other layers have been trained and are good at something.
The later layers become more sophisticated, but also more specific. For example, layer 4 might find things like eyeballs. If you’re wanting to transfer and learn to something for microscopic images of cancer, there’s probably not going to be eyeballs in that right? So the later layers are not useful. But, the earlier layers, where it recognizes things like repeating patterns are useful. The earlier you go in the model, the more likely it is that you want those weights to stay as they are.
We definitely need to train these new weights because they’re random. Let’s not bother training the other weights at all.
Let’s freeze all the early layers that we aren’t going to train. What does that mean? It means that we’re asking fastai and PyTorch that when we train (however many epochs we do), when we call fit, don’t back propagate the gradients back into those layers. — In other words, when you do
parameters = parameters — learning rate * gradient, only do it for the new layers and not for these other layers. That’s what freezing means — don’t update those parameters.
Because we’re updating less things, it’ll be a bit faster and take up less memory since there are less gradients to store. Most importantly, it’s not going to change weights that are already better than nothing — they’re better than random at the very least.
That’s what happens when you call freeze. It doesn’t freeze the whole thing. It freezes everything except the randomly generated added layers that we put on for you.
Unfreezing and Using Discriminative Learning Rates
Then what happens? After a while we say, “okay this is looking pretty good. We probably should train the rest of the network now”. So we unfreeze. Now we’re gonna chain the whole thing, but we still have a pretty good sense that these new layers we added to the end probably need more training, and these ones right at the start (the early layers) probably don’t need a lot of training. So we split our model into a few sections. And we say, “let’s give different parts of the model different learning rates.” So the earlier part of the model, we might give a learning rate of
1e-5 and the later part of the model we might give a learning rate of
1e-3 for example.
What’s going to happen now is that we can keep training the entire network. But because the learning rate for the early layers is smaller, it’s going to move them around less since we think it’s already pretty good and also if it’s already pretty close to optimal value, a higher learning rate might actually make it much worse which we really don’t want to happen. So this process is called using discriminative learning rates. You won’t find much about it online, fastai is really the first group of people who have discussed using this technique in transfer learning.
How do we do discriminative learning rates in fastai? Anywhere you can put a learning rate in fastai such as with the
fit method. The first argument is the number of epochs, the second is the leraning rate (same is true for
fit_one_cycle). The learning rate, can be a few different things:
- You can pass a single number (
1e-3): Every layer will get the same learning rate. So you wouldn’t be using discriminative learning rates.
- You can write a slice with a single number (
slice(1e-3)): The final layers will get a learning rate of whatever you pass (
1e-3) and the other layers will get the same learning rate divided by 3 (
1e-3 / 3). The last layer will be
- You can also write a slice with a two numbers (
slice(1e-5, 1e-3)). The final layers (the ones with random numbers) get a learning rate of the second argument (
1e-3). The first layers will get the learning rate of the first argument (
1e-5) and the other layers will get learning rates that are multiplicatively equally spread between those two. If there are three layers, the learning rates would be
One tweak to make things a little simpler to manage. We don’t actually give a different learning rate to every layer. We give a different learning rate to every “layer group”. Specifically what fastai does is that the randomly added extra layers we’ll call those one layer group by default. Then for all the rest, we split in half into two layer groups.
By default (with a CNN), you’ll get three layer groups. If you say
slice(1e-5, 1e-3) , you’ll get
1e-5 learning rate for the first layer group,
1e-4 for the second,
1e-3 for the third. So now if you go back and look at the way that we’re training, hopefully you’ll see that this makes a lot of sense.
This divided by three thing will be talked about later. It has to do with batch normalization.
That is fine tuning.
In this collaborative filtering example, we called
fit_one_cycle and passed in just a single number for the learning rate. This makes sense, because in collaborative filtering, we only have one layer. There’s a few different pieces in it, but there isn’t a matrix multiply followed by an activation function followed by another matrix multiple.
An affine function is not always exactly a matrix multiplication. They are similar. Affine functions are linear functions that we add together. When we do convolutions, convolutions are matrix multiplications where some of the weights are tied. So it would be slightly more accurate to call them affine functions. The word affine function essentially just means a linear function. Something very close to a matrix multiplication. Specifically for collaborative filtering, the model we were using was this one:
It was where we had a bunch of numbers on the left and a bunch of numbers at the top, and we took the dot product of them. Given that one here is a row and one is a column, that’s the same as a matrix product. The average sum of squared error got down to 0.39.
Talked about this idea of embedding matrices. To understand that let’s look at this other worksheet.
There’s a weight matrix for the users and a weight matrix for the movies. Both matrices have the same dimensions.
Initially these values were random. We can train them with gradient descent. In the original data, the user IDs and movie IDs were numbers like these. To make it a little more conventient, the IDs have been converted to number from 1 to 15. So in these columsn, for every rating, I’ve got user ID, movie ID, and rating. Using the mapped numbers they are contiguous starting at one.
Now replace user ID number 1 with this vector — the vector contains a 1 followed by 14 zeros
Then the use number is two. I’m going to replace with a vector of 0 and then 1 and then 13 zeros.
These are called one-hot encodings. This is not part of a neural net. This is just like some input pre-processing where I’m literally making this my new input:
So this is my new inputs for my movies, this is my new inputs for my users. These are the inputs to a neural net.
I’m going to take this input matrix and I’m going to do a matrix multiplied by the weight matrix. That’ll work because weight matrix has 15 rows, and this (one-hot encoding) has 15 columns. I can multiply those two matrices together because they match.
User activations is the matrix product of this input matrix of inputs, and this parameter matrix or weight matrix. So that’s just a normal neural network layer. It’s just a regular matrix multiply. So we can do the same thing for movies, and so here’s the matrix multiply for movies.
This input, we claim, is this one hot encoded version of user ID number 1, and these activations are the activations for user ID number one. Why is that? If you think about it, the matrix multiplication between a one hot encoded vector and some matrix is actually going to find the Nth row of that matrix when the one is in position N. So what we’ve done here is we’ve actually got a matrix multiply that is creating these output activations. But it’s doing it in a very interesting way — it’s effectively finding a particular row in the input matrix.
Having done that, we can then multiply those two sets together (with a dot product), and we can then find the loss squared, and then we can find the average loss.
0.39 is the same as the number from the solver in the previous example because they’re doing the same thing.
This one (“dotprod” version) was finding this particular users embedding vector, this one (“movielens_1hot” version) is just doing a matrix multiply, and therefore we know they are mathematically identical.
To be continued in another post.