Source: Deep Learning on Medium
How neural loss surface looks like in practice?
Training neural networks can be a difficult task to begin with, larger batch sizes (9000+) are worse while smaller (100) are generally better. To get a better loss surface you need more CNN filters. This means that you need to optimize the weights of filters to minimize the error.
The main way how to train neural networks is by minimizing the loss function. The loss function is essentially a function of the weight parameters. Usually, nonlinearities layer on top of nonlinearities and we are getting a non-convex optimization problem. Even with GPU is too expensive and time-consuming to plot a loss function. So we did a loss function computations on GPU cluster and come up with a visualization of loss of function that looks like this. We will cover in-depth loss functions all machine learners should know here.
This is a loss function trained on SSD 300 pedestrian dataset. So sharp versus flat are important, it’s widely believed that flat minimizers generalized better than sharp ones. But what is fascinating, is when you turn weight decay you are getting opposite results. Now, small-batch minimizers are very sharp, and big ones are flat. Therefore, what is important here is the size of the weights, when the weights are big it looks flat.
Does this identify the correct distance scale for this problem? What affects the loss landscape?
To start with there are different kinds of neural nets we want to look into: VGG like net is the object-recognition model that supports up to 19 layers; we also interested in ResNets where information from sheller layers skips over convolutions and gets added to the output of deeper convolutional layers.
This is 56 layer VGG net, one of the most simplest neural networks and you can see that in the center of the plot is the minimizer. We can easily visualize the non-convexity of these landscapes. What is more surprising is when you change the neural architecture (adding skip connections — ResNet) the loss surface changes it looks entirely. You can see below.
If you don’t select random directions it started to pick special directions, if you make those kind of plots you don’t see a non convex behavioural. When you go deep enough there is transition from convex behavior to the chaotic one and they also tend to correspond in increase to generalization error. Thus, if you try to train a neural network that is to big and not good behaved you want the situation when gradience that you produce from many batches are not correlated, gradience effectively become random and when this happens nets are basically untrainable and you can’t find good minimizers anymore.
When there are skip connections that completely prevents those chaos transitions to happen at least for deep enough networks ( up to 300 layers, you can go even deeper if you have enough memory on GPU’s). Thus, it seems like the property of the model matters a lot and the difference between them is correlated with the loss landscape and training optimization takes place on the loss landscape.
Well designed neural networks have nice loss functions with landscapes that are populated with large flat-convex like minimizers. Is sharp versus flat matters, maybe we should look more in chaotic versus non-chaotic.
- Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Automated inference with adaptive batches. In AISTATS, 2017.
- Marcus Gallagher and Tom Downs. Visualization of learning in multilayer perceptron networks using principal component analysis. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 33(1):28–34, 2003.
- Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein,University of Maryland, College Park United States, Naval Academy, Cornell University. Visualizing the Loss Landscape of Neural Nets. In arXiv 7 Nov 2018.