Solving overfitting in Neural Nets with Regularization

Original article can be found here (source): Deep Learning on Medium

Why Regularization reduces overfitting

When implementing regularization we added a term called Frobenius norm which penalises the weight matrices from being too large. So, now the question to think about is why does shrinking the Frobenius norm reduce overfitting?

Idea 1

An idea is that if you crank regularisation parameter λ to be really big, they’ll be really incentivized to set the weight matrices w to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that’s basically zeroing out a lot of the impact of these hidden units. And if that is the case, then the neural network becomes a much smaller and simplified neural network. In fact, it is almost like a logistic regression unit, but stacked most probably as deep. And so that would take you from the overfitting case much closer to the high bias case. But hopefully there should be an intermediate value of λ that results in an optimal solution. So, to sum up you are just zeroing or reducing out the impact of some hidden layers and essentially a simpler network.

The intuition of completely zeroing out of a bunch of hidden units isn’t quite right and does not work too good in practice. It turns out that what actually happens is we will still use all the hidden units, but each of them would just have a much smaller effect. But you do end up with a simpler network and as if you have a smaller network that is therefore less prone to overfitting.

Idea 2

Here is another intuition or idea to regularization and why it reduces overfitting. To understand this idea we take the example of tanh activation function. So, our g(z) = tanh(z) .

Here notice that if z takes on only a small range of parameters, that is |z| is close to zero, then you’re just using the linear regime of the tanh function. If only if zis allowed to wander up to larger values or smaller values or |z| is farther from 0, that the activation function starts to become less linear. So the intuition you might take away from this is that if λ, the regularization parameter, is large, then you have that your parameters will be relatively small, because they are penalized being large into a cost function.

And so if the weights of w are small then because z = wx+b but if w tends to be very small, then z will also be relatively small. And in particular, if z ends up taking relatively small values, it would cause of g(z) to roughly be linear. So it is as if every layer will be roughly linear like linear regression. This would make it just like a linear network. And so even a very deep network, with a linear activation function is at the end only able to compute a linear function. This would not make it possible to fit some very complicated decisions.

If you have a neural net and some very complicated decisions you could possibly overfit and this could definitely help reducing your overfitting.