 # Optimization for Deep Learning

Source: Deep Learning on Medium

### Why do we need Optimization?

In general, debugging a Deep learning algorithm is like debugging any other complex piece of software. If something goes wrong you need to make hypotheses about what might have happened, and look for evidence to test those hypotheses. This requires thorough understanding of the principles of Optimization.

### How to Visualize Gradient Descent

When we train a neural network, we’re trying to minimize some cost function E, which is a function of the network’s parameters, which we’ll denote with the vector θ. In general, θ would contain all of the network’s weights and biases. In order to think qualitatively about optimization problems, we’ll need some ways to visualize them. Suppose θ consists of two weights, w1 and w2. One way to visualize E is to draw the cost surface, as in Figure 1; this is an example of a surface plot. This particular cost function has two local minima, or points which minimize the cost within a small neighbourhood. One of these local optima is also a global optimum, a point which achieves the minimum cost over all values of θ. A function can have multiple global minima if there are multiple points that achieve the minimum cost. Technically speaking, global optima are also local optima, but informally when we refer to “local optima,” we usually mean the ones which aren’t global optima. In the context of optimization, local and global minima are also referred to as local and global optima.

Surface plot is hard to interpret . So generally people use Contour plot for visualizing two-dimensional optimization problems. The idea behind contour plot is visualizing weight space (w1, w2 etc). Each of the contours represent a level sets or set of parameters where the cost takes a particular value.

One of the most important things we can visualize on contour plot is the gradient of the cost function. We can’t determine the magnitude of the gradient from the plot but it is easy to determine its direction (gradient is always perpendicular to the level sets).

Gradient is simply the vector of partial derivatives of the cost function

where alpha is the learning rate. This shows directly the gradient descent moves opposite to gradient or in the direction of steepest descent. Learning rate is one of the most important hyperparameters of a learning algorithm. So it is very important to tune it.