Source: Deep Learning on Medium

**Why do we need Optimization?**

In general, debugging a Deep learning algorithm is like debugging any other complex piece of software. If something goes wrong you need to make hypotheses about what might have happened, and look for evidence to test those hypotheses. This requires thorough understanding of the *principles of Optimization.*

**How to Visualize Gradient Descent**

When we train a neural network, we’re trying to minimize some cost function E, which is a function of the network’s parameters, which we’ll denote with the vector θ. In general, θ would contain all of the network’s weights and biases. In order to think qualitatively about optimization problems, we’ll need some ways to visualize them. Suppose θ consists of two weights, w1 and w2. One way to visualize E is to draw the cost surface, as in Figure 1; this is an example of a **surface plot**. This particular cost function has two local minima, or points which minimize the cost within a small neighbourhood. One of these local optima is also a global optimum, a point which achieves the minimum cost over all values of θ. A function can have multiple global minima if there are multiple points that achieve the minimum cost. Technically speaking, global optima are also local optima, but informally when we refer to “local optima,” we usually mean the ones which aren’t global optima. In the context of optimization, local and global minima are also referred to as **local and global optima**.

Surface plot is hard to interpret . So generally people use **Contour plot** for visualizing two-dimensional optimization problems. The idea behind contour plot is visualizing weight space (w1, w2 etc). Each of the contours represent a level sets or set of parameters where the cost takes a particular value.

One of the most important things we can visualize on contour plot is the **gradient of the cost function. **We can’t determine the magnitude of the gradient from the plot but it is easy to determine its direction (gradient is always perpendicular to the level sets).

Gradient is simply the vector of partial derivatives of the cost function

**Gradient Descent Update Rule**

where alpha is the learning rate. This shows directly the gradient descent moves opposite to gradient or in the direction of steepest descent. Learning rate is one of the most important hyperparameters of a learning algorithm. So it is very important to tune it.

**Stochastic Gradient Descent**

In Machine Learning our cost function generally consists of average of costs for individual training examples. Likewise gradient is the average of the gradients for individual examples.

If we use this formula directly, we must visit every training example to compute the gradient. This is known as **batch training**, since we’re treating the entire training set as a batch. But this can be very time-consuming, and it’s also unnecessary: we can get a stochastic estimate of the gradient from a single training example. In **stochastic gradient descent (SGD)**, we pick a training example, and update the parameters opposite the gradient for that example.

SGD is able to make a lot of progress even before the whole training set has been visited. A lot of datasets are so large that it can take hours or longer to make a single pass over the training set; in such cases, batch training is impractical, and we need to use a stochastic algorithm.

Some of the contents are form Deep Learning class which I took at University of Toronto. Thanks for the wonderful notes and slides by Roger Grosse (CS Professor at University of Toronto)