Source: Deep Learning on Medium

### Foreword

Previous articles have repeatedly mentioned the use of gradient descent optimization methods to find an optimal solution to a problem that has identified four key parts of machine learning.

For example, for the issue of house price forecasting, 1) for the digitized representation of the input, you can use a vector containing the house area, the house year, etc. to indicate the input, 2) for the output is the actual transaction price of the house, 3) for the input to the output The mapping method can be linear regression, 4) For the evaluation criteria of the mapping method, Abs Criterion can be used, that is, the absolute value of the difference between the predicted value and the actual value. Once these contents are determined, the optimization method of gradient descent can be used to solve the unknowns in the mapping method, so that the mapping reaches an optimal state. In the work of the weekdays, engineers can’t change the gradient descent method or innovate. There are some researches in this field, but they are still a minority. Nonetheless, the gradient descent optimization method is an important part of machine learning. A full understanding of this part will be helpful to make a better solution about the four key parts of machine learning, for example, when designing the map. It is necessary to ensure that the mapping is deductible so that it can be solved by the gradient descent method.

### Theory

As long as four key parts of machine learning are identified, the cost can be expressed as a function related to input, target output, and mapping. The input and target outputs are all deterministic values. In most cases, there will be some uncertain unknowns in the map. These unknowns can be obtained by optimization methods so that the cost is as small as possible.

To demonstrate, here is a simplified function that represents the cost.

Its graphical representation is as follows, the x-axis direction represents w₁, the y-axis direction represents, cost.

In order to find the optimal value of w₁ which let the cost minimized. It is easy to see from the figure that the minimum value of cost is when w₁=-1. In short, you can find the position of the minimum value when the derivative is 0 by using the derivative function of the cost function. However, in the actual situation, the expression of cost is very complicated, including hundreds and thousands of variables. When the number of equations with zero decimation is 0, the complexity is very high and the calculation amount is large. Therefore, the researchers invented a method that uses the computer-calculated value of the value of w₁ to make the cost as mininum as possible.

#### The steps

Take a random value for w₁, For example, let w₁=0.5, at this time cost = 2 ⨉ 0.5² + 4 ⨉ 0.5 = 2.5 . We know that cost’ = 4 ⨉ w₁ + 4 , at which cost’= 4 ⨉ 0.5 + 4 = 6 . This means that if w₁ is added a little bit, the cost will increase by a little 6 times. In turn, w₁ will be reduced a little bit, and the cost will be reduced a little. This way, there is a way to reduce the cost step by step. so you can get the w₁ value for mininum cost.

Here a little bit is equal 0.5. Because 0.5 is much larger than a small number, the change in cost is not strictly six times that of 0.5, but it has little effect on our goals.

The first iteration: at this time cost’= 4 ⨉ 0.5 + 4 = 6, w₁ = w₁–0.5 = 0.5–0.5 = 0, cost = 0

The second iteration: at this time cost’= 4 ⨉ 0 + 4 = 4, w₁ = w₁–0.5 = -0.5, cost=-1.5

The third iteration: at this time cost’= 4 ⨉ -0.5 + 4 = 2, w₁ = w₁–0.5 = -1, cost=-2

The way to update w₁ here is very simple, only according to the direction of the derivative. If the derivative is positive, reduce w₁ a little bit, or the derivative is negative, then add w₁ a little bit, so that it slowly approaches the desired value. If you find that w₁ changes, the cost does not change much, you can stop iterating.

### Computing module in Torch

At present, many deep learning libraries have built-in mapping methods, as well as evaluation criteria for mapping methods. The function calculation and the part of the derivative have internal methods that can be called directly. Here, Torch, a deep learning framework, is used as an example to give some examples of use. Most of the deep learning libraries are designed in the same way. When you understand other libraries, you can apply the ideas here, but you need to pay attention to the subtle differences, such as Torch vs PyTorch.

The box on the left side of the figure above is a linear transformation module that transforms two dimensions into one dimension. The image on the right is a block of the square of the MSE Criterion difference, used to calculate the effect of this mapping of linear transformations. The circle represents the variable, the red circle represents the known number, x1, x2 represents the two elements of a two-dimensional input vector, and y represents the value of the target of 10. A green circle indicates an unknown number, or a variable number, in the module. Circles without color represent some intermediate variables. The squares represent operations such as multiplication, addition, subtraction, square, and so on.

The code related to Torch is as follows.

The Forward function, which is the result of the calculation function, the following example is a linear transformation function, used in Linear Regression.`require 'nn';`

``l = nn.Linear(2, 1)l.weight[1][1] = 1l.weight[1][2] = 5l.bias[1] = 3a = torch.Tensor(2)a[1] = 2a[2] = 3res = l:forward(a) --res = 2 * 1 + 3 * 5 + 3 = 20, print(res)--will print --20--[torch.DoubleTensor of size 1]``
``crit = nn.MSECriterion()targets = torch.Tensor(1)targets[1] = 10cost = crit:forward(res, targets)print(cost) --cost = (20 - 10) * (20 - 10) = 100--will print--100``

Backward function:

The derivative of cost for each variable, `nn.Linear`, `nn.MSECriterion` is provided.

The principle is as follows,

Through the rules of the chain of derivative in mathematics, it can be concluded that

``require 'nn';l = nn.Linear(2, 1)l.weight[1][1] = 1l.weight[1][2] = 5l.bias[1] = 3a = torch.Tensor(2)a[1] = 2a[2] = 3res = l:forward(a) --res = 2 * 1 + 3 * 5 + 3 = 20, print(res)--will print --20--[torch.DoubleTensor of size 1]crit = nn.MSECriterion()targets = torch.Tensor(1)targets[1] = 10cost = crit:forward(res, targets)print(cost) --cost = (20 - 10) * (20 - 10) = 100--will print--100backward1 = crit:backward(res, targets) print(backward1) -- 2 * (20 - 10) = 20--will print--20--[torch.DoubleTensor of size 1]backward2 = l:backward(a, backward1) -- This is result of d(cost) / d(y') same to 2(y'-y)print(l.gradWeight) -- dw1 = 20 * 2 = 40, dw2 = 20 * 3 = 60--will print--40  60--[torch.DoubleTensor of size 1x2]print(l.gradBias)  --db = 20--will print--20--[torch.DoubleTensor of size 1]``