Original article can be found here (source): Artificial Intelligence on Medium
Gradient Descent Using Line Search
Most machine learning algorithms involve the minimization of a loss and gradient is one of the most popular strategies to accomplish this goal. More on gradient descent can be found here.
The main goal is to converge to the minima of the loss functions in the least amount of time. This involves converging fast to the minima in a computationally efficient manner. Minimization using Newton’s method provides super-linear convergence at the cost of huge computational expenses. Gradient descent, although computationally efficient, provides a slow rate of convergence. This is where line search comes into place and provides much better rate of convergence at a slight increase in computational spending.
In gradient descent, a step size is either fixed or updated as the gradient of the loss function changes. If the step size is too small, it can take forever to converge and if it is too big, we might overshoot the minima and oscillate back and forth. Line search is a strategy for picking a good value of step size which results in a quicker convergence.
For demonstrating line search, let us define logistic loss :
The loss function returns the logistic loss and its gradient for a given X, y, w and regularization term lambda.
Generate the data
import numpy as np
m = 5000
n = 12000
X = np.random.rand(n, m)
y = np.random.rand(n, 1)
We generate our input data X and y with 12000 examples each with 5000 features.
Lets run gradient descent on our generated data:
50th Iteration Loss :: 242.70309219567366 gradient :: 207.52578738484715
100th Iteration Loss :: 224.33403843602025 gradient :: 177.12067698718036
150th Iteration Loss :: 210.53161197583918 gradient :: 155.86772526251653
200th Iteration Loss :: 199.61732667360204 gradient :: 140.03048141454516
250th Iteration Loss :: 190.6750569151979 gradient :: 127.69248238098645
Gradient Descent with Line Search
Steps for line search are given below:
- Calculate initial loss and initialize step size to a large value.
- Update value of weights using the gradient and step size and calculate new loss.
- Decrease the value of step size by some factor and repeat step 2 until the new loss is less than the initial loss.
Here we choose initial step size to be 1 and reduce the step size by a factor of 2. We also make use of an additional parameter gamma to guarantee the new loss will be less than the initial loss by at least:
Lets run gradient descent with line search for the generated data.
15th Iteration Loss :: 143.29412846343416 gradient :: 5.161827846700099
30th Iteration Loss :: 30.075989488139246 gradient :: 0.4525504460057531
45th Iteration Loss :: 29.66848123493795 gradient :: 0.012072573984844907
60th Iteration Loss :: 29.66781114884413 gradient :: 0.002405275249607662
75th Iteration Loss :: 29.667783478402605 gradient :: 0.000494190727108184
90th Iteration Loss :: 29.667782310274475 gradient :: 0.00010154086627732706
105th Iteration Loss :: 29.66778226095854 gradient :: 2.0863900171776335e-05
120th Iteration Loss :: 29.667782258876443 gradient :: 4.287048632986741e-06
135th Iteration Loss :: 29.667782258788534 gradient :: 8.809059148513826e-07
150th Iteration Loss :: 29.667782258784822 gradient :: 1.8101258135569298e-07
As seen above we get a much lower loss value in less number of iterations.
Gradient descent with line search reaches the minima much faster by improving the rate of convergence while not being too computationally expensive. This version of line search can still be improved further and these will be discusses in Part 2 coming soon.