Vectorized Implementation

Gradient descent algorithm is one of the most popular optimization algorithms for finding optimal parameters for the model. Goal is to find the parameter which minimize the cost function.

(Full Source Code: https://github.com/SSaishruthi/Gradient_Descent_Algorithm_Using_Numpy)

*Process*

Local gradient of the error function with respect to the parameter is measured and it goes in the direction of decreasing gradient. Cost function should decrease after every iteration. When the decrease in cost function is less than 10^ (-3), it indicates that minimum point is approaching or very near.

*Learning Rate*

Learning rate is the one of the important hyper-parameters that controls how big the steps are during gradient descent. Smaller the learning rate, algorithm takes smaller steps in the process of reaching global minimum. If the learning rate is large, it may overshoot minimum and fail to converge. At times, it may even diverge.

*Feature Scaling*

If there are more than one feature, then feature scaling will help gradient descent to converge quickly. Two feature scaling methods can be applied.

1. Normalization — This is used to adjust the differences among attributes in terms of frequency of occurrence, mean, range and variance. Replace x(i) with x(i) — Mean(x).

2. Standardization — This is done by subtracting the mean and dividing by standard deviation.

*Gradient Descent for linear regression*

*Gradient Descent for linear regression*

*Equation*

Y(pred) = b0 + b1*x

*Error Function to be minimized*

*General Gradient Equation*

Derivative of error function with respect to the parameters

*Gradient Descent Algorithms*

*Gradient Descent Algorithms*

Three types of gradient descent algorithms are discussed below

- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent

*Batch Gradient Descent*

Flow of the algorithm

As we approach minimum, gradient descent will take only small steps. There is no need to decrease learning rate over time.

def model_optimize(w,b,X,Y):

#

m = X.shape[0]

#

final_result = np.dot(w, X.T) + b

cost = (1/m)*np.sum((Y.T - final_result) ** 2)

#

dw = (-2/m)*np.sum((np.dot(X.T,(Y.T - final_result).T)))

db = (-2/m)*np.sum(((Y.T - final_result)))

#

grads = {"dw": dw, "db": db}

return grads, cost

def gradientUpdate(w,b,X, Y, learning_rate, no_iterations):

costs = []

for i in range(no_iterations):

#

grads, cost = model_optimize(w,b, X,Y)

#

dw = grads["dw"]

db = grads["db"]

#Weight Update

w = w - (learning_rate*dw)

b = b - (learning_rate*db)

#

costs.append(cost)

#

coeff = {"w": w, "b": b}

gradient = {"dw": dw, "db": db}

return coeff, gradient, costs

Cost function with respect to number of iteration

*Stochastic Gradient Descent*

Flow

Cost function go up and down. There is only gradual decrease in the cost function on an average.

This can help in coming out of local minimum and chances of finding global minimum is not as perfect as batch gradient method. But by decreasing learning rate over the time can help reaching the perfect minimum point.

Process of changing the learning rate at every step is called as simulated annealing. The function which determines the learning rate is called as learning schedule.

def stochasticUpdate(w,b,X, Y, learning_rate, no_iterations):

#

n_points = X.shape[0]

#

for i in range(no_iterations):

for i in range(n_points):

index = np.random.randint(n_points)

x_pt = X[index:index+1]

y_pt = Y[index:index+1]

grads, cost = model_optimize(w,b, x_pt,y_pt)

#

dw = grads["dw"]

db = grads["db"]

#Weight Update

w = w - (learning_rate*dw)

b = b - (learning_rate*db)

#

costs.append(cost)

#

coeff = {"w": w, "b": b}

gradient = {"dw": dw, "db": db}

return coeff, gradient, costs

As discussed, cost function goes up and down for every iteration

*Mini-batch Gradient Descent*

Gradient descent is performed on small random sets. It is Less erratic. It can get quite closer to minimum as they reduce variance of parameter update. This leads to more stable updates.

def miniBatchUpdate(w,b,X, Y, learning_rate, no_iterations):

#

n_points = X.shape[0]

#

for i in range(no_iterations):

X, y = shuffle(x_train, y_train)

x_random = X[:40]

y_random = y[:40]

grads, cost = model_optimize(w,b, x_random,y_random)

#

dw = grads["dw"]

db = grads["db"]

#Weight Update

w = w - (learning_rate*dw)

b = b - (learning_rate*db)

#

costs.append(cost)

#

coeff = {"w": w, "b": b}

gradient = {"dw": dw, "db": db}

return coeff, gradient, costs

Overall performance Comparison

(Full Source code: https://github.com/SSaishruthi/Gradient_Descent_Algorithm_Using_Numpy)

Source: Deep Learning on Medium