Original article was published by Khalid Moataz on Artificial Intelligence on Medium
Gradient descent In a nutshell
You may be asking or searching what is gradient descent and how it works, In this article, I will briefly show you exactly what is a gradient descent So let’s get started.
Suppose we have a data set consists of three features and we want to know what features are more important.
So firstly we will give each feature a WEIGHT that weight shows how the feature is important if the weight is large then it’s important if it’s small then it’s not that important and will not affect the output.
These Weights are firstly initialized randomly and will be updated after each iteration on the data set to get the optimum weights.
So you may be asking now WHERE IS THE GRADIENT DESCENT ?? Gradient descent is a way to update those weights and get their optimum value and minimizes the error.
We will use the sum squared error (SSE) in computing the error value which we want it to be minimum.
To get a better understanding of what is gradient descent follow these steps with me of how we update the weights of our features using gradient descent.
Components of Gradient Descent :
- Learning Rate.
- Activation Function.
- Random Initialized Weights.
Step 1 :
We have 3 features so we will assume 3 Weights, one for each feature with any number. W1 = 0.3443, W2 = 0.3213, W3 = 0.5642
Step 2 :
Pick a suitable learning rate!!
YOU MAY ASK NOW WHAT IS LEARNING RATE ?!
The Learning rate is how much you move deeper to the minimum loss if it’s too small we will take more time to train our model and if it’s too big we may not reach the minimum loss point. This figure will illustrate more of what I mean Imagine this curve as .. So we pick a learning rate like 0.001 or 0.01.
Step 3 :
Calculate ‘Y’ the output of the system We calculate ‘Y’ by dot product the Features and Weights.
Y = X1*W1 + X2*W2 + X3*W3 + X4*W4
Step 4 :
Choose an activation function, Here we’ll be using a Sigmoid Function So we need to update our output ‘y’ to a new value. Since we’re using sigmoid function then Y = 1/1+e^-x.
Step 5 :
Update the weights using gradient descent.
New Weight 1 = Old Weight 1 + Learning Rate * Feature 1 * (d-y) * y * (1-y)
- Old Weight 1 is the W1 we initialized in step 1.
- Learning Rate the one we picked in step 2.
- Feature 1 is the first feature (Because we’re updating weight 1).
- (d-y) is the Error loss where ‘d’ is the output we want (0, 1, etc..) and y is the value we obtained from the last step which will be equals 1/1+e^-x we used this because we are using a Sigmoid Function ‘x’ here is the feature 1.
- y*(1-y) Since we’re using Sigmoid Function.
Step 6 :
Compute the error using sum squared error over the whole test data we will subtract the predicted value (From Step 4) and the actual value (Needed Output).