Why Feature Scaling is required for Training ML Models

Original article was published on Deep Learning on Medium

Why Feature Scaling is required for Training ML Models

A question that all the Data Science learners have is, why is there a need to normalize data. Here we will discuss the answer and get the intuition of the mechanics.

Feature scaling is a method used to normalize the range of independent variables or features of data.

What is Normalization?

Feature scaling is a method used to normalize the range of independent variables or features of data. While, data normalization refers to shifting and scaling of values. There are many ways to Features can be scaled-

  1. Min Max Scaling
  2. Standard Scaling
  3. Max abs Scaling
  4. Z-Score Scaling

The details of whose, most of the readers are aware and can be kept for later.

Here we are looking at the Z-score formula. Mean is subtracted by the value, then divided by the standard deviation of the dataset.

Z-score Normalization

Why Scale the data ?

We understand that models are capable to train, no matter what type of values are provided. However the answer to this question, lies deep in the working of optimization functions of learning models.

A normalized data with comparable ranges, leads to better and efficient optimization of the learning algorithms.

Intuition

We can get some intuition from the most famous optimization algorithm:
The Gradient Descent Algorithm.

A complex visualization for gradient descent plane

To know more about gradient descent click here.

Let us look at a simplified version.

Now the size and complexity of this curve would depend on the range of data being trained.

For two feature learning model and data with similar range data, the gradient descent can look like this.

For a optimized learning rate, the algorithm smoothly reaches the global minima hence optimization is fast and effective.

Now let us consider unscaled data:
X1: Range -(1,100)
X2: Range -(0,1)

BUT IT LOOKS FINE AND HAS NO PROBLEM!!

Let us see how it reached minima for a given learning rate.(X1 perspective)

WHAT !! BUT WHY!!

This looks like unreal but, lets look at this from another perspective(of X2), for the same learning rate.

Sure we can decrease the learning rate, but then the leaning in X1 will decrease too. Hence it is best to normalize and make the ranges comparable.