Hands-on with Feature Engineering Techniques: Transforming Variables

Original article was published on Artificial Intelligence on Medium

This post is a part of a series about feature engineering techniques for machine learning with python.

You can check out the rest of the articles:

Welcome to another article in our series on feature engineering! In this post, we’re going to discuss the different transformations you can apply to your variables in a given dataset.

Specifically, we are going to explain mathematical transformations expected by linear models—they assume that the variables follow a normal distribution.

Why These Transformations?

Some machine learning models, like linear and logistic regression, assume that the variables follow a normal distribution. More likely, variables in real datasets will follow more a skewed distribution.

By applying a number of transformations to these variables, and mapping their skewed distribution to a normal distribution, we can increase the performance of our models.

First, we have to see if our variables follow a normal distribution or not. We can estimate normality with histograms and Q-Q plots. Here’s an example of a Q-Q plot:

Q-Q Plot Example

In the Q-Q plots, if the variable follows a normal distribution, the variable’s values should fall in a 45-degree line when plotted against the theoretical quantiles.

Here’s the code snippet in Python to generate the previous plot: