Data Preprocessing for Machine Learning

Original article was published by Kr. Wan on Artificial Intelligence on Medium

Numerical Preprocess

Now comes the fun part, we want to process our numerical values in the dataset. There are a number of things we need to do: first, we need to replace missing values; second, we need to standardize our values.

Replacing Missing Values

Sometimes the dataset we have will usually have many missing values with Nan values, and those are always a pain for anyone who wishes to use the dataset for various reasons. Luckily, sklearn provides a simple and efficient way to deal with such problem. With a few lines of code, we can replace missing values with median of the columns they are in.

Sklearn imputer to replace missing values

Utterly simple! First we get rid of non-numerical attributes, then we just fit the imputer with our data and transform our data with missing values replaced with median. Lastly, we transform our replaced dataset back to pandas dataframe for consistency.


This step is even simpler!

Standardize numerical attributes

Just import StandardScaler from sklearn.preprocessing and fit_transform() the dataset we have. Done!