Categorical feature encoding in Machine Learning

Original article was published on Artificial Intelligence on Medium


How to do Categorical feature encoding in Machine Learning

As we know that the machine learning model always perform well with numeric data but in real life machine learning problem the data comes in form of both numeric and discrete/categorical value. In this case the first job of data scientist is to convert these discrete/categorical value to numeric value and this process also known as feature engineering. Feature engineering basically have two task.

1- Derived feature from existing feature

2- Categorical feature encoding

We are going to discuss about the categorical feature encoding with the help of Titanic data. This is one of best data set to learn the ML. You can download the Titanic data set from Kaggle. When you will start working with Titanic data set you will find the some feature are not numeric so we have to convert these feature to numeric before feed it to ML algorithm.

Read the Titanic data set:

train_df = pd.read_csv(os.path.join(dataset_path, ‘train.csv’))

Categorical feature encoding provides us some technique to convert the discrete value to numeric value. Following are the feature encoding techniques.

1- Binary Encoding

2- Label Encoding

3- One-Hot Encoding

Binary Encoding:

Binary encoding means 0 or 1. You can use binary encoding when you have only two label in your feature column. As in Titanic data set, we have ‘Sex’ column which have only two labels ‘male’ and ‘female’ so we can simply convert it to male to 1 and female to 0 with the help of ‘where’ function in ‘numpy’ module. ‘where’ function will iterate through the each row and if the value of row is equal to ‘male’ then it will place 1 else 0 so ‘male’ converted to 1 and ‘female’ converted to 0 as you can see in following example.

import numpy as np
train_df[‘IsMale’] = np.where(train_df.Sex == ‘male’, 1, 0)

Label Encoding:

Label encoding is useful when data set feature column have more than two label and the weight of one label is greater other then use the Label encoding.

To perform the Label encoding we can use ‘fit_transform’ function which is defined in LabelEncoder class. To use this first we have to import the LabelEncoder from sklearn.preprocessing module.

Python code:

from sklearn.preprocessing import LabelEncoder

Now create a object of the LabelEncoder class and assign it to la_en.

la_en = LabelEncoder()

Call the ‘fit_transform’ function and pass Fare_Bin feature column as parameter and stored all values to data variable.

data = la_en.fit_transform(df[‘Fare_Bin’])

Now create a new column ‘Fare_Bin_with_LabelEncoder’ in your data frame and assign all values to that column.

df[‘Fare_Bin_with_LabelEncoder’] = data

Code snippet

One-Hot Encoding:

In One-Hot encoding each label in the feature column become a new feature column and value is 1 or 0 to be filled.

There is a very handy ‘get_dummies’ function available in Pandas module to convert categorical feature to One-Hot encoding. Let take the above example again to understand the one hot-encoding. In following code get_dummies are taking two arguments first one is df — which is data-frame where all data are present, second one is columns=[‘Fare_Bin’]- this is column name on which we wants to perform the one-hot encoding.

import pandas as pd
df_FareBin = pd.get_dummies(df, columns=[‘Fare_Bin’])
df_FareBin.loc[1:10, ‘Fare_Bin_very_low’:]

As you can see in below image, there are four label present in ‘Fare_bin’ column 1- very_low, 2- low, 3- high, 4- very_high and after applying the one-hot encoding it converted to four different columns 1- Fare_Bin_very_low, 2 -Fare_Bin_low, 3-Fare_Bin_high, 4-Fare_Bin_very_high and value(0 or 1) are filled as label present in each row.

I tried to explain you the Categorical feature encoding. I hope you enjoyed this.