Useful scikit-learn tips -1

Original article was published by Bala Priya C on Artificial Intelligence on Medium


Useful scikit-learn tips -1

Column Transformer and Column Selector

Photo by Tim Mossholder on Unsplash

This series of blog posts are inspired by Kevin Markham, Founder of Data School’s videos on cool scikit-learn tips.

More often than not, we come across datasets where the columns are all of different types, some could be categorical variables, some numerical. Clearly, we require different pre-processing strategies to encode categorical variables and to impute the missing values of numerical data while some columns could be retained as such as they have data just the way we need.

Here’s a cool tip to apply different pre-processing techniques to different columns😎. You’d need scikit-learn version 0.20 and later to use this feature.

Let’s import the dataset that we need (The famous Titanic dataset from Kaggle😊 )

# Import pandas and read in the dataset into a DataFrame
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)
# Read in a subset of the DataFrame containing columns of interest
cols = ['Fare', 'Embarked', 'Sex', 'Age']
X = df[cols]

Our Data Frame X looks like this

The Data Frame X

Let’s inspect the Data frame X; There are two columns with categorical variables which should be one-hot encoded, there’s an ‘Age’ column with missing values which have to be imputed and a ‘Fare’ column with no missing values.

# Import necessary pre-processing functions
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
# Instantiate the One-Hot Encoder and Imputer
ohe = OneHotEncoder()
imp = SimpleImputer()

We can now use the make_column_transformer function to apply the necessary pre-processing to the necessary columns

ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),# apply OneHotEncoder to Embarked,Sex
(imp, ['Age']), # apply SimpleImputer to Age
remainder='passthrough') # include remaining column (Fare) in the output
# column order: Embarked (3 columns), Sex (2 columns), Age (1 column), Fare (1 column)
ct.fit_transform(X)
Output after pre-processing

We now see that, the columns ‘Embarked’ and ‘Sex’ have been One-Hot Encoded and the missing value in ‘Age’ column has been replaced with the mean of the other values (mean imputation); setting the argument remainder to passthrough ensures that we pass the other columns as such as they do not require any encoding or imputing strategy.

The code used above can be found in this GitHub repo and the video can be found on YouTube

In the above example, we’ve selected columns by name, but there are several other ways to do it too. Let’s look at them now in the following code snippet.

# all of these produce the same results# Choose by Column Names
ct = make_column_transformer((ohe, ['Embarked', 'Sex']))
# Choose by integer positions
ct = make_column_transformer((ohe, [1, 2]))
# Alternatively, we could use slicing
ct = make_column_transformer((ohe, slice(1, 3)))
# Use Boolean mask to choose columns
# True -> Include Column
# False -> Exclude Column
ct = make_column_transformer((ohe, [False, True, True, False]))

In scikit-learn version 0.22 and later, there’s another function that we can use to choose the columns to which we would like to apply the particular encoding strategy, illustrated in the following code snippet

# using regular expressions
ct = make_column_transformer((ohe, make_column_selector(pattern='E|S')))
# apply to all object type columns
ct = make_column_transformer((ohe, make_column_selector(dtype_include=object)))
# apply to all non-numerical columns
ct = make_column_transformer((ohe, make_column_selector(dtype_exclude='number')))
# one-hot encode Embarked and Sex (and drop all other columns)
ct.fit_transform(X)
Output after transformation

Let’s note that the argument remainder takes drop as the default value and hence when not explicitly specified, the remaining columns are dropped.

The above code can be found in this GitHub repo and the video is on YouTube.

Happy Learning✨! Until next time 😊