Original article was published by Bala Priya C on Artificial Intelligence on Medium
Useful scikit-learn tips -1
Column Transformer and Column Selector
More often than not, we come across datasets where the columns are all of different types, some could be categorical variables, some numerical. Clearly, we require different pre-processing strategies to encode categorical variables and to impute the missing values of numerical data while some columns could be retained as such as they have data just the way we need.
Here’s a cool tip to apply different pre-processing techniques to different columns😎. You’d need scikit-learn version 0.20 and later to use this feature.
Let’s import the dataset that we need (The famous Titanic dataset from Kaggle😊 )
# Import pandas and read in the dataset into a DataFrame
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)# Read in a subset of the DataFrame containing columns of interest
cols = ['Fare', 'Embarked', 'Sex', 'Age']
X = df[cols]
Our Data Frame X looks like this
Let’s inspect the Data frame X; There are two columns with categorical variables which should be one-hot encoded, there’s an ‘Age’ column with missing values which have to be imputed and a ‘Fare’ column with no missing values.
# Import necessary pre-processing functions
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer# Instantiate the One-Hot Encoder and Imputer
ohe = OneHotEncoder()
imp = SimpleImputer()
We can now use the make_column_transformer function to apply the necessary pre-processing to the necessary columns
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),# apply OneHotEncoder to Embarked,Sex
(imp, ['Age']), # apply SimpleImputer to Age
remainder='passthrough') # include remaining column (Fare) in the output# column order: Embarked (3 columns), Sex (2 columns), Age (1 column), Fare (1 column)
We now see that, the columns ‘Embarked’ and ‘Sex’ have been One-Hot Encoded and the missing value in ‘Age’ column has been replaced with the mean of the other values (mean imputation); setting the argument remainder to passthrough ensures that we pass the other columns as such as they do not require any encoding or imputing strategy.
In the above example, we’ve selected columns by name, but there are several other ways to do it too. Let’s look at them now in the following code snippet.
# all of these produce the same results# Choose by Column Names
ct = make_column_transformer((ohe, ['Embarked', 'Sex']))# Choose by integer positions
ct = make_column_transformer((ohe, [1, 2]))# Alternatively, we could use slicing
ct = make_column_transformer((ohe, slice(1, 3)))# Use Boolean mask to choose columns
# True -> Include Column
# False -> Exclude Columnct = make_column_transformer((ohe, [False, True, True, False]))
In scikit-learn version 0.22 and later, there’s another function that we can use to choose the columns to which we would like to apply the particular encoding strategy, illustrated in the following code snippet
# using regular expressions
ct = make_column_transformer((ohe, make_column_selector(pattern='E|S')))# apply to all object type columns
ct = make_column_transformer((ohe, make_column_selector(dtype_include=object)))# apply to all non-numerical columns
ct = make_column_transformer((ohe, make_column_selector(dtype_exclude='number')))# one-hot encode Embarked and Sex (and drop all other columns)
Let’s note that the argument remainder takes drop as the default value and hence when not explicitly specified, the remaining columns are dropped.
Happy Learning✨! Until next time 😊