Santander Case — Part A: Classification

Original article was published by Pedro Couto on Artificial Intelligence on Medium


Santander Case — Part A: Classification

Here you will find: Data Cleaning, Feature Selection, Bayesian Optimization, Classification and Model Validation.

Customer Classification. Source: https://miro.medium.com/max/1400/1*PM4dqcAe6N7kWRpXKwgWag.png.

The Problem

The Santander Group is a global banking group, led by Banco Santander S.A., the largest bank in the euro area. It has its origin in Santander, Cantabria, Spain. As every bank, they have a retention program that should be applied to unsatisfied customers.

To be able to use this program properly, we need to develop a machine learning model to classify if the customer is satisfied or not. Customers classified as unsatisfied should be the target of the retention program.

The retention program cost $10 for each customer and an effective application (in really unsatisfied customers) returns a profit of $100. In the classification task we can have the following scenarios:

  1. False Positive(FP): classify the customer as UNSATISFIED but he is SATISFIED. Cost: $ 10, Earn: $ 0;
  2. False Negative(FN): classify the customer as SATISFIED but he is DISSATISFIED. Cost: $ 0, Earn: $ 0;
  3. True Positive(TP): classify the customer as UNSATISFIED and he is UNSATISFIED. Cost: $ 10, Earn: $ 100;
  4. True Negative(TN): classify the customer as SATISFIED and he is SATISFIED. Cost: $ 0, Earn: $ 0.

In summary, we want to minimize the rate of FP and FN as well as maximize the rate of TP. To do so, we will use the metric AUC (area under the curve) of ROC Curve (receiver operating characteristic), because it returns us the best model as well as the best threshold.

You can check the complete notebook with this solution on my Github.

Let’s go.

1 Loading Data and Packages

# Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
%matplotlib inline# Loading the Train and Test datasets
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

The data can be found in this old Santanders Competition.

2 Basic Exploratory Analysis

For this step, let us address the following points:

  • Are the data in the columns numeric or do they need to be encoded?
  • Can the test dataset really be used or is it useful only for a Kaggle competition?
  • Are there any missing data?
  • What is the proportion of dissatisfied customers (1) in the dataset df_train?
  • Does it make sense to apply a feature selection method on the data?

Primarily, let’s get a first overview of the dataset and its features

# Checking the first 5 rows of df_train
df_train.head()
df_train.head() output.
# Checking the first 5 rows of df_test
df_test.head()
df_test.head() output.
# Checking the genearl infos of df_train and df_test
df_train.info()
df_train.info() output.
# Checking the genearl infos of df_test
df_test.info()
df_test.info() output.

Looking at the outputs of the cells above, we can say that:

  1. All columns are already in a numeric format. This means we don’t need to do any encoding to convert any type of variable into a numeric variable;
  2. Since this is an anonymous dataset, we don’t have any clue if there are categorical variables. So, there is no need to make any encode to address this problem.
  3. Lastly, df_train has 371 columns and df_test has 370 columns. This happens because as it comes from competitions datasets, the df_test has no Target column.

Another crucial point is to chek if there is any missing value on this datasets. Let’s check it out.

# Checking if is there any missing value in both train and test datasetsdf_train.isnull().sum().sum(), df_test.isnull().sum().sum()
No missing values for the datasets.

Now, we can conclude that there is no missing data in any dataset.

Finally, let’s investigate how is the proportion of unsatisfied customers (our target) in the df_train dataset.

# Investigating the proportion of unsatisfied customers on df_train
rate_insatisfied = df_train.TARGET.value_counts()[1] / df_train.TARGET.value_counts()[0]
rate_insatisfied * 100
Fraction of unsatisfied customers (%).

We have an extremely unbalanced dataset, approximately 4.12% positive. This must be taken into account in two situations:

  1. To split the data in train and test;
  2. To choose hyperparameters such as “class_weight” by Random Forest.

3 Dataset Split (train — test)

As the train_test_split method does the segmentation randomly, even with an extremely unbalanced dataset, the split should occur so that both training and testing have the same proportion of unsatisfied customers.
However, as it is difficult to guarantee randomness in fact, we can make a stratified split based on the TARGET variable, thus ensuring that the proportion is exact in both datasets.

from sklearn.model_selection import train_test_split# Spliting the dataset on a proportion of 80% for train and 20% for test.X_train, X_test, y_train, y_test = train_test_split(df_train.drop('TARGET', axis = 1), df_train.TARGET, 
train_size = 0.8, stratify = df_train.TARGET, random_state = 42)
# Checking the split
X_train.shape, y_train.shape[0], X_test.shape, y_test.shape[0]

We split the test successfully in train data (X_train, y_train), and test data (X_test, y_test).

4 Feature Selection

Since both train and test dataset are relatively large (around 76k rows and 370 columns), and we don’t know what each feature represents and how they can impact the model, it demands a feature selection for three reasons:

  1. To know which features bring most relevant prediction power to the model;
  2. Avoid using features that could degrade the model performance;
  3. Minimize the computational cost by using the minimal amount of features that provide the best model performance.

For this reason, we will try to answer the following questions:

  • Are there any constant and/or semi-constants features that can be removed?
  • Are there duplicate features?
  • Does it make sense to perform some more filtering to reach a smaller group of features?

4.1 Removing low variance features

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

# Investigating if there are constant or semi-constat feature in X_train
from sklearn.feature_selection import VarianceThreshold
# Removing all features that have variance under 0.01
selector = VarianceThreshold(threshold = 0.01)
selector.fit(X_train)
mask_clean = selector.get_support()
X_train = X_train[X_train.columns[mask_clean]]

Now let’s check how much columns were removed.

# Total of remaning features
X_train.shape[1]
Amount of remaining features.

With this filtering, 104 features were removed. Thus, the dataset has become leaner without losing predictive power. As these features do not add information to the machine learning model they impact its ability to classify an instance. We have now 266 features remaining.

4.2 Removing repeated features

For obvious reason, removing duplicated features improve the performance of our model by removing redundant information as well as to have a lighter dataset to work with.

# Checking if there is any duplicated column
remove = []
cols = X_train.columns
for i in range(len(cols)-1):
column = X_train[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(column, X_train[cols[j]].values):
remove.append(cols[j])
# If yes, than they will be dropped here
X_train.drop(remove, axis = 1, inplace=True)

Now let’s check the result.

# Checking if any column was dropped
X_train.shape
The shape of X_train dataframe.

There were 266 columns before checking for duplicate features and now there are 251. So there were 15 repeated features.

4.3 Using SelectKBest to select features

There are two types of methods for evaluating features, with SelectKBest: f_classif (fc) and mutual_info_classif (mic). The first works best when the features and Target have a more linear relationship. The second is more appropriate when there are non-linear relationships.

Since the dataset is anonymized and the quantity of features is too large to make a quality study on the feature-target relationship, both methods will be tested and the one that produces a stable region with the highest AUC value will be chosen.

For this, different K values will be tested with the SelectKBest class, which will be used to train an XGBClassifier model and be evaluated using the AUC metric. Having a collection of values, a graph for fc and another for mic will be created.

Thus, through visual analysis, it is possible to choose the best K value as well as the best method for scoring features.

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.metrics import roc_auc_score as auc
from sklearn.model_selection import cross_val_score
import xgboost as xgb
#Create an automated routine to test different K values in each of these methodsK_vs_score_fc = [] #List to store AUC of each K with f_classif
K_vs_score_mic = [] #List to store AUC of each K with mutual_info_classif
start = time.time()
for k in range(2, 247, 2):
start = time.time()

# Instantiating a KBest object for each of the metrics in order to obtain the K features with the highest value
selector_fc = SelectKBest(score_func = f_classif, k = k)
selector_mic = SelectKBest(score_func = mutual_info_classif,
k = k)

# Selecting K-features and modifying the dataset
X_train_selected_fc = selector_fc.fit_transform(X_train, y_train)
X_train_selected_mic = selector_mic.fit_transform(X_train, y_train)

# Instantiating an XGBClassifier object
clf = xgb.XGBClassifier(seed=42)

# Using 10-CV to calculate AUC for each K value avoinding overfitting
auc_fc = cross_val_score(clf, X_train_selected_fc, y_train,
cv = 10, scoring = 'roc_auc')
auc_mic = cross_val_score(clf, X_train_selected_mic, y_train,
cv = 10, scoring = 'roc_auc')

# Adding the average values obtained in the CV for further analysis.
K_vs_score_fc.append(auc_fc.mean())
K_vs_score_mic.append(auc_mic.mean())

end = time.time()
# Returning the metrics related to the tested K and the time spent on this iteration of the loop
print("k = {} - auc_fc = {} - auc_mic = {} - Time = {}s".format(k, auc_fc.mean(), auc_mic.mean(), end-start))


print(time.time() - start) # Computing the total time spent

The code above returns two lists with 123 K-Scores. By plotting them for each value of K, we can choose the best K-value as well as the best features.