# Santander Case — Part A: Classification

Original article was published by Pedro Couto on Artificial Intelligence on Medium # The Problem

The Santander Group is a global banking group, led by Banco Santander S.A., the largest bank in the euro area. It has its origin in Santander, Cantabria, Spain. As every bank, they have a retention program that should be applied to unsatisfied customers.

To be able to use this program properly, we need to develop a machine learning model to classify if the customer is satisfied or not. Customers classified as unsatisfied should be the target of the retention program.

The retention program cost \$10 for each customer and an effective application (in really unsatisfied customers) returns a profit of \$100. In the classification task we can have the following scenarios:

1. False Positive(FP): classify the customer as UNSATISFIED but he is SATISFIED. Cost: \$ 10, Earn: \$ 0;
2. False Negative(FN): classify the customer as SATISFIED but he is DISSATISFIED. Cost: \$ 0, Earn: \$ 0;
3. True Positive(TP): classify the customer as UNSATISFIED and he is UNSATISFIED. Cost: \$ 10, Earn: \$ 100;
4. True Negative(TN): classify the customer as SATISFIED and he is SATISFIED. Cost: \$ 0, Earn: \$ 0.

In summary, we want to minimize the rate of FP and FN as well as maximize the rate of TP. To do so, we will use the metric AUC (area under the curve) of ROC Curve (receiver operating characteristic), because it returns us the best model as well as the best threshold.

You can check the complete notebook with this solution on my Github.

Let’s go.

`# Loading packagesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport time%matplotlib inline# Loading the Train and Test datasetsdf_train = pd.read_csv("data/train.csv")df_test = pd.read_csv("data/test.csv")`

The data can be found in this old Santanders Competition.

# 2 Basic Exploratory Analysis

For this step, let us address the following points:

• Are the data in the columns numeric or do they need to be encoded?
• Can the test dataset really be used or is it useful only for a Kaggle competition?
• Are there any missing data?
• What is the proportion of dissatisfied customers (1) in the dataset df_train?
• Does it make sense to apply a feature selection method on the data?

Primarily, let’s get a first overview of the dataset and its features

`# Checking the first 5 rows of df_traindf_train.head()`
`# Checking the first 5 rows of df_testdf_test.head()`
`# Checking the genearl infos of df_train and df_testdf_train.info()`
`# Checking the genearl infos of df_testdf_test.info()`

Looking at the outputs of the cells above, we can say that:

1. All columns are already in a numeric format. This means we don’t need to do any encoding to convert any type of variable into a numeric variable;
2. Since this is an anonymous dataset, we don’t have any clue if there are categorical variables. So, there is no need to make any encode to address this problem.
3. Lastly, df_train has 371 columns and df_test has 370 columns. This happens because as it comes from competitions datasets, the df_test has no Target column.

Another crucial point is to chek if there is any missing value on this datasets. Let’s check it out.

`# Checking if is there any missing value in both train and test datasetsdf_train.isnull().sum().sum(), df_test.isnull().sum().sum()`

Now, we can conclude that there is no missing data in any dataset.

Finally, let’s investigate how is the proportion of unsatisfied customers (our target) in the df_train dataset.

`# Investigating the proportion of unsatisfied customers on df_trainrate_insatisfied = df_train.TARGET.value_counts() /                                            df_train.TARGET.value_counts()rate_insatisfied * 100`

We have an extremely unbalanced dataset, approximately 4.12% positive. This must be taken into account in two situations:

1. To split the data in train and test;
2. To choose hyperparameters such as “class_weight” by Random Forest.

# 3 Dataset Split (train — test)

As the train_test_split method does the segmentation randomly, even with an extremely unbalanced dataset, the split should occur so that both training and testing have the same proportion of unsatisfied customers.
However, as it is difficult to guarantee randomness in fact, we can make a stratified split based on the TARGET variable, thus ensuring that the proportion is exact in both datasets.

`from sklearn.model_selection import train_test_split# Spliting the dataset on a proportion of 80% for train and 20% for test.X_train, X_test, y_train, y_test = train_test_split(df_train.drop('TARGET', axis = 1), df_train.TARGET, train_size = 0.8, stratify = df_train.TARGET, random_state = 42)# Checking the splitX_train.shape, y_train.shape, X_test.shape, y_test.shape`

We split the test successfully in train data (X_train, y_train), and test data (X_test, y_test).

# 4 Feature Selection

Since both train and test dataset are relatively large (around 76k rows and 370 columns), and we don’t know what each feature represents and how they can impact the model, it demands a feature selection for three reasons:

1. To know which features bring most relevant prediction power to the model;
2. Avoid using features that could degrade the model performance;
3. Minimize the computational cost by using the minimal amount of features that provide the best model performance.

For this reason, we will try to answer the following questions:

• Are there any constant and/or semi-constants features that can be removed?
• Are there duplicate features?
• Does it make sense to perform some more filtering to reach a smaller group of features?

## 4.1 Removing low variance features

If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed.

`# Investigating if there are constant or semi-constat feature in X_trainfrom sklearn.feature_selection import VarianceThreshold# Removing all features that have variance under 0.01selector = VarianceThreshold(threshold = 0.01)selector.fit(X_train)mask_clean = selector.get_support()X_train = X_train[X_train.columns[mask_clean]]`

Now let’s check how much columns were removed.

`# Total of remaning featuresX_train.shape`

With this filtering, 104 features were removed. Thus, the dataset has become leaner without losing predictive power. As these features do not add information to the machine learning model they impact its ability to classify an instance. We have now 266 features remaining.

## 4.2 Removing repeated features

For obvious reason, removing duplicated features improve the performance of our model by removing redundant information as well as to have a lighter dataset to work with.

`# Checking if there is any duplicated columnremove = []cols = X_train.columnsfor i in range(len(cols)-1):    column = X_train[cols[i]].values    for j in range(i+1,len(cols)):        if np.array_equal(column, X_train[cols[j]].values):            remove.append(cols[j])# If yes, than they will be dropped hereX_train.drop(remove, axis = 1, inplace=True)`

Now let’s check the result.

`# Checking if any column was droppedX_train.shape`

There were 266 columns before checking for duplicate features and now there are 251. So there were 15 repeated features.

## 4.3 Using SelectKBest to select features

There are two types of methods for evaluating features, with SelectKBest: f_classif (fc) and mutual_info_classif (mic). The first works best when the features and Target have a more linear relationship. The second is more appropriate when there are non-linear relationships.

Since the dataset is anonymized and the quantity of features is too large to make a quality study on the feature-target relationship, both methods will be tested and the one that produces a stable region with the highest AUC value will be chosen.

For this, different K values will be tested with the SelectKBest class, which will be used to train an XGBClassifier model and be evaluated using the AUC metric. Having a collection of values, a graph for fc and another for mic will be created.

Thus, through visual analysis, it is possible to choose the best K value as well as the best method for scoring features.

`from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classiffrom sklearn.metrics import roc_auc_score as aucfrom sklearn.model_selection import cross_val_scoreimport xgboost as xgb#Create an automated routine to test different K values in each of these methodsK_vs_score_fc = [] #List to store AUC of each K with f_classifK_vs_score_mic = [] #List to store AUC of each K with mutual_info_classifstart = time.time()for k in range(2, 247, 2):    start = time.time()    # Instantiating a KBest object for each of the metrics in order  to obtain the K features with the highest value    selector_fc = SelectKBest(score_func = f_classif, k = k)    selector_mic = SelectKBest(score_func = mutual_info_classif,     k = k)    # Selecting K-features and modifying the dataset    X_train_selected_fc = selector_fc.fit_transform(X_train,                                                                 y_train)    X_train_selected_mic = selector_mic.fit_transform(X_train, y_train)     # Instantiating an XGBClassifier object    clf = xgb.XGBClassifier(seed=42)    # Using 10-CV to calculate AUC for each K value avoinding  overfitting    auc_fc = cross_val_score(clf, X_train_selected_fc, y_train,     cv = 10, scoring = 'roc_auc')    auc_mic = cross_val_score(clf, X_train_selected_mic, y_train,    cv = 10, scoring = 'roc_auc')    # Adding the average values obtained in the CV for further  analysis.    K_vs_score_fc.append(auc_fc.mean())    K_vs_score_mic.append(auc_mic.mean())    end = time.time()    # Returning the metrics related to the tested K and the time   spent on this iteration of the loop    print("k = {} - auc_fc = {} - auc_mic = {} - Time =   {}s".format(k, auc_fc.mean(), auc_mic.mean(), end-start))print(time.time() - start) # Computing the total time spent`

The code above returns two lists with 123 K-Scores. By plotting them for each value of K, we can choose the best K-value as well as the best features.