Santander Case — Part B: Net Promoter Score (NPS)

Original article was published by Pedro Couto on Artificial Intelligence on Medium


Santander Case — Part B: Net Promoter Score (NPS)

Here you will find: a system to score the satisfaction level of your customers.

The Problem

NPS is a management tool used as a measure of customer satisfaction and has been shown to correlate with revenue growth relative to competitors. NPS has been widely adopted by Fortune 500 companies and other organizations.

The metric was developed by (and is a registered trademark of) Fred Reichheld, Bain & Company and Satmetrix. It was introduced by Reichheld in his 2003 Harvard Business Review article, “The One Number You Need to Grow”. Its popularity and broad use have been attributed to its simplicity and its openly available methodology.

In this task, we need to give a rate from 1 to 5 for each customer of the test base respecting the ‘TARGET’ feature, that represents their level of satisfaction. The following points will guide the score system:

  • 1 represents the most dissatisfied and 5 the most satisfied;
  • The retention program should only be applied to customers with a satisfaction score of 1.

You can check the complete notebook with this solution on my Github.

This Case was made as a parte of the prize for winning the Santander Data Masters Competition. I explain more about the competition itself and the hard skills I learned and soft skills I used in my way to winning it in this article.

1 Loading Data and Packages

# Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time%matplotlib inline# Loading the Train and Test datasets
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

The data can be found in this old Santanders Competition.

2 The Classification Model

Knowing the satisfaction score, allows us to take the results that maximize the profits as well as to understand the behaviour and satisfaction of each customer.

2.1 Method

The classification model, that we build in Part A, has also the option to output the probability of the customer being unsatisfied.

By using this type of output, we can then create 5 intervals, one for each level of satisfaction. The customer will receive a satisfaction label according to the interval in which the outputted probability fits.

So let’s first rebuild the model of Part A.

2.2 Dataset Split (train — test)

As said in Part A, section 3, the train_test_split method does the segmentation randomly. Even with an extremely unbalanced dataset, the split should occur so that both training and testing have the same proportion of unsatisfied customers.
However, as it is difficult to guarantee randomness in fact, we can make a stratified split based on the TARGET variable, thus ensuring that the proportion is exact in both datasets.

from sklearn.model_selection import train_test_split# Spliting the dataset on a proportion of 80% for train and 20% for test.X_train, X_test, y_train, y_test = train_test_split(df_train.drop('TARGET', axis = 1), df_train.TARGET, 
train_size = 0.8, stratify = df_train.TARGET, random_state = 42)
# Checking the split
X_train.shape, y_train.shape[0], X_test.shape, y_test.shape[0]
Output of the code above.

2.3 Rebuilding the selected dataset

Here we need to:

  • Remove constant / semi-constat features;
  • Remove duplicate features;
  • Select only the best 96 features we found in Part-A.

Removing constant and semi-constant feature:

# Investigating if there are constant or semi-constat feature in X_train
from sklearn.feature_selection import VarianceThreshold
# Removing all features that have variance under 0.01
selector = VarianceThreshold(threshold = 0.01)
selector.fit(X_train)
mask_clean = selector.get_support()
X_train = X_train[X_train.columns[mask_clean]]

Removing duplicate features:

# Checking if there is any duplicated column
remove = []
cols = X_train.columns
for i in range(len(cols)-1):
column = X_train[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(column, X_train[cols[j]].values):
remove.append(cols[j])
# If yes, than they will be dropped here
X_train.drop(remove, axis = 1, inplace=True)

Selecting the 96 best features:

# Selection the 96 best features aconrdingly to f_classif
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
selector_fc = SelectKBest(score_func = f_classif, k = 96)
selector_fc.fit(X_train, y_train)
mask_selected = selector_fc.get_support()
# Saving the selected columns in a list
selected_col = X_train.columns[mask_selected]
# Creating datasets where only with the selected 96 features are included
X_train_selected = X_train[selected_col]
X_test_selected = X_test[selected_col]

Now that we have successfully rebuild the selected datasets, we can move forward to the next steps.

2.4 Retraining the model

Because we want the model developed in Part A to generate probabilities for each customer to be unsatisfied, we need to retrain it as we did in Part A. The good news is that we already have the optimal hyperparameters and they can be used in the model training. Let’s recap the best hyperparameters:

  • learning rate: 0.007961566078062952;
  • n_estimators: 1397;
  • max_depth: 4;
  • min_child_weight: 5.711008778424264;
  • gamma: 0.2816441089227697;
  • subsample: 0.692708251269958;
  • colsample_bytree: 0.5079831261101071.

So let’s train the model.

# Generating the model with the optimized hyperparametersclf_optimized = xgb.XGBClassifier(learning_rate = 0.007961566078062952, n_estimators = 1397, max_depth = 4, min_child_weight = 5.711008778424264, gamma = 0.2816441089227697, subsample = 0.692708251269958, colsample_bytree = 0.507983126110107, seed = 42)# Fitting the model to the X_train_selected dataset
clf_optimized.fit(X_train_selected, y_train)

Now that we have a trained model, we can check if its performance is the same as in Part A, using the test split (X_test_selected).

# Evaluating the performance of the model in the test data (which have not been used so far).
y_predicted = clf_optimized.predict_proba(X_test_selected)[:,1]
auc(y_test, y_predicted)
AUC of the trained model on test split data (X_test_selected)

As in Part A the model scored an AUC of 0.8477! It means we have a model as we want and can now continue to the next steps. But first, let’s take a look at how the model’s output is in probabilities format.

# checking the output in probability format
clf_optimized.predict_proba(X_test_selected)[:,1]
The output of the code above in probability format.

As we can see, the output is an array of probabilities with values between 0 and 1, where 0 means satisfied customer and 1 means unsatisfied customer. The probabilities lay in this range.

Now that we have a model and its output in the way we need to create the NPS system, let’s move forward.

3 Strategie & Method

3.1 Threshold selection

Now that we have a probability output that lay within the range 0 to 1, we can split this range into 5 intervals. Each interval will be a score of satisfaction and knowing the probability output for a specific customer, we are able to give him a satisfaction label. The question is just how we should split this range in a way that gives us the best NPS system. To answer this question, let’s plot the distribution of probabilities for the test split data (X_test_selected).

# Plotting the distribution of probailities for the X_test_selected
fig, ax = plt.subplots(figsize = (18, 8))
ax.hist(clf_optimized.predict_proba(X_test_selected)[:,1], bins = 20);
ax.set_xlim(0, 1);
plt.xticks(np.arange(0, 1, 0.1))
plt.title('Probability distribution for unsatisfied classification', fontsize=18);plt.ylabel('Frequency', fontsize=16);
plt.xlabel('Probability', fontsize=16);
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)