This is what a Data Scientist can do with your data

Original article was published on Artificial Intelligence on Medium

Many conclutions arrive from the above graphs. We have already seen that customers with 4 products are not retained. But now we see that for Switzerland and Austria, also customers with 3 products are not retained. But we also see some odd peak around 0 euros balance for these countries, which does not happen for Germany. Knowing that this plots are KDE, we could think that what is happening is that there are actually customers with exactly 0 euros balance for Switzerland and Austria, and not for Germany. So let’s make one more suitable plot this case, using a histogram, that uses bins instead of kernels, so we can check our theory.

In fact, we see that there are customers with 0 euros balance for Switzerland and Austria, and not for Germany. Other thing that we can see here is that there are more customers in Switzerland than in Austria and Germany.

We can add a bar plot to see this more clearly, and take advante of it, adding the ‘retained’ variable to the plot.

The plot not only shows that there are more customers in Switzerland than in Austria and Germany, but also that customers from Germany have a lower retain factor than the rest.

gender VS retained

The last variable that we will analyze is gender, which we saw before that had correlation with the retain factor. As they are just two variables, we will be using again a bar plot.

We see that males have a greater retain factor. It also seems that there are more male customers than female customers, but as the bars are splitted, we can not be 100% sure. So we will add one last plot just for the gender distribution.

Question 3: Which are the top k customers at highest risk of leaving the bank?

For this question we have trained different classification models, using Logistic Regression, Random Forest, SVM and MultiLayer Perceptron classifiers. You can check the code in this jupyter notebook in case you are interested. Finally we chose the best classifier, which turned out to be the Random Forest model, and gave us this result (with k=5):