How to Build a Dataset to Predict Customer Churn

Original article was published by Frederik Bussler on Artificial Intelligence on Medium


AI is wildly hyped in 2020, and every startup claims to use it. However, getting relevant and clean data is a basic pre-requisite to AI that many organizations haven’t ticked off.

Churn analysis is a powerful AI use-case, but you can’t build an accurate churn model if you don’t have sufficient, high-quality data to plug-in.

To clarify some points, churn is when a customer quits a service, and the goal of churn analysis is to effectively fight churn and increase customer retention, which can be done with product upgrades, one-on-one customer interactions, better pricing, more targeted user acquisition, and so on.

Here’s how to get the data you need to build an accurate churn model.

Building the Dataset

We want to predict churn. So, we need historical data where one column is churn. This is a binary classification problem, so the labels for the churn column should look like “Yes” or “No” (or “1” or “0”, or any other class labels).

If you have a monthly subscription service, each row could be a certain client in a certain month, and the other columns (besides churn) are attributes about that client, such as their tenure, selected add-ons, contract type, and so on.

Here’s a simple example from a fictional telecom company.