How did I predict customer churn before it knocked on my door?

Original article can be found here (source): Artificial Intelligence on Medium

Data Exploration


First, the data is supported by DSND and contains a good dataset about (12GB) but for just proofing the concept we worked with a tiny subset (128MB) of the full dataset, using Pyspark to build machine learning pipelines, and creating ETLs.

To start loading data we need to create spark sessions and then loading the data which have the user activity data that tracked by Sparkify, the records also contain listening session, artist, song, duration, user information including some demographic plus the visited pages, below we can find the initial look at data scheme and a sample of the expected values.

fig[1] dataset scheme
fig[2] dataset values sample

Data Cleaning

Dealing with null values is a basic operation in any dataset cleaning phase and here I explored the null values and found them in many columns.

fig[3] Null values average in the dataset

but I conclude them into two types:-

I. ID columns null values like “userId’’ and “sessionId”.

II. Non-ID columns null values like “artist” and “length”.

The most common decision to take with ID columns is dropping because you don’t know these data refer to who exactly but with other columns, we can make a decision based on feature state later.

Another common cleaning step is date-time columns formatting which is mostly not formatted in the human-readable format so we need to format them in a good way or maybe extract them to multiple features based on the analytical way we follow, in our case we format the columns and split them to “event_time”, “registration_time” and “event_hour” which we will use them later in our features engineering step.

fig[4] DateTime column after processing

Defining Churn

When the customer can be defined as churned?

Once the user clicked thecancellation confirmation page that appeared in user logs activity can be defined as churn from our service, and will no longer show up in the log.

fig[5] churn users count

After taking a look at this factor we find churned users are 52 out of 225 users, so we need to label them as churned users and go in deep with their activity since registration time on service, and take look from some aspects we explored below like :

1.Churn pattern between genders:

Is any gender likely to churn other than one?

fig[6] Churn number in genders

Male customers are slightly likely to churn than female customers.

2. User plan type and churning:

fig[6] churning per user plan

Churn usually happens when a customer is using a free plan which may feel not committed to continue using the service

2. Listening Activity since registration:

A. Number of listened songs

fig[7] streamed songs number for each user and gender type

B. Thumb up songs Count

fig[8] thumb upstreamed songs number for each user and gender type

Customer lifetime songs listened didn’t vary a lot whatever is liked or just listened.

4. Songs activity per session

fig[9] activity session count per user type

Loyal users spend more sessions in the service more than users who later

5. Are churned users listen to more songs than unchurned users?

fig[10] listening songs count for churned users

Unchurned users listen to more songs than churned users

6. Are there any periods of the day have activity other than time in the day?

fig[11] 24-hours songs activity

Day hours have a high activity than night hours.