Original article can be found here (source): Artificial Intelligence on Medium
First, the data is supported by DSND and contains a good dataset about (12GB) but for just proofing the concept we worked with a tiny subset (128MB) of the full dataset, using Pyspark to build machine learning pipelines, and creating ETLs.
To start loading data we need to create spark sessions and then loading the data which have the user activity data that tracked by Sparkify, the records also contain listening session, artist, song, duration, user information including some demographic plus the visited pages, below we can find the initial look at data scheme and a sample of the expected values.
Dealing with null values is a basic operation in any dataset cleaning phase and here I explored the null values and found them in many columns.
but I conclude them into two types:-
I. ID columns null values like “userId’’ and “sessionId”.
II. Non-ID columns null values like “artist” and “length”.
The most common decision to take with ID columns is dropping because you don’t know these data refer to who exactly but with other columns, we can make a decision based on feature state later.
Another common cleaning step is date-time columns formatting which is mostly not formatted in the human-readable format so we need to format them in a good way or maybe extract them to multiple features based on the analytical way we follow, in our case we format the columns and split them to “event_time”, “registration_time” and “event_hour” which we will use them later in our features engineering step.
When the customer can be defined as churned?
Once the user clicked the
cancellation confirmation page that appeared in user logs activity can be defined as churn from our service, and will no longer show up in the log.
After taking a look at this factor we find churned users are 52 out of 225 users, so we need to label them as churned users and go in deep with their activity since registration time on service, and take look from some aspects we explored below like :
1.Churn pattern between genders:
Is any gender likely to churn other than one?
Male customers are slightly likely to churn than female customers.
2. User plan type and churning:
Churn usually happens when a customer is using a free plan which may feel not committed to continue using the service
2. Listening Activity since registration:
A. Number of listened songs
B. Thumb up songs Count
Customer lifetime songs listened didn’t vary a lot whatever is liked or just listened.
4. Songs activity per session
Loyal users spend more sessions in the service more than users who later
5. Are churned users listen to more songs than unchurned users?
Unchurned users listen to more songs than churned users
6. Are there any periods of the day have activity other than time in the day?
Day hours have a high activity than night hours.