Predicting Churn on Insurance Data

Original article was published by Frederik Bussler on Artificial Intelligence on Medium

Getting new customers is expensive — from 5 to 25 times more so than retaining an existing one, according to HBR¹.

Growing without minimizing churn is like pouring water in a leaky bucket. Blockbuster fell from an $8 billion media giant to bankruptcy because they failed to address churn. Doing so would have revealed why customers were dropping like flies: They wanted convenient features like DVD by mail, and later streaming. Without these, customers switched to competitors like Netflix, which now has a market cap of over $200 billion.

Churn in regards to employees quitting is called “attrition,” but in this article I’ll focus on service churn, or when customers stop paying for a service. Previously, I explored how to predict churn based on a Telco dataset. Here, I’ll analyze insurance churn data.


Our data will come from the MachineHack insurance churn challenge². MachineHack is an online platform for Machine Learning competitions and a popular alternative to Kaggle.

To preserve customer privacy, the data is completely anonymized, so attributes are in the format feature_x, from feature_0 to feature_15, and our KPI is a simple boolean column called labels, which is either 1 or 0 (either the customer churned, or they didn’t).

The data is in the format needed for predictive analytics, with a KPI and attributes that describe that KPI.