Source: Deep Learning on Medium
AmExpert 2019 — Machine Learning Hackathon Approach
Analytics Vidhya conducted ML hackathon from the data provided by American Express team. This contest took place from Sat Sep 28 2019 to Sun Oct 06 2019. I spent some time to solve this problem.
We need to identify or predict coupon redemption probability for each customer and coupon id combination given in the test dataset. The data available in this problem contains the following information, including the details of a sample of campaigns and coupons used in previous campaigns –
- User Demographic Details
- Campaign and coupon Details
- Product details
- Previous transactions
Based on previous transaction & performance data from the last 18 campaigns, predict the probability for the next 10 campaigns in the test set for each coupon and customer combination, whether the customer will redeem the coupon or not? The data set description and their mapping is shown below.
Train data has about 78k rows and Test data has about 50k rows. Explored the data by looking at the descriptive statistics of each variable in the train set. redemption_status is the target variable. The distribution of this target variable is hugely imbalanced.
I tried to solve the problem by deriving metrics from the below 4 parts.
a) Purchase done before the start date of campaign
b) Purchase done before and after the campaign time frame
c) Purchase done for only coupon claimed i.e., coupon_discount < 0
d) Purchase done between campaign start and end dates
The metrics or fields I derived from the above parts are aggregated at customer_id and coupon_id level
i) # of items purchased
ii) # of items quantity purchased
iii) Total selling price
iv) coupon_discount received
v) Total time taken in days to purchase after the campaign start date
vi) Category level items counts
vii) Top 20 brands purchased items count at each brand code level
viii) # items purchased at brand type level
ix) # items purchased at concatenated brand_type and category level
x) Target mean encoding at each of the category columns i.e., campaign_id, coupon_id, customer_id, age_range, family_size and income_bracket
xi) Frequency count features of all the category variables mentioned above
I used LightGBM with metric ‘auc’ as the optimizer and cross validation of 5 folds.
These are the above parameters that worked best for me to achieve public leaderboard score as 0.9246039336 without any hyper parameter tuning. The top most important features are show below.
Other models tried and tested are XGboost and AutoML- but LightGBM has given me the best score.
To be frank I really found very difficult to improve my score beyond the above mentioned score and I ran out of ideas on how to get the best features. For these kind of problems I don’t think algorithms are really important which can boost the score. The only way we can improve the score is by feature engineering.
Things to learn from other competitors or by self learning
- What is the best CV strategy for a class imbalance problem
- How to get best features from the business problem which has majorly category variables
- How to handle data (e.g., retail data) — which has customer purchase behavior at each item (1000 plus category levels), brand and category level
- What are the best encoding techniques for category variables
- How to approach the problem when we loose out all the ideas we have gathered? (This is the point where you feel disappointing and helpless)
Let me know what you guys think and any inputs or suggestions from your side are welcome. The toppers have scores of 0.945+. I am really looking forward to see their approach and improve myself.
My entire code can be found at my github location [here].