AmExpert 2019 — Machine Learning Hackathon Approach

Source: Deep Learning on Medium

AmExpert 2019 — Machine Learning Hackathon Approach

Brief Summary:

Analytics Vidhya conducted ML hackathon from the data provided by American Express team. This contest took place from Sat Sep 28 2019 to Sun Oct 06 2019. I spent some time to solve this problem.

Problem Statement:

We need to identify or predict coupon redemption probability for each customer and coupon id combination given in the test dataset. The data available in this problem contains the following information, including the details of a sample of campaigns and coupons used in previous campaigns –

  • User Demographic Details
  • Campaign and coupon Details
  • Product details
  • Previous transactions

Based on previous transaction & performance data from the last 18 campaigns, predict the probability for the next 10 campaigns in the test set for each coupon and customer combination, whether the customer will redeem the coupon or not? The data set description and their mapping is shown below.

Approach:

Train data has about 78k rows and Test data has about 50k rows. Explored the data by looking at the descriptive statistics of each variable in the train set. redemption_status is the target variable. The distribution of this target variable is hugely imbalanced.

Feature Engineering:

I tried to solve the problem by deriving metrics from the below 4 parts.

a) Purchase done before the start date of campaign

b) Purchase done before and after the campaign time frame

c) Purchase done for only coupon claimed i.e., coupon_discount < 0

d) Purchase done between campaign start and end dates

The metrics or fields I derived from the above parts are aggregated at customer_id and coupon_id level

i) # of items purchased

ii) # of items quantity purchased

iii) Total selling price

iv) coupon_discount received

v) Total time taken in days to purchase after the campaign start date

vi) Category level items counts

vii) Top 20 brands purchased items count at each brand code level

viii) # items purchased at brand type level

ix) # items purchased at concatenated brand_type and category level

x) Target mean encoding at each of the category columns i.e., campaign_id, coupon_id, customer_id, age_range, family_size and income_bracket

xi) Frequency count features of all the category variables mentioned above

Model Build:

I used LightGBM with metric ‘auc’ as the optimizer and cross validation of 5 folds.

These are the above parameters that worked best for me to achieve public leaderboard score as 0.9246039336 without any hyper parameter tuning. The top most important features are show below.

Other models tried and tested are XGboost and AutoML- but LightGBM has given me the best score.

Challenges:

To be frank I really found very difficult to improve my score beyond the above mentioned score and I ran out of ideas on how to get the best features. For these kind of problems I don’t think algorithms are really important which can boost the score. The only way we can improve the score is by feature engineering.

Things to learn from other competitors or by self learning

  1. What is the best CV strategy for a class imbalance problem
  2. How to get best features from the business problem which has majorly category variables
  3. How to handle data (e.g., retail data) — which has customer purchase behavior at each item (1000 plus category levels), brand and category level
  4. What are the best encoding techniques for category variables
  5. How to approach the problem when we loose out all the ideas we have gathered? (This is the point where you feel disappointing and helpless)

Let me know what you guys think and any inputs or suggestions from your side are welcome. The toppers have scores of 0.945+. I am really looking forward to see their approach and improve myself.

My entire code can be found at my github location [here].