Targeted Marketing with Machine Learning

Original article was published on Artificial Intelligence on Medium

Feature Engineering

Now, to compare features according to their value, we can extract new features from the ones we already have to search for better features. We will create all the new features and will compare and rank them afterward.

First of all, we focus on the creation of business-oriented features. Then, we apply the Box-Cox transformations on all features, trying to find the ones that best fit in terms of discrimination ability. Finally, we merge the rest of the categories with low discrimination ability, and apply PCA to try to reduce the dimensionality of the attributes.

Business-oriented features:

  • RFM(Recency, Frequency, and Monetary value);
  • Number of campaigns accepted;
  • Proportion of money spent in each product (wines, meat, gold products, etc.);
  • Monetary value (total spent);
  • Buy potential (proportion of total spent and income);
  • Frequency of product shopping.

Now we apply the Box-Cox Transformation.

A Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.

Best Power Transformation for each feature:
-> NumCatalogPurchases sqrt
-> NumStorePurchases sqrt
-> NumWebVisitsMonth sqrt
-> Days_Customer **2
-> PrpGoldProds **1/4
-> NmbAccCmps x
-> PrpAccCmps x
-> PrpWines sqrt
-> PrpFruits sqrt
-> PrpMeat sqrt
-> PrpFish **1/4
-> Mnt **2
-> BuyPot exp
-> Freq **2
-> RFM **2

Now we merge the less significant categories and we get dummy variables for the left ones. Then we apply PCA, which can either lead to a good summarization of the data or an excessive loss of information. Note that it is not correct to use categorical variables for principal component analysis.

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent.

from sklearn.decomposition import PCAcolumns = X_train.columns
columns = columns.drop(['Kidhome','Teenhome','NumberOff','AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response', 'Marital_Status_bin', 'Education_bin', 'HasOffspring'])
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X_train[columns])
principalComponents_test = pca.transform(X_test[columns])
X_train["pc1"] = principalComponents[:,0]
X_train["pc2"] = principalComponents[:,1]
X_test["pc1"] = principalComponents_test[:,0]
X_test["pc2"] = principalComponents_test[:,1]