Feature Engineering

Source: Deep Learning on Medium

Here are the main methods of feature engineering

A] Creation of indicators

Every dataset covers a specific topic, so what a good data scientist would do is that he/she will develop domain knowledge about the topic, so for example, if the data-set is about football you have to develop some domain knowledge however way you wish, after developing domain knowledge you start to get a sense of what is important and if you feel like something valuable is not presented, you then have to make a judgment call and add an indicator column/feature/attribute, So that it would help with your analysis as well as improving your machine learning model.

Example:

Through my newly developed domain knowledge I now that the number of interceptions in football is a very relevant indicator of whether or not a defensive player is good or not so I set up a threshold of five interceptions and there should be a binary indicator that tells me if someone has over or below five interceptions.

Interceptions (Values) | Indicator (1 or 0)

B] Adding dummy variables

Machine learning algorithms do not understand categories in our data they only understand numbers, therefore we have to present it in a way the machine would understand and perform calculations on, this is when we use dummy variables for instance male and female would be 1 and 0.

Python: here is how to do it in python get_dummies.

This is an absolute must for Machine learning but not necessary for EDA

C] Combining Sparse classes.

There are two primary concerns in machine learning overfitting and underfitting and combining sparse classes helps combat overfitting which in turn provides a better overall model

Now, what is combining sparse data?

let’s say you have a feature column called “Material”, and they include 1180 iron, 1101 earth, 1000 wood, 4 pinewood, 2 awesome wood , 11 Nebraska wood and you need to use that feature to predict whether or not the product your creating will be successful or not.

So as you can see here the distinction between all types of wood is useful for you, however, it is not exactly useful for the machine learning algorithm because, at the end of the day, they are all simply wood and they all have some shared characteristics so why not combine them into one.

now we have 1180 iron,1101 earth, and 1017 wood, now it is a lot more general and less specific so it is more likely to avoid overfitting than it did before.

P.S choosing to group classes together is subjective and requires domain knowledge.

D] Removing unused attributes:

let’s say that we are trying to predict who wins in a match and there are things that help you do that such knowing who had more shots, who had the better defense …. etc, but at no point during your speculation did you consider someone’s ID as a factor or his T-Shirt color because all of these are just extraneous features at least in the scope of our predictions. so we need to limit feeding our machine learning to only the relevant features that are actually a contributing factor.

E] Creating interaction features (Opposite of the previous method D)

Interaction features are novel metrics that a data scientist creates from the data.

Examples:

  • de-structuring data and time into its subcomponents [Year — Month-Day] but this has to be supported by your domain knowledge
  • creating ratios
  • Binning
  • Bucketing
  • proportions