Importance of Feature Engineering in Machine learning and Deep learning

Source: Deep Learning on Medium

Importance of Feature Engineering in Machine learning and Deep learning

Real-world data is often not separable. This is a big problem that needs to be addressed. So, linear models like Logistic Regression, Support Vector Machine Classifiers and Linear Regression fail to achieve the required objective. Sometimes, even complex models like Random Forest, XGBoost Classifiers, Neural Networks doesn’t produce effective results, when the data is not separable.

Is there a way to transform our data, so that they are partly or fully separable?

Src: giphy.com

The answer is YES, with the help of Feature Engineering.

Feature Engineering

“Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.”

— Dr. Jason Brownlee

In simpler terms, Feature Engineering is the process of creating(or transforming)new features from the existed ones, so as to make Machine learning/Deep learning models work better.

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

This means that the features used are mainly responsible for the end results.

Feature Engineering with Toy Example

Source code: link

Let’s say we have the data with two features and a target class with two labels, 0 being Negative and 1 being Positive class.

Goal: Classification Task — To build a plane(line) that separates positive and negative classes.

Now the question is, Can we build a plane(line) that clearly separates these two classes, effectively?

Unfortunately, No.

Building a plane(Logistic Regression) is a difficult task on this data, because it is difficult to draw a line that completely separates positive and negative class.

Is there any alternative? Can we transform the data to make this task easier to do? Can we apply any transformation(any mathematical) so that they can be separated? Think about it!

Now, here comes the art of feature engineering. If you look at the data, we can see that the Negative class points are closer to each other whereas Positive class points are far from these Negative class points.

What if we square these two features?

Let’s see how they look if they are squared.

Source code: link

Bingo!!!

By squaring the features(with themselves), the larger values of Feature 1 and Feature 2 gave much larger values and smaller values gave much smaller values.

The below table shows the values before and after transformation.

+-----------+-----------+-----------+-----------+
| feat_1 | feat_2 | feat_1sq | feat_2sq |
+-----------+-----------+-----------+-----------+
| -0.814954 | 4.933138 | 0.66415 | 24.33585 |
| 1.418311 | -4.794621 | 2.011606 | 22.988394 |
| 3.234597 | 3.812792 | 10.462615 | 14.537385 |
| 4.984155 | -0.397743 | 24.841801 | 0.158199 |
+-----------+-----------+-----------+-----------+

Hence, they can be clearly separated by just squaring the original data and thus we can draw a line in between them that clearly separates these two classes using Logistic Regression.

Src: giphy.com

That’s the Art and Essence of Feature Engineering that I would like you to know about.

NOTE: How to know these transforms? When to apply a particular transform?

Answer: This comes only with Domain Knowledge/Domain Expertise .

Some sources for Domain Knowledge: Blogs, recent research papers, Kaggle Kernels for corresponding domains.

Results

Let us build two Logistic Regression Classifiers with ‘C’ being the Hyper-parameter and ROC AUC being the evaluation metric on original data and feature engineered data. ROC AUC value range between 0 and 1. Higher the ROC, better the model and vice versa.

Sorce code: link

We can clearly see that the feature engineered data has given the ROC AUC as 1, specifies all the positively data points are correctly classified and not even a single negative point is classified as positive, i.e 100% accuracy.

Note: Performing Hyperparameter tuning in this problem didn’t make any impact.

Other Toy example

Source code: link

The above data is generally referred to as XOR data as it follows XOR Truth table. The data point is labeled positive when both the features are positive and vice versa. The data point is labeled negative if either of the features is negative.

Can we apply any transformation on this data to make this task simpler?

Yes, again!!!.

Source code: link

As positively labeled data point features are both either positive or negative, we will create a new feature set:

Feature 1 new= Feature 1 * Feature 2

Feature 2 new = Feature 2

+-----------+-----------+---------------+---------------+
| Feature 1 | Feature 2 | Feature 1 new | Feature 2 new |
+-----------+-----------+---------------+---------------+
| 10 | 10 | 100 | 10 |
| 12 | -10 | -120 | -10 |
| -10 | 10 | -100 | 10 |
| -12 | -10 | 120 | -10 |
+-----------+-----------+---------------+---------------+

The Feature 1 new, for positively labeled points is always positive and for negatively labeled points is always negative making both sets of points falling in different regions.

Now, we can easily create a plane that separates these two features in a much easier way.

Other Feature Engineering Techniques

There are many transformation techniques. Let f1- Feature 1 and f2- Feature 2 be real-valued features. Then we can apply

  1. Polynomial Transformations : f1**2, f1*f2 , f1+f2 , 3*f1+2f2 etc.,
  2. Trigonometric Transformations: sin(f1), tanh(f2**2), cot(f1/f2), etc.,
  3. Boolean Transformations: AND, OR, NOT, NAND, XOR, etc.,
  4. Logarithmic Transformations: log(f1), log(f1*f2), etc.,
  5. Exponential Transformations: pow(e,f1), pow(e, 2*f2), etc.,

For Text data, commonly used Feature engineering techniques are:

  1. Bag of words
  2. Continuous Bag of Words
  3. Term Frequency Inverse Document Frequency(TFIDF)
  4. Word2Vec technique(W2V)
  5. Average TFIDF and W2V

Conclusions

There is a general misconception that tuning the Hyper-parameters may always give better results with Bias- variance trade-off. This is not true, because if the data given is not separable, building a classifier on top of the data is useless. The only thing that can rescue you to overcome this problem is Feature Engineering. So, Feature Engineering plays a vital role in the field of Machine learning and Deep learning.

Simpler models like Logistic Regression with Feature Engineering can outperform complex models like Random Forest, XGBoostClassifier , Neural Networks without Feature Engineering.

Alert: Building complex models without understanding the data is the dumbest thing that you can ever do.

Extension to Deep learning

The concept of feature engineering can be extended to Deep learning. The advantage of using Engineered features in Deep Learning is that Optimal accuracy can be achieved using:

  1. Simpler networks
  2. Fewer layers
  3. Less number of epochs, etc.,

Reference

AppliedAiCourse

Source code

https://github.com/gnana1997/Importance-of-Featurization

Footnote

I appreciate any feedback and constructive criticism. Thanks for reading!!

Follow me to know some cool stuff in Machine learning.

  1. LinkedIn
  2. GitHub