Original article was published by Dr. Monica on Artificial Intelligence on Medium
When we are working on real-time scenarios in building the machine learning model, feature selection can be an important part of the process because it has the potential to significantly enhance the performance of our models. Training the model with all the data we have may lead to a common problem called the curse of dimensionality. In simple terms, this may be referred to as the phrase” Too many cooks will spoil the soup”.
What is feature selection?
Feature Selection is a mechanism in which you automatically pick the features in your data that most apply to the forecast attribute or performance you are interested in.
Benefits of feature selection:
Reduces Overfitting: Less redundant data means less ability to make noise-based decisions.
Improves Accuracy: Less misleading data means modeling accuracy improves.
Reduces Training Time: fewer data means that algorithms train faster.
Types of feature selection techniques:
1. Univariate Selection:
This is a type of statistical test that can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library offers a SelectKBest class that can be used to select a certain number of features using a suite of different statistical tests. With this selection method, many distinct statistical test scans are used.
2. Recursive Feature Elimination:
The technique of Elimination of Recursive Features (or RFE) operates by recursively eliminating attributes and creating a model on those remaining attributes. It uses the accuracy of the model to classify which attributes (and attribute combinations) contribute most to the target attribute prediction.
3. Principal Component Analysis
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form. Generally, this is called a data reduction technique. This property of PCA allows for selecting the number of dimensions or principal components in the transformed result.
4. Feature Importance
Decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.
In this particular blog, we are going to implement the ExtraTree classifier technique. Let us workaround on a particular use case and see how we can select particular features for our machine learning model.
It is a kind of ensemble learning technique that aggregates the outcomes of many de-correlated decision trees collected in a “forest” to generate the outcome of its classification. It is quite similar to a Random Forest Classifier in principle and differs from it only in the way the decision trees are built in the forest.
Every Decision Tree is built from the original training sample in the Extra Trees Forest. Then, a random sample of k characteristics from the feature set is given for each test node, from which each decision tree must choose the best feature to divide the data based on some mathematical criterion (typically the Gini Index). This random sample of characteristics leads to many de-correlated decision trees being formed.
A normalized total reduction in the mathematical criteria used in the decision of the split function (Gini Index if the Gini Index is used in the construction of the forest) is determined for the purpose of performing the feature selection using the forest structure referred to above during the construction of the forest for each feature.
Each feature is ordered in descending order according to the Gini Importance of each feature in order to perform feature selection, and the user selects the top k features according to their preference.
Bank Telemarketing Use case :
Here, we are considering the use case of predicting the success of bank telemarketing. The dataset and the entire notebook code can be found here.
After importing the necessary libraries and reading the dataset into pandas.
We are printing the first 5 rows from the dataset and columns with the help of the below command.
It shows that there are 5 rows × 62 columns, and it not recommended to train the model with these many features. So we are going to implement ExtraTree classifier to identify the top 10 features out of all.
The above print command will print the features with its associated values.
[2.21058745e-02 1.11119854e-01 1.60791210e-02 2.09130859e-02
4.43678476e-03 3.83687896e-02 1.05711908e-02 1.38615522e-02
6.92845394e-02 4.34670523e-02 1.68238409e-02 2.70002094e-02
3.38319966e-03 2.27420042e-03 7.26212370e-03 3.33765994e-03
3.45090421e-03 1.01580463e-02 2.28330313e-03 1.75731727e-02
2.64174666e-03 1.04237588e-03 1.48773548e-02 1.93321844e-02
2.01570037e-02 3.73663495e-04 2.62631368e-02 2.67472146e-02
4.95723953e-05 1.63644589e-02 2.36237821e-02 5.92626182e-03
2.16979848e-02 2.59067640e-02 4.77323094e-06 2.55260076e-02
3.91215948e-03 2.79759220e-02 9.00655620e-03 3.44425610e-03
1.00713714e-02 1.92098382e-02 3.42614836e-02 7.11670718e-03
6.86531865e-03 5.88462224e-04 6.94024108e-03 4.66581856e-03
1.56135613e-03 2.88009556e-02 9.60638356e-03 3.93827984e-03
1.55331230e-03 2.03218704e-02 2.50614829e-02 1.73065992e-02
1.53337762e-02 1.66701632e-02 1.52280559e-02 2.28734829e-02
Let’s try to visualize them in the bar graph to have better visibility about involved features.
Finally, we have the top 10 features which are having an impact on building the machine learning model for predicting the success of bank telemarketing.
We can apply different feature selection techniques and build a more appropriate model depending on our application.