Feature Selection Methods in Machine Learning

Original article was published by Sivasai Yadav Mudugandla on Artificial Intelligence on Medium

3 Top Feature Selection Techniques in Machine Learning

Improve your model performance with features that contribute more to predictions.

Image by Arek Socha from Pixabay


When do you say a model is good? When a model performs well on unseen data then we say its a good model. As a Data Scientist, we perform various operations in order to make a good Machine Learning model. The operations include data pre-processing (dealing with NA’s and outliers, column type conversions, dimensionality reduction, normalization etc), exploratory data analysis (EDA), hyperparameter tuning/optimization (the process of finding the best set of hyper-parameters of the ML algorithm that delivers the best performance), feature selection etc.

“Garbage in, Garbage out.”
If data fed into an ML model is of poor quality, the model will be of poor quality

Below articles are related to hyperparameter tuning techniques, open-source frameworks for hyperparameter tuning and data leakage in hyperparameter tuning.

Feature Selection

Feature Selection is the process of selecting the best subset from the existing set of features that contribute more in predicting the output. This helps to improve the performance of the model (i.e removes redundant or/and irrelevant features that are carrying noise and decreasing the accuracy of the model). One of the major advantages of feature selection is it rescues the model from the high risk of overfitting when we have datasets with high dimensionality. By reducing the number of features, it actually reduces the training time of the ML algorithm i.e. computational cost involved.

There are 3 main feature selection techniques

  1. Filter methods
  2. Embedded methods
  3. Wrapper methods

1. Filter methods

Filter methods use statistical techniques to compute the relationship between features and the target variable. Filter methods generally use scores in statistical test and variances, to rank the importance of individual features.


  1. Filter methods are independent of any machine learning algorithms. So, they can be used as the input of any machine learning models.
  2. Filter methods are very fast.

Three types of filter methods:

1. ANOVA F-value

ANOVA F-value estimates the degree of linearity between the input feature (i.e., independent features) and the output feature (i.e., dependent feature). A high F-value indicates a high degree of linearity and a low F-value indicates a low degree of linearity.

Scikit-learn provides two functions to calculate F-value:

  1. sklearn.feature_selection.f_regression for regression problems
  2. sklearn.feature_selection.f_classif for classification problems

Let’s visualize the same


ANOVA F-value only captures the linear relationships between input and output feature.

2. Variance Threshold

Variance Threshold removes the features whose variance is below the pre-defined threshold value. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

Let’s visualize the same


This can be used for unsupervised learning.


Variance Threshold only considers the relationship among the features but not the relationship between input features with the output feature.

3. Mutual information

Mutual information (MI) measures the dependence of one variable to another by quantifying the amount of information obtained about one feature, through the other feature. MI is symmetric and non-negative and is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

Scikit-learn provides two functions to calculate F-value:

  1. sklearn.feature_selection.mutual_info_classif for regression problems
  2. sklearn.feature_selection.mutual_info_regression for classification problems

Let’s visualize the same


MI can capture non-linear relationships between input and output feature.

4. Scikit-learn’s SelectKBest

SelectKBest selects the features using a function (in this case ANOVA F-value) and then “removes all but the k highest scoring features”.

In my next article, I will talk about Embedded and Wrapper methods.

Thank you for reading!

Any feedback and comments are, greatly appreciated!

Some of my other posts you may find interesting,