Feature Selection — Overview of Everything You Need to Know

Original article was published by Mikhail Raevskiy on Deep Learning on Medium

Feature Selection — Overview of Everything You Need to Know

We will consider the existing methods of feature selection for learning tasks with and without a teacher. Each method is illustrated with an open-source Python implementation so you can quickly test the proposed algorithms. However, this is not a complete selection: over the past 20 years, many algorithms have been created, and here you will find the most basic ones. For more in-depth exploration, check out this overview.

Feature Selection analogy. Source: Pinterest

The correct selection of features for data analysis allows:

  • improve the quality of supervised and unsupervised machine learning models,
  • reduce training time and reduce the required computing power,
  • and in the case of input data of high dimension, it helps to weaken the “curse of dimension”.

Assessing the importance of features is necessary to interpret model results.

Models with and without a teacher

There are supervised selection algorithms that allow you to determine the appropriate features for better performance of supervised learning tasks (for example, in classification and regression problems). These algorithms need access to labeled data. For unlabeled data, there are also a number of feature selection methods that evaluate all features based on various criteria: variance, entropy, ability to maintain local similarity, etc. Relevant features discovered using unsupervised heuristic methods can also be used in supervised models because they allow other patterns to be detected in the data besides the correlation of features with the target variable.

Feature selection methods are usually divided into 4 categories: filters, wrappers, inline, and hybrid.


With this approach, we estimate the efficiency of a subset of features, taking into account the final result of the applied learning algorithm (for example, what is the increase in accuracy when solving the classification problem). Any learning algorithm can be used in this combination of search strategy and simulation.

Existing selection strategies:

  • Forward selection: start with an empty feature set and then iteratively add features that provide the best gain in model quality.
  • Backward selection: we start with a set consisting of all features, then, at each iteration, we remove the “worst” feature.

Implementation: these algorithms are implemented in the mlxtend package, here is an example of use.

  • RFE (Recursive feature elimination, recursive removal characteristics ) “greedy” search algorithm which selects features by a recursive definition of increasingly small feature sets. It ranks the features according to the order in which they are removed.
  • Implementation: scikit-learn

Built-in methods

This group includes algorithms that simultaneously train the model and select features. This is usually done with an L1- regularizer (sparsity regularizer) or a condition that constrains some features.

  • SMLR (Sparse Logistic Regression MULTINOMIAL, sparse multinomial logistic regression ): This algorithm implements the l1-regularization using ARD (Automatic relevance determination, automatic determination of relevance ) in the classical multinomial logistic regression. Regularization determines the importance of each feature and nullifies those that are useless for forecasting.
  • Implementation: SMLR
  • The ARD (the Automatic by Relevance Determination Regression, Regression automatically determines the relevance ) model uses a Bayesian ridge regression (Bayesian Ridge Regression). It more biases the weights of the coefficients towards zero in comparison, for example, with the method of least squares.
  • ARD zeroes out the weight of some features, thereby helping to identify the relevant dimensions.
  • Implementation: scikit-learn

Other examples of regularization algorithms: Lasso (implements l1 -regularization), ridge regression (implements l2 -regularization), Elastic Net (implements l1- and l2 -regularization). If you depict these methods graphically, you can see that Lasso regression limits the coefficients to the area of ​​the square, ridge regression outlines a circle, and Elastic Net occupies an intermediate position.


A comprehensive description of these algorithms is provided here.


With this approach, we evaluate the importance of features only on the basis of their inherent characteristics, without involving learning algorithms. These methods are faster and less computationally expensive than wrapper methods. If there is not enough data to model the statistical correlation between features, then filters can give worse results than wrappers. Unlike wrappers, such methods are less prone to overfitting. They are widely used to work with high-dimensional data when wrapping methods are too computationally intensive.

Supervised methods


  • This method randomly selects samples from a dataset and updates the significance of each feature based on the difference between the selected instance and the two closest objects of the same and opposite class. If there is a difference in the values ​​of a feature for the two nearest neighbors of the same class, its importance decreases, and if, on the contrary, there is a difference between the values ​​of the feature for objects of different classes, the importance, respectively, increases.
  • The weight of a feature decreases if its value differs for the closest objects of the same class more than for the closest objects from different classes; otherwise, the weight increases.
    The advanced ReliefF algorithm uses feature weighting and searches for more nearest neighbors.
  • Implementation: scikit-rebate , Relief

Fisher score

  • Typically used in binary classification problems. The Fisher ratio (FiR) is defined as the distance between the mean values ​​of traits for each class, divided by their variance:

Chi-squared score

  • Checks if there is a significant difference between the observed and expected frequencies of two categorical variables. Thus, the null hypothesis that there is no relationship between the two variables is tested.
Chi-square test of independence.
  • To correctly apply the chi-square test to check the relationship between different features from the dataset and the target variable, the following conditions must be met: the variables must be categorical, independent, and must have an expected frequency greater than 5. The latter condition ensures that the CDF (cumulative density function) of the test statistic can be approximated using the chi-square distribution. Read more here.
  • Implementation: sklearn , scipy


CFS (Correlation-based feature selection, feature selection based on correlations ): The rationale for this method can be stated as follows:

Features are relevant if their meanings change systematically depending on belonging to a particular category.

Thus, a good subset of features contains features that are highly correlated with the target variable without correlating with each other. The estimate for a subset of k features is calculated as follows:

Here r_{cf} Is the mean of all correlations between the trait and the class, and r¯_{ff}– the average value of all correlations between features. The CFS criterion is defined as follows:


  • FCBF (the Fast Correlation-based filter, quick filter based correlation ): This method is faster and more efficient than ReliefF and of CFS, and is therefore often used for high-dimensional input data. In fact, this is a typical approach that takes into account relevance and redundancy, in which first, for all features, Symmetrical Uncertainty is calculated (mutual information between X and YI (X, Y), divided by the sum of their entropies), then the features are sorted by this criterion, and then redundant ones are removed.
  • Implementation: skfeature, https://github.com/shiralkarprashant/FCBF

Methods without a teacher


  • It has been shown that estimating the variance of a feature can be an effective way to select features. As a rule, features with almost zero variance are not significant and can be removed.
  • Implementation: Variance Threshold

Average absolute difference

  • We calculate the average absolute difference between the values ​​of the attribute and its average value ( implementation).
  • Higher values ​​tend to have higher predictive power.
  • The arithmetic mean divided by the geometric mean. Higher variance corresponds to more relevant characteristics (implementation).

Laplacian Score

  • It is based on the observation that data from the same class are often located closer to each other, so the importance of a trait can be assessed by its ability to reflect this proximity. The method consists of embedding data in the nearest neighbor’s graph by measuring an arbitrary distance and then calculating the weight matrix. Then, for each feature, we calculate the Laplace criterion and obtain such a property that the smallest values ​​correspond to the most important dimensions. However, in practice, when selecting a subset of features, a different clustering algorithm (k-means method) is usually used, with the help of which the most effective group is selected.
  • Implementation: scikit-feature

Laplace’s test combined with distance-based entropy

  • The algorithm is based on the Laplace’s test, where k-means clustering is replaced by entropy. The algorithm demonstrates higher stability on high-dimensional datasets (implementation).


  • MCFS (Multi-Cluster Feature selection, multicluster feature selection ): the spectral analysis is performed to measure the correlation between the different attributes. Eigenvectors of the Laplace operator (graph Laplacian) are used to cluster data and evaluate features. Their calculation is described in this work.
  • Implementation: https://github.com/danilkolikov/fsfc


  • Algorithms LFSBSS (Localised feature selection, selection of localized symptoms ), weighted k-average (weighted k-means) SPEC, and Apriori discussed herein and implemented in the package.

Hybrid methods

Another way to implement feature selection is a hybrid of filters and wrappers combined in a two-phase process: first, features are filtered by statistical properties, and then wrapper methods are applied.

Other sources

A lot of literature has been written in which the problem of feature selection is considered, and here we have only slightly touched on the entire array of scientific research works.

A complete list of other feature selection algorithms that I have not mentioned has been implemented in the scikit-feature package.

We can etermine relevant features also using PLS (Partial least squares, partial least squares ) as described in this article, or by a linear dimension reduction methods, as shown here.