Original article was published by Namrata Kapoor on Artificial Intelligence on Medium

# Feature Selection- Selection of the best that matters through Python

To eliminate waste and to retain good is universal rule. In Machine learning we want our model to be optimized and fast in order to do so and to eliminate unnecessary variables we employ various feature selection techniques.

Top reasons to use feature selection are:

- To train the machine learning model faster.
- To improve the accuracy of a model, if the optimized subset is chosen.
- To reduce the complexity of a model.
- To reduce overfitting and make it easier to interpret.

Feature selection techniques that are easy to use and also gives good results are:

**A)** **Filter Methods**

- Dropping constant features
- Univariate Selection
- Feature Importance
- Correlation Matrix with Heat map

**Dropping constant features**

In this filter we can remove the features which have constant values, which are actually unimportant to solve the problem statement.

In Python the code to apply this using VarianceThreshhold feature of sklearn is:

**from** **sklearn.feature_selection** **import** VarianceThreshold

var_thres=VarianceThreshold(threshold=0)

var_thres.fit(data)

data.columns[var_thres.get_support()]

constant_columns = [column **for** column **in** data.columns

**if** column **not** **in** data.columns[var_thres.get_support()]]

data.drop(constant_columns,axis=1)

**Univariate Selection**

In this type of selection Statistical tests are used to select those variables/features which have the strongest relationship with the result/output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

- Pearson’s Correlation Coefficient: f_regression()
- ANOVA: f_classif()
- Chi-Squared: chi2()

The example below uses the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features from the Mobile Price Range Prediction Dataset

To understand Chi2 we need to understand few terms as below:

**Degrees of freedom** refers to the maximum number of logically independent values, which have the freedom to vary. In simple words, it can be defined as the total number of observations minus the number of independent constraints imposed on the observations.

A chi-square test is used in statistics to test the independence of two events. Given the data of two variables, we can get observed count O and expected count E. Chi-Square measures how expected count E and observed count O deviates each other.

Xc2= ∑(Oi — Ei )2 / Ei

Where:

c= degree of freedom,

O= Observed Value(s)

E=Expected value(s)

When two features are independent, the observed count is close to the expected count, thus we will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of independence is incorrect. In simple words, higher the Chi-Square value the feature is more dependent on the response and it can be selected for model training.

In Python this type of filtration can be done by using SelectKBest and chi2 library from sklearn as in following code:

**import** **pandas** **as** **pd**

**import** **numpy** **as** **np**

**from** **sklearn.feature_selection** **import** SelectKBest

**from** **sklearn.feature_selection** **import** chi2

data = pd.read_csv("train.csv")

X = data.iloc[:,0:20] *#independent columns*

y = data.iloc[:,-1] *#target column i.e price range*

*#apply SelectKBest class to extract top 10 best features*

bestfeatures = SelectKBest(score_func=chi2, k=10)

fit = bestfeatures.fit(X,y)

dfscores = pd.DataFrame(fit.scores_)

dfcolumns = pd.DataFrame(X.columns)

*#concat two dataframes for better visualization *

featureScores = pd.concat([dfcolumns,dfscores],axis=1) featureScores.columns = ['Specs','Score'] *#naming the dataframe columns*

featureScores

print(featureScores.nlargest(10,'Score')) *#print 10 best features*

**Feature Importance**

Feature importance is a kind of score for each feature in dataset, the higher the score more important or relevant is the feature towards for the result or dependent variable.

You can get the feature importance of each feature of your dataset by using the feature importance property of the model.

Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

In python it can be done by following code:

**from** **sklearn.ensemble** **import** ExtraTreesClassifier

**import** **matplotlib.pyplot** **as** **plt**

model = ExtraTreesClassifier()

model.fit(X,y)

print(model.feature_importances_) *#use inbuilt class feature_importances of tree based classifiers*

*#plot graph of feature importances for better visualization* feat_importances = pd.Series(model.feature_importances_, index=X.columns)

feat_importances.nlargest(10).plot(kind='barh') plt.show()

**Correlation Matrix with Heatmap**

Correlation states the relation between variables and the output or target variable.

Correlation can be positive (directly proportional) or negative (inversely proportional)

Heat map makes it easy to identify which features are most related to the target variable, we will plot heat map of correlated features using the seaborn library.

In Python it can be done by following code.

**import** **seaborn** **as** **sns**

*#get correlations of each features in dataset*

corrmat = data.corr()

top_corr_features = corrmat.index

plt.figure(figsize=(20,20))

*#plot heat map*

g=sns.heatmap(data[top_corr_features].corr(),annot=**True **, cmap=plt.cm.CMRmap_r)

plt.show()

The dark color indicates higher correlation.

**B)** **Wrapper Methods**

Some common examples of wrapper methods are:

1) Forward feature selection

2) Backward feature elimination

3) Recursive feature elimination.

**1: Forward Selection**: Forward selection is an iterative method of optimization of feature selection in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

In python it can be implemented as:

`1. `*# step forward feature selection*

2.

3. from sklearn.model_selection import train_test_split

4. from sklearn.ensemble import RandomForestRegressor

5. from sklearn.metrics import r2_score

6. from mlxtend.feature_selection import SequentialFeatureSelector as SFS

7. *# select numerical columns:*

8.

9. numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

10. numerical_vars = list(data.select_dtypes(include=numerics).columns)

11. data = data[numerical_vars]

12. *# separate train and test sets*

13. X_train, X_test, y_train, y_test = train_test_split(

14. X,Y, test_size=0.3, random_state=0)

15. *# find and remove correlated features*

16. def correlation(dataset, threshold):

17. col_corr = set() *# Set of all the names of correlated columns*

18. corr_matrix = dataset.corr()

19. for i **in** range(len(corr_matrix.columns)):

20. for j **in** range(i):

21. if abs(corr_matrix.iloc[i, j]) > threshold: *# we are interested in absolute coeff value*

22. colname = corr_matrix.columns[i] *# getting the name of column*

23. col_corr.add(colname)

24. return col_corr

25.

26. corr_features = correlation(X_train, 0.8)

27. print('correlated features: ', len(set(corr_features)) )

28. *# removed correlated* *features*

29. X_train.drop(labels=corr_features, axis=1, inplace=True)

30. X_test.drop(labels=corr_features, axis=1, inplace=True)

31. X_train.fillna(0, inplace=True)

32.

33.

**34.** *# step forward feature selection*

35.

36. from mlxtend.feature_selection import SequentialFeatureSelector as SFS

37.

38. sfs1 = SFS(RandomForestRegressor(),

39. k_features=10,

40. forward=True,

41. floating=False,

42. verbose=2,

43. scoring='r2',

44. cv=3)

45.

46. sfs1 = sfs1.fit(np.array(X_train), y_train)

47. X_train.columns[list(sfs1.k_feature_idx_)]

48. sfs1.k_feature_idx_

**2: Backward Elimination**: In backward elimination, we start with all the features and removes the most insignificant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

In python it can be implemented as:

`49. `*# step backward feature elimination*

50.

51. sfs1 = SFS(RandomForestRegressor(),

52. k_features=10,

53. forward=False,

54. floating=False,

55. verbose=2,

56. scoring='r2',

57. cv=3)

58.

59. sfs1 = sfs1.fit(np.array(X_train), y_train)

60. X_train.columns[list(sfs1.k_feature_idx_)]

61. sfs1.k_feature_idx_

**3: Recursive Feature elimination**: It is the most greedy optimization algorithm which aims to find the best performing feature subset.

It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are done with. It then ranks the features based on the order of their elimination.

In python it can be implemented as:

**from** **sklearn.svm** **import** SVC

**from** **sklearn.datasets** **import** load_digits

**from** **sklearn.feature_selection** **import** RFE

**import** **matplotlib.pyplot** **as** **plt**

*# Load the digits dataset*

digits = load_digits()

X = digits.images.reshape((len(digits.images), -1))

y = digits.target

*# Create the RFE object and rank each pixel*

svc = SVC(kernel="linear", C=1)

rfe = RFE(estimator=svc, n_features_to_select=1, step=1)

rfe.fit(X, y)

**C)** **Embedded Methods**

**1: LASSO Regression**

Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.

Regularisation consists of adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model and in other words to avoid overfitting. In linear model regularisation, the penalty is applied over the coefficients that multiply each of the predictors. From the different types of regularisation, Lasso or l1 has the property that is able to shrink some of the coefficients to zero. Therefore, that feature can be removed from the model.

In python it is implemented as:

*#load libraries*

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Lasso

from sklearn.feature_selection import SelectFromModel

from sklearn.preprocessing import StandardScaler

*# different scales, so it helps the regression to scale them*

*# separate train and test sets*

X_train, X_test, y_train, y_test = train_test_split(

X,Y, test_size=0.3,

random_state=0)

scaler = StandardScaler()

scaler.fit(X_train.fillna(0))

*# to force the algorithm to shrink some coefficients*

sel_ = SelectFromModel(Lasso(alpha=100))

sel_.fit(scaler.transform(X_train.fillna(0)), y_train)

*# make a list with the selected features and print the outputs*

selected_feat = X_train.columns[(sel_.get_support())]

We can see that Lasso regularisation helps to remove non-important features from the dataset. So, increasing the penalisation will result in increase the number of features removed.

If the penalty is too high and important features are removed, we will notice a drop in the performance of the algorithm and then realise that we need to decrease the regularisation.

**2: Random Forest /Ensemble Techniques**

Decision Tree Ensembles, also referred to as random forests, are useful for feature selection in addition to being effective classifiers.

Random forests are one the most popular machine learning algorithms. They are so successful because they provide in general a good predictive performance, low overfitting and easy interpretability.

For classification, the measure of impurity is either the Gini impurity or the information gain/entropy. For regression the measure of impurity is variance. Therefore, when training a tree, it is possible to compute how much each feature decreases the impurity. The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.

To give a better intuition, features that are selected at the top of the trees are in general more important than features that are selected at the end nodes of the trees, as generally the top splits lead to bigger information gains.

In python it can be implemented as:

*# Import libraries*

from sklearn import preprocessing

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn import tree

from sklearn.ensemble import RandomForestClassifier

*# Encode categorical variables*

X = pd.get_dummies(X, prefix_sep='_')

y = LabelEncoder().fit_transform(y)

*# Normalize feature vector*

X2 = StandardScaler().fit_transform(X)

*# Split the dataset*

X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size = 0.30, random_state = 0)

*# instantiate the classifier with n_estimators = 100*

clf = RandomForestClassifier(n_estimators=100, random_state=0)

*# fit the classifier to the training set*

clf.fit(X_train, y_train)

*# predict on the test set*

y_pred = clf.predict(X_test)

Hope after reading this blog different types of feature selection through python will be more easy.

Thanks for reading!