Introduction
As we all know, machine learning is a subset of Artificial Intelligence that is essentially focused on building systems that can learn from historical data, identify patterns and make logical decisions with little or no human intervention. The popularity of machine learning is increasing day by day and it becomes important for an individual to equip themselves with this knowledge of machine learning. So in this article, we learn about feature selection in Machine learning.

We will cover the topic of feature selection in ML with scikit-learn. We will also learn about univariate feature selection using scikit-learn, recursive feature selection using scikit-learn, and sequential feature selection using scikit-learn. In the end, we will cover the topic of SelectFromModel with examples.
What is feature selection?
As we all know, the input variables we give to our ML models are called features. Each column of our dataset contains a feature. If we want to train our model optimally, we need to ensure that we use only the essential features. The model can capture unimportant patterns and learn from noise if we have too many features. If we have fewer features, the model would be efficient and effective, so our ultimate goal is to decrease the count of features. Therefore feature selection is quite an important task. So the method of choosing the essential parameters or attributes for our data is called feature selection.
Good knowledge and understanding of feature selection is a great asset for the business. Good knowledge of these methods leads to better performance of the models, a better understanding of the underlying structure of the model, and many more. Now that we have seen what exactly feature selection is. It is time to learn about univariate selection.
Univariate feature selection using scikit-learn
Univariate feature selection examines every feature individually to determine the strength and relationship of the feature with the variable. These methods are generally simple to run and understand and are good for understanding the underlying structure of the data.
One of the best methods to understand a feature's relation to the response variable is the Pearson correlation coefficient, which measures the linear correlation between two variables.
import numpy as np
from scipy.stats import pearsonr
np.random.seed(0)
size = 100
number = np.random.normal(0, 1, size)
print ("Lower noise", pearsonr(number, number + np.random.normal(0, 1, size)))
print ("Higher noise", pearsonr(number, number + np.random.normal(0, 10, size)))
Output:

In the above code, we are comparing a variable with its noisy version. With a smaller amount of noise, the correlation is relatively strong, and the p-value is low, while for the noisy comparison, the correlation is small, and the p-value is high.
Recursive feature selection using scikit-learn
Recursive feature elimination, or RFE, is a wrapper-type feature selection algorithm. This means that a different ml algorithm is given and used in the core of this method, and then it is wrapped by RFE, which is then further used to help select features. RFE also uses filter-based feature selection internally.
RFE first searches for a subset of features from all the features present in a training dataset and then starts removing them until the desired number of features remains. Now we will see a coded example of the implementation of RFE for Classification.
Creating a dataset,
from sklearn.datasets import make_classification
# defining a dataset
num1, num2 = make_classification(n_samples=2000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# summarizing the dataset
print(num1.shape, num2.shape)
Output:
(2000, 10) (2000,)
Now we will evaluate an RFE feature selection on this dataset. We will also use a DecisionTreeClassifier to choose the features and we will set no of features to 6.
# evaluate RFE for classification
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
# defining a dataset
num1, num2 = make_classification(n_samples=2000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# creating a pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=6)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluating the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, num1, num2, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# checking the performance
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
Output:
Accuracy: 0.803 (0.035)
In the above code, we can see that RFE uses a decision tree, selects six features, then it fits a decision tree on the selected features, and finally archives an accuracy of about 80.3%.
Implementing Sequential Feature selection using scikit-learn
Sequential feature selection is a family of greedy search algorithms that reduce the d-dimensional feature space to a k-dimensional feature subspace where k<d. The motive for sequential feature selection is to automatically select a subset of features most relevant to the problem.
There are four different types of Sequential feature selectors available:
- Sequential Forward Selection
- Sequential Backward selection
- Sequential Forward floating selection
- Sequential Backward floating selection
This Sequential Feature Selector greedily adds (forward selection) or removes (reverse selection) features to create a feature subset. This estimator chooses the optimal feature to add or remove at each step based on the cross-validation score of an estimator.
Below is an example where we are using SequentialFeatureSelector,
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SequentialFeatureSelector
num1, num2 = load_iris(return_X_y=True)
knn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=3)
sfs.fit(num1, num2)
SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
n_features_to_select=3)
sfs.get_support()
sfs.transform(num1).shape
Output:
(150, 3)
SelectFromModel Feature Selection
Scikit-learn API provides SelectFromModel class for extracting the best features of a given dataset according to their weights. The SelectFromModel is a meta-estimator that determines the importance of weights by comparing them with the given threshold value.
Below is an example of SelectFromModel feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
print("The threshold value is: ",selector.threshold_ )
print(selector.estimator_.coef_)
Output:
