Table of contents
1.
Introduction
2.
What is feature selection?
2.1.
Univariate feature selection using scikit-learn
2.2.
Recursive feature selection using scikit-learn
2.3.
Implementing Sequential Feature selection using scikit-learn
2.4.
SelectFromModel Feature Selection
3.
Frequently Asked Questions
3.1.
Is high variance in data good or bad?
3.2.
List some applications of ensemble learning.
3.3.
State difference b/w correlation and Causality?
3.4.
What is deep learning?
3.5.
Can you mention some advantages and disadvantages of decision trees?
4.
Conclusion
Last Updated: Mar 27, 2024
Medium

Feature Selection in ML with scikit-learn

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

As we all know, machine learning is a subset of Artificial Intelligence that is essentially focused on building systems that can learn from historical data, identify patterns and make logical decisions with little or no human intervention. The popularity of machine learning is increasing day by day and it becomes important for an individual to equip themselves with this knowledge of machine learning. So in this article, we learn about feature selection in Machine learning.

intro image

We will cover the topic of feature selection in ML with scikit-learn. We will also learn about univariate feature selection using scikit-learn, recursive feature selection using scikit-learn, and sequential feature selection using scikit-learn. In the end, we will cover the topic of  SelectFromModel with examples.

What is feature selection?

As we all know, the input variables we give to our ML models are called features. Each column of our dataset contains a feature. If we want to train our model optimally, we need to ensure that we use only the essential features. The model can capture unimportant patterns and learn from noise if we have too many features. If we have fewer features, the model would be efficient and effective, so our ultimate goal is to decrease the count of features. Therefore feature selection is quite an important task. So the method of choosing the essential parameters or attributes for our data is called feature selection.

Good knowledge and understanding of feature selection is a great asset for the business. Good knowledge of these methods leads to better performance of the models, a better understanding of the underlying structure of the model, and many more. Now that we have seen what exactly feature selection is. It is time to learn about univariate selection. 

Univariate feature selection using scikit-learn

Univariate feature selection examines every feature individually to determine the strength and relationship of the feature with the variable. These methods are generally simple to run and understand and are good for understanding the underlying structure of the data.

One of the best methods to understand a feature's relation to the response variable is the Pearson correlation coefficient, which measures the linear correlation between two variables.

import numpy as np
from scipy.stats import pearsonr
np.random.seed(0)
size = 100
number = np.random.normal(0, 1, size)
print ("Lower noise", pearsonr(number, number + np.random.normal(0, 1, size)))
print ("Higher noise", pearsonr(number, number + np.random.normal(0, 10, size)))
You can also try this code with Online Python Compiler
Run Code

 

Output:

output

In the above code, we are comparing a variable with its noisy version. With a smaller amount of noise, the correlation is relatively strong, and the p-value is low, while for the noisy comparison, the correlation is small, and the p-value is high.

Recursive feature selection using scikit-learn

Recursive feature elimination, or RFE, is a wrapper-type feature selection algorithm. This means that a different ml algorithm is given and used in the core of this method, and then it is wrapped by RFE, which is then further used to help select features. RFE also uses filter-based feature selection internally. 

RFE first searches for a subset of features from all the features present in a training dataset and then starts removing them until the desired number of features remains. Now we will see a coded example of the implementation of RFE for Classification.

Creating a dataset,

from sklearn.datasets import make_classification
# defining a dataset
num1, num2 = make_classification(n_samples=2000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# summarizing the dataset
print(num1.shape, num2.shape)
You can also try this code with Online Python Compiler
Run Code

 

Output:

(2000, 10) (2000,)
You can also try this code with Online Python Compiler
Run Code

 

Now we will evaluate an RFE feature selection on this dataset. We will also use a DecisionTreeClassifier to choose the features and we will set no of features to 6.  

# evaluate RFE for classification
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE

# defining a dataset
num1, num2 = make_classification(n_samples=2000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# creating a pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=6)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])

# evaluating the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, num1, num2, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# checking the performance
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
You can also try this code with Online Python Compiler
Run Code

 

Output:

Accuracy: 0.803 (0.035)
You can also try this code with Online Python Compiler
Run Code

In the above code, we can see that RFE uses a decision tree, selects six features, then it fits a decision tree on the selected features, and finally archives an accuracy of about 80.3%. 

Implementing Sequential Feature selection using scikit-learn

Sequential feature selection is a family of greedy search algorithms that reduce the d-dimensional feature space to a k-dimensional feature subspace where k<d. The motive for sequential feature selection is to automatically select a subset of features most relevant to the problem. 

There are four different types of Sequential feature selectors available:

  1. Sequential Forward Selection
  2. Sequential Backward selection
  3. Sequential Forward floating selection
  4. Sequential Backward floating selection

This Sequential Feature Selector greedily adds (forward selection) or removes (reverse selection) features to create a feature subset. This estimator chooses the optimal feature to add or remove at each step based on the cross-validation score of an estimator.

Below is an example where we are using SequentialFeatureSelector,

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SequentialFeatureSelector
num1, num2 = load_iris(return_X_y=True)
knn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=3)
sfs.fit(num1, num2)
SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
n_features_to_select=3)
sfs.get_support()
sfs.transform(num1).shape
You can also try this code with Online Python Compiler
Run Code

 

Output:

(150, 3)
You can also try this code with Online Python Compiler
Run Code

SelectFromModel Feature Selection

Scikit-learn API provides SelectFromModel class for extracting the best features of a given dataset according to their weights. The SelectFromModel is a meta-estimator that determines the importance of weights by comparing them with the given threshold value.

Below is an example of SelectFromModel feature selection

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]
selector = SelectFromModel(estimator=LogisticRegression()).fit(X, y)
print("The threshold value is: ",selector.threshold_ )
print(selector.estimator_.coef_)
You can also try this code with Online Python Compiler
Run Code

 

Output: 

output

Frequently Asked Questions

Is high variance in data good or bad?

If the variance is high, the data spread is significant, and the feature has a variety of data, which is not good quality.

List some applications of ensemble learning.

  1. Person recognition includes the iris, face, fingerprint, and behavior recognition.
     
  2. Medical applications like x-rays, human genome analysis, and examining sets of medical data to look for anomalies.

State difference b/w correlation and Causality?

Correlation occurs when we relate two different actions, X and Y, but here X does not cause Y, whereas Causality occurs when an action, X, causes another action, Y.

What is deep learning?

It is a branch of machine learning that uses artificial neural networks to create systems that think and learn like people. The word "deep" refers to neural networks containing several layers.

Can you mention some advantages and disadvantages of decision trees?

The advantage is that they are easier to interpret and are non-parametric.

Whereas the disadvantage is that they are prone to overfitting. 

Conclusion

We have discussed the topic of feature selection in ML with scikit-learn. We have also learned about univariate feature selection using scikit-learn, recursive feature selection using scikit-learn, and sequential feature selection using scikit-learn. We have concluded by covering the topic of  SelectFromModel with an example.

We hope this article has helped you in some way and if you liked our article, do upvote our article and help other ninjas grow.  You can refer to our Guided Path on Coding Ninjas Studio to upskill yourself in Data Structures and AlgorithmsCompetitive ProgrammingSystem Design, and many more!

Head over to our practice platform Coding Ninjas Studio to practice top problems, attempt mock tests, read interview experiences and interview bundles, follow guided paths for placement preparations, and much more!!

Live masterclass