Table of contents
1.
Introduction
2.
Data Preprocessing 
3.
Data Preprocessing in Scikit-learn
3.1.
Load data with the Scikit-learn
3.2.
Perform the Exploratory data analysis
3.3.
Handle missing values 
3.4.
Infer new features with feature engineering
3.5.
Encode categorical features
3.6.
Scale numeric features
3.7.
Create a LogisticRegression
4.
Frequently asked questions
4.1.
What is scikit-learn used for?
4.2.
What is the main purpose of scikit-learn?
4.3.
Is scikit-learn deep learning?
4.4.
Does scikit-learn use multiple cores?
4.5.
Is scikit-learn a library or module?
5.
Conclusion
Last Updated: Mar 27, 2024
Medium

Preprocessing with scikit-learn

Author Muskan Sharma
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Hey Readers!!

Welcome back to another article related to Python. While working on Python, you've come across different Python Libraries like Numpy, Pandas, Keras, and many more. 

So in this article, you'll learn about one of those libraries used in machine learning called scikit-learn and how the preprocessing is done in scikit-learn.

Let’s begin!!

Preprocessing with scikit-learn

Data Preprocessing 

Data Preprocessing is a technique that converts the raw data into data that can be used for analysis.

Data Preprocessing in Scikit-learn

Now let's look at the steps of Data Preprocessing in Scikit-learn. In this, we'll use the Titanic Dataset for the data preprocessing.

Load data with the Scikit-learn

Load the titanic dataset using fetch_openml()

import pandas as pd
from sklearn.datasets import fetch_openml
 
df = fetch_openml('titanic', version=1, as_frame=True)['data']
df.head(4)
You can also try this code with Online Python Compiler
Run Code
Load data with the Scikit-learn

Perform the Exploratory data analysis

Now let's begin the analysis of the dataset.

Null Columns 

df.info()
You can also try this code with Online Python Compiler
Run Code
Perform the Exploratory data analysis

Missing Values 

df.isnull().sum()
You can also try this code with Online Python Compiler
Run Code
Missing values

Handle missing values 

Handling the missing values can be done by : 

  • Removing them
  • Using the mean, mode, median

Infer new features with feature engineering

The process of extracting features from unprocessed data using domain knowledge is known as feature engineering. 

Let’s take the example of checking whether the person travelled alone or with someone.

SibSp: It captures the number of siblings of the person who travelled.

Parch: It captures the number of parents and children of the person who traveled.

By combining both, we can determine whether the person travelled alone or with someone.

Infer new features with feature engineering:

Encode categorical features

To solve classification problems, categorical features must be numerically encoded continuously.

Categorical features must be transformed into binary arrays (0s, 1s) for the Scikit-learn API.

There are two ways to do that:

sciki-learn: OneHotEncoder()

import pandas as pd 
 
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import OneHotEncoder
 
df = fetch_openml('titanic', version=1, as_frame=True)['data']
 
df[['female','male']] = OneHotEncoder().fit_transform(df[['sex']]).toarray()
df[['sex','female','male']] 
You can also try this code with Online Python Compiler
Run Code
Encode categorical features

pandas: get_dummies()

import pandas as pd 
 
from sklearn.datasets import fetch_openml
 
df = fetch_openml('titanic', version=1, as_frame=True)['data']
 
df['sex'] = pd.get_dummies(df['sex'],drop_first=True)
 
df.head(4)
You can also try this code with Online Python Compiler
Run Code

 

get_dummies()

Scale numeric features

The ranges of individual attributes may vary across different datasets.

For instance, one feature may accept data in Kilometres while the other may accept it in miles.

As a result, we must normalize the data.

We can apply StandardScaler or MinMaxScaler to standardize the data.

  • Each feature is scaled by MinMaxScaler to a specified range.
  • By eliminating the mean and scaling to unit variance, StandardScaler standardized the features to have μ= 0 and σ= 1 for each feature.

Create a LogisticRegression

In this let's create the LogisticRegression

from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
 
X, y = fetch_openml('titanic', version=1, as_frame=True, return_X_y=True)
 
# impute missing values
X['age'].fillna(X['age'].mean(), inplace=True)
X['embarked'].fillna(X['embarked'].mode(), inplace=True)
 
# handle categorical data
X = pd.get_dummies(X[['age','embarked', 'sex','pclass']],drop_first=True)
 
# fit machine learning model
model = LogisticRegression()
model.fit(X, y)
 
# make prediction
model.predict(X)
You can also try this code with Online Python Compiler
Run Code
Create a LogisticRegression

Build a pipeline

Let's skip the specifics and simply briefly describe what the code does.

We construct a set of actions we wish to carry out sequentially using the Pipeline class.

Frequently asked questions

What is scikit-learn used for?

In the Python ecosystem, Scikit-learn, an open-source data analysis toolkit, is considered the pinnacle of machine learning.

What is the main purpose of scikit-learn?

It offers a range of effective machine learning and statistical modeling capabilities, such as dimensionality reduction, clustering, regression, and classification.

Is scikit-learn deep learning?

A high-level wrapper for the TensorFlow deep learning library is called Scikit Flow.

Does scikit-learn use multiple cores?

Using several CPU cores, some scikit-learn estimators and tools parallelize expensive computations.

Is scikit-learn a library or module?

A free machine-learning library for Python is called Scikit-learn.

Conclusion

You understand the data preprocessing, Preprocessing with scikit-learn, and the different steps to do that.

Below are the mentioned kike that will help you out in gaining more knowledge in libraries for machine learning.

Refer to our guided paths on Coding Ninjas Studio to upskill yourself in Data Structures and AlgorithmsCompetitive ProgrammingJavaScriptSystem Design, and many more! If you want to test your competency in coding, check out the mock test series and participate in the contests hosted on Coding Ninjas Studio! But suppose you have just started your learning process and are looking for questions from tech giants like Amazon, Microsoft, Uber, etc. For placement preparations, you must look at the problemsinterview experiences, and interview bundles.

Nevertheless, consider our paid courses to give your career an edge over others!

Do upvote our blogs if you find them helpful and engaging!

Thank you

Happy Learning!

Live masterclass