Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Installation
3.
Basic Guides
4.
Implementation
5.
Here are the main ways the Scikit-learn library is used.
6.
Frequently Asked Questions 
6.1.
1) What are the features of Scikit-learn?
6.2.
2) What is the transform method in Scikit-learn?
6.3.
3) What is the fit_transform method in Scikit-learn?
6.4.
4) What is the fit method used in Scikit-learn?
6.5.
5) What are the cons of using Scikit-learn?
7.
Conclusion
Last Updated: Mar 27, 2024

Basic Guide to Scikit-learn

Author Rajkeshav
0 upvote
gp-icon
Basics of Machine Learning
Free guided path
9 chapters
29+ problems
gp-badge
Earn badges and level up

Introduction

 

Scikit-learn is a library that is used to perform machine learning in Python. Scikit-learn is an open-source library and is usable in various contexts, increasing academics and commercial use. It is also based on popular libraries such as NumPy, Scipy, and Matplotlib. Also, the best part about Scikit-learn is that it has many tuning parameters, outstanding documentation, and a support community. So let me show that as well. So I open my Google, and I type in Scikit-learn.

 

 

Overhear the first link is the official documentation of Scikit-learn. So I just clicked it. 

 

 

As you can see here, this is the official page of Scikit-learn over here. This is the introduction that I have already discussed. Over here, we have some applications and some algorithms. So let's say in classification, we have applications such as spam detection and image classification and algorithms such as SVM, nearest-neighbor random forest, and many more. 

 

Installation

 

Now we will install the Scikit-learn package. You need to go to your command line and type in pip install Scikit-learn. And if you are using Anaconda distribution, you can type in Conda install Scikit-learn as I mentioned that it already contains a lot of algorithms. It has a library such as NumPy, Scipy which makes the work easy with Arrays and machine learning techniques much more effortless. So if you go back to Scikit-learn documentation, you will first see an Install tab, then a User guide, and so on.

 

 

Set the click on install; this will tell us how to install Scikit-learn. Then we have beautiful documentation named User guide where you can look up the guides and explore this. 

 

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Basic Guides

 

Every algorithm is exposed In Scikit-learn via an estimator. So first, we need to import the model. The general form to do this is by typing in from sklearn.family import Model.

Let's talk about the first machine learning algorithm that we learned, linear regression. So we'll go ahead and type in from sklearn.linear_model import LinearRegression. Here linear_model is the family, and LinearRegression is the model. And then, we need to instantiate that model to implement any algorithm using Scikit-learn. 

We will be implementing the Classification model. I am taking the Iris data set, one of the oldest data sets, and carrying out easy supervised learning tasks. This data set contains three classes of 50 instances, each of which refers to a type of iris Iris-Setosa Iris-Virginica and Iris-Versicolor. It has a robust measurement among all the species: the Sepal length, the Sepal width, the Petal length, and the Petal width. These are the four features from each sample.

So now, let's pick a classification model. In classification, we have different algorithms. We have decision trees, Random forest, Naive Bayes classifier, and many more. Here I will discuss the support vector machine, also called the SVM classifier. SVM is a supervised machine learning algorithm used for both classification and regression challenges. In general, SVM is used for classification problems, so it defines a hyperplane that can split the data optimally. There is a wide margin between the hyperplane and observation. 

 

 

 

Sourcelink

 

SVM tries to define a hyperplane that will segregate data into different categories. The data present on the right side of the hyperplane belong to a different type, whereas the data on the left side of the hyperplane belong to some other category. 

We have to use an SVM classifier on the Iris data set and create a model which classifies the flowers based on the feature.

 

Implementation

#importing the essential modules

from sklearn import svm
from sklearn import datasets

# Loading iris datasets

iris = datasets.load_iris()

# let's print the type of the iris
print(type(iris))
 
<class 'sklearn.utils.Bunch'>

#The datatype of the iris is 'Bunch .'This Bunch contains the iris datasets and all the attributes. 
#One of the attributes is 'data.' 'data' contains all the specifications of the iris datasets.

print(iris.data)


Output:
[[5.2 3.9 1.2 0.2]
[4.9 3.1  2.4 0.1]
[4.7 3.0 1.3 0.2]
[4.9 3.1 2.5 0.2]
[5.1  3.7 1.2 0.2]
[5.4 3.8 1.7 0.3]

#As we can see, four columns are denoting, Sepal Length, Sepal Width, Petal Length, and Petal width
# Let's get the feature names

print(iris.feature_names)


Output:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

# There is one more attribute called 'Target .'Target is what we are going to predict

print(iris.target)


Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

# 0 denotes setosa, 1 denotes versicolor, and 2 represents virginica 


print(iris.target_names)


Output:
['setosa' 'versicolor' 'virginica']

# Create different arrays for storing the Dependent and Independent variables

X = iris.data
Y = iris.target

# splitting the datasets into train data and test data
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=4)

# Now we need to use SVM to build a classifier which helps to classify whenever we provide new data.

model= svm.SVC(kernel = 'linear')

# We have to fit our model and pass in the parameters independent variable,i.e., X_train and the dependent variable Y-train

model.fit(X_train, Y_train)

# The model is ready with us and we can predict the class based on the test dataset

accuracy = model.score(X_test, Y_test)
print(accuracy)


Output:
0.9666666666666667

# The accuracy score is 96%.

Y_predicted = model.predict(X_test)
print('The predicted values are',Y_predicted)
print('The actual values are',Y_test)


Output:
The predicted values are [2 0 2 2 2 1 2 0 0 2 0 0 0 1 2 0 1 0 0 2 0 2 1 0 0 0 0 0 0 2]
The actual values are [2 0 2 2 2 1 1 0 0 2 0 0 0 1 2 0 1 0 0 2 0 2 1 0 0 0 0 0 0 2]

 

Here are the main ways the Scikit-learn library is used.

Classification: The classification tools identify the category associated with the provided data. For example, classifying images and texts on social media platforms whether they are appropriate or not, they can be used to categorise email messages as either spam or not.

Classification algorithms in Scikit-learn include:

  • Support vector machines (SVMs)
  • Nearest neighbours
  • Random forest

The most famous model used in classification is SVM as it finds the best-suited path for the classification model. If You want to work on the model of SVM here is the official link for the support vector machine. The below graph shows the difference between a simple classifier and SVM. It finds the best possible classifying line for the data.

Regression: It involves creating a model that tries to comprehend the relationship between input and output data. Regression models have a tendency to bend their curves for a better approach for problem-solving. For example, regression tools can be used to understand the behaviour of stock prices.

Regression algorithms include:

  • SVMs
  • Ridge regression
  • Lasso

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). Lasso is considered as important model for the regression problems. The graph for the regression models is below and the official link is here.

Clustering: The Scikit-learn clustering tools are used to automatically group data with the same characteristics into sets. Classifying groups in data models are basically something that clustering deals with, for example, customer data can be segmented based on their localities.

Clustering algorithms include:

  • K-means
  • Spectral clustering
  • Mean-shift

Let’s explore K-Means

Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group.

The above graph shows three clusters of separate data with specific properties K-means is a scikit module that takes raw input of the mathematical form of data and binds them in clusters. It is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labelled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

Dimensional Reduction: It lowers the number of random variables for analysis. For example, to increase the efficiency of visualisations, outlying data may not be considered.

Dimensionality reduction algorithms include:

  • Missing Values Ratio
  • Low Variance Filter
  • High Correlation Filter
  • Random Forests/Ensemble Trees
  • Principal Component Analysis (PCA)
  • Backward Feature Elimination
  • Forward Feature Construction

The number of input variables or features for a dataset is referred to as its dimensionality. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. More input features often make a predictive modelling task more challenging to model, more generally referred to as the curse of dimensionality.

High-dimensionality statistics and dimensionality reduction techniques are often used for data visualisation. Nevertheless, these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.

Dimension reduction is generally performed to keep the important information only and curve the memory use for the dataset.

Model selection Tools: Model selection algorithms offer tools to compare, validate, and select the best parameters and models to use in your data science projects. Model selection modules that can deliver enhanced accuracy through parameter tuning include:

  • Grid search
  • Cross-validation
  • Metrics
  • K-fold

We can’t perform every machine learning algorithm on the same model as the data differs the need to change the model also comes into action so we take the help of the sklearn model selection tools, for example, I take Cross-validation module using this module we break the train data into segments and try to perform the classifier defined on the data and it gives the score for the fragments by this you can easily get an idea of how good your model is performing. Below is the flowchart explanation for K-fold.

Preprocessing: The Scikit-learn preprocessing tools are important in feature extraction and normalisation during data analysis. For example, you can use these tools to transform input data—such as text—and apply their features in your analysis.

Preprocessing modules include:

  • Preprocessing
  • Feature extraction
  • Scaling of data
  • Dealing with outliers
  • Data Exploration (EDA)

If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take a considerable amount of processing time. Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.

Frequently Asked Questions 

 

1) What are the features of Scikit-learn?

  • Scikit-learn has various inbuilt dummy data sets upon which we can work.
  • Scikit learn contains various inbuilt machine learning algorithms which we can directly use in our program. 

 

2) What is the transform method in Scikit-learn?

=> The transform()  method in Scikit-learn is used to transform the data according to the fitted model.

 

3) What is the fit_transform method in Scikit-learn?

=> The fit_transform way in Scikit-learn does both the fitting and the transformation of the data according to the fitted model.

 

4) What is the fit method used in Scikit-learn?

=> The fit() method in Scikit-learn is used for generating a learning model from training data. The fit() model is used to train the model.

 

5) What are the cons of using Scikit-learn?

  • Scikit-learn is not very optimized in graph algorithm
  • Scikit-learn is not very good at string processing
  • Scikit-learn does not have a solid linear algebra library 

 

Conclusion

 

If you are interested in learning the Machine learning algorithms and the various interesting libraries in great detail, you must visit here.

(Don't forget to check: Logistic Regression on Iris datasets)

Next article
Must Know Functions in Scikit-Learn
Guided path
Free
gridgp-icon
Basics of Machine Learning
9 chapters
29+ Problems
gp-badge
Earn badges and level up
Live masterclass