**Basic Guides**

Every algorithm is exposed In Scikit-learn via an estimator. So first, we need to import the model. The general form to do this is by typing in from sklearn.family import Model.

Let's talk about the first machine learning algorithm that we learned, linear regression. So we'll go ahead and type in **from sklearn.linear_model import LinearRegression.** Here linear_model is the family, and LinearRegression is the model. And then, we need to instantiate that model to implement any algorithm using Scikit-learn.

We will be implementing the Classification model. I am taking the** Iris data set**, one of the oldest data sets, and carrying out easy supervised learning tasks. This data set contains three classes of 50 instances, each of which refers to a type of iris **Iris-Setosa Iris-Virginica and Iris-Versicolor**. It has a robust measurement among all the species: **the Sepal length, the Sepal width, the Petal length, and the Petal width**. These are the four features from each sample.

So now, let's pick a classification model. In classification, we have different algorithms. We have decision trees, Random forest, Naive Bayes classifier, and many more. Here I will discuss the support vector machine, also called the **SVM classifier**. SVM is a supervised machine learning algorithm used for both classification and regression challenges. In general, SVM is used for classification problems, so it defines a hyperplane that can split the data optimally. There is a wide margin between the hyperplane and observation.

**Source**- __link__

SVM tries to define a hyperplane that will segregate data into different categories. The data present on the right side of the hyperplane belong to a different type, whereas the data on the left side of the hyperplane belong to some other category.

We have to use an SVM classifier on the Iris data set and create a model which classifies the flowers based on the feature.

**Implementation**

```
#importing the essential modules
from sklearn import svm
from sklearn import datasets
# Loading iris datasets
iris = datasets.load_iris()
# let's print the type of the iris
print(type(iris))
<class 'sklearn.utils.Bunch'>
#The datatype of the iris is 'Bunch .'This Bunch contains the iris datasets and all the attributes.
#One of the attributes is 'data.' 'data' contains all the specifications of the iris datasets.
print(iris.data)
Output:
[[5.2 3.9 1.2 0.2]
[4.9 3.1 2.4 0.1]
[4.7 3.0 1.3 0.2]
[4.9 3.1 2.5 0.2]
[5.1 3.7 1.2 0.2]
[5.4 3.8 1.7 0.3]
#As we can see, four columns are denoting, Sepal Length, Sepal Width, Petal Length, and Petal width
# Let's get the feature names
print(iris.feature_names)
Output:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
# There is one more attribute called 'Target .'Target is what we are going to predict
print(iris.target)
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
# 0 denotes setosa, 1 denotes versicolor, and 2 represents virginica
print(iris.target_names)
Output:
['setosa' 'versicolor' 'virginica']
# Create different arrays for storing the Dependent and Independent variables
X = iris.data
Y = iris.target
# splitting the datasets into train data and test data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=4)
# Now we need to use SVM to build a classifier which helps to classify whenever we provide new data.
model= svm.SVC(kernel = 'linear')
# We have to fit our model and pass in the parameters independent variable,i.e., X_train and the dependent variable Y-train
model.fit(X_train, Y_train)
# The model is ready with us and we can predict the class based on the test dataset
accuracy = model.score(X_test, Y_test)
print(accuracy)
Output:
0.9666666666666667
# The accuracy score is 96%.
Y_predicted = model.predict(X_test)
print('The predicted values are',Y_predicted)
print('The actual values are',Y_test)
Output:
The predicted values are [2 0 2 2 2 1 2 0 0 2 0 0 0 1 2 0 1 0 0 2 0 2 1 0 0 0 0 0 0 2]
The actual values are [2 0 2 2 2 1 1 0 0 2 0 0 0 1 2 0 1 0 0 2 0 2 1 0 0 0 0 0 0 2]
```

**Here are the main ways the Scikit-learn library is used.**

**Classification: **The classification tools identify the category associated with the provided data. For example, classifying images and texts on social media platforms whether they are appropriate or not, they can be used to categorise email messages as either spam or not.

Classification algorithms in Scikit-learn include:

- Support vector machines (SVMs)
- Nearest neighbours
- Random forest

The most famous model used in classification is SVM as it finds the best-suited path for the classification model. If You want to work on the model of SVM here is the official link for the support vector machine. The below graph shows the difference between a simple classifier and SVM. It finds the best possible classifying line for the data.

**Regression: **It** **involves creating a model that tries to comprehend the relationship between input and output data. Regression models have a tendency to bend their curves for a better approach for problem-solving. For example, regression tools can be used to understand the behaviour of stock prices.

Regression algorithms include:

- SVMs
- Ridge regression
- Lasso

**Lasso regression** is a type of linear **regression** that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The **lasso** procedure encourages simple, sparse models (i.e. models with fewer parameters). Lasso is considered as important model for the regression problems. The graph for the regression models is below and the official link is here.

**Clustering: **The Scikit-learn clustering tools are used to automatically group data with the same characteristics into sets. Classifying groups in data models are basically something that clustering deals with, for example, customer data can be segmented based on their localities.

Clustering algorithms include:

- K-means
- Spectral clustering
- Mean-shift

Let’s explore K-Means

**Clustering** is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a **clustering algorithm** to classify each data point into a specific group.

The above graph shows three clusters of separate data with specific properties K-means is a scikit module that takes raw input of the mathematical form of data and binds them in clusters. It is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labelled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features and groupings inherent in a set of examples.

**Clustering** is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

**Dimensional Reduction: **It lowers the number of random variables for analysis. For example, to increase the efficiency of visualisations, outlying data may not be considered.

Dimensionality reduction algorithms include:

- Missing Values Ratio
- Low Variance Filter
- High Correlation Filter
- Random Forests/Ensemble Trees
- Principal Component Analysis (PCA)
- Backward Feature Elimination
- Forward Feature Construction

The number of input variables or features for a dataset is referred to as its dimensionality. Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. More input features often make a predictive modelling task more challenging to model, more generally referred to as the curse of dimensionality.

High-dimensionality statistics and dimensionality reduction techniques are often used for data visualisation. Nevertheless, these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.

Dimension reduction is generally performed to keep the important information only and curve the memory use for the dataset.

**Model selection Tools**: Model selection algorithms offer tools to compare, validate, and select the best parameters and models to use in your data science projects. Model selection modules that can deliver enhanced accuracy through parameter tuning include:

- Grid search
- Cross-validation
- Metrics
- K-fold

We can’t perform every machine learning algorithm on the same model as the data differs the need to change the model also comes into action so we take the help of the sklearn model selection tools, for example, I take Cross-validation module using this module we break the train data into segments and try to perform the classifier defined on the data and it gives the score for the fragments by this you can easily get an idea of how good your model is performing. Below is the flowchart explanation for K-fold.

**Preprocessing**: The Scikit-learn preprocessing tools are important in feature extraction and normalisation during data analysis. For example, you can use these tools to transform input data—such as text—and apply their features in your analysis.

Preprocessing modules include:

- Preprocessing
- Feature extraction
- Scaling of data
- Dealing with outliers
- Data Exploration (EDA)

If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take a considerable amount of processing time. Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.

## F**requently Asked Questions **

**1) What are the features of Scikit-learn?**

- Scikit-learn has various inbuilt dummy data sets upon which we can work.
- Scikit learn contains various inbuilt machine learning algorithms which we can directly use in our program.

**2) What is the transform method in Scikit-learn?**

=> The transform() method in Scikit-learn is used to transform the data according to the fitted model.

**3) What is the fit_transform method in Scikit-learn?**

=> The fit_transform way in Scikit-learn does both the fitting and the transformation of the data according to the fitted model.

**4) What is the fit method used in Scikit-learn?**

=> The fit() method in Scikit-learn is used for generating a learning model from training data. The fit() model is used to train the model.

**5) What are the cons of using Scikit-learn?**

- Scikit-learn is not very optimized in graph algorithm
- Scikit-learn is not very good at string processing
- Scikit-learn does not have a solid linear algebra library

## Conclusion

If you are interested in learning the Machine learning algorithms and the various interesting libraries in great detail, you must visit __here__.

(Don't forget to check: __Logistic Regression on Iris datasets__)