Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Pre-Requisites
3.
Steps for data pre-processing
4.
Model Building
5.
FAQs
6.
Key Takeaways
Last Updated: Mar 27, 2024

Data Preprocessing and Building ML Models

Introduction

 

Machine learning algorithms learn a one-to-one mapping from input variables to a target variable in a predictive modeling assignment. Structured data, often known as tabular data, is the most prevalent predictive modeling project. This is data in the form of a spreadsheet or matrix, with columns of features for each sample and rows of examples.

We can't fit and assess machine learning algorithms on raw data; instead, we have to alter the data to satisfy the needs of the particular algorithms. Furthermore, to acquire the highest performance given our available resources on a predictive modeling project, we must choose a data representation that optimally exposes the unknown underlying structure of the prediction problem to the learning algorithms.

Fitting models has become routine now that we have standard implementations of highly parameterized machine learning algorithms in open source libraries. As a result, the most toughest aspect of any predictive modeling project is preparing the one unique item: the data utilized for modeling.

Pre-Requisites

The majority of machine learning algorithms require clean datasets as input. These are the datasets you'll use to train and test. These datasets are divided into x-train, y-train, and x-test, y-test datasets.

However, before getting to the clean data set, we must first do some significant operations on the raw input datasets to get to the valuable data set.

Let's discuss the naming pattern we are going to follow:

  • df — The variable name for the data frame
  • X-train — the training data-set( x-variables )
  • y-train — The training data-set( y-variable / target variable )
  • X-test — the test data-set ( x-variables )
  • y-test — The test data-set( y-variable / target variable )

Steps for data pre-processing

 

  1. List and Check Data Types
  • What type of information is contained in each column? Are all of the columns correctly labeled as the appropriate data type? Change the data types as necessary.
df. dtypes    # data types of the columns in the data frame
You can also try this code with Online Python Compiler
Run Code
  • Is there anything that should be numeric in a categorical column? ZIP codes, for example, are commonly characterized as numeric, but they are categories.

2. Checking your column names

  • Are there any long names, special characters, spaces, or other characters in the column names? Is it necessary to update the column names if this is the case? Additionally, removing the spaces in the column names makes it easier to work with the columns.

3. Checking for missing values

  • Are there any missing values in any of your features?
df.isnull().sum() # to check
You can also try this code with Online Python Compiler
Run Code

4. Irrelevant data 

  • Do you have any erroneous data in your datasets? (e.g., '?' or '-' to represent a null value) If you replace them, you could find it easier to work with the data. Are there any zeros or negative numbers in your data set instead of natural numbers?
     

5. Data Scaling

  • Is your data in need of scaling or normalization across feature lists, and if so, what kind of Standard or Min-Max scaling do you need?
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
You can also try this code with Online Python Compiler
Run Code

6. Checking for outliers

  • Are there any outliers? Which can be checked by using box plotting and visualizing the outliers. Followed by a check if they are real or not. If they are setting up methods to treat them.

7.Categorical Encoding

  • Check if there are data types that are categorical. If yes, what encoding would you prefer to do categorical or label?

 

8. Feature reduction step

  • Are there any features that can be disabled? Finally, what criteria would you choose which characteristics to include in your model?
  • Have you considered the possibility of multicollinearity? ( for algorithms that do not like x-variables that are closely related )

Model Building

 

The use of machine learning algorithms to model data has become commonplace. Therefore, most algorithms are well understood and parameterized, and open-source software such as Python's scikit-learn machine learning library provides standard definitions and implementations.
 

  1. Linear regression
from sklearn.linear_model import LinearRegression
lxr = LinearRegression()
lxr.fit(X_train, y_train)
You can also try this code with Online Python Compiler
Run Code


The first line imports the sklearn.linear model sub-LinearRegression() module's function. The lxr variable is then assigned to the LinearRegression() function, and the. fit() function does the actual model training on the input data X train and y train.
 

2. Random forest

Random forest (RF) is an ensemble learning method in which numerous decision trees' predictions are combined. The built-in importance of RF is one of its best features.
 

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(max_depth=2, random_state=42)
rf.fit(X_train, y_train)
You can also try this code with Online Python Compiler
Run Code

 

It's worth noting that RandomForestRegressor is the regression version (i.e., where the Y variable is made up of numerical values), while RandomForestClassifier is the classification version (i.e., this is used when the Y variable contains categorical values).

 

3. Other models

Let's look at a few examples of regressors that we can use:

  • sklearn.linear_model.Ridge
  • sklearn.linear_model.SGDRegressor
  • sklearn.ensemble.ExtraTreesRegressor
  • sklearn.ensemble.GradientBoostingRegressor
  • sklearn.neighbors.KNeighborsRegressor
  • sklearn.neural_network.MLPRegressor
  • sklearn.tree.DecisionTreeRegressor
  • sklearn.tree.ExtraTreeRegressor
  • sklearn.svm.LinearSVR
  • sklearn.svm.SVR

 

FAQs

  1. What does the term feature engineering mean?
    The fundamental aspects of any forecast that influence the outcome are known as features. Developing a new feature, changing a part, and encoding a feature is feature engineering.
     
  2. What are dummy variables?
    In regression analysis, all category columns must be converted to binary variables. Dummy variables are the name for these kinds of variables.
     
  3. What is heteroscedasticity?
    The variability of a variable is unequal throughout the range of values of a second variable that predicts it, which is known as heteroscedasticity.
     
  4. How are you going to deal with multicollinearity?
    The Correlation coefficient, Variance inflation factor (VIF), and Eigenvalues can all be used to discover multicollinearity. The correlation coefficient is the simplest to calculate.
     
  5. What exactly do you mean when you say "data transformation"?
    Consolidate or aggregate your data columns with data transformation. It may have an impact on the performance of your machine learning model.

Key Takeaways

This is a beginner's guide to creating checklists for the data pre-processing and model-building steps. There are a few other items you may include in your inventory, which you can find here.

 

Live masterclass