Steps for data pre-processing
- List and Check Data Types
- What type of information is contained in each column? Are all of the columns correctly labeled as the appropriate data type? Change the data types as necessary.
df. dtypes # data types of the columns in the data frame
You can also try this code with Online Python Compiler
Run Code
- Is there anything that should be numeric in a categorical column? ZIP codes, for example, are commonly characterized as numeric, but they are categories.
2. Checking your column names
- Are there any long names, special characters, spaces, or other characters in the column names? Is it necessary to update the column names if this is the case? Additionally, removing the spaces in the column names makes it easier to work with the columns.
3. Checking for missing values
- Are there any missing values in any of your features?
df.isnull().sum() # to check
You can also try this code with Online Python Compiler
Run Code
4. Irrelevant data
-
Do you have any erroneous data in your datasets? (e.g., '?' or '-' to represent a null value) If you replace them, you could find it easier to work with the data. Are there any zeros or negative numbers in your data set instead of natural numbers?
5. Data Scaling
- Is your data in need of scaling or normalization across feature lists, and if so, what kind of Standard or Min-Max scaling do you need?
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
You can also try this code with Online Python Compiler
Run Code
6. Checking for outliers
- Are there any outliers? Which can be checked by using box plotting and visualizing the outliers. Followed by a check if they are real or not. If they are setting up methods to treat them.
7.Categorical Encoding
- Check if there are data types that are categorical. If yes, what encoding would you prefer to do categorical or label?
8. Feature reduction step
- Are there any features that can be disabled? Finally, what criteria would you choose which characteristics to include in your model?
- Have you considered the possibility of multicollinearity? ( for algorithms that do not like x-variables that are closely related )
Model Building
The use of machine learning algorithms to model data has become commonplace. Therefore, most algorithms are well understood and parameterized, and open-source software such as Python's scikit-learn machine learning library provides standard definitions and implementations.
- Linear regression
from sklearn.linear_model import LinearRegression
lxr = LinearRegression()
lxr.fit(X_train, y_train)
You can also try this code with Online Python Compiler
Run Code
The first line imports the sklearn.linear model sub-LinearRegression() module's function. The lxr variable is then assigned to the LinearRegression() function, and the. fit() function does the actual model training on the input data X train and y train.
2. Random forest
Random forest (RF) is an ensemble learning method in which numerous decision trees' predictions are combined. The built-in importance of RF is one of its best features.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(max_depth=2, random_state=42)
rf.fit(X_train, y_train)
You can also try this code with Online Python Compiler
Run Code
It's worth noting that RandomForestRegressor is the regression version (i.e., where the Y variable is made up of numerical values), while RandomForestClassifier is the classification version (i.e., this is used when the Y variable contains categorical values).
3. Other models
Let's look at a few examples of regressors that we can use:
- sklearn.linear_model.Ridge
- sklearn.linear_model.SGDRegressor
- sklearn.ensemble.ExtraTreesRegressor
- sklearn.ensemble.GradientBoostingRegressor
- sklearn.neighbors.KNeighborsRegressor
- sklearn.neural_network.MLPRegressor
- sklearn.tree.DecisionTreeRegressor
- sklearn.tree.ExtraTreeRegressor
- sklearn.svm.LinearSVR
- sklearn.svm.SVR
FAQs
-
What does the term feature engineering mean?
The fundamental aspects of any forecast that influence the outcome are known as features. Developing a new feature, changing a part, and encoding a feature is feature engineering.
-
What are dummy variables?
In regression analysis, all category columns must be converted to binary variables. Dummy variables are the name for these kinds of variables.
-
What is heteroscedasticity?
The variability of a variable is unequal throughout the range of values of a second variable that predicts it, which is known as heteroscedasticity.
-
How are you going to deal with multicollinearity?
The Correlation coefficient, Variance inflation factor (VIF), and Eigenvalues can all be used to discover multicollinearity. The correlation coefficient is the simplest to calculate.
-
What exactly do you mean when you say "data transformation"?
Consolidate or aggregate your data columns with data transformation. It may have an impact on the performance of your machine learning model.
Key Takeaways
This is a beginner's guide to creating checklists for the data pre-processing and model-building steps. There are a few other items you may include in your inventory, which you can find here.