Assumption of Regression Model
Before building a Multiple Linear Regression model, it’s important to understand the assumptions it relies on. These assumptions ensure the model’s accuracy & reliability. If these assumptions are violated, the results may not be valid. Let’s discuss these one by one:
1. Linearity
The relationship between the independent variables & the dependent variable should be linear. This means that changes in the independent variables should result in proportional changes in the dependent variable. To check this, we can use scatter plots or residual plots.
2. Independence of Errors
The residuals (errors) should not be correlated with each other. In other words, there should be no pattern in the errors. This is often checked using the Durbin-Watson test.
3. Homoscedasticity
The residuals should have constant variance at every level of the independent variables. If the variance changes, it’s called heteroscedasticity. A residual vs. fitted value plot can help identify this.
4. Normality of Residuals
The residuals should be normally distributed, especially for small sample sizes. This can be checked using a Q-Q plot or a histogram of residuals.
5. No Multicollinearity
The independent variables should not be highly correlated with each other. High multicollinearity can make it difficult to interpret the model. This can be checked using the Variance Inflation Factor (VIF).
Let’s now implement these checks in Python. Below is a complete example:
Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.api import qqplot
import statsmodels.api as sm
Example dataset
data = {
'Size': [1200, 1500, 1700, 2000, 2200],
'Bedrooms': [2, 3, 3, 4, 4],
'Age': [10, 5, 8, 2, 1],
'Price': [300000, 400000, 450000, 500000, 550000]
}
df = pd.DataFrame(data)
Define independent (X) & dependent (y) variables
X = df[['Size', 'Bedrooms', 'Age']]
y = df['Price']
Add a constant to the independent variables (for statsmodels)
X = sm.add_constant(X)
Fit the model
model = sm.OLS(y, X).fit()
Check Linearity: Residual vs. Fitted plot
sns.residplot(x=model.fittedvalues, y=model.resid, lowess=True)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual vs. Fitted Plot')
plt.show()
Check Normality: Q-Q plot
qqplot(model.resid, line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()
Check Multicollinearity: VIF
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
In this Code:
1. Linearity Check: The residual vs. fitted plot helps us see if there’s a pattern in the residuals. If the plot shows randomness, the linearity assumption holds.
2. Normality Check: The Q-Q plot compares the distribution of residuals to a normal distribution. If the points lie on the line, the residuals are normally distributed.
3. Multicollinearity Check: The VIF measures how much the variance of a coefficient is inflated due to multicollinearity. A VIF value greater than 5 indicates high multicollinearity.
Handling Categorical Data with Dummy Variables
When working with categorical variables, we must convert them into numerical values using dummy variables.
Example
# Handling categorical data
categorical_data = pd.get_dummies(data['Category'], drop_first=True)
# Merging with original data
data = pd.concat([data, categorical_data], axis=1)
data.drop(['Category'], axis=1, inplace=True)
This process prevents incorrect numerical interpretations and helps in better model performance.
Multicollinearity in Multiple Linear Regression
Multicollinearity occurs when independent variables are highly correlated with each other, leading to unreliable coefficient estimates. High multicollinearity reduces the model’s interpretability.
Detecting Multicollinearity
We can detect multicollinearity using:
- Variance Inflation Factor (VIF): Measures how much a variable is correlated with others.
- Correlation Matrix: Identifies highly correlated variables.
Example using VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculating VIF
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
A VIF value above 5 or 10 indicates multicollinearity.
Assumptions of Multiple Regression Model
- Linearity: The relationship between independent and dependent variables should be linear.
- Homoscedasticity: Residuals should have constant variance.
- Multicollinearity: Independent variables should not be highly correlated.
- Independence of Errors: Residuals should be independent.
- Normality of Residuals: Errors should be normally distributed.
Checking Homoscedasticity and Normality
import seaborn as sns
# Residual plot
sns.residplot(x=y_pred, y=(y_test - y_pred), lowess=True)
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()
If residuals show a random pattern, homoscedasticity holds.
Implementing Multiple Linear Regression Model in Python
Here is a complete implementation using a sample dataset.
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Load dataset
data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2', 'Feature3']]
y = data['Target']
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict values
y_pred = model.predict(X_test)
# Evaluate performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
This implementation shows how to build a Multiple Linear Regression model in Python and evaluate its performance.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is the process of analyzing & summarizing datasets to understand their main characteristics. It helps us identify patterns, detect anomalies, & test assumptions. For Multiple Linear Regression, EDA involves understanding the relationships between variables, checking for missing values, & visualizing data distributions. Let’s break this down step by step:
1. Understanding the Dataset
Start by loading the dataset & examining its structure. Check the number of rows, columns, & data types.
2. Handling Missing Values
Missing data can affect the model’s performance. Identify missing values & decide how to handle them (e.g., removing rows or imputing values).
3. Descriptive Statistics
Calculate summary statistics like mean, median, standard deviation, & percentiles to understand the distribution of the data.
4. Data Visualization
Visualize the data using plots like histograms, scatter plots, & correlation matrices to identify trends & relationships.
Let’s implement EDA in Python using a sample dataset. Below is the complete code:
Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Example dataset
data = {
'Size': [1200, 1500, 1700, 2000, 2200, np.nan, 2500],
'Bedrooms': [2, 3, 3, 4, 4, 5, 5],
'Age': [10, 5, 8, 2, 1, 15, 20],
'Price': [300000, 400000, 450000, 500000, 550000, 600000, 650000]
}
df = pd.DataFrame(data)
Step 1: Understanding the Dataset
print("Dataset Overview:")
print(df.head()) Display the first 5 rows
print("\nDataset Info:")
print(df.info()) Check data types & missing values
Step 2: Handling Missing Values
print("\nMissing Values:")
print(df.isnull().sum()) Check for missing values
df['Size'].fillna(df['Size'].mean(), inplace=True) Fill missing values with mean
print("\nAfter Handling Missing Values:")
print(df.isnull().sum())
Step 3: Descriptive Statistics
print("\nDescriptive Statistics:")
print(df.describe()) Summary statistics
Step 4: Data Visualization
Histogram for numerical columns
df.hist(bins=10, figsize=(10, 8))
plt.suptitle("Histograms of Numerical Columns")
plt.show()
Scatter plot to check relationships
sns.pairplot(df)
plt.suptitle("Pairplot of Variables", y=1.02)
plt.show()
Correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
In this Code:
1. Understanding the Dataset: We use `head()` to see the first few rows & `info()` to check data types & missing values.
2. Handling Missing Values: We use `isnull().sum()` to identify missing values & `fillna()` to handle them.
3. Descriptive Statistics: The `describe()` function provides summary statistics like mean, median, & standard deviation.
4. Data Visualization:
- Histograms show the distribution of numerical columns.
- Scatter plots (pairplot) help visualize relationships between variables.
- A correlation matrix shows how variables are related to each other.
Model Building
Once we’ve performed Exploratory Data Analysis (EDA) & ensured that the assumptions of regression are met, the next step is to build the Multiple Linear Regression model. This involves splitting the data into training & testing sets, training the model, & evaluating its performance. Let’s understand this in detail:
1. Splitting the Data
We divide the dataset into two parts: a training set (used to train the model) & a testing set (used to evaluate the model). A common split ratio is 80% training & 20% testing.
2. Training the Model
We use the training data to fit the regression model. This involves finding the coefficients for each independent variable that minimize the error between the predicted & actual values.
3. Making Predictions
Once the model is trained, we use it to predict the dependent variable for the testing data.
4. Evaluating the Model
We evaluate the model’s performance using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), & R-squared.
Let’s implement this in Python. Below is the complete code:
Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Example dataset
data = {
'Size': [1200, 1500, 1700, 2000, 2200, 2500, 2700],
'Bedrooms': [2, 3, 3, 4, 4, 5, 5],
'Age': [10, 5, 8, 2, 1, 15, 20],
'Price': [300000, 400000, 450000, 500000, 550000, 600000, 650000]
}
df = pd.DataFrame(data)
Define independent (X) & dependent (y) variables
X = df[['Size', 'Bedrooms', 'Age']]
y = df['Price']
Step 1: Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 2: Training the Model
model = LinearRegression()
model.fit(X_train, y_train)
Step 3: Making Predictions
y_pred = model.predict(X_test)
Step 4: Evaluating the Model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("Model Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2):", r2)
Displaying Predictions vs. Actual Values
results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print("\nPredictions vs. Actual Values:")
print(results)
In this Code:
1. Splitting the Data: We use `train_test_split()` to divide the dataset into training & testing sets. Here, 80% of the data is used for training & 20% for testing.
2. Training the Model: We create an instance of `LinearRegression()` & use the `fit()` method to train the model on the training data.
3. Making Predictions: The `predict()` method is used to generate predictions for the testing data.
4. Evaluating the Model:
- Mean Squared Error (MSE): Measures the average squared difference between actual & predicted values.
- Root Mean Squared Error (RMSE): The square root of MSE, providing error in the same units as the dependent variable.
- R-squared (R2): Indicates the proportion of variance in the dependent variable that’s explained by the independent variables.
Output Example:
Model Coefficients: [ 200. 50000. -5000.]
Intercept: 100000.0
Mean Squared Error (MSE): 25000000.0
Root Mean Squared Error (RMSE): 5000.0
R-squared (R2): 0.95
Predictions vs. Actual Values:
Actual Predicted
0 500000 495000
1 650000 655000
Frequently Asked Questions
What is Multiple Linear Regression used for?
Multiple Linear Regression is used for predicting an outcome based on multiple independent variables, such as house price prediction and medical diagnosis.
How do I handle categorical variables in Multiple Linear Regression?
Categorical variables should be converted into numerical values using dummy variables or one-hot encoding to be used in regression models.
What is the impact of multicollinearity in Multiple Linear Regression?
High multicollinearity leads to unreliable coefficient estimates and reduces the model’s accuracy.
Conclusion
In this article, we discussed Multiple Linear Regression in Python, a powerful technique used to model relationships between one dependent variable and multiple independent variables. We discussed its implementation using libraries like NumPy and scikit-learn, along with its importance in predictive analysis. Understanding multiple linear regression helps in making data-driven decisions across various domains.