Multiple Linear Regression in Python

Introduction

Multiple Linear Regression is a statistical technique used to model the relationship between one dependent variable and multiple independent variables. It extends simple linear regression by considering multiple predictors to make more accurate predictions. In Python, multiple linear regression can be implemented using libraries like sklearn and statsmodels.

In this article, we will learn how to perform multiple linear regression in Python with step-by-step implementation and examples.

Steps for Multiple Linear Regression

To perform Multiple Linear Regression in Python, follow these steps:

Import the necessary libraries
Load and preprocess the dataset
Split the dataset into training and testing sets
Train the Multiple Linear Regression model
Evaluate the model
Make predictions

Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2', 'Feature3']]
y = data['Target']
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training the model
model = LinearRegression()
model.fit(X_train, y_train)
# Predicting values
y_pred = model.predict(X_test)
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

Assumption of Regression Model

Before building a Multiple Linear Regression model, it’s important to understand the assumptions it relies on. These assumptions ensure the model’s accuracy & reliability. If these assumptions are violated, the results may not be valid. Let’s discuss these one by one:

1. Linearity

The relationship between the independent variables & the dependent variable should be linear. This means that changes in the independent variables should result in proportional changes in the dependent variable. To check this, we can use scatter plots or residual plots.

2. Independence of Errors

The residuals (errors) should not be correlated with each other. In other words, there should be no pattern in the errors. This is often checked using the Durbin-Watson test.

3. Homoscedasticity

The residuals should have constant variance at every level of the independent variables. If the variance changes, it’s called heteroscedasticity. A residual vs. fitted value plot can help identify this.

4. Normality of Residuals

The residuals should be normally distributed, especially for small sample sizes. This can be checked using a Q-Q plot or a histogram of residuals.

5. No Multicollinearity

The independent variables should not be highly correlated with each other. High multicollinearity can make it difficult to interpret the model. This can be checked using the Variance Inflation Factor (VIF).

Let’s now implement these checks in Python. Below is a complete example:

Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.api import qqplot
import statsmodels.api as sm

Example dataset

data = {
    'Size': [1200, 1500, 1700, 2000, 2200],
    'Bedrooms': [2, 3, 3, 4, 4],
    'Age': [10, 5, 8, 2, 1],
    'Price': [300000, 400000, 450000, 500000, 550000]
}
df = pd.DataFrame(data)

Define independent (X) & dependent (y) variables

X = df[['Size', 'Bedrooms', 'Age']]
y = df['Price']

Add a constant to the independent variables (for statsmodels)

X = sm.add_constant(X)

Fit the model

model = sm.OLS(y, X).fit()

Check Linearity: Residual vs. Fitted plot

sns.residplot(x=model.fittedvalues, y=model.resid, lowess=True)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual vs. Fitted Plot')
plt.show()

Check Normality: Q-Q plot

qqplot(model.resid, line='s')
plt.title('Q-Q Plot of Residuals')
plt.show()

Check Multicollinearity: VIF

vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

In this Code:

1. Linearity Check: The residual vs. fitted plot helps us see if there’s a pattern in the residuals. If the plot shows randomness, the linearity assumption holds.

2. Normality Check: The Q-Q plot compares the distribution of residuals to a normal distribution. If the points lie on the line, the residuals are normally distributed.

3. Multicollinearity Check: The VIF measures how much the variance of a coefficient is inflated due to multicollinearity. A VIF value greater than 5 indicates high multicollinearity.

Handling Categorical Data with Dummy Variables

When working with categorical variables, we must convert them into numerical values using dummy variables.

Example

# Handling categorical data
categorical_data = pd.get_dummies(data['Category'], drop_first=True)
# Merging with original data
data = pd.concat([data, categorical_data], axis=1)
data.drop(['Category'], axis=1, inplace=True)

This process prevents incorrect numerical interpretations and helps in better model performance.

Multicollinearity in Multiple Linear Regression

Multicollinearity occurs when independent variables are highly correlated with each other, leading to unreliable coefficient estimates. High multicollinearity reduces the model’s interpretability.

Detecting Multicollinearity

We can detect multicollinearity using:

Variance Inflation Factor (VIF): Measures how much a variable is correlated with others.
Correlation Matrix: Identifies highly correlated variables.

Example using VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculating VIF
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)

A VIF value above 5 or 10 indicates multicollinearity.

Assumptions of Multiple Regression Model

Linearity: The relationship between independent and dependent variables should be linear.
Homoscedasticity: Residuals should have constant variance.
Multicollinearity: Independent variables should not be highly correlated.
Independence of Errors: Residuals should be independent.
Normality of Residuals: Errors should be normally distributed.

Checking Homoscedasticity and Normality

import seaborn as sns
# Residual plot
sns.residplot(x=y_pred, y=(y_test - y_pred), lowess=True)
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()

If residuals show a random pattern, homoscedasticity holds.

Implementing Multiple Linear Regression Model in Python

Here is a complete implementation using a sample dataset.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Load dataset
data = pd.read_csv('data.csv')
X = data[['Feature1', 'Feature2', 'Feature3']]
y = data['Target']
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict values
y_pred = model.predict(X_test)
# Evaluate performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")

This implementation shows how to build a Multiple Linear Regression model in Python and evaluate its performance.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of analyzing & summarizing datasets to understand their main characteristics. It helps us identify patterns, detect anomalies, & test assumptions. For Multiple Linear Regression, EDA involves understanding the relationships between variables, checking for missing values, & visualizing data distributions. Let’s break this down step by step:

1. Understanding the Dataset

Start by loading the dataset & examining its structure. Check the number of rows, columns, & data types.

2. Handling Missing Values

Missing data can affect the model’s performance. Identify missing values & decide how to handle them (e.g., removing rows or imputing values).

3. Descriptive Statistics

Calculate summary statistics like mean, median, standard deviation, & percentiles to understand the distribution of the data.

4. Data Visualization

Visualize the data using plots like histograms, scatter plots, & correlation matrices to identify trends & relationships.

Let’s implement EDA in Python using a sample dataset. Below is the complete code:

Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Example dataset

data = {
    'Size': [1200, 1500, 1700, 2000, 2200, np.nan, 2500],
    'Bedrooms': [2, 3, 3, 4, 4, 5, 5],
    'Age': [10, 5, 8, 2, 1, 15, 20],
    'Price': [300000, 400000, 450000, 500000, 550000, 600000, 650000]
}
df = pd.DataFrame(data)

Step 1: Understanding the Dataset

print("Dataset Overview:")
print(df.head())   Display the first 5 rows
print("\nDataset Info:")
print(df.info())   Check data types & missing values

Step 2: Handling Missing Values

print("\nMissing Values:")
print(df.isnull().sum())   Check for missing values
df['Size'].fillna(df['Size'].mean(), inplace=True)   Fill missing values with mean
print("\nAfter Handling Missing Values:")
print(df.isnull().sum())

Step 3: Descriptive Statistics

print("\nDescriptive Statistics:")
print(df.describe())   Summary statistics

Step 4: Data Visualization

Histogram for numerical columns

df.hist(bins=10, figsize=(10, 8))
plt.suptitle("Histograms of Numerical Columns")
plt.show()

Scatter plot to check relationships

sns.pairplot(df)
plt.suptitle("Pairplot of Variables", y=1.02)
plt.show()

Correlation matrix

corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

In this Code:

1. Understanding the Dataset: We use `head()` to see the first few rows & `info()` to check data types & missing values.

2. Handling Missing Values: We use `isnull().sum()` to identify missing values & `fillna()` to handle them.

3. Descriptive Statistics: The `describe()` function provides summary statistics like mean, median, & standard deviation.

4. Data Visualization:

Histograms show the distribution of numerical columns.
Scatter plots (pairplot) help visualize relationships between variables.
A correlation matrix shows how variables are related to each other.

Model Building

Once we’ve performed Exploratory Data Analysis (EDA) & ensured that the assumptions of regression are met, the next step is to build the Multiple Linear Regression model. This involves splitting the data into training & testing sets, training the model, & evaluating its performance. Let’s understand this in detail:

1. Splitting the Data

We divide the dataset into two parts: a training set (used to train the model) & a testing set (used to evaluate the model). A common split ratio is 80% training & 20% testing.

2. Training the Model

We use the training data to fit the regression model. This involves finding the coefficients for each independent variable that minimize the error between the predicted & actual values.

3. Making Predictions

Once the model is trained, we use it to predict the dependent variable for the testing data.

4. Evaluating the Model

We evaluate the model’s performance using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), & R-squared.

Let’s implement this in Python. Below is the complete code:

Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Example dataset

data = {
    'Size': [1200, 1500, 1700, 2000, 2200, 2500, 2700],
    'Bedrooms': [2, 3, 3, 4, 4, 5, 5],
    'Age': [10, 5, 8, 2, 1, 15, 20],
    'Price': [300000, 400000, 450000, 500000, 550000, 600000, 650000]
}
df = pd.DataFrame(data)

Define independent (X) & dependent (y) variables

X = df[['Size', 'Bedrooms', 'Age']]
y = df['Price']

Step 1: Splitting the Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 2: Training the Model

model = LinearRegression()
model.fit(X_train, y_train)

Step 3: Making Predictions

y_pred = model.predict(X_test)

Step 4: Evaluating the Model

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)


print("Model Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2):", r2)

Displaying Predictions vs. Actual Values

results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print("\nPredictions vs. Actual Values:")
print(results)

In this Code:

1. Splitting the Data: We use `train_test_split()` to divide the dataset into training & testing sets. Here, 80% of the data is used for training & 20% for testing.

2. Training the Model: We create an instance of `LinearRegression()` & use the `fit()` method to train the model on the training data.

3. Making Predictions: The `predict()` method is used to generate predictions for the testing data.

4. Evaluating the Model:

Mean Squared Error (MSE): Measures the average squared difference between actual & predicted values.
Root Mean Squared Error (RMSE): The square root of MSE, providing error in the same units as the dependent variable.
R-squared (R2): Indicates the proportion of variance in the dependent variable that’s explained by the independent variables.

Output Example:

Model Coefficients: [   200.   50000.  -5000.]
Intercept: 100000.0
Mean Squared Error (MSE): 25000000.0
Root Mean Squared Error (RMSE): 5000.0
R-squared (R2): 0.95

Predictions vs. Actual Values:

 Actual  Predicted
0  500000   495000
1  650000   655000

Frequently Asked Questions

What is Multiple Linear Regression used for?

Multiple Linear Regression is used for predicting an outcome based on multiple independent variables, such as house price prediction and medical diagnosis.

How do I handle categorical variables in Multiple Linear Regression?

Categorical variables should be converted into numerical values using dummy variables or one-hot encoding to be used in regression models.

What is the impact of multicollinearity in Multiple Linear Regression?

High multicollinearity leads to unreliable coefficient estimates and reduces the model’s accuracy.

Conclusion

In this article, we discussed Multiple Linear Regression in Python, a powerful technique used to model relationships between one dependent variable and multiple independent variables. We discussed its implementation using libraries like NumPy and scikit-learn, along with its importance in predictive analysis. Understanding multiple linear regression helps in making data-driven decisions across various domains.