Cost Function and Gradient Descent
Cost function in the case of Linear regression was 
Here h_{θ}x^{(i) }is output of i^{th} training example on our hypothesis and y^{(i)} is the actual value of the data point.
The cost function basically finds the values of linear regression coefficient and yintercept for which the error difference between the actual output and predicted output is the least.
The cost function in the case of multivariate linear regression is given by 
Source  link
The cost function is simulated using the gradient descent algorithm. Gradient descent is an optimisation algorithm that minimises the cost function.
We see that for a very small increase in value of θ, there is a sharp decrease in the value of the cost function J. However this difference in J gets smaller and smaller for the increasing value of θ. To a point where it finally reaches the global minima and after that there is an increase in J value for an increase in θ. The corresponding independent variables at the global minima are then chosen for the final model.
This is an iterative process and is given by 
Here α is the learning rate of the gradient descent.
Feature Scaling
The features or the independent variables or features need to be scaled in the same range for the gradient descent to work optimally. This is done by dividing all the features by the max of the feature column. For eg, suppose consider this dummy dataset { 3, 15, 200, 82, 500}. We can divide all the values in the data set with the Absoulte maximum value in the dataset (500). So the scaled data would lie in the range of 1 to 1. {3/500 , 15/500, 200/500 ,82/500 ,500/500}. This scaling method is called Absolute maximum Minmax scaling is another popular scaling method where we subtract the minimum value from every data in the set and then divide it by the range of the set. It is given by 
(DD_{min}) / (D_{max}D_{min})
There are other ways to scale the dataset as well.
Unscaled features lead to skewed contour plotting as seen above.
Source  link
Learning Rate
Source  link
Learning rate is the rate at which the gradient descent approaches and converges at global minima. The learning rate needs to be optimal as a small learning rate leads to slow convergence, which leads to slower execution and higher computational cost.
For a very high value of learning rate, the descent might never reach the global minima and just pass over it.
So an optimum value of α is necessary for the gradient descent to work efficiently.
Source  link
Implementation
We’ll now have a look at the python implementation of multivariate linear regression.
The dataset used is boston housing prices which is imported from sklearn library. The house price is our target variable.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from pandas import DataFrame
import sklearn as skt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
You can also try this code with Online Python Compiler
Run Code
Creating the data frame and splitting the dataset
class Dataset:
def __str__(self):
return "Dataset: The 'scikitlearn' Boston HousePrices"
def __init__(self):
print(self)
def load(self):
boston = load_boston()
self.data = boston
self.df = DataFrame(np.concatenate((boston.data, boston.target.reshape(1, 1)), axis=1),
columns=np.concatenate((boston.feature_names, ["MEDV"])))
X = boston.data
# Choose only a subset of features
features_limit = ["LSTAT", "RM", "PTRATIO"]
if len(features_limit)>0:
features_index = np.where(dataset.data.feature_names==np.array(features_limit).reshape(1,1))[1]
X = X[:, features_index]
X = self.preprocess(X)
y = boston.target.reshape(1, 1) # Convert to Column vector
# Insert bias feature (x0 = 1) into features for all examples
X = np.concatenate((np.ones((X.shape[0],1)), X), axis=1)
self.X = X
self.y = y
print("Informations:")
print("\t X ~", np.shape(X))
print("\t y ~", np.shape(y))
# Sanity check
assert(len(X)==len(y))
m = len(X)
self.m = m
print("\t m:", m)
# Number of features
n = np.shape(X)[1]
self.n = n
print("\t n:", n)
# train, dev, test sets split
X_train_dev, X_test, y_train_dev, y_test = train_test_split(X, y, test_size=.1, random_state=0)
X_train, X_dev, y_train, y_dev = train_test_split(X_train_dev, y_train_dev, test_size=.1, random_state=0)
# Number of examples
m_train = len(X_train)
m_dev = len(X_dev)
m_test = len(X_test)
print("\t m_train:", m_train)
print("\t m_dev:", m_dev)
print("\t m_test:", m_test)
assert(m_train+m_dev+m_test == m)
return X_train, X_dev, X_test, y_train, y_dev, y_test, m_train, m_dev, m_test, m, n
def preprocess(self, X):
X_ = (Xnp.mean(X, axis=0)) / np.std(X, axis=0) # Z normalization
return X_
You can also try this code with Online Python Compiler
Run Code
dataset = Dataset()
X_train, X_dev, X_test, y_train, y_dev, y_test, m_train, m_dev, m_test, m, n = dataset.load()
You can also try this code with Online Python Compiler
Run Code
Custom Multivariate Linear Regression Model
class Model:
def __str__(self):
return "Multivariate Linear Regression"
#
# Model parameters and hyperparameters initialization
def __init__(self):
print(self)
self.theta = np.zeros((n, 1)) # Model parameters
print("theta ~", np.shape(self.theta) )
self.alpha = 0.06 # Learning rate #0.15 #0.01 #0.001 #0.0001
self.epochs = 80
# Cost function values during the training
self.costs_train = []
self.costs_dev = []
#
# Fitting the model to the training set w.r.t. the development set
def fit(self, X_train, X_dev, y_train, y_dev):
for epoch in np.arange(1, self.epochs+1):
# Predictions
preds_train = self.predict(X_train)
preds_dev = self.predict(X_dev)
# Costs
J_train = self.cost(preds_train, y_train)
J_dev = self.cost(preds_dev, y_dev)
# Accumulating costs for plotting the learning curves
self.costs_train.append(J_train)
self.costs_dev.append(J_dev)
# Derivatives
dtheta = (1/m) * np.matmul(X_train.T , (preds_train  y_train))
# Update parameters
self.update(dtheta)
# Plot the learning curve
self.plot_learning_curve()
#
# Update the parameters of the model
def update(self, dtheta):
self.theta = self.alpha * dtheta
#
# Model prediction for some examples
def predict(self, X):
# Hypothesis
preds = np.matmul(X, self.theta)
return preds
#
# Evaluation of model on the test set
def evaluate(self, X_test, y_test):
m_test = X_test.shape[0]
print("test set>")
preds_test = self.predict(X_test)
J_test = self.cost(preds_test, y_test)
se_test = np.sum( np.power( (preds_testy_test), 2 ) )
mse_test = (1/m_test)*(se_test)
rmse_test = np.sqrt( mse_test )
r2_score = 1( se_test / np.sum( np.power(y_testnp.mean(y_test),2) ) )
#print("\t", "Cost: ", J_test)
print("\t", "MSE: ", mse_test)
print("\t", "RMSE: ", rmse_test)
print("\t", "r2 score: ", r2_score)
#
# Use for getting the parameters of the model especially after the training
def get_parameters(self):
return self.theta
#
# Plot the learning curve to decision making purposes
def plot_learning_curve(self):
def config_figure():
plt.xlabel("Epoch")
plt.ylabel("Cost")
plt.title("Learning curve [train set]")
plt.legend()
data = [[self.costs_train], [self.costs_dev], [self.costs_train, self.costs_dev]]
labels = [["Trainig cost"], ["Development (validation) cost"], ["Trainig cost", "Development (validation) cost"]]
styles = [["r"], ["b"], ["r", "b"]]
for i, [dt, label, style] in enumerate(zip(data, labels, styles), 0):
plt.figure()
for j, d in enumerate(dt, 0):
plt.plot(np.arange(self.epochs), d, style[j], label=label[j])
config_figure()
plt.show()
#
def cost(self, preds, y):
J = (1/2) * np.mean( np.power( np.sum(predsy), 2 ) )
return J
model=Model()
You can also try this code with Online Python Compiler
Run Code
model.fit(X_train, X_dev, y_train, y_dev)
You can also try this code with Online Python Compiler
Run Code
Scikitlearn Multivariate Linear Regression
from sklearn.metrics import mean_squared_error, r2_score
clf = LinearRegression()
clf.fit(X_train, y_train)
preds_test = clf.predict(X_test)
You can also try this code with Online Python Compiler
Run Code
Comparing the results of both the models
print("scikitlearn")
print("test set>")
print("\t", "MSE: ", mean_squared_error(y_test, preds_test))
print("\t", "RMSE: ", np.sqrt(mean_squared_error(y_test, preds_test)))
print("\t", "r2 score: ", r2_score(y_test, preds_test))
You can also try this code with Online Python Compiler
Run Code
model.evaluate(X_test, y_test)
You can also try this code with Online Python Compiler
Run Code
As we can see, our custom model is giving out similar results as the one by scikit learn library.
Frequently Asked Questions

What is the difference between simple linear regression and multivariate linear regression?
Simple linear regression model makes its predictions with a single feature. Multivariate linear regression is a more generalised form of linear regression with n features. It could be said that simple linear regression is a special case of multivariate linear regression.

What is the learning rate of gradient descent?
Learning rate is defined as the rate at which gradient descent converges at the global minima. This needs to be an optimum value since a small learning rate leads to slow convergence and a higher computational cost. While a bigger value can lead to missing out on the minima entirely.

What are the various regression evaluation metrics?
r^{2 } score, mean absolute error, mean squared error.
Key Takeaways
In this blog, we covered multivariate linear regression in detail. From its mathematical significance to its python implementation. We also learned about the nittygritty of gradient descent and it can be used to simulate the cost function. We highly recommend you implement the code yourself to better understand how to go about a regression problem.
Happy Learning