Table of contents
1.
Introduction
2.
Why MLP??
3.
Algorithm
4.
Implementation using Sklearn
5.
Frequently Asked Questions
6.
Key Takeaways
Last Updated: Mar 27, 2024

MLP(Multi-Layer Perceptron)

Author Mayank Goyal
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

MLP is an integral part of deep learning. MLP is a neural network where the connection between the input and the output is non-linear. MLP contains

  • One input layer.
  • One or more hidden layers.
  • One final layer is called the output layer.

 

The layers close to input are called lower layers, and the ones close to output are known as upper layers. Every layer except the output layer is fully connected to the next layer and does not contain a bias neuron.

 

MLP has a minimum of 3 layers, including one hidden layer. If MLP contains more than one hidden layer, it is called a deep ANN. The Multi-Layer Perceptron is an example of a feedforward artificial neural network.

 

The number of different layers and the number of neurons in different layers are the neural network's hyperparameters, and these parameters need tuning. Cross-validation is a technique to find the optimal values for these hyperparameters.

 

The weight adjustment training is done through backpropagation. The deeper neural networks, the better they are at processing data. However, deeper layers can lead to vanishing gradient problems.

Also Read, Resnet 50 Architecture

Why MLP??

A Multi-Layer Perceptron (MLP) contains one or more hidden layers (apart from one input and one output layer). While a single layer perceptron can only learn linear functions, a multi-layer perceptron can also learn non – linear functions. The fact that perceptrons could not represent non-linear functions was proved when it could not describe the exclusive OR gate, where perceptrons only return one of the different inputs.

 

Algorithm

  • Initially, we divide the input dataset into mini-batches. It handles one mini-batch at a time, and then it traverses the whole dataset multiple times. Each pass is called an epoch.
def fit( x,  y,  n_features=2,  n_neurons=3, n_output=1, iter=10, eta=0.001): 
    Args:
        x (ndarray): matrix of features
        y (ndarray): vector of expected values
        n_features (int): number of feature vectors 
        n_neurons : number of neurons in hidden layer
        n_output : number of output neurons
        iter (int): number of iterations over the training set
        eta (float): learning rate
        
    Returns: 
        errors (list): list of errors over iterations
        param (dic): dictionary of learned parameters
    """
    
    ## Initialize parameters    
param = init_params(n_features=n_features, 
                            n_neurons=n_neurons, 
                            n_output=n_output)

    
    for _ in range(iterations):
        z1 = linear(param['W1'], x, param['b1'])
        s1 = sigmoid(Z1)
        z2 = linear(param['W2'], S1, param['b2'])
        s2 = sigmoid_function(Z2)


def sigmoid(Z):
    Args:
        Z (ndarray): weighted sum of features
    
    Returns: 
        S (ndarray): neuron activation
    return 1/(1+np.exp(-Z))


def linear(W, X, b):
    Args:
        W (ndarray): weight matrix
        X (ndarray): matrix of features
        b (ndarray): vector of biases
        
    Returns:
        Z (ndarray): weighted sum of features
    
    return np.dot(X,W)+b

 

  • Next, we compute the network’s output error.
## storage errors after each iteration
    errors = []
error = cost_function(S2, y)
        errors.append(error)


def cost_function(V, y): 
    Argumentss:
        A (ndarray): neuron activation
        y (ndarray): vector of expected values
    
    Returns:
        Error (float): total squared error

    return (np.mean(np.power(V - y,2)))/2

 

  • Then, we calculate how much each output connection contributed to the error. Then we calculate how much of these error contributions come from each link from previous layers, and we use the chain rule until we reach the input layer. This reverse pass measures the error gradient across all connection weights in the network by propagating the error gradient backward through the network, known as backward propagation. Finally, we update all the weights and the biases in the network using the error gradient just computed.

 

W2_gradients = np.dot(S1.T ,d2)
        para["w2"] = para["w2"] - W2_gradients * eta

        # update output bias
        para["b2"] = para["b2"] - np.sum(delta2, axis=0,
          keepdims=True) * eta

        # update hidden weights
        d1 = (d2 @ param["W2"].T )* S1*(1-S1)
        W1_gradients = np.dot(X.T , d1) 
        para["W1"] = para["W1"] - W1_gradients * eta

        # update hidden bias
        para["b1"] = param["b1"] - np.sum(delta1, axis=0,                
          keepdims=True) * eta

 

That's the algorithm we follow while implementing MLP. As stated above, I used the XOR problem as an example as the XOR problem was the reason for discovering the MLP algorithm.

 

As it is too much to grasp, let us summarize the algorithm. First, we feed each training instance into the network. We perform forward propagation, measure the error, then go through each layer to compute the error contribution from each neuron(backward propagation). Finally, we update the weights and biases to reduce the error.

 

NOTE:

We should always initialize the weights of the connections randomly, or else training will fail. For example, if we initialize the weights and biases, all the neurons in a given layer will be similar. Thus, the propagation will identically affect them; hence no distinctive changes will be observed, which won't be too clever. So if we initialize the weights randomly, we break the symmetry and allow backpropagation to produce unprecedented changes.

 

Let us look at the whole code at once:

import numpy as np
import pandas as pd
def init_params(n_features, n_neurons, n_output)
    
    np.random.seed(100) # for reproducibility
    W1np.random.uniform(size=(n_features,n_neurons))
    b1np.random.uniform(size=(1,n_neurons))
    W2np.random.uniform(size=(n_neurons,n_output))
    b2np.random.uniform(size=(1,n_output))

    
  parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters


def sigmoid_function(Z):
    return 1/(1+np.exp(-Z))

def linear_function(W, X, b):
    return np.dot(X,W)+b

def cost_function(V, y): 
    return (np.mean(np.power(V - y,2)))/2

def predict(X, W1, W2, b1, b2):
  
    
    Z1 = linear_function(W1, X, b1)
    S1 = sigmoid_function(Z1)
    Z2 = linear_function(W2, S1, b2)
    S2 = sigmoid_function(Z2)
    return np.where(S2 >= 0.510)

def fit_model(X, y, n_features=2, n_neurons=3, n_output=1, iterations=10, eta=0.001):
  
para = init_params(n_features=n_features, 
                            n_neurons=n_neurons, 
                            n_output=n_output)

    
    forin range(iterations):
        Z1 = linear_function(para['W1'], X, para['b1'])
        S1 = sigmoid_function(Z1)
        Z2 = linear_function(para['W2'], S1, para['b2'])
        S2 = sigmoid_function(Z2)
      ##storage errors after each iteration
        errors = []
        error = cost_function(S2, y)
        errors.append(error)  

        ##Backpropagation

        # update output weights
        d2 = (S2 - y)* S2*(1-S2)
        W2_gradients = np.dot(S1.T,d2)
        para["W2"] = para["W2"] - W2_gradients * eta

        # update output bias
        para["b2"] = para["b2"] - np.sum(d2, axis=0,
          keepdims=True) * eta

        # update hidden weights
        d1 = np.dot(d2, para["W2"].T )* S1*(1-S1)
        W1_gradients = X.T @ d1 
        para["W1"] = para["W1"] - W1_gradients * eta

        # update hidden bias
        para["b1"] = para["b1"] - np.sum(d1, axis=0,                
          keepdims=True) * eta

        # expected values
        y = np.array([[0110]]).T
 
        # features
        X = np.array([[0011],
              [0101]]).T

        errors, para = fit(X, y, iterations=5000, eta=0.1)
        y_pred = predict(X, para["W1"], para["W2"], para["b1"],    
        para["b2"])

        correct_predictions = (y_pred == y).sum()
        accuracy = (correct_predictions / y.shape[0]) * 100
        print('Multi-layer perceptron accuracy: %.2f%%' % accuracy)

 

Output:

Multi-layer perceptron accuracy: 100.00%

 

Implementation using Sklearn

Importing libraries

import numpy as np
from sklearn import metrics
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

 

Initializing Data

# expected values
y = np.array([[0110]]).T

# features
x = np.array([[0011],
              [0101]]).T

 

Splitting the Data

train_features, test_features, train_targets, test_targets= train_test_split(x,y,test_size=0.1,random_state=123)

 

Training the Model:

def MLPerceptron(train_features, test_features, train_targets, test_targets, num_neurons=50):
    classifier=MLPClassifier(hidden_layer_sizes=num_neurons, max_iter=1000, activation='relu', solver='sgd',verbose=20,  random_state=124, learning_rate='invscaling')
    classifier.fit(train_features, train_targets)
    
    predictions=classifier.predict(test_features)
    score=np.round(metrics.accuracy_score(test_targets, predictions),2)
    print("Mean accuracy:" +str(score))

 

Model:

MLPerceptron(train_features, test_features, train_targets, test_targets)

 

Output:

 

 

That’s all from the implementation part, and we got 100% accuracy. You can vary different parameters to view the changes.

Frequently Asked Questions

  1. What are the limitations of perceptron?
    Perceptron networks have several limitations. The first one is that the output values are binary. Second, perceptrons can only classify linearly separable sets of vectors.
     
  2. Can MLP be used for regression?
    The Multi-Layer Perceptron algorithms support both regression and classification problems. It is also called artificial neural networks.
     
  3. What’s the use of bias in MLP?
    The bias can be thought of as how flexible the perceptron is. It is similar to the constant b of a linear function y = ax + b. It allows us to move the line-up and down to better fit the prediction with the data.

Key Takeaways

Let us brief the article.

Firstly we saw the introductory part of MLP, and then we learned why we need MLP and lastly, we saw the implementation part of MLP in two different ways, first using simple NumPy and pandas, and secondly, using sklearn. 

Well, that’s one of the fundamental algorithms of neural networks. However, with Multilayer Perceptron, horizons are expanded, and now this neural network can have many layers of neurons and is ready to learn more complex patterns.

Check out this article - Padding In Convolutional Neural Network

That’s the end of the article. I hope you Like it.

Keep Learning NInjas!

Live masterclass