Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Description of PageRank Algorithm
3.
Algorithm
3.1.
Simplified Algorithm
4.
Python Implementation
4.1.
Code
4.2.
Output
5.
Applications of PageRank Algorithm
6.
Advantages and Disadvantages 
7.
Frequently Asked Questions
7.1.
Name some other link-based ranking algorithms.
7.2.
State the factors that influenced the PageRank Algorithm.
7.3.
What is the damping factor?
7.4.
Can the PageRank of a webpage be zero?
8.
Conclusion
Last Updated: Mar 27, 2024

PageRank Algorithm

Author Pankhuri Goel
1 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

PageRank (PR) is a Google Search algorithm that ranks websites in search engine results. Larry Page, one of Google's founders, provided the inspiration for PageRank. PageRank is a metric for determining how important a website's pages are. Google says that:

PageRank calculates a rough estimate of the importance of a website by measuring the quantity and quality of links to that page. The basic premise is that more important websites are more likely to gain links from other sites.

It is not the only algorithm used by Google to organise search engine results; nonetheless, it is the company's first and most well-known algorithm.

Description of PageRank Algorithm

PageRank is a link analysis algorithm that gives each element of a hyperlinked group of documents, such as the World Wide Web, a numerical weighting to "measure" its relative relevance within the set. Any collection of entities containing reciprocal quotations and references can be used with the algorithm. The PageRank of E is the numerical weight it allocates to any given element E and is denoted by PR(E).

A PageRank is the output of a mathematical process based on the web graph, which comprises all World Wide Web pages as nodes and hyperlinks as edges, with authority hubs taken into consideration. The importance of a page is indicated by its rank value. A vote of support is a hyperlink to a page.

A page's PageRank is determined by the quantity and PageRank metric of all pages connected to it in a recursive manner ("incoming links"). A page with a high PageRank that is connected to many other pages with a high PageRank earns a high PageRank as well.

Since Page and Brin's original work, many research studies on PageRank have been released. The PageRank idea may be sensitive to manipulation in practice. The purpose of the research was to discover erroneously affected PageRank rankings. The goal is to find a way to ignore connections from documents with skewed PageRank effectively.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Algorithm

The PageRank algorithm generates a probability distribution that indicates the possibility of a random user clicking on links and ending up on a specific page. PageRank can be determined for any large collection of documents. Several research publications assume that the distribution is uniformly distributed among all documents in the collection at the start of the computational process. The PageRank computations necessitate numerous trips through the collection, referred to as "iterations", to update PageRank values to represent the actual theoretical value more closely.

A probability is a number between 0 and 1, representing the likelihood of something happening. A "50 per cent possibility" of something happening is typically stated as a 0.5 probability. As a result, a document with a PageRank of 0.5 means that a person clicking on a random link will be sent to that document 50% of the time.

Simplified Algorithm

Assume a universe with four web pages, namely, A, B, C, and D. Links between pages are disregarded, as are repeated outbound links from one single page to another single page. For all pages, PageRank is set to the same value. The whole number of pages on the web at the time was the sum of PageRank over all pages in the original form of PageRank; therefore, each page in this scenario would have an initial value of 1. The remainder of this section and later versions of PageRank, on the other hand, assume a probability distribution between 0 and 1. As a result, in this example, the initial value for each page is 0.25.

On the next iteration, the PageRank sent from a specific page to the targets of its outbound links is distributed evenly across all outbound connections.

If the system's only links to A were from pages B, C, and D, each link would transmit 0.25 PageRank to A on the next iteration, for a total of 0.75.

Instead, suppose page B is linked to pages C and A, page C is linked to page A, and page D is connected to all three sites. As a result, page B would transmit half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C on the first iteration. Page C's entire existing value, 0.25, would be transferred to the single page it connects to, A. D would transfer one-third of its existing value, or around 0.083, to A because it had three outgoing linkages. Page A will have a PageRank of roughly 0.458 at the end of this cycle.

To put it another way, an outgoing link's PageRank is equal to the document's PageRank score divided by the number of outbound links L().

The PageRank value for any page u can be stated as follows in general:

i.e., the PageRank value of a page u is determined by the PageRank values of each page v in the set Bu (set of all pages connecting to page u), divided by the number L(v) of links from page v.  

For the calculation of PageRank, the algorithm uses a damping factor. The formula, once the damping factor is factored in, is as follows:

here, d is the damping factor, and N is the number of elements in set Bu(total pages in the collection). 

Python Implementation

The implementation of the PageRank Algorithm using Python is below:

Code

import numpy as np


def pageRank(M, num_iter: int = 100, d: float = 0.85):
    """
    Parameters
    ----------
    M : numpy array
        adjacency matrix where M[i,j] represents the link from 'j' to 'i', such that for all 'j' -> sum(i, M[i,j]) = 1
    num_iter : int, optional
        number of iterations (default 100)
    d : float, optional
        damping factor (default 0.85)


    Returns
    -------
    numpy array
        vector of ranks such that v[i] is the i-th rank from [0, 1],
        v sums to 1
    """
    N = M.shape[1]
    v = np.ones(N) / N
    M_hat = (d * M + (1 - d) / N)
    for i in range(num_iter):
        v = M_hat @ v
    return v


M = np.array([[0, 0, 0, 0, 1],
              [0.5, 0, 0, 0, 0],
              [0.5, 1, 0, 0, 0],
              [0, 0, 1, 0.5, 0],
              [0, 0, 0, 0.5, 0]])
v = pageRank(M, 100, 0.85)
print(v)

Output

If we calculate the sum of these five values, it will equal 1.

Applications of PageRank Algorithm

The applications of the PageRank algorithm are stated as follows:

  • The PageRank algorithm's earliest and most obvious application is in search engines. PageRank can rank websites to give more relevant search results faster because it was built expressly for use in Google's search engine.
  • The PageRank algorithm is used to search networks that are not connected to the internet. This can be used in academic papers; PageRank can determine the most effective and referenced publications in a field by utilising citations instead of links.
  • An example of a real-world application of the PageRank method is identifying significant species in an ecology. Applying the PageRank algorithm to the relationships between species in an ecosystem allows the user to determine the most notable species. As a result, assigning value to essential animal and plant species in an ecosystem makes it easier to predict outcomes such as extinction or a species' removal from the ecosystem.

Advantages and Disadvantages 

The advantages and disadvantages of the PageRank algorithm are as follows:

Also Read - Kadanes Algorithm

Frequently Asked Questions

Name some other link-based ranking algorithms.

The HITS algorithm (developed by Jon Kleinberg and now used by Teoma and Ask.com), the IBM CLEVER project, the TrustRank algorithm, and the Hummingbird algorithm are other link-based ranking algorithms for Web pages.

 

State the factors that influenced the PageRank Algorithm.

The factors that highly influenced the PageRank Algorithm are:
-> Anchor text
-> The likelihood of being clicked
-> Internal links
-> NoFollow links

 

What is the damping factor?

According to the PageRank theory, a hypothetical surfer who clicks on links at random will soon stop clicking at some point in time. A damping factor d is the probability that the individual will continue to surf at any point. Different damping factors have been investigated in various experiments; however, it is generally anticipated that the damping factor will be fixed at around 0.85.

 

Can the PageRank of a webpage be zero?

As observed from the formula above, the PR of a page cannot be zero. If the page has no pages pointing to it, still the PR will be

The page can still be accessible if it is saved in, say, bookmarks.

Conclusion

In this article, we learned about the PageRank algorithm. We saw its description, simplified working, applications as well as its advantages and disadvantages.

We hope this blog has helped you enhance your knowledge. If you want to learn more, check out our articles on Data Mining and Data Analytics - Coding Ninjas Coding Ninjas StudioAdaBoost Algorithm - Coding Ninjas Coding Ninjas StudioExpectation-Maximization Algorithm - Coding Ninjas Coding Ninjas Studio and Top 10 Common Data Mining Algorithms – Coding Ninjas Blog. Do upvote our blog to help other ninjas grow.

Head over to our practice platform Coding Ninjas Studio to practice top problems, take various guided paths, attempt mock tests, read interview experiences, solve problems, participate in contests and much more!

Happy Reading!

Previous article
Expectation-Maximization Algorithm
Next article
AdaBoost Algorithm
Live masterclass