Table of contents
1.
Introduction
2.
What is HITS algorithm
3.
Advantages
4.
Algorithm
5.
Implementation in Python 
6.
Frequently Asked Questions
6.1.
How does the HITS algorithm differ from PageRank?
6.2.
Can the HITS algorithm be used for purposes other than web search?
6.3.
What are some limitations of the HITS algorithm?
7.
Conclusion
Last Updated: Aug 18, 2024
Easy

Hyperlink Induced Topic Search (HITS) Algorithm

Author Rinki Deka
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

HITS (Hyperlink-Induced Topic Search) is a link analysis algorithm used to measure the importance of web pages. It helps search engines understand which pages are the most relevant & authoritative for a given topic. 

Hyperlink Induced Topic Search (HITS) Algorithm

In this article, we will discuss what the HITS algorithm is, how it works, & its advantages. Finally, we will see the code implementation in python.

What is HITS algorithm

The HITS algorithm, also known as "hubs & authorities", is used to rank web pages based on their importance & relevance to a specific topic. It was developed by Jon Kleinberg in 1999. 
 

The algorithm works by assigning two scores to each web page: a hub score & an authority score. A page with a high hub score is one that links to many other relevant pages, while a page with a high authority score is one that is linked to by many relevant pages. 

 

The idea behind HITS is that a good hub points to many good authorities, & a good authority is pointed to by many good hubs. The algorithm starts with a set of web pages relevant to a topic & then iteratively updates the hub & authority scores of each page based on the scores of the pages it links to & is linked from.
 

HITS is used by search engines as part of their ranking algorithms to determine the most important & relevant pages for a given search query. It helps provide users with high-quality search results that are both informative & trustworthy.

Advantages

1. Topic-specific rankings: HITS generates rankings that are specific to a particular topic, ensuring that the most relevant & authoritative pages for that topic are surfaced.
 

2. Mutually reinforcing relationship: The hub & authority scores in HITS reinforce each other. Good hubs boost the authority scores of the pages they link to, while good authorities boost the hub scores of the pages that link to them. This helps identify the most important pages.
 

3. Spam resistance: HITS is relatively resistant to spam because it relies on the structure of the web graph & the relationships between pages, rather than just the content or keywords on a page. This makes it harder for spammers to manipulate rankings.
 

4. Efficient computation: The iterative nature of HITS allows it to be computed efficiently, even for large sets of web pages. This is important for search engines that need to rank millions or billions of pages.
 

5. Complementary to other algorithms: HITS can be used in conjunction with other ranking algorithms, such as PageRank, to provide a more comprehensive & robust ranking of web pages.

Algorithm

The HITS algorithm involves the following steps:
 

1. Start with a set of web pages relevant to a specific topic. This can be obtained through a keyword search or other methods.
 

2. Construct a directed graph representing the hyperlink structure of the web pages, where each page is a node & each hyperlink is a directed edge.
 

3. Initialize the hub & authority scores of each page to 1.
 

4. Iteratively update the scores until convergence:

   - For each page, update its authority score to be the sum of the hub scores of the pages that link to it.

   - For each page, update its hub score to be the sum of the authority scores of the pages it links to.

   - Normalize the scores by dividing each score by the square root of the sum of the squares of all scores of the same type (hub or authority).
 

5. After convergence, the pages with the highest authority scores are considered the most important & relevant for the topic, while the pages with the highest hub scores are the best hubs for finding relevant content.

Implementation in Python 

import numpy as np

def hits(graph, max_iterations=100, tolerance=1e-8):
    n = len(graph)
    hub = np.ones(n)
    authority = np.ones(n)

    for _ in range(max_iterations):
        prev_hub = hub.copy()
        prev_authority = authority.copy()

        hub = np.dot(graph, authority)
        authority = np.dot(graph.T, hub)

        hub /= np.linalg.norm(hub)
        authority /= np.linalg.norm(authority)

        if np.linalg.norm(hub - prev_hub) < tolerance & np.linalg.norm(authority - prev_authority) < tolerance:
            break

    return hub, authority


This code takes a graph represented as an adjacency matrix & iteratively updates the hub & authority scores until convergence or a maximum number of iterations is reached.

Frequently Asked Questions

How does the HITS algorithm differ from PageRank?

While both algorithms rank web pages, HITS computes topic-specific scores (hubs & authorities), whereas PageRank calculates a global importance score for each page.

Can the HITS algorithm be used for purposes other than web search?

Yes, HITS can be applied to any network or graph where the importance of nodes needs to be determined based on their relationships, such as social networks or citation networks.

What are some limitations of the HITS algorithm?

HITS can be sensitive to the initial set of pages & may not handle topic drift well. It also does not consider the content or quality of the pages, only their link structure.

Conclusion

In this article, we learned about the HITS algorithm, a link analysis method used to rank web pages based on their importance & relevance to a specific topic. We discussed how HITS assigns hub & authority scores to pages & iteratively updates them based on the hyperlink structure of the web graph. We also talked about the advantages of HITS. Finally, we looked at the code implementation in Python.

You can also check out our other blogs on Code360.

Live masterclass