Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Aho Corasick Algorithm
2.1.
Example
2.2.
Implementation
2.2.1.
Input
2.2.2.
Output
2.3.
Complexities
2.3.1.
Time Complexity
2.3.2.
Space Complexity
3.
FAQs
4.
Key Takeaways
Last Updated: Mar 27, 2024

Aho Corasick Algorithm

Author Ujjawal Gupta
0 upvote

Introduction

This blog will discuss an Aho Corasick Algorithm used in pattern searching. Aho Corasick Algorithm is a crucial topic and is widely asked in various coding contests and multiple interviews. This blog will discuss what aho corasick algorithm is and its applications.

Also Read, Byte Array to String

Aho Corasick Algorithm

Aho corasick algorithm was given by Alfred Aho and Margaret corasick in 1975. This algorithm helps get all the occurrences of all the given sets of keywords. It is the type of dictionary-matching algorithm. It uses a trie data structure. After making the trie, it converts it into an automaton, due to which we can search in linear time. Aho corasick algorithm works in three different phases. These are:

  1. Go-To: This phase makes the trie using all the given keywords.
  2. Failure: This phase tries to find the backward transition to get the proper suffix of some keywords.
  3. Output: In this phase, it finds all the words that are ending at that stage for each state of automation.
     

Let’s understand more clearly by taking an example.

Example

Suppose we have to find whether the strings “ACG”, “ATC”, “CAT”, “GCG” is present in the string “GCATCG” or not.

The first phase is the Go-To phase, in which we have to build the trie-like data structure, which is shown in the below figure.

Source

As seen in the above figure, several failures links come into play when there is a mismatch in character. This is also known as the failure phase. When any mismatch occurs, then failure links refer the string to the substring with the most prefix match.

There are a few red dots that signify the output phase. These signify that the string corresponding to that node is matched.

Implementation

#include <bits/stdc++.h>
using namespace std;

int output[500];
int fail[500];
int gotoMat[500][26];

int buildTree(string array[], int size)
{
    /*All elements of output is 0*/
    for (int i = 0; i < 500; i++)
        output[i] = 0;

    /*All elements of failure array is -1*/
    for (int i = 0; i < 500; i++)
        fail[i] = -1;

    /*All elements of goto matrix is -1*/
    for (int i = 0; i < 500; i++)
        for (int j = 0; j < 26; j++)
            gotoMat[i][j] = -1;

    /*There is only one state*/
    int state = 1;

    /*Make trie structure for all pattern in array*/
    for (int i = 0; i < size; i++)
    {
        string word = array[i];
        int presentState = 0;

        for (int j = 0; j < word.size(); ++j)
        {
           
            /* all pattern is added */
            int ch = word[j] - 'a';
            if (gotoMat[presentState][ch] == -1)    
                gotoMat[presentState][ch] = state++;
            presentState = gotoMat[presentState][ch];
        }
        output[presentState] |= (1 << i);
    }

    /*If ch is not directly connected to root*/
    for (int ch = 0; ch < 26; ++ch)
        if (gotoMat[0][ch] == -1)
            gotoMat[0][ch] = 0;

    queue<int> q;

    for (int ch = 0; ch < 26; ++ch)
    {
        /*Node goes to previous state when fails*/
        if (gotoMat[0][ch] != 0)
        {
            fail[gotoMat[0][ch]] = 0;
            q.push(gotoMat[0][ch]);
        }
    }

    while (q.size())
    {
        /*remove front node*/
        int state = q.front();
        q.pop();

        for (int ch = 0; ch <= 26; ++ch)
        {
            if (gotoMat[state][ch] != -1)
            {
                /* if goto state is present*/
                int failure = fail[state];

                /*Find deepest node with proper suffix*/
                while (gotoMat[failure][ch] == -1)
                    failure = fail[failure];
                failure = gotoMat[failure][ch];
                fail[gotoMat[state][ch]] = failure;
                output[gotoMat[state][ch]] |= output[failure];
                q.push(gotoMat[state][ch]);                    
            }
        }
    }
    return state;
}

int getNextState(int presentState, char nextChar)
{
    int answer = presentState;
    int ch = nextChar - 'a';

    /*if go to is not found, use fail function*/
    while (gotoMat[answer][ch] == -1)
        answer = fail[answer];
    return gotoMat[answer][ch];
}

void patternSearch(string arr[], int size, string text)
{

    /*Make the trie*/
    buildTree(arr, size);
    int presentState = 0;

    for (int i = 0; i < text.size(); i++)
    {
        /* find all occurances of pattern */
        presentState = getNextState(presentState, text[i]);
        if (output[presentState] == 0)
            for (int j = 0; j < size; ++j)
            {
                if (output[presentState] & (1 << j))
                {
                    cout << "Word " << arr[j] << " location: " << i - arr[j].size() + 1 << endl;
                }
            }
    }
}

int main()
{
    int n;
    cin>>n;
    string arr[n],text;
    for(int i=0;i<n;i++)
        cin>>arr[i];

    cin>>text;
    int k = sizeof(arr) / sizeof(arr[0]);
    patternSearch(arr, k, text);
    return 0;
}
You can also try this code with Online C++ Compiler
Run Code

Input

5
their there answer any bye
isthereanyanswerofgoodbye

Output

Word their location: 2
Word any location: 7
Word answer location: 10
Word bye location: 22

Complexities

Time Complexity

The time complexity of the above program is O(N + L + Z), where ‘N’ is the length of the text, ‘L’ is the length of keywords, and ‘Z’ is the number of matches.

Space Complexity

O(L * Q)where ‘Q’ is the length of the alphabet since that is the maximum number of children a node can have.

Check out this article - Converting NFA to DFA

FAQs

  1. How many phases are in the Aho corasick algorithm?
    Aho corasick algorithm works in three phases. These phases are the Go-To, Failure, and Output phases.
     
  2. Why is the aho corasick algorithm known as a linear search pattern?
    Aho Corasick algorithm is known as linear search pattern because it searches the pattern in linear time.
     
  3. Why is the aho corasick algorithm also known as dictionary matching algorithm? 
    Aho corasick algorithm is also known as dictionary matching algorithm because it checks whether all the keywords of the dictionary are a match in text or not. 
     
  4. What are some other applications of the aho corasick algorithm?
    Aho corasick algorithm can also be used to find the lexicographically smallest string of a given length that doesn't match any given string.
     
  5. Which data structure is used in implementing the Aho corasick algorithm?
    Trie-like data structure is used in implementing aho corasick algorithm. After creating the trie-like data structure, it creates an automaton.

Key Takeaways

In this blog, we learned about an interesting topic based on a Search pattern. We have discussed the aho corasick algorithm and learned how it is implemented using trie and automaton. Another topic involving pattern search is: here.

Hence learning never stops, and there is a lot more to learn.

So head over to our practice platform Coding Ninjas Studio to practice top problems, attempt mock tests, read interview experiences, and much more. Till then, Happy Coding!

Live masterclass