Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
1.1.
Example
1.1.1.
Input
1.1.2.
Output
2.
Algorithm
2.1.
C++ implementation
2.1.1.
Output
2.2.
Complexities
2.2.1.
Time complexity
2.2.2.
Space complexity
3.
Frequently Asked Questions
4.
Key Takeaways
Last Updated: Mar 27, 2024
Easy

Aho-corasick Algorithm

Author Manan Singhal
0 upvote
Crack Google SDE interview : Essential projects
Speaker
Saurav Prateek
SDE-2 @
20 Jun, 2024 @ 01:30 PM

Introduction

The Aho-Corasick technique may be used to search for various patterns in a vast blob of text, making it a valuable tool in data science and many other fields. Alfred Aho and Margaret Corasick proposed this algorithm.

Let’s Dig Deeper into the Algorithm Itself.

Example

Given an input text and an array containing some words, arr[], find occurrences of all words in the input text.

Input

text = "codingninjas"    
arr[] = {"coding", "ninja", "as", "ding"}

Output

Word coding appears from 0 to 5
Word ding appears from 2 to 5
Word ninja appears from 6 to 10
Word as appears from 10 to 11

Algorithm

We can solve this problem efficiently by using a Tree-like data structure to store our strings called ‘Trie’. Trie can efficiently store multiple strings and make string matching accessible. If you want to know more about Tries and their implementation, look at Using Trie in Data Structures. But here is a catch. We easily match patterns using a trie, but it will cost us a significant time and space complexity. Building a Trie takes linear time on average. For the above-given parameters, time complexity using a simple trie is O(kmax*n) time complexity, where kmax is the maximum pattern length, which can improve this complexity.

C++ implementation

/* C++ program for implementation of Aho-Corasick algorithm */
#include <bits/stdc++.h>
using namespace std;

/* Maximum number of characters in input alphabet */
const int MAXC = 26;

/* Maximum number of states in the matching machine, which is equal to the sum of the length of all keywords. */
const int MAXS = 500;

/* OUTPUT FUNCTION IS IMPLEMENTED USING out[] */
int out[MAXS];

/* GOTO FUNCTION IS IMPLEMENTED USING g[][] */
int g[MAXS][MAXC];

/* FAILURE FUNCTION IS IMPLEMENTED USING f[] */
int f[MAXS];

/* arr - array of words.
It will return the number of states that the built machine has. */

int buildMatchingMachine(string arr[], int k)
{
	/* Initialize values to 0 in output function. */
	memset(out, 0, sizeof out);
	
	/* Initialize values to -1 in goto function. */
	memset(g, -1, sizeof g);
	
	int states = 1;
	
	/* Construct values for goto function */
	for (int i = 0; i < k; ++i)
	{
 		const string &word = arr[i];
 		int currentState = 0;
 		
 		/* Insert all characters in arr[] */
 		int j;
 		for (j = 0; j < word.size(); ++j)
 		{
  			int ch = word[j] - 'a';
  			
  			/* Create a new node if a node for ch doesn't exist. */
  			if (g[currentState][ch] == -1)
   				g[currentState][ch] = states++;
   				
  			currentState = g[currentState][ch];
 		}
 		
 		out[currentState] |= (1 << i);
	}
	
	/* Characters which does not have an edge, add a goto edge to state 0 */
	for (int ch = 0; ch < MAXC; ++ch)
 		if (g[0][ch] == -1)
  			g[0][ch] = 0;
	
	memset(f, -1, sizeof f);
	queue<int> q;
	
	for (int ch = 0; ch < MAXC; ++ch)
	{
 		/* All nodes with depth 1 have failure function value as 0. */
 		if (g[0][ch] != 0)
 		{
  			f[g[0][ch]] = 0;
  			q.push(g[0][ch]);
 		}
	}
	
	/* Now queue has states 1 and 3 */
	while (q.size())
	{
 		/* Now, remove the front state from queue */
 		int state = q.front();
 		q.pop();
 		
 		/* Find failure function for those characters for which goto function is not defined. */
 		for (int ch = 0; ch <= MAXC; ++ch)
 		{
  			if (g[state][ch] != -1)
  			{
   				/* Find failure state of removed state */
   				int failure = f[state];
   				
   				while (g[failure][ch] == -1)
    				failure = f[failure];
   				failure = g[failure][ch];
   				f[g[state][ch]] = failure;
   				out[g[state][ch]] |= out[failure];
   				
   				/* Insert the next level node in Queue */
   				q.push(g[state][ch]);
  			}
 		}
	}
	return states;
}


int findNextState(int currentState, char nextInput)
{
	int answer = currentState;
	int ch = nextInput - 'a';
	while (g[answer][ch] == -1)
 		answer = f[answer];
	return g[answer][ch];
}


void searchWords(string arr[], int k, string text)
{
	buildMatchingMachine(arr, k);
	int currentState = 0;
	
	/* Traverse the text using built machine to find all occurrences of words in arr[] */
	int i;
	for (i = 0; i < text.size(); ++i)
	{
 		currentState = findNextState(currentState, text[i]);
 		
 		/* If match does not found, move to next state */
 		if (out[currentState] == 0)
  			continue;
 		
 		/* if match found, print all matching words of arr[] */
 		for (int j = 0; j < k; ++j)
 		{
  			if (out[currentState] & (1 << j))
  			{
   				cout << "Word " << arr[j] << " appears from "
    				<< i - arr[j].size() + 1 << " to " << i << endl;
  			}
 		}
	}
}


/* Main code */
int main()
{
	string arr[] = {"coding", "ninja", "as", "ding"};
	string text = "codingninjas";
	int k = sizeof(arr)/sizeof(arr[0]);
	searchWords(arr, k, text);
	return 0;
}

Output

Word coding appears from 0 to 5
Word ding appears from 2 to 5
Word ninja appears from 6 to 10
Word as appears from 10 to 11

Complexities

Time complexity

O(n+m+z), where n is the length of the text, m is the total number of characters in all words, z is the total number of occurrences of words in the text.
Reason: The Aho-corasick Algorithm formed the basis of the original Unix command fgrep.

Space complexity

O(alphabet_size * average key length * N), where N is the number of words in the trie.
Reason: Space complexity is used to create a trie data structure in Aho-corasick Algorithm.

Read More - Time Complexity of Sorting Algorithms

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Frequently Asked Questions

  1. What is a trie?
    A trie is an ordered data structure that stores associative data structures. It is a type of search tree. It's also known as a Prefix tree or a Radix tree.
     
  2. How does Aho Corasick work?
    It's a dictionary-matching method that finds elements of a finite collection of strings (the "dictionary") in an input text. It matches all strings at the same time. The algorithm's complexity is proportional to the length of the strings, the length of the searched text, and the number of output matches.
     
  3. When should the Aho Corasick algorithm be used?
    The Aho-Corasick algorithm can be used to quickly search for various patterns in a big blob of text, making it a valuable tool in data science and other fields.
     
  4. What is the purpose of a trie data structure?
    A Trie is a unique data structure for storing texts that can be viewed as a graph. Each node has a maximum of 26 children, with edges connecting each parent node to its offspring.
     
  5. What is the space complexity of trie?
    The space complexity of creating a trie is O(alphabet_size * average key length * N), where N is the number of words in the trie.

Key Takeaways

In this article, we have learned the Aho-Corasick algorithm to search for various patterns in a vast blob of text. We hope that this blog will help you understand the concept of the Aho-Corasick algorithm using trie, and if you would like to learn more about trie, check out this blog on Trie.

Check out this problem - Longest Common Prefix

Attempt our Online Mock Test Series on Coding Ninjas Studio now!

Happy Coding!

Previous article
Longest Common Prefix Using Trie
Next article
Binary Search Algorithm
Live masterclass