Introduction
The Aho-Corasick technique may be used to search for various patterns in a vast blob of text, making it a valuable tool in data science and many other fields. Alfred Aho and Margaret Corasick proposed this algorithm.
Let’s Dig Deeper into the Algorithm Itself.
Example
Given an input text and an array containing some words, arr[], find occurrences of all words in the input text.
Input
text = "codingninjas"
arr[] = {"coding", "ninja", "as", "ding"}
Output
Word coding appears from 0 to 5
Word ding appears from 2 to 5
Word ninja appears from 6 to 10
Word as appears from 10 to 11
Algorithm
We can solve this problem efficiently by using a Tree-like data structure to store our strings called ‘Trie’. Trie can efficiently store multiple strings and make string matching accessible. If you want to know more about Tries and their implementation, look at Using Trie in Data Structures. But here is a catch. We easily match patterns using a trie, but it will cost us a significant time and space complexity. Building a Trie takes linear time on average. For the above-given parameters, time complexity using a simple trie is O(kmax*n) time complexity, where kmax is the maximum pattern length, which can improve this complexity.
C++ implementation
/* C++ program for implementation of Aho-Corasick algorithm */
#include <bits/stdc++.h>
using namespace std;
/* Maximum number of characters in input alphabet */
const int MAXC = 26;
/* Maximum number of states in the matching machine, which is equal to the sum of the length of all keywords. */
const int MAXS = 500;
/* OUTPUT FUNCTION IS IMPLEMENTED USING out[] */
int out[MAXS];
/* GOTO FUNCTION IS IMPLEMENTED USING g[][] */
int g[MAXS][MAXC];
/* FAILURE FUNCTION IS IMPLEMENTED USING f[] */
int f[MAXS];
/* arr - array of words.
It will return the number of states that the built machine has. */
int buildMatchingMachine(string arr[], int k)
{
/* Initialize values to 0 in output function. */
memset(out, 0, sizeof out);
/* Initialize values to -1 in goto function. */
memset(g, -1, sizeof g);
int states = 1;
/* Construct values for goto function */
for (int i = 0; i < k; ++i)
{
const string &word = arr[i];
int currentState = 0;
/* Insert all characters in arr[] */
int j;
for (j = 0; j < word.size(); ++j)
{
int ch = word[j] - 'a';
/* Create a new node if a node for ch doesn't exist. */
if (g[currentState][ch] == -1)
g[currentState][ch] = states++;
currentState = g[currentState][ch];
}
out[currentState] |= (1 << i);
}
/* Characters which does not have an edge, add a goto edge to state 0 */
for (int ch = 0; ch < MAXC; ++ch)
if (g[0][ch] == -1)
g[0][ch] = 0;
memset(f, -1, sizeof f);
queue<int> q;
for (int ch = 0; ch < MAXC; ++ch)
{
/* All nodes with depth 1 have failure function value as 0. */
if (g[0][ch] != 0)
{
f[g[0][ch]] = 0;
q.push(g[0][ch]);
}
}
/* Now queue has states 1 and 3 */
while (q.size())
{
/* Now, remove the front state from queue */
int state = q.front();
q.pop();
/* Find failure function for those characters for which goto function is not defined. */
for (int ch = 0; ch <= MAXC; ++ch)
{
if (g[state][ch] != -1)
{
/* Find failure state of removed state */
int failure = f[state];
while (g[failure][ch] == -1)
failure = f[failure];
failure = g[failure][ch];
f[g[state][ch]] = failure;
out[g[state][ch]] |= out[failure];
/* Insert the next level node in Queue */
q.push(g[state][ch]);
}
}
}
return states;
}
int findNextState(int currentState, char nextInput)
{
int answer = currentState;
int ch = nextInput - 'a';
while (g[answer][ch] == -1)
answer = f[answer];
return g[answer][ch];
}
void searchWords(string arr[], int k, string text)
{
buildMatchingMachine(arr, k);
int currentState = 0;
/* Traverse the text using built machine to find all occurrences of words in arr[] */
int i;
for (i = 0; i < text.size(); ++i)
{
currentState = findNextState(currentState, text[i]);
/* If match does not found, move to next state */
if (out[currentState] == 0)
continue;
/* if match found, print all matching words of arr[] */
for (int j = 0; j < k; ++j)
{
if (out[currentState] & (1 << j))
{
cout << "Word " << arr[j] << " appears from "
<< i - arr[j].size() + 1 << " to " << i << endl;
}
}
}
}
/* Main code */
int main()
{
string arr[] = {"coding", "ninja", "as", "ding"};
string text = "codingninjas";
int k = sizeof(arr)/sizeof(arr[0]);
searchWords(arr, k, text);
return 0;
}
Output
Word coding appears from 0 to 5
Word ding appears from 2 to 5
Word ninja appears from 6 to 10
Word as appears from 10 to 11
Complexities
Time complexity
O(n+m+z), where n is the length of the text, m is the total number of characters in all words, z is the total number of occurrences of words in the text.
Reason: The Aho-corasick Algorithm formed the basis of the original Unix command fgrep.
Space complexity
O(alphabet_size * average key length * N), where N is the number of words in the trie.
Reason: Space complexity is used to create a trie data structure in Aho-corasick Algorithm.
Read More - Time Complexity of Sorting Algorithms