Introduction
This blog will discuss an Aho Corasick Algorithm used in pattern searching. Aho Corasick Algorithm is a crucial topic and is widely asked in various coding contests and multiple interviews. This blog will discuss what aho corasick algorithm is and its applications.
Also Read, Byte Array to String
Aho Corasick Algorithm
Aho corasick algorithm was given by Alfred Aho and Margaret corasick in 1975. This algorithm helps get all the occurrences of all the given sets of keywords. It is the type of dictionary-matching algorithm. It uses a trie data structure. After making the trie, it converts it into an automaton, due to which we can search in linear time. Aho corasick algorithm works in three different phases. These are:
- Go-To: This phase makes the trie using all the given keywords.
- Failure: This phase tries to find the backward transition to get the proper suffix of some keywords.
-
Output: In this phase, it finds all the words that are ending at that stage for each state of automation.
Let’s understand more clearly by taking an example.
Example
Suppose we have to find whether the strings “ACG”, “ATC”, “CAT”, “GCG” is present in the string “GCATCG” or not.
The first phase is the Go-To phase, in which we have to build the trie-like data structure, which is shown in the below figure.
As seen in the above figure, several failures links come into play when there is a mismatch in character. This is also known as the failure phase. When any mismatch occurs, then failure links refer the string to the substring with the most prefix match.
There are a few red dots that signify the output phase. These signify that the string corresponding to that node is matched.
Implementation
#include <bits/stdc++.h>
using namespace std;
int output[500];
int fail[500];
int gotoMat[500][26];
int buildTree(string array[], int size)
{
/*All elements of output is 0*/
for (int i = 0; i < 500; i++)
output[i] = 0;
/*All elements of failure array is -1*/
for (int i = 0; i < 500; i++)
fail[i] = -1;
/*All elements of goto matrix is -1*/
for (int i = 0; i < 500; i++)
for (int j = 0; j < 26; j++)
gotoMat[i][j] = -1;
/*There is only one state*/
int state = 1;
/*Make trie structure for all pattern in array*/
for (int i = 0; i < size; i++)
{
string word = array[i];
int presentState = 0;
for (int j = 0; j < word.size(); ++j)
{
/* all pattern is added */
int ch = word[j] - 'a';
if (gotoMat[presentState][ch] == -1)
gotoMat[presentState][ch] = state++;
presentState = gotoMat[presentState][ch];
}
output[presentState] |= (1 << i);
}
/*If ch is not directly connected to root*/
for (int ch = 0; ch < 26; ++ch)
if (gotoMat[0][ch] == -1)
gotoMat[0][ch] = 0;
queue<int> q;
for (int ch = 0; ch < 26; ++ch)
{
/*Node goes to previous state when fails*/
if (gotoMat[0][ch] != 0)
{
fail[gotoMat[0][ch]] = 0;
q.push(gotoMat[0][ch]);
}
}
while (q.size())
{
/*remove front node*/
int state = q.front();
q.pop();
for (int ch = 0; ch <= 26; ++ch)
{
if (gotoMat[state][ch] != -1)
{
/* if goto state is present*/
int failure = fail[state];
/*Find deepest node with proper suffix*/
while (gotoMat[failure][ch] == -1)
failure = fail[failure];
failure = gotoMat[failure][ch];
fail[gotoMat[state][ch]] = failure;
output[gotoMat[state][ch]] |= output[failure];
q.push(gotoMat[state][ch]);
}
}
}
return state;
}
int getNextState(int presentState, char nextChar)
{
int answer = presentState;
int ch = nextChar - 'a';
/*if go to is not found, use fail function*/
while (gotoMat[answer][ch] == -1)
answer = fail[answer];
return gotoMat[answer][ch];
}
void patternSearch(string arr[], int size, string text)
{
/*Make the trie*/
buildTree(arr, size);
int presentState = 0;
for (int i = 0; i < text.size(); i++)
{
/* find all occurances of pattern */
presentState = getNextState(presentState, text[i]);
if (output[presentState] == 0)
for (int j = 0; j < size; ++j)
{
if (output[presentState] & (1 << j))
{
cout << "Word " << arr[j] << " location: " << i - arr[j].size() + 1 << endl;
}
}
}
}
int main()
{
int n;
cin>>n;
string arr[n],text;
for(int i=0;i<n;i++)
cin>>arr[i];
cin>>text;
int k = sizeof(arr) / sizeof(arr[0]);
patternSearch(arr, k, text);
return 0;
}
Input
5
their there answer any bye
isthereanyanswerofgoodbye
Output
Word their location: 2
Word any location: 7
Word answer location: 10
Word bye location: 22
Complexities
Time Complexity
The time complexity of the above program is O(N + L + Z), where ‘N’ is the length of the text, ‘L’ is the length of keywords, and ‘Z’ is the number of matches.
Space Complexity
O(L * Q), where ‘Q’ is the length of the alphabet since that is the maximum number of children a node can have.
Check out this article - Converting NFA to DFA