Last Updated: 3 Feb, 2021

Boyer Moore Algorithm For Pattern Searching

Moderate
Asked in companies
GoogleGoldman SachsRxLogix

Problem statement

You are given a string ‘text’ and a string ‘pattern’, your task is to find all occurrences of pattern in the string ‘text’ and return an array of indexes of all those occurrences. You may assume that the length of ‘text’ is always greater than the length of ‘pattern’.

For example :
Let text = “this is a good place to have a good start”, pattern = “good” so you have to return {10, 31} because at 10 and 31 index pattern is present in the text.
Note :
If there is no such index in the text then just return an array containing -1.
Input format :
The first line of input contains an integer ‘T’ denoting the number of test cases.

The first line of each test case contains a string ‘text’ in which the pattern will be searched for.

The second line of each test case contains a string ‘pattern’ of which you have to find all occurrences in ’text’.
Output format :
For each test case, return an array representing the indices where the pattern is found in the text. The output of each test case is printed in a separate line. 
Note :
You don’t have to print anything; it has already been taken care of. Just implement the given function. 
Constraints :
1 <= T <= 5
1 <= M < 100
1 <= N < M

Time Limit: 1 sec

Approaches

01 Approach

The idea is to find out which character mismatches in the text from the pattern that is the character of the text that doesn't match with the current character of the pattern and we call that character a bad character.

 

  • One thing here needed to be observed is that unlike other algorithms for pattern searching boyer-Moore algorithm starts matching from the last character of the pattern.
  • Therefore on finding a mismatch seen from the last, we will shift the pattern until the mismatch becomes a match and the pattern moves past the mismatched character i.e. pattern crosses the bad character. We will have two cases here as discussed-
  • While implementing this we preprocess the pattern and store the last occurrence of every possible character in an array of size equal to alphabet size.
  • In the first case when the mismatch becomes a match through shifting, We will find the position of the last occurrence of the mismatching character in the pattern and if the mismatching character exists in the pattern then we’ll shift the pattern such that it gets aligned to the mismatching character in the given text. For example ‘text’ = “ABCAFDBACA” and ‘pattern’ = “BACA”.
    • We can clearly see that we have our first mismatch at position 1 and the mismatched character is “A” so we will find the last occurrence of “A” in a pattern that is present at position 1 as we check from last.
    • Now we will shift the pattern two times so that “A” in the pattern gets aligned with the “A” in text.
  • Now for the second case where we have to move the pattern past the mismatched character because if the bad character does not exist in the pattern there is no point in searching again so we move the pattern past it like in the above example after moving the pattern two times ahead we get a mismatch at position 4 and the mismatching character “F” does not exist in the pattern before position 4 so we’ll shift pattern past to the position 4 and eventually in above example we have got a perfect match of pattern.

02 Approach

 The idea is to shift the pattern to the right by the least amount that guarantees not to skip any occurrence of the good-suffix already matched. Good suffix heuristic is based on the matched suffix. Here, we shift the pattern to the right in such a way that the matched suffix subpattern is aligned with another occurrence of the same suffix in the pattern. The good suffix is the suffix that we have matched so far. We will have three cases here as discussed-

 

  • First case when the matched substring of pattern in the text we move the pattern until we find another occurrence of that substring in the pattern, In this case, we will try to shift the pattern to align that next occurrence of a substring within the text.
  • The next case is if the prefix of the pattern, which matches with a suffix of the substring in the text if we fail the first case then we can search for some suffix of substring matching with some prefix of Pattern and try to align them by shifting Pattern.
  • If both the above cases fail, we will just shift the pattern past the substring that moves, we the pattern ahead of that substring.
  • There is an enhancement in this good suffix rule called the strong good suffix rule.
    • In this case, let us consider substring ‘sub’ which is in Pattern that is from i to M got matched with subtext in text and badCharacter = pattern[i-1] is the mismatching character. Unlike case first, we will search for subtext in a pattern that is not preceded by badCharacter . The closest such occurrence is then aligned with subtext in the text by shifting pattern. For example –
    • text = ”XXYXYXYXZYXWXYYZXY” and pattern = “XXZZXZZXZ“.
    • Here, sub = pattern[7 to 8] got matched with subtext in text. The mismatching character badCharacter is “Z” at position pattern[6]. Now if we start searching subtext in the pattern we will get the first occurrence of subtext starting at position 4. But this occurrence is preceded by “Z” which is equal to c, so we will skip this and carry on searching. At position 1 we got another occurrence of subtext. This occurrence is preceded by “X” which is not equivalent to badCharacter. So we will shift the pattern 6 times to align this occurrence with subtext in the text. We are doing this because we already know that character badCharacter = “C” causes the mismatch. So any occurrence of subtext preceded by badCharacter will again cause mismatch when aligned with subtext, so that’s why it is better to skip this.
  • In this case, also we will do some preprocessing. Here, we create an array ‘Skip’ in which we will store the distance pattern that will be shifted if a mismatch occurs at position i-1. That is, the suffix of the pattern starting at position i is matched and a mismatch occurs at position i-1. This is the preprocessing for the first case.
  • Now for preprocessing for strong good suffix and case 2.
    • For strong good suffix, we will store an array endBorders where we will store the starting index of the border for suffix starting at index i in the given pattern. Where border is referred to as a substring which is both proper suffix and proper prefix.
    • For the second case for each suffix, the largest border of the whole pattern that is contained in that suffix is determined. The starting index of the largest border of the pattern at all is stored in endBorders[0] and this computer value endBorders[0] is stored initially in all entries of array ‘Skip’. But when the suffix of the pattern becomes shorter than endBorders[0], this continues with the next-largest border of the pattern, i.e. with endBorders[0].
  • After this preprocessing in the search function, we will add the indices in the array and finally return them.