Last Updated: 18 Jun, 2023

Search Pattern (KMP - Algorithm)

Moderate
Asked in companies
AmazonAdobeIBM

Problem statement

You’re given two strings, 'text' of length 'n' and 'pattern' of length 'm', consisting of lowercase characters.


Find all the occurrences of the string ‘pattern’ in ‘text’.


For each occurrence, print the index from where it starts in the string ‘text’ (1 - indexed).


Example:
Input: ‘text’ = “cxyzghxyzvjkxyz” and ‘pattern’ = “xyz”

Output: Occurrences = [2, 7, 13]

Explanation: The pattern ‘pattern’ = “xyz” appears at 3 positions in ‘text’.
Input Format:
The first line contains the string ‘text’.
The second line contains the string ‘pattern’.


Output format:
The first line contains an integer ‘count’, the number of occurrences.
The first line contains ‘count’ integers, the positions of occurrence of ‘pattern’ in ‘text’ in sorted order.


Note:
You need not print anything. Return the list of positions in any order. They will be sorted and printed.

Approaches

01 Approach

Check each substring in ‘text’ of length ‘m’ and see if they are equal to ‘pattern’.

If we find a match we add the current index (1 - based) in our list ‘ans’.

The steps are as follows:

List stringMatch(string ‘text’, string ‘pattern’)

  1. Let ‘n’ = ‘text.length’, ‘m’ = ‘pattern.length’
  2. Let ‘ans’ be a list storing the indices of occurrences.
  3. For ‘i’ in the range from 0 to ‘n’ - ‘m’ (inclusive):
    • Check whether the substring of ‘text’ from index ‘i’ to ‘i’ + ‘m’ is equal to ‘pattern’ or not.
    • If the substring is equal to ‘pattern’:
      • Add ‘i’ + 1 in ‘ans’.
  4. Return ‘ans’.

02 Approach

The basic idea of the Knuth-Morris-Pratt algorithm is that in case of a mismatch, we need not check the entire ‘pattern’ string from the beginning. For this, we will create an LPS array of ‘pattern’.

LPS(Longest Prefix Suffix) is an array data structure that captures the length of the longest proper prefix which is also a proper suffix for every substring. Proper prefix means a prefix that is not equal to the actual string.

For example: Let ‘pattern’ = “abaabc”

For “a”, there is no proper prefix. Therefore LPS of “a” is 0.

For “ab”, proper prefix = [“a”] and proper suffix = [“b”]. Therefore LPS of “ab” is also 0.

For “aba”, proper prefix = [“a”, “ab”] and proper suffix = [“a”, “ba”]. Therefore LPS of “aba” is 1 because “a” is common.

For “abaa”, proper prefix = [“a”, “ab”, “aba”] and proper suffix = [“a”, “aa”, “baa”]. Therefore LPS of “abaa” is also 1.

For “abaab”, proper prefix = [“a”, “ab”, “aba”, “abaa”] and proper suffix = [“b”, “ab”, “aab”, “baab”]. Therefore LPS of “abaab” is 2 because “ab” is the longest common prefix that is also suffix.

For “abaabc”, proper prefix = [“a”, “ab”, “aba”, “abaa”, “abaab”] and proper suffix = [“c”, “bc”, “abc”, “aabc”, “baabc”]. Therefore LPS of “abaabc” is 0.

So our final lps array is: [0, 0, 1, 1, 2, 0].

Now when we will compare our strings, whenever we find a mismatch, instead of assigning the current variable pointer of ‘pattern’ as 0, we will assign it to the previous lps value, if possible. We will never decrement the variable pointer of ‘text’, so effectively the time complexity of this algorithm is almost linear.

The steps are as follows:

List stringMatch(string ‘text’, string ‘pattern’)

  1. Let ‘n’ = ‘text.length’, ‘m’ = ‘pattern.length’
  2. Let ‘ans’ be a list storing the indices of occurrences.
  3. Let ‘lps’ be the LPS array of size ‘m’. Now we will initialize it.
  4. ‘lps[0]’ = 0
  5. Let ‘i’ = 1, ‘j’ = 0
  6. While ‘i’ < ‘m’:
    • If ‘pattern[i]’ == ‘pattern[j]’:
      • ‘lps[i]’ = ‘j’ + 1
      • ‘i’ = ‘i’ + 1
      • ‘j’ = ‘j’ + 1
    • Else if ‘j’ > 0:
      • ‘j’ = ‘lps[j - 1]’
    • Else:
      • ‘lps[i]’ = 0
      • ‘i’ = ‘i’ + 1
  7. ‘i’ = 0, ‘j’ = 0
  8. While ‘i’ < ‘n’:
    • If ‘text[i]’ == ‘pattern[j]’:
      • ‘i’ = ‘i’ + 1
      • ‘j’ = ‘j’ + 1
    • Else if ‘j’ > 0:
      • ‘j’ = ‘lps[j - 1]’
    • Else:
      • ‘i’ = ‘i’ + 1
    • If ‘j’ == ‘m’, that is when the pattern matches:
      • Add (‘i’ - ‘m’ + 1) in ‘ans’.
      • ‘j’ = ‘lps[j - 1]’
  9. Return ‘ans’.