Table of contents
1.
Introduction
2.
The Problem Statement
2.1.
Example
3.
Approach 
3.1.
Brute Force Approach
3.1.1.
Time Complexity
3.1.2.
Space Complexity
3.2.
Suffix Array Approach
3.3.
Pseudo Code
3.3.1.
Time Complexity
3.3.2.
Space Complexity
4.
FAQs
5.
Key Takeaways
Last Updated: Mar 27, 2024

Substrings and Frequency

Author Ujjawal Gupta
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

In this blog, we are going to solve the problem based on the string. The problem will be solved using the suffix array approach. Suffix array-based coding problems are widely asked in various coding contests and in various competitive programming interviews.

The Problem Statement

Given a string ‘S’. Your task is to process ‘Q’ queries wherein each query you are given an integer ‘F,’ and you have to find the number of substrings of ‘S’ that occurs at least ‘F’ times in ‘S’.
Note: Two strings are said to be different in two cases: they have different lengths or a different starting index.

Example

Let S = “aaeddf” and we have one query F = 2. Here, substrings “a” and “d” occur twice each in the string ‘S’. Therefore the required answer for F = 2 in string ‘S’ is 4. That is all occurrences of “a” and “d”.

Approach 

Brute Force Approach

The naive approach to solving the above problem is to generate all the substrings of string ‘S’ and count the frequency of each string formed. After storing the frequency of all the substrings in the map, count the number of strings whose frequency is greater than or equal to ‘F’ by traversing the map. This is the simplest but not the most efficient approach to solve the above problem. 

Time Complexity

O( N ^ 2), where ‘N’ is the length of the given string. Because generating all the substrings takes O(N ^ 2). 

Space Complexity

O(N ^ 2), where ‘N’ is the length of the string. Because storing all the strings in the map takes O(N ^ 2) extra space.

Suffix Array Approach

The above problem can be solved using the suffix array approach. We need to construct a suffix array ‘SA’ of the string ‘S’ in this approach. Create an array ‘LCP’ where LCP[i] = lcp(SA[i], SA[i+1]) where lcp is the longest common prefix.

The ‘LCP’ array can be seen as a histogram with each bar having height ‘LCP[i]’. Then for some range [i,j] if minimum of [LCP[i], LCP[i+1]… LCP[j]] is ‘M’, then it means a substring of length ‘M’, which is present at j-i+2 indices in string ‘S’.

Using this property, we can find the solution in the following way. Let ‘D[i]’ represent the number of substrings that repeat exactly ‘i’ times. If we can compute the array ‘D’, we can easily answer all the queries. For a given frequency F, required answer will be D[F] + D[F+1] + … D[length of S].

For a bar of height H (LCP[i] = H), let us assume that we know the largest interval (L[i], R[i]) such that the minimum of LCP array over the interval is equal to H, then it means that we have got a substring of length H, which repeats exactly R[i]-L[i]+2 times. 

We will update the ‘D’ array accordingly.

Pseudo Code

for V = 0 to N:
    sz = P[val].size()
    /* right is maintained for implementing two-pointer method.*/
    right = 0;
    for j = 0 to sz:
        index = P[val][j]
        /* to take care of the issue 2, don't overcount, always check whether your current element under*/
        /* consideration is not inside the range which has already been considered.*/
        if (index >= right):
            lo = L[index], hi = R[index]
            /* len denotes the current number of prefixes of the suffix corresponding to size V. */
            len = V;
            /* to take care of the issue 1.*/
            if (hi is defined, ie 0 <= hi < LCP.size()):
                len = min(len, val - LCP[hi])
            if (lo is defined, ie 0 <= lo < LCP.size()):
                len = min(len, val - LCP[hi]);
            /* add the current prefixes */
            D[hi-lo+2] += len * (hi - lo + 2);
       
            /* update the right pointer */
            right = index + 1;

You can also try this code with Online C++ Compiler
Run Code

Time Complexity

O(N), where ‘N’ is the length of the given string. Because storing all the suffix takes O(N). 

Space Complexity

O(N), where ‘N’ is the length of the string. Because storing all the suffix of string in the array takes O(N) extra space.

Also check out - Substr C++, and Euclid GCD Algorithm

FAQs

  1. What is string hashing?
    String hashing is the best approach to change over a string into a whole number known as a hash of that string.
     
  2. What are the advantages of Suffix Tree over Tries?
    Suffix Trees consume lesser space, and when you are storing a vast text, the difference between linear and quadratic space complexity can be huge. Suffix tree use can also reduce lookup time.
     
  3. What are the similarities between the suffix tree and the generalized suffix tree?
    Both suffix trees and generalized suffix trees are implemented using a tree data structure, and both are used to perform operations in the suffix of the string.
     
  4. What are the differences between the suffix tree and the generalized one? 
    The suffix tree is made only for a single string, whereas the generalized suffix tree is used for more than one string.
     
  5. What are the applications of the Suffix Tree?
    While there are many applications of this data structure, the fundamental ones are searching a string in a Text in O(len(string)) time for multiple checks, searching for longest repeated string, longest common substring, etc. It's also used for searching patterns in DNA or protein sequences.

Key Takeaways

In this blog, we learned about an interesting problem based on a string. We solved the problem efficiently using the Combinatorics approach. Another problem involving substring is: here.

Check out this problem - Longest Common Prefix

Hence learning never stops, and there is a lot more to learn.
So head over to our practice platform Coding Ninjas Studio to practice top problems, attempt mock tests, read interview experiences, and much more. Till then, Happy Coding!

Live masterclass