What is pair in C++ STL?

pair STL is commonly used to store tuples of size two. It can store two elements with different data types together. One important fact about them is that std::unordered_map doesn't contain a hash function for pairs and requires the user to supply the hash function of their choice.

What are the uses of "auto" in C++?

It can be used to initialize elements that have a complicated type. In our code above, we wrote "auto" instead of having to write "pair ". Using longer type names usually makes the code harder to read. It's important to remember that you can't declare an auto variable without initializing it to something.

What is the Z function?

Z function is calculated using the Z algorithm. It's a mapping between the index of any character of the string (except 0 because the z function isn't defined for index 0) and the maximum number of characters including and after the index that match exactly with the prefix of the string.

What is the prefix function?

The prefix function (П) is similar to the Z function. It's a mapping between the index of any character of the string (π[0] = 0) and the maximum number of characters including and before the index i that match exactly with the prefix of the string.

Brute force complexity for calculating the pi function of a string vs. the z function?

While the Z function and Prefix function look almost identical, the brute force time complexity for calculating the Z function is O(N2) and calculating the prefix function is O(N3). In the brute force approach, for the Z function for each index i, we keep iterating forward from i till the characters match with the prefix of the string and count the characters matched, these iterations are of the order O(N) and the time complexity is thus O(N2).On the other hand, for the prefix function, for each index i, we need to check for all substrings of the type s[j … i] where 0 < j <= i, if they match with the prefix of the string with the length j - i + 1. Here getting the substring and matching the substring will take O(N) each. Hence the time complexity for this one turns out to be O(N3).

Count all substrings of all lengths that match exactly with the prefix and suffix of a string

Introduction

The article discusses a problem involving strings, these problems are very commonly asked in various contests and interviews. Here in this article, we aim to understand how we can improve on the brute force approach and apply Z-algorithm to get the most optimal approach.

Problem Statement

Let there be a string s (length n), containing only lowercase letters [a-z] and n <= 2 * 10⁵. Our objective is to find for all l (1 <= l <= n), the count of substrings that have a length of l that matches exactly with the prefix and suffix of the string, i.e., the prefix of length l of string s = s[0…l-1] and the suffix of length l of string s = s[n - l ...n-1]. As output, print the count of all l for which the count is non zero and print each l with their count.

For example,

s = “aabaab”

for lengths 1, 2, 4 and 5, the prefix and suffix of that length don’t exactly match and therefore, our answer will not include them.

For length 3, the prefix and suffix are both “aab”. This string occurs two times.

For length 6, the prefix and suffix are both “aabaab” and occur only once.

The output will be:

2
3 2
6 1

Brute Force Approach

In the brute force approach, we check if the prefix of length i (0 <= i < n) matches the suffix of length i. If both the prefix and suffix match exactly, we find the count of substrings that matches exactly with the prefix of length i in the complete string.

Step by step implementation for the approach is.

We iterate over all the prefixes of length i.
In every iteration, we take the prefix of length i of the string and match it with the suffix of length i of the string.
If both the prefix and suffix are equal, we count the occurrences of the prefix of length i of the string in the original string.
We store the length of the prefix and the count as a pair in a vector.
After all the iterations are complete, we print the length of the vector and all the pairs as our final answer.

Code in C++

#include <iostream>
#include <vector>
#include <utility>
#include <string>
using namespace std;


void solve(int _){
    string s = "aabaab";
    int n = s.size();
    
    vector<pair<int,int>> ans;
    for(int i = 1; i <= n; i++){
        // prefix of length i
        string prefix_i = s.substr(0, i);
        // suffix of length i
        string suffix_i = s.substr(n - i, i);
        
        // check if the prefix and suffix match
        if(prefix_i == suffix_i){
            pair<int,int> countForLengthI = {i, 1};
            
            // count all occurunces of the prefix_i in the string s
            for(int j = 1; j <= n - i; j++){
                if(s.substr(j, i) == prefix_i){
                    countForLengthI.second += 1;
                }
            }
            // push the answer in the answer vector
            ans.push_back(countForLengthI);
        }
    }
    
    cout << ans.size() << endl;
    for(auto x: ans){
        cout << x.first << " " << x.second << endl;
    }
}


int32_t main(){
    solve(1);
}

You can also try this code with Online C++ Compiler

Run Code

Output

2
3 2
6 1

Time Complexity

The time complexity of the above algorithm is O(N³). We first loop over all the lengths (N), and then we iterate over all the substrings of length i (N - i - 1 < N - 1). Finally, we find the substring and compare each substring of length i (i <= N).

Space Complexity

The algorithm takes O(N) extra space. This is because we are making copies of part of the string when we take their substring. The overall complexity can’t be improved as we have to save the answer that can take O(N) extra memory even if, instead of taking substring, we simply loop over and compare those strings.

Optimized Approach (Z-Algorithm)

An O(N³) algorithm is slow and will not be sufficient for any practical use given the input constraint of the problem. Hence, we propose an improved solution using Z-Algorithm (if this is a new term for you, refer here). We can use Z-algorithm to find the length of the largest prefix that matches the substring starting at position i.

The idea for the improved algorithm is as follows:

We initially compute the z function of the string s, where z[i] stands for the longest matching prefix that matches with the substring of the string starting at position i.

We then try to find the count of all prefixes that can be obtained from various i.

Let’s say that the z function of string s is, z = [0, 1, 0, 3, 1, 0]. We make z[0] = n for ease of calculation.

Here ‘1’ occurs in the z array two times. Similarly, if we store the count for each prefix's occurrence we obtain substr_pref = [2, 2, 0, 1, 0, 0, 1], substr_pref[1] = 2 implies that the count of 1 in the array “z” is two.

Another observation to be made is that if the prefix of length 3 has a count of 2, then there's also the prefix of length 2 and 1 present within it, so we add that count to substr_pref[1] and substr_pref[2].

After that, we obtain the below array as substr_pref, which contains the count of all prefixes present as a substring in the string s.

substr_pref = [2, 4, 2, 2, 1, 1, 1]

After obtaining the above result, we check for each length if the prefix of length i matches the suffix of length i. This can bed done by checking if z[n-i] equals i, implying that the next i characters after n - i - 1 match with the prefix of length i.

If the check is true and z[n - i] == i, then we push the pair {i, substr_pref[i]} as one of our answers into the “ans” vector.

Step by step algorithm for the approach is:

We calculate the z array of string s using the Z-algorithm.
We find the count of all prefix lengths in the z array obtained above.
We update the count with the observation that if there is one prefix of length i, it contains prefixes of all lengths from 1 to i - 1 inside itself.
We iterate over the length of the prefixes, and for each length i, we check if z[n-i] equals i.
If the condition holds, then the prefix of length i matches with the suffix of length i, now we push the pair {i, substr_pref[i]} into the "ans" vector. Here substr_pref[i] is the count of all substrings that match the prefix of length i.
After all the iterations are complete, we print the length of the vector and all the pairs as our final answer.

Code in C++

#include <iostream>
#include <vector>
#include <utility>
#include <string>
using namespace std;


vector<int> z_function(string a){
    int n = a.size();
    vector<int> z(n);
    int l, r;
    l = r = 0;
    for(int i = 1; i < n; i++){
        // z[i - l] is assigned to z[i] but i + z[i] <= r here
        // as we haven't examined characters beyond index r
        z[i] = min(max(r - i + 1, 0), z[i - l]);
        
        // keep increasing z[i] as long as possibled
        for(int j = z[i]; j < n - i; j += 1){
            if(a[j] != a[j + i]) break;
            z[i] += 1;
        }
        
        // update the range l to r
        if(i + z[i] - 1 > r){
            l = i;
            r = i + z[i] - 1;
        }
    }
    return z;
}


void solve(int _){
    string s = "aabaab";
    int n = s.size();
    
    // z_function for string s
    vector<int> z_res = z_function(s);
    z_res[0] = n;


    vector<int> substr_pref(n + 1);
    for(int i = 0; i < n; i++){
        substr_pref[z_res[i]] += 1;
    }
    
    // get the count of all substrings with length i that matches
    // first i characters of the string i.e., the prefix of it
    for(int i = n - 1; i >= 1; i--){
        substr_pref[i] += substr_pref[i + 1];
    }
    
    // we then determine if a suffix exists for string s that matches exactly with the prefix of length i
    vector<pair<int,int>> ans;
    for(int i = 1; i <= n; i++){
        if(z_res[n - i] == i){
            ans.push_back({i, substr_pref[i]});
        }
    }
    
    // print the result along with the count
    cout << ans.size() << endl;
    for(auto element: ans){
        cout << element.first << " " << element.second << endl;
    }
}


int32_t main(){
    solve(1);
}

You can also try this code with Online C++ Compiler

Run Code

Output

2
3 2
6 1

Time Complexity

The time complexity of the above algorithm is O(N). The time complexity of the Z-algorithm is O(N), and all other requirements were also calculated using a single iteration (O(N)). Hence, the combined time complexity of the algorithm is also O(N).

Space Complexity

The algorithm takes O(N) extra space. We are storing various results (that take O(N) extra space) and "ans" vector (that can at max take O(N) extra space).

Also check out - Substr C++

FAQs

What is pair in C++ STL?
pair STL is commonly used to store tuples of size two. It can store two elements with different data types together. One important fact about them is that std::unordered_map doesn't contain a hash function for pairs and requires the user to supply the hash function of their choice.
What are the uses of "auto" in C++?
It can be used to initialize elements that have a complicated type. In our code above, we wrote "auto" instead of having to write "pair<int, int>". Using longer type names usually makes the code harder to read. It's important to remember that you can't declare an auto variable without initializing it to something.
What is the Z function?
Z function is calculated using the Z algorithm. It's a mapping between the index of any character of the string (except 0 because the z function isn't defined for index 0) and the maximum number of characters including and after the index that match exactly with the prefix of the string.
What is the prefix function?
The prefix function (П) is similar to the Z function. It's a mapping between the index of any character of the string (π[0] = 0) and the maximum number of characters including and before the index i that match exactly with the prefix of the string.
Brute force complexity for calculating the pi function of a string vs. the z function?
While the Z function and Prefix function look almost identical, the brute force time complexity for calculating the Z function is O(N²) and calculating the prefix function is O(N³). In the brute force approach, for the Z function for each index i, we keep iterating forward from i till the characters match with the prefix of the string and count the characters matched, these iterations are of the order O(N) and the time complexity is thus O(N²).
On the other hand, for the prefix function, for each index i, we need to check for all substrings of the type s[j … i] where 0 < j <= i, if they match with the prefix of the string with the length j - i + 1. Here getting the substring and matching the substring will take O(N) each. Hence the time complexity for this one turns out to be O(N³).

Key Takeaways

The article aims to help you understand the use of the Z-algorithm and how to use it to solve a real problem with optimal time complexity. To understand the blog well, read a bit about Z-Algorithm here and other commonly used string algorithms here.

Check out this problem - Longest Common Prefix

Learn more about the C++ STL library from here if you have trouble understanding some part of the code. Visit the link here for carefully crafted courses on campus placement and interview preparation on coding ninjas.

Happy learning!