Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Problem Statement
3.
Approach 1: Greedy Algorithm
3.1.
Code
3.2.
Output
3.3.
Complexity Analysis
3.3.1.
Time Complexity: NP-Hard
3.3.2.
Space Complexity: O(N)
4.
Approach 2: Dynamic Programming
4.1.
Code
4.2.
Output
4.3.
Complexity Analysis 
4.3.1.
Time Complexity: O(N2 * 2N)
4.3.2.
Space Complexity: O(2N * N)
5.
Frequently Asked Questions
6.
Key Takeaways
Last Updated: Mar 27, 2024

Shortest Superstring

Author Rhythm Jain
1 upvote

Introduction

String problems are becoming increasingly popular as interview topics. To ace technical interviews with big companies, you'll need to master questions based on string applications.

Let's proceed deeper into the problem and its solution approach.

Problem Statement

We are given a set of N strings called arr[]. Our task is to find the smallest string that contains each string in the given set as a substring. (Assume that no string in arr[] is a substring of another string)

Example:

Assume we have the following set of strings:

Input:

arr = ["catg","ctaagt","gcta","ttca","atgcatc"]

Output:

“gctaagttcatgcatc"

Also see, Euclid GCD Algorithm

Approach 1: Greedy Algorithm

We can approach this problem in a greedy manner by finding maximum overlapping strings. Overlap means suffix of one string matches prefix of another string.

Maximum overlap means when matching suffix and prefix length is maximum.

Algorithm:

  • Create an auxiliary array of strings, temp[].  Copy contents of arr[] to temp[].
  • While the size of temp[] is greater than 1
    • In temp[], find the strings that result in maximum overlapping. Say we have two strings “s1” and “s2” that results in maximum overlapping among all the given strings in temp[].
    • Remove string “s1” and “s2” from the temp[] and add the resultant string after merging s1 and s2 in temp[].
  • Return temp[0], the only final string left in temp which is the required shortest superstring.

Code

#include <bits/stdc++.h>
using namespace std;

int Overlap(string s1,string s2, string &str)
{
	int max_len = INT_MIN;
	int len1 = s1.length();
	int len2 = s2.length();

	//suffix of s1 with prefix of s2
	for (int i = 1; i <=min(len1, len2); i++)
	{
		
		// Compare last i char of s1 with first i char of s2.
		if (s1.compare(len1-i, i, s2,0, i) == 0)
		{
			if (max_len < i)
			{
				max_len = i;
				str = s1 + s2.substr(i);
			}
		}
	}

	//suffix of s2 with prefix of s1
	for (int i = 1; i <=min(len1, len2); i++)
	{
		
		// Compare last i char of s1 with first i char of s2.
		if (s1.compare(0, i, s2,len2-i, i) == 0)
		{
			if (max_len < i)
			{
				max_len = i;
				str = s2 + s1.substr(i);
			}
		}
	}

	return max_len;
}

// Function to calculate
// smallest string that contains
// each string in the given
// set as substring.
string findShortestSuperstring(vector<string> arr)
{
	
	int len=arr.size();
	while(len>1)
	{
		int max_len = INT_MIN;
		int l, r;
		string max_string;
	
		// Maximum overlap
		for (int i = 0; i < len; i++)
		{
			for (int j = i + 1; j < len; j++)
			{
				string str;
				int res = Overlap(arr[i],arr[j], str);

				if (max_len < res)
				{
					max_len = res;
					max_string=str;
					//indexes to replace
					l = i, r = j;
				}
			}
		}
		len--;
		if (max_len == INT_MIN)
			arr[0] += arr[len];
		else
		{
			arr[l] = max_string;
			arr[r] = arr[len];
		}
	}
	return arr[0];
}

// Driver program
int main()
{
	vector<string> arr= {"catg","ctaagt","gcta","ttca","atgcatc"};
	cout<< findShortestSuperstring(arr);

	return 0;
}

Output

“gctaagttcatgcatc”

Complexity Analysis

Time Complexity: NP-Hard

Since the greedy approach is an approximate solution only for this NP-Hard problem, the exact time complexity is unknown.

Space Complexity: O(N)

Since we need to use O(N) extra space as temp[] array to store all the strings.

Also check out - Substr C++

Approach 2: Dynamic Programming

If we observe, this is the Traveling salesman problem (TSP). If we think of our strings as nodes, we can calculate their distance. 

For example, the distance between “abcde” and “cdefghij” is 5 because we need to use 5 additional symbols “fghij” to complete the first string to get the second. Because our graph is not symmetric, therefore it is a directed graph.

So we have a challenge similar to the Traveling Salesman Problem (TSP): we need to determine the shortest path that visits all nodes.

Algorithm:

  • We must arrange the words in a row, with each word possibly overlapping the previous one. This is due to the fact that no word is enclosed within another word. We must try to maximise this overlap.
  • Let's pretend we've written some words in a sequence that ends with the letter A[i]. Let's say we write the word A[j] as the next word, but we haven't yet written the word A[j]. By overlapping, the overlap grows (A[i], A[j]).
  • There are two states in this situation:
    • mask: a binary mask of length n that contains nodes that have already been visited.
    • Last: the node that was last visited
    • dp[mask][last] will contain the length of the build string.
  • The main recursion is : 
    dp(mask ^ (1<<j), j) = max(overlap(A[i], A[j]) + dp(mask, i))
    where the jth bit is not active in the mask, and i ranges over all bits set in the mask.
  • Later we have to reconstruct the answer from the parent information since we only have the value of maximum overlap. So we would also need to store the parent information.

Code

#include <iostream>
#include <bits/stdc++.h>
using namespace std;


string findShortestSuperstring(vector<string>& A) {
        const int n = A.size();
        vector<vector<int> > g(n, vector<int>(n));
        for(int i = 0; i < n; i++)
            for(int j = 0; j < n; j++){
                g[i][j] = A[j].length();
                for(int k = 1; k <= min(A[i].length(), A[j].length()); k++){
                    if(A[i].substr(A[i].length()- k) == A[j].substr(0, k)){
                        g[i][j] = A[j].length() - k;
                    }
                }
            }
        
        vector<vector<int> > dp(1<<n, vector<int>(n, INT_MAX/2));
        vector<vector<int> > parent(1<<n, vector<int>(n, -1));
        for(int i = 0; i < n; i++){ dp[1<<i][i] = A[i].length();}        
        
        for(int s = 1; s < (1<<n); s++){
            for(int j = 0; j < n; j++){
                if(!s&(1<<j)) continue;
                //if(!(s&(1<<j))) continue;
                int ps = s & ~(1<<j);
                for(int i = 0; i < n; i++){
                    if(dp[s][j] > dp[ps][i] + g[i][j]){
                        dp[s][j] = dp[ps][i] + g[i][j];
                        parent[s][j] = i;
                    }
                }
            }
        }
        
        string res;
        auto it = min_element(begin(dp.back()), end(dp.back()));
        int j = it - begin(dp.back()); // cur
        int s = (1 << n) - 1;
        while(s){
            int i = parent[s][j]; // pre
            if(i < 0) res = A[j] + res;
            else res = A[j].substr(A[j].length() - g[i][j]) + res;            
            s &= ~(1<<j);
            j = i;
        }
        return res;        
    }



// Driver program
int main()
{
vector<string> arr= {"catg","ctaagt","gcta","ttca","atgcatc"};
cout<< findShortestSuperstring(arr);


return 0;
}

Output

“gctaagttcatgcatc”

Complexity Analysis 

Time Complexity: O(N2 * 2N)

Where n is the number of strings, we have a total of (2N * N) possible states, and O(N) work is required for each state.

Space Complexity: O(2N * N)

Because we have a 2d array of size O(2N * N).

 

Check out this problem - Longest Common Prefix

Frequently Asked Questions

  1. What is the NP-Hard problem?
    NP stands for Non-deterministic Polynomial Time. If a method for solving a problem can be transformed into an algorithm for solving any NP-problem (nondeterministic polynomial time) issue, the problem is NP-hard.
     
  2. What do you mean by overlapping of strings in superstring problem?
    Overlap means suffix of one string matches prefix of another string. Maximum overlap means when matching suffix and prefix length is maximum.

Key Takeaways

The task of finding the shortest superstring is NP-Hard. However, a greedy approach to solving this problem can lead to a "near-optimal" solution.

If you wish to learn more about greedy algorithms, visit Greedy Algorithms in Array.

Since the greedy approach is approximate and not 100% correct, we moved over to the dynamic programming approach, which is much more efficient and 100% accurate.

Recommended problems -

 

You can study more about Dynamic Programming by visiting ‘Dynamic Programming and Algorithms.’

Happy Coding!

Live masterclass