Approach 1: Greedy Algorithm
We can approach this problem in a greedy manner by finding maximum overlapping strings. Overlap means suffix of one string matches prefix of another string.
Maximum overlap means when matching suffix and prefix length is maximum.
Algorithm:
- Create an auxiliary array of strings, temp[]. Copy contents of arr[] to temp[].
-
While the size of temp[] is greater than 1
- In temp[], find the strings that result in maximum overlapping. Say we have two strings “s1” and “s2” that results in maximum overlapping among all the given strings in temp[].
- Remove string “s1” and “s2” from the temp[] and add the resultant string after merging s1 and s2 in temp[].
- Return temp[0], the only final string left in temp which is the required shortest superstring.
Code
#include <bits/stdc++.h>
using namespace std;
int Overlap(string s1,string s2, string &str)
{
int max_len = INT_MIN;
int len1 = s1.length();
int len2 = s2.length();
//suffix of s1 with prefix of s2
for (int i = 1; i <=min(len1, len2); i++)
{
// Compare last i char of s1 with first i char of s2.
if (s1.compare(len1-i, i, s2,0, i) == 0)
{
if (max_len < i)
{
max_len = i;
str = s1 + s2.substr(i);
}
}
}
//suffix of s2 with prefix of s1
for (int i = 1; i <=min(len1, len2); i++)
{
// Compare last i char of s1 with first i char of s2.
if (s1.compare(0, i, s2,len2-i, i) == 0)
{
if (max_len < i)
{
max_len = i;
str = s2 + s1.substr(i);
}
}
}
return max_len;
}
// Function to calculate
// smallest string that contains
// each string in the given
// set as substring.
string findShortestSuperstring(vector<string> arr)
{
int len=arr.size();
while(len>1)
{
int max_len = INT_MIN;
int l, r;
string max_string;
// Maximum overlap
for (int i = 0; i < len; i++)
{
for (int j = i + 1; j < len; j++)
{
string str;
int res = Overlap(arr[i],arr[j], str);
if (max_len < res)
{
max_len = res;
max_string=str;
//indexes to replace
l = i, r = j;
}
}
}
len--;
if (max_len == INT_MIN)
arr[0] += arr[len];
else
{
arr[l] = max_string;
arr[r] = arr[len];
}
}
return arr[0];
}
// Driver program
int main()
{
vector<string> arr= {"catg","ctaagt","gcta","ttca","atgcatc"};
cout<< findShortestSuperstring(arr);
return 0;
}
Output
“gctaagttcatgcatc”
Complexity Analysis
Time Complexity: NP-Hard
Since the greedy approach is an approximate solution only for this NP-Hard problem, the exact time complexity is unknown.
Space Complexity: O(N)
Since we need to use O(N) extra space as temp[] array to store all the strings.
Also check out - Substr C++
Approach 2: Dynamic Programming
If we observe, this is the Traveling salesman problem (TSP). If we think of our strings as nodes, we can calculate their distance.
For example, the distance between “abcde” and “cdefghij” is 5 because we need to use 5 additional symbols “fghij” to complete the first string to get the second. Because our graph is not symmetric, therefore it is a directed graph.
So we have a challenge similar to the Traveling Salesman Problem (TSP): we need to determine the shortest path that visits all nodes.
Algorithm:
- We must arrange the words in a row, with each word possibly overlapping the previous one. This is due to the fact that no word is enclosed within another word. We must try to maximise this overlap.
- Let's pretend we've written some words in a sequence that ends with the letter A[i]. Let's say we write the word A[j] as the next word, but we haven't yet written the word A[j]. By overlapping, the overlap grows (A[i], A[j]).
-
There are two states in this situation:
- mask: a binary mask of length n that contains nodes that have already been visited.
- Last: the node that was last visited
- dp[mask][last] will contain the length of the build string.
-
The main recursion is :
dp(mask ^ (1<<j), j) = max(overlap(A[i], A[j]) + dp(mask, i)),
where the jth bit is not active in the mask, and i ranges over all bits set in the mask.
- Later we have to reconstruct the answer from the parent information since we only have the value of maximum overlap. So we would also need to store the parent information.
Code
#include <iostream>
#include <bits/stdc++.h>
using namespace std;
string findShortestSuperstring(vector<string>& A) {
const int n = A.size();
vector<vector<int> > g(n, vector<int>(n));
for(int i = 0; i < n; i++)
for(int j = 0; j < n; j++){
g[i][j] = A[j].length();
for(int k = 1; k <= min(A[i].length(), A[j].length()); k++){
if(A[i].substr(A[i].length()- k) == A[j].substr(0, k)){
g[i][j] = A[j].length() - k;
}
}
}
vector<vector<int> > dp(1<<n, vector<int>(n, INT_MAX/2));
vector<vector<int> > parent(1<<n, vector<int>(n, -1));
for(int i = 0; i < n; i++){ dp[1<<i][i] = A[i].length();}
for(int s = 1; s < (1<<n); s++){
for(int j = 0; j < n; j++){
if(!s&(1<<j)) continue;
//if(!(s&(1<<j))) continue;
int ps = s & ~(1<<j);
for(int i = 0; i < n; i++){
if(dp[s][j] > dp[ps][i] + g[i][j]){
dp[s][j] = dp[ps][i] + g[i][j];
parent[s][j] = i;
}
}
}
}
string res;
auto it = min_element(begin(dp.back()), end(dp.back()));
int j = it - begin(dp.back()); // cur
int s = (1 << n) - 1;
while(s){
int i = parent[s][j]; // pre
if(i < 0) res = A[j] + res;
else res = A[j].substr(A[j].length() - g[i][j]) + res;
s &= ~(1<<j);
j = i;
}
return res;
}
// Driver program
int main()
{
vector<string> arr= {"catg","ctaagt","gcta","ttca","atgcatc"};
cout<< findShortestSuperstring(arr);
return 0;
}
Output
“gctaagttcatgcatc”
Complexity Analysis
Time Complexity: O(N2 * 2N)
Where n is the number of strings, we have a total of (2N * N) possible states, and O(N) work is required for each state.
Space Complexity: O(2N * N)
Because we have a 2d array of size O(2N * N).
Check out this problem - Longest Common Prefix
Frequently Asked Questions
-
What is the NP-Hard problem?
NP stands for Non-deterministic Polynomial Time. If a method for solving a problem can be transformed into an algorithm for solving any NP-problem (nondeterministic polynomial time) issue, the problem is NP-hard.
-
What do you mean by overlapping of strings in superstring problem?
Overlap means suffix of one string matches prefix of another string. Maximum overlap means when matching suffix and prefix length is maximum.
Key Takeaways
The task of finding the shortest superstring is NP-Hard. However, a greedy approach to solving this problem can lead to a "near-optimal" solution.
If you wish to learn more about greedy algorithms, visit Greedy Algorithms in Array.
Since the greedy approach is approximate and not 100% correct, we moved over to the dynamic programming approach, which is much more efficient and 100% accurate.
Recommended problems -
You can study more about Dynamic Programming by visiting ‘Dynamic Programming and Algorithms.’
Happy Coding!