Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
1.1.
Example 1
1.1.1.
Input
1.1.2.
Output
1.1.3.
Explanation
1.2.
Example 2
1.2.1.
Input
1.2.2.
Output
1.2.3.
Explanation
2.
Approach 
3.
Code:
4.
Time Complexity 
5.
Space Complexity 
6.
FAQs
7.
Key Takeaways
Last Updated: Mar 27, 2024

Longest Common Substring using a suffix tree

Author Apoorv
1 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

Determine the longest common substring from a given string. The naive approach  [O(N * M ^ 2)] and Dynamic Programming [O(N * M)] techniques have already been discussed for the calculation of the Longest Common Substring. So first go and read all the approaches from here. One linear time strategy based on suffix trees will be discussed in this article which can find the Longest Palindromic Substring in O(N) time.

Example 1

Input

“abcde”

“abfce”

Output

Longest Common Substring in abcde and abfce is: ab, of length: 2

Explanation

There are a lot of common substrings between these two strings like a,b,f, ab but we have to find the longest common substring which is common between these two strings which is ab in this case.

Example 2

Input

qdef

defq

Output

3

Explanation

There are a lot of common substrings between these two strings like "q", "d", "e", "f", "de", "ef", "def". "def" is the longest common substring. So, the length of the longest common substring between these two substrings will be "3".

Approach 

First, we need to make a generalized suffix tree in linear time for the two given substrings. The Suffix Tree is a compressed trie that stores all of a string's suffixes. Suffix tree has been shown to be quite useful in string-related tasks such as detecting different substrings in a string, pattern matching, and determining the largest palindrome, among others. To read more about suffix trees, you can refer to here. To read more about the generalized suffix tree, you can read from here. Below is the generalized subtree of two substrings x = ABAB and y = BABA.

source

The generalized subtree for the given strings is ABAB#BABA$

The generalized subtree for the given strings is ABAB#BABA$. Now, after building a suffix tree, the task is to find the longest common substring. For that, we need to compare the generalized suffix tree between the given two strings. After that, find the common occurrences of the string by traversing the tree deep. Then the most important step is to find the common occurrences and then choose the longest string among them. Let X and Y be two strings containing some common substring. We will construct a generalized suffix tree with delimiters # for X and $ for y.

Then perform the depth-first traversal on the generalized suffix tree formed and count the number of occurrences of # and $ as a tuple(#,$).

Then compare the (#, $) tuple values and choose the longest substring. While we maintain occurrences of # and $ at each node, maintain an array of the max count. If at any internal node the value of the tuple (#,$) is greater than the current, update the entry in the array. In case we get the same maximum tuple value for multiple nodes, compare the length of string for which we are getting the tuple value.   

Making use of the suffix tree constructed in linear time using ukkonen’s algorithm we can solve the problem of the longest substring in o(1) time. Below is the detailed algorithm for the given approach.

Also check out - Substr C++

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Code:

/*
    Ukkonen's Suffix Tree Construction is implemented in C++
    We create a generalised suffix tree for two strings in this example
    The longest common substring of the two input strings is then found
*/
#include <bits/stdc++.h>
using namespace std;

#define MAX_CHAR 256

struct SuffixTreeNode {
    struct SuffixTreeNode *children[MAX_CHAR];

    //pointer to other node via suffix link
    struct SuffixTreeNode *suffixLink;

    /*
        The (start, end) interval indicates the edge that connects the node to its parent node. 
        Each edge will connect two nodes, one parent and one child, and the child node will record 
        the (start, end) interval of each edge. If two nodes X and Y are joined by an edge with the 
        indices (5, 8) stored in node Y, then the indices (5, 8) will be stored in node Y.
    */
    int start;
    int *end;

    //  It saves the suffix index for the path from root to leaf for leaf nodes.
    int suffixIndex;
};

typedef struct SuffixTreeNode Node;

// Input string
char text[100]; 

// Root node pointer
Node *root = NULL; 

/*
    lastNewNode will point to a newly generated internal node that is awaiting the setting of its suffix link, 
    which may receive a new suffix link (other than root) in the following phase extension. When the previous 
    newly generated internal node (if any) has its suffix link reset to a new internal node created in the 
    following extension of the same phase, lastNewNode will be set to NULL.
*/
Node *lastNewNode = NULL;
Node *activeNode = NULL;

// activeEdge is represented as a character index in the input string (not the character itself)
int activeEdge = -1;
int activeLength = 0;

// remainingSuffixCount indicates how many suffixes in the tree have yet to be added.
int remainingSuffixCount = 0;
int leafEnd = -1;
int *rootEnd = NULL;
int *splitEnd = NULL;

//Length of input string
int size = -1; 

//String 1 size that is x
int size1 = 0; 

Node *newNode(int start, int *end)
{
    Node *node =(Node*) malloc(sizeof(Node));
    int i;
    for (i = 0; i < MAX_CHAR; i++)
        node->children[i] = NULL;

    /*
        The suffixLink property will be set to NULL for the root node. In the current extension, 
        suffixLink will be set to root by default for internal nodes, however this may change in
         the future.
    */
    node->suffixLink = root;
    node->start = start;
    node->end = end;

    /*
        By default, suffixIndex is set to -1, and the actual suffix index is set afterwards for 
        leaves at the end of all phases.
    */
    node->suffixIndex = -1;
    return node;
}

int edgeLength(Node *n) {
    if(n == root)
        return 0;
    return *(n->end) - (n->start) + 1;
}

int walkDown(Node *currNode)
{

    /*
        Using the Skip/Count Trick, alter activePoint for a walk down (APCFWD) (Trick 1). Set 
        next internal node as activeNode and change activeEdge and activeLength to represent same 
        activePoint if activeLength is higher than current edge length.
    */
    if (activeLength >= edgeLength(currNode))
    {
        activeEdge += edgeLength(currNode);
        activeLength -= edgeLength(currNode);
        activeNode = currNode;
        return 1;
    }
    return 0;
}

void extendSuffixTree(int pos)
{

    /*
        Extension Rule 1 is responsible for extending all of the leaves that have been formed so far
         in the tree.
    */
    leafEnd = pos;

    /*
        Increase the amount of time left. SuffixCount indicates that a new suffix has been added to the 
        list of suffixes that have not yet been added to the tree.
    */
    remainingSuffixCount++;

    /*
        When entering a new phase, set lastNewNode to NULL to indicate that no internal node is waiting
         for its suffix link to be reset in the current phase.
    */
    lastNewNode = NULL;

    //Add all suffixes (yet to be added) one by one in tree
    while(remainingSuffixCount > 0) {

        if (activeLength == 0)
            activeEdge = pos; //APCFALZ

        // Important: Add all suffixes (that haven't been added yet) to the tree one by one.
        if (activeNode->children] == NULL)
        {

            //Starting with activeEdge from activeNode, there is no outgoing edge.
            activeNode->children] = newNode(pos, &leafEnd);

            /*
                In the above line, a new leaf edge is generated, starting from an existing node (the current
                 activeNode), and if any internal nodes are waiting for their suffix links to be reset, point 
                 the suffix link from that last internal node to the current activeNode. Then set lastNewNode 
                 to NULL to indicate that there are no more nodes waiting for the suffix link to be reset.
            */
            if (lastNewNode != NULL)
            {
                lastNewNode->suffixLink = activeNode;
                lastNewNode = NULL;
            }
        }
        
        // From activeNode, there is an outgoing edge called activeEdge.
        else
        {

            // Start with activeEdge to get the next node at the end of the edge.
            Node *next = activeNode->children];

            // Go down
            if (walkDown(next))
            {

                // Begin at the next node (the new activeNode)
                continue;
            }

            // Extension Rule 3 (the currently processed character is already on the edge)
            if (text[next->start + activeLength] == text[pos])
            {

                /*
                    Set the suffix link of a freshly generated node to the current active node
                    if it is waiting for its suffix link to be set.
                */
                if(lastNewNode != NULL && activeNode != root)
                {
                    lastNewNode->suffixLink = activeNode;
                    lastNewNode = NULL;
                }

                // APCFER3
                activeLength++;

                // STOP ALL ADDITIONAL PROCESSING IN THIS PHASE AND GO TO THE NEXT PHASE
                break;
            }

            /*
                make sure When activePoint is in the middle or mid way of the edge being traversed and the current 
                character being processed is not on the edge (we fall off the tree), we will be
                here. In this scenario, we're going to add a new internal node and a new leaf edge
                that comes out of it. This is Extension Rule 2, which creates a new leaf edge
                and an internal node.
            */
            splitEnd = (int*) malloc(sizeof(int));
            *splitEnd = next->start + activeLength - 1;

            // New node (internal node)
            Node *split = newNode(next->start, splitEnd);
            activeNode->children] = split;

            // A new leaf emerges from a new internal node.
            split->children] = newNode(pos, &leafEnd);
            next->start += activeLength;
            split->children] = next;

            /*
                We've added a new internal node to the mix. If any internal nodes generated in 
                previous extensions of the same phase have yet to have their suffix links reset,
                 do so now.
            */
            if (lastNewNode != NULL)
            {

                // lastNewNode's suffixLink links to the current freshly generated internal node.
                lastNewNode->suffixLink = split;
            }

            /*
                Make the current newly constructed internal node wait for its suffix link to be reset 
                (which is currently pointing to root). If we may come across any other internal node (existing 
                or freshly formed) in the next extension of the same phase, suffixLink of this node will 
                point to that internal node when a new leaf edge is added (i.e. when Extension Rule 2 applies
                in any of the next extensions of the same phase).
            */
            lastNewNode = split;
        }

        // One suffix was added to the tree, reducing the number of suffixes still to be added.
        remainingSuffixCount--;

        // check for active node
        if (activeNode == root && activeLength > 0) //APCFER2C1
        {
            activeLength--;
            activeEdge = pos - remainingSuffixCount + 1;
        }
        
        else if (activeNode != root) //APCFER2C2
        {
            activeNode = activeNode->suffixLink;
        }
    }
}

void print(int i, int j)
{
    int k;
    for (k=i; k<=j && text[k] != '#'; k++)
        printf("%c", text[k]);
    if(k<=j)
        printf("#");
}

/*
    As well as printing the suffix tree, you may also set the suffix index.
    As a result, the tree will be printed in DFS format.
    Each edge will be printed together with its suffix index.
*/
void setSuffixIndexByDFS(Node *n, int labelHeight)
{
    if (n == NULL) return;

    if (n->start != -1) // A non-root node
    {

    }
    int leaf = 1;
    int i;
    for (i = 0; i < MAX_CHAR; i++)
    {
        if (n->children[i] != NULL)
        {

            // Because it has outgoing /edges, the current node is not a leaf.
            leaf = 0;
            setSuffixIndexByDFS(n->children[i], labelHeight +
                                edgeLength(n->children[i]));
        }
    }

    // check for leaf
    if (leaf == 1)
    {

        // Iteration
        for(i= n->start; i<= *(n->end); i++)
        {
            if(text[i] == '#')
            {
                n->end = (int*) malloc(sizeof(int));
                *(n->end) = i;
            }
        }
        n->suffixIndex = size - labelHeight;
    }
}

void freeSuffixTreeByPostOrder(Node *n)
{
    if (n == NULL)
        return;
    int i;
    for (i = 0; i < MAX_CHAR; i++)
    {
        if (n->children[i] != NULL)
        {
            freeSuffixTreeByPostOrder(n->children[i]);
        }
    }
    if (n->suffixIndex == -1)
        free(n->end);
    free(n);
}

/*
    Create the suffix tree and output the edge labels, as well as suffixIndex. 
    For leaf edges, suffixIndex will be >= 0 while for non-leaf edges, suffixIndex 
    will be -1.
*/
void buildSuffixTree()
{
    size = strlen(text);
    int i;
    rootEnd = (int*) malloc(sizeof(int));
    *rootEnd = - 1;

    /*
        Because it has no parent from which an edge comes to root, root is a special 
        node with start and end indices of -1.
    */
    root = newNode(-1, rootEnd);

    // root will be first active node
    activeNode = root;
    for (i=0; i<size; i++)
        extendSuffixTree(i);
    int labelHeight = 0;
    setSuffixIndexByDFS(root, labelHeight);
}

int doTraversal(Node *n, int labelHeight, int* maxHeight,
int* substringStartIndex)
{
    if(n == NULL)
    {
        return;
    }
    int i=0;
    int ret = -1;

    // only if it is internal node
    if(n->suffixIndex < 0) 
    {
        for (i = 0; i < MAX_CHAR; i++)
        {
            if(n->children[i] != NULL)
            {
                ret = doTraversal(n->children[i], labelHeight +
                    edgeLength(n->children[i]),
                    maxHeight, substringStartIndex);
                
                if(n->suffixIndex == -1)
                    n->suffixIndex = ret;
                else if((n->suffixIndex == -2 && ret == -3) ||
                    (n->suffixIndex == -3 && ret == -2) ||
                    n->suffixIndex == -4)
                {

                    // Mark this given node as XY
                    n->suffixIndex = -4;

                    // always keep the track of the deepest node
                    if(*maxHeight < labelHeight)
                    {
                        *maxHeight = labelHeight;
                        *substringStartIndex = *(n->end) -
                            labelHeight + 1;
                    }
                }
            }
        }
    }

    // suffix of X
    else if(n->suffixIndex > -1 && n->suffixIndex < size1)

        // Mark this node as X
        return -2;

    // suffix of Y
    else if(n->suffixIndex >= size1) 

    // Mark this node as Y       
    return -3;    
    return n->suffixIndex;
}

void getLongestCommonSubstring()
{
    int maxHeight = 0;
    int substringStartIndex = 0;
    doTraversal(root, 0, &maxHeight, &substringStartIndex);
    
    int k;
    for (k=0; k<maxHeight; k++)
        printf("%c", text[k + substringStartIndex]);
    if(k == 0)
        printf("No common substring");
    else
        printf(", of length: %d",maxHeight);
    printf("\n");
}

// main function which will test all the given functions
int main(int argc, char *argv[])
{
    size1 = 7;
    printf("Longest Common Substring in xabxac and abcabxabcd is: ");
    strcpy(text, "xabxac#abcabxabcd$"); buildSuffixTree();
    getLongestCommonSubstring();

    //Free the dynamically allocated memory
    freeSuffixTreeByPostOrder(root);

    size1 = 10;
    printf("Longest Common Substring in xabxaabxa and babxba is: ");
    strcpy(text, "xabxaabxa#babxba$"); buildSuffixTree();
    getLongestCommonSubstring();

    //Free the dynamically allocated memory
    freeSuffixTreeByPostOrder(root);
    
    size1 = 6;
    printf("Longest Common Substring in abcde and fghie is: ");
    strcpy(text, "abcde#fghie$"); buildSuffixTree();
    getLongestCommonSubstring();

    //Free the dynamically allocated memory
    freeSuffixTreeByPostOrder(root);

    size1 = 6;
    printf("Longest Common Substring in pqrst and uvwxyz is: ");
    strcpy(text, "pqrst#uvwxyz$"); buildSuffixTree();
    getLongestCommonSubstring();

    //Free the dynamically allocated memory
    freeSuffixTreeByPostOrder(root);

    return 0;
}

 

Output:

Longest Common Substring in xabxac and abcabxabcd is: abxa, of length: 4
Longest Common Substring in xabxaabxa and babxba is: abx, of length: 3
Longest Common Substring in abcde and fghie is: e, of length: 1
Longest Common Substring in pqrst and uvwxyz is: No common substring

Time Complexity 

O(n)

The time complexity to find the Longest common Substring using the suffix tree method will cost only linear time complexity and is the most optimized approach for finding the  Longest Common Substring. The first construction of the entire suffix tree will cost o(n) time complexity, where n is the size of the given string. So in the worst-case iteration on the entire tree may cost o(n) time complexity, so the total time complexity to solve the entire problem will be o(n) time.

Space Complexity 

O(n)

The space complexity to find the Longest common Substring using the suffix tree method will cost only linear space complexity and is the most optimized approach for finding the  Longest common Substring. The entire construction of the suffix tree will cost o(n) space complexity, where n is the size of the given string.

FAQs

  1. What is the longest common string?
    The longest common substring problem in computer science is to identify the longest string that is a substring of two or more strings.
  2. What are suffix trees?
    A suffix tree (also known as a PAT tree or, in an earlier form, a position tree) is a compressed trie that contains all the suffixes of a given text as keys and text positions as values. Many key string operations can be implemented very quickly using suffix trees.

Key Takeaways

In this blog, we discussed the most optimized solution to find the Longest common Substring using suffix trees. Along with the solution, We also explored the time and space complexity of the solution.

Check out this problem - Longest Common Prefix

If you want to learn more about such hidden number theory algorithms and want to practice some quality questions which require you to excel your preparation s a notch higher, then you can visit our Guided Path for these number theory algorithms on  Coding Ninjas Studio.To be more confident in data structures and algorithms, try out our DS and Algo Course. Until then, All the best for your future endeavours, and Keep Coding.

Previous article
Substrings and Repetitions
Next article
Searching All Patterns Using Suffix Tree
Live masterclass