Data structures & algorithms (Beginner to Intermediate)

Free guided path13 chapters99+ problems

Earn badges and level up

Introduction

You must think about what the suffix Tree would be and Ukkonen's algorithm. Don't worry, and we will be covering everything in brief.

Let's get started!

Suffix Tree

A suffix tree is a compressed trie of all the suffixes in a particular string (compressed trie trees use a relatively small number of nodes, which provides a significant memory advantage, especially for large lines). Many string-related issues, such as pattern matching, locating unique substrings within a given string, locating the largest palindrome, etc., can be resolved using suffix trees. This article will discuss the following topics:

Compressed Trie

Suffix Tree Construction (Brute Force)

Brief description of Ukkonen's Algorithm

A suffix tree is a compressed trie in which all of the suffixes in a string are stored. A suffix tree has been shown to help with problems like finding unique substrings in a string, pattern matching, finding the longest palindrome, and other string-related issues. This post will look at another application of the suffix tree: Substring Check.

We will generate the Suffix Tree using Ukkonen's Approach, a linear time-taking algorithm in the following scenario.

Get the tech career you deserve, faster!

Connect with our expert counsellors to understand how to hack your way to success

User rating 4.7/5

1:1 doubt support

95% placement record

Akash Pal

Senior Software Engineer

326% Hike After Job Bootcamp

Himanshu Gusain

Programmer Analyst

32 LPA After Job Bootcamp

After Job Bootcamp

Ukkonen’s Algorithm

Esko Ukkonen proposed Ukkonen's algorithm in 1995 as a linear-time, online algorithm for constructing suffix trees. The algorithm starts with an implicit suffix tree containing the string's first character. Then it goes through the string, adding characters one by one until the tree is complete. This character order addition gives Ukkonen's algorithm its "online" property. Peter Weiner's original algorithm worked backwards from the last character to the first, from the shortest to the longest suffix. Edward M. McCreight discovered a simpler algorithm that goes from the longest to the most temporary suffix.

Understanding the Algorithm with an Example

Problem Statement

Implement a function that uses the suffix tree to determine whether the original text S contains a string, as well as the brute force approach for creating a suffix tree from a text S.

The string "ana" and “nan” are valid substrings of the text S, and hence we output True.

The string "nun" doesn't exist as a substring, and the output is False.

Approach

By inserting each suffix and compressing the resulting trie, the brute force technique aims to create a suffix tree. See the example below for how we condense a chain of nodes into a single node.

If a node is added repeatedly, it will eventually exist. If there is a relevant node, we examine its label to see how much of the current suffix ([i:n-1]) matches the label.

If we completely match, we go to its child; if we just partially match, we generate two new nodes. The leftover Suffix string is likewise formed into a child of the unmatched label node, which is then put into a node with the matched label component.

The above chain of nodes converges into the single node below.

The compressed chain of characters is created to save memory

Algorithm

Text S should include an additional character at the end that isn't present in S. In our example, the dollar sign was applied. This will result in N distinct leaf nodes.

Build a root that has no label.

One by one, add each of the N suffixes to the compressed trie of suffixes.

We add a node for each suffix if it doesn't already exist.

We match the suffix with the node's label if it is present. If everything matches, we proceed to the child node.

If only a piece of the suffix string matches, we build two nodes, assign the matched section to one, assign the remaining portion to the other, and make the parent of the previous node the parent of the new node. Create a child node for the previous node and rename it to reflect the mismatched piece. The node with the last suffix string node will be the other child.

We get the suffix tree after completing everything above.

Implementation

class Node:
"""
Class representing a singular node in a suffix tree
"""
def __init__(self, set_label):
self.node_lab = set_label
self.nexts = {}
class SuffixTreeBruteForce:
"""
This implementation takes O(M^2) time to build the suffix tree and consumes O(M) Nodes (<= 2 * M - 1)
"""
def __init__(self, s):
"""
Build Suffix Tree here
"""
# add a char that didn't occur earlier in the text like '$' to the end
s += '$'
# creating root of our suffix tree
self.root = Node(None)
for i in range(len(s)):
# insert individual suffixes [i, len(s) - 1]
curr = self.root
j = i
while(j < len(s)):
if s[j] in curr.nexts:
child = curr.nexts[s[j]]
label = child.node_lab
# check if the current node label is entirely consumed by the suffix
k = j
label_end = False
while(k - j < len(label)):
if(k == len(s)):
# suffix comes short of the label's length
label_end = True
break
elif(s[k] == label[k-j]):
# we keep checking for matching character in suffix with node label
k += 1
else:
# we reached a mismatch in the label
# prompting us to make introduce two new nodes
label_end = True
break
if label_end:
# special case, here we create two new nodes
# we create a node here with the left half of the label (the part that matched) and replace the child node's position with it
curr.nexts[s[j]] = Node(label[:k-j])
# the new position of the child node is as child of the previosly created node
curr.nexts[s[j]].nexts[label[k-j]] = child
# it's label is now the right part of label(the unmatched part)
child.node_lab = label[k-j:]
# we also create another Node containing the unmatched part of the suffix
curr.nexts[s[j]].nexts[s[k]] = Node(s[k:])
break
else:
# the whole node was consumed and we move to it's child
curr = child
j = k
else:
# no child found, create child and end execution
curr.nexts[s[j]] = Node(s[j:])
break
def findSubstring(self, s):
"""
function traces through the Suffix Tree for the given string
and returns true if the string is found, the time complexity for this function is O(len(s))
"""
curr = self.root
i = 0
while(i < len(s)):
if s[i] in curr.nexts:
child = curr.nexts[s[i]]
label = child.node_lab
k = i
while(k - i < len(label)):
if(k == len(s)):
# string ends in the middle of label
return True
elif(s[k] == label[k-i]):
# we keep checking for matching character with node label
k += 1
else:
# mismatch
return False
curr = child
i = k
else:
return False
return True
s = 'banana'
p = SuffixTreeBruteForce(s)
print(p.findSubstring('ban'))
print(p.findSubstring('nan'))
print(p.findSubstring('aua'))

Input

banana
ban
nan
aua

Output

True
True
False

Complexity

Time Complexity

O(M^2): As each suffix is inserted separately into the suffix tree, the temporal complexity of the aforementioned approach is O(M^2).

Explanation: The Suffix Tree will take O(N) time to generate because Ukkonen's Algorithm is used. For the substring check, the operations do the traversal, which takes a total of O time (M^2). The total complexity will thus be O(M^2).

Auxiliary Space

O(M^2): where M is a text string length.

Explanation: This algorithm's space complexity is O (M^2). If our text has M distinct characters, we will generate nodes for each suffix with labels that are O(M^2) memory in size. If we save the pair of start index and offset instead of the entire label, we can easily decrease this to O(M). If you test that implementation, you'll see how much better this method is than Tries and how much less space it typically requires.

Note: It's vital to remember that there will only ever be one edge originating from one character from a specific node (internal or root). Any node with a character at the beginning will not have more than one edge leaving it.

Implicit suffix trees in Ukkonen's algorithm

An implicit suffix tree is a tree that was created by using the suffix tree for t$.

Removing all instances of $ from the edge labels,

Eliminating edges without labels

Removing unary nodes.

True Suffix Tree

Finally, the implicit suffix tree A real suffix tree can be created from Tn in O(n) time.

Follow t with the terminal symbol $.

Use this character to continue Ukkonen's algorithm.

When no suffix is the prefix of another suffix, the resultant tree is the real suffix tree. Each suffix comes to a leaf's end.

Frequently Asked Questions

Why would someone utilise a suffix tree?

A list of strings is often stored in a suffix tree, a type of tree data structure. It is also a trie that has been compressed because, unlike a trie, each distinct suffix in the list is represented by a single node or branch in a suffix tree.

How is a suffix tree created?

Each suffix, starting with the top node, we build a suffix tree, giving each character its edge. If a set of characters already present in the tree are the first characters of the new suffix to be added, we follow those characters until we find a different one, forming a new branch.

Is trie same as a suffix tree?

It would be simple to query a Trie for the substrings of a string if you imagined a Trie into which you could insert some word suffixes. This is the underlying principle behind the suffix tree, which is essentially a "suffix trie." So, you could say that trie is a kind of suffix tree.

What exactly are suffix arrays and suffix trees?

A suffix array can be created by performing a depth-first traverse of a suffix tree. Suppose edges are traversed in the lexicographical order of their initial character. The leaf labels are listed in the suffix array according to the order in which they are encountered throughout the traversal.

What benefits do tries have?

Strings are kept in a tree called Tries. A node can have as many children as the length of the alphabet. Trie offers O(L)-time search, insert, and delete operations, where L is the key length.

Conclusion

In this article, we have discussed an Ukkonen's algorithm called Suffix Tree with an example. Ukkonen's Algorithm. An implicit suffix tree containing the string's first character serves as the algorithm's starting point. Once the tree is complete, it continues through the string, adding new characters at each step. The "online" quality of Ukkonen's method is due to this character order addition.

Refer to our Guided PathonCoding Ninjas Studioto upskill yourself in pygame, Competitive Programming, JavaScript, System Design, and many more! You can check out the mock exam series and take part in the competitions held on Coding Ninjas Studio to test your coding proficiency. But suppose you have just started your learning process and are looking for questions asked by tech giants like Amazon, Microsoft, Uber, etc. In that case, you must look at the problems,interview experiences, and interview bundle for placement preparations.

Nevertheless, you may consider our paid courses to give your career an edge over others!

Do upvote our blogs if you find them helpful and engaging!