Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Suffix Tree
3.
Ukkonen’s Algorithm
4.
Understanding the Algorithm with an Example
4.1.
Problem Statement
4.2.
Example
4.2.1.
Input
4.2.2.
Output
4.2.3.
Explanation: 
4.3.
Approach
4.4.
Algorithm
4.5.
Implementation
4.5.1.
Input
4.5.2.
Output
4.6.
Complexity
4.6.1.
Time Complexity
4.6.2.
Auxiliary Space
5.
Implicit suffix trees in Ukkonen's algorithm
6.
True Suffix Tree
7.
Frequently Asked Questions
7.1.
Why would someone utilise a suffix tree?
7.2.
How is a suffix tree created?
7.3.
Is trie same as a suffix tree?
7.4.
What exactly are suffix arrays and suffix trees?
7.5.
What benefits do tries have?
8.
Conclusion
Last Updated: Mar 27, 2024
Hard

Suffix Tree-Ukkonen's Algorithm

gp-icon
Data structures & algorithms (Beginner to Intermediate)
Free guided path
13 chapters
99+ problems
gp-badge
Earn badges and level up

Introduction

You must think about what the suffix Tree would be and Ukkonen's algorithm. Don't worry, and we will be covering everything in brief.

Let's get started!

introduction

Suffix Tree

suffix tree is a compressed trie of all the suffixes in a particular string (compressed trie trees use a relatively small number of nodes, which provides a significant memory advantage, especially for large lines). Many string-related issues, such as pattern matching, locating unique substrings within a given string, locating the largest palindrome, etc., can be resolved using suffix trees. This article will discuss the following topics:

  • Compressed Trie
  • Suffix Tree Construction (Brute Force)
  • Brief description of Ukkonen's Algorithm
     

A suffix tree is a compressed trie in which all of the suffixes in a string are stored. A suffix tree has been shown to help with problems like finding unique substrings in a string, pattern matching, finding the longest palindrome, and other string-related issues. This post will look at another application of the suffix tree: Substring Check.

We will generate the Suffix Tree using Ukkonen's Approach, a linear time-taking algorithm in the following scenario.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Ukkonen’s Algorithm

Esko Ukkonen proposed Ukkonen's algorithm in 1995 as a linear-time, online algorithm for constructing suffix trees. The algorithm starts with an implicit suffix tree containing the string's first character. Then it goes through the string, adding characters one by one until the tree is complete. This character order addition gives Ukkonen's algorithm its "online" property. Peter Weiner's original algorithm worked backwards from the last character to the first, from the shortest to the longest suffix. Edward M. McCreight discovered a simpler algorithm that goes from the longest to the most temporary suffix.

Ukkonen Algorithm

Understanding the Algorithm with an Example

Problem Statement

Implement a function that uses the suffix tree to determine whether the original text S contains a string, as well as the brute force approach for creating a suffix tree from a text S.

Example

Input

S = “banana”
Find Substring "ana"
Find Substring "nan"
Find Substring "nun"

Output

True
True
False

Explanation:
 

The string "ana" and “nan” are valid substrings of the text S, and hence we output True.

The string "nun" doesn't exist as a substring, and the output is False.
 

Approach

By inserting each suffix and compressing the resulting trie, the brute force technique aims to create a suffix tree. See the example below for how we condense a chain of nodes into a single node.

If a node is added repeatedly, it will eventually exist. If there is a relevant node, we examine its label to see how much of the current suffix ([i:n-1]) matches the label. 

If we completely match, we go to its child; if we just partially match, we generate two new nodes. The leftover Suffix string is likewise formed into a child of the unmatched label node, which is then put into a node with the matched label component. 

 

The above chain of nodes converges into the single node below.

The compressed chain of characters is created to save memory

Algorithm

  • Text S should include an additional character at the end that isn't present in S. In our example, the dollar sign was applied. This will result in N distinct leaf nodes.
     
  • Build a root that has no label.
     
  • One by one, add each of the N suffixes to the compressed trie of suffixes.
     
  • We add a node for each suffix if it doesn't already exist.
     
  • We match the suffix with the node's label if it is present. If everything matches, we proceed to the child node.
     
  • If only a piece of the suffix string matches, we build two nodes, assign the matched section to one, assign the remaining portion to the other, and make the parent of the previous node the parent of the new node. Create a child node for the previous node and rename it to reflect the mismatched piece. The node with the last suffix string node will be the other child.
     
  • We get the suffix tree after completing everything above.

Implementation

class Node:
  """
  Class representing a singular node in a suffix tree
  """
  def __init__(self, set_label):
      self.node_lab = set_label
      self.nexts = {}
class SuffixTreeBruteForce:
  """
  This implementation takes O(M^2) time to build the suffix tree and consumes O(M) Nodes (<= 2 * M - 1)
  """
  def __init__(self, s):
      """
      Build Suffix Tree here
      """
      # add a char that didn't occur earlier in the text like '$' to the end
      s += '$'
      
      # creating root of our suffix tree
      self.root = Node(None)
      
      for i in range(len(s)):
          # insert individual suffixes [i, len(s) - 1]
          curr = self.root
          j = i
          while(j < len(s)):
              if s[j] in curr.nexts:
                  child = curr.nexts[s[j]]
                  label = child.node_lab
                  
                  # check if the current node label is entirely consumed by the suffix
                  k = j
                  label_end = False
                  while(k - j < len(label)):
                      if(k == len(s)):
                          # suffix comes short of the label's length
                          label_end = True
                          break
                      elif(s[k] == label[k-j]):
                          # we keep checking for matching character in suffix with node label
                          k += 1
                      else:
                          # we reached a mismatch in the label
                          # prompting us to make introduce two new nodes
                          label_end = True
                          break
                  
                  if label_end:
                      # special case, here we create two new nodes
                      # we create a node here with the left half of the label (the part that matched) and replace the child node's position with it
                      curr.nexts[s[j]] = Node(label[:k-j])
                      
                      # the new position of the child node is as child of the previosly created node
                      curr.nexts[s[j]].nexts[label[k-j]] = child
                      # it's label is now the right part of label(the unmatched part)
                      child.node_lab = label[k-j:]
                      
                      # we also create another Node containing the unmatched part of the suffix
                      curr.nexts[s[j]].nexts[s[k]] = Node(s[k:])
                      break
                  else:
                      # the whole node was consumed and we move to it's child
                      curr = child
                      j = k
              else:
                  # no child found, create child and end execution
                  curr.nexts[s[j]] = Node(s[j:])
                  break
  
  def findSubstring(self, s):
      """
      function traces through the Suffix Tree for the given string
      and returns true if the string is found, the time complexity for this function is O(len(s))
      """
      curr = self.root
      i = 0
      while(i < len(s)):
          if s[i] in curr.nexts:
              child = curr.nexts[s[i]]
              label = child.node_lab
              
              k = i
              while(k - i < len(label)):
                  if(k == len(s)):
                      # string ends in the middle of label
                      return True
                  elif(s[k] == label[k-i]):
                      # we keep checking for matching character with node label
                      k += 1
                  else:
                      # mismatch
                      return False
              curr = child
              i = k
          else:
              return False
      return True
s = 'banana'
p = SuffixTreeBruteForce(s)
print(p.findSubstring('ban'))
print(p.findSubstring('nan'))
print(p.findSubstring('aua'))

Input

banana
ban
nan
aua

Output

True
True
False

Complexity

Time Complexity

O(M^2): As each suffix is inserted separately into the suffix tree, the temporal complexity of the aforementioned approach is O(M^2).

Explanation: The Suffix Tree will take O(N) time to generate because Ukkonen's Algorithm is used. For the substring check, the operations do the traversal, which takes a total of O time (M^2). The total complexity will thus be O(M^2).

Auxiliary Space

O(M^2): where M is a text string length.

Explanation: This algorithm's space complexity is O (M^2). If our text has M distinct characters, we will generate nodes for each suffix with labels that are O(M^2) memory in size. If we save the pair of start index and offset instead of the entire label, we can easily decrease this to O(M). If you test that implementation, you'll see how much better this method is than Tries and how much less space it typically requires.

Note: It's vital to remember that there will only ever be one edge originating from one character from a specific node (internal or root). Any node with a character at the beginning will not have more than one edge leaving it.

Implicit suffix trees in Ukkonen's algorithm

An implicit suffix tree is a tree that was created by using the suffix tree for t$.

  • Removing all instances of $ from the edge labels,
  • Eliminating edges without labels
  • Removing unary nodes.

True Suffix Tree

Finally, the implicit suffix tree A real suffix tree can be created from Tn in O(n) time.

  • Follow t with the terminal symbol $.
  • Use this character to continue Ukkonen's algorithm.
     

When no suffix is the prefix of another suffix, the resultant tree is the real suffix tree. Each suffix comes to a leaf's end.

Frequently Asked Questions

Why would someone utilise a suffix tree?

A list of strings is often stored in a suffix tree, a type of tree data structure. It is also a trie that has been compressed because, unlike a trie, each distinct suffix in the list is represented by a single node or branch in a suffix tree.

How is a suffix tree created?

Each suffix, starting with the top node, we build a suffix tree, giving each character its edge. If a set of characters already present in the tree are the first characters of the new suffix to be added, we follow those characters until we find a different one, forming a new branch.

Is trie same as a suffix tree?

It would be simple to query a Trie for the substrings of a string if you imagined a Trie into which you could insert some word suffixes. This is the underlying principle behind the suffix tree, which is essentially a "suffix trie." So, you could say that trie is a kind of suffix tree.

What exactly are suffix arrays and suffix trees?

A suffix array can be created by performing a depth-first traverse of a suffix tree. Suppose edges are traversed in the lexicographical order of their initial character. The leaf labels are listed in the suffix array according to the order in which they are encountered throughout the traversal.

What benefits do tries have?

Strings are kept in a tree called Tries. A node can have as many children as the length of the alphabet. Trie offers O(L)-time search, insert, and delete operations, where L is the key length.

Conclusion

In this article, we have discussed an Ukkonen's algorithm called Suffix Tree with an example. Ukkonen's Algorithm. An implicit suffix tree containing the string's first character serves as the algorithm's starting point. Once the tree is complete, it continues through the string, adding new characters at each step. The "online" quality of Ukkonen's method is due to this character order addition.

Check out this problem - Check If A String Is Palindrome

Refer to our Guided Path on Coding Ninjas Studio to upskill yourself in pygameCompetitive ProgrammingJavaScriptSystem Design, and many more! You can check out the mock exam series and take part in the competitions held on Coding Ninjas Studio to test your coding proficiency. But suppose you have just started your learning process and are looking for questions asked by tech giants like Amazon, Microsoft, Uber, etc. In that case, you must look at the problems, interview experiences, and interview bundle for placement preparations.

Nevertheless, you may consider our paid courses to give your career an edge over others!

Do upvote our blogs if you find them helpful and engaging!

Happy Learning!

 

Thank you
Previous article
Advantages of Trie Data Structure
Next article
Substring Check Using Suffix Tree
Guided path
Free
gridgp-icon
Data structures & algorithms (Beginner to Intermediate)
13 chapters
109+ Problems
gp-badge
Earn badges and level up
Live masterclass