Table of contents
1.
Introduction
2.
What is the Boyer-Moore algorithm?
3.
Problem Statement
4.
Bad Character Heuristic
5.
Good Suffix Heuristic
5.1.
Here's how it works
5.2.
Implementation
5.3.
Python
6.
Frequently Asked Questions
6.1.
What is Boyer-Moore's voting algorithm used for?
6.2.
What is the last function of the Boyer-Moore algorithm?
6.3.
Is Boyer-Moore algorithm better than Rabin-Karp?
6.4.
What is the best-case time complexity of Boyer-Moore?
6.5.
Why is the Boyer Moore algorithm faster than other search algorithms?
6.6.
Can the Boyer Moore algorithm be used on any type of text?
6.7.
What happens if the pattern is not found in the text?
7.
Conclusion
Last Updated: Nov 11, 2024
Easy

Boyer Moore Algorithm

Author Gaurav Gandhi
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

The Boyer-Moore algorithm is an efficient string searching algorithm that skips over sections of text by utilizing pattern matching heuristics. It was established by Robert S. Boyer and J Strother Moore in 1977.

Searching through texts & code to find specific patterns is a common yet crucial task in computing. This is where the Boyer Moore algorithm shines, offering an efficient way to handle string searches. By cleverly skipping sections of text, it reduces the comparison count, making it faster than many other search algorithms. 

Boyer Moore Algorithm

This article will explain the Boyer Moore algorithm, looking into its main principles & showcasing its implementation. We'll explore its problem statement, explain the Bad Character & Good Suffix Heuristics, & provide a proper coding example. 

What is the Boyer-Moore algorithm?

The Boyer-Moore algorithm is an efficient string searching (or pattern matching) algorithm that searches for a substring (pattern) within a main text (string). It works by comparing characters of the pattern from right to left, and it skips over portions of the text that are guaranteed not to match the pattern.

Problem Statement

When we talk about searching for patterns in texts, what we're really looking at is finding a specific sequence of characters within a larger text. Imagine you've got a huge book, & you're trying to find a single sentence in it. That's the challenge we face in computing too, especially with vast amounts of data. The Boyer Moore algorithm is our tool for this job. It's designed to find these patterns quickly & efficiently, skipping over parts of the text where the pattern definitely isn't found. This makes it stand out for its speed, especially with long texts & patterns. 

Bad Character Heuristic

The Bad Character Heuristic is a smart move in the Boyer Moore algorithm that speeds up the search by using mismatches. Instead of comparing every character in the text with the pattern, it starts at the end of the pattern and works backwards. When a character in the text doesn't match the current character in the pattern, the algorithm jumps forward. It skips sections that don't need checking, based on where the mismatched character appears in the pattern. This leap is calculated so the next alignment ensures the mismatched character aligns with its last occurrence in the pattern or moves past it if it's not there. This way, the algorithm reduces the number of comparisons dramatically, making the search faster.

To illustrate the Bad Character Heuristic, let's consider a simple example. Imagine we want to find the pattern "NEEDLE" in the text "FINDINANEEDLEINAHEAPOFHAY".

  • We start comparing from the last character of "NEEDLE" with the corresponding character in the text. Our first comparison would be 'E' (from "NEEDLE") with 'D' (in "FINDINA...").
     
  • Since 'E' does not match 'D', we look for 'D' in the pattern "NEEDLE". 'D' is not found, so we can safely jump the entire length of the pattern. Now, the comparison starts with 'E' from "NEEDLE" and 'E' from "HAY".
     
  • This time, the characters match, so we move backwards in the pattern and the text simultaneously until we find a mismatch or complete the pattern.
     
  • If a mismatch is found (say 'N' from "NEEDLE" and 'A' from "HAY"), we look for 'A' in "NEEDLE". Since it's not present, we again jump the whole pattern length. This time, the first 'E' of "NEEDLE" aligns with the 'E' in "NEEDLEINA...".
     
  • We continue this process until we find a complete match or reach the end of the text.

Good Suffix Heuristic

After understanding the Bad Character Heuristic, let's move on to another clever part of the Boyer Moore algorithm called the Good Suffix Heuristic. This strategy comes into play when a part of the pattern has been found to match a section in the text. Instead of starting all over when a mismatch happens, the Good Suffix Heuristic uses the information from the matching part, known as the "good suffix", to decide how far to jump ahead.

Here's how it works

If a part of the pattern matches but then a mismatch occurs, the algorithm looks for occurrences of this matching part elsewhere in the pattern. It then shifts the pattern in a way that aligns the next matching part of the pattern with the part in the text. If there's no such matching part, the algorithm finds the longest prefix of the pattern that matches a suffix of the good suffix and aligns them. If no such prefix exists, the pattern is moved past the good suffix entirely.

This approach significantly reduces the number of comparisons needed, making the search process even faster. Especially in cases where patterns have repeating segments, the Good Suffix Heuristic can skip large portions of the text that don't need to be checked.

To see the Good Suffix Heuristic in action, consider this simplified example: Suppose we're searching for the pattern "ABCD" in the text "ABCABCD". When the pattern is aligned starting from the first character, "ABC" matches but "D" doesn't. The Good Suffix Heuristic notices the "ABC" part matches a prefix of the pattern and shifts the pattern to align this "ABC" with the prefix "ABC" of the pattern, leading to a match in the next step.

By smartly leveraging the information from the good suffix, the algorithm becomes highly efficient for searches, particularly in texts with repetitive patterns.

Implementation

Implementing the Boyer Moore algorithm involves combining the Bad Character and Good Suffix Heuristics. Here, we'll focus on a simplified version that primarily uses the Bad Character Heuristic, giving you a foundational understanding of how the algorithm works in practice.

Let's start with a Python code example:

  • Python

Python

def boyer_moore_search(text, pattern):
# Length of pattern and text
m = len(pattern)
n = len(text)

# Create the bad character list
bad_char = [-1] * 256

# Fill the bad character list by calling the preprocessing function
for i in range(m):
bad_char[ord(pattern[i])] = i

# Start from the beginning of the text
s = 0
while s <= n - m:
j = m - 1

# Decrease index j of pattern while characters are matching
while j >= 0 and pattern[j] == text[s + j]:
j -= 1

# If the pattern is present at the current shift, then index j will become -1
if j < 0:
print("Pattern found at index:", s)

# Shift the pattern so that the next character in text aligns with the last occurrence of it in the pattern
s += (m - bad_char[ord(text[s + m])] if s + m < n else 1)
else:
# Shift the pattern so that the bad character in text aligns with the last occurrence of it in the pattern
s += max(1, j - bad_char[ord(text[s + j])])

# Example usage
text = "ABAAABCD"
pattern = "ABC"
boyer_moore_search(text, pattern)
You can also try this code with Online Python Compiler
Run Code

Output

Pattern found at index: 4


This code defines a boyer_moore_search function that takes a text and a pattern as input. It first prepares a list called bad_char, which is used to determine how far to shift the pattern when a mismatch occurs. The search begins at the start of the text, comparing the pattern to the text from right to left. If a mismatch is found, the algorithm uses the bad_char list to decide how many positions to skip before the next comparison. When the pattern matches the text, the start index of the match is printed.

By using the Bad Character Heuristic, this implementation can quickly skip over parts of the text that are guaranteed not to match the pattern, making it efficient for searching large texts.

Frequently Asked Questions

What is Boyer-Moore's voting algorithm used for?

The Boyer-Moore voting algorithm is used for finding the majority element in an array, where the element appears more than half the time.

What is the last function of the Boyer-Moore algorithm?

The last function of the Boyer-Moore algorithm is the search phase, where it performs pattern matching by comparing the pattern to the text.

Is Boyer-Moore algorithm better than Rabin-Karp?

The Boyer-Moore algorithm is often faster than Rabin-Karp, especially for longer patterns and texts, as it skips more portions of the text.

What is the best-case time complexity of Boyer-Moore?

The best-case time complexity of the Boyer-Moore algorithm is O(n), where n is the length of the text, when there are no mismatches.

Why is the Boyer Moore algorithm faster than other search algorithms?

The Boyer Moore algorithm is often faster because it skips sections of the text that don't match the pattern, thanks to its Bad Character and Good Suffix Heuristics. This means fewer comparisons, making it very efficient, especially for long patterns and texts.

Can the Boyer Moore algorithm be used on any type of text?

Yes, the Boyer Moore algorithm can be applied to any type of text, including binary data. It's versatile and effective for various search tasks, from simple document searches to complex bioinformatics applications.

What happens if the pattern is not found in the text?

If the pattern doesn't exist in the text, the Boyer Moore algorithm will go through the text without finding a match. The algorithm efficiently concludes that the pattern is not present by making the minimum number of comparisons necessary.

Conclusion

In this article, we have learned about the Boyer Moore algorithm, a powerful tool for pattern searching in texts. We started with a basic understanding of the problem it solves and then delved into the workings of the Bad Character and Good Suffix Heuristics, which are key to its efficiency. Through a practical example, we saw how to implement the algorithm in Python, focusing on the Bad Character Heuristic. The Boyer Moore algorithm stands out for its ability to reduce the number of comparisons needed to find a pattern, making it faster than many other search methods, especially for large texts and patterns. 

You can refer to our guided paths on Code360. You can check our course to learn more about DSADBMSCompetitive ProgrammingPythonJavaJavaScript, etc. Also, check out some of the Guided Paths on topics such as Data Structure and AlgorithmsCompetitive ProgrammingOperating SystemsComputer Networks, DBMSSystem Design, etc., as well as some Contests, Test Series, and Interview Experiences curated by top Industry Experts.

Live masterclass