Table of contents
1.
Introduction
2.
What is flajolet martin algorithm ?
3.
Code Implementation 
3.1.
Python
4.
Why do we need of this Algorithm ? 
4.1.
1. Handling Large Datasets
4.2.
2. Real-time Processing
4.3.
3. Approximate Counting
4.4.
4. Scalability
4.5.
5. Detecting Duplicates
4.6.
6. Estimating Data Skew
5.
Disadvantages
6.
Frequently Asked Questions
6.1.
What is the time complexity of the Flajolet-Martin algorithm?
6.2.
Can the Flajolet-Martin algorithm handle duplicate elements?
6.3.
How does the bitmap size affect the accuracy of the Flajolet-Martin algorithm?
7.
Conclusion
Last Updated: Aug 18, 2024
Easy

Flajolet Martin Algorithm

Author Pallavi singh
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

The Flajolet-Martin algorithm is an intelligent way to estimate the number of distinct elements in a large dataset. It's a probabilistic algorithm that uses hash functions to create a compact summary of the data. This summary, called a sketch, allows us to quickly estimate the cardinality (the number of unique items) without having to store every single element. 

Flajolet Martin Algorithm

In this article, we'll learn how the Flajolet-Martin algorithm works, look at the steps involved, & see how to implement it in Python. We'll also discuss the advantages & limitations of this algorithm.

What is flajolet martin algorithm ?

The Flajolet-Martin algorithm is a probabilistic algorithm used to estimate the number of distinct elements (cardinality) in a large dataset using a limited amount of memory. It works by applying hash functions to the elements & maintaining a compact summary called a bitmap. By analyzing the bitmap, the algorithm provides an approximate count of the unique elements in the dataset without the need to store or compare each individual item.


In simpler terms, the Flajolet-Martin algorithm is a clever way to count the number of unique items in a huge collection of data without actually counting them one by one. It trades a bit of accuracy for efficiency, allowing us to get a good estimate of the cardinality even when dealing with massive datasets that wouldn't fit in memory.


The algorithm relies on the properties of hash functions & probability theory to create a compact representation of the data. By setting bits in the bitmap based on the hash values of the elements, it captures the essence of the dataset's uniqueness. The position of the least significant bit in the bitmap is then used to estimate the total number of distinct elements.


The steps for the Flajolet-Martin algorithm are:
 

1. Initialize an empty bitmap of size k, where k is a power of 2.
 

2. For each element in the dataset, apply a hash function that maps the element to a number between 0 & 2^k - 1.
 

3. Find the position of the least significant bit (LSB) that is set to 1 in the hash value.
 

4. Update the bitmap by setting the bit at the position found in step 3 to 1.
 

5. After processing all elements, estimate the number of distinct elements by counting the number of zeros (r) from the right until the first 1 is encountered in the bitmap.
 

6. Calculate the estimate using the formula: 2^(k - r) / φ, where φ is a constant (approximately 0.77351).
 

The key idea behind the Flajolet-Martin algorithm is that the position of the LSB in the hash values follows a geometric distribution. By analyzing the bitmap & applying the estimation formula, we can get a good approximation of the number of distinct elements in the dataset.

Code Implementation 

  • Python

Python

import mmh3


def flajolet_martin(data, num_bits):

   max_zero_run = 0   

   for item in data:

       # Hash the item and treat as an unsigned integer

       hash_value = mmh3.hash(str(item)) & 0xffffffff

      

       # Find the position of the first 1 in the binary representation (count trailing zeros)

       zero_run_length = 0

       while hash_value > 0 and (hash_value & 1) == 0:

           zero_run_length += 1

           hash_value >>= 1    

       # Update the maximum zero run length seen

       max_zero_run = max(max_zero_run, zero_run_length)   

   # Estimate the number of distinct elements

   estimate = 2 ** max_zero_run / 0.77351

  

   return estimate


# Example usage

data = ["apple", "banana", "orange", "banana", "apple", "grape"]

num_bits = 32

estimate = flajolet_martin(data, num_bits)

print("Estimated number of distinct elements:", estimate)
You can also try this code with Online Python Compiler
Run Code


Output :

Estimated number of distinct elements: 5.160992690384847


In this code:
 

1. We import the `mmh3` library, which provides the MurmurHash3 hash function.
 

2. We define the `flajolet_martin` function that takes the `data` (input dataset) & `num_bits` (number of bits for hashing) as parameters.
 

3. We initialize `max_zero_run` to 0, which will keep track of the maximum number of trailing zeros encountered in the hash values.

 

4. We iterate over each `item` in the `data`:

   - We hash the `item` using `mmh3.hash()` & convert it to an unsigned 32-bit integer using `& 0xffffffff`. This ensures that the hash value is non-negative.
 

   - We initialize `zero_run_length` to 0, which will count the number of trailing zeros in the binary representation of the hash value.
 

   - We enter a loop that continues as long as `hash_value` is greater than 0 & its least significant bit (LSB) is 0:
 

     - We increment `zero_run_length` to count the trailing zeros.
 

     - We right-shift `hash_value` by 1 bit using `>>= 1` to remove the LSB.
 

   - After the loop, `zero_run_length` represents the count of trailing zeros in the hash value.
 

   - We update `max_zero_run` by taking the maximum of the current `max_zero_run` & the `zero_run_length` for the current item.
 

5. After processing all items, we estimate the number of distinct elements using the formula: `2 ** max_zero_run / 0.77351`. The constant `0.77351` is a scaling factor derived from the analysis of the algorithm.

 

6. Finally, we return the estimated cardinality.

Why do we need of this Algorithm ? 

1. Handling Large Datasets

When dealing with massive datasets that contain millions or billions of elements, counting the exact number of distinct elements can be impractical or even impossible due to memory & computational limitations. The Flajolet-Martin algorithm provides an efficient way to estimate the cardinality without requiring the entire dataset to be stored in memory.

2. Real-time Processing

In some applications, such as network traffic monitoring or real-time data analysis, processing data as it arrives is crucial. The Flajolet-Martin algorithm allows for on-the-fly estimation of the number of distinct elements without needing to store or process the entire dataset at once. This makes it suitable for real-time scenarios where quick results are required.

3. Approximate Counting

In many cases, an exact count of distinct elements is not necessary, & an approximate estimate is sufficient. The Flajolet-Martin algorithm provides a good balance between accuracy & efficiency. By sacrificing a small degree of precision, it enables faster computation & reduces memory usage compared to exact counting methods.

4. Scalability

As datasets grow larger, the Flajolet-Martin algorithm scales well in terms of performance. The algorithm's time complexity is linear with respect to the number of elements, making it suitable for processing huge datasets. Additionally, the algorithm can be easily parallelized, allowing for distributed processing across multiple machines to handle even larger datasets.

5. Detecting Duplicates

The Flajolet-Martin algorithm can be used as a preprocessing step to identify the presence of duplicate elements in a dataset. By estimating the number of distinct elements & comparing it with the total number of elements, we can quickly determine if duplicates exist without the need for expensive pairwise comparisons.

6. Estimating Data Skew

In some applications, understanding the distribution of distinct elements across different subsets of the data is important. The Flajolet-Martin algorithm can be applied to estimate the cardinality of different partitions or segments of the data, providing insights into data skew & helping in load balancing or resource allocation decisions.

Disadvantages


1. Approximate Results

The Flajolet-Martin algorithm provides an estimate of the number of distinct elements rather than an exact count. While the estimate is generally close to the actual value, there is always some inherent error in the approximation. The accuracy of the estimate depends on factors such as the size of the bitmap & the characteristics of the dataset. In situations where precise counts are crucial, the Flajolet-Martin algorithm may not be the best choice.
 

2. False Positives

The Flajolet-Martin algorithm relies on hash functions to map elements to bit positions in the bitmap. Due to the probabilistic nature of hash functions, there is a possibility of collisions, where different elements may map to the same bit position. These collisions can lead to false positives, where the algorithm overestimates the number of distinct elements. While the impact of false positives can be mitigated by using larger bitmaps or multiple hash functions, it still remains a potential drawback.
 

3. Bitmap Size

The size of the bitmap used in the Flajolet-Martin algorithm affects both the accuracy & the memory usage. A larger bitmap reduces the probability of collisions & improves the accuracy of the estimate. However, it also requires more memory to store the bitmap. Choosing an appropriate bitmap size involves balancing the trade-off between accuracy & memory consumption based on the specific requirements of the application.
 

4. Sensitivity to Data Distribution

The Flajolet-Martin algorithm assumes that the elements in the dataset are uniformly distributed. If the dataset has a skewed distribution, where some elements appear much more frequently than others, the algorithm's estimate may be biased. In such cases, the algorithm may underestimate the number of distinct elements. Techniques like using multiple hash functions or combining the Flajolet-Martin algorithm with other cardinality estimation methods can help mitigate this issue to some extent.
 

5. Limited Information

The Flajolet-Martin algorithm only provides an estimate of the number of distinct elements. It does not give any information about the actual elements themselves or their frequencies. If the application requires more detailed information about the distinct elements, such as their identities or counts, additional data structures or algorithms need to be used in conjunction with the Flajolet-Martin algorithm.
 

6. Overhead for Small Datasets

For small datasets where the number of distinct elements is relatively low, the overhead of using the Flajolet-Martin algorithm may outweigh its benefits. In such cases, using simpler methods like hash sets or direct counting may be more efficient in terms of both memory & computation.

Frequently Asked Questions

What is the time complexity of the Flajolet-Martin algorithm?

The time complexity of the Flajolet-Martin algorithm is O(n), where n is the number of elements in the dataset. It processes each element once, making it efficient for large datasets.

Can the Flajolet-Martin algorithm handle duplicate elements?

Yes, the Flajolet-Martin algorithm can handle duplicate elements. It estimates the number of distinct elements, so duplicates do not affect the final estimate.

How does the bitmap size affect the accuracy of the Flajolet-Martin algorithm?

A larger bitmap size generally improves the accuracy of the estimate by reducing the probability of collisions. However, it also increases the memory usage. The choice of bitmap size depends on the desired balance between accuracy & memory consumption.

Conclusion

In this article, we discussed the Flajolet-Martin algorithm, a probabilistic technique for estimating the number of distinct elements in a large dataset. We learned about its key steps, like hashing elements, updating the bitmap, & calculating the final estimate. We also explained that in which situations we need this algorithm like in handling massive datasets, real-time processing, & approximate counting scenarios.

You can also check out our other blogs on Code360.

Live masterclass