Code Implementation
Python
import mmh3
def flajolet_martin(data, num_bits):
max_zero_run = 0
for item in data:
# Hash the item and treat as an unsigned integer
hash_value = mmh3.hash(str(item)) & 0xffffffff
# Find the position of the first 1 in the binary representation (count trailing zeros)
zero_run_length = 0
while hash_value > 0 and (hash_value & 1) == 0:
zero_run_length += 1
hash_value >>= 1
# Update the maximum zero run length seen
max_zero_run = max(max_zero_run, zero_run_length)
# Estimate the number of distinct elements
estimate = 2 ** max_zero_run / 0.77351
return estimate
# Example usage
data = ["apple", "banana", "orange", "banana", "apple", "grape"]
num_bits = 32
estimate = flajolet_martin(data, num_bits)
print("Estimated number of distinct elements:", estimate)

You can also try this code with Online Python Compiler
Run Code
Output :
Estimated number of distinct elements: 5.160992690384847
In this code:
1. We import the `mmh3` library, which provides the MurmurHash3 hash function.
2. We define the `flajolet_martin` function that takes the `data` (input dataset) & `num_bits` (number of bits for hashing) as parameters.
3. We initialize `max_zero_run` to 0, which will keep track of the maximum number of trailing zeros encountered in the hash values.
4. We iterate over each `item` in the `data`:
- We hash the `item` using `mmh3.hash()` & convert it to an unsigned 32-bit integer using `& 0xffffffff`. This ensures that the hash value is non-negative.
- We initialize `zero_run_length` to 0, which will count the number of trailing zeros in the binary representation of the hash value.
- We enter a loop that continues as long as `hash_value` is greater than 0 & its least significant bit (LSB) is 0:
- We increment `zero_run_length` to count the trailing zeros.
- We right-shift `hash_value` by 1 bit using `>>= 1` to remove the LSB.
- After the loop, `zero_run_length` represents the count of trailing zeros in the hash value.
- We update `max_zero_run` by taking the maximum of the current `max_zero_run` & the `zero_run_length` for the current item.
5. After processing all items, we estimate the number of distinct elements using the formula: `2 ** max_zero_run / 0.77351`. The constant `0.77351` is a scaling factor derived from the analysis of the algorithm.
6. Finally, we return the estimated cardinality.
Why do we need of this Algorithm ?
1. Handling Large Datasets
When dealing with massive datasets that contain millions or billions of elements, counting the exact number of distinct elements can be impractical or even impossible due to memory & computational limitations. The Flajolet-Martin algorithm provides an efficient way to estimate the cardinality without requiring the entire dataset to be stored in memory.
2. Real-time Processing
In some applications, such as network traffic monitoring or real-time data analysis, processing data as it arrives is crucial. The Flajolet-Martin algorithm allows for on-the-fly estimation of the number of distinct elements without needing to store or process the entire dataset at once. This makes it suitable for real-time scenarios where quick results are required.
3. Approximate Counting
In many cases, an exact count of distinct elements is not necessary, & an approximate estimate is sufficient. The Flajolet-Martin algorithm provides a good balance between accuracy & efficiency. By sacrificing a small degree of precision, it enables faster computation & reduces memory usage compared to exact counting methods.
4. Scalability
As datasets grow larger, the Flajolet-Martin algorithm scales well in terms of performance. The algorithm's time complexity is linear with respect to the number of elements, making it suitable for processing huge datasets. Additionally, the algorithm can be easily parallelized, allowing for distributed processing across multiple machines to handle even larger datasets.
5. Detecting Duplicates
The Flajolet-Martin algorithm can be used as a preprocessing step to identify the presence of duplicate elements in a dataset. By estimating the number of distinct elements & comparing it with the total number of elements, we can quickly determine if duplicates exist without the need for expensive pairwise comparisons.
6. Estimating Data Skew
In some applications, understanding the distribution of distinct elements across different subsets of the data is important. The Flajolet-Martin algorithm can be applied to estimate the cardinality of different partitions or segments of the data, providing insights into data skew & helping in load balancing or resource allocation decisions.
Disadvantages
1. Approximate Results
The Flajolet-Martin algorithm provides an estimate of the number of distinct elements rather than an exact count. While the estimate is generally close to the actual value, there is always some inherent error in the approximation. The accuracy of the estimate depends on factors such as the size of the bitmap & the characteristics of the dataset. In situations where precise counts are crucial, the Flajolet-Martin algorithm may not be the best choice.
2. False Positives
The Flajolet-Martin algorithm relies on hash functions to map elements to bit positions in the bitmap. Due to the probabilistic nature of hash functions, there is a possibility of collisions, where different elements may map to the same bit position. These collisions can lead to false positives, where the algorithm overestimates the number of distinct elements. While the impact of false positives can be mitigated by using larger bitmaps or multiple hash functions, it still remains a potential drawback.
3. Bitmap Size
The size of the bitmap used in the Flajolet-Martin algorithm affects both the accuracy & the memory usage. A larger bitmap reduces the probability of collisions & improves the accuracy of the estimate. However, it also requires more memory to store the bitmap. Choosing an appropriate bitmap size involves balancing the trade-off between accuracy & memory consumption based on the specific requirements of the application.
4. Sensitivity to Data Distribution
The Flajolet-Martin algorithm assumes that the elements in the dataset are uniformly distributed. If the dataset has a skewed distribution, where some elements appear much more frequently than others, the algorithm's estimate may be biased. In such cases, the algorithm may underestimate the number of distinct elements. Techniques like using multiple hash functions or combining the Flajolet-Martin algorithm with other cardinality estimation methods can help mitigate this issue to some extent.
5. Limited Information
The Flajolet-Martin algorithm only provides an estimate of the number of distinct elements. It does not give any information about the actual elements themselves or their frequencies. If the application requires more detailed information about the distinct elements, such as their identities or counts, additional data structures or algorithms need to be used in conjunction with the Flajolet-Martin algorithm.
6. Overhead for Small Datasets
For small datasets where the number of distinct elements is relatively low, the overhead of using the Flajolet-Martin algorithm may outweigh its benefits. In such cases, using simpler methods like hash sets or direct counting may be more efficient in terms of both memory & computation.
Frequently Asked Questions
What is the time complexity of the Flajolet-Martin algorithm?
The time complexity of the Flajolet-Martin algorithm is O(n), where n is the number of elements in the dataset. It processes each element once, making it efficient for large datasets.
Can the Flajolet-Martin algorithm handle duplicate elements?
Yes, the Flajolet-Martin algorithm can handle duplicate elements. It estimates the number of distinct elements, so duplicates do not affect the final estimate.
How does the bitmap size affect the accuracy of the Flajolet-Martin algorithm?
A larger bitmap size generally improves the accuracy of the estimate by reducing the probability of collisions. However, it also increases the memory usage. The choice of bitmap size depends on the desired balance between accuracy & memory consumption.
Conclusion
In this article, we discussed the Flajolet-Martin algorithm, a probabilistic technique for estimating the number of distinct elements in a large dataset. We learned about its key steps, like hashing elements, updating the bitmap, & calculating the final estimate. We also explained that in which situations we need this algorithm like in handling massive datasets, real-time processing, & approximate counting scenarios.
You can also check out our other blogs on Code360.