In hashing, we pass the key to be hashed using a hash function. The hash function generates the index(basically a location) corresponding to the given key. A good hash function uses the property of involved objects, which makes them different from each other.
No matter how different data items are from one another, there is always a chance of having something in common, which compels the hash function to generate the same value for different keys. For example, For an array of strings, some elements may have equal lengths. For numbers, even-odd can be the common property. Using these common properties in hashing functions can sometimes lead to a collision.
Have you heard of the Birthday Paradox? In simple terms, the birthday paradox means the higher the number of keys is in a group or table, the higher is the probability of two keys having the same properties. For more details, visit Birthday problem.
Let’s consider the following scenario:
Being the class monitor, you are assigned to record each class student’s name and respective test score using hashmaps. So, the keys will be student names, and the values will be corresponding test scores. What about the hashing function? For now, let’s say that it’s equal to the length of the key. So, h(x) = length(key),where h(x) is the hashing function.
Let’s look at the student data we need to input:
After inserting this data using hashing, we will get an array like this:
As you can see here, ‘Srijan’ and ‘Diksha’ have equal string lengths. So, both the keys are mapped to the same index, and the hashing function encounters a collision. That is why we need a mechanism to handle such collisions.
Each mechanism offers its advantages and disadvantages. So, let’s dive in and study two of the most common and intuitive collision handling mechanisms: Separate Chaining and Open Addressing.
Separate Chaining
Can you guess the mechanism from the name? Go ahead. You are probably right. Separate chaining doesn’t follow the traditional way of key-value mapping, where every key is mapped to a memory block. Each memory block contains only one value. Instead, in Separate Chaining, the key is mapped to a chain (represented by an array or linked list).
So, every time a collision is encountered, the value for the key will be pushed back in the respective chain. Let’s revisit our example of students’ test score records discussed in the previous section. This time our classmate ‘Srijan’ can get his test score mapped in the hashmap. Now, the hashmap will look like this:
Every mechanism comes with its pros and cons. It is always better to first analyze the advantages and disadvantages offered by a mechanism before using it. Let’s analyze the Separate Chaining collision handling to check where it should be used.
Advantages
Efficient domain split: As you can distribute the inputs in different chains, the data is used efficiently for CRUD operations (create, read, update, delete).
Constant insertion time: For insertion, the data can be pushed back in its respective chain, making the insertion an O(1) operation.
Dynamic memory management: If we use a linked list or vector for implementation, the size of the chain can dynamically be modified, making Separate Chaining a memory-efficient collision handling mechanism.
Disadvantages
Linear time complexity in the worst case: Separate Chaining is a suitable collision handling mechanism, but it performs search and delete operations run in linear time
(O(N)time complexity) when the hashing function works poorly for given data. By poorly, we mean if most of the elements are pushed back in the same chain.
Extra space for link management: Extra space is needed to maintain additional information like the next node address (basically metadata). It is helpful in large input cases but not good enough when the input size is small.
Performance Analysis
Let n = number of keys to be inserted, and
m = size of table
Let 𝞪 be the load factor.
𝞪 = n / m
What does the load factor signify?
Having aload factor of 1 signifies that the load is well shared in all the domains.
Load factor < 1 means that empty indexes are more than the keys.
Load factor > 1 refers to the opposite.
As insertion requires a push back in the chain it is an O(1) operation.
Similarly, deletion requires a pop-back operation that has O(1) time complexity.
In the worst case, the search and delete operations have load factor O(1+ 𝞪) which signifies that more elements are distributed non-uniformly.
Open Addressing
In case of collision, the Open Addressing mechanism finds the next free memory address to map the key. Unlike Separate Chaining, the Open Addressing mechanism offers multiple ways to find the next available memory location. The most common ones are linear probing, quadratic probing, and double hashing. Let’s discuss each of these in detail.
Linear Probing
As the name suggests, the linear probing mechanism traverses linearly through the array to look for the next available memory for insert, delete, and search operations. Didn’t get it? Let’s try with an example.
For your class assignment, you have got the following key-value pairs to store in an array ‘ARR’ of size 10.
For simplicity, we use modulo 10 as our hashing function. So, the index of a value will be equal to (key % 10). Let’s create a lookup to get a better look at the key-value pairs and their respective indices.
Based on the above data, the insert, search and delete will work as follows:
Insert
Based on the lookup we created, the array occupancy may look like this:
0 1 2 3 4 5 6 7 8 9
Free
Free
Free
Occupied
Occupied
Free
Free
Occupied
Free
Free
As you can see here, pairs (13,6.7) and (3, 4.4) are both given index 3 by our hashing function. Pair (13, 6.7) will be inserted successfully. But for pair (3, 4.4), the method will encounter a collision. For collision handling using linear probing, the next index is checked. Unfortunately, index 4 is already occupied by pair (4, 2.5). On probing further, the method will find a free memory block at index 5. So, the pair (3, 4.4) will be stored at index 5. So, the array will look like this:
0 1 2 3 4 5 6 7 8 9
Free
Free
Free
(13, 6.7)
(4 ,2.5)
(3, 4.4)
Free
(7, 7.3)
Free
Free
Array (ARR) using Linear Probing
Thus, for linear probing, the hashing function becomes h(x) + i, where i is the offset for the next available memory address.
If you have come this far, you would be aware of the fact that we only store the values corresponding to the keys. Keys, on the other hand, are not stored but used to map the locations of these values in our hashmap. But in the case of linear probing, remember that we are not only storing values. As the location is not fixed in the linear probing technique, key-value pairs must be stored in the array.
Search
It’s not fixed that we’ll find the required key on the computed index based on the hash function. To search for a key, we first calculate the index using the defined hash function and check if the required key-value pair is present at that index.
Consider the array ‘ARR’ to find the value of the key (13). The index is calculated as
(13 % 10) = 3. Index 3 holds the key 13, and its respective value is 13.67.
The chances are that the pair may have encountered a collision while insertion and might have been stored elsewhere with probing. If the key-value pair is not present at the index, the mechanism probes linearly and checks the pairs present next to the current index.
Find the value of the key (3, 4.4) in array ‘ARR’. The index for the same using a hash function will be (3 % 10) = 3. But, we encountered a collision on index 3 with pair (13, 6.7). Do you remember that? Probe linearly, and you’ll find the key (3) at index 5.
What if the key is not present at all? While linear probing, if you encounter a free memory block or the end of the array, it means the key is not present in the array.
Delete
To delete a particular key, follow the operations to first search the key index. If the key is present, replace the key-value pair with a default value (such as (-1,-1) ) to indicate that the current index is free for memory allocation.
Before you jump to use linear probing, it’s essential to know about two special conditions: Primary Clustering and Secondary Clustering.
It’s a scenario where most of the elements tend to accumulate at one part of the array. In this case, you may have to probe through the array every time to insert a new element. Doesn’t this make the insertion run in O(N) time complexity? You’re absolutely right here.
Secondary Clustering
Turn on your intuition mode because you can’t directly see this kind of clustering in your array. If probing starts with the same index for many elements, the insertion operation will take linear time (O(N) time complexity) to find a vacant address.
Consider the key-value pair table discussed above:
For this table, the data populated by linear probing is shown below:
0 1 2 3 4 5 6 7 8 9
Free
Free
Free
(13, 6.7)
(4 ,2.5)
(3, 4.4)
Free
(7, 7.3)
Free
Free
Array (ARR) using Linear Probing
Let’s say the following new values and their corresponding hashes undergo insertion.
In the above example, all the key-value pairs will encounter collision and linearly probe through the array. Since all the values start probing from the same index, insertion time complexity becomes linear, and the hashmap becomes inefficient. This kind of clustering is called Secondary Clustering.
Quadratic probing is very similar to the linear probing mechanism. Except, the hashing function here, is modified as (h(x) + i * i). Using a quadratic function as an offset eliminates primary clustering, one of the biggest disadvantages of linear probing. However, there are still great chances of encountering secondary clustering.
Can you guess the index of pair (3, 4.4) using quadratic probing collision handling?
Hint: If you reach the end of the array, you can start from the beginning of the array to look for a free memory location.
Refer to the array given below for your answer:
0 1 2 3 4 5 6 7 8 9
Free
Free
(3, 4.4)
(13, 6.7)
(4 ,2.5)
Free
(7, 7.3)
Free
Free
Array (ARR) using Quadratic Probing
Double Hashing
As the name suggests, double hashing uses two hashing functions (not necessarily the same) to compute the index for data mapping. While one of the hashing functions is used on all the inputs, the other function is only used if a collision occurs. Let’s consider an example for better understanding.
Remember our students' test score record example? Refer to the table given below for the same.
Student’s Name
Test Score
Arun
79
Diksha
98
Ram
66
Srijan
88
Vivek
91
Our hashing function to map the values used the length of ‘NAME’ strings to find the index of every record stored. But a collision was encountered when two keys - ‘Srijan’ and ‘Diksha’ had the same string length. Double hashing can be used here for collision handling. Let’s add another hashing function that uses (score % array size) to find out the index for mapping. So, after collision handling, the hashmap will look like this:
Advantages and Disadvantages
Like Separate Chaining, Open Addressing offers its pros and cons. Based on the advantages and disadvantages given below, you can choose your collision handling mechanism as per your needs.
Advantages
Easy to implement mechanism: To implement open addressing mechanisms, all you need is a hashing function and an offset function. No complex implementation like a linked list is required.
More combinations of hashing functions, less collision: In Open Addressing, you can combine more than one hashing function (such as Double hashing) for Collision handling. This provides more flexibility than Separate Chaining.
Disadvantages
Higher chances of primary: Collision handling mechanisms like linear probing tend to insert the data, which causes clustering around one index linearly.
Higher chances of secondary clustering: Collision handling mechanisms like quadratic probing do not help much in secondary clustering. Keys that encounter collision are inserted using the same offset. This makes the insertion operation a linear-time operation.
Inefficient in the case of Birthday Paradox: In the case of Separate Chaining, the Birthday paradox can be handled by pushing the key at the end of the chain. This is not possible for Open Addressing. As the keys offer the same properties, for every hashing function, collision is encountered. The only way out is probing methods that have linear time complexity.
The time complexity for insert, search and delete operations in Open Addressing is
(1 / (1 - 𝞪 )) which means the number of keys is more than the size of the hashmap. Check out this problem - Longest String Chain
Frequently Asked Questions
What does the load factor signify?
Having aload factor of 1 signifies that the load is well shared in all the domains.
Load factor < 1 means that empty indexes are more than the keys.
Load factor > 1 refers to the opposite.
What is Double Hashing?
As the name suggests, double hashing uses two hashing functions (not necessarily the same) to compute the index for data mapping. While one of the hashing functions is used on all the inputs, the other function is only used if a collision occurs.
What is Linear Probing?
As the name suggests, the linear probing mechanism traverses linearly through the array to look for the next available memory for insert, delete, and search operations.
Conclusion
You did really well in learning these collision-handling techniques. Head over to Coding Ninjas Studio to learn more.
Hold on to your curiosity and this will take you to greater heights. Those who learn something new every day, are destined to succeed in their lives. For people with ambitions, CodingNinjas offers guided paths and much to make the learning process easier and more interesting.