Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Hash tables are a powerful data structure used in computer science to store & retrieve data efficiently. Imagine you have information like student IDs paired with student names. Without a hash table, you might have to look through a list of thousands of students to find a name, but with a hash table, you can do this almost instantly by using the student ID as a key. They help you with fast lookups, inserts, and deletions by using a special function called a hash function to map keys to specific indexes in an array. In C++, hash tables can be implemented using the unordered_map container.
In this article, we will discuss what hash tables are, how they work, some examples of hash functions, handling collisions, and analyzing the time and space complexity of hash table operations.
Some Examples of Hash Functions
Hash functions are the core of how hash tables work. Their job is to take a key (like a string or number) & map it to an index in the underlying array. A good hash function should:
1. Be deterministic (same input always maps to same output)
2. Distribute keys evenly across indexes to minimize collisions
3. Be fast to compute
For example:
#include <iostream>
#include <string>
using namespace std;
// Hash function for integers
int hashInt(int key, int tableSize) {
return key % tableSize;
}
// Hash function for strings
int hashString(string key, int tableSize) {
int hash = 0;
for (char c : key) {
hash = (hash + static_cast<int>(c)) % tableSize;
}
return hash;
}
int main() {
cout << "Hash of 42: " << hashInt(42, 10) << endl;
cout << "Hash of 'hello': " << hashString("hello", 10) << endl;
return 0;
}
You can also try this code with Online C++ Compiler
The integer hash function simply uses the modulo operator to map the key to an index. The string hash function iterates through each character, casts it to an int, adds it to the running hash, & takes the modulo at the end.
Bucket Index
In a hash table, the array that stores the key-value pairs is often referred to as a bucket array or simply buckets. The hash function computes the index into this array. So when we say a key "hashes to bucket 5," we mean the hash function mapped that key to index 5 in the underlying array.
For example:
#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;
// Hash function for strings
int hashString(string key, int tableSize) {
int hash = 0;
for (char c : key) {
hash = (hash + static_cast<int>(c)) % tableSize;
}
return hash;
}
int main() {
unordered_map<string, int> hashTable;
string key = "example";
int tableSize = hashTable.bucket_count();
// Hash the key to get the bucket index
int bucketIndex = hashString(key, tableSize);
cout << "Key '" << key << "' hashes to bucket " << bucketIndex << endl;
return 0;
}
You can also try this code with Online C++ Compiler
In this code, we create an unordered_map (C++'s implementation of a hash table), get its current size or bucket count, and then use our hashString function from earlier to find the bucket index for the key "example."
The bucket index tells us where in the underlying array this key-value pair will be stored. If the hash table needs to grow and rehash, the bucket_count() and bucket indexes can change.
Rehashing
As more key-value pairs are added to a hash table, the number of collisions can start to increase, slowing down lookups. To solve this issue this, hash tables dynamically resize and rehash all the elements when a certain load factor is reached.
The load factor is the number of elements divided by the bucket count. For example, if a hash table has 100 buckets & 75 key-value pairs stored in it, the load factor is 0.75.
When the load factor exceeds a certain threshold (typically 0.75 or 1), the hash table will:
1. Create a new bucket array with double the previous size
2. Recompute the hash & bucket index for every existing key-value pair
3. Insert each pair into the new bucket array at the new index
This process is called rehashing. While it takes some time to rehash every element, by resizing exponentially, we keep the overall amortized cost of inserts to O(1).
For example:
#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;
int main() {
unordered_map<int, string> hashTable;
// Insert elements
for (int i = 0; i < 10; i++) {
hashTable[i] = "value" + to_string(i);
cout << "Bucket count: " << hashTable.bucket_count() << endl;
}
return 0;
}
You can also try this code with Online C++ Compiler
If you run this code, you'll see the bucket_count double a few times as rehashing occurs due to the increasing load factor.
Note: In practice, C++'s unordered_map handles rehashing automatically, but it's important to understand the concept & performance implications.
Collision
Even with a good hash function, it's possible for two distinct keys to hash to the same bucket index. This is called a collision. There are a few main ways to handle collisions:
1. Separate Chaining: Each bucket is a linked list of key-value pairs. If a collision occurs, the new pair is simply appended to the list for that bucket. This allows for an unlimited number of collisions to be resolved, but lookups in a bucket become O(n) in the worst case.
2. Open Addressing: If a collision occurs, the hash table probes for the next empty bucket in a deterministic sequence. The most common probe sequences are linear probing (check the next bucket in a loop), quadratic probing (add quadratic offsets), and double hashing (use a second hash function for offsets). Open addressing is more space-efficient but can lead to clustering and fails completely if all buckets are full.
For example:
#include <iostream>
#include <string>
#include <unordered_map>
using namespace std;
int main() {
unordered_map<string, int> hashTable;
hashTable["key1"] = 1;
hashTable["key2"] = 2;
hashTable["key3"] = 3;
// Simulate a collision by manually setting bucket index
auto it = hashTable.find("key1");
hashTable["collider"] = hashTable.bucket(it->first);
return 0;
}
In this case, "collider" would be added to the linked list for the same bucket as "key1," "key2," and "key3." When looking up "collider," the hash table would find the correct bucket and then search the linked list for the matching key.
Note: C++'s unordered_map uses separate chaining for collision resolution by default. The maximum load factor can be customized if desired.
Time Complexity and Space Complexity
The main advantage of hash tables is their excellent average case time complexity for lookups, inserts, and deletes. Let’s understand this in the form of a table:
Operation
Average Case
Worst Case
Insert
O(1)
O(n)
Delete
O(1)
O(n)
Lookup
O(1)
O(n)
In the average case, these operations are all constant time because the hash function distributes keys evenly, and collisions are rare. However, in the worst case (all keys hash to the same bucket), these operations degrade to linear time because the bucket's linked list must be searched or modified.
The space complexity of a hash table is O(n), where n is the number of key-value pairs stored. This is because the bucket array scales linearly with the number of elements.
Note: It's very important to remember that the worst-case time complexity can be avoided by using a good hash function and choosing an appropriate load factor threshold for rehashing.
Frequently Asked Questions
What is the best way to handle collisions in a hash table?
Separate chaining is generally preferred as it allows for unlimited collisions & is simpler to implement than open addressing.
How do I choose a good hash function?
A good hash function should be fast, deterministic, and distribute keys evenly. std::hash is usually sufficient for most types.
When should I use a hash table over other data structures?
Hash tables are ideal when you need fast O(1) lookups, inserts & deletes on average. They are less suitable if ordering matters or the keys are not hashable.
Conclusion
In this article, we discussed the fundamentals of hash tables in C++, including hash functions, bucket indexes, rehashing, collision resolution, & time/space complexity. We learned that hash tables provide excellent average case performance for key-based operations, making them a powerful tool in any programmer's arsenal.