Introduction
In this blog, we will discuss the topic of string hashing. String hashing is a very interesting as well as an important topic. String hashing based-coding problems are widely asked in coding contests and various coding interviews.
String Hashing
String hashing is the technique or process of mapping string and integer. In other words, String hashing is the way to convert the string into an integer, and that integer is known as the hash of a string.
String hashing is used to compare two strings whether they are equal or not. They both are equal if the hash value of both the string is equal. The hash value of the string is calculated by using a hash function. Let’s discuss the hash function to convert a string into an integer.
What is Hash Function?
String hash function is the function that converts a string into an integer. The hash of a string can be calculated as:
Hash(S) =(S[0] * P^0 + S[1] * P^1 +S[2] * P^2 + …….+ S[N-1] * P^(N-1)) % M
Where,
‘S’ is the given string.
‘P’ is the prime number that must be greater than or equal to the number of distinct characters in the string ‘S’.
‘M’ should be large to avoid the collision.
The above function is known as the polynomial rolling hash function.
Collision
It may be possible that two different strings have the same hash values. This may occur because we take modulo ‘M’ in the final hash value. In that case, two different strings may have the same hash values, called a collision. We need to design our hash function so that the collision probability is very low. One way to reduce the chances of collision is that not take the mod of the final hash value with ‘M’. But this is not efficient as the hash value may be very large, and our storage memory is limited.
Another way to reduce the collision probability is to take the value of ‘M’ as large as possible.
The probability of collision is 1 / M.
Example
The hash of string S = “cat” can be calculated as:
hash(S) = (‘c’ - ‘a’) + (‘a’ - ‘a’) * 31 + (‘c’ - ‘a’) * 31^2.
Here, P = 31 as all the characters are in lowercase, so the count of all distinct lowercase characters is 26, and 31 is the prime number and greater than 26.
Also see, Euclid GCD Algorithm