8. Hashing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Hashing

What is Hashing?
 Hashing is the process of mapping large amount of data item to smaller table with the help of hashing
function.
 Hashing is also known as Hashing Algorithm or Message Digest Function.
 It is a technique to convert a range of key values into a range of indexes of an array.
 It is used to facilitate the next level searching method when compared with the linear or binary search.
 Hashing allows to update and retrieve any data entry in a constant time O(1).
 Constant time O(1) means the operation does not depend on the size of the data.
 Hashing is used with a database to enable items to be retrieved more quickly.
 It is used in the encryption and decryption of digital signatures.

What is Hash Function?


 A fixed process converts a key to a hash key is known as a Hash Function.
 This function takes a key and maps it to a value of a certain length which is called a Hash value or Hash.
 Hash value represents the original string of characters, but it is normally smaller than the original.
 It transfers the digital signature and then both hash value and signature are sent to the receiver. Receiver
uses the same hash function to generate the hash value and then compares it to that received with the
message.
 If the hash values are same, the message is transmitted without errors.

What is Hash Table?


 Hash table or hash map is a data structure used to store key-value pairs.
 It is a collection of items stored to make it easy to find them later.
 It uses a hash function to compute an index into an array of buckets or slots from which the desired value
can be found.
 It is an array of list where each list is known as bucket.
 It contains value based on the key.
 Hash table is used to implement the map interface and extends Dictionary class.
 Hash table is synchronized and contains only unique elements.

 The above figure shows the hash table with the size of n = 10. Each position of the hash table is called
as Slot. In the above hash table, there are n slots in the table, names = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. Slot 0,
slot 1, slot 2 and so on. Hash table contains no items, so every slot is empty.
 As we know the mapping between an item and the slot where item belongs in the hash table is called the
hash function. The hash function takes any item in the collection and returns an integer in the range of slot
names between 0 to n-1.
 Suppose we have integer items {26, 70, 18, 31, 54, 93}. One common method of determining a hash key is
the division method of hashing and the formula is :
Hash Key = Key Value % Number of Slots in the Table
 Division method or reminder method takes an item and divides it by the table size and returns the remainder
as its hash value.
Data Item Value % No. of Slots Hash Value
26 26 % 10 = 6 6
70 70 % 10 = 0 0
18 18 % 10 = 8 8
31 31 % 10 = 1 1
54 54 % 10 = 4 4
93 93 % 10 = 3 3

Data Structures & Algorithms by Dr. K Bhowal


 After computing the hash values, we can insert each item into the hash table at the designated position as
shown in the above figure. In the hash table, 6 of the 10 slots are occupied, it is referred to as the load factor
and denoted by, λ = No. of items / table size. For example, λ = 6/10.
 It is easy to search for an item using hash function where it computes the slot name for the item and then
checks the hash table to see if it is present.
 Constant amount of time O(1) is required to compute the hash value and index of the hash table at that
location.

Types of hashing methods:


There are various types of hash function which are used to place the data in a hash table,

1. Division method
In this the hash function is dependent upon the remainder of a division. For example:-if the record 52, 68, 99, 84
is to be placed in a hash table and let us take the table size is 10.
Then:
h(key) = record% table size.
h(52) = 52%10 = 2
h(68) = 68%10 = 8
h(99) = 99%10 = 9
h(84) = 84%10 = 4

2. Multiplication method
In this method, digits of a key are multiplied and divide the result by the table size to get the address.
For example: consider a key 2134. Now, h(2134) = (2*1*3*4) % 10 = 4 is the address of the key.

3. Mid square method


Let p is the maximum number of digits in a key and q is the maximum number of digits in a bucket size. Now,
in this method, firstly key is squared and then mid part of the result is taken as the index.
For example: consider that p = 3 and q = 3. Suppose, we want to place a key 3101 and the size of table is 1000
(0 – 999). So 3101*3101=9616201 i.e. h (3101) = 162 (middle 3 digit) is the address of the key.

4. Folding method
Most machines have a small number of primitive data types for which there are arithmetic instructions.
Frequently key to be used will not fit easily in to one of these data types. The solution is to combine the various
parts of the key in such a way that all parts of the key affect for final result such an operation is termed folding
of the key. That is the key is actually partitioned into number of parts, each part having the same length as that
of the required address. Add the value of each part, ignoring the final carry to get the required address.
This is done in two ways:

Data Structures & Algorithms by Dr. K Bhowal


Fold-shifting: Here actual values of each parts of key are added.
Suppose, the key is: 12345678 and the required address are of two digits, then break the key into: 12, 34, 56, 78.
Add these, we get 12 + 34 + 56 + 78: 180, ignore first 1 we get 80 as location.

Fold-boundary: Here the reversed values of outer parts of key are added.
Suppose, the key is: 12345678 and the required address is of two digits, then break the key into: 21, 34, 56, 87.
Add these, we get 21 + 34 + 56 + 87: 198, ignore first 1 we get 98 as location.

Characteristics of good hashing function:


 The hash function should generate different hash values for the similar string.
 The hash function is easy to understand and simple to compute.
 The hash function should produce the keys which will get distributed, uniformly over an array.
 A number of collisions should be less while placing the data in the hash table.
 The hash function is a perfect hash function when it uses all the input data.

Hash Table Data Structure:


There are two different forms of hashing.
1. Open hashing or external hashing
Open or external hashing, allows records to be stored in unlimited space (could be a hard disk). It places no
limitation on the size of the tables. Each bucket in the bucket table is the head of the linked list of records
mapped to that bucket.

2. Close hashing or internal hashing


Closed or internal hashing uses a fixed space for storage and thus limits the size of hash table. A closed hash
table keeps the elements in the bucket itself. Only one element can be put in the bucket. In closed hashing,
collision handling is a very important issue.

Applications of Hashing:
Hashing provides constant time search, insert and delete operations on average. This is why hashing is one of
the most used data structure, example problems are, distinct elements, counting frequencies of items, finding
duplicates, etc.
There are many other applications of hashing, including modern day cryptography hash functions. Some of
these applications are listed below:
 Message Digest
 Password Verification
 Data Structures(Programming Languages)
 Compiler Operation
 Rabin-Karp pattern matching Algorithm
 Linking File name and path together

Collision
It is a situation in which the hash function returns the same hash key for more than one record, it is called as
collision. Sometimes when we are going to resolve the collision it may lead to an overflow condition and this
overflow and collision condition makes the poor hash function.

What is Collision?
Since a hash function gets us a small number for a key which is a big integer or string, there is possibility that
two keys result in same value. The situation where a newly inserted key maps to an already occupied slot in
hash table is called collision and must be handled using some collision handling technique.

What are the chances of collisions with large table?


Collisions are very likely even if we have big table to store keys. An important observation is Birthday Paradox.
With only 23 persons, the probability that two people have same birthday is 50%.

Data Structures & Algorithms by Dr. K Bhowal


How to handle Collisions?
If there is a problem of collision occurs then it can be handled by apply some techniques. These techniques are
called as collision resolution techniques. There are several techniques which are described below:

1. Open Addressing Methods


a) Linear Probing
b) Quadratic Probing
c) Double Hashing

2. Chaining Method
a) Separate Chaining

3. Coalesced Chaining

1. Open Addressing Methods:


In Open Addressing, all elements are stored in the hash table itself. So at any point, size of the table must be
greater than or equal to the total number of keys (Note that we can increase table size by copying old data if
needed).
Insert(k): Keep probing until an empty slot is found. Once an empty slot is found, insert k.
Search(k): Keep probing until slot‟s key doesn‟t become equal to k or an empty slot is reached.
Delete(k): Delete operation is interesting. If we simply delete a key, then search may fail. So, slots of deleted
keys are marked specially as “deleted”.
Insert can insert an item in a deleted slot, but the search doesn‟t stop at a deleted slot.

Open Addressing is done following ways:


a) Linear Probing: In linear probing, we linearly probe for next slot. For example, typical gap between two
probes is 1 as taken in below example also.
Let hash(x) be the slot index computed using hash function and S be the table size
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
..................................................
Hash function is h(key) = (key + i) % S, for i = 0, 1, 2, …….
Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76, 85, 92, 73, 101.

Data Structures & Algorithms by Dr. K Bhowal


Clustering: The main problem with linear probing is clustering, many consecutive elements form groups and it
starts taking time to find a free slot or to search an element.

b) Quadratic Probing: We look for i2„th slot in i‟th iteration.


Let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
..................................................
Hash function is h(key) = (key + i*i) % S, for i = 0, 1, 2, …….

Example of Quadratic probing:

c) Double Hashing: We use another hash function hash2(x) and look for i*hash2(x) slot in i‟th rotation.
Let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S
If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S
..................................................
Hash function is hash1(key) = key % S and
one good choice is to choose a prime R < S and hash2(key) = R – (key % R)
So, the function is (hash1(key) + i * hash2(key)) % S, for i = 1, 2, 3,…

Example of Double Hashing:

Data Structures & Algorithms by Dr. K Bhowal


Performance of Open Addressing:
Like Chaining, the performance of hashing can be evaluated under the assumption that each key is equally likely
to be hashed to any slot of the table (simple uniform hashing)
m = Number of slots in the hash table
n = Number of keys to be inserted in the hash table
Load factor α = n/m ( < 1 )
Expected time to search/insert/delete < 1/(1 - α)
So Search, Insert and Delete take (1/(1 - α)) time

Chaining Method: The idea is to make each cell of hash table point to a linked list of records that have same
hash function value. Chaining is simple, but requires additional memory outside the table.

Separate Chaining:
The idea is to make each cell of hash table point to a linked list of records that have same hash function value.
Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76, 85, 92, 73, 101.

Advantages:
1) Simple to implement.
2) Hash table never fills up, we can always add more elements to chain.
3) Less sensitive to the hash function or load factors.
4) It is mostly used when it is unknown how many and how frequently keys may be inserted or deleted.

Disadvantages:
1) Cache performance of chaining is not good as keys are stored using linked list. Open addressing provides
better cache performance as everything is stored in same table.
2) Wastage of Space (Some Parts of hash table are never used)
3) If the chain becomes long, then search time can become O(n) in worst case.
4) Uses extra space for links.

Performance of Chaining:
Performance of hashing can be evaluated under the assumption that each key is equally likely to be hashed to
any slot of table (simple uniform hashing).

m = Number of slots in hash table

Data Structures & Algorithms by Dr. K Bhowal


n = Number of keys to be inserted in hash table
Load factor α = n/m
Expected time to search = O(1 + α)
Expected time to insert/delete = O(1 + α)
Time complexity of search insert and delete is
O(1) if α is O(1)

Comparison of above three: Linear probing has the best cache performance but suffers from clustering. One
more advantage of linear probing is easy to compute. Quadratic probing lies between the two in terms of cache
performance and clustering. Double hashing has poor cache performance but no clustering. Double hashing
requires more computation time as two hash functions need to be computed.
S.No. Separate Chaining Open Addressing
1. Chaining is Simpler to implement. Open Addressing requires more computation.
2. In chaining, Hash table never fills up, we can In open addressing, table may become full.
always add more elements to chain.
3. Chaining is Less sensitive to the hash function or Open addressing requires extra care for to avoid
load factors. clustering and load factor.
4. Chaining is mostly used when it is unknown how Open addressing is used when the frequency and
many and how frequently keys may be inserted or number of keys is known.
deleted.
5. Cache performance of chaining is not good as keys Open addressing provides better cache
are stored using linked list. performance as everything is stored in the same
table.
6. Wastage of Space (Some Parts of hash table in In Open addressing, a slot can be used even if an
chaining are never used). input doesn‟t map to it.
7. Chaining uses extra space for links. No links in Open addressing

Coalesced hashing:
Coalesced hashing is a collision avoidance technique when there is a fixed sized data. It is a combination of
both separate chaining and open addressing. It uses the concept of Open Addressing (linear probing) to find first
empty place for colliding element from the bottom of the hash table and the concept of Separate Chaining to link
the colliding elements to each other through pointers. The hash function used is h=(key)%(total number of
keys). Inside the hash table, each node has three fields:
 h(key): The value of hash function for a key.
 Data: The key itself.
 Next: The link to the next colliding elements.

The basic operations of Coalesced hashing are:


1. INSERT(key): The insert Operation inserts the key according to the hash value of that key if that hash
value in the table is empty otherwise the key is inserted in first empty place from the bottom of the hash
table and the address of this empty place is mapped in NEXT field of the previous pointing node of the
chain.(Explained in example below).
2. DELETE(Key): The key if present is deleted. Also if the node to be deleted contains the address of
another node in hash table then this address is mapped in the NEXT field of the node pointing to the node
which is to be deleted
3. SEARCH(key): Returns True if key is present, otherwise return False.

The best case complexity of all these operations is O(1) and the worst case complexity is O(n) where n is the
total number of keys. It is better than separate chaining because it inserts the colliding element in the memory of
hash table only instead of creating a new linked list as in separate chaining.
Example:
n = 10
Input : {20, 35, 16, 40, 45, 25, 32, 37, 22, 55}

Hash function
h(key) = key%10

Data Structures & Algorithms by Dr. K Bhowal


Steps:
1. Initially empty hash table is created with all NEXT field initialized with NULL and h(key) values ranging
from 0-9.
HASH VALUE DATA NEXT
0 – NULL
1 – NULL
2 – NULL
3 – NULL
4 – NULL
5 – NULL
6 – NULL
7 – NULL
8 – NULL
9 – NULL

2. Let‟s start with inserting 20, as h(20)=0 and 0 index is empty so we insert 20 at 0 index.
HASH VALUE DATA NEXT
0 20 NULL
1 – NULL
2 – NULL
3 – NULL
4 – NULL
5 – NULL
6 – NULL
7 – NULL
8 – NULL
9 – NULL

3. Next element to be inserted is 35, h(35)=5 and 5th index empty so we insert 35 there.
HASH VALUE DATA NEXT
0 20 NULL
1 – NULL
2 – NULL
3 – NULL
4 – NULL
5 35 NULL
6 – NULL
7 – NULL
8 – NULL
9 – NULL

4. Next we have 16, h(16)=6 which is empty so 16 is inserted at 6 index value.


HASH VALUE DATA NEXT
0 20 NULL
1 – NULL
2 – NULL
3 – NULL
4 – NULL
5 35 NULL
6 16 NULL
7 – NULL
8 – NULL
9 – NULL

Data Structures & Algorithms by Dr. K Bhowal


5. Now we have to insert 40, h(40)=0 which is already occupied so we search for the first empty block from
the bottom and insert it there i.e 9 index value. Also the address of this newly inserted node(from address we
mean index value of a node) i.e.(9 )is initialized in the next field of 0th index value node.
HASH VALUE DATA NEXT
0 20 9
1 – NULL
2 – NULL
3 – NULL
4 – NULL
5 35 NULL
6 16 NULL
7 – NULL
8 – NULL
9 40 NULL

6. To insert 45, h(45)=5 which is occupied so again we search for the empty block from the bottom i.e 8 index
value and map the address of this newly inserted node i.e(8) to the Next field of 5th index value node i.e in
the next field of key=35.
HASH VALUE DATA NEXT
0 20 9
1 – NULL
2 – NULL
3 – NULL
4 – NULL
5 35 8
6 16 NULL
7 – NULL
8 45 NULL
9 40 NULL

7. Next to insert 25, h(25)=5 is occupied so search for the first empty block from bottom i.e. 7th index value
and insert 25 there. Now it is important to note that the address of this new node can‟t be mapped on 5th
index value node which is already pointing to some other node. To insert the address of new node we have
to follow the link chain from the 5th index node until we get NULL in next field and map the address of
new node to next field of that node i.e from 5th index node we go to 8th index node which contains NULL
in next field so we insert address of new node i.e(7) in next field of 8th index node.
HASH VALUE DATA NEXT
0 20 9
1 – NULL
2 – NULL
3 – NULL
4 – NULL
5 35 8
6 16 NULL
7 25 NULL
8 45 7
9 40 NULL

8. To insert 32, h(32)=2, which is empty so insert 32 at 2nd index value.


HASH VALUE DATA NEXT
0 20 9
1 – NULL
2 32 NULL
3 – NULL
4 – NULL
5 35 8
6 16 NULL

Data Structures & Algorithms by Dr. K Bhowal


7 25 NULL
8 45 7
9 40 NULL

9. To insert 37, h(37)=7 which is occupied so search for the first free block from bottom which is 4th index
value. So insert 37 at 4th index value and copy the address of this node in next field of 7th index value node.
HASH VALUE DATA NEXT
0 20 9
1 – NULL
2 32 NULL
3 – NULL
4 37 NULL
5 35 8
6 16 NULL
7 25 4
8 45 7
9 40 NULL

10. To insert 22, h(22)=2 which is occupied so insert it at 3rd index value and map the address of this node in
next field of 2nd index value node.

HASH VALUE DATA NEXT


0 20 9
1 – NULL
2 32 3
3 22 NULL
4 37 NULL
5 35 8
6 16 NULL
7 25 4
8 45 7
9 40 NULL

11. Finally, to insert 55 h(55)=5 which is occupied and the only empty space is 1st index value so insert 55
there. Now again to map the address of this new node we have to follow the chain starting from 5th index
value node until we get NULL in next field i.e from 5th index->8th index->7th index->4th index which
contains NULL in Next fied, and we insert the address of newly inserted node at 4th index value node.
HASH VALUE DATA NEXT
0 20 9
1 55 NULL
2 32 3
3 22 NULL
4 37 1
5 35 8
6 16 NULL
7 25 4
8 45 7
9 40 NULL

Deletion process is simple, for example:

Case 1: To delete key=37, first search for 37. If it is present then simply delete the data value and if the node
contains any address in next field and the node to be deleted i.e 37 is itself pointed by some other node(i.e
key=25) then copy that address in the next field of 37 to the next field of node pointing to 37(i.e key=25) and
initialize the NEXT field of key=37 as NULL again and erase the key=37.

Data Structures & Algorithms by Dr. K Bhowal


HASH VALUE DATA NEXT
0 20 9
1 55 NULL
2 32 3
3 22 NULL
4 – NULL
5 35 8
6 16 NULL
7 25 1
8 45 7
9 40 NULL

Case 2: If key to be deleted is 35 which is not pointed by any other node then we have to pull the chain attached
to the node to be deleted i.e 35 one step back and mark last value of chain to NULL again.
HASH VALUE DATA NEXT
0 20 9
1 – NULL
2 32 3
3 22 NULL
4 – NULL
5 45 8
6 16 NULL
7 55 NULL
8 25 7
9 40 NULL

Load Factor and Rehashing


How hashing works:
For insertion of a key(K) – value(V) pair into a hash map, 2 steps are required:
1. K is converted into a small integer (called its hash code) using a hash function.
2. The hash code is used to find an index (hashCode % arrSize) and the entire linked list at that
index(Separate chaining) is first searched for the presence of the K already.
3. If found, it‟s value is updated and if not, the K-V pair is stored as a new node in the list.

Complexity and Load Factor


 For the first step, time taken depends on the K and the hash function. For example, if the key is a string
“abcd”, then it‟s hash function may depend on the length of the string. But for very large values of n, the
number of entries into the map, length of the keys is almost negligible in comparison to n so hash
computation can be considered to take place in constant time, i.e, O(1).
 For the second step, traversal of the list of K-V pairs present at that index needs to be done. For this, the
worst case may be that all the n entries are at the same index. So, time complexity would be O(n). But,
enough research has been done to make hash functions uniformly distribute the keys in the array so this
almost never happens.
 So, on an average, if there are n entries and b is the size of the array there would be n/b entries on each
index. This value n/b is called the load factor that represents the load that is there on our map.
 This Load Factor needs to be kept low, so that number of entries at one index is less and so is the
complexity almost constant, i.e., O(1).
Rehashing:
As the name suggests, rehashing means hashing again. Basically, when the load factor increases to more than
its pre-defined value (default value of load factor is 0.75), the complexity increases. So to overcome this, the
size of the array is increased (doubled) and all the values are hashed again and stored in the new double sized
array to maintain a low load factor and low complexity.

Data Structures & Algorithms by Dr. K Bhowal


Why rehashing?
Rehashing is done because whenever key value pairs are inserted into the map, the load factor increases, which
implies that the time complexity also increases as explained above. This might not give the required time
complexity of O(1).
Hence, rehash must be done, increasing the size of the bucketArray so as to reduce the load factor and the time
complexity.

How Rehashing is done?


Rehashing can be done as follows:
 For each addition of a new entry to the map, check the load factor.
 If it‟s greater than its pre-defined value (or default value of 0.75 if not given), then Rehash.
 For Rehash, make a new array of double the previous size and make it the new bucketarray.
 Then traverse to each element in the old bucketArray and call the insert() for each so as to insert it into the
new larger bucket array.

Data Structures & Algorithms by Dr. K Bhowal

You might also like