Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 44

HASHING

What is Hashing
• The searching time of each searching technique depends on the
comparison. i.e., n comparisons required for an array A with n
elements
• To increase the efficiency, i.e., to reduce the searching time, we
need to avoid unnecessary comparisons
• Hashing is a technique where we can compute the location of the
desired record in order to retrieve it in a single access (or without
comparison)
• a hash table is a data structure
– that uses a hash function to efficiently translate certain keys
(e.g., person names) into associated values (e.g., their
telephone numbers).
Introduction
• Let there is a table of n employee records and each
employee record is defined by a unique employee
code, which is a key to the record and employee name
• If the key (or employee code) is used as the array
index, then the record can be accessed by the key
directly
– Ideally the hash function should map each possible key to a different slot index
– but this goal is rarely achievable in practice.
– Most hash table designs assume that hash collisions — pairs of different keys with
the same hash values — are normal occurrences, and accommodate them in some
way.
Applications

– Real-time databases
• Organizing files in the hard disk
• air traffic control
• packet routing
• Correct delivery of data in computer networks

4
Hash Tables
• Hashing is used for storing relatively large amounts
of data in a table called a hash table ADT.
• Hash table is usually fixed as H-size, which is larger
than the amount of data that we want to store.
• We define the load factor () to be the ratio of data
to the size of the hash table.
• Hash function maps an item into an index in range.
hash table

item 0
key hash 1
function 2
3

H-1
Hash Tables
• Hashing is a technique used to perform insertions, deletions, and
searches/finds in constant average time.
• To insert or find a certain data, we assign a key to the elements and use a
function to determine the location of the element within the table called
hash function.
• Hash tables are arrays of cells with fixed size containing data or keys
corresponding to data.
• For each key, we use the hashing function to map key into some number in
the range 0 to H-size-1 using hashing function.

Unfortunately such a function H may not yield different values (or index), it is
possible that two different keys k1 and k2 will yield the same hash address
This situation is called Hash Collision, which is discussed later
Hash Function
• The basic idea of hash function is the
transformation of the key into the corresponding
location in the hash table
• A Hash function H can be defined as a function
that takes key as input and transforms it into a
hash table index

Algorithms & Data Structures


• Following are the most popular hash functions:

1. Division method
2. Mid Square method
3. Folding method

Algorithms & Data Structures


Division Method
• TABLE is an array of database file where the employee details
are stored
• Choose a number m, which is larger than the number of keys
k. i.e., m is greater than the total number of records in the
TABLE
• Ideally the number m is usually chosen to be prime number
to minimize the collision
• The hash function H is defined by
H(k) = k mod m //k%m
• Where H(k) is the hash address and k mod m means the
remainder when k is divided by m

Algorithms & Data Structures


For Example
• Let a company has 90 employees and 00, 01, 02, ...... 90 be
the two digit 91memory address (or index or hash address)index key
to store the records 00

• We have employee code as the key

----
10 2103
• Choose m in such a way that it is greater than 90

----
• Suppose m = 91. Then for the following employee code (or
19 3750
key k) :

----
H(k) = H(2103) = 2103 mod 91 = 10
H(k) = H(6147) = 6147 mod 91 = 50 50 6147
H(k) = H(3750) = 3750 mod 91 = 19

----
So if you enter the employee code to the hash function, we 90
can directly retrieve TABLE[H(k)] details directly

Algorithms & Data Structures


Properties of Hash Function
A good hash function should:

• Minimize collisions
• Be easy and quick to compute
• Distribute key values evenly in the hash table
• Use all the information provided in the key

Algorithms & Data Structures


Designing a Good Hash Function
• If the divisor is even and there are more even than
odd key values, the hash function will produce an
excess of even values. This is also true if there are an
excessive amount of odd values.
• However, if the divisor is odd, then either kind of
excess of key values would still give a balanced
distribution of odd/even results.
• Thus, the divisor should be odd. But, this is not
enough.
Designing a Good Hash Function
• Thus, the divisor should be odd. But, this is
not enough.
• If the divisor itself is divisible by a small odd
number (like 3, 5, or 7) the results are
unbalanced again. Ideally, it should be a
prime number. If no such prime number
works for our table size (the divisor,
remember?), we should use an odd number
with no small factors.
Mid-Square Method
• In this method, the key is squared and the address selected
from the middle of the squared number
• The hash function H is defined by:
H(k) = k2 = l
• Where l is obtained by digits from mid of k2 starting from left
depending on digits of hashtable size
• The most obvious limitation of this method is the size of the
key
• Given a key of 6 digits, the product will be 12 digits, which may
be beyond the maximum integer size of many computers
• Same number of digits must be used for all of the keys

Algorithms & Data Structures


For Example
• Consider following keys in the table and its
hash index :

Algorithms & Data Structures


Hash Table with Mid-Square Division

Algorithms & Data Structures


Folding Method
• In this method, the key K is partitioned into number of
parts, k1, k2,...... kr
• The parts have same number of digits as the required
hash address, except possibly for the last part
• Then the parts are added together, ignoring the last carry
H(k) = k1 + k2 + ...... + kr
• Here we are dealing with a hash table with index from
00 to 99, i.e., two-digit hash table
• So we divide the K into numbers of two digits

48
8
Algorithms & Data Structures
For improvement
• Extra milling can also be applied to even numbered parts,
k2, k4, ...... are each reversed before the addition

48
--
84
--

-
84
-

• H(7148) = 71 + 84 = 155, here we will eliminate the


leading carry (i.e., 1). So H(7148) = 71 + 84 = 55
Algorithms & Data Structures
Hash Collision
• It is possible that two non-identical keys K1, K2
are hashed into the same hash address
• This situation is called Hash Collision
• Let us consider a hash table having 10 locations as
shown in Figure

Algorithms & Data Structures


• Division method is used to hash the key
• H(k) = k (mod m)
• Here m is chosen as 10
• The Hash function produces any integer between 0
and 9, depending on the value of the key
• If we want to insert a new record with key 500 then
• H(500) = 500(mod 10) = 0
• The location 0 in the table is already filled (i.e., not
empty)
• Thus collision occurred

Algorithms & Data Structures


Collision handling
• Collisions are almost impossible to avoid but it can
be minimized considerably by introducing any one of
the following three techniques:
1. Open addressing (closed hashing)
1. Linear Probing
2. Quadratic Probing
3. Double Hashing

2. Separate Chaining(Open hashing)


3. Bucket addressing
Algorithms & Data Structures
Closed Hashing
If collision, try to find alternative cells within table.
Closed hashing also known as open addressing.
For insertion, we try cells in sequence by using
incremented function like:
hi(x) = (hash(x) + f(i)) mod H-size f(0) = 0
Function f is used as collision resolution strategy.
The table is bigger than the number of data.
Different method to choose function f :
Linear probing
Quadratic probing
Double hashing
Linear probing
Use a linear function f(i) = i
Find the first position in the table for the key, which is
close to the actual position.
Least complex function.
May result in primary clustering.
Elements that hash to the different location probe the same
alternative cells
The complexity of this probing is dependent on the
value of  (load factor).
We do not use this probing if  > 0.5.
Linear Probing Example

• h(k) = k mod 13
• Insert keys:
• 18 41 22 44 59 32 31 72
key 18 41 22 44 59 32 31 72
Mod 5 2 9 5 7 6 5 7

41 18 44 59 32 22 31 72
0 1 2 3 4 5 6 7 8 9 10 11 12
Clustering
• Sometimes, data will cluster – this is caused
when many elements hash to the same (or
similar) location and linear probing has been
used often. We can help with this problem by
choosing our divisor carefully in our hash
function and by carefully choosing our table
size.
Problems of Linear Probing
• The majority of the problems are caused by
clustering. These problems can be helped by
using Quadratic probing instead.
Quadratic probing
• Eliminate the primary clustering by selecting f(i) = i2
• There is more problem with a hash table that is more
than half full.
• You have to select appropriate table size that is not
square of a number.
• We can prove that quadratic probing with table size
prime number and at least half empty will always find a
location for an element.
• Elements that hash to the same location will probe the
same alternative cells (secondary clustering).
Quadratic Probing
• Works like linear probing but instead of
looking to the next available position, the next
location is chosen by looking at the positions
that are 12, 22, 32, etc. positions ahead.
Quadratic Probing
• Consider the data with keys: 24, 42, 34,62,73
into a table of size 10. These entries can be
placed into the table at the following
locations:
Key 24 42 34 62 73
H(key) 4 2 4 2 3

42 62 24 34 73
0 1 2 3 4 5 6 7 8 9
Quadratic Probing
• 24 % 10 = 4. Position is free. 24 placed into element 4
• 42 % 10 = 2. Position is free. 42 placed into element 2
• 34 % 10 = 4. Position is occupied. Try place 12 away in the
table (5). 34 placed into position 5.
• 62 % 10 = 2. Position is occupied. Try place 12 away in the
table. (3) 62 placed into position 3.
• 73 % 10 = 3. Position is occupied. Try place 12 away in the
table (4). Same problem. Try place 22 away in the table (7). 73
is placed into position 7.
– Thus, we jumped over the existing cluster.
• This doesn’t completely solve our problem, but it helps.
Double Hashing
• Use two hash functions h(key) and hp(key)
• hi(key) = [h(key) + I* hp(key)]

Performance of Double hashing:


– Much better than linear or quadratic probing because it eliminates both primary and
secondary clustering.
– BUT requires a computation of a second hash function hp.
Example: Load the keys 18, 26, 35, 9, 64, 47, 96, 36, and 70 in this order, in an empty hash table of
size 13

(a) using double hashing with the first hash function: h(key) = key % 13 and the second hash
function: hp(key) = 1 + key % 12
(b) using double hashing with the first hash function: h(key) = key % 13 and the second hash
function: hp(key) = 7 - key % 7
Show all computations.
Double Hashing (cont’d)

hi(key) = [h(key) + i*hp(key)]% 13


h0(18) = 18%13 = 5
h0(26) = 26%13 = 0 h(key) = key % 13
h0(35) = 35%13 = 9 hp(key) = 1 + key % 12
h0(9) = 9%13 = 9 collision
hp(9) = 1 + 9%12 = 10
h1(9) = (9 + 1*10)%13 = 6

h0(64) = 64%13 = 12
h0(47) = 47%13 = 8
h0(96) = 96%13 = 5 collision
hp(96) = 1 + 96%12 = 1
h1(96) = (5 + 1*1)%13 = 6 collision
h2(96) = (5 + 2*1)%13 = 7
h0(36) = 36%13 = 10
h0(70) = 70%13 = 5 collision
hp(70) = 1 + 70%12 = 11
h1(70) = (5 + 1*11)%13 = 3
33
Double Hashing (cont'd)

h (key) = [h(key) + i*hp(key)]% 13


i
h0(18) = 18%13 = 5
h0(26) = 26%13 = 0 h(key) = key % 13
h0(35) = 35%13 = 9 hp(key) = 7 - key % 7
h0(9) = 9%13 = 9 collision
hp(9) = 7 - 9%7 = 5
h1(9) = (9 + 1*5)%13 = 1
h0(64) = 64%13 = 12
h0(47) = 47%13 = 8
h0(96) = 96%13 = 5 collision
hp(96) = 7 - 96%7 = 2
h1(96) = (5 + 1*2)%13 = 7
h0(36) = 36%13 = 10
h0(70) = 70%13 = 5 collision
hp(70) = 7 - 70%7 = 7
h1(70) = (5 + 1*7)%13 = 12 collision
34
h2(70) = (5 + 2*7)%13 = 6
• H1(k)=Key mod 7, H2(k)=5 –(key mod 5)
Open Hashing
Collision problems is solved by inserting all elements that hash to the
same bucket into a single collection of values.
Open Hashing:
To keep a linked list of all the elements that are hashed to the same cell
(separate chaining).
Each cell in the hash table contains a pointer to a linked list containing the
data.
Functions and Analysis of Open Hashing:
Inserting a new element in to the table: We add the element at the beginning
or the end of the appropriate linked list.
Depending if you want to check for duplicates or not.
Also depends on how frequent you expect to access the most recently added
elements.
Separate Chaining

4
Open Hashing
For search, we use the hash function to determine
which linked list holds the element, and then traverse
the linked list to find the element.
Deletion is done to the element in the appropriate
linked list after we find the element to be deleted.
We could use other kinds of lists like a tree or another
hash table for each cell in the hash table to resolve
collision.
The main advantage of this method is the fact that it
can handle any amount of data (dynamic expansion).
Perfect Hashing
• If all of the keys that will be used are known ahead of time,
and there are no more keys than can fit the hash table, a
perfect hash function can be used to create a perfect hash
table, in which there will be no collisions.
Summary
Hash tables: array
Hash function: function that maps key into
number [0  size of hash table)
Collision resolution
Open hashing
Separate chaining
Closed hashing (Open addressing)
Linear probing
Quadratic probing
Double hashing
Summary
• Advantage
– Constant Running time + time to resolve Collision
• Disadvantage
– Difficult (not efficient) to print all elements in hash
table
– Inefficient to find minimum element or maximum
element
– Not growable (for closed hash/open addressing)
– Waste some space
Conclusions

Hashing is a search method,


used when
 sorting is not needed

 access time is the primary


concern
Conclusions(cont’d)

To choose a good hash function is a “black


art”.
The choice depends on the
nature of keys and the
distribution of the numbers corresponding
to the keys.
Conclusions(cont’d)
Best course of action:

• separate chaining: if the number of records is


not known in advance

• open addressing: if the number of the records


can be predicted and there is enough memory
available

You might also like