Hashing: Fundamentals, Solving Search and Insert Problem Using Hashing, Deletion From Hash Table, Collision Resolution

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Hashing

Fundamentals, solving search and insert problem using hashing,


deletion from hash table, collision resolution
Fundamentals of Hashing
• Hashing is a search technique used in cases where there are a large
number of entries stored in a table
• Searching in long tables is a commonly encountered problem and
therefore efficient methods of searching are desirable
• Hashing is a search technique based on key values
• The main idea is to take as input a particular key value and return the
location in the table where the element is present, if the element
exists
Concept of Hash Function and Collision
• A function is used to map the key value to a location in the table, this
function is called ‘Hash Function’
• It is entirely possible that the hash function returns same table
location for two different elements, that is a ‘collision’ occurs
• We shall see various methods for resolving collision
• Collision resolution is an important part of Hashing
• The benefits obtained by use of Hashing depend on two factors:
• The efficiency of the Hash function
• The efficiency of Collision resolution technique used
The Search and Insert Problem
• The problem can be stated as follows:
Given a list of items (the list may be empty initially), search for a given
item in the list. It the item is not found, insert it in the list
• In solving this problem, the major design issue is searching
• This is because insertion can be done in constant time however,
search time keeps increasing as the size of list grows
• You have already seen some of the possible solutions to this problem,
for the sake of revision we shall first list the possible solutions to this
problem
Solving the Search and Insert Problem
• One solution is to store the items sequentially one after the other as they
come up. Here, searching will take linear time (O(n)) and the complete list
must be checked before a new item may be inserted in. The advantage of
this method lies in its simplicity
• Storage using a linked list has similar disadvantages
• Another solution is to maintain a sorted list. Here, search time reduces to
logarithmic order (O(logn)) but additional time is required for adding each
element in such a manner that the resulting list is always sorted
• The list can also be stored as a binary search tree, in this case insertion and
searching both can be done efficiently if the tree does not get too
unbalanced
• Another possible solution is Hashing, which we shall study in detail now
Hashing
• Suppose, we are having a list of capacity n
• Thus, the index numbers will range from 0 to n-1
• A hash function on this list would be a function which can translate a
key value to one of these index numbers
• There can be several possible functions that can perform this
• We choose a hash function that gives output that take all values
between 0 and n-1
• A hash function that maps all the keys to only some index values will
unnecessarily cause high collisions and reduce search performance
Choosing Hash Function
• A possible hash function could be:
key % 10 (this will give values between 0 and 9)

This may not be a good choice where table size is larger than 9

• In general, for a range of n values, following hash function turns out to be a reasonable
choice:
key % n
• Thus, for n = 10 and key = 23, we will get index = 3 i.e., 23 will be placed at index 3

In hashing, we want the keys to be scattered all over the table. If, suppose the keys are
hashed to only one area in the table, we can end up with an unnecessarily high number of
collisions. Such possibilities should be avoided
Mid-Square Method
• The key is multiplied by itself and the address is obtained by selecting
an appropriate number of bits or digits from the middle of the square
• Usually, the number of bits chosen depends on the size of the hash
table
• The same position in the square must be used for all values

123456 * 123456 = 15241383936


Folding Method
• In this method, the key is partitioned into a number of parts, each of
which has the same length as the required address with the possible
exception of the last part
• These parts are then added together, ignoring the final carry to form
the address
• 3569427812 all the parts are added excluding carry to
generate a 3-digit address
Digit Analysis
• This method forms address by selecting and shifting digits or bits of
the original key
• For a given key set, the same position in the keys and same
arrangement pattern must be used consistently
• Let address space be 4-digits long and key is formed by reversing the
selected digits. 4-digits are selected starting form 2nd position
• Then, the address can be obtained as follows
7546123 1645
Length Dependent Method
• In this method, the length of the key is used along with some portion
of the key to produce either a table address directly or more
commonly an intermediate key which is used with the division
method to produce the final table address
• 7546123 size = 7

7 x (some portion of key) 7 x 754 (r)

perform (r mod m) for some m to get final address


(division method to find final address)
Algebraic Coding
• This is a cluster separating hash function based on the algebraic
coding theory
• It was originally proposed for implementation in hardware rather
than software
• An r-bit key in this method is considered to be a polynomial

𝑘 𝑥 = σ𝑟𝑖=1 𝑘𝑖 𝑥 𝑖−1 i = 1, 2… r
• Above polynomial is used as an intermediate for multiplicative
hashing
Handling Collision
• Collision might happen when a key maps to an index where an element already
exists
• Easiest way is to put the element in the next location
• If this location is also filled we keep moving to next locations until a free location
is found
• In such cases, while searching these items must be searched linearly
• So total time becomes – hashing (constant time) + linear search over items with
collision
• During deletion, if a location is empty a special key may be used to mark free
location which can be filled later
• In practice, we never allow the hash table to become completely full
• In general, the hash technique works better when there are more free locations
in the table
Collision Resolution
• We have seen the way of resolving collision by looking at the next
location in the table
• This method is also known as Linear Probing
• There are other methods also for finding a location to place an
element if a collision occurs
• The three main techniques used for resolving collision are – Linear
Probing, Quadratic Probing and Chaining
Collision Resolution

Open Addressing Chaining

• Linear Probing
• Quadratic Probing
• Random Probing
• Rehashing
Linear Probing
• for a given key and table size n we get location for insertion as:
loc = key %n
• If this location is not free, we apply linear probing as
loc = (loc+1) %n
i.e., we go for next location, if is not free, we move to next
• Suppose, multiple key hash to a single location, in this case we will
keep on adding values to next available locations
• In some cases, these elements might form long chains, resulting into
more collisions
• This phenomenon is called clustering.
• This is one of the main drawbacks of linear probing
• It is also possible that chains for two different key values join each
other and form even longer chain, again the possibility of this
happening increases with long chains

Thus, two types of clustering are defined:


• Primary Clustering: occurs when keys that hash to different locations
trace the same sequence in looking for an empty location. Linear
probing exhibits this phenomenon.
• Secondary Clustering: occurs when keys that hash to the same
location trace the same sequence in looking for an empty location.
Linear probing exhibits this phenomenon as well.
• loc + k, where k is a constant might also be another way of doing
linear probing
• In this case, locations that are k distance apart will be searched for if a
collision occurs
• This could be advantageous if there are less chances of getting
immediate free locations
• However, it might not generate all possible keys and therefore, result
into even more collisions
• If there are m possible table locations, this problem can be avoided by
choosing k such that m and k are relatively prime. In this case, all
possible keys will be generated
• In general, loc + k where k varies with key gives one of the best ways
to implement hashing
Performance of Linear Probing
• It depends upon how long we have to search for finding a free location to
insert
• i.e. the performance may be measured from the average search length
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑛𝑡𝑟𝑖𝑒𝑠 𝑖𝑛 𝑡𝑎𝑏𝑙𝑒
• This in turn depends on the load factor f =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑎𝑏𝑙𝑒 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠
• In general, for linear probing to give reasonable performance, we should
keep the table not more than 75% full
• In this way, we can guarantee good performance with a simple algorithm
• In general, the size of the table is kept about 1.3 times the number of keys
that are to be stored
Quadratic Probing
• Here in case of a collision, we find the next free location for insertion
according to the following formula:
𝑙𝑜𝑐 = 𝑙𝑜𝑐 + (𝑎𝑖 + 𝑏𝑖 2 )

In the above equation, a and b are constants and i takes on values as


below:
• i = 1, if first collision happens at a location
• i = 2, if second happens
• i = 3, if third happens and so on
• Suppose a = 1, b=1 and collision occurs at location 7
In the first run i = 1
1+1 = 2
loc = 7+2 = 9
Suppose, a collision occurs at location 9 also then,
For the second run i = 2
loc = 9 + 6 = 15 and so on

• If at any point we reach the end of the table in this process, we wrap
around from that point
Performance of Quadratic Probing
• Here, the keys that map to different locations trace different
sequences therefore, primary clustering is eliminated
• Secondary clustering still remains
• If n is a power of 2 that 𝑛 = 2𝑚 for some m, this method explores
only a small fraction of the locations in the table and is therefore, not
very effective
• If n is prime, the method can reach half the locations in the table; this
is usually sufficient for most practical purposes
Pseudo-Random Probing
• In this method, a random sequence of positions is generated in place
of an ordered sequence, when a collision occurs
• The random sequence generated must contain every position
between 1 and n exactly once
• The table is full when the first duplicate position is generated
• This method reduces the problem of primary clustering
• Because of the expense of random number generator, this method is
not often used
• The pseudo-random generator uses ith permutation of the numbers
from 1 to m and uses it as probe sequence to find out next free
location.
Double Hashing
• In this method, if a collision occurs, another hash function is used to
decide the next location to search for insertion
• The value generated by the second hash function gives the offset
from the original location
• The value of offset usually depends on the key and therefore reduces
the chances of primary clustering
• When the size of the table is a prime number, the double hashing is
seen to perform very well in practice
Chaining
• Open Addressing is applicable where the key values themselves are stored
as table entries, it is also known as closed addressing
• Another option could be to store pointers to key values, this results into a
new method for collision resolution known as chaining
• In this method, all items that hash to the same location are held on a linked
list
• Each time an element is not found at its hashed location, the
corresponding linked list is searched in sequential manner
• Chaining is usually implemented in a manner such that the hash table
actually contains a pointer to the top of the linked list and each element is
represented by a separate linked list
Applications
• Encryption (Message digest)
• Compiler Operation (Symbol table)
• Rabin-Karp Algorithm
Symbol Table
• It is a set of name-value pairs
• Associated with each name in the table is an attribute or a collection
of attribute or some instructions about what further processing is
needed
• they are normally used when building loaders, assemblers, compilers
or any key-word driven translator
Operations on Symbol Table
• The operations that are generally performed on
Symbol table are:
• Checking for the presence of a particular key
• Retrieving attributes for a particular key value
• Inserting a new name and its value
Name Value
• Deleting a name and its value
.. ..
.. ..

You might also like