Hashing

Direct Address Table
Direct Address Table is a data structure that has the capability of mapping records to their corresponding
keys using arrays. In direct address tables, records are placed using their key values directly as indexes.
They facilitate fast searching, insertion and deletion operations.
Direct addressing is a simple technique that works well when the universe U of keys is reasonably small.
Suppose that an application needs a dynamic set in which each element has a key drawn from the
universe 𝑈 = {0,1, … , 𝑚 − 1}, where m is not too large. We shall assume that no two elements have the
same key.
To represent the dynamic set, we use an array, or direct-address table, denoted by 𝑇[0. . 𝑚 − 1], in which
each position, or slot, corresponds to a key in the universe 𝑈. Figure below illustrates the approach; slot 𝑘
points to an element in the set with key 𝑘. If the set contains no element with key 𝑘, then 𝑇[𝑘] = 𝑁𝐼𝐿.
In the above figure each key 𝑈 = {0,1, … , 𝑚 − 1} in the universe corresponds to an index in the direct-
address table 𝑇. The set 𝐾 = {2, 3, 5, 8} of actual keys determines the slots in the table that contain pointers
to elements. The other slots, heavily shaded, contain NIL.
For some applications, the direct-address table itself can hold the elements in the dynamic set. That is, rather
than storing an element’s key and satellite data in an object external to the direct-address table, with a
pointer from a slot in the table to the object, we can store the object in the slot itself, thus saving space.
We can understand the concept using the following example. We create an array of size equal to maximum
value plus one (assuming 0 based index) and then use values as indexes. For example, in the following
diagram key 21 is used directly as index.
Advantages:
1. Searching in 𝑶(𝟏) Time: Direct address tables use arrays which are random access data structure,
so, the key values (which are also the index of the array) can be easily used to search the records
in 𝑂(1) time.
2. Insertion in 𝑶(𝟏) Time: We can easily insert an element in an array in 𝑂(1) time. The same thing
follows in a direct address table also.
3. Deletion in 𝑶(𝟏) Time: Deletion of an element takes 𝑂(1) time in an array. Similarly, to delete an
element in a direct address table we need 𝑂(1) time.
These operations are trivial to implement:
DIRECT-ADDRESS-SEARCH (T, k)
1 return T[k]
DIRECT-ADDRESS-INSERT (T, x)
1 T[x, key] = x
DIRECT-ADDRESS-DELETE (T, x)
1 T[x, key] = NIL
Each of these operations takes only O(1) time.
Limitations:
1. Prior knowledge of maximum key value
2. Practically useful only if the maximum value is very less.
3. It causes wastage of memory space if there is a significant difference between total records and
maximum value.
Hashing can overcome these limitations of direct address tables.
Hash Tables:
What is Hash Table?
In computing, a hash table is a data structure that implements an associative array, also called a dictionary,
which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an
index, also called a hash code, into an array of buckets or slots, from which the desired value can be found.
During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.
In simple words, a Hash table is defined as a data structure used to insert, look up, and remove key-value
pairs quickly. It operates on the hashing concept, where each key is translated by a hash function into a
distinct index in an array. The index functions as a storage location for the matching value. In simple words,
it maps the keys with the value.
As the downside of direct addressing is obvious: if the universe 𝑈 is large, storing a table 𝑇 of size |𝑈| may
be impractical, or even impossible, given the memory available on a typical computer. Furthermore, the
set 𝐾 of keys actually stored may be so small relative to 𝑈 that most of the space allocated for 𝑇 would be
wasted.
When the set 𝐾of keys stored in a dictionary is much smaller than the universe 𝑈of all possible keys, a hash
table requires much less storage than a direct address table. Specifically, we can reduce the storage
requirement to 𝜃(|𝐾|) while we maintain the benefit that searching for an element in the hash table still
requires only 𝑂(1) time. The catch is that this bound is for the average-case time, whereas for direct
addressing it holds for the worst-case time.
With direct addressing, an element with key 𝑘 is stored in slot 𝑘. With hashing, this element is stored in
slot ℎ(𝑘); that is, we use a hash function ℎ to compute the slot from the key 𝑘. Here, ℎ maps the universe
𝑈 of keys into the slots of a hash table 𝑇[0, 1, … , 𝑚 − 1]:
ℎ: 𝑈 → {0, 1, … , 𝑚 − 1},
where the size 𝑚 of the hash table is typically much less than |𝑈|. We say that an element with key 𝑘 hashes
to slot ℎ(𝑘); we also say that ℎ(𝑘) is the hash value of key 𝑘. Figure below illustrates the basic idea. The
hash function reduces the range of array indices and hence the size of the array. Instead of a size of |𝑈|, the
array can have size 𝑚.
There is one hitch: two keys may hash to the same slot. We call this situation a collision.
Of course, the ideal solution would be to avoid collisions altogether.
Because |𝑈| > 𝑚, however, there must be at least two keys that have the same hash value; avoiding
collisions altogether is therefore impossible. Thus, while a well-designed, “random”-looking hash function
can minimize the number of collisions, we still need a method for resolving the collisions that do occur.
Hash Collision
When the hash function generates the same index for multiple keys, there will be a conflict (what value to
be stored in that index). This is called a hash collision.
We can resolve the hash collision using one of the following techniques.
 Collision resolution by chaining
 Open Addressing: Linear Probing
Collision resolution by chaining
In chaining, if a hash function produces the same index for multiple elements, these elements are stored in
the same index by using a linked list as following figure shows.
If 𝑗 is the slot for multiple elements, it contains a pointer to the head of the list of elements. If no element
is present, 𝑗 contains 𝑁𝐼𝐿.
Each hash-table slot 𝑇[𝑗] contains a linked list of all the keys whose hash value is 𝑗. For example,
ℎ(𝑘1 ) = ℎ(𝑘4 ) and ℎ(𝑘5 ) = ℎ(𝑘7 ) = ℎ(𝑘2 ). The linked list can be either singly or doubly linked.
The operations on a hash table 𝑇 are easy to implement when collisions are resolved by chaining:
CHAINED-HASH-INSERT (𝑇, 𝑥)
1 insert 𝑥 at the head of list 𝑇[ℎ(𝑥, 𝑘𝑒𝑦)]
CHAINED-HASH-SEARCH (𝑇, 𝑘)
1 search for an element with key 𝑘 in list 𝑇[ℎ(𝑘)]
CHAINED-HASH-DELETE (𝑇, 𝑥)
1 delete 𝑥 from the list 𝑇[ℎ(𝑥, 𝑘𝑒𝑦)]
How well does hashing with chaining perform? In particular, how long does it take to search for an element
with a given key?
Given a hash table 𝑇 with 𝑚 slots that stores 𝑛 elements, we define the load factor 𝜶 for 𝑇 as 𝑛/𝑚, that is,
the average number of elements stored in a chain.
Our analysis will be in terms of 𝛼, which can be less than, equal to, or greater than1. The worst-case
behavior of hashing with chaining is terrible: all 𝑛 keys hash to the same slot, creating a list of length 𝑛.
The worst-case time for searching is thus 𝑂(𝑛) plus the time to compute the hash function—no better than
if we used one linked list for all the elements. Clearly, we do not use hash tables for their worst-case
performance.
The average-case performance of hashing depends on how well the hash function ℎ distributes the set of
keys to be stored among the 𝑚 slots, on the average.
Example: Let's understand with the help of examples. Given below is the hash function:
ℎ(𝑘𝑒𝑦) = 𝑘𝑒𝑦 % 𝑡𝑎𝑏𝑙𝑒 𝑠𝑖𝑧𝑒
In a hash table with size 7, keys 42 and 38 would get 0 and 3 as hash indices respectively.
If we insert a new element 52 that would also go to the fourth index as 52% 7 is 3.
The lookup cost will be scanning all the entries of the selected linked list for the required key. If the keys
are uniformly distributed then the average lookup cost will be an average number of keys per linked list.
Open Addressing:
Open Addressing, also known as closed hashing, is a simple yet effective way to handle collisions in hash
tables. Unlike chaining, it stores all elements directly in the hash table. This method uses probing techniques
like Linear, Quadratic, and Double Hashing to find space for each key, ensuring easy data management and
retrieval in hash tables.
The main concept of Open Addressing hashing is to keep all the data in the same hash table and
hence a bigger Hash Table is needed. When using open addressing, a collision is resolved by
probing (searching) alternative cells in the hash table until our target cell (empty cell while
insertion, and cell with value 𝑥 while searching 𝑥) is found. It is advisable to keep load factor {𝛼}
below 0.5, where 𝛼 is defined as 𝛼 = 𝑛/𝑚 where 𝑛 is the total number of entries in the hash table
and 𝑚 is the size of the hash table. As explained above, since all the keys are stored in the same
hash table so it's obvious that 𝛼 ≤ 1 because 𝑛 ≤ 𝑚 always. If in case a collision happens then,
alternative cells of the hash table are checked until the target cell is found.
More formally,
 Cells ℎ0(𝑥), ℎ1(𝑥), ℎ2(𝑥). . . . ℎ𝑛(𝑥) are tried consecutively until the target cell has been found in
the hash table. Where ℎ𝑖(𝑥) = (ℎ(𝑥) + 𝑓(𝑖))%𝑆𝑖𝑧𝑒, keeping 𝑓(0) = 0.
 The collision function 𝑓 is decided according to method resolution strategy.
Linear Probing
In linear probing, collisions are resolved by searching the hash table consecutively (with wrap around) until
an empty cell is found. The definition of collision function 𝑓 is quite simple in linear probing. As suggested
by the name it is a linear function of 𝑖 or simply 𝑓(𝑖) = 𝑖.
Operations in linear probing collision resolution technique -
 For inserting 𝑥 we search for the cells ℎ(𝑥) + 0, ℎ(𝑥) + 1, . . . ℎ(𝑥) + 𝑘 until we found an empty
cell to insert 𝑥.
 For searching 𝑥 we again search for the cells ℎ(𝑥) + 0, ℎ(𝑥) + 1, . . . ℎ(𝑥) + 𝑘 until we found a cell
with value 𝑥. If we found a cell that has never been occupied it means 𝑥 is not present in the hash
table.
 For deletion, we repeat the search process if a cell is found with value 𝑥 we replace the value 𝑥 with
a predefined unique value (say ∞) to denote that this cell has contained some value in past.
Example of linear probing -
Table Size = 77 Hash Function - ℎ(𝑘𝑒𝑦) = 𝑘𝑒𝑦%7 Collision Resolution Strategy - 𝑓(𝑖) = 𝑖
 Insert - 16,40,27,9,75
 Search - 75,21
 Delete - 40
Steps involved are
 Step 1 - Make an empty hash table of size 7.
 Step 2 - Inserting 16, 40 and 27.

o ℎ(16) = 16%7 = 2
o ℎ(40) = 40%7 = 5
o ℎ(27) = 27%7 = 6
As we do not get any collision we can easily insert values at their respective indexes generated by the hash
function. After inserting, the hash table will look like –
 Step 3 - Inserting 9 and 75.
o ℎ(9) = 9%7 = 2 But at index 2 already 16 is placed and hence collision occurs so as per
linear probing we will search for consecutive cells till we find an empty cell.
o So we will probe for ℎ(9) + 1.i.e. cell 3, since the next cell i.e. 3 is not occupied we place 9
in cell 3.
o ℎ(75) = 75%7 = 5 Again collision happens because 40 is already placed in cell 55. So
will search for the consecutive cells, so we search for cell 6 which is also occupied then
we will search for cell (ℎ(75) + 2)%7 i.e. 0 which is empty so we will place 75 there.
After inserting 99 and 75hash table will look like –
 Step 4 - Search 75 and 21-

o ℎ(75) = 75%7 = 5 But at index 5,75 is not present so we search for consecutive cells
until we found an empty cell or a cell with a value of 75. So we search in cell 6 but it does
not contain 75, so we search for 75 in cell 0 and we stop our search here as we have
found 75.
o ℎ(21) = 21%7 = 0 We will search for 21 in cell 0 but it contains 75 so we will search in
the next cell ℎ(21) + 1, i.e. 1 since it is found empty it is clear that 21 do not exist in our
table.
 Step 5 - Delete 40
o ℎ(40) = 40%7 = 5 Firstly we search for 40 which results in a successful search as we
get 40 in cell 5 then we will remove 40 from cell 5 and replace it with a unique value (say
- ∞).
After all these operations our hash table will look like –
Hash functions:
What is a Hash function?
The hash function creates a mapping between key and value, this is done through the use of mathematical
formulas known as hash functions. The result of the hash function is referred to as a hash value or hash.
The hash value is a representation of the original string of characters but usually smaller than the original.
For example: Consider an array as a Map where the key is the index and the value is the value at that index.
So for an array A if we have index i which will be treated as the key then we can find the value by simply
looking at the value at A[i].
Properties of a Good hash function:
A hash function that maps every item into its own unique slot is known as a perfect hash function. We can
construct a perfect hash function if we know the items and the collection will never change but the problem
is that there is no systematic way to construct a perfect hash function given an arbitrary collection of items.
Fortunately, we will still gain performance efficiency even if the hash function isn’t perfect. We can achieve
a perfect hash function by increasing the size of the hash table so that every possible value can be
accommodated. As a result, each item will have a unique slot. Although this approach is feasible for a small
number of items, it is not practical when the number of possibilities is large.
So, we can construct our hash function to do the same but the things that we must be careful about while
constructing our own hash function.
A good hash function should have the following properties:
1. Efficiently computable.
2. Should uniformly distribute the keys (Each table position is equally likely for each item).
3. Should minimize collisions.
4. Should have a low load factor (number of items in the table divided by the size of the table).
Types of Hash functions:
There are many hash functions that use numeric or alphanumeric keys. Here we discuss two different hash
functions:
1. Division Method.
2. Multiplication Method
Division Method
This is the most simple and easiest method to generate a hash value. The hash function divides the value k
by M and then uses the remainder obtained.
Formula:
h(K) = k mod M
Here,
k is the key value, and
M is the size of the hash table.
It is best suited that M is a prime number as that can make sure the keys are more uniformly distributed.
The hash function is dependent upon the remainder of a division.
Example:
k = 12345
M = 95
h(12345) = 12345 mod 95
= 90
k = 1276
M = 11
h(1276) = 1276 mod 11
=0
Pros:
1. This method is quite good for any value of M.
2. The division method is very fast since it requires only a single division operation.
Cons:
1. This method leads to poor performance since consecutive keys map to consecutive hash values in
the hash table.
2. Sometimes extra care should be taken to choose the value of M.
Multiplication Method
This method involves the following steps:
1. Choose a constant value A such that 0 < A < 1.
2. Multiply the key value with A.
3. Extract the fractional part of kA.
4. Multiply the result of the above step by the size of the hash table i.e. M.
5. The resulting hash value is obtained by taking the floor of the result obtained in step 4.
Formula:
h(K) = floor (M (kA mod 1))
Here,
M is the size of the hash table.
k is the key value.
A is a constant value.
Example:
k = 12345
A = 0.357840
M = 100
h(12345) = floor[ 100 (12345*0.357840 mod 1)]
= floor[ 100 (4417.5348 mod 1) ]
= floor[ 100 (0.5348) ]
= floor[ 53.48 ]
= 53
Pros:
The advantage of the multiplication method is that it can work with any value between 0 and 1, although
there are some values that tend to give better results than the rest.
Cons:
The multiplication method is generally suitable when the table size is the power of two, then the whole
process of computing the index by the key using multiplication hashing is very fast.
Commonly used hash functions:
Hash functions are widely used in computer science and cryptography for a variety of purposes, including
data integrity, digital signatures, password storage, and more.
There are many types of hash functions, each with its own strengths and weaknesses. Here are a few of the
most common types:
1. SHA (Secure Hash Algorithm): SHA is a family of cryptographic hash functions designed by the
National Security Agency (NSA) in the United States. The most widely used SHA algorithms are SHA-
1, SHA-2, and SHA-3. Here’s a brief overview of each:
 SHA-1: SHA-1 is a 160-bit hash function that was widely used for digital signatures and other
applications. However, it is no longer considered secure due to known vulnerabilities.
 SHA-2: SHA-2 is a family of hash functions that includes SHA-224, SHA-256, SHA-384, and
SHA-512. These functions produce hash values of 224, 256, 384, and 512 bits, respectively. SHA-
2 is widely used in security protocols such as SSL/TLS and is considered secure.
 SHA-3: SHA-3 is the latest member of the SHA family and was selected as the winner of the NIST
hash function competition in 2012. It is designed to be faster and more secure than SHA-2 and
produces hash values of 224, 256, 384, and 512 bits.
2. CRC (Cyclic Redundancy Check): CRC is a non-cryptographic hash function used primarily for error
detection in data transmission. It is fast and efficient but is not suitable for security purposes. The basic
idea behind CRC is to append a fixed-length check value, or checksum, to the end of a message. This
checksum is calculated based on the contents of the message using a mathematical algorithm, and is
then transmitted along with the message.
When the message is received, the receiver can recalculate the checksum using the same algorithm, and
compare it with the checksum transmitted with the message. If the two checksums match, the receiver
can be reasonably certain that the message was not corrupted during transmission.
The specific algorithm used for CRC depends on the application and the desired level of error detection.
Some common CRC algorithms include CRC-16, CRC-32, and CRC-CCITT.
3. MD5 (Message Digest 5): MD5 is a widely-used cryptographic hash function that produces a 128-bit
hash value. It is fast and efficient but is no longer recommended for security purposes due to known
vulnerabilities. The basic idea behind MD5 is to take an input message of any length, and produce a
fixed-length output, known as the hash value or message digest. This hash value is unique to the input
message, and is generated using a mathematical algorithm that involves a series of logical operations,
such as bitwise operations, modular arithmetic, and logical functions.
MD5 is widely used in a variety of applications, including digital signatures, password storage, and
data integrity checks. However, it has been shown to have weaknesses that make it vulnerable to attacks.
In particular, it is possible to generate two different messages with the same MD5 hash value, a
vulnerability known as a collision attack.
There are many other types of hash functions, each with its own unique features and applications. The
choice of hash function depends on the specific requirements of the application, such as speed, security,
and memory usage.
Static and Dynamic Hashing:-
Static Hashing
In static hashing, the hash function always generates the same bucket’s address. For example, if we have a
data record for employee_id = 106, the hash function is mod-5 which is – H(x) % 5, where x = id. Then the
operation will take place like this:
H(106) % 5 = 1.
This indicates that the data record should be placed or searched in the 1st bucket (or 1st hash index) in the
hash table.
Example:
The primary key is used as the input to the hash function and the hash function generates the output as the
hash index (bucket’s address) which contains the address of the actual data record on the disk block.
Static Hashing has the following Properties
 Data Buckets: The number of buckets in memory remains constant. The size of the hash table is
decided initially and it may also implement chaining that will allow handling some collision issues
though, it’s only a slight optimization and may not prove worthy if the database size keeps
fluctuating.
 Hash function: It uses the simplest hash function to map the data records to its appropriate bucket.
It is generally modulo-hash function
 Efficient for known data size: It’s very efficient in terms when we know the data size and its
distribution in the database.
 It is inefficient and inaccurate when the data size dynamically varies because we have limited space
and the hash function always generates the same value for every specific input. When the data size
fluctuates very often it’s not at all useful because collision will keep happening and it will result in
problems like – bucket skew, insufficient buckets etc.
To resolve this problem of bucket overflow, techniques such as – chaining and open addressing are used.
Dynamic Hashing
Dynamic hashing is also known as extendible hashing, used to handle database that frequently changes data
sets. This method offers us a way to add and remove data buckets on demand dynamically. This way as the
number of data records varies, the buckets will also grow and shrink in size periodically whenever a change
is made.
Properties of Dynamic Hashing
 The buckets will vary in size dynamically periodically as changes are made offering more
flexibility in making any change.
 Dynamic Hashing aids in improving overall performance by minimizing or completely preventing
collisions.
 It has the following major components: Data bucket, Flexible hash function, and directories
 A flexible hash function means that it will generate more dynamic values and will keep changing
periodically asserting to the requirements of the database.
 Directories are containers that store the pointer to buckets. If bucket overflow or bucket skew-like
problems happen to occur, then bucket splitting is done to maintain efficient retrieval time of data
records. Each directory will have a directory id.
 Global Depth: It is defined as the number of bits in each directory id. The more the number of
records, the more bits are there.
Working of Dynamic Hashing
Example: If global depth: k = 2, the keys will be mapped accordingly to the hash index. K bits starting
from LSB will be taken to map a key to the buckets. That leaves us with the following 4 possibilities: 00,
11, 10, 01.
As we can see in the above image, the k bits from LSBs are taken in the hash index to map to their
appropriate buckets through directory IDs. The hash indices point to the directories, and the k bits are taken
from the directories’ IDs and then mapped to the buckets. Each bucket holds the value corresponding to the
IDs converted in binary.

Hashing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hashing

Uploaded by

Copyright:

Available Formats

Direct Address Table

 Step 2 - Inserting 16, 40 and 27.

 Step 4 - Search 75 and 21-

You might also like