Hash Tables: A Detailed Description

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

HASH TABLES

Raashid Altaf- 16BCS036

Introduction
In many scenarios, we often need to store a certain set of values in a data structure so that they
could be retrieved at a later moment. This set of values (let’s call it ‘data’) could be in any
form, say integer, letters or in some cases, even mathematical functions. In order to retrieve the
data, our selected data structure should consist of a certain index, (let’s call that ‘key’) which
uniquely identifies our data. Let’s also assume that our data is the subset of a larger set called
the Universe, U.
What we want is that our data structure should assign a particular key to a certain data point,
so that we could extract one particular data point out of the whole data by referring to its
respective key. This raises many important problems, such as:

 What type of key should be use?


 What should the size of our data structure be?
 How do we ‘map’ a data point to a particular key?
There are many methods available to us that let us solve this, and more problems to some
extent. We’ll discuss two of those: Array and Dictionary.
Array: Array is a data structure that uses keys in the form of integers. It, in many ways, is the
base upon which many other data structures have been built. Because the key is in the form of
integer, array gives a result in just O(1) time:
e.g. Given an array A of size m, we can store values in A[0], A[1], … , A[m-1], and the
index (key) of any ‘element’ A[I] is ‘I’ itself.
Dictionary: Instead of the keys being integers as in the case of array, dictionaries enable us to
set any type of value as key. This key could be a word a number or anything in between. In
Python, dictionaries are declared as
Dict = {key1: value1, key2:value2,…, keyN : valueN}
Here by referring to Dict[keyN] we can retrieve valueN. Dictionaries can be thought of
as an implementation of balanced Binary Search Trees and hence will have the time
complexity of the order O(log2N) where N is the number of entries.
Both of the above data structures have their merits and demerits. Where array is fast, it also
uses integers as keys which make it hard to remember which key was assigned to which data
point. In the case of dictionaries, even though the keys are simplified, the retrieval itself is slow.
The question that now arises is that can we reach to a compromise between the two? Can we
have the best of both world where we have the keys as they are in dictionaries while having the
speed of arrays? That’s where Hash Tables come in.
In the length of this paper, we will first describe what hash tables are and a detailed description
of how hash tables work. We’ll end with a glossary containing the definitions of terms that we
encounter on the way.

Definition:
A Hash table is a data structure which allows us to store data and label it with certain ‘keys’.
The hash table then stores this data according to a shorter length index. For the purpose of
simplicity in further arguments, we’ll consider hashing to an integer index.
These integer indices are computed by using user defined keys and performing a certain
function upon them. This function is known as a hash function and the method is known as
hashing.
Consider the following diagram: Abdul

HASH FUNCTION

0 1 2 3 4 5
HASH TABLE
Ram: Rahim: Abdul: Tom: Din: Dan:
3.5 5.6 8.7 4.5 5.4 4.6

8.7
In the above given implementation of a hash table, we have made a list which uses names as
keys and maps them to certain float values. The hash table uses the hash function to compute
the integer indices corresponding to our keys (i.e. the names), eg for Abdul, the hash function
computes the value of 2. Thus when we want to retrieve the data of Abdul, we just input the
key into the table, and then the table computes its index value and returns the corresponding
result.
Thus we’re able to maintain user decided keys, while also maintaining the speed of arrays.

What is a hash function?


Let’s get a basic idea of what a hash function is:
Suppose there is a set of m elements A that we have to store in a hash table of size k ≥ m. There
exists a Universe U such that A ⊆ U. Let the total number of elements in U be u. Usually,
k<<u.
We create a hash function h that maps every possible existing element of U to the hash table
which is an array ranging from 0 → k-1.
i.e the hash function h: U → {0, 1, …, k-1}
If the array is A, we can say that we have hashed an element p to A[h(p)]
Considering the before mentioned example, h(Abdul) = 2 and A[2] = 8.7
There’s a point to consider here. If we create a hash table such that u=k, then we can create a
hash function where h(t)=t. This is commonly called a direct access table or more commonly,
an array.

How do we decide the hash functions to use?


Now that we know what a hash function is, we will focus our attention to the problems we face
while creating hash functions.
If the size of our hash table is small, we encounter a problem named collision. Two elements a
and b are said to have collided if our hash function assigns the same value to both of them, i.e
h(a) = h(b). In the worst case scenario, a subset of the elements (of the size u/k) will always
collide, i.e. would have the same location to which they have hashed. This will bring our
program to a standstill.
In an ideal scenario, we make a hash function that doesn’t cause any collision. But that is
practically impossible as the data that is to be entered depends on the user and thus is random
and completely outside our knowledge domain.
Thus there will always exist a certain set of pairs of elements that will collide. What our focus
should be is to minimise the number of collisions and that is how we will design our hash
functions.
The only way out of this, or to partly avoid this problem is to randomly choose the hash
function: we create a set of hash functions H. When we run the program to retrieve or edit the
table, we choose one of the hash functions from the set H according to a certain method or
distribution.
Let’s consider a popular example of a hash function called “the division method”. This hash
function calculates h(y) = y mod k, where k is the table size. This function might sound pretty
promising but in reality is pretty useless. If k occurs in the form of 2t where t is an integer, the
whole model falls down. This is because in such a case, the resulting index will again occur as
a lower power of 2, and hence the collisions will still occur. There is a modification to the
above method which says that k should be a prime number. This doesn’t improve our model
much because the same problem will occur, because when k is prime, the pair of elements
where the difference between two results in an integral multiple of k will definitely collide, i.e
h(t + ak) = h(t).

TYPES OF HASH FUNCTIONS:


There are many types of hash functions that are currently in use, e.g “the division method”
which is explained above. Some other simple hash functions are:
1. Folding method: In this technique, a number which is to be hashed is taken, this value
is then divided into a certain number of parts. These parts are then added and a selected
number of digits from the resultant is used as the hashed value.
2. Radix Transformation Method: If the value to be hashed is digital, its radix can be
changed, i.e. a decimal number can be converted into a hexadecimal or BCD number
resulting in a different hashed value.
3. Digit rearrangement Method: In this method, the value is taken and a random sequence
of digits in the value is reversed, resulting in a different value than first.
It should be noted that hashing is not just used for fast retrieval of data. In many cases, hashing
is also deeply used in cryptography.

TYPES OF HASHING TECHNIQUES


The idea behind hashing is to use an algorithm to make a random function, called the hash
function that is then used on a key, and results in the required field. Correlating this to locating
a block of memory, the hash key points to a record and by using the function on the hash key,
the block in which this record is located is returned. Assume that the unit of a hash table is a
bucket and that the records are contained in buckets.
Consequently, there are many types of hashing techniques.
1. Static Hashing: In this type of hashing, the records are allocated sequentially. In
simpler terms, for a particular key, the static hashing function will always compute the
same value, even if the contents of the dictionary have changed.
In this technique, all the objects in the dictionary should be static, i.e. the objects and
what they refer to should not change and be constant. This leads to limited application
of static hashing. The primary blocks are thus fixed, and if other records are to be added,
overflow pages are used. Also, to find a particular record, the whole bucked is to be
searched sequentially.
e.g if mod-5 hashing is used, it will only result in 5 values and for a particular input, we
will always find the same output. The number of buckets that exist, thus, is also
unchanging.
BUCKETS
KEYS
2. Dynamic Hashing: The problem that we face with static hashing is that the dictionary
does not change size dynamically. Rather it is fixed permanently. In dynamic hashing,
we thus we thus keep adding and deleting buckets dynamically. The hash function
initially forms many buckets. Some of these buckets may stay unused. These are then
deleted accordingly.
N N1 Bucket-1

00 1
N2 Bucket-2

01 2
N3 Bucket-3

10 3
N4 Bucket-4

11 4

NOTE: Dynamic Hashing is also known as Extended Hashing

DIRECT ACCESS TABLE:


Suppose we have to create a database where we store the details of the employees of a company.
The key to this table will be the phone numbers of the employees. This scenario can be
implemented in various methods like arrays, linked lists, etc. We can store the phone numbers
and details of the employees in arrays or linked lists and use their respective integer index. The
problem with linked lists and arrays is that the time to search the list to find a record will be
large. It would be of the order O(logn). Not only this, the insertion and deletion of records will
also take up much time.
Naturally, we look for other ways to solve our problem. That is where direct access tables come
in.
In a direct access table, we use the keys to identify the pointer to the location where the records
are stored. This not only simplifies our structure but also shortens the retrieval time. In the case
of direct access tables, we can get a result in O(1) time.
Correlating this with the above example, to make a direct access table, we will use phone
numbers as keys to a list of pointers which point to the respective records. In order to insert an
entry, say the record of an employee, we first create this record with all the necessary details.
Then the phone number, which is also the index, is used along with the pointer to the created
record and we then store this in the table.
This might seem to be the perfect way to store the data, but unfortunately this method is at
most, idealistic. For a phone number of n digits, and a pointer of size m, we would need a table
of size O(m*10n) which is unrealistic to say the least. That is why that instead of direct access
tables, we use hash tables instead. Hashing gives us the solution to almost all the problems in
this case.

COLLISION HANDLING:
When hashing, we use a function, known as a hash function to compute the index key. This
means that we convert a large key into a small, and preferably integral key. It is but natural that
during this process, there may arise a case where two different keys result in the same hashed
value, i.e., to different keys are hashed to the same memory location. This is known as collision.
To counter this problem, there are many techniques available. They come under the category
of collision handling. A few more common types are:
1. Chaining: In this method, if collision occurs at a particular hashed value, they are
connected via a linked list. Then to retrieve a record, we first go to the entry in the hash
table, and then traverse the linked list if multiple memory locations are hashed. This
method, even though is relatively simple, requires a lot of additional memory.
e.g let us consider the example of ‘key mod 6’ and the sequence of keys 20, 350, 38,
45, 46, 36, 55. The chaining would be done as follows

700 700
20 20 20 45

38 38
Initial empty Insert 20 Insert 350 and 38 Insert 45, collision happened
table add to chain

2. Open Addressing/Probing: In open addressing, the hash table is used to store every
element itself. The entry in the table either has the value of the record or a null value.
This clearly implies that the size of the table should either be equal or greater than the
number of entries that are to be stored onto it. The null value means that no entry is
hashed to the particular record in the table. To retrieve a particular record, we traverse
through the whole table one by one until the element is found or it isn’t in the table.
The following operations are performed in the following way in open addressing:
a. Insert: The table is traversed for an empty slot. Once the slot is found, the entry
is entered.
b. Search: We keep traversing the table until we match the key in the table with
the key for the record that we are searching for.
c. Delete: Here we encounter a problem as if we simply delete an entry, it might
cause a break in our table. Instead, we mark the entries in a specific way, say as
“deleted”. This way when searching, our function just skips over these entries.
Also, other records can be overwritten at their place.
Open addressing is done in the following ways:
a) Linear Probing: As literally apparent, in this method, we linearly probe the table. We
also select a probing length, i.e., the number of entries that the algorithm will skip to
check for the next empty slot. Here is an example:
Let us consider the example of ‘key mod 6’ and the sequence of keys 20, 350, 38, 45,
46, 36, 55. The chaining would be done as follows:

700 700
20 20 20
45

38 38
Initial empty Insert 20 Insert 350 and 38 Insert 45, collision happened
table with 20, enter at next free slot

700
20
45
46

38
Insert 45, Collision with 20
again, enter at next free slot
The following is the algorithm for Open Addressing: (Referred from Jayakanth
Srinivasan @MIT)

Index := hash(key)
While Table(Index) Is Full do
index := (index + 1) MOD Table_Size
if (index = hash(key))
return table_full
else
Table(Index) := Entry
b) Quadratic probing: Instead of linearly checking for empty entries, in the case of
quadratic probing, the table is traversed according to a quadratic fashion, i.e for ith
entries, i2 entry is checked.

c) Double hashing: In this case, instead of one hash function, two hash functions are used
and the combined result is probed for entry.

PERFECT HASHING:
In most of the cases when we theoretically analyse hash function, we assume that our hash
functions are ideal. This means that out of a set of hash functions available to us, each one is
selected uniformly, i.e. with about equal probability. Mathematically:

For a uniform hash function

P[h(x) = i] = 1/m for all x and i

But finding these types of function is extremely difficult and hence focus is laid on finding a
set of hash functions as close to uniformity as possible.

In another case of ideal hash functions, we focus our attention to perfect hash functions.
Perfect hash functions are those in which no collision occurs. Perfect hashing is possible when
we know beforehand exactly what keys are or will be available to us. This is popularly used in
case of hashing keywords for compilers.
A simple example of perfect hashing is to use function which maps a key to it’s index, i.e a set
of n keys will be mapped to the range from [0, n-1]. This is feasible for a small set of records,
but as the size increases, so does the memory requirement, and hence this type of hashing has
limited functionality.

CONCLUSION:
In the above paper, we discussed the details about hash tables and how they are broadly
implemented. We also dived into some of the techniques of used in hashing while also
discussing some of the popular hash functions in use. The paper ended with the description of
some of the related terms to hash tables and hashing in general.
In the end, we should discuss the applications of hashing and hash tables:

 Hash tables could be used to build a primitive form of search engine. This would lead
us to build two hash functions. One hash table would store certain set of keywords and
map them to a set of URLs, and the other hash table would contain the URLs within
every set.
 Hash tables are also very much in use the building of compilers. In Compiler Design,
hash tables are used to store certain keywords that the compiler could quickly refer to
when doing lexical analysis.
 Hashing is also widely used in cryptography. A certain set of keywords that need to be
encrypted are hashed to a different keywords. The receiver then reverses the process
according to certain rules that are already available to him. These rule are nothing but
hash functions that reverse map the hashed keywords back to the original phrase.

REFERENCES:
 web.mit.edu/16.070/www/lecture/hashing.pdf
 https://www.geeksforgeeks.org/hashing-set-2-separate-chaining
 https://www.hackerearth.com/practice/data-structures/hash-tables/basics-of-hash-
tables/tutorial/
 https://www.cs.cmu.edu/~fp/courses/15122-f15/lectures/12-hashtables.pdf
 ee.usc.edu/~redekopp/cs104/slides/L21_Hashing.pdf

You might also like