Unit1 Notes ADS

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Dictionary (Data Structure):

In computer science, a dictionary is a data structure that stores a collection of key-value pairs,
where each key must be unique. It is also known by other names such as a map, associative
array, or symbol table. The idea is to associate a value with a unique identifier (key), allowing
efficient retrieval and modification of values based on their keys.

Abstract Data Type (ADT):

An Abstract Data Type (ADT) is a high-level description of a set of operations that can be
performed on a data structure, without specifying how these operations are implemented. It
defines what operations are possible and what their semantics are, but it does not prescribe how
the operations should be carried out. In the case of a dictionary, the ADT would include
operations like insert, delete, and search.

Implementation of Dictionaries:

There are various ways to implement dictionaries, each with its own advantages and
disadvantages. Here are a few common implementations:

Hash Tables:

Idea: Use a hash function to map keys to indices in an array.

Pros: Efficient average-case time complexity for basic operations (insert, delete, search).

Cons: Possibility of collisions (two keys hashing to the same index), which requires collision
resolution strategies.

Binary Search Trees (BST):

Idea: Store key-value pairs in a binary tree, where keys to the left are smaller, and keys to the
right are larger.

Pros: Maintains a sorted order of keys, easy to find the minimum and maximum keys.

Cons: The tree can become unbalanced, leading to inefficient operations in the worst case.

Skip Lists:

Idea: Linked lists with multiple layers of links, allowing for faster search operations.

Pros: Simpler than many other data structures, and has good average-case performance.
Cons: More complex than a simple linked list, and may not perform as well as hash tables in
certain scenarios.

Trie (Prefix Tree):

Idea: Organize keys in a tree structure where each node represents a character in a key.

Pros: Efficient for operations involving prefixes, like autocomplete.

Cons: Can be memory-intensive, especially if there are many keys with common prefixes.

Hashing:

Hashing is a technique used to map data of arbitrary size to fixed-size values, usually for the
purpose of quickly and efficiently locating a data record. It's commonly employed in data
structures like hash tables to achieve constant-time average complexity for basic operations like
insertion, deletion, and retrieval.

Hash Function:

A hash function takes an input (or "key") and produces a fixed-size string of characters, which is
typically a hash code. The goal is to distribute the keys uniformly across the range of possible
hash values to minimize collisions. An ideal hash function is fast to compute and minimizes the
likelihood of two different keys producing the same hash code.

Collision Resolution Techniques:

Collisions occur when two distinct keys hash to the same value. Various techniques are
employed to address collisions:

Collision Resolution

We now return to the problem of collisions. When two items hash to the same slot, we must have
a systematic method for placing the second item in the hash table. This process is called collision
resolution. If the hash function is perfect, collisions will never occur. However, since this is
often not possible, collision resolution becomes a very important part of hashing.

One method for resolving collisions looks into the hash table and tries to find another open slot
to hold the item that caused the collision. A simple way to do this is to start at the original hash
value position and then move in a sequential manner through the slots until we encounter the first
slot that is empty. Note that we may need to go back to the first slot (circularly) to cover the
entire hash table. This collision resolution process is referred to as open addressing in that it
tries to find the next open slot or address in the hash table. By systematically visiting each slot
one at a time, we are performing an open addressing technique called linear probing.

Figure 8 shows an extended set of integer items under the simple remainder method hash
function (54,26,93,17,77,31,44,55,20). Table 4 above shows the hash values for the original
items. Figure 5 shows the original contents. When we attempt to place 44 into slot 0, a collision
occurs. Under linear probing, we look sequentially, slot by slot, until we find an open position.

Again, 55 should go in slot 0 but must be placed in slot 2 since it is the next open position. The
final value of 20 hashes to slot 9. Since slot 9 is full, we begin to do linear probing. We visit slots
10, 0, 1, and 2, and finally find an empty slot at position 3.

Collision Resolution with Linear Probing

Once we have built a hash table using open addressing and linear probing, it is essential that we
utilize the same methods to search for items. Assume we want to look up the item 93. When we
compute the hash value, we get 5. Looking in slot 5 reveals 93, and we can return True. What if
we are looking for 20? Now the hash value is 9, and slot 9 is currently holding 31. We cannot
simply return False since we know that there could have been collisions. We are now forced to
do a sequential search, starting at position 10, looking until either we find the item 20 or we find
an empty slot.

A disadvantage to linear probing is the tendency for clustering; items become clustered in the
table. This means that if many collisions occur at the same hash value, a number of surrounding
slots will be filled by the linear probing resolution. This will have an impact on other items that
are being inserted, as we saw when we tried to add the item 20 above. A cluster of values
hashing to 0 had to be skipped to finally find an open position. This cluster is shown in Figure 9.

One way to deal with clustering is to extend the linear probing technique so that instead of
looking sequentially for the next open slot, we skip slots, thereby more evenly distributing the
items that have caused collisions. This will potentially reduce the clustering that occurs. Figure
10 shows the items when collision resolution is done with a "plus 3" probe. This means that once
a collision occurs, we will look at every third slot until we find one that is empty.
The general name for this process of looking for another slot after a collision is rehashing. With
simple linear probing, the rehash function is newh ashvalue =rehash( oldhashv alue)

where reha sh(pos)= (pos+1)% sizeofta ble . The "plus 3" rehash can be defined
as reha sh(pos)= (pos+3) %sizeof table . In general, reh ash(pos )=(pos+ skip) % sizeof table.
It is important to note that the size of the "skip" must be such that all the slots in the table will
eventually be visited. Otherwise, part of the table will be unused. To ensure this, it is often
suggested that the table size be a prime number. This is the reason we have been using 11 in our
examples.

A variation of the linear probing idea is called quadratic probing. Instead of using a constant
"skip" value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on.
This means that if the first hash value is h, the successive values are h+1

, h +4, h +9 , h +16 , and so on. In other words, quadratic probing uses a skip consisting of
successive perfect squares. Figure 11 shows our example values after they are placed using this
technique.

Collision Resolution with Quadratic Probing

An alternative method for handling the collision problem is to allow each slot to hold a reference
to a collection (or chain) of items. Chaining allows many items to exist at the same location in
the hash table. When collisions happen, the item is still placed in the proper slot of the hash
table. As more and more items hash to the same location, the difficulty of searching for the item
in the collection increases. Figure 12 shows the items as they are added to a hash table that uses
chaining to resolve collisions.
Collision Resolution with Chaining

The three Major collision resolution strategies

 Linear Probing
 Quadratic Probing
 Double hashing

Linear Probing:

It is a Scheme in Computer Programming for resolving collision in hash tables.

Suppose a new record R with key k is to be added to the memory table T but that the memory
locations with the hash address H (k). H is already filled.

Our natural key to resolve the collision is to crossing R to the first available location following T
(h). We assume that the table T with m location is circular, so that T [i] comes after T [m].

The above collision resolution is called "Linear Probing".

Linear probing is simple to implement, but it suffers from an issue known as primary clustering.
Long runs of occupied slots build up, increasing the average search time. Clusters arise because
an empty slot proceeded by i full slots gets filled next with probability (i + 1)/m. Long runs of
occupied slots tend to get longer, and the average search time increases.

Given an ordinary hash function h': U {0, 1...m-1}, the method of linear probing uses the hash
function.
1. h (k, i) = (h' (k) + i) mod m

Where 'm' is the size of hash table and h' (k) = k mod m. for i=0, 1....m-1.

Given key k, the first slot is T [h' (k)]. We next slot T [h' (k) +1] and so on up to the slot T [m-1].
Then we wrap around to slots T [0], T [1]....until finally slot T [h' (k)-1]. Since the initial probe
position dispose of the entire probe sequence, only m distinct probe sequences are used with linear
probing.

Example: Consider inserting the keys 24, 36, 58,65,62,86 into a hash table of size m=11 using
linear probing, consider the primary hash function is h' (k) = k mod m.

Solution: Initial state of hash table

Insert 24. We know h (k, i) = [h' (k) + i] mod m


Now h (24, 0) = [24 mod 11 + 0] mod 11
= (2+0) mod 11 = 2 mod 11 = 2
Since T [2] is free, insert key 24 at this place.

Insert 36. Now h (36, 0) = [36 mod 11 + 0] mod 11


= [3+0] mod 11 = 3
Since T [3] is free, insert key 36 at this place.

Insert 58. Now h (58, 0) = [58 mod 11 +0] mod 11


= [3+0] mod 11 =3
Since T [3] is not free, so the next sequence is
h (58, 1) = [58 mod 11 +1] mod 11
= [3+1] mod 11= 4 mod 11=4
T [4] is free; Insert key 58 at this place.

Insert 65. Now h (65, 0) = [65 mod 11 +0] mod 11


= (10 +0) mod 11= 10
T [10] is free. Insert key 65 at this place.

Insert 62. Now h (62, 0) = [62 mod 11 +0] mod 11


= [7 + 0] mod 11 = 7
T [7] is free. Insert key 62 at this place.

Insert 86. Now h (86, 0) = [86 mod 11 + 0] mod 11


= [9 + 0] mod 11 = 9
T [9] is free. Insert key 86 at this place.
Thus,

2. Quadratic Probing:

Suppose a record R with key k has the hash address H (k) = h then instead of searching the location
with addresses h, h+1, and h+ 2...We linearly search the locations with addresses

h, h+1, h+4, h+9...h+i2

Quadratic Probing uses a hash function of the form

h (k,i) = (h' (k) + c1i + c2i2) mod m

Where (as in linear probing) h' is an auxiliary hash function c1 and c2 ≠0 are auxiliary constants
and i=0, 1...m-1. The initial position is T [h' (k)]; later position probed is offset by the amount that
depend in a quadratic manner on the probe number i.

Example: Consider inserting the keys 74, 28, 36,58,21,64 into a hash table of size m =11 using
quadratic probing with c1=1 and c2=3. Further consider that the primary hash function is h' (k) = k
mod m.

Solution: For Quadratic Probing, we have

h (k, i) = [k mod m +c1i +c2 i2) mod m

This is the initial state of hash table

Here c1= 1 and c2=3


h (k, i) = [k mod m + i + 3i2 ] mod m
Insert 74.

h (74,0)= (74 mod 11+0+3x0) mod 11


= (8 +0+0) mod 11 = 8
T [8] is free; insert the key 74 at this place.

Insert 28.

h (28, 0) = (28 mod 11 + 0 + 3 x 0) mod 11


= (6 +0 + 0) mod 11 = 6.
T [6] is free; insert key 28 at this place.

Insert 36.

h (36, 0) = (36 mod 11 + 0 + 3 x 0) mod 11


= (3 + 0+0) mod 11=3
T [3] is free; insert key 36 at this place.

Insert 58.

h (58, 0) = (58 mod 11 + 0 + 3 x 0) mod 11


= (3 + 0 + 0) mod 11 = 3
T [3] is not free, so next probe sequence is computed as
h (59, 1) = (58 mod 11 + 1 + 3 x12) mod 11
= (3 + 1 + 3) mod 11
=7 mod 11= 7
T [7] is free; insert key 58 at this place.

Insert 21.

h (21, 0) = (21 mod 11 + 0 + 3 x 0)


= (10 + 0) mod 11 = 10
T [10] is free; insert key 21 at this place.

Insert 64.
h (64, 0) = (64 mod 11 + 0 + 3 x 0)
= (9 + 0+ 0) mod 11 = 9.
T [9] is free; insert key 64 at this place.

Thus, after inserting all keys, the hash table is

3. Double Hashing:

Double Hashing is one of the best techniques available for open addressing because the
permutations produced have many of the characteristics of randomly chosen permutations.

Double hashing uses a hash function of the form

h (k, i) = (h1(k) + i h2 (k)) mod m

Where h1 and h2 are auxiliary hash functions and m is the size of the hash table.

h1 (k) = k mod m or h2 (k) = k mod m'. Here m' is slightly less than m (say m-1 or m-2).

Example: Consider inserting the keys 76, 26, 37,59,21,65 into a hash table of size m = 11 using
double hashing. Consider that the auxiliary hash functions are h1 (k)=k mod 11 and h2(k) = k mod
9.

Solution: Initial state of Hash table is

1. Insert 76.
h1(76) = 76 mod 11 = 10
h2(76) = 76 mod 9 = 4
h (76, 0) = (10 + 0 x 4) mod 11
= 10 mod 11 = 10
T [10] is free, so insert key 76 at this place.

2. Insert 26.
h1(26) = 26 mod 11 = 4
h2(26) = 26 mod 9 = 8
h (26, 0) = (4 + 0 x 8) mod 11
= 4 mod 11 = 4
T [4] is free, so insert key 26 at this place.

3. Insert 37.
h1(37) = 37 mod 11 = 4
h2(37) = 37 mod 9 = 1
h (37, 0) = (4 + 0 x 1) mod 11 = 4 mod 11 = 4
T [4] is not free, the next probe sequence is
h (37, 1) = (4 + 1 x 1) mod 11 = 5 mod 11 = 5
T [5] is free, so insert key 37 at this place.

4. Insert 59.
h1(59) = 59 mod 11 = 4
h2(59) = 59 mod 9 = 5
h (59, 0) = (4 + 0 x 5) mod 11 = 4 mod 11 = 4
Since, T [4] is not free, the next probe sequence is
h (59, 1) = (4 + 1 x 5) mod 11 = 9 mod 11 = 9
T [9] is free, so insert key 59 at this place.

5. Insert 21.
h1(21) = 21 mod 11 = 10
h2(21) = 21 mod 9 = 3
h (21, 0) = (10 + 0 x 3) mod 11 = 10 mod 11 = 10
T [10] is not free, the next probe sequence is
h (21, 1) = (10 + 1 x 3) mod 11 = 13 mod 11 = 2
T [2] is free, so insert key 21 at this place.

6. Insert 65.
h1(65) = 65 mod 11 = 10
h2(65) = 65 mod 9 = 2
h (65, 0) = (10 + 0 x 2) mod 11 = 10 mod 11 = 10
T [10] is not free, the next probe sequence is
h (65, 1) = (10 + 1 x 2) mod 11 = 12 mod 11 = 1
T [1] is free, so insert key 65 at this place.
Thus, after insertion of all keys the final hash table is

Rehashing

Rehashing is a technique that dynamically expands the size of the Map, Array, and Hashtable to
maintain the get and put operation complexity of O(1).

It can be also defined as rehashing is the process of re-calculating the hash code of already stored
entries and moving them to a bigger size hash map when the number of elements in the map reaches
the maximum threshold value.

In simple words, rehashing is the reverse of hashing process. It retains the performance. In
rehashing, the load factor plays a vital role.

Load Factor

The load factor is a measure that decides when to increase the HashMap or Hashtable capacity to
maintain the get() and put() operation of complexity O(1). The default value of the load factor of
HashMap is 0.75 (75% of the map size). In short, we can say that the load factor decides "when
to increase the number of buckets to store the key-value pair."

LARGER load factor: Lower space consumption but higher lookups SMALLER Load
factor: Larger space consumption compared to the required number of elements.

How Load Factor is Calculated?

Load factor can be calculated by using the following formula:

1. The initial capacity of the HashMap * Load factor of the HashMap

Example:

The initial capacity of HashMap is = 16


The default load factor of HashMap = 0.75
According to the formula: 16*0.75 = 12

It represents that the 12th key-value pair of HashMap will keep its size to 16. As soon as the
13th element (key-value pair) will come into the HashMap, it will increase its size from
default 24 = 16 buckets to 25 = 32 buckets.
Load Factor Example

Let's understand the load factor through an example.

We know that the default bucket size of the HashMap is 16. We insert the first element, now
check whether we need to increase the HashMap capacity or not. It can be determined by the
formula:

Size of the HashMap (m) / Number of Buckets (n)

In this case, the size of the HashMap is 1, and the bucket size is 16. So, 1/16 = 0.0625. Now
compare the obtained value with the default load factor (0.75).

0.0625 < 0.75

The value is smaller than the default value of the load factor. So, no need to increase the
HashMap size. Therefore, we do not need to increase the size of the HashMap up to the
12th element because

12/16 = 0.75

The obtained value is equal to the default load factor, i.e., 0.75.

As soon as, we insert the 13th element in the HashMap, the size of HashMap is increased because:

13/16 = 0.8125

Here, the obtained value is greater than the default load factor value.

0.8125 > 0.75

In order to insert the 13th element into the HashMap, we need to increase the HashMap size.

If you want to keep get() and put() operation complexity O(1), it is advisable to have a load factor
around 0.75.

Why rehashing is required?

Rehashing is required when the load factor increases. The load factor increases when we insert
key-value pair in the map and it also increases the time complexity. Generally, the time complexity
of HashMap is O(1). In order to reduce the time complexity and load factor of the HashMap, we
implement the rehashing technique.

o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting
in poor performance.

How to search a key


o First, calculate the hash address of the key.
o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the directory.
o Now using the index, go to the directory and find bucket address where the record might
be.

How to insert a new record


o Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.

For example:

Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:

The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full,
so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go into
bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.
Advantages of dynamic hashing
o In this method, the performance does not decrease as the data grows in the system. It simply
increases the size of memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data. There will
not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks frequently.

Disadvantages of dynamic hashing


o In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the data
address will keep changing as buckets grow and shrink. If there is a huge increase in data,
maintaining the bucket address table becomes tedious.
o In this case, the bucket overflow situation will also occur. But it might take little time to
reach this situation than static hashing.

You might also like