Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 62

Data Structures Algorithm and Design

DSECFZG519 Akshaya Ganesan


BITS Pilani Asst. Prof [OFF-CAMPUS]
Pilani|Dubai|Goa|Hyderabad

1
Dictionary
ADT
• The dictionary ADT models a searchable collection of key-
element items.
• A dictionary stores key-element pairs ( k , e), which we call
items, where k is the key and e is the element
• The main operations of a dictionary are searching, inserting,
and deleting items
• Multiple items with the same key are allowed
• Applications:
– Address book
– Documentdiff
– Mapping host names (e.g., cs16.net) to internet
addresses (e.g., 128.148.34.101)

Page 2
The Unordered Dictionary
ADT
• A dictionary stores key-element pairs ( k , e), which we call
items, where k is the key and e is the element.
• For example, in a dictionary storing student records (such as
the student's name, address, and course grades), the key might
be the student's ID number.
• A key is an identifier that is assigned by an application or user
to an associated element.
• In cases when keys are unique, the key associated with an
object can be viewed as an "address" for that object in
memory.

Page 3
Dictionary ADT
methods:
• Dictionary ADT methods:
– findElement(k): if the dictionary has an item with key k,
returns its element, else, returns the special element
NO_SUCH_KEY
– insertItem(k, e): Insert an item with element e and key k
into D
– removeElement(k): if the dictionary has an item with key
k, removes it from the dictionary and returns its element,
else returns the special element NO_SUCH_KEY
– size(), isEmpty()
– keys(), elements()-Iterators

Page 4
Dictionary ADT
methods:
• Special element (NO_SUCH_KEY)is known as
a sentinel.
• If we wish to store an item „e‟ in a dictionary so that
the item is itself its own key, then we would insert e
with the method call insertltem(e, e ) .
• findAIIElements(k) - which returns an iterator of all
elements with key equal to k
• removeAIIElements (k) , which removes from D all
the items with key equal to k

Page 5
Log Files/Unordered
Sequence Implementation
• A log file is a dictionary implemented by means of an
unsorted sequence
• Often called audit trail
• We store the items of the dictionary in a sequence
(based on a vector or list to store the key-element
pairs), in arbitrary order
• The space required for a log file is (n), since the
vector data structure can maintain its memory usage
to be proportional to its size.

Page 6
Log
Files
• Performance:
– Insertion?
– insertItem takes O(1) time since we can insert
the new item at the end of the sequence
– Search?? Removal???
– findElement and removeElement take O(n) time since in
the worst case (the item is not found) we traverse the entire
sequence to look for an item with the given key
• The log file is effective only for dictionaries of small size or
for dictionaries on which insertions are the most common
operations, while searches and removals are rarely performed
• (e.g., historical record of logins to a workstation)

Page 7
Hash
Tables
• A hash table for a given key type consists of
– Hash function h
– Array (called table) of size N
• When implementing a dictionary with a hash
table, the goal is to store item (k, e) at index i = h(k)

Page 8
Hash
Tables
• A hash table consists of two major components,
– Bucket array
– Hash Functions.

Page 9
Bucket
Arrays
• A bucket array for a hash table is an array A of size N,
where each cell of A is thought of as a "bucket" (that
is, a container of key-element pairs)
• Integer N defines the capacity of the array.
• An element e with key k is simply inserted into the
bucket A [k] .
• Any bucket cells associated with keys not present in
the dictionary are assumed to hold the special
NO_SUCH _KEY object.

Page 10
Bucket
Arrays

Page 11
Bucket
Arrays
• If keys are not unique, then two different
elements may be mapped to the same bucket in A .
• A collision has occurred.
• If each bucket of A can store only a single element,
then we cannot associate more than one element with
a single bucket
• problem in the case of collisions.
• There are ways of dealing with collisions
• The best strategy is to try to avoid them in the
first place
Page 12
Bucket Arrays-
Analysis
• If keys are unique, then collisions are not a concern,
and searches, insertions, and removals in the hash
table take
• worst-case time O( 1 ) .
• It has two major drawbacks
– It uses space  (N)
– The bucket array requires keys be unique integers
in the
range [0 ,N - 1 ] , which is often not the case

Page 13
Bucket Arrays-
Analysis
• What we can do???
• Define the hash table data structure to consist
of a bucket array together with a "good"
mapping
our keys tofrom 1. integers,
2.in the range [0, N - 1 ]

Page 14
Hash
Functions
• A hash function h maps keys of a given type
to integers in a fixed interval [0, N - 1]
• Example:
h(x)  x mod N
is a hash function for integer keys
• The integer h(x) is called the hash value of key x.

Page 15
Hash
Functions
• Hash function ,maps each key k in our dictionary to
an integer in the range [0, N – 1] , where N is the
capacity of the bucket array for this table.
• Now, bucket array method can be applied to arbitrary
keys

Page 16
Hash
Functions
• Main Idea
• Use the hash function value, h(k) , as an index to
bucket array, A, instead of the key k (which is most
likely inappropriate for use as a bucket array index).
• That is, store the item(k, e) in the bucket A [h(k)] .
• A hash function is "good" if it maps the keys in our
dictionary so as to minimize collisions as much as
possible.

Page 17
Hash Tables-
Example
• We design a hash table for a dictionary storing items
(SSN, Name), where SSN (social security number) is
a nine-digit positive integer
• Hash table uses an array of size
0  025-612-0001
N  10,000 and the hash function 1
2 981-101-0002
h(x)  last four digits of x 3 
4 451-229-0004


9997 
9998 200-751-9998
9999 

Page 18
Evaluation of a hash function,
h(k)
• A hash function ,h(k) , is usually specified as the
composition of two functions:
Hash code map:
h1: keys  integers[mapping the key k to an
integer]
Compression map:
h2: integers  [0, N - 1]
[mapping the hash code to an integer within the
range of indices of a bucket array]

Page 19
Evaluation of a hash
function

Page 20
Evaluation of a hash function,
h(k)
• The hash code map is applied first, and
the compression map is applied next on
the result, i.e.,
h(x) = h2(h1(x))
• The goal of the hash function is to “disperse” the
keys in an apparently random way.

Page 21
Hash Code
Maps
• To be consistent with all of our keys, the hash code we
use for a key k should be the same as the hash code for
any key that is equal to k.
• Memory address as Hash Codes:
– We reinterpret the memory address of the key object as an
integer (default hash code of all Java objects)
– Good in general, except for numeric and string keys

Page 22
Hash Code
Maps
• Component sum:
• We partition the bits of the key into components of fixed
length (e.g., 16 or 32 bits) and we sum the components
(ignoring overflows)
• Suitable for numeric keys of fixed length greater than or
equal to the number of bits of the integer type (e.g., long
and double in Java)
Ex:
static int hashCode(long i)
{return (int)((i >>> 32) + (int) i);}

Page 23
Hash Code
Maps
• This approach of summing components can be further extended
to any object x whose binary representation can be viewed as a
k-tuple (x0, x1, ..., xk-1) of integers, for we can then form a hash
code by summing xi.
• Ex. Given any floating point number, we can sum its mantissa
and exponent as integers and then apply a hash code for long
integers to the result.
• These computations of hash codes for the primitive types are
actually used by the corresponding wrapper classes in their
implementations of the method hashCode().

Page 24
Hash Code
Maps

S P O T
83 80 79 84

Page 25
Hash Code
Maps
• Polynomial accumulation:
–We choose a nonzero constant, a != 1, and calculate
(x0ak-1+ x1ak-2+ ...+ xk-2a+ xk-1)
as the hash code, ignoring overflows.
– This is simply a polynomial in a that takes the components
(x0, x1, ..., xk-1) of an object x as its coefficients
– Especially suitable for strings (e.g., the choice a = 33 gives
at most 6 collisions on a set of 50,000 English words)
–Polynomial p(a) can be evaluated in O(n) time
using
Horner’s rule
–Refer 30.1 in CLRS textbook

Page 26
Hash Code
Maps

S P O T S
83 80 79 84 83

h(SPOTS)=83 (a4 ) +80 (a3 ) +79 (a2) +84 (a1) +83 (a0)

Page 27
Hash Code
Maps
• Many Java choose the polynomial
function.
implementations hash
• For the sake of speed, however, some Java implementations
only apply the polynomial hash function to a fraction of the
characters in long strings, say every 8 characters.
• This computation can cause an overflow, especially for long
strings. Java ignores these overflows
• 31,33, 37, 39, and 41 are particularly good choices for „a‟ when
working with character strings that are English words
• Default Java String.hashCode() uses a = 31

Page 28
Compression
Maps
• Division:
– Let y=h1(k)//integer hash code for a key object k
– h2 (y) = y mod N
– The size N of the hash table is usually chosen to be a prime
– The reason has to do with number theory
• Multiply, Add and Divide (MAD):
– h2 (y) = (ay + b) mod N
– a and b are nonnegative integers such that a mod N ≠ 0
– Otherwise, every integer would map to the same value b

Page 29
Compression
Maps
• Multiply, Add and Divide (MAD):
– Hash index = ( (a × hashCode + b) % p ) % N with:
p = prime number > N
a = integer from range [1..(p-1)]
b = integer from range [0..(p-
1)]
– From Mathematical analysis theory),
this compression function will spread
(group (more)
evenly
integer over range [0..(N-1)] if we use a
the number for p prime

Page 30
Evaluating Hashing
Functions:TRY OUT!!!
The following three hashing functions can be considered:
• t1: using the length of the string as its hash value
• t2: adding the components of the string as its hash value
• t3: hashing the first three characters of the with
string
polynomial hashing
• The compression function - division method
• The input file can have nearly 4000 random names and 4000
unique words from the C code

Page 31
Collision- Handling
Schemes
• Collisions occur when different elements are mapped
to the same cell
• Separate Chaining: let each cell in the table point to a
linked list of elements that map there

0 
1 025-612-0001
2 
3 
4 451-229-0004 981-101-0004

Page 32
Collision- Handling
Schemes
• Chaining is simple, but requires additional memory
outside the table

0 
1 025-612-0001
2 
3 
4 451-229-0004 981-101-0004

Page 33
Separate
Chaining

Page 34
Separate
Chaining

Page 35
Separate
Chaining

Page 36
Separate
Chaining

Example of a hash table of size 13, storing 10 integer keys, with collisions
resolved by the chaining method. The compression map in this case is
h (k) = k mod 1 3 .

Page 37
Separate
Chaining
• A good hash function will try to minimize collisions as
much as possible, which will imply that most of our buckets
are either empty or store just a single entry.
• Assume we use a good hash function to index the n entries of
our map in a bucket array of capacity N, we expect each
bucket to be of size n/N(average)
• This value is called the load factor of the hash table
• Should be bounded by a small constant, preferably below 1

Page 38
Separate
Chaining
• For good hash function,the expected running time of
operations
a findElement, insertItem, removeElement in a
dictionary implemented using hash table which uses separate
chaining to resolve collisions is O(n/N).
• Thus we can expect the standard dictionary operations to run
in O(1)expected time provided we know that n is O(N)

Page 39
Separate
Chaining

Page 40
Separate
Chaining

Page 41
Rehashin
g
• Mostly load factor ,0.75 is common
• Whenever we add elements we need to increase the size of our
bucket array and change our compression map to match this
new size, in order to keep the load factor below the specified
constant.
• Moreover, we must then insert all the existing hash-table
elements into the new bucket array using the new compression
map. Such a size increase and hash table rebuild is called
rehashing
• A good choice is to rehash into an array roughly double the
size of the original array, choosing the size of the new array to
be a prime number

Page 42
Open
Addressing
• Open addressing: the colliding item is placed in a
different cell of the table
• This approach saves space because no auxiliary
structures are employed, but it requires a bit more
complexity to deal with collisions

Page 43
Linear
Probing
• Linear probing handles collisions by placing the
colliding item in the next (circularly) available table
cell
• Each table cell inspected is referred to as a “probe”
• Colliding items lump together, causing future
collisions to cause a longer sequence of probes

Page 44
Linear
Probing
• In this method, if we try to insert an entry (k, v) into a
bucket A[i] that is already occupied, where i = h(k),
then we try next at A[(i+1) mod N].
• This process will continue until we find an empty
bucket that can accept the new entry.

Page 45
Linear
Probing
Example: An insertion into a hash table using
linear probing to resolve collisions.
The compression map h( k) = k mod 11 .

Page 46
Linear
Probing
• Example:
– h(x) = x mod 13
– Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

0 1 2 3 4 5 6 7 8 9 10 11 12

41 18 44 59 32 22 31 73
0 1 2 3 4 5 6 7 8 9 10 11 12

Page 47
Linear
Probing
• Example
– We want to add the following (phone, address) entries to an addressBook with
size 101:
– addressBook.add("869-1264", "8-128");
– addressBook.add("869-8132", "9-101");
– addressBook.add("869-4294", "8-156");
– addressBook.add("869-2072", "9-101");
The hash function is h(k) = (k % 10000) % 101

All of the above keys (phone numbers) map to index 52. By


linear probing, all entries will be put to indices 52 - 55

Page 48
Search with Linear
Probing
• Consider a hash table A that uses linear probing
• findElement(k)
– We start at cell h(k)
– We probe consecutive locations until one
of the following occurs
• An item with key k is found, or
• An empty cell is found, or
• N cells have been unsuccessfully probed

Page 49
Updates with Linear
Probing
• To handle insertions and deletions, we introduce
special object, called AVAILABLE, which replaces
deleted elements
removeElement(k)
• We search for an item with key k
• If such an item (k, e) is found, we replace it with the special
item AVAILABLE and we return element e
• Else, we return NO_SUCH_KEY

Page 50
Updates with Linear
Probing
• insertItem(k, e)
– We throw an exception if the table is full
– We start at cell h(k)
– We probe consecutive cells until one of the
following
occurs
• A cell i is found that is either empty or stores AVAILABLE, or
• N cells have been unsuccessfully probed
– We store item (k, e) in cell i

Page 51
Problem with Linear
Probing
• Primary Clustering

Page 52
Quadratic
probing
• This open addressing strategy involves
iteratively trying the buckets
A[(i + f(j)) mod N],
for j = 0, 1, 2, ..., where f(j) = j2, until finding
an empty bucket
• This strategy may not find an empty slot even when
the array is not full.
• This strategy may not find an empty slot, if
the bucket array is at least half full.

Page 53
Linear Vs Quadratic
probing
• An advantage of linear probing is that it can reach
every location in the hash table.
• This property is important since it guarantees the
success of the insertItem operation when the hash
table is not full.
• Quadratic probing can only guarantee a successful
insertItem operation when the hash table is at most
half full.

Page 54
Double
Hashing
• Double hashing uses a secondary hash function h',
• If h maps some key k to a bucket A[i], with i = h(k), that is
already occupied, then we iteratively try the bucket
– A[(i + f(j)) mod N],
– for j = 1, 2, 3, ..., where f(j) = j*h'(k)
• The secondary hash function cannot have zero values
• The table size N must be a prime to allow probing of all
the cells
• Choose a secondary hash function that will attempt
to minimize clustering as much as possible

Page 55
Double
Hashing
• Common choice of compression map for the secondary hash
function:
h2(k) = q - k mod q
where
– q<N
– q is a prime
• The possible values for h2(k) are
1, 2, … , q

Page 56
Example of Double
Hashing
• Consider a hash table storing integer keys that
handles
collision with double hashing
– N  13
– h(k)  k mod 13
– h‘(k)  7  k mod 7
• Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

Page 57
Example of Double
Hashing
k h (k ) h' (k ) Probes
18 5 3 5
41 2 1 2
22 9 6 9
44 5 5 5 10
59 7 4 7
32 6 3 6
31 5 4 5 9 0
73 8 4 8

0 1 2 3 4 5 6 7 8 9 10 11 12

31 41 18 32 59 73 22 44
0 1 2 3 4 5 6 7 8 9 10 11 12

Page 58
Performance of
Hashing
• In the worst case, searches, insertions and removals on a hash
table take O(n) time
• The worst case occurs when all the keys inserted into
the dictionary collide
• The load factor = n/N affects the performance of a hash
table
• Assuming that the hash values are like random numbers, it
can be shown that the expected number of probes for an
insertion with open addressing is
1 / (1 – load factor) [R2:Section 11.4,Theorem 11.6,11.8]

Page 59
Performance of
Hashing
• The expected running time of all the dictionary ADT
operations in a hash table is O(1)
• In practice, hashing is very fast provided the
load factor is not close to 100%
• Applications of hash tables:
– small databases
– compilers
– browser caches

Page 60
Application

 In the cryptographic sense, hash functions must have two


properties to be useful:
they must be one-way and must be collision resistant.

For these reasons, simple checksums and CRCs are not good hash
functions for cryptography.
Being one-way implies that given the output of a hash function,
learning anything useful about the input is nontrivial.

Most trivial checksums are not one-way, since they are 


linear functions.
For short enough inputs, deducing the input from the output is
often a simple computation.
Thank You!!

62

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like