Reminder: Dictionary ADT
Dictionary operations insert Adrien
 insert Donald Roller-blade demon
l33t haxtor
 find Hannah
 delete C++ guru
 create
Adrien find(Adrien) Dave
 destroy
Roller-blade demon Older than dirt

Stores values associated with user- …

specified keys
 values may be any (homogeneous) type
 keys may be any (homogeneous)
comparable type

Dictionary Implementations So Far
Insert Find Delete
Unsorted list O(1) O(n) O(n)
Trees O(log O(log O(log
n) n) n)
Sorted array O(n) O(log O(n)
Array special case O(1) O(1) O(1)
known keys ∈ {1, … ,

ADT Legalities:

A Digression on Keys
Methods are the contract between an ADT and the
outside agent (client code)
 Ex: Dictionary contract is {insert, find, delete}
 Ex: Priority Q contract is {insert, deleteMin}

Keys are the currency used in transactions between

an outside agent and ADT
 Ex: insert(key), find(key), delete(key)

So …
 How about O(1) insert/find/delete for any key type?

Hash Table Goal:

Key as Index
We can access a record as We want to access a record as
a[5] a[“Hannah”]

Adrien Adrien
2 roller-blade demon
Adrien roller-blade demon

Hannah Hannah
5 C++ guru
Hannah C++ guru

Hash Table Approach

Adrien f(x)

But… is there a problem with this pipe-dream?

Hash Table
Dictionary Data Structure
Hash function: maps keys to

Can quickly find the right
spot for a given entry
Dave f(x)
Unordered and sparse table Adrien
Result: Ed

Cannot efficiently list all

Cannot efficiently find min,
max, ordered ranges

Hash Table Taxonomy
hash function
Adrien f(x)

load factor λ = # of entries in table


Hash Table Design

What should the hash function be?

What should the table size be?

How should we resolve collisions?

Hash Function
Hash function maps a key to a table
Value & find(Key & key) {
int index = hash(key) % tableSize;
return Table[index];

What Makes A Good Hash
Fast runtime

O(1) and fast in practical terms

Distributes the data evenly

hash(a) % size ≠ hash(b) % size

Uses the whole hash table

for all 0 ≤ i < size, ∃ k such that hash(k) %
size = i

Good Hash Function for
Integer Keys

tableSize is prime

hash(n) = n 1

Example: 2

tableSize = 7

insert(4) 4
find(12) 5
delete(17) 6

Good Hash Function for Strings?
Let s = s1s2s3s4…sn: choose
 hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n

Think of the string as a base 128 (aka radix 128)


hash(“really, really big”) = well… something really,
really big

hash(“one thing”) % 128 = hash(“other thing”) % 128

String Hashing

Issues and Techniques

Minimize collisions

Make tableSize and radix relatively prime
Typically, make tableSize not a multiple of 128

Simplify computation

Use Horner’s Rule

int hash(String s) {
h = 0;
for (i = s.length() - 1; i >= 0; i--) {
h = (s[i] + 128*h) % tableSize;
return h;

Good Hashing:

Multiplication Method
Hash function is defined by size plus a parameter A
hA(k) = size * (k*A mod 1) where 0 < A < 1

Example: size = 10, A = 0.485

hA(50) = 10 * (50*0.485 mod 1)
= 10 * (24.25 mod 1) = 10 * 0.25 = 2

 no restriction on size!
 when building a static table, we can try several values of A
 more computationally intensive than a single mod

Hashing Dilemma
Suppose your Worst Enemy 1) knows your hash function; 2) gets
to decide which keys to send you?

Faced with this enticing possibility, Worst Enemy decides to:

a) Send you keys which maximize collisions for your hash function.
b) Take a nap.

Moral: No single hash function can protect you!

Faced with this dilemma, you:

a) Give up and use a linked list for your Dictionary.
b) Drop out of software, and choose a career in fast foods.
c) Run and hide.
d) Proceed to the next slide, in hope of a better alternative.

Universal Hashing
Suppose we have a set K of 0
k1 1
possible keys, and a finite h .
set H of hash functions that k2 .
map keys to entries in a K .
hashtable of size m. m-1
H is a universal collection of hash functions if and only if …
For any two keys k1, k2 in K, there are at most |H|/m functions in H
for which h(k1) = h(k2).

 So … if we randomly choose a hash function from H, our chances of

collision are no more than if we get to choose hash table entries at

Random Hashing – Not!
How can we “randomly choose a hash function”?

Certainly we cannot randomly choose hash functions at
runtime, interspersed amongst the inserts, finds, deletes!
Why not?

We can, however, randomly choose a hash function each
time we initialize a new hashtable.


Worst Enemy never knows which hash function we will
choose – neither do we!

No single input (set of keys) can always evoke worst-case

Good Hashing:
Universal Hash Function A(UHFa)
Parameterized by prime table size and vector:
a = <a0 a1 … ar> where 0 <= ai < size

Represent each key as r + 1 integers where ki < size

 size = 11, key = 39752 ==> <3,9,7,5,2>
 size = 29, key = “hello world” ==>
ha(k) =
(∑ )
ai k i mod size

UHFa: Example

Context: hash strings of length 3 in a table of
size 131

let a = <35, 100, 21>

ha(“xyz”) = (35*120 + 100*121 + 21*122) %
= 129

Thinking about UHFa
 works on any type as long as you can form ki’s
 if we’re building a static table, we can try many
values of the hash vector <a>

random <a> has guaranteed good properties no
matter what we’re hashing

 must choose prime table size larger than any ki

Good Hashing:
Universal Hash Function 2 (UHF2)
Parameterized by j, a, and b:

j * size should fit into an int

a and b must be less than size

hj,a,b(k) = ((ak + b) mod (j*size))/j

UHF2 : Example
Context: hash integers in a table of size 16

let j = 32, a = 100, b = 200

hj,a,b(1000) = ((100*1000 + 200) % (32*16)) / 32
= (100200 % 512) / 32
= 360 / 32
= 11

Thinking about UHF2

if we’re building a static table, we can try many
parameter values

random a,b has guaranteed good properties no
matter what we’re hashing

can choose any size table

very efficient if j and size are powers of 2 (why?)


need to turn non-integer keys into integers

Hash Function Summary
Goals of a hash function

reproducible mapping from key to table index
 evenly distribute keys across the table

separate commonly occurring keys (neighboring keys?)

fast runtime

Some hash function candidates

h(n) = n % size

h(n) = string as base 128 number % size

Multiplication hash: compute percentage through the table

Universal hash function A: dot product with random vector

Universal hash function 2: next pseudo-random number

Hash Function Design

Know what your keys are

Study how your keys are distributed

Try to include all important information
in a key in the construction of its hash

Try to make “neighboring” keys hash to
very different places

Prune the features used to create the
hash until it runs “fast enough” (very
application dependent)

Handling Collisions
Pigeonhole principle says we can’t avoid all collisions
 try to hash without collision n keys into m slots with n > m
 try to put 6 pigeons into 5 holes

What do we do when two keys hash to the same entry?

 Separate Chaining: put a little dictionary in each entry
 Open Addressing: pick a next entry to try within hashtable

Terminology madness :-(

 Separate Chaining sometimes called Open Hashing
 Open Addressing sometimes called Closed Hashing

Separate Chaining h(a) = h(d)
Put a little dictionary at each entry h(e) = h(b)
 Commonly, unordered linked list
 Or, choose another Dictionary type as 0
appropriate (search tree, hashtable,
etc.) 1
a d
 λ can be greater than 1 3
e b
 performance degrades with length of 4
 Alternate Dictionary type (e.g. search 5
tree, hashtable) can speed up
secondary search 6

Separate Chaining Code
void insert(const Key & k, const Value & v) {

Value & find(const Key & k) {

return findBucket(k).find(k);

void delete(const Key & k) {


Dictionary & findBucket(const Key & k) {
return table[hash(k)%table.size];

Load Factor in Separate Chaining

Search cost

unsuccessful search:

successful search:

Desired load factor:

Open Addressing
Allow one key at each table entry 0
h(a) = h(d)
 two objects that hash to the same h(e) = h(b) 1
spot can’t both go there a
 first one there gets the spot 2
 next one must go in another spot d
Properties 4
 λ≤1 b
 performance degrades with 5
difficulty of finding right spot c

Requires collision resolution function f(i)

Probing how to:

First probe - given a key k, hash to h(k)

Second probe - if h(k) is occupied, try h(k) + f(1)

Third probe - if h(k) + f(1) is occupied, try h(k) + f(2)

And so forth
Probing properties

we force f(0) = 0

ith probe is to (h(k) + f(i)) mod size

if i reaches size - 1, the probe has failed

depending on f(), the probe may fail sooner

long sequences of probes are costly!

Linear Probing
f(i) = i
Probe sequence is

h(k) mod size

h(k) + 1 mod size

h(k) + 2 mod size

bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash(k);
do {
entry = &table[probePoint];
probePoint = (probePoint + 1) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();

Linear Probing Example
insert(76) insert(93) insert(40) insert(47) insert(10) insert(55)
76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6
0 0 0 0 0 0
47 47 47
1 1 1 1 1 1
2 2 2 2 2 2
93 93 93 93 93
3 3 3 3 3 3
10 10
4 4 4 4 4 4

5 5 5 5 5 5
40 40 40 40
6 6 6 6 6 6
76 76 76 76 76 76
probes: 1 1 1 3 1 3
Load Factor in Linear
For any λ < 1, linear probing will find an empty slot
Search cost (for large table sizes)
1 1
( )

successful search:
2 ( 1−λ )
1 1
unsuccessful search:
( )

2 ( 1−λ ) 2
Linear probing suffers from primary clustering
Performance quickly degrades for λ > 1/2

Quadratic Probing
f(i) = i2
Probe sequence:

h(k) mod size

h(k) + 1 mod size

h(k) + 4 mod size

h(k) + 9 mod size

bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash(k), i = 0;
do {
entry = &table[probePoint];
probePoint = (probePoint + (2*i - 1)) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();

Good Quadratic Probing Example 

insert(76) insert(40) insert(48) insert(5) insert(55)

76%7 = 6 40%7 = 5 48%7 = 6 5%7 = 5 55%7 = 6
0 0 0 0 0
48 47 47
1 1 1 1 1

2 2 2 2 2
5 5
3 3 3 3 3
4 4 4 4 4

5 5 5 5 5
40 40 40 40
6 6 6 6 6
76 76 76 76 76
probes: 1 1 2 3 3
Bad Quadratic Probing Example 

insert(76) insert(93) insert(40) insert(35) insert(47)

76%7 = 6 93%7 = 2 40%7 = 5 35%7 = 0 47%7 = 5
0 0 0 0 0
35 35
1 1 1 1 1

2 2 2 2 2
93 93 93 93
3 3 3 3 3

4 4 4 4 4

5 5 5 5 5
40 40 40
6 6 6 6 6
76 76 76 76 76
probes: 1 1 1 1 ∞

Quadratic Probing Succeeds
for λ ≤ ½
If size is prime and λ ≤ ½, then quadratic probing will
find an empty slot in size/2 probes or fewer.
 show for all 0 ≤ i, j ≤ size/2 and i ≠ j
(h(x) + i2) mod size ≠ (h(x) + j2) mod size
 by contradiction: suppose that for some i, j:
(h(x) + i2) mod size = (h(x) + j2) mod size
i2 mod size = j2 mod size
(i2 - j2) mod size = 0
[(i + j)(i - j)] mod size = 0
 but how can i + j = 0 or i + j = size when
i ≠ j and i,j ≤ size/2?
 same for i - j mod size = 0

Quadratic Probing May Fail
for λ > ½

For any i larger than size/2, there is
some j smaller than i that adds with i to
equal size (or a multiple of size). D’oh!

Load Factor in Quadratic Probing

For any λ ≤ ½, quadratic probing will find
an empty slot

For λ > ½, quadratic probing may find a

Quadratic probing does not suffer from
primary clustering

Quadratic probing does suffer from
secondary clustering

How could we possibly solve this?

Double Hashing
f(i) = i*hash2(k)
Probe sequence:
 h (k) mod size
 (h (k) + 1 ⋅ h (x)) mod size
1 2
 (h (k) + 2 ⋅ h (x)) mod size
1 2

bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash1(k), delta = hash2(k);
do {
entry = &table[probePoint];
probePoint = (probePoint + delta) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();

A Good Double Hash Function…

…is quick to evaluate.

…differs from the original hash function.
…never evaluates to 0 (mod size).

One good choice:

Choose a prime p < size
Let hash2(k)= p - (k mod p)

Double Hashing Example (p=5)
insert(76) insert(93) insert(40) insert(47) insert(10) insert(55)
76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6
5 - (47%5) = 3 5 - (55%5) = 5
0 0 0 0 0 0

1 1 1 1 1 1
47 47 47
2 2 2 2 2 2
93 93 93 93 93
3 3 3 3 3 3
10 10
4 4 4 4 4 4
5 5 5 5 5 5
40 40 40 40
6 6 6 6 6 6
76 76 76 76 76 76
probes: 1 1 1 2 1 2
Load Factor in Double
For any λ < 1, double hashing will find an empty slot (given
appropriate table size and hash2)

Search cost appears to approach optimal (random hash):

 successful search: 1 1
 unsuccessful search:
λ 1−λ
No primary clustering and no secondary clustering
One extra hash calculation

Deletion in Open
delete(2) find(7)
0 0
0 0
1 1
1 Where is it?!
2 2
3 3
7 7
4 4  Must use lazy deletion!
5 5  On insertion, treat a (lazily)
deleted item as an empty slot
6 6

The Squished Pigeon

Insert using Open Addressing cannot work with
λ ≥ 1.

Insert using Open Addressing with quadratic
probing may not work with λ ≥ ½.

With Separate Chaining or Open Addressing,
large load factors lead to poor performance!

How can we relieve the pressure on the pigeons?

Hint: what happens when we overrun array storage
in a {queue, stack, heap}?

What else must happen with a hashtable?

When the λ gets “too large” (over some constant
threshold), rehash all elements into a new, larger table:
 takes O(n), but amortized O(1) as long as we (just about)
double table size on the resize
 spreads keys back out, may drastically improve performance
 gives us a chance to retune parameterized hash functions
 avoids failure for Open Addressing techniques
 allows arbitrarily large tables starting from a small table
 clears out lazily deleted items

Case Study
Spelling dictionary Practical notes

30,000 words 
almost all searches are

static successful – Why?


words average about 8
preprocessing time characters in length

30,000 words at 8
Goals bytes/word ~ .25 MB

fast spell checking 
pointers are 4 bytes

minimal storage 
there are many
regularities in the
structure of English

Case Study:

Design Considerations
Possible Solutions

sorted array + binary search

Separate Chaining

Open Addressing + linear probing


Which data structure should we use?

Which type of hash function should we use?

Case Study:

Assume words are strings and entries are
pointers to strings
Array + Open addressing
binary search Separate Chaining

How many pointers
does each use?

Case Study:

storage time
n pointers + log2n ≤ 15 probes
Binary search words = 360KB per access, worst
n + n/λ pointers + 1 + λ/2 probes per
Separate Chaining words access on average
(λ = 1 ⇒ 600KB) (λ = 1 ⇒ 1.5 probes)
n/λ pointers + (1 + 1/(1 - λ))/2 probes
Open Addressing words per access on average
(λ = 0.5 ⇒ 480KB) (λ = 0.5 ⇒ 1.5
What to do, what to do? …

Perfect Hashing
When we know the entire key set in
advance …

Examples: programming language
keywords, CD-ROM file list, spelling
dictionary, etc.

… then perfect hashing lets us achieve:

Worst-case O(1) time complexity!

Worst-case O(n) space complexity!

Perfect Hashing Technique

Static set of n known keys 0

Separate chaining, two-level 1

Primary hash table size=n
 jth secondary hash table size=nj2 3
(where nj keys hash to slot j in 4 Secondary hash
primary hash table) tables

Universal hash functions in all
hash tables 6

Conduct (a few!) random trials,
Primary hash
until we get collision-free hash table

Perfect Hashing
Theorem: If we store n keys in a hash table of size n2 using a randomly
chosen universal hash function, then the probability of any collision is < ½.

Theorem: If we store n keys in a hash table of size m=n using a randomly

chosen universal hash function, then

[∑ ]
E n 2j <2n
where nj j=
is 0the number of keys hashing to slot j.

Corollary: If we store n keys in a hash table of size m=n using a randomly

chosen universal hash function and we set the size of each secondary hash
table to mj=nj2, then:
a) The expected amount of storage required for all secondary hash tables is less than 2n.
b) The probability that the total storage used for all secondary hash tables exceeds 4n is
less than ½.

Intro to Algorithms, 2nd ed.
Hashing Cormen, Leiserson, Rivest, Stein
Perfect Hashing
Perfect hashing theorems set tight expected
bounds on sizes and collision behavior of all the
hash tables (primary and all secondaries).

 Conduct a few random trials of universal hash

functions, by simply varying UHF parameters, until
we get a set of UHFs and associated table sizes
which deliver …
 Worst-case O(1) time complexity!
 Worst-case O(n) space complexity!

Extendible Hashing:

Cost of a Database Query

I/O to CPU ratio is 300-to-1!

Extendible Hashing
Hashing technique for huge data sets
 optimizes to reduce disk accesses
 each hash bucket fits on one disk block

better than B-Trees if order is not important – why?

Table contains

buckets, each fitting in one disk block, with the data

a directory that fits in one disk block is used to hash to
the correct bucket

Extendible Hash Table

Directory entry: key prefix (first k bits) and a pointer to the
bucket with all keys starting with its prefix

Each block contains keys matching on first j ≤ k bits, plus
the data associated with each key

directory for k = 3
000 001 010 011 100 101 110 111

(2) (2) (3) (3) (2)

00001 01001 10001 10101 11001
00011 01011 10011 10110 11011
00100 01100 10111 11100
00110 11110

Inserting (easy case)

000 001 010 011 100 101 110 111

(2) (2) (3) (3) (2)

00001 01001 10001 10101 11001
00011 01011 10011 10110 11100
00100 01100 10111 11110


000 001 010 011 100 101 110 111

(2) (2) (3) (3) (2)

00001 01001 10001 10101 11001
00011 01011 10011 10110 11011
00100 01100 10111 11100
Hashing 00110 11110
Splitting a Leaf

000 001 010 011 100 101 110 111

(2) (2) (3) (3) (2)

00001 01001 10001 10101 11001
00011 01011 10011 10110 11011
00100 01100 10111 11100
00110 11110

000 001 010 011 100 101 110 111

(2) (2) (3) (3) (3) (3)

00001 01001 10001 10101 11000 11100
00011 01011 10011 10110 11001 11110
00100 01100 10111 11011
Splitting the Directory
1. insert(10010)
00 01 10 11
But, no room to insert and
no adoption!
(2) (2) (2)
1. Solution: Expand directory 01101 10000 11001
10001 11110
1. Then, it’s just a normal split. 10011

000 001 010 011 100 101 110 111

If Extendible Hashing Doesn’t
Cut It
Store only pointers to the items
+ (potentially) much smaller M
+ fewer items in the directory
– one extra disk access!
+ potentially better distribution over the buckets
+ fewer unnecessary items in the directory
– can’t solve the problem if there’s simply too much data

What if these don’t work?

 use a B-Tree to store the directory!

Hash Wrap
Collision resolution Hash functions
•Separate Chaining 
Simple integer hash: prime
 Expand beyond hashtable table size
via secondary Dictionaries
 Allows λ > 1

Multiplication method
•Open Addressing 
Universal hashing guarantees
 Expand within hashtable no (always) bad input
 Secondary probing: {linear, Perfect hashing
quadratic, double hash} 
Requires known, fixed keyset
 λ ≤ 1 (by definition!)
 λ ≤ ½ (by preference!)

Achieves O(1) time, O(n)
space - guaranteed!
Rehashing Extendible hashing
 Tunes up hashtable when λ 
For disk-based data
crosses the line 
Combine with b-tree directory
if needed


