Professional Documents
Culture Documents
Hashing
Hashing
Hashing
Reminder: Dictionary ADT
Dictionary operations insert Adrien
insert Donald Roller-blade demon
l33t haxtor
find Hannah
delete C++ guru
create
Adrien find(Adrien) Dave
destroy
Roller-blade demon Older than dirt
Hashing
Dictionary Implementations So Far
Insert Find Delete
Unsorted list O(1) O(n) O(n)
Trees O(log O(log O(log
n) n) n)
Sorted array O(n) O(log O(n)
n)
Array special case O(1) O(1) O(1)
known keys ∈ {1, … ,
K}
Hashing
ADT Legalities:
A Digression on Keys
Methods are the contract between an ADT and the
outside agent (client code)
Ex: Dictionary contract is {insert, find, delete}
Ex: Priority Q contract is {insert, deleteMin}
So …
How about O(1) insert/find/delete for any key type?
Hashing
Hash Table Goal:
Key as Index
We can access a record as We want to access a record as
a[5] a[“Hannah”]
Adrien Adrien
2 roller-blade demon
Adrien roller-blade demon
Hannah Hannah
5 C++ guru
Hannah C++ guru
Hashing
Hash Table Approach
Hannah
Dave
Adrien f(x)
Donald
Ed
Hashing
Hash Table
Dictionary Data Structure
Hash function: maps keys to
integers
Result:
Can quickly find the right
spot for a given entry
Hannah
Dave f(x)
Unordered and sparse table Adrien
Donald
Result: Ed
Cannot efficiently list all
entries
Cannot efficiently find min,
max, ordered ranges
Hashing
Hash Table Taxonomy
hash function
Hannah
Dave
Adrien f(x)
collision
Donald
Ed
keys
Hashing
Hash Table Design
Decisions
What should the hash function be?
What should the table size be?
How should we resolve collisions?
Hashing
Hash Function
Hash function maps a key to a table
index
Value & find(Key & key) {
int index = hash(key) % tableSize;
return Table[index];
}
Hashing
What Makes A Good Hash
Function?
Fast runtime
O(1) and fast in practical terms
Hashing
Good Hash Function for
Integer Keys
Choose
0
tableSize is prime
hash(n) = n 1
Example: 2
tableSize = 7
3
insert(4) 4
insert(17)
find(12) 5
insert(9)
delete(17) 6
Hashing
Good Hash Function for Strings?
Let s = s1s2s3s4…sn: choose
hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n
Think of the string as a base 128 (aka radix 128)
number
Problems:
hash(“really, really big”) = well… something really,
really big
hash(“one thing”) % 128 = hash(“other thing”) % 128
Hashing
String Hashing
Simplify computation
Use Horner’s Rule
int hash(String s) {
h = 0;
for (i = s.length() - 1; i >= 0; i--) {
h = (s[i] + 128*h) % tableSize;
}
return h;
}
Hashing
Good Hashing:
Multiplication Method
Hash function is defined by size plus a parameter A
hA(k) = size * (k*A mod 1) where 0 < A < 1
no restriction on size!
when building a static table, we can try several values of A
more computationally intensive than a single mod
Hashing
Hashing Dilemma
Suppose your Worst Enemy 1) knows your hash function; 2) gets
to decide which keys to send you?
Hashing
Universal Hashing
Suppose we have a set K of 0
k1 1
possible keys, and a finite h .
set H of hash functions that k2 .
map keys to entries in a K .
hashtable of size m. m-1
hi
hj
H
Definition:
H is a universal collection of hash functions if and only if …
For any two keys k1, k2 in K, there are at most |H|/m functions in H
for which h(k1) = h(k2).
Hashing
Random Hashing – Not!
How can we “randomly choose a hash function”?
Certainly we cannot randomly choose hash functions at
runtime, interspersed amongst the inserts, finds, deletes!
Why not?
We can, however, randomly choose a hash function each
time we initialize a new hashtable.
Conclusions
Worst Enemy never knows which hash function we will
choose – neither do we!
No single input (set of keys) can always evoke worst-case
behavior
Hashing
Good Hashing:
Universal Hash Function A(UHFa)
Parameterized by prime table size and vector:
a = <a0 a1 … ar> where 0 <= ai < size
Hashing
UHFa: Example
Context: hash strings of length 3 in a table of
size 131
Hashing
Thinking about UHFa
Strengths:
works on any type as long as you can form ki’s
if we’re building a static table, we can try many
values of the hash vector <a>
random <a> has guaranteed good properties no
matter what we’re hashing
Weaknesses
must choose prime table size larger than any ki
Hashing
Good Hashing:
Universal Hash Function 2 (UHF2)
Parameterized by j, a, and b:
j * size should fit into an int
a and b must be less than size
Hashing
UHF2 : Example
Context: hash integers in a table of size 16
Hashing
Thinking about UHF2
Strengths
if we’re building a static table, we can try many
parameter values
random a,b has guaranteed good properties no
matter what we’re hashing
can choose any size table
very efficient if j and size are powers of 2 (why?)
Weaknesses
need to turn non-integer keys into integers
Hashing
Hash Function Summary
Goals of a hash function
reproducible mapping from key to table index
evenly distribute keys across the table
separate commonly occurring keys (neighboring keys?)
fast runtime
Hashing
Hash Function Design
Considerations
Know what your keys are
Study how your keys are distributed
Try to include all important information
in a key in the construction of its hash
Try to make “neighboring” keys hash to
very different places
Prune the features used to create the
hash until it runs “fast enough” (very
application dependent)
Hashing
Handling Collisions
Pigeonhole principle says we can’t avoid all collisions
try to hash without collision n keys into m slots with n > m
try to put 6 pigeons into 5 holes
Hashing
Separate Chaining h(a) = h(d)
Put a little dictionary at each entry h(e) = h(b)
Commonly, unordered linked list
(chain)
Or, choose another Dictionary type as 0
appropriate (search tree, hashtable,
etc.) 1
a d
2
Properties
λ can be greater than 1 3
e b
performance degrades with length of 4
chains
Alternate Dictionary type (e.g. search 5
tree, hashtable) can speed up
c
secondary search 6
Hashing
Separate Chaining Code
void insert(const Key & k, const Value & v) {
findBucket(k).insert(k,v);
}
[private]
Dictionary & findBucket(const Key & k) {
return table[hash(k)%table.size];
}
Hashing
Load Factor in Separate Chaining
Search cost
unsuccessful search:
successful search:
Hashing
Open Addressing
Allow one key at each table entry 0
h(a) = h(d)
two objects that hash to the same h(e) = h(b) 1
spot can’t both go there a
first one there gets the spot 2
next one must go in another spot d
3
e
Properties 4
λ≤1 b
performance degrades with 5
difficulty of finding right spot c
6
Hashing
Probing
Requires collision resolution function f(i)
Hashing
Linear Probing
f(i) = i
Probe sequence is
h(k) mod size
h(k) + 1 mod size
h(k) + 2 mod size
…
bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash(k);
do {
entry = &table[probePoint];
probePoint = (probePoint + 1) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();
}
Hashing
Linear Probing Example
insert(76) insert(93) insert(40) insert(47) insert(10) insert(55)
76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6
0 0 0 0 0 0
47 47 47
1 1 1 1 1 1
55
2 2 2 2 2 2
93 93 93 93 93
3 3 3 3 3 3
10 10
4 4 4 4 4 4
5 5 5 5 5 5
40 40 40 40
6 6 6 6 6 6
76 76 76 76 76 76
probes: 1 1 1 3 1 3
Hashing
Load Factor in Linear
Probing
For any λ < 1, linear probing will find an empty slot
Search cost (for large table sizes)
1 1
( )
successful search:
1+
2 ( 1−λ )
1 1
unsuccessful search:
( )
1+
2 ( 1−λ ) 2
Linear probing suffers from primary clustering
Performance quickly degrades for λ > 1/2
Hashing
Quadratic Probing
f(i) = i2
Probe sequence:
h(k) mod size
h(k) + 1 mod size
h(k) + 4 mod size
h(k) + 9 mod size
…
bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash(k), i = 0;
do {
entry = &table[probePoint];
i++;
probePoint = (probePoint + (2*i - 1)) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();
}
Hashing
Good Quadratic Probing Example
2 2 2 2 2
5 5
3 3 3 3 3
55
4 4 4 4 4
5 5 5 5 5
40 40 40 40
6 6 6 6 6
76 76 76 76 76
probes: 1 1 2 3 3
Hashing
Bad Quadratic Probing Example
2 2 2 2 2
93 93 93 93
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
40 40 40
6 6 6 6 6
76 76 76 76 76
probes: 1 1 1 1 ∞
Hashing
Quadratic Probing Succeeds
for λ ≤ ½
If size is prime and λ ≤ ½, then quadratic probing will
find an empty slot in size/2 probes or fewer.
show for all 0 ≤ i, j ≤ size/2 and i ≠ j
(h(x) + i2) mod size ≠ (h(x) + j2) mod size
by contradiction: suppose that for some i, j:
(h(x) + i2) mod size = (h(x) + j2) mod size
i2 mod size = j2 mod size
(i2 - j2) mod size = 0
[(i + j)(i - j)] mod size = 0
but how can i + j = 0 or i + j = size when
i ≠ j and i,j ≤ size/2?
same for i - j mod size = 0
Hashing
Quadratic Probing May Fail
for λ > ½
For any i larger than size/2, there is
some j smaller than i that adds with i to
equal size (or a multiple of size). D’oh!
Hashing
Load Factor in Quadratic Probing
For any λ ≤ ½, quadratic probing will find
an empty slot
For λ > ½, quadratic probing may find a
slot
Quadratic probing does not suffer from
primary clustering
Quadratic probing does suffer from
secondary clustering
How could we possibly solve this?
Hashing
Double Hashing
f(i) = i*hash2(k)
Probe sequence:
h (k) mod size
1
(h (k) + 1 ⋅ h (x)) mod size
1 2
(h (k) + 2 ⋅ h (x)) mod size
1 2
…
bool findEntry(const Key & k, Entry *& entry) {
int probePoint = hash1(k), delta = hash2(k);
do {
entry = &table[probePoint];
probePoint = (probePoint + delta) % size;
} while (!entry->isEmpty() && entry->key != k);
return !entry->isEmpty();
}
Hashing
A Good Double Hash Function…
Hashing
Double Hashing Example (p=5)
insert(76) insert(93) insert(40) insert(47) insert(10) insert(55)
76%7 = 6 93%7 = 2 40%7 = 5 47%7 = 5 10%7 = 3 55%7 = 6
5 - (47%5) = 3 5 - (55%5) = 5
0 0 0 0 0 0
1 1 1 1 1 1
47 47 47
2 2 2 2 2 2
93 93 93 93 93
3 3 3 3 3 3
10 10
4 4 4 4 4 4
55
5 5 5 5 5 5
40 40 40 40
6 6 6 6 6 6
76 76 76 76 76 76
probes: 1 1 1 2 1 2
Hashing
Load Factor in Double
Hashing
For any λ < 1, double hashing will find an empty slot (given
appropriate table size and hash2)
successful search: 1 1
ln
unsuccessful search:
λ 1−λ
1
No primary clustering and no secondary clustering
1−λ
One extra hash calculation
Hashing
Deletion in Open
Addressing
delete(2) find(7)
0 0
0 0
1
1 1
1 Where is it?!
2 2
2
3 3
7 7
4 4 Must use lazy deletion!
5 5 On insertion, treat a (lazily)
deleted item as an empty slot
6 6
Hashing
The Squished Pigeon
Principle
Insert using Open Addressing cannot work with
λ ≥ 1.
Insert using Open Addressing with quadratic
probing may not work with λ ≥ ½.
With Separate Chaining or Open Addressing,
large load factors lead to poor performance!
Hashing
Rehashing
When the λ gets “too large” (over some constant
threshold), rehash all elements into a new, larger table:
takes O(n), but amortized O(1) as long as we (just about)
double table size on the resize
spreads keys back out, may drastically improve performance
gives us a chance to retune parameterized hash functions
avoids failure for Open Addressing techniques
allows arbitrarily large tables starting from a small table
clears out lazily deleted items
Hashing
Case Study
Spelling dictionary Practical notes
30,000 words
almost all searches are
static successful – Why?
arbitrary(ish)
words average about 8
preprocessing time characters in length
30,000 words at 8
Goals bytes/word ~ .25 MB
fast spell checking
pointers are 4 bytes
minimal storage
there are many
regularities in the
structure of English
words
Hashing
Case Study:
Design Considerations
Possible Solutions
sorted array + binary search
Separate Chaining
Open Addressing + linear probing
Issues
Which data structure should we use?
Which type of hash function should we use?
Hashing
Case Study:
Storage
Assume words are strings and entries are
pointers to strings
Array + Open addressing
binary search Separate Chaining
…
How many pointers
does each use?
Hashing
Case Study:
Analysis
storage time
n pointers + log2n ≤ 15 probes
Binary search words = 360KB per access, worst
case
n + n/λ pointers + 1 + λ/2 probes per
Separate Chaining words access on average
(λ = 1 ⇒ 600KB) (λ = 1 ⇒ 1.5 probes)
n/λ pointers + (1 + 1/(1 - λ))/2 probes
Open Addressing words per access on average
(λ = 0.5 ⇒ 480KB) (λ = 0.5 ⇒ 1.5
probes)
What to do, what to do? …
Hashing
Perfect Hashing
When we know the entire key set in
advance …
Examples: programming language
keywords, CD-ROM file list, spelling
dictionary, etc.
Hashing
Perfect Hashing Technique
Static set of n known keys 0
Separate chaining, two-level 1
hash
2
Primary hash table size=n
jth secondary hash table size=nj2 3
(where nj keys hash to slot j in 4 Secondary hash
primary hash table) tables
5
Universal hash functions in all
hash tables 6
Conduct (a few!) random trials,
Primary hash
until we get collision-free hash table
functions
Hashing
Perfect Hashing
Theorems1
Theorem: If we store n keys in a hash table of size n2 using a randomly
chosen universal hash function, then the probability of any collision is < ½.
[∑ ]
m−1
E n 2j <2n
where nj j=
is 0the number of keys hashing to slot j.
1
Intro to Algorithms, 2nd ed.
Hashing Cormen, Leiserson, Rivest, Stein
Perfect Hashing
Conclusions
Perfect hashing theorems set tight expected
bounds on sizes and collision behavior of all the
hash tables (primary and all secondaries).
Hashing
Extendible Hashing:
Hashing
Extendible Hashing
Hashing technique for huge data sets
optimizes to reduce disk accesses
each hash bucket fits on one disk block
better than B-Trees if order is not important – why?
Table contains
buckets, each fitting in one disk block, with the data
a directory that fits in one disk block is used to hash to
the correct bucket
Hashing
Extendible Hash Table
Directory entry: key prefix (first k bits) and a pointer to the
bucket with all keys starting with its prefix
Each block contains keys matching on first j ≤ k bits, plus
the data associated with each key
directory for k = 3
000 001 010 011 100 101 110 111
Hashing
Inserting (easy case)
insert(11011)
insert(11000)
000 001 010 011 100 101 110 111
Hashing
If Extendible Hashing Doesn’t
Cut It
Store only pointers to the items
+ (potentially) much smaller M
+ fewer items in the directory
– one extra disk access!
Rehash
+ potentially better distribution over the buckets
+ fewer unnecessary items in the directory
– can’t solve the problem if there’s simply too much data
Hashing
Hash Wrap
Collision resolution Hash functions
•Separate Chaining
Simple integer hash: prime
Expand beyond hashtable table size
via secondary Dictionaries
Allows λ > 1
Multiplication method
•Open Addressing
Universal hashing guarantees
Expand within hashtable no (always) bad input
Secondary probing: {linear, Perfect hashing
quadratic, double hash}
Requires known, fixed keyset
λ ≤ 1 (by definition!)
λ ≤ ½ (by preference!)
Achieves O(1) time, O(n)
space - guaranteed!
Rehashing Extendible hashing
Tunes up hashtable when λ
For disk-based data
crosses the line
Combine with b-tree directory
if needed
Hashing