Professional Documents
Culture Documents
Hashing
Hashing
Hashing
... ...
2 / 40
Introduction
Naive implementations
Linear search
Array (or list) is unordered
Sequentially check each item until match is found
Binary search
Array is maintained in order
Can perform binary search to find item
3 / 40
Introduction
Advanced implementations
Binary search tree
Organization of data items in BST, following certain order
Any node's key is greater than all keys in its left subtree
Any node's key is smaller than all keys in its right subtree
At most, explore height of tree
4 / 40
Introduction
Ideal data structure
Can we achieve constant time on all three main operations? Search Insert Delete
5 / 40
Introduction
Example #2: frequency of words
Compute the frequency of all the (alphabetical) characters in a text
Key Value
I was born in Tuckahoe, near Hillsborough, and about twelve miles from
Easton, in Talbot county, Maryland. I have no accurate knowledge of my 'of' 6
age, never having seen any authentic record containing it. By far the larger 'time' 5
part of the slaves know as little of their ages as horses know of theirs, and it
is the wish of most masters within my knowledge to keep their slaves thus 'to' 3
ignorant. I do not remember to have ever met a slave who could tell of his 'the' 3
birthday. They seldom come nearer to it than planting-time, harvest-time,
'it' 3
cherry-time, spring-time, or fall-time. [...]
'I' 3
... ...
Issues
Very large number of words in English
Would need a huge table to index by word
std::vector<size_t> fmap(171476);
Cannot index an array with a string
fmap["time"] is not a proper C++ construct
Can only index arrays with an integer
6 / 40
Introduction
Hashing 0
5 1
Compute an array index from any key
hash("time") = 1 2
Translate key range 0..K-1 to index range 0..M-1 hash("of") = 4 3
How to provide hashing for any type of key? 6 4
Boolean, integers, floating-point numbers, strings, ...
user-defined objects, etc. M-1
Collisions 0
5 1
Hash function is said to be non-injective
hash("time") = 1 2
Key range is typically much larger than index range hash("of") = 4 3
Two keys can be reduced to the same hash hash("to") = 4 6 4
??
How to resolve potential collisions properly? ...
M-1
7 / 40
Hash functions
Required properties
Simple to compute, and fast
Hash of a key is deterministic
Equal keys must produce the same hash value
assert(a == b && hash(a) == hash(b));
Uniform distribution of the keys across the index range
key(-value) 0
Hash function 1
Hash Array 2
code index 3
Any key [0,N) [0-M) 4
...
M-1
8 / 40
Hash functions
Some pitfalls
Worse (yet functional) hash function
Problem
template<typename K>
size_t GalaxyHashCode(K &key) { Can only insert one key
return 42;
}
All the following keys will collide, regardless
of their value
Solution
Need to distinguish keys by value
Key subset selection
Problem
Possible scenario:
Phone numbers with same area code will
Keys are phone numbers: e.g., (530) 752- collide
1011 Very likely to happen if hash function was
Hash table is of size 1000 used on campus!
Idea: use first three digits as a direct Solution
index? Select another set of 3 digits that shows
better "random" properties (e.g., the last
hash("(530) 752-1011") = 530
three)
Or consider the entire phone number
instead
9 / 40
Hash functions
Hashing positive integers
Key is a simple unsigned integer in the range 0..K-1
Array index is always an unsigned integer in the range 0..M-1
Hashing is done via modulo operation
size_t HashUInt(unsigned key, size_t M) {
return key % M;
}
uint_hash_uniform.cc
10 / 40
Hash functions
Consideration about table size (1/2) Key % 100 % 97
Conclusion 384 84 93
812 12 36
528 28 43
568 68 83
Collisions 3 2
11 / 40
Hash functions
Consideration about table size (2/2) Key % 100 % 97
0 0 0
Does the size of the hash table matter?
25 25 25
Same scenario as last slide
50 50 50
Keys are random numbers between 0 and 999
75 75 75
Hashing of 25 keys
100 0 3
Hash table can either be of size 100 or size 97
125 25 28
But distribution of keys is non-uniform
150 50 53
Conclusion 175 75 78
200 0 6
Table size becomes critical
225 25 31
If distribution of keys is non-uniform
250 50 56
Prime numbers exhibit good properties
275 75 81
300 0 9
325 25 34
350 50 59
Collisions 21 0
12 / 40
ECS 36C - Hashing (part 2)
Prof. Joël Porquet-Lupine
100 0 3
Collisions 21 0
14 / 40
Hash functions
Hashing strings size_t HashStr(std::string key, size_t M) {
size_t h = 0;
Naive approach for (auto c : key)
h += c;
Sum ASCII value of each character return h % M;
}
Reduce to table size string_hash_naive.cc
Observations Issues
Average word length is 4.5 Assuming a hash table of size 10,000
4.5 ∗ 127 = 576 Indices up to 500 will be crowded
Longest word is 45 letters long Indices above 5000 will never be used
45 ∗ 127 = 5715
Experiment Result
15 / 40
Hash functions
Hashing strings size_t HashStr(std::string key, size_t M) {
size_t h = 0;
Better approach for (auto c : key)
h = (h * 31) + c;
Multiple/add ASCII value of each return h % M;
}
character string_hash_better.cc
Horner's method
Consider string of length L as a polynomial
h = key[0] ∗ 31L−1 + key[1] ∗ 31L−2 + ... + key[L − 1] ∗ 310
Properties Result
If unsigned hash value gets too big, overflow is Hash(the) => 4801
Hash(and) => 6727
controlled Hash(that) => 8823
31 is an interesting prime number Hash(have) => 5240
Hash(for) => 1577
Multiply item by 32 (by shifting << 5) and Hash(not) => 9267
Hash(with) => 9734
subtract item
Hash(you) => 9839
Experimentations has shown good distribution Hash(pneumonoultramicroscopicsilicovolca
noconiosis) => 2420
of indices overall
16 / 40
Hash functions
Hashing compound data types
Example: phone number struct PhoneNumber {
unsigned area, exch, ext; // (530) 752-1011
Fields of same type };
Same method as string, but for size_t HashPhone(PhoneNumber &p, size_t M) {
each field return ((p.area * 31 + p.exch) * 31 + p.ext) % M;
}
17 / 40
Collisions
Uniform hashing
We assume that the hash function has a uniform distribution
Keys are mapped as evenly as possible over the index range
Collision probability
Assuming an array index in the range 0..M-1 Hash Key % 97
After how many hashed keys will there be the first collision? #1 548 63
#2 592 10
Experiments #3 715 36
Example: #4 844 68
#10 623 41
#11 645 63
18 / 40
Collisions
Birthday problem
In a group of N random people, what is the
probability that two people have the same
birthdate?
100% if group counts 367 people (pigeonhole
principle)
But 50% if group counts only 23 people
And 99.9% if group counts 70 people!
π
For an output range of 97: 2 97
= 12 Hash Key % 97
#11 645 63
19 / 40
Collisions
Observation
Unless table size is quadratically bigger than necessary
Impossible to avoid collisions
Conclusion
Need to deal with collisions gracefully
Two options:
1. Separate chaining: couple the hash table with another container
2. Open addressing: put the key at another index in case of collision
20 / 40
Separate chaining
Principle
Hash table is an array of linked-lists
Keys are still hashed to array index i in the range 0..M-1
d 0 1
a 2 'u' 'a'
2
v 3
i 0 3 'v'
s 0
4
'c'
21 / 40
Separate chaining
Public API
// Returns whether @key is found in table
bool Contains(const K &key);
// Inserts @key in hash table
void Insert(const K &key);
// Removes @key from hash table
void Remove(const K &key);
ht_chaining.h
// Helper methods
size_t Hash(const K &key);
...
};
ht_chaining.h
22 / 40
Separate chaining
Hash function
template <typename K> std::hash<T> provided by
size_t HTChaining<K>::Hash(const K &key) { C++ library
static std::hash<K> k_hash;
return k_hash(key) % ht.size(); Knows how to hash most basic
}
ht_chaining.h types (e.g., integers, floating-
point, strings)
Example
std::hash<int> int_hash;
std::hash<double> double_hash;
std::hash<std::string> str_hash;
std::hash<int*> intptr_hash;
std::cout << 42 << " => " << int_hash(42) << std::endl;
std::cout << -42 << " => " << int_hash(-42) << std::endl;
std::cout << 42.23 << " => " << double_hash(42.23) << std::endl;
std::cout << "ucdavis" << " => " << str_hash("ucdavis") << std::endl;
int a;
std::cout << &a << " => " << intptr_hash(&a) << std::endl;
std_hash.cc
23 / 40
Separate chaining
API implementation
template <typename K>
bool HTChaining<K>::Contains(const K &key) {
auto &list = ht[Hash(key)];
return std::find(list.begin(), list.end(), key) != list.end();
}
ht_chaining.h
24 / 40
Separate chaining
Performance analysis
Hashing key to array index: O(1)
Searching key in chain: O(length of chain)
Hash function
If using poor hash function, such as GalaxyHashCode()
One chain can end up containing all the items
Runtime complexity of operations: O(N )!
If using hash function providing uniform distribution
On average, chains will be of length N /M = L
L is also know as the load factor
Runtime complexity of operations: O(L)
Load factor
If load factor is maintained to be constant
Then complexity also becomes constant
Number of items N in hash table cannot be controlled
But number of chains M can...
25 / 40
Separate chaining
Resizing Resizing - Rehashing
Increase the size of hash table
Scenario
Rehash every item into new table
ht[5]
0
'd' 'i' 's'
Result
1 ht[11]
'u' 'a' 'c'
2 0
3 1 'd'
'v'
2
4
'c'
3
7 'u'
8
'v'
9
'a'
10
26 / 40
Separate chaining
Resizing implementation
template <typename K> template <typename K>
class HTChaining { void HTChaining<K>::Insert(const K &key) {
public: ...
... // Update size
private: cur_size++;
...
void Resize(size_t capacity); // Resize hash table if load factor is 1
}; if (cur_size > ht.size())
ht_chaining.h
Resize(ht.size() * 2);
}
ht_chaining.h
// Reinsert all previous items into new table Ideally, resize to another
for (auto &list : old_ht) prime number
for (auto &k : list)
Insert(k);
}
ht_chaining.h
27 / 40
Separate chaining
Conclusion
Complexity
Proportional to load factor
Search Insert Delete
Typically force load factor to 1.0
O(1) O(1) amortized O(1)
Pros
Very effective if keys are uniformly distributed
Cons
Rely on another data structure, such as lists, to implement the chains
Note that the chains could also be implemented with balanced trees!
Cost associated to this additional data structure
Memory management, traversal, etc.
28 / 40
ECS 36C - Hashing (part 3)
Prof. Joël Porquet-Lupine
Separate chaining
Lists of keys hashed to same array index Load factor
ht[5] Average length of chains in
'd' 'i' 's' hashtable
0
Defines complexity of most
1 operations
Maintain constant load factor
'u' 'a'
2 via resizing
3 'v'
4
'c'
30 / 40
Open addressing
Principle
Keys hashed to the same array index are not chained together
Inserted in the same flat hash table, but at the next available index
Hash table must be at least the same size as the number of keys total
Formally ht[M]
Key x is inserted at the first index hi available
hash(keyC) ...
hi (x) = (hash(x) + f (i)) mod M
h0 keyA
where f (0) = 0
+f(1) ...
Function f is the collision resolution strategy
Linear probing, quadratic probing, etc. h1 keyB
+f(2) ...
h2 keyC
...
Operations
Insert: keep probing until an empty index is found
Search: keep probing until key is found
Or until encountering an empty index (i.e., key not found)
31 / 40
Linear probing
Principle
Function f is a linear function of i
Typically f (i) =i
In case of collision, try next indices sequentially
3 0 0
Observations
L 8 8
Apparition of primary clusters
U 0 1 (= 0 + 1)
Contiguous block of keys
Caused by multiple collisions V 1 2 (= 1 + 1)
Key hashed into a cluster might require multiple
attempts to be inserted
32 / 40
Linear probing
Data structure
Based on flat hash table
Same public API as for separate chaining implementation
template <typename K>
class HTLinear {
public:
...
private:
struct Node {
K key;
bool taken;
};
static const size_t init_capacity = 11;
std::vector<Node> ht{init_capacity};
size_t cur_size = 0;
// Helper methods
size_t Hash(const K &key);
void Resize(size_t capacity);
};
ht_linear.h
33 / 40
Linear probing
API implementation (1/2)
template <typename K> template <typename K>
bool HTLinear<K>::Contains(const K &key) { void HTLinear<K>::Insert(const K &key) {
size_t idx = Hash(key); // Key already there
while (ht[idx].taken) { if (Contains(key)) return;
if (ht[idx].key == key)
return true; // Find empty spot
idx = (idx + 1) % ht.size(); size_t idx = Hash(key);
} while (ht[idx].taken)
return false; idx = (idx + 1) % ht.size();
}
ht_linear.h // Put key there
ht[idx].key = key;
template <typename K> ht[idx].taken = true;
void HTLinear<K>::Resize(size_t capacity) {
std::vector<Node> old_ht = ht; // Update size
cur_size++;
// Resize to new capacity and empty
ht.resize(capacity); // Resize hash table if 50% full
for (auto &k : ht) if (cur_size > ht.size() / 2)
k.taken = false; Resize(ht.size() * 2);
cur_size = 0; }
ht_linear.h
34 / 40
Linear probing
Deletion principle
Scenario
Assuming following hash table
Key Hash Index
Cluster of three keys
How to remove key U? I 5 5
Key V should still be reachable < 9 9
Problems 3 0 0
Solution
Free key's index upon removal
But rehash every key of cluster located directly after key
V would be reinserted at index 1, and stay reachable
35 / 40
Linear probing
Deletion implementation
template <typename K>
void HTLinear<K>::Remove(const K &key) {
// Key not there
if (!Contains(key)) return;
36 / 40
Linear probing
Conclusion
Complexity
Difficult complexity analysis
Search hit Search miss/Insert
Often simplified to O(1)
O(∼ O(∼
1
2
(1 +
1
1−α
))
1
2
(1 + 1
(1−α)2
))
Pros
No overhead for memory management
Locality of reference, especially if multiple consecutive items can be in the same cache
line
Cons
Requires low load factor (0.5) compared to separate chaining
Bigger hash table
Causes primary clustering
37 / 40
Open addressing strategies
Linear probing
f (i) = i
If index at i is taken, try i+1, then i+2, ..., i+k.
Quadratic probing
f (i) = i2
If index at i is taken, try i+1^2, then i+2^2, ..., i+k^2.
Double hashing
Use second hash functions to compute intervals between probes
f (i) = i ∗ h2 (k)
Comparison
Strategy Locality of reference Prevents clustering
38 / 40
Standard C++ containers
The C++ Standard Library defines two main unordered associative containers, implemented
as hashed data structures:
std::unordered_set
Contains an unordered set of unique objects
Objects must be comparable for equality, as they are used as keys directly
std::unordered_map
Contains unordered pairs of key-value
Keys must be comparable for equality and unique
39 / 40
Standard C++ containers
Demo
#include <iostream> (Bart: 10)
#include <unordered_map> (Lisa: 8)
int main() { (Alice: 36)
// Initialization (Bob: 23)
std::unordered_map<int, std::string> m = {
{36, "Alice"}, {23, "Bob"}
};
return 0;
}
std_unordered_map_example.cc
40 / 40