Hashing

ECS 36C - Hashing
Prof. Joël Porquet-Lupine
UC Davis - Spring Quarter 2020
Copyright © 2018-2020 Joël Porquet-Lupine - CC BY-NC-SA 4.0 International License 1 / 40

Introduction
Symbol table
Association of a value with a key
table[key] = value
Typical operations
Insert(), Remove(), Contains()/Get()
Other possible operations
Min(), Max(), Rank(), SortedList(), etc.
Example #1: frequency of letters

Compute the frequency of all the (alphabetical) characters in a text
Key Value
I was born in Tuckahoe, near Hillsborough, and about twelve miles from
Easton, in Talbot county, Maryland. I have no accurate knowledge of my 'a' 37
age, never having seen any authentic record containing it. By far the larger
'b' 6
part of the slaves know as little of their ages as horses know of theirs, and it
'c' 10
is the wish of most masters within my knowledge to keep their slaves thus
ignorant. I do not remember to have ever met a slave who could tell of his 'd' 10
birthday. They seldom come nearer to it than planting-time, harvest-time,
'e' 54
cherry-time, spring-time, or fall-time. [...]
'g' 11
... ...
2 / 40
Introduction
Naive implementations
Linear search
Array (or list) is unordered
Sequentially check each item until match is found
Search Insert Delete

'I'' ''w''a''s'...
O(N ) O(1) (search time +) O(1)
3 30 9 37 26 ...
Binary search
Array is maintained in order
Can perform binary search to find item
O(log N ) O(N ) O(N ) ...'a''b''c''d''e'...

37 6 10 10 54
3 / 40
Introduction
Advanced implementations
Binary search tree
Organization of data items in BST, following certain order
Any node's key is greater than all keys in its left subtree
Any node's key is smaller than all keys in its right subtree
At most, explore height of tree
Search avg/worst Insert avg/worst Delete avg/worst
O(log N )/O(N ) O(log N )/O(N ) O(log N )/O(N )
'd' AVL or RB tree

10 Balanced BST
Height of tree is kept to O(log N )
'b' 'e'
6 54 Search Insert Delete
O(log N ) O(log N ) O(log N )

'a' 'c'
37 10
4 / 40
Introduction
Ideal data structure
Can we achieve constant time on all three main operations? Search Insert Delete
O(1) O(1) O(1)
Using data as index

Keys are small integers
...
ASCII is 128 characters 97 37
std::vector<size_t> fmap(128);
Can directly be used as index
98 6
fmap[key] = value; 99 10
Limitations 100 10
Doesn't (efficiently) support other operations
Min(), Max(), etc.
101 54
Wasted memory for unused keys ...
E.g., ASCII defines about 30 control characters
What if the key is not a small integer?
5 / 40
Introduction
Example #2: frequency of words
Compute the frequency of all the (alphabetical) characters in a text
Key Value
I was born in Tuckahoe, near Hillsborough, and about twelve miles from
Easton, in Talbot county, Maryland. I have no accurate knowledge of my 'of' 6
age, never having seen any authentic record containing it. By far the larger 'time' 5
part of the slaves know as little of their ages as horses know of theirs, and it
is the wish of most masters within my knowledge to keep their slaves thus 'to' 3
ignorant. I do not remember to have ever met a slave who could tell of his 'the' 3
birthday. They seldom come nearer to it than planting-time, harvest-time,
'it' 3
cherry-time, spring-time, or fall-time. [...]
'I' 3
... ...
Issues
Very large number of words in English
Would need a huge table to index by word
std::vector<size_t> fmap(171476);
Cannot index an array with a string
fmap["time"] is not a proper C++ construct
Can only index arrays with an integer
6 / 40
Introduction
Hashing 0
5 1
Compute an array index from any key
hash("time") = 1 2
Translate key range 0..K-1 to index range 0..M-1 hash("of") = 4 3
How to provide hashing for any type of key? 6 4
Boolean, integers, floating-point numbers, strings, ...
user-defined objects, etc. M-1
Collisions 0
5 1
Hash function is said to be non-injective
hash("time") = 1 2
Key range is typically much larger than index range hash("of") = 4 3
Two keys can be reduced to the same hash hash("to") = 4 6 4
??
How to resolve potential collisions properly? ...
M-1
7 / 40
Hash functions
Required properties
Simple to compute, and fast
Hash of a key is deterministic
Equal keys must produce the same hash value
assert(a == b && hash(a) == hash(b));
Uniform distribution of the keys across the index range
key(-value) 0
Hash function 1
Hash Array 2
code index 3
Any key [0,N) [0-M) 4
...
M-1
Two implicit steps:

1. Transform any potentially non-integer key into an integer hash code
2. Reduce this hash code to an index range
8 / 40
Hash functions
Some pitfalls
Worse (yet functional) hash function
Problem
template<typename K>
size_t GalaxyHashCode(K &key) { Can only insert one key
return 42;
}
All the following keys will collide, regardless
of their value
Solution
Need to distinguish keys by value
Key subset selection
Problem
Possible scenario:
Phone numbers with same area code will
Keys are phone numbers: e.g., (530) 752- collide
1011 Very likely to happen if hash function was
Hash table is of size 1000 used on campus!
Idea: use first three digits as a direct Solution
index? Select another set of 3 digits that shows
better "random" properties (e.g., the last
hash("(530) 752-1011") = 530
three)
Or consider the entire phone number
instead
9 / 40
Hash functions
Hashing positive integers
Key is a simple unsigned integer in the range 0..K-1
Array index is always an unsigned integer in the range 0..M-1
Hashing is done via modulo operation
size_t HashUInt(unsigned key, size_t M) {
return key % M;
}
uint_hash_uniform.cc
// Hash table Hash(548) => 48

Hash(592) => 92
std::vector<bool> ht(100, false);
Hash(715) => 15
Hash(844) => 44
// Uniform random number generator Hash(602) => 2
std::mt19937 mt(0); Hash(857) => 57
std::uniform_int_distribution<unsigned> distr(0, 999); Hash(544) => 44 (collision!)
Hash(847) => 47
// Generate some random keys Hash(423) => 23
for (int i = 0; i < 25 ; i++) { Hash(623) => 23 (collision!)
Hash(645) => 45
unsigned key = distr(mt);
Hash(384) => 84
size_t hash = HashUInt(key, ht.size()); Hash(437) => 37
Hash(297) => 97
std::cout << "Hash(" << key << ") => " << hash; Hash(891) => 91
Hash(56) => 56
if (ht[hash]) Hash(963) => 63
std::cout << " (collision!)"; Hash(272) => 72
ht[hash] = true; Hash(383) => 83
...
std::cout << std::endl;
}
10 / 40
Hash functions
Consideration about table size (1/2) Key % 100 % 97
... ... ...

Does the size of the hash table matter?
857 57 81
Same scenario as previous slide
544 44 59
Keys are random numbers between 0 and 999
847 47 71
Distribution of keys is uniform
423 23 35
Hashing of 25 keys
623 23 41
But hash table can either be of size 100 or size 97
645 45 63
Conclusion 384 84 93
... ... ...

Table size is not critical
477 77 89
If distribution of keys is uniform
791 91 15
812 12 36
528 28 43
... ... ...
568 68 83
Collisions 3 2
11 / 40
Hash functions
Consideration about table size (2/2) Key % 100 % 97
0 0 0
Does the size of the hash table matter?
25 25 25
Same scenario as last slide
50 50 50
Keys are random numbers between 0 and 999
75 75 75
Hashing of 25 keys
100 0 3
Hash table can either be of size 100 or size 97
125 25 28
But distribution of keys is non-uniform
150 50 53
Conclusion 175 75 78
200 0 6
Table size becomes critical
225 25 31
If distribution of keys is non-uniform
250 50 56
Prime numbers exhibit good properties
275 75 81
300 0 9
325 25 34
350 50 59
... ... ...
Collisions 21 0
12 / 40
ECS 36C - Hashing (part 2)

Recap
Hash tables Hashing
Data structure implementing efficient Hash function
associative arrays Compute array index from any type of
Key to value mapping key
Average O(1) for search, insert, and Hash function
key(-value) 0
1
delete operations Hash Array 2
code index 3
Any key [0,N) [0-M) 4
...
M-1
Hashing positive integers Table size

Hash key in range 0..K-1 into array Table size is critical if distribution of keys
index in range 0..M-1 is non-uniform
Simple modulo operation Pick prime number to minimize number
size_t HashUInt(unsigned key, size_t M) {
of collisions
return key % M; Key % 100 % 97
}
... ... ...
100 0 3
... ... ...
Collisions 21 0
14 / 40
Hash functions
Hashing strings size_t HashStr(std::string key, size_t M) {
size_t h = 0;
Naive approach for (auto c : key)
h += c;
Sum ASCII value of each character return h % M;
}
Reduce to table size string_hash_naive.cc
Observations Issues
Average word length is 4.5 Assuming a hash table of size 10,000
4.5 ∗ 127 = 576 Indices up to 500 will be crowded
Longest word is 45 letters long Indices above 5000 will never be used
45 ∗ 127 = 5715
Experiment Result
std::vector<std::string> dict{ Hash(the) => 321

// Some of the most used words Hash(and) => 307
"the", "and", "that", "have", Hash(that) => 433
"for", "not", "with", "you", Hash(have) => 420
// Longest word in English Hash(for) => 327
"pneumonoultramicroscopicsilicovolcanoconiosis" Hash(not) => 337
}; Hash(with) => 444
Hash(you) => 349
for (auto &d : dict) Hash(pneumonoultramicroscopicsilicovolca
std::cout << "Hash(" << d << ") => " noconiosis) => 4880
<< HashStr(d, 10000) << std::endl;
string_hash_naive.cc
15 / 40
Hash functions
Hashing strings size_t HashStr(std::string key, size_t M) {
size_t h = 0;
Better approach for (auto c : key)
h = (h * 31) + c;
Multiple/add ASCII value of each return h % M;
}
character string_hash_better.cc
Reduce to table size
Horner's method
Consider string of length L as a polynomial
h = key[0] ∗ 31L−1 + key[1] ∗ 31L−2 + ... + key[L − 1] ∗ 310
Properties Result
If unsigned hash value gets too big, overflow is Hash(the) => 4801
Hash(and) => 6727
controlled Hash(that) => 8823
31 is an interesting prime number Hash(have) => 5240
Hash(for) => 1577
Multiply item by 32 (by shifting << 5) and Hash(not) => 9267
Hash(with) => 9734
subtract item
Hash(you) => 9839
Experimentations has shown good distribution Hash(pneumonoultramicroscopicsilicovolca
noconiosis) => 2420
of indices overall
16 / 40
Hash functions
Hashing compound data types
Example: phone number struct PhoneNumber {
unsigned area, exch, ext; // (530) 752-1011
Fields of same type };
Same method as string, but for size_t HashPhone(PhoneNumber &p, size_t M) {
each field return ((p.area * 31 + p.exch) * 31 + p.ext) % M;
}
Example: street address struct Address {

unsigned st_num;
Fields of different type std::string st_name;
Combine each field using };
31x + y rule size_t HashAddress(Address &a, size_t M) {

size_t h = 17; // small prime as initial value
// Hash street number

If key is too long to compute, h = h * 31 + a.st_num;
pick a subset of the fields
// Hash street name
E.g., only a few well-chosen for (auto c : a.st_name)
h = h * 31 + c;
characters from the street
name return h % M;
}
17 / 40
Collisions
Uniform hashing
We assume that the hash function has a uniform distribution
Keys are mapped as evenly as possible over the index range
Collision probability
Assuming an array index in the range 0..M-1 Hash Key % 97
After how many hashed keys will there be the first collision? #1 548 63
#2 592 10
Experiments #3 715 36
Example: #4 844 68
Output range of 97 #5 603 20
11 hashes to observe the first collision #6 857 81
Now, what is the output range was 223?... #7 544 59
Need to restart a simulation... #8 847 71
Or need for better mathematical tools! #9 423 35
#10 623 41
#11 645 63
18 / 40
Collisions
Birthday problem
In a group of N random people, what is the
probability that two people have the same
birthdate?
100% if group counts 367 people (pigeonhole
principle)
But 50% if group counts only 23 people
And 99.9% if group counts 70 people!
Back to collision probability

Number of hashed keys to first collision is ≈ π
2
M
π
For an output range of 97: 2 97
= 12 Hash Key % 97
π ... ... ...

And for an output of 223: 2 223
= 18!
#10 623 41
#11 645 63
19 / 40
Collisions
Observation
Unless table size is quadratically bigger than necessary
Impossible to avoid collisions
Conclusion
Need to deal with collisions gracefully
Two options:
1. Separate chaining: couple the hash table with another container
2. Open addressing: put the key at another index in case of collision
20 / 40
Separate chaining
Principle
Hash table is an array of linked-lists
Keys are still hashed to array index i in the range 0..M-1
key hash ht[5]

u 2 'd' 'i' 's'
0
c 4
d 0 1
a 2 'u' 'a'
2
v 3
i 0 3 'v'
s 0
4
'c'
Insertion: key pushed in ith chain

Search: only need to search key in ith chain
21 / 40
Separate chaining
Public API
// Returns whether @key is found in table
bool Contains(const K &key);
// Inserts @key in hash table
void Insert(const K &key);
// Removes @key from hash table
void Remove(const K &key);
ht_chaining.h
Data structure implementation

template <typename K> Small prime number as initial
class HTChaining { hash table size
public:
... Hash table of chains, as a
private:
static const size_t init_capacity = 11;
vector of linked lists
std::vector<std::list<K>> ht{init_capacity};
size_t cur_size = 0;
// Helper methods
size_t Hash(const K &key);
...
};
ht_chaining.h
22 / 40
Separate chaining
Hash function
template <typename K> std::hash<T> provided by
size_t HTChaining<K>::Hash(const K &key) { C++ library
static std::hash<K> k_hash;
return k_hash(key) % ht.size(); Knows how to hash most basic
}
ht_chaining.h types (e.g., integers, floating-
point, strings)
Example
std::hash<int> int_hash;
std::hash<double> double_hash;
std::hash<std::string> str_hash;
std::hash<int*> intptr_hash;
std::cout << 42 << " => " << int_hash(42) << std::endl;
std::cout << -42 << " => " << int_hash(-42) << std::endl;
std::cout << 42.23 << " => " << double_hash(42.23) << std::endl;
std::cout << "ucdavis" << " => " << str_hash("ucdavis") << std::endl;
int a;
std::cout << &a << " => " << intptr_hash(&a) << std::endl;
std_hash.cc
42 => 42 Good and easy way to get a well-distributed

-42 => 18446744073709551574 hash code
42.23 => 12297999112293035700
ucdavis => 2541764663091866736 Still need to reduce it to array index with a
0x7fff232febdc => 140733783731164
modulo operation
23 / 40
Separate chaining
API implementation
template <typename K>
bool HTChaining<K>::Contains(const K &key) {
auto &list = ht[Hash(key)];
return std::find(list.begin(), list.end(), key) != list.end();
}
ht_chaining.h
std::find() returns iterator on matching item within specified bounds

Returns special end() iterator if not found
template <typename K> template <typename K>

void HTChaining<K>::Insert(const K &key) { void HTChaining<K>::Remove(const K &key) {
// Key already there // Key not there
if (Contains(key)) return; if (!Contains(key)) return;
// Push key into right bucket // Remove key from bucket

auto &list = ht[Hash(key)]; auto &list = ht[Hash(key)];
list.push_back(key); auto itr = std::find(list.begin(),
list.end(), key);
// Update size list.erase(itr);
cur_size++;
... // Update size
} cur_size--;
ht_chaining.h
}
ht_chaining.h
24 / 40
Separate chaining
Performance analysis
Hashing key to array index: O(1)
Searching key in chain: O(length of chain)
Hash function
If using poor hash function, such as GalaxyHashCode()
One chain can end up containing all the items
Runtime complexity of operations: O(N )!
If using hash function providing uniform distribution
On average, chains will be of length N /M = L
L is also know as the load factor
Runtime complexity of operations: O(L)
Load factor
If load factor is maintained to be constant
Then complexity also becomes constant
Number of items N in hash table cannot be controlled
But number of chains M can...
25 / 40
Separate chaining
Resizing Resizing - Rehashing
Increase the size of hash table
Scenario
Rehash every item into new table
ht[5]
0
'd' 'i' 's'
Result
1 ht[11]
'u' 'a' 'c'
2 0
3 1 'd'
'v'
2
4
'c'
3
Load factor: 7/5 = 1.4 4

's'
5
'i'
6
7 'u'
8
'v'
9
'a'
10
Load factor: 7/11 = 0.63
26 / 40
Separate chaining
Resizing implementation
class HTChaining { void HTChaining<K>::Insert(const K &key) {
public: ...
... // Update size
private: cur_size++;
...
void Resize(size_t capacity); // Resize hash table if load factor is 1
}; if (cur_size > ht.size())
ht_chaining.h
Resize(ht.size() * 2);
}
ht_chaining.h

void HTChaining<K>::Resize(size_t capacity) {
Complexity
// Back up existing hash table Resizing happens
std::vector<std::list<K>> old_ht = ht;
infrequently
// Resize table to new capacity, and empty it
ht.resize(capacity); Amortized time
for (auto &list : ht) complexity: O(1)
list.clear();
cur_size = 0;
// Reinsert all previous items into new table Ideally, resize to another
for (auto &list : old_ht) prime number
for (auto &k : list)
Insert(k);
}
ht_chaining.h
27 / 40
Separate chaining
Conclusion
Complexity
Proportional to load factor
Typically force load factor to 1.0
O(1) O(1) amortized O(1)
Pros
Very effective if keys are uniformly distributed
Cons
Rely on another data structure, such as lists, to implement the chains
Note that the chains could also be implemented with balanced trees!
Cost associated to this additional data structure
Memory management, traversal, etc.
28 / 40
ECS 36C - Hashing (part 3)

Recap
Hashing Collisions
Horner's method Collisions are inevitable
π
2 M insertions before first collision in
h = key[0] ∗ 31L−1 + key[1] ∗ 31L−2
+... + key[L − 1] ∗ 310 hashtable of size M
Separate chaining
Lists of keys hashed to same array index Load factor
ht[5] Average length of chains in
'd' 'i' 's' hashtable
0
Defines complexity of most
1 operations
Maintain constant load factor
'u' 'a'
2 via resizing
3 'v'
4
'c'
30 / 40
Open addressing
Principle
Keys hashed to the same array index are not chained together
Inserted in the same flat hash table, but at the next available index
Hash table must be at least the same size as the number of keys total
Formally ht[M]
Key x is inserted at the first index hi available

hash(keyC) ...
hi (x) = (hash(x) + f (i)) mod M
h0 keyA

where f (0) = 0
+f(1) ...
Function f is the collision resolution strategy
Linear probing, quadratic probing, etc. h1 keyB
+f(2) ...
h2 keyC
...
Operations
Insert: keep probing until an empty index is found
Search: keep probing until key is found
Or until encountering an empty index (i.e., key not found)
31 / 40
Linear probing
Principle
Function f is a linear function of i
Typically f (i) =i
In case of collision, try next indices sequentially
Example Key Hash Index
Hash table of size M = 17 I 5 5

Insert sequence of characters < 9 9
3 0 0
Observations
L 8 8
Apparition of primary clusters
U 0 1 (= 0 + 1)
Contiguous block of keys
Caused by multiple collisions V 1 2 (= 1 + 1)
Key hashed into a cluster might require multiple
attempts to be inserted
32 / 40
Linear probing
Data structure
Based on flat hash table
Same public API as for separate chaining implementation
class HTLinear {
public:
...
private:
struct Node {
K key;
bool taken;
};
static const size_t init_capacity = 11;
std::vector<Node> ht{init_capacity};
size_t cur_size = 0;
// Helper methods
size_t Hash(const K &key);
void Resize(size_t capacity);
};
ht_linear.h

Same hash function as for size_t HTLinear<K>::Hash(const K &key) {
separate chaining static std::hash<K> k_hash;
return k_hash(key) % ht.size();
implementation }
ht_linear.h
33 / 40
Linear probing
API implementation (1/2)
bool HTLinear<K>::Contains(const K &key) { void HTLinear<K>::Insert(const K &key) {
size_t idx = Hash(key); // Key already there
while (ht[idx].taken) { if (Contains(key)) return;
if (ht[idx].key == key)
return true; // Find empty spot
idx = (idx + 1) % ht.size(); size_t idx = Hash(key);
} while (ht[idx].taken)
return false; idx = (idx + 1) % ht.size();
}
ht_linear.h // Put key there
ht[idx].key = key;
template <typename K> ht[idx].taken = true;
void HTLinear<K>::Resize(size_t capacity) {
std::vector<Node> old_ht = ht; // Update size
cur_size++;
// Resize to new capacity and empty
ht.resize(capacity); // Resize hash table if 50% full
for (auto &k : ht) if (cur_size > ht.size() / 2)
k.taken = false; Resize(ht.size() * 2);
cur_size = 0; }
ht_linear.h
// Reinsert all previous items Try to maintain load factor of maximum

for (auto &k : old_ht)
if (k.taken) 0.5
Insert(k.key);
}
ht_linear.h
34 / 40
Linear probing
Deletion principle
Scenario
Assuming following hash table
Key Hash Index
Cluster of three keys
How to remove key U? I 5 5
Key V should still be reachable < 9 9
Problems 3 0 0
If free index at U, V is not reachable anymore L 8 8

Hash(V) points to available entry
U 0 1
Same issue for key U if remove key 3
V 1 2
Solution
Free key's index upon removal
But rehash every key of cluster located directly after key
V would be reinserted at index 1, and stay reachable
35 / 40
Linear probing
Deletion implementation
void HTLinear<K>::Remove(const K &key) {
// Key not there
if (!Contains(key)) return;
// Find key position and remove it

size_t idx = Hash(key);
while (ht[idx].key != key)
idx = (idx + 1) % ht.size();
ht[idx].taken = false;
// Rehash the next keys of the same cluster

while (ht[idx].taken) {
// Temporarily remove key from its spot, and reinsert it right away
auto &key_cpy = ht[idx].key;
ht[idx].taken = false;
cur_size--;
Insert(key_cpy);

}
// Update size, and resize hash table if 12.5% full

cur_size--;
if (cur_size > 0 && cur_size <= ht.size() / 8)
Resize(ht.size() / 2);
}
ht_linear.h
36 / 40
Linear probing
Conclusion
Complexity
Difficult complexity analysis
Search hit Search miss/Insert
Often simplified to O(1)
O(∼ O(∼
1
2
(1 +

1
1−α
))

1
2
(1 + 1
(1−α)2
))
Pros
No overhead for memory management
Locality of reference, especially if multiple consecutive items can be in the same cache
line
Cons
Requires low load factor (0.5) compared to separate chaining
Bigger hash table
Causes primary clustering
37 / 40
Open addressing strategies
Linear probing
f (i) = i
If index at i is taken, try i+1, then i+2, ..., i+k.
Quadratic probing
f (i) = i2
If index at i is taken, try i+1^2, then i+2^2, ..., i+k^2.
Double hashing
Use second hash functions to compute intervals between probes
f (i) = i ∗ h2 (k)

Comparison
Strategy Locality of reference Prevents clustering
Linear probing Excellent Bad
Quadratic probing Good Good
Double hashing Bad Excellent
38 / 40
Standard C++ containers
The C++ Standard Library defines two main unordered associative containers, implemented
as hashed data structures:
std::unordered_set
Contains an unordered set of unique objects
Objects must be comparable for equality, as they are used as keys directly
std::unordered_map
Contains unordered pairs of key-value
Keys must be comparable for equality and unique
Comparison with ordered associate containers (map, set, etc.)

Constant vs logarithm complexity
Collection is not ordered by key, by design
39 / 40
Standard C++ containers
Demo
#include <iostream> (Bart: 10)
#include <unordered_map> (Lisa: 8)
int main() { (Alice: 36)
// Initialization (Bob: 23)
std::unordered_map<int, std::string> m = {
{36, "Alice"}, {23, "Bob"}
};
// Dynamic insertion by indexing

m[8] = "Lisa";
// Dynamic pair insertion

m.insert({10, "Bart"});
// Alternative pair insertion

// and of duplicated item
m.insert(std::make_pair(10, "Bart"));
// Iterate and print values

for (auto n : m)
std::cout << "(" << n.second
<< ": " << n.first << ")"
<< std::endl;
return 0;
}
std_unordered_map_example.cc
40 / 40

Hashing

Uploaded by

Copyright:

Available Formats

You might also like

Hashing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hashing

Uploaded by

Copyright:

Available Formats

ECS 36C - Hashing

Prof. Joël Porquet-Lupine

UC Davis - Spring Quarter 2020

Copyright © 2018-2020 Joël Porquet-Lupine - CC BY-NC-SA 4.0 International License 1 / 40

Example #1: frequency of letters

Search Insert Delete

Search Insert Delete

O(log N ) O(N ) O(N ) ...'a''b''c''d''e'...

Search avg/worst Insert avg/worst Delete avg/worst

O(log N )/O(N ) O(log N )/O(N ) O(log N )/O(N )

'd' AVL or RB tree

O(log N ) O(log N ) O(log N )

O(1) O(1) O(1)

Using data as index

Two implicit steps:

// Hash table Hash(548) => 48

... ... ...

... ... ...

... ... ...

... ... ...

UC Davis - Spring Quarter 2020

Copyright © 2018-2020 Joël Porquet-Lupine - CC BY-NC-SA 4.0 International License 13 / 40

Hashing positive integers Table size

... ... ...

std::vector<std::string> dict{ Hash(the) => 321

Reduce to table size

Example: street address struct Address {

31x + y rule size_t HashAddress(Address &a, size_t M) {

// Hash street number

Output range of 97 #5 603 20

11 hashes to observe the first collision #6 857 81

Now, what is the output range was 223?... #7 544 59

Need to restart a simulation... #8 847 71

Or need for better mathematical tools! #9 423 35

Back to collision probability

π ... ... ...

key hash ht[5]

Insertion: key pushed in ith chain

Data structure implementation

42 => 42 Good and easy way to get a well-distributed

std::find() returns iterator on matching item within specified bounds

template <typename K> template <typename K>

// Push key into right bucket // Remove key from bucket

Load factor: 7/5 = 1.4 4

Load factor: 7/11 = 0.63

template <typename K>

UC Davis - Spring Quarter 2020

Copyright © 2018-2020 Joël Porquet-Lupine - CC BY-NC-SA 4.0 International License 29 / 40

+... + key[L − 1] ∗ 310 hashtable of size M

Example Key Hash Index

Hash table of size M = 17 I 5 5

template <typename K>

// Reinsert all previous items Try to maintain load factor of maximum

If free index at U, V is not reachable anymore L 8 8

// Find key position and remove it

// Rehash the next keys of the same cluster

idx = (idx + 1) % ht.size();

// Update size, and resize hash table if 12.5% full