Download as pdf or txt
Download as pdf or txt
You are on page 1of 111

Data Structures: Hashing

Curs 2016
Data Structures: Remainder
Given a universe U, a dynamic set of records, where each record:
k Key

Satellit Data

Record

I Array
I Linked List (and variations)
I Stack (LIFO): Supports push and pop
I Queue (FIFO): Supports enqueue and dequeue
I Deque: Supports push, pop, enqueue and dequeue
I Heaps: Supports insertions, deletions, find Max and MIN
I Hashing
Dynamic Sets.

Given a universe U and a set of keys S ⊂ U, for any k ∈ S we can


consider the following operations
I Search (S, k): decide if k ∈ S
I Insert (S, k): S := S ∪ {k}
I Delete (S, k): S := S\{k}
I Minimum (S): Returns element of S with smallest k
I Maximum (S): Returns element of S with largest k
I Successor (S, k): Returns element of S with next larger key
to k
I Predecessor (S, k): Returns element of S with next smaller
key to k.
Recall Dynamic Data Structures

DICTIONARY
Data structure for maintaining S ⊂ U together with operations:
I Search (S, k): decide if k ∈ S
I Insert (S, k): S := S ∪ {k}
I Delete (S, k): S := S\{k}

PRIORITY QUEUE
Data structure for maintaining S ⊂ U together with operations:
I Insert (S, k): S := S ∪ {k}
I Maximum (S): Returns element of S with largest k
I Extract-Maximum (S): Returns and erase from S the element
of S with largest k
Priority Queue

Linked Lists:
I INSERT: O(n)
I EXTRACT-MAX: O(1)

Heaps:
I INSERT: O(lg n)
I EXTRACT-MAX: O(lg n)

Using a Heap is a good compromise between fast insertion and


slow extraction.
String Matching

Given a text, find a


subtext
• Given two texts, find
common subtexts
(plagiarism)
• Given two genomes,
find common subchains
(consecutive characters)

Search: primality of a number


Document similarity

Finding similar
documents in the WWW
• Proliferation of almost
identical documents
• Approximately 30% of
the pages on the web
are (near) duplicates.
• Another way to find
plagiarism
Hashing functions
Data Structure that supports dictionary operations on an universe
of numerical keys.

Notice the number of possible keys


represented as 64-bit integers is
264 = 18446744073709551616.
Tradeoff time/space
Define a hashing table T [0, . . . , m − 1]
a hashing function h : U → T [0, . . . , m − 1] Hans P. Luhn
(1896-1964)

h
S Collision
U
Simple uniform hashing function.

A good hashing function must have the property that ∀k ∈ U,


h(k) must have the same probability of ending in any T [i].
Given a hashing table T with m slots, we want to store n = |S|
keys, as maximum.
Important measure: load factor α = n/m, the average number of
keys per slot.
The performance of hashing depends on how well h distributes the
keys on the m slots: h is simple uniform if it hash any key with
equal probability into any slot, independently of where other keys
go.
How to choose h?

Advice: For an exhaustive treaty on Hashing: D. Knuth, Vol. 3 of


The Art of computing programming

h depends on the type of key:


• If k ∈ R, 0 ≤ k ≤ 1 we can use h(k) = bmkc.
• If k ∈ R, s ≤ k ≤ t scale by 1/(t − s), and use the previous
methode: h(k/(t − s)) = bmk/(t − s)c.
The division method

Choose m prime and as far as possible from a power,

h(k) = k mod m .

Fast (Θ(1)) to compute in most languages (k%m)!


Be aware: if m = 2r the hash does not
depend on all the bits of K
If r = 6 with k = 1011000111 011010
| {z }
=h(k)
(45530 mod 64 = 858 mod 64)
• In some applications, the keys may be very large, for instance
with alphanumeric keys, which must be converted to ascii:
Example: averylongkey is
converted via ascii:
97 · 12811 + 118 · 12810 +
101 · 1289 + 114 · 1288
+121 · 1287 + 108 · 1266
+111 · 1285 + 110 · 1284
+103 · 1283 + 107 · 1282
+101 · 1281 + 121 · 1280 = n

which has 84-bits!


Recall mod arithmetic : for a, b, m ∈ Z,
(a + b) mod m = (a mod m + b mod m) mod m
(a · b) mod m = ((a mod m) · (b mod m)) mod m
a(b + c) mod m = ab mod m + ac mod m
If a ∈ Zm (a mod m) mod m = a mod m
Horner’sP
rule: Given a specific value x0 and a polynomial
A(x) = ni=0 ai x i = a0 + a1 X + · · · + an x n to evaluate A(x0 ) in
Θ(n) steps:

A(x0 ) = a0 + x0 (a1 + x0 (a2 + · · · + x0 (an−1 + an x0 )))


How to deal with large n

For large n, to compute h = n mod m, we can use mod arithmetic


+ Horner’s method:

((((((((((97 · 128 + 118) · 128 + 101) · 128 + 114) · 128 + 121)


· 128 + 111) · 128 + 110) · 128 + 103) · 128 + 107)
· 128 + 101) · 128 + 121 mod m
= ((((((((((97 · 128 + 118 mod m) ·128) mod m + 101) · . . .))))))
| {z }
| {z }
Collision resolution: Separate chaining

For each table address, construct a linked list of the items whose
keys hash to that address.

I Every key goes to the same slot i


I Time to explore the list = 20 27 8
length of the list
h(20)=h(27)=h(8)=i
Cost of average analysis of chaining

The cost of the dictionary operations using hashing:


I Insertion of a new key: Θ(1).
I Search of a key: O( length of the list)
I Deletion of a key: O( length of the list).

Under the hypothesis that h is simply uniform hashing, each key x


is equally likely to be hashed to any slot of T , independently of
where other keys are hashed
Therefore, the expected number of keys falling into T [i] is
α = n/m.
Cost of search

For an unsuccessful search (x is not in T ) therefore we have to


explore the all list at h(x) → T [i] with an the expected time to
search the list at T [i] is O(1 + α).
(α of searching the list and Θ(1) of computing h(x) and going to
slot T [i])
For an successful search search, we can obtain the same bound,
(most of the cases we would have to search a fraction of the list
until finding the x element.)
Therefore we have the following result: Under the assumption of
simple uniform hashing, in a hash table with chaining, an
n
unsuccessful and successful search takes time Θ(1 + m ) on the
average.
Notice that if n = θ(m) then α = O(1) and search time is Θ(1).
Universal hashing: Motivation

For every deterministic hash function, there is a set of bad


instances.
An adversary can arrange the keys so your function hashes most of
them to the same slot.
Create a set H of hash functions on U and choose a hashing
function at random and independently of the keys.
Must be careful once we choose one particular hashing function for
a given key, we always use the same function to deal with the key.
Universal hashing

Let U be the universe of keys and let H be a collection of hashing


functions with hashing table T [0, . . . , m − 1], H is universal if
∀x, y ∈ U, x 6= y , then

|H|
|{h ∈ H | h(x) = h(y )}| ≤ .
m

In an equivalent way, H is universal


if ∀x, y ∈ U, x 6= y , and for any h
chosen uniformly from H, we have H
1 H
Pr [h(x) = h(y )] ≤ . m
m
{h : h(x)=h(y)}
Universality gives good average-case behaviour

Theorem
If we pick a u.a.r. h from a universal H and build a table using and
hash n keys to T with size m, for any given key x let Zx be a
random variable counting the number of collisions with others keys
y in T .
E [#collisions] ≤ n/m.

Proof We want to compute the expected list at T [i].


For each key x, define indicator rv to count how many others keys
hash to the same slot.
(
1 if h(x) = h(y ),
Zxy =
0 otherwise.
P
Then E [Zxy ] ≤ 1/m and Zx = y ∈T −{x} Zxy
Proof

 
X
E [Zxy ] = E  Zxy 
y ∈T −{x}
X
= E [Zxy ]
y ∈T −{x}
X n−1
= 1/m = 2
m
y ∈T −{x}

Therefore, universal hash functions dismount adversarial strategy


Construction of a universal family: H

To construct a family H for N = max{U} and T [0, . . . , m − 1]:


I H = ∅.
I Choose a prime p, N ≤ p ≤ 2N. Then
U ⊂ Zp = {0, 1, . . . , p − 1}.
I Choose independently and u.a.r. a ∈ Z+p and b ∈ Zp . Given a
key x define ha,b (x) = ((ax + b) mod p ) mod m.
| {z }
ga,b (x)
I H = {ha,b |a, b ∈ Zp , a 6= 0}.

Example: p = 17, m = 6 we have H17,6 = {ha,b : a ∈ Z+


p , b ∈ Zp }
if x = 8, a = 3, b = 4 then
h3,4 (8) = ((3 · 8 + 4) mod 17) mod 6 = 5
Properties of H

1. hab : Zp → Zm .
2. |H| = p(p − 1). (We can select a in p − 1 ways and b in p
ways)
3. Specifying an h ∈ H requires O(lg p) = O(lg N) bits.
4. To choose h ∈ H select a, b independently and u.a.r. from Z+
p
and Zp .
5. Evaluating h(x) is fast.
Theorem
The family H is universal.

For the proof:


Chapter 11 of Cormen. Leiserson, Rivest, Stein: An introduction
to Algorithms
Bloom filter

Given a set of elements S, we want a Data structure for supporting


insertions and querying about membership in S.
In particular we wish a DS s.t.
I minimizes the use of memory,
I can check membership as fast as possible.

Burton Bloom: The Bloom filter data structure. Comm. ACM,


July 1970.
A hash data structure where each register in the table is one bit
Query on a list of e-mails

We have a set S of 109 e-mail addresses, where the typical e-mail


address is 20 bites. Therefore it does not seem reasonable to store
S in main memory. We can spare 1 Gigabyte of memory, which is
approximately 109 bytes or 8 × 109 bites. How can put S in main
memory to query it?
Definition Bloom filter

Create a one bit hash table T [0, . . . , m − 1], and a hash function h.
Initially all m bits are set to 0.
Giving a set S = {x1 , . . . , xn } define a hashing function h : S → T .
For every xi ∈ S, h(xi ) → T [j] and T [j] := 1.
Given a set S a function h() and a table T [m]:

inS(y )
Insert (x) h(x) → i
h(x) → i if T [i] == 1 then
if T [i] == 0 then return Yes
T [i] = 1 else
end if return No
end if
Notice: once we have hashed S into T we can erase S.
False positives

0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 T

S x y z u w
w

Bloom filter needs O(m) space and answers membership queries in


Θ(1).
Inconvenience: Do not support removal and may have false
positive.
In a query y ∈ S?, a Bloom filter always will report correctly if
indeed y ∈ S (h(y ) → T [i] with T [i] = 1),
but if y 6∈ S it may be the case that h(y ) → T [i] with T [i] = 1,
which is called a False positive.
How large is the error of having a false positive?
Probability of having a false positives
Let |S| = n, we constructed a BF (h, T [m]) with all elements in S.
If we query about y ∈ S?, with y 6∈ S, and h(y ) → T [i], what is
the probability that T [i] = 1?
After all the elements of S are hashed into the Bloom filter, the
probability that a specific T [i] = 0 is (1 − m1 )n = e −n/m
(recall that: e = limx→∞ (1 + x1 )x , e −1 = limx→∞ (1 − x1 )x )
Therefore, for a y 6∈ S, the probability of false positive π:
1 n
π = Pr [h(y ) → T [i] | T [i] = 1] = 1 − (1 − ) ∼ 1 − e −n/m .
m

To minimise π, e −n/m has to be small, i.e, m >>> n.


For ex.: if m = 100n, π = 0.0095; If m = n, π = 0.632 and if
m = n/10, π = 0.9999
Alternative: Amplify
Take k different functions {h1 , h2 , . . . , hk } in the same 2-universal
set of functions.
Ex. Bloom filter with 3 hash functions: h1 , h2 , h3 .
a b c d e

0 0 0 10 0 10 10 10 0 10 10 10 10 0 1 10 1 0 01 0 00

When making a query about if y ∈ S, compute h1 (y ), . . . ht (y ), if


one of them is 0 we certainty y 6∈ S, else (if all the k hashing go to
bits with value 1) y ∈ S with some probability.
After hashing the n element k times to T :

Pr [T [i] = 1] = (1 − e −kn/m )k .

The probability that all h1 (y ), . . . , hk (y ) go to bits already to 1 is:

p = (1 − e −kn/m )k .
Asymptotic estimations for k and m

dp
To minimize the probability of having a false positive: dk =0
Let f (k) = ln p then f (k) = k ln(1 − e −kn/m )
kne −kn/m
⇒ f 0 (k) = ln(1 − e −kn/m ) + m(1−e −kn/m )

Making f 0 (k) = 0, we get

m1 9 m
kopt = ln 2 =
n2 13 n

The probability of having a false positive for kopt is

9 m n 9 m 1 9m m
p0 = (1 − e 13 n m ) 13 n ∼ ( ) 13n = 0.619223 n .
2
Asymptotic estimations for k and m

To estimate the size m of the T as a function of n, p and k:

9 m n ln p
− ln 2 = ln p ⇒ m = − .
13 n 2.083
Therefore, to maintain a fixed false positive probability, the length
of the Bloom table must grow linearly with n.
Optimal number of hash functions

For given n and m the number of hash functions k, which


minimizes the false positiver is :
m n
k= ln 2 ∼ 0.69314718056
2 m

To obtain the size m of T substitute the value of k in


p = (1 − e −kn/m )k to get

n ln p
m=−
(ln 2)2

Therefore, to maintain a fixed false positive probability, the length


of the Bloom table must grow linearly with n.
Practical issues
For password checking:
If D has 100000 common words, each of 7 characters ⇒ we need
700000 bytes
Use 5 tables of 160000 bits each⇒ need a total of 800000 bits =
100000 bytes.
The probability of error is 0.02

On the other hand although


the results shown before are
asymptotic, there also work for
practical values of n. Figure in
the side table give the
probability of false positive wrt
to n
Another application of Bloom filters: Caching structures
Recall: http (Hypertext transfer protocol) basic network protocol
to dristributed information on the WWW net. (Tim Berners-Lee
(1990))
HTML (HyperText Markup Language) is the standard language for
creating web pages and web applications.
URL (Uniform Resource Locator) web address indicating for
example web pages.
http://www.cs.upc.edu/∼diaz

Web server is a computer system that


processes requests using http to deliver web
pages to clients.

Web cache is a technology for temporary storage of web


documents (html pages, images,..) which aim to reduce
bandwidth, server load and lag (latency).
Another application of Bloom filters: Caching structures

Suppose we have a set U with n URL, each one with 100


characters, i.e in total we have 800n bits.
Consider caches C1 , C2 , C3 , each with
documents indexed by their URL.
A query for URL x is sent to one of the C1 C2 C3
caches, that cache must determine which
of the caches has x (if x is there)
If every Ci stores 10000 documents, that means about 48000000
bits can be exchanged.
Bloom filters may help to reduce the transfer of bits, accepting a
small marge of error.
Another application of Bloom filters: Caching structures

I Each proxy all of the URLs in its


cache into Bloom Filter.
I Proxies periodically exchange Bloom
filters, so queries of other caches can
be made locally without sending ICP
message.
Cache filtering

Using a Bloom filter to prevent one-hit-wonders from being stored


in a web cache decreased the rate of disk writes by nearly one half,
reducing the load on the disks and potentially increasing disk
performance.
Nearly three-quarters of the URLs accessed from a typical web
cache are one-hit-wonders accessed by users only once and never
again.
To prevent caching one-hit-wonders, a Bloom filter is used to keep
track of all URLs that are accessed by users.
A web object is cached only when it has been accessed at least
once before.
Further applications of Bloom filters

Bloom filters are useful when a set of keys is used and space is
important.
I Packet routing: Bloom filters provide a means to speed up or
simplify packet routing protocols.
I IP Tracebook
I Useful tool for measurement infrastructures used to create
data summaries in routers or other network devices.

A. Broder, M. Mitzenmacher: Network applications of Bloom


filters: A survey. Internet Mathematics, 1,4: 485-509, 2005
Cuckoo Hashing

Pagh, Rodler: Cuckoo Hashing. ESA-2001


Cuckoo hashing is a hashing technique where:
I Lookups are Θ(1) worst-case.
I Deletions are Θ(1) worst-case.
I Insertions are O(1) in expectation.
Cuckoo Hashing

I We have two hash tables T1 , T2 with size m each and two


hash functions h1 for T1 and h2 for T2 .
I Can use for instance h1 (k) = k mod m and h2 (k) = dk/me
mod m
I Every element k ∈ U can be only in two positions: at h1 (k) in
T1 or at h2 (k) in T2 .
I Lookups take Θ(1) because we only need to check 2 positions.
I Deletions take Θ(1) because we only need to check 2
positions.
I To insert k ∈ U, try h1 (k), if the slot is empty put k there, if
the slot contains k 0 , kick out the k 0 k stay there, and k 0
repeats the behavior of k on T2 .
I Repeat this process, bouncing between tables, until all
elements stabilize.
Cuckoo Hashing: Long cycles of insertion

One complication is that the cuckoo may loop for ever. The
probability of such an event is small. In such a case choose an
upper bound in the number of slot exchanges, and if it exceeds, do
a rehash: choose new functions and start .

Example: We have {y , x, w , z, u}
0
h1 (x) = 2; h1 (y ) = 2; h1 (w ) = 4; h1 (z) = 4
1 x
h2 (x) = 1; h2 (y ) = 1; h2 (w ) = 2; h2 (z) = 2
2 y w
3
Next we hash u: h1 (u) = 4 and h2 (u) = 2 4 z
5 u
If insertion gets into a cycle, we perform a rehash: choose new
h1 , h2 and insert all elements back into the table.
Cuckoo Hashing: An example

We wish to hash the set of keys:(20, 50, 53, 75, 100, 67, 105, 3, 36, 39, 6)
k
using h1 (k) = k mod 11 and h2 (k) = b 11 c mod 11.

h1 h2 0 0
20 9 1 1 100 1 20
50 6 4 2 2
53 9 4 3 3
75 9 6 4 4 53
100 1 9 5 5
67 1 6 6 50 6
105 6 9 7 7
3 3 0 8 8
36 3 3 9 75 9
39 6 3 10 10
6 6 0
T1 T2
Cuckoo Hashing: An example

h1 h2
20 9 1
0 0
50 6 4 1 67 1 20
53 9 4 2 2
75 9 6 3 3
100 1 9 4 4 50
67 1 6 5 5
105 6 9 6 105 6 75
3 3 0 7 7
36 3 3 8 8
39 6 3 9 53 9 100
6 6 0
Cuckoo Hashing: An example

h1 h2
20 9 1 0 0 3
50 6 4 1 67 1 20
53 9 4 2 2
75 9 6 3 36 3
100 1 9 4 4 50
67 1 6 5 5
105 6 9 6 105 6 75
3 3 0 7 7
36 3 3
8 8
9 53 9 100
39 6 3
6 6 0
Cuckoo Hashing: An example

h1 h2
20 9 1
50 6 4
0 0 3
53 9 4
1 100 1 20
75 9 6
2 2
100 1 9 3 36 3 39
67 1 6 4 4 53
105 6 9 5 5
3 3 0
6 39 50 6 67
36 3 3
7 7
39 6 3
8 8
6 6 0
9 75 9 105
With 6 we have to rehash!!!
Complexity

Cuckoo hashing has a complexity:


I Search an element x: constant worst case complexity (x only
can be in the 2 positions h1 (x) or in h2 (x))
I Delete an element: constant worst case complexity (look at
the 2 positions and erase the element)
I Inserte an element: expected constant complexity.

It is a simple alternative to perfect matching, to implement a


dictionary with reasonable space and constant searching time.
Other models, for example d-hashing tables.
String matching

The string matching problem: given a text TX[1 . . . n] an a pattern


P[1 . . . `], where elements of TX and P are draw from the same
alphabet Σ, we wish to find all the occurrences of P in TX,
together with the position they start to occur.
TX: a b c a b a a b c a b a b a a c b a a b a b
P: a b a a
TX: a b c a b a a b c a b a b a a c b a a b a b

Given a string x and y :


|x| its length
xy its concatenation with length |x| + |y |
Naive algorithm

Search (TX,P)
for i = 1 to n − ` do
if PT [1, . . . , `] = TX[i, . . . , i + ` − 1] then
print P occurs at i
end if
end for

P: A A A G

TX: A A A A A G A G T C

This algorithm has complexity Θ((n − ` + 1)`), worst case O(n2 )


Hashing

Use Hashing D.Karp, M. Rabin: Efficient randomized patter


matching algorithms. IBM JRD,1987.

Given TX (|TX| = n) and pattern P (|P| = `), want to indicate


define a hash function h a table T [0, . . . , m − 1].
Notice each symbol in TX is a key. Wlog consider alphabet
Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.
General idea of Karp-Rabin’s hashing algorithm

Idea: Break TX into overlapping substring of length = `,


S0 , S1 , . . . Si , . . . and compute the decimal value of each substring
Si and of P.

TX P
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4
8 6 1 7 9 3 5 7 3 4 2 1 7 9 3 5
S0 S1 S S6
2 S3
P 17935
h S0 86179
S1 61793
T S2 17935
Brute force implementation of the algorithm

Let si denote the decimal value of Si and p the decimal value of P.


Use Horner’s rule to compute p in time Θ(`):

p = P[` − 1] + 10(P[` − 2] + · · · + (10P[0])) · · · )

In the same way, use Horner’s rule to compute for 0 ≤ i < n :

si =Si [i]10`−1 + Si [i + 1]10`−2 + · · · + Si [i + ` − 2]101 + Si [i + ` − 1]100


=Si [i + ` − 1]1 + 10(Si [i + ` − 2] + · · · + 10(Si [i]1)) · · · ).
Brute force implementation

I At the beginning all registers to 0.


I Hash P→ T h(p) = p mod m, if h(P) = i then T [i] := 1
I Run through TX, hashing each set of ` consecutive characters
into T
I If one of them goes to a T [i] (T [i] = 1), double check that
the ` Sk match P (i.e. sk − p = 0)

Complexity: O(n`), where ` could be Θ(n).


Rolling Hash

Instead of looking to O(n) substrings independently, we may take


advantage the substrings have a lot of overlap:
si = 79357 → si+1 = 93573 → si+2 = 35734

si+1 = Si+1 [i + 1]10`−1 + Si+1 [i + 2]10`−1 + · · · + Si+1 [i + ` − 1]101


| {z }
(Si \{Si [i]})∗10
0
+ Si+1 [i + `]10

Knowing si to get si+1 with we only have to deal with the element
leaving (Si [i]) and the element incorporating (Si+` ):

si+1 = (si − (Si [i] ∗ 10` )) ∗ 10 + Si+1 [i + `]


Rolling Hash

Recall mod magic: for a, b, m ∈ Z,


(a + b) mod m = (a mod m + b mod m) mod m
(a · b) mod m = ((a mod m) · (b mod m)) mod m
If a, b ∈ Zm and b > a, (a − b) mod m = (m − (a − b)) mod m
(a mod m) mod m = a mod m
Using the hash function h(a) = a mod m, for any a ∈ N

h(si+1 ) = ((si − (Si [i] ∗ 10` )) ∗ 10 + Si+1 [i + `]) mod m

h(si+1 ) = ((h(si ) −(Si [i]) mod m ∗ 10` ) mod m))∗10+Si+1 [i+`])) mod m
|{z} | {z } | {z }
known Si [i] pre-comp.

Therefore given h(si ) we can compute h(si+1 ) in Θ(1) steps.


Example

8 6 1 7 9 3 5 7

S0
S1
h

TX=861793, m = 73,
Preprocess: h(86179) = 39 and 104 mod 73 = 72.

h(61793) = ((86179 − 8 · 104 ) · 10 + 3) mod 73


= (((h(86179) − (8 · 104 ) mod 73)10) mod 73 + 3) mod 73
= ((47 · 10) mod 73 + 3) mod 73 = 35
Karp-Rabin Algorithm
Given a text |TX| = n pattern P = `, hash table |T | = m, hash
function h = ∗ mod m:

Karp-Rabin (TX, P, T )
p = 0; s0 = 0; q = 10`−1 mod m
for j = 0 to ` − 1 do
h(p) = (10p + P[j]) mod m
h(s0 ) = (10s0 + TX[j] mod m
end for
for i = 0 to n − ` do
if h(p) == h(si ) then
if P[0 . . . ` − 1] == TX[i . . . i + ` − 1] then
return Match at i
end if
else
h(si+1 ) = (10(si − T [i + 1]q) + T [i + ` + 1]) mod m
end if
end for
Complexity

I To use any other radix d 6= 10 it behaves the same as


radix-10. We has to substitute 10 by d.
I Using rolling hash we could speed the computation of the
hash function of each `-string to Θ(1), once we compute the
first one in O(`)
I The total complexity depends of the number of comparisons.
Each comparison takes Θ(`).
I If TX and P are such that the algorithm must make Θ(n)
comparisons, the total complexity is Θ(n`)
I In most practical applications (genomics, text searching, etc.),
string searching using Karp-Rabin takes O(n + `) = O(n).
Complexity

I Regarding collisions from hashing different substrings, we must


choose m a large prime integer, which fits into a computer
word and make sure it keeps basic operations constants.
For instance, if m = O(n) then the expected number of
collisions is Θ(1) collision in each slot, if m = O(n2 ) we
expect O(1/n) number of collisions per cell, which is nice, but
at expenses of having a very large T .
I There is a fast algorithm for string matching
Knuth-Morris-Pratt Θ(n). But the simplicity of Karp-Rabin
and the easiness to generalize to non-textual applications,
makes K-R a good choice, widely used in practice.
Common substring problem
Common substring problem
Given two texts Tx1 and Tx2 , with |Tx1 | = |Tx2 | = n discover if they
share a common substring of length `. Define h and T [0 · · · m − 1] and
use rolling hash (notice blancs should be considered as an extra symbol):

1. Hash the first substring of length ` in Tx1 to T . (O(`))


2. Use rolling hash to compute the subsequent n − 1 substring in Tx1 ,
hashing each one to T . (O(n))
3. Hash the first substring of length ` in Tx2 to T . (O(`))
4. Use rolling hash to compute the subsequent n − 1 substring in Tx2 ,
hashing each one to T . For each substring, check if there are
collisions with substrings from Tx1 . (O(n))
5. If a substring of T1 collide with a substring of T2 do a string
comparison on those substrings. (O(`))

If the number of collisions should be small the complexity is O(n).


But for large number of collisions it could be O(n2 ).
Cryptography

Cryptography is the study of techniques for secure communication


in the presence of adversaries.
I Ciphertext (cryptogram): encrypted text
I Encryption (enciphering): plaintext → a ciphertext.
I Decryption (deciphering): ciphertext → plaintext.
I Cryptographic system (cipher): encrypting or decrypting.
I Cryptanalysis: the process to break the code.
I Key: symbols used to encrypt and decrypt.
I Key space: The total number of keys that can be used in a
cryptographic system.
Cryptography is as old as tribal fights, see for ex:
The Code Book, Simon Singh. Fourth State, 1999.
Cryptography
Cryptography is the study of techniques for secure communication
in the presence of adversaries.

Encrypt Decrypt

Plaintext M Ciphertext C
Key K Key K

Alice Eve Bob

The adversary Eve can eavesdrop all communication between Alice


and Bob so if Alice an Bob want to keep secret the contends, they
must encrypt the plaintext messages into a ciphertext which Eve
can’t break.
To make clear the meaning of encryption

I Encoding: Transform data so that it can be properly


consumed by a different type of system, (binary data being
sent over email, ASCII, URL Encoding, ....). The goal is not
to keep information secret, but rather to ensure that can be
properly consumed. Encoding works by using a scheme that is
publicly available so that it can easily be reversed. It does not
require a key as the only thing required to decode it is the
algorithm that was used to encode it.
I Encryption: Transformation of data in order to keep it secret
from others, (sending a password over internet using PGP).
The goal is to ensure the data cannot be consumed by anyone
other than the intended agent. Encryption transforms data
into another format in such a way that only specific
individual(s) can reverse the transformation. It uses a key,
which is kept secret.
Basic Techniques

Most of the encryption techniques are based on variants and


combinations of two basic techniques:
I Substitution: Symbols are directly replaced by other symbols.
I Transpositions (permutations): Rearrange the order of
appearance of the elements of the plaintext.
Example of Substitution
Caesar cipher: Shift k places in the alphabet.
Ex: (vini vidi vinci)
ABCD WXYZ
E3
K=3: D E F G ZABC ylpl ylgl ylpfl

The math of Caesar: Ek (i) = i + k mod 26; Dk (i) = i − k


mod 26
Other example of substitution The Key is given by:
Pure Transposition

To implement a pure transposition write your plaintext message


along the rows of a matrix. Example: Key: 2 4 1 3
plaintext: r e c o
n y q u
ciphertext: eyntouasrniscqpi
i n p a
s t i s
In real life algorithms, there are multiple rounds of interlaced
transposition and substitution.
Public key encryption

Useful for digital signatures


A send a message M to B, E can eavesdrop M
How can we assure E can not recover M?
Private-Key Systems: Key F is secret. Both, A and B have a copy
of F and F −1 (dangerous)
To encrypt message M: compute X = F (M)
To decrypt: compute M = F −1 (X )

A B
Public-Key Systems:

Diffie-Hellman
S = F is private and secret,
P = F −1 is public. To know P
does not help in discovering S.
M: A → B. E eavesdropper.

Public Key: PA , PB ,
Secret Key: SA , SB ,

Secret and Public keys must have the following property: for any
person A we must have M = SA (PA (M)) = PA (SA (M)).
To send M: A → B,
(1.-) A gets PB ,
(2.-) A computes the ciphertext C = PB (M),
(3.-) A sends C to B.

When B gets C : SB (C ) = SB (PB (M)) = M

B
B
A B

A E B
Digital signature
A sends to B (M, σ) such that B knows only A could have send M.
σ is called the signature
(1.-) A computes σ = SA (M) and sends to B C = (PB (σ)),
(2.-) B decrypts M = SB (PA (C )) (only A knows SA so only A
could compute σ)

A B

A E B
A A

B
B
Applications: Cryptographic hash functions

Cryptographic hash functions: One way hash functions

h Hexadecimal

DFCD3454BBEA788A
751A6 96C 2 4D9700 9
Hola CA992D17
Cryptographyc
hash
function
Setze jutges d’un jutjat 46042841 935C7F80
mengen el fetge d’un penjat 9158585AB94AE241
si el penjat es despenges 26EB3CEA
es menjaria els setze fetges
dels setze jutges
que l’han jutjat
Cryptographic hash
Cryptographic hash functions have to string of any length and
output a fixed-length hash value, in general in hexadecimal.

Hexadecimal= Radix 16
(4CF 5)16 = (4 × 163 + 12 ×
162 + 15 × 161 + 5 × 160 ) =
19701

For security reasons, modern


crypto-hash implementations
give yield very large integers,
for instance MD5 gives a 128
bits integer, SHA512 yields a
512 bits integer.
To make those output integers
more compact it is customary
to represent them in Radix 16.
Main Properties of a Cryptographic Hash Function hc

Due to the applications, malicious adversaries try to hack


crypto-hash functions.
Crypto-hash functions should behave as random while being
deterministic and efficiently computable.
1. For any message M, hc (M) is fast to compute.
2. Pre-image resistance: It is not feasible to recover M from
C = hc (M). (Computing hc−1 should not be feasible).
3. Collision resistance: Given a text M1 , it should not be feasible
to find a M2 s.t. hc (M1 ) = hc (M2 ) .
4. Sensitivity: If we slightly modify M to M 0 then
hc (M) 6= hc (M 0 ).
Example cryptographic hash function SHA-1

https:www.sha1−online.com
Applications: Message Digest
Direct application of the Collision resistance property:
Alice wants to update a very large document in Dropbox like
repository.
She wants to be sure when she download the document it is
exactly the same document.
An adversary wants to substitute Alice’s document for a forged
one.

Repository
Applications: Message Digest
Alice appends a cryptohash h of the document to the stored
document, and keeps a copy of the hash digest for her (very short).
The adversary has access to h but as soon as he tampers with the
document the digest of the document will be different than the
one append to the original document.
When Alice retrieves the document she just have to compare the
digest of the document with the copy that she kept.

Repository

A65FED252A6B
2A94EC2A0E7
70EEF0D235D6
362C752A0CE7
Crypto h 70EEF0D235D6
Hash 362C752A0CE7

Crypto h A65FED252A6B
Hash 2A94EC2A0E7
Applications: Password verification

Reduce security breach for passwords storing.


Store the hash digest on a table with users names
Applications: Password verification

There are Cryptography Hash functions to get a cryptographic


integer from the personal biometrics data: fingerprint, retinal-scan,
etc.
Application: Digital signature and verifying the integrity of
files or messages
Determining whether any changes have been made to a message or
file. Confirm the sender is Alice.

Alice Bob

Crypto h
Hash

70EEF0D235D6 70EEF0D235D6
362C752A0CE7 = 362C752A0CE7
DOC.
?
A public

Crypto h 70EEF0D235D6
Hash 362C752A0CE7 Cyber
public Hash number A private text
How to construct secure hc : Merkel’s scheme
Assume the M is the message that we want to compute its crypto
hash function, and M is already in binary.
I The input message M is partitioned into L bit blocks, each of
size exactly m bits.
I For extra security, the ending block includes the total length
of the message whose hash function is to be computed.
I The scheme has L sequential stages one for each block.
I The i-stage has as input the m bits from the i-th. block and
the n bit output of the previous stage. The 1-stage is provided
with an n-bit vector, the Initialization Vector (IV)
M B1 B2 B3 BL
m−bits m−bits m−bits m−bits

VI n−bits
f f f f
n−bits n−bits n−bits n−bits n−bits Hash

I The key is the compression function f , which depend on each


implementation.
The Secure Hash Algorithm (SHA) family

The Secure Hash Algorithm is a family of cryptographic hash


functions developed by the National Institute of Standards and
Technology (NIST) as a U.S.
Technical ideas based in previous work of several cryptographers:
Ron Rivest, Ralf Merkel and others.
SHA-1 was designed by NIST in 1993. it is still the crypto hash of
choice in many systems.
The Secure Hash Algorithm (SHA-1)

Given a message with size < 264 -bits, SHA-1 produces an message
digest (Hash output) exactly of 160-bites.

1. Pad M so its length K is such that K mod 512 = 448.


Append 64-bit with the length of M. Break the padded M
into blocks of size 512.
2. Produce an initial vector of size 160-bits: 5 words of 32 bits
each.
3. Process sequentially each block using the compression
function f . Notice the input is an m bit and a n-bit and the
output is an n-bit which feeds next stage.
4. The output of the last stage is the message digest, which has
160-bits.
Padding of the input M
The main steps of SHA-1 on input M
Let K be the length of M. We want to break the input into equal
16 words blocks (i.e. 512-bits blocks)
Append one 1-bit followed by Z 0-bits, where Z is the smallest
non-negative solution to K + 1 + Z ≡ 448 mod 512.
Add to the end 64 bits containing the binary representation of K .
As the length field is 64 bits long, the longest M must be < 264
bits long.

| {z } 01100011
{z } 01100010
Toy Example: M = abc, so M = |01100001 | {z }
a b c
the whole padding to form a block:

| {z } 01100010
01100001 · · · 0} 00 · · · 0 |11000
| {z } 1 |00 {z
| {z } 01100011 {z }
a c 423
b
| {z 24 }
64
| {z }
512
Initial Vector

The initial vector (Hash Buffer) is given by the concatenation of 5


words, each one of 32 bits, named a, b, c, d, e. In total the IV has
length 160 bits.
Each one of the 5 registers is initialized by the first 64 bits of the
fractional parts of the square-roots of the first 5 primes
(2,5,7,11,13):
The values are (in hex):
a = 67452301,
b = efcdab89,
c = 98badcfe,
d = 10325476,
e = c3d2e1f0.
Compression Function f
Assume on input M, the SHA-1 algorithm breaks M into L blocks
with each of size 512 bits.
Consider each block Bj , (0 ≤ j ≤ L) partitioned in 16 32-bits words
wj [0]||wj [1]|| · · · ||wj [15].
For each block there is a preprocessing phase that transform those
16 words into 80 words, creating new 63 words
wj [0], . . . , wj [15],wj [16], . . . , wj [79], by

wj [i] = w [i −3]⊕wj [i −8]⊕wj [i −14]⊕wj [1−16], for(16 ≤ i ≤ 79).

By feeding those wj ’s as inputs to different stages of f , it will assure that al


bits in Bj play a relevant role in the output of f for block Bj .

The compression function for each 512-bit Bj works in 80 rounds.


Doing binary operations so f (M) comply with the 3 properties of
crypto hash functions.
The output to the block BL−1 is the message digest.
High level picture of f for Bj
jth Block
Expand 35−bits words wj[0],...., wj[15]
into 80, 36−bits words wj[0],.....,wj[79]
cte c(0) cte c(3)

wj[0] wj[79] sums mod 232

a a a a
b b b b
c 0 c 79 c c xj
x(j−1) d d d d
e e e e

20 rounds with 20 rounds with 20 rounds with 20 rounds with


same c & f same c & f same c & f same c & f

80 Rounds of the Compression Function


Wrapping up
w0 w15
M B1 B2 B3 BL

512 512 512 512


VI Hash
f f f f
160 160 160 160 160

Given an input M the SHA-1 yields a 160 bit crypto hash of M by:
1. Padding M as a binary string multiple of 512. Partitions it
into L blocs of size 512 bits.
2. Computing in cascade fashion the Compression Function for
each Bj , it takes as input the 5 words hash buffer, from Bj−1
and also the Bj itself, and returns the new values a||b||c||d||e,
with total length 160 bits, which will be part of the input for
the computation on Bj+1 .
3. The output for the last block is the message digest, i.e. the
crypto hash function for M.
Security of the SHA family

I At the moment SHA-1 is the crypto hash algorithm of choice


in a myriad of systems, for ex. the browsers of Microsoft and
Mozilla.
I Recent theoretical results have indication the possibility of
breaking SHA-1,in particular the collision resistance property.
Some of the systems using SHA-1 already have plans to
discontinue their use in 2017-18.
I One of the most clear choices at the moment is the SHA-512.
The basic scheme is the same than the SHA-1, but uses a
larger block size (1024 bits), can process documents up to
2128 and the output is exactly 512. For a few years, SHA-512
seem it will be secure choice.
I Another recent alternative is the SHA-256. It’s working
scheme is different that the SHA-1 and SHA-512.
Distributed Consensus

Distributed consensus protocol: A P2P network with n nodes each


with an input, a few of those nodes are malicious, a distributed
consensus protocol must have the following two properties:
I It must terminate with all honest nodes in agreement,
I the value must have been generated by an honest node.
Distributed consensus with malicious nodes has been studied in the
framework of classical distributed computing.
However, for us a particularity derived from the lack of a
centralized authority is that the node will not have identity.
Hash Pointer

A block pointer is a data structure


similar to the pointers in a linked list,
but each pointer besides the address
of the previous block, the pointer
also contains a cryptographic hash of
the information in the previous block.

Whereas a ”regular” pointer gives a way to retrieve the


information, a hash pointer also gives you a way to verify that the
information has not changed.
Notice we can use any know type of data structure, for ex. lists or
trees, and and substitute the pointers by hash pointers.
Blockchain

Blockchain: linked list data structure where the links are hash
pointers.
Nice data structure to implement decentralized consensus, where
authority and trust are transferred to a decentralized virtual
network and enables its nodes to sequentially record transactions
on a public block, creating a unique blockchain.
Blockchain

Uses a crypto hash function H


(usually a SHA−256)

Notice the digest H() conatained in


block m are the cryptogram of the
data contained in block m−1

The contains of a block include its hash pointer to the previous


block
We can build a blockchain as large as we want, going back to a
first initial block denoted the genesis block.
Every user of the blockchain, needs only to store the head of the
list, which is just a regular hash-pointer that points to the most
recent data block.
Blockchain: Impossibility of tampering the data

An adversary can’t tamper data in any block of the chain without


getting detected.
If Eve wants modifies one file in block m, the hash of this block (in
Block m + 1) invalid.
Therefore Eve has to modify also that hash, which in turn changes
the contents of block m + 1, and will become different of the H in
block m + 2.
This goes all way until the root hash pointer H() to the head
block, which is difficult to change as every user of the blockchain
has their own copy of the root hash pointer.
Every time a new block is added to the chain, the new H() is
broadcasted to all users.
Merkle trees
Another useful hash pointer data structure is the Markle tree. A
Markle tree is a binary tree where the blocks with information are
in the leaves of the tree, and hash pointers.
Usually H is the result of one or two words from applying SHA-256
It is a static DS: Given documents, D1 , . . . , Dm , with m even (or
better 2k ) the Merkle tree will hash a large quantity of data into a
single hash.
The tree is build from the leaves (pairs of blocks) up to the root.
Construction Merkle’s tree

1. (Optative) Sort the given D1 , . . . , Dm : Compute D1 , . . . , Dm


and store the hash the m block leaves of the tree.
2. In a bottom-up fashion, at level h do the hash of the
concatenation of the contains of pairwise blocks and store
them in blocks at level h − 1, until arriving to the root.
Root H
d47780c084bad3830bcdaf6eafe035e4
H((H(H(D1) || H(D2))) || (H(H(D3) || H(D4))))
H H

H(H(D1) || H(D2)) H(H(D3) || H(D4))


H H H H

H(D1) H(D2) H(D3) H(D4) Leaves

D1 D2 D3 D4 Documents
Properties of Merkle’s trees

I Users only need to remember the hash pointer at the head of


the tree. The user has the ability traverse down through the
hash pointers to any point in the list.
I This allows us to be sure that the data hasnt been tampered
with, as any attempt to tamper with any piece of data will be
detected by just remembering the hash pointer at the top.
I The same ideas can be applied for any kind of tree: In fact,
they can be apply to define Merkle structures on any kind of
acyclic graph.
Markle’s trees as a tool for Data Verification

In P2P systems, the same data is replicated in many nodes. It is


important to update copies at the same time.
Merkle’s trees are an important too to sase time/space. Instead of
sending the whole file through the internet, we just sent a hash of
the file to see if it matches:
In a network P2P, given two sites A and B to check if both have
the same files FA = FB ?, both store the files as a Merkle tree:
1. A sends H(FA ) to B.
2. If H(FA ) = H(FB ), done. Otherwise
3. B request the two hashes of the sons.
4. A creates the necessary hashes and send them to A.
5. Repeat until finding the different blocks.
Hash pointers Data Structures

Blockchain and Merkle trees are a new form of information


technology that will have the relevance and importance comparable
to the development of the TCP/IP internet protocol in 1974.
Blockchain could create a trustworthy and secure distributed
ledger, without needs of a trusted third party.
Hash pointer technology has many applications:
1. Bitcoin
2. Smart contracts
3. Smart properties
Bitcoin: B

Satoshi Nakamoto, 2008. (Craig Wright)


I A bitcoin network is a P2P network of
users, besides all users have access to
a common blockchain that is a
trustworthy distributed ledger of of
who own what (at any time in
history).
I Users are identified by a digital
signature. The signature looks random
and can not be uncover the user
I The blockchain contains records of
every single transaction of bitcoins.
I The block chain replaces the bank
(trusted third party).
Bitcoin: B

I Implements a digital currency, i.e. money controlled and


stored by a network of computers.
I Why bitcoin has any value? because people want it and use it
!
I Bitcoin is not just a currency, for illegal transactions, it is also
a way to make payments without intermediators.
I Frees money from control by banks and governments or any
central authority.
I Useful for small payments (1 satoshi= 1/108 B)
I The bitcoin has a great fluctuation (1B= 516 e) (Sept 2016)
I The total amount of currency in the system must be
≤ 21 × 109 B(2040) .
Bitcoin: B

I It is math. based currency: If you own some coins, to make a


transition, you need a public and private keys associated to an
internet address, which among other things contain your
current balance of B.
I A bitcoin network is a decentralised network. Every time a
transaction occurs between members of the network it needs
to be certified, validate, verified, to avoid double spending.
I This process is carried out by special members of the network
denoted miners. In the process the miners generate new
bitcoins as reward (actual rate 25B per blockchain).
I The miners created each 10 minutes a new blockchain
containing the last valid transactions in the web. A
transaction will be valid once its block is added to the block
chain.
Bitcoin: B
I To use the bitcoin system, a user needs a special software: a
wallet, to keep track of balance, internet addresses and and
public keys. There are many suppliers of wallets, some of the
trusted ones:
http://bloickchain.info/wallet-legacy/indax.htm,
http://bitcointrezor.com
B: How does it work

1. Every user has the software for bitcoin


wallet identified with a private key.
2. User A transfers and amount of Bto B,
via QR code and a public key system.
3. In the same transaction A includes a
reference to a previous transcription
where A received B(i.e. can spend the
money)
4. B will generate the output script and
store it the wallet, so it can spend it in
the future.
5. Miners confirm your transaction and
together with other transaction will
form a new block.
B: How miners validate a transaction
Miners and rewards
I Miners have the task to validate the last transactions, gather
them and create a new block. The miner that gets a valid
new block into the blockchain gets a block reward of 25 B.
This is the only way in which new B are allowed to be created.
I Miners compete among them to create a new block, by
competing to solve a hash puzzle: Find a number (nonce) s.t.
when taking the hash
H(nonce||previous-hash||TX 1|| · · · ||TXk) < target
where H is the hash and target is quite small interval that is
given.
I Solving the puzzle is computationally expensive, uses
exhaustive search, but to test it is in the target is easy.
I When somebody finds the nonce, the nonce becomes part of
the block, the block is added to the blockchain and the hash
pointer to the new block is broadcasted to all vertices in the
network.
Block in a Bitcoin blockchain

CPR. The Economist


The algorithm workflow

• User A wants to sent user B an amount of B. In the chainblock should be


recorded that A own at least that much bitcoins.
• A initiates a transaction with B using his secret key and B public key.
• Miners verify the validity of the transaction.
• If the transaction is valid it is added to a new block, together with other
concurrent transactions.
• A hash pointer to the last transaction is added to the block.
• Miners validate and verify the new block by the proof of work algorithm (and
they earn some new B). When there is an agreement on the validity of the
block, it becomes part of the blockchain and the pointer is distributed to all
nodes.
• All transactions in a block are completed once it becomes part of the
blockchain structure. From then on, all transactions in the block are considered
recorded and completed.
B: How it works
Blockchain Information web page
References

• A. Narayanan, J. Tonneau, E. Felten, A.


Miller, S. Goldfeder: Bitcoin and
Cryptocurrency Technologies. Princeton U.
P. 2016
• https://bitcoin.org/en/

• The great chain of being sure about things The Economist,


Oct, 31, 2015
http://www.economist.com/news/briefing/21677228-technology-behind-
bitcoin-lets-people-who-do-not-know-or-trust-each-other-build-dependable
• Bitcoin: A Peer-to-Peer Electronic Cash System Satoshi Nakamoto.
https://bitcoin.org/bitcoin.pdf

You might also like