Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

24-01-2020

John Augustine
Jan 16, 2020

Data Mining (CS6720) Introduction to Streaming Model

Streaming Sampling from a stream

1 2

Basic Model Sample queries


• Processor (unlimited power. Why?) • Mean, max, etc.
• Limited memory. Goal is to minimize the amount of memory. • Representative candidate element(s) (Formally: sample(s) with/without
• Stream: 𝑎 , 𝑎 , … , 𝑎 (typically, 𝑛 is unknown to processor) replacement)
• Each 𝑎 ∈ 𝑈, the universe, which can be {1, 2, … , 𝑚}, a set of 𝑚 symbols, ℝ, ℤ, ℕ, etc. • Suppose 𝑈 is a collection of 𝑚 items and and 𝑓 is the frequency (number
• Query: some question about item seen so far. (Sample queries next page.) of occurrences) of 𝑗th item.
• At time step 𝑖, item 𝑎 is provided to the processor. • The 𝑝th frequency moment is ∑ 𝑓 .
• Processor uses 𝑚 memory words and 𝑎 , • The 0th frequency moment is the number of distinct items.
• Updates memory such that answer to query is updated. • The 2nd frequency moment (aka “surprise number”) is useful in computing variance.
/
• Several other variants (sliding window, cash register model, graph • The quantity ∑ 𝑓 captures the frequency of the most frequent item as
streaming, etc.) 𝑝 → ∞.

3 4
24-01-2020

Sampling Formulation and


• Formulation attempt: Pick each item with probability, say, 1/100.
Reservoir
• Stream of (unknown) 𝑛 elements from a universe of 𝑚 symbols. Large 𝑛
Sampling and 𝑚.
• Query: sample of 𝑘 ≪ min(𝑛, 𝑚) elements chosen uniformly at random
without replacement.
• Goal: memory ≤ 𝑘 symbols.

• Strawman solution 1: First 𝑘 or last 𝑘.


• Strawman solution 2: Choose some random 𝑘 indices and collect items at
Sampling without replacement in a those indices.
stream. Vitter, 1985.

5 6

Simple case: 𝑘 = 1 Case: Arbitrary 𝑘


• Only one memory cell • Now we are allowed 𝑘 cells. Place first 𝑘 items in cells.
• Item 𝑎 is placed in cell with prob .
• When item 𝑎 , 𝑖 > 𝑘, arrives, pick a random number 𝑟 from 1 to 𝑖.
• Event 𝐸 : The item 𝑎 is the final sample iff If 𝑟 ≤ 𝑘, place 𝑎 in cell 𝑟. Else, ignore 𝑎 .
(i) 𝑎 was placed in cell and
(ii) subsequent items NOT placed in cell.
• Event E: 𝑎 is a sample iff (i) 𝑎 is placed in one of the 𝑘 cells and (ii)
1 1 1 1 no subsequent item is placed in that cell.
Pr 𝐸 = 1− 1− ⋯ 1−
𝑖 𝑖+1 𝑖+2 𝑛 • Thus,
1 𝑖 𝑖+1 𝑛−1 1
𝑘 1 1 1 𝑘
= ⋯ = Pr 𝐸 = 1− 1− ⋯ 1− = .
𝑖 𝑖+1 𝑖+2 𝑛 𝑛 𝑖 𝑖+1 𝑖+2 𝑛 𝑛

7 8
24-01-2020

Problem Formulation
Counting • Stream on 𝑛 numbers from {1, 2, … , 𝑚}. Material not covered in class
but required to know (and
• Query: how many different numbers?
Distinct
therefore asked in
tests/quizzes/exams) will be
• Goal: memory 𝑂(log 𝑚) bits mentioned in a cloud like this.
Elements You are free to search other
sources (Internet, textbook).
• Deterministic solutions?
• 𝑛 = 𝑚 + 1 stream elements require m bits.
• Proof:
• Suppose after 𝑚 stream items only m-1 bits.
• 2 − 1 possible subsets could be seen, but only 2 states for memory.
• Thus, some two subsets 𝑆 and 𝑆 represented by same state.
• If those subsets of different sizes, error.
In a Stream • Otherwise, if same size, then what if (𝑚 + 1)th item is from 𝑆 \S ?

9 10

Problem Formulation -- Updated Intuition


• Stream on 𝑛 numbers from {1, 2, … , 𝑚}.
• Query: Let 𝑑 be the number of distinct items.
Output 𝑑 such that Pr ≤ 𝑑 ≤ 6𝑑 > .
• Goal: memory 𝑂(log 𝑚)

Why 6?

Answer: convenient to prove.

This and other subsequent figures & screen clips are from
Blum, Hopcroft, and Kannan unless specified otherwise.

11 12
24-01-2020

Algorithm Claim:
• Assume availability of two-universal hash function Let 𝑏 , 𝑏 , … , 𝑏 be the 𝑑 distinct items encountered in the stream.
ℎ: 1, 2, … , 𝑚 → {1, 2, … , 𝑀} Then,
where 𝑀 > 𝑚. How to get 2-
𝑑 2 𝑑 1
universal hash
Pr ≤ 𝑑 ≤ 6𝑑 ≥ − >
• Pr ℎ 𝑎 = 𝑥 = 1/𝑀. 6 3 𝑀 2
functions? Why is it
2-universal? (when 𝑀 ≫ 𝑑).
• Pr ℎ 𝑎 = 𝑥 ∧ ℎ 𝑏 = 𝑦 = .

• Algorithm: hash each value in the stream and only remember the
smallest value 𝑠. On being queried, report 𝑑 = 𝑀/𝑠.

13 14

Proof. Brief Detour into Elementary Tail Bounds


First, we focus on the event: 𝑠 ≤ Markov’s Inequality: If 𝑋 is a random variable that only takes nonnegative
values, then, for all 𝑎 > 0,
𝐸𝑋
Pr 𝑋 ≥ 𝑎 ≤ .
𝑎

Chebyshev’s Inequality: For any random variable 𝑋 (not necessarily


𝑀 nonnegative) and for all 𝑎 > 0,
𝑠≤ 𝑉𝑎𝑟 𝑋 Prove these tail
6𝑑 Pr 𝑋 − 𝐸 𝑋 ≥ 𝑎 ≤ . inequalities.
𝑎 Reference:
Also, for any 𝑡 > 1, Mitzenmacher and
𝑉𝑎𝑟 𝑋 Upfal
Pr 𝑋 − 𝐸 𝑋 ≥ 𝑡 ⋅ 𝐸 𝑋 ≤ .
𝑡 ⋅ (𝐸 𝑋 )

15 16
24-01-2020

Back to our Proof.


Focus on event: 𝑠 ≥ and bound its probability to within 1/6. 6𝑀 6
𝐸 𝑌 = 1 × Pr 𝑌 = 1 = Pr ℎ 𝑏 < ≈ .
𝑑 𝑑
Recall:
6
𝐸𝑌 ≈ = 6.
6𝑀 6𝑀 𝑑
Pr 𝑠 ≥ = Pr ∀𝑖, ℎ 𝑏 ≥ .
𝑑 𝑑 𝑉𝑎𝑟 𝑌 = 𝐸 𝑌 −𝐸 𝑌 =𝐸 𝑌 −𝐸 𝑌 ≤𝐸 𝑌 .
Let
6𝑀 𝑉𝑎𝑟 𝑌 = 𝑉𝑎𝑟[𝑌 ] ≤ 𝐸[𝑌]
𝑌 = 0, if ℎ 𝑏 ≥
𝑑
1, Otherwise.
Due to 2-way
And, let 𝑌 = ∑ 𝑌. independence of 𝑌 ’s.
Find out why?

17 18

Homework
6𝑀 6𝑀 How to boost probability and achieve the following claim?
Pr 𝑠 ≥ = Pr ∀𝑖, ℎ 𝑏 ≥
𝑑 𝑑
For any 𝑐 > 0,
= Pr 𝑌 = 0 𝑑 1
Pr ≤ 𝑑 ≤ 6𝑑 ≥ 1 −
6 𝑚
≤ Pr 𝑌 − 𝐸 𝑌 ≥ 𝐸 𝑌 Final result follows by the union
bound which says:
Pr 𝐸 ∪ 𝐸 ∪ ⋯ ≤ Pr 𝐸 Hint: Use an appropriate number of repetitions and use the median
𝑉𝑎𝑟 𝑌 1 1 value. For analysis (i.e., proving the above claim), use Chernoff bounds.
≤ ≤ ≤ .
𝐸𝑌 𝐸𝑌 6 (You will only need Equation (5).)

19 20
24-01-2020

Chernoff Bounds (Mitzenmacher and Upfal) Chernoff Bounds (Mitzenmacher and Upfal)
Let 𝑋 , 𝑋 , … , 𝑋 be independent binomial trials with Pr 𝑋 = 1 = 𝑝. Let Let 𝑋 , 𝑋 , … , 𝑋 be independent binomial trials with Pr 𝑋 = 1 = 𝑝.
𝑋 = 𝑋 + 𝑋 + ⋯ + 𝑋 and 𝜇 = 𝐸 𝑋 = 𝑛𝑝. Then, the following hold. Let 𝑋 = 𝑋 + 𝑋 + ⋯ + 𝑋 and 𝜇 = 𝐸 𝑋 = 𝑛𝑝. Then, for 0 < 𝛿 < 1,
1. For any 𝛿 > 0, 𝑒
𝑒 Pr 𝑋 ≤ 1 − 𝛿 𝜇 ≤ , (4)
1−𝛿
Pr 𝑋 ≥ 1 + 𝛿 𝜇 ≤ . (1)
1+𝛿
2. For 0 < 𝛿 ≤ 1, Pr 𝑋 ≤ 1 − 𝛿 𝜇 ≤ 𝑒 , (5)

Pr 𝑋 ≥ 1 + 𝛿 𝜇 ≤ 𝑒 . (2) And putting (2) and (5) together,


3. For 𝑅 ≥ 6𝜇, Pr 𝑋 − 𝜇 ≥ 𝛿𝜇 ≤ 2𝑒 . (6)
Pr 𝑋 ≥ 𝑅 ≤ 2 . (3)
contd.

21 22

The 𝑝th Frequency Moment


Let 𝑓 be the number of occurrences of 𝑠 ∈ {1,2, … , 𝑚}.
Frequency
Moments The 𝑝th frequency moment is given by

𝑓 .

Generalization of Distinct Elements

23 24
24-01-2020

Commonly Studied Frequency Moments


𝑝 = 0 captures number of distinct elements (assuming 0 = 0).
Estimating the
𝑝 = 1 is simply the stream length 𝑛. Second
𝑝 = 2 is useful in computing the variance. Moment

It is often called the surprise number. Can you guess why?


Hint: Consider a random stream vs a stream with just one item.

𝑝 = ∞ captures frequency of the most frequent element. Alon, Mattias, and Szegedy (AMS)

25 26

Claim:
The AMS Algorithm
• For each 𝑠 ∈ {1,2, … , 𝑚}, let 𝑥 be ±1 with probability ½ each.
𝑎= 𝑥 𝑓

• For the theory to go through,


𝑥 can be the hash of 𝑠 into {−1, +1} is an unbiased estimator of the second frequency
such that the outcomes are 4-way independent. moment.
• Algorithm (unbiased, but weak): Maintain a counter 𝑠𝑢𝑚 that is
incremented by 𝑥 when a symbol 𝑠 arrives. Report 𝑎 = 𝑠𝑢𝑚
• Thus, at the end, sum = ∑ 𝑥 𝑓 and 𝑎 = ∑ 𝑥 𝑓 .

27 28
24-01-2020

Claim: 𝑉𝑎𝑟 𝑎 = 𝐸 𝑎 − 𝐸 [𝑎] ≤ 2 𝐸 𝑎 . Reducing the Error Probability


Algorithm: Repeat the algorithm 𝑟 = times yielding estimates
𝑎 , 𝑎 , … , 𝑎 and report the average 𝑋 = ∑ 𝑎 .

Claim:
𝑉𝑎𝑟 𝑋
Pr 𝑋 − 𝐸 𝑋 > 𝜖𝐸 𝑋 ≤ ≤𝛿
If one of 𝑠, 𝑡, 𝑢, or 𝑣 𝜖 𝐸 𝑋
is different from the
others, the
expectation of that
term vanishes. Why?
29 30

You might also like