Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Data Stream Filtering

Filtering and Streaming


• The randomized algorithms and data structures we
have seen so far always produce the correct answer
but have a small probability of being slow.
• In this lecture, we will consider randomized algorithms
that are always fast, but have a small probability of
returning the wrong answer.
• More generally, we are interested in tradeoffs
between the (likely) efficiency of the algorithm and the
(likely) quality of its output.
Bloom Filters
Whenever a list or set is used, and space is consideration, a Bloom
filter should be considered. When using a Bloom filter, consider
the potential effects of false positives."
 It is a randomized data structure that is used to represent a
set.
 It answers membership queries
 It can give FALSE POSITIVE while answering membership
queries (very less %).
 But can't return FALSE NEGATIVE
 POSSIBLY IN SET
 DEFINITELY NOT IN SET
 Space efficient
Bloom Filters
• Bloom filters are a natural variant of hashing
proposed by Burton Bloom in 1970 as a
mechanism for supporting membership queries in
sets.
• Applications:
• Example: Email spam filtering
– We know 1 billion “good” email addresses
– If an email comes from one of these, it is NOT spam
Filtering Stream Content
• To motivate the Bloom-filter idea, consider a web
crawler.
• It keeps, centrally, a list of all the URL's it has found
so far.
• It assigns these URL's to any of a number of parallel
tasks; these tasks stream back the URL's they find in
the links they discover on a page.
• It needs to filter out those URL's it has seen before.
Role of the Bloom Filter
• A Bloom filter placed on the stream of URL's will
declare that certain URL's have been seen before.
• Others will be declared new, and will be added to
the list of URL's that need to be crawled.
• Unfortunately, the Bloom filter can have false
positives.
– It can declare a URL has been seen before when it
hasn't.
– But if it says “never seen”, then it is truly new.
How a Bloom Filter Works?
• A Bloom filter is an array of bits, together
with a number of hash functions.
• The argument of each hash function is a
stream element. and it returns a position in
the array.
• Initially, all bits are 0.
• When input x arrives, we set to 1 the bits
h(x). for each hash function h.
The Set memebrship task

• • x : An element
• S: A set of elements
• Input: x, S
• Output:
• -TRUE if x in S
• -FALSE if x not in S
3/13/2023 MODULE-I DATA ANALYTICS 9
Insert (x)
Find h1(x), h2(x),......,hk(x), set all these bits in Bloom Filter to 1

QUERY-Bloom Filter (y)


Find h1(y), h2(y),....hk(y)
IF (All h1(y), h2(y),....hk(y) ==1)
RETURN(1)
ELSE
RETURN(0);
We assume hash functions maps an element in bits 0,1,2,...(n-1)
m = number of elements inserted or present in the set
Error types
• False Negative: answering “not there” on an element
that is in the set.
‒ Never happens for Bloom Filters

• False Positive: answering “is there” on an element


that is not in the set
‒ We design the filter so that the probability of a false positive
is very small.
3/13/2023 MODULE-I DATA ANALYTICS 12
3/13/2023 MODULE-I DATA ANALYTICS 13
3/13/2023 MODULE-I DATA ANALYTICS 14
Calculating the Probability of False
Positives
• The False Positive probability is minimised
by choosing

m
k= ln(2)
n
Bloom Filter -- Analysis
• What fraction of the bit vector B are 1s?
– Throwing k∙m darts at n targets
– So fraction of 1s is (1 – e-km/n)

• But we have k independent hash functions


and we only let the element x through if all k
hash element x to a bucket of value 1

• So, false positive probability = (1 – e-km/n)k


Bloom Filter – Analysis (2)
• m = 1 billion, n = 8 billion
– k = 1: (1 – e-1/8) = 0.1175
– k = 2: (1 – e-1/4)2 = 0.0493

• What happens as we
keep increasing k?
• “Optimal” value of k: n/m ln(2)
– In our case: Optimal k = 8 ln(2) = 5.54 ≈ 6
• Error at k = 6: (1 – e-1/6)2 = 0.0235
Example-1
• n size of the bloom filter = 11
• m is the number of items = 2
• k is the number of hash functions = 4
km k
  
1  e
• Pr (All k cells are set to 1) =

n

 
Example-2
(1/e ≈0.37....)
Counting Distinct Elements
Definition
• Data stream consists of a universe of elements chosen from
a set of N
• Maintain a count of number of distinct items seen so far.

Let us consider a stream :


32, 12,14,32,7, 12,32, 7, 6, 12, 4
Elements occure multiple times, we want to count the
number of distinct elements.
Number of distinct element is n (=6 in this example)
Number of elements in this example is 11
Why do we count distinct elements?
• Number of distinct queries issued
• Unique IP addresses passing packages through a router
• Number of unique users accessing a website per month
• Number of different people passing through a traffic hub
(airport, etc.)
• How many unique products we sold tonight?
• How many unique requests on a website came in today?
• How many different words did we find on a website?
Now, let’s constrain ourselves with limited storage…
• How to estimate (in an unbiased manner) the
number of distinct elements seen?
– Flajolet-Martin (FM) Approach
• FM algorithms approximates the number of
unique objects in a stream or a database in one
pass.
• If the stream contains n elements with m of them
unique, this algorithm runs in O(n) time and
needs O(log(m)) memory.
FM-sketch (Flajolet-Martin)
Pick a hash function that maps each of the N elements to
at least log2N bits
For each stream element a, let r(a) be the number of
trailing 0s in h(a)
r(a) = position of first 1 counting from the right
E.g., say h(a) = 12, then 12 is 1100 in binary, so
r(a) = 2
Record R = the maximum r(a) seen
R = maxa r(a), over all the items a seen so far
Estimated number of distinct elements = 2R
Example
Calculate hash functions h(x)

The numbers obtained


are:

{2,4,3,2,3,4,0,4,2,3,3,2}
Binnary Bits

Convesion to binanary:
{010,100,011,010,011,100,000,100,010,0
Trailing Zeros

Computing r(a):
{1,2,0,1,0,2,0,2,1,0,2,1}
Distinct Elements

So, r(a):
{1,2,0,1,0,2,0,2,1,0,2,1}
R = max r(a) = 2
Estimate = 2R = 22 = 2*2 =4

You might also like