Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Models & Algorithms

for Internet Computing


Lecture 5: BINS AND BALLS MODEL AND
APPLICATIONS
Agenda
• Review: the problem of bins and balls
• Poisson distribution
• Hashing
• Bloom Filters

Probability for Computing 3


Balls into Bins
• We have m balls that are thrown into n bins,
with the location of each ball chosen
independently and uniformly at random from n
possibilities.
• What does the distribution of the balls into the
bins look like
• “Birthday paradox” question: is there a bin with at
least 2 balls
• How many of the bins are empty?
• How many balls are in the fullest bin?
Answers to these questions give solutions to
many problems in the design and analysis of
algorithms Probability for Computing 4
The maximum load
• When n balls are thrown independently and uniformly
at random into n bins, the probability that the maximum
load is more than 3 lnn/lnlnn is at most 1/n for n
sufficiently large.
• By Union bound, Pr [bin 1 receives  M balls] 
• Note that:

• Now, using Union bound again, Pr [ any ball receives  M balls]


is at most

which is  1/n
Probability for Computing 5
Application: Bucket Sort
• A sorting algorithm that
breaks the (nlogn) lower
bound under certain input
assumption
• Bucket sort works as follows:
• Set up an array of initially
empty "buckets."
• Scatter: Go over the original
array, putting each object in its
bucket.
• Sort each non-empty bucket. A set of n =2m integers,
• Gather: Visit the buckets in randomly chosen from
order and put all elements back [0,2k),km, can be sorted
into the original array. in expected time O(n)
Probability for Computing  Why:6 will analyze later!
The Poisson Distribution
• Consider m balls, n bins
• Pr [ a given bin is empty] =
• Let Xj is a indicator r.v. that is 1 if bin j empty, 0 otherwise
• Let X be a r.v. that represents # empty bins

• Generalizing this argument, Pr [a given bin has r balls] =

• Approximately,

• So:

Probability for Computing 7


Limit of the Binomial Distribution

Probability for Computing 8


Application: Hashing
• The balls-and-bins model is good to model hashing
• Example: password checker
• Goal: prevent people from choosing common, easily cracked
passwords
• Keeping a dictionary of unacceptable passwords and check newly
created password against this dictionary.
• Initial approach: Sorting this dictionary and do binary
search on it when checking a password
• Would require (log m) time for m words in the dictionary
• New approach: chain hashing
• Place the words into bins and search appropriate bin for the word
• The worlds in a bin: implemented as a linked list
• The placement of words into bins is done by using a hash function
Probability for Computing 9
Chain hashing
• Hash table
• A hash function f: U  [0,n-1] is a way of placing items from the
universe U into n bins
• Here, U consists of all possible password strings
• The collection of bins called hash table
• Chain hashing: items that fall into the same bin are chained together
in a linked list
• Using a hash table turns the dictionary problem into a
balls-and-bins problem
• m words, hashing range [0..n-1]  m balls, n bins
• Making assumption: we can design perfect hash functions that map
words into bins uniformly random
• A given word could be mapped into any bin with the same probability

Probability for Computing 10


Search time in chain hashing
• To search for an item
• First hash it to find the corresponding bin then find
it in the bin: sequential search through the linked
list
• The expected # balls in a bin is about m/n 
expected time for the search is (m/n)
• If we chose m=n then a search takes expectedly
constant time
• Worst case
• maximum # balls in a bin: (lnn/lnlnn) if choose m=n
• Another disadvantage: wasting a lot of space in
empty bins

Probability for Computing 11


Hashing: bit strings
• In chain hashing, n balls n bins, we waste a lot of
empty bins  should have m/n >>1
• Hashing using sort fingerprints will help
• Suppose: passwords are 8-char, i.e. 64 bits
• We use a hash function that maps each pwd into a 32-bit
string, i.e. a fingerprint
• We store the dictionary of fingerprints of the unacceptable
passwords
• When checking a password, compute its fingerprint then
check it against the dictionary: if found then reject this
password
• But it is possible that our password checker may not
give the correct answer!

Probability for Computing 12


False positives

• This hashing scheme gives a false positive


when it rejects a good password
• The fingerprint of this password accidentally
matches that of an unacceptable password
• For our password checker application this over-
conservative approach is, however, acceptable if
the probability of making a false positive is not
too high

Probability for Computing 13


False positive probability
• How many bits should we use to create
fingerprints?
• We want reasonably small probability of a false
positive match
• Prob [the fingerprint of a given good pwd  any given
unacceptable fingerprint] = 1- 1/2b; here b # bits
• Thus for m unacceptable pwd, prob [false positive
occurs on a given good pwd] = 1- (1- 1/2b)m1- e-m/2b
• Easy to see that: to make this prob less than a given
small constant, we need b= (logn)
• If use b=2logm bits  Prob [ a false positive]= 1-(1-1/m2)m<
1/
m
• Dictionary of 216 words using 32-bit fingerprint  false prob
1/
65,536

Probability for Computing 14


An approximate set membership
problem
• Suppose we have a set S = {s1, s2, s3, …, sm} of
m elements from a large universe set U. We
would like to represent the elements of S in
such a way so that
• We can quickly answer the queries of form “Is x is
an element of S?”
• We want the representation take as little space as
possible
• For saving space we can accept occasional
mistakes in form of false positives
• E.g. in our password checker application

Probability for Computing 15


Bloom filters
• A Bloom filter: a data structure for this
approximate set membership problem
• By generalizing these mentioned hashing ideas to
achieve more interesting trade-off between
required space and the false positive probability
• Consists of an array of n bits, A[0] to A[n-1],
initially set to 0
• Uses k independent hash functions h1, h2, …, hk
with range {0,…n-1}; all these are uniformly
random
• Represent an element sS by setting A[hi(s)] to 1,
i=1,..k

Probability for Computing 16


• Checking: For any
value x, to see if xS
simply check if A[hi(x)]
=1 for all i=1,..k
• If not, clearly x is not a
member of S
• If right, we assume that
x is in S but we could
be wrong!  false
positive

Probability for Computing 17


False positive probability
• The probability of a false positive for an element not in
the set
• After all m elements of S are hashed into Bloom filter, Prob[a
give bit =0] = (1-1/n)km  e –km/n. Let p= e –km/n.
• Prob [a false positive] = (1- (1-1/n)km)k  (1-e –km/n)k = (1-p)k . Let
f= (1-p)k .
• Given m, n what is the optimum k to minimize f?
• Note that a higher k gives us more chance to find a 0-bit for an
element not in S, but using fewer h-functions increases the fraction
of 0-bit in the array.
• Optimal k = ln2.n/m which reaches minimum f = ½k (0.6185)n/m
• Thus Bloom filters allow a small probability of a false positive
while keep the number of storage bit per item a constant
• Note in previous consideration of fingerprints we need (logm)
bits per items

Probability for Computing 18


Bloom filters: applications
• Discovering DoS attack attempt
• Computing the difference between SYN and FIN
packets
• Matching between SYN and FIN packets by 4-tuples of
addresses (source and destination ports)
• Many, many other applications

Probability for Computing 19


Application of hashing: breaking
symmetry
• Suppose that n users want a unique resource (processes
demand CPU time) how can we decide a permutation
quickly and fairly?
• Hashing the User ID into 2b bits then sort the resulting numbers
• That is, smallest hash will go first
• How to avoid two users being hashed to the same value?
• If b large enough we can avoid such collisions as in
birthday paradox analysis
• Fix an user. Prob [another user has the same hash] = 1- (1- 1/2b)n-1
(n-1)/2b
• By union bound, prob [two users have the same hash] = (n-1)n/2b
• Thus, choosing b =3logn guarantees success with probability 1-1/n
• Leader election

Probability for Computing 20


SYN flood DEfense
solutions
21
TCP SYN-Flooding Attack
• TCP services are often susceptible to various
types of DoS attacks
• SYN flood: external hosts attempt to overwhelm the
server machine by sending a constant stream of TCP
connection requests
• Streaming spoofed TCP SYNs
• Forcing the server to allocate resources for each new connection
until all resources are exhausted
• 90% of DoS attacks use TCP SYN floods
• Takes advantage of three way handshake
• Server start “half-open” connections
• These build up… until queue is full and all additional requests are
blocked
TCP: Overview RFCs: 793, 1122, 1323, 2018, 2581
• full duplex data:
• point-to-point: • bi-directional data flow in
same connection
• one sender, one receiver
• MSS: maximum segment
• reliable, in-order byte size
steam: • connection-oriented:
• no “message boundaries” • handshaking (exchange of
• pipelined: control msgs) init’s sender,
receiver state before data
• TCP congestion and flow
exchange
control set window size
• send & receive buffers • flow controlled:
• sender will not
overwhelm receiver
application application
writes data reads data
socket socket
door door
TCP TCP
send buffer receive buffer
segment
TCP segment structure

32 bits
URG: urgent data source port # counting
dest port #
(generally not used) by bytes
sequence number
ACK: ACK # of data
acknowledgement number
valid head not (not segments!)
PSH: push data now Receive window
len used U A P R S F

checksum Urg data pnter # bytes


(generally not used)
rcvr willing
RST, SYN, FIN: Options (variable length) to accept
connection estab
(setup, teardown
commands) application
Internet data
(variable length)
checksum
(as in UDP)
Attack Mechanism

• Transmission Control Block


(TCB) is reserved
• TCP SYN-RECEIVED state:
connection is half-opened
• Up on receiving SYN,
segment TCB
• Transited to ESTABLISHED
until last ACK
Attack Mechanism
• attacker sends a flood
of SYNs  too
manyTCB  host is
exhauted in memory.
• To avoid this, OS only
allows a fixed
maximum number of
TCBs in SYN-
RECEIVED
• If this threshold is
reached, new coming
SYN will be rejected
SYN Flooding
C S

SYNC1 Listening
SYNC2
Store data
SYNC3

SYNC4

SYNC5
Implementation Method
How to create a successful flood
• Making drops of incomplete connection (IC)
• Standard TCP: a connection times out only after some retranmisstion  511 sec
• Assuming 1024 ICs are allowed per socket 2 connection attempts per second to
exhaust all allocated resources.
• Note that existing ICs are dropped when a new SYN request is received.
• If an ACK arrives at the server but does not find a corresponding IC
state  the server fail to establish such required connection
• Round trip time (RTT): time required for the server to have the client reply
• Forcing the server to drop IC state at a rate larger than the RTT,  no connections
are able to complete  success in attack!
• The goal of attack is to recycle every connection before the average RTT
• For a listen queue size of 1024, and a 100 millisecond RTT  need 10,000 packets per
second.
• A minimal size TCP packet is 64 bytes, so the total bandwidth used is only
4Mb/second  practical!

29
Jan 2013
Flood Detection System (FDS)
• Stateless, simple, edge (leaf) routers
• Utilize SYN-FIN pair behavior
• Include (SYNACK - FIN) so client or server
• However, RST violates SYN-FIN behavior
• Placement: First/last mile leaf routers
• First mile – detect large DoS attacker
• Last mile – detect DDoS attacks that first mile
would miss
SYN – FIN Behavior
SYN – FIN Behavior
• Generally every SYN has a FIN
SFD-Method

1- Classification of packets
2-Computing the # of SYN and FIN packets
going through
3-Using algorithm CUSUM to analyze the
(SYN-FIN) pair behaviour

Operating System
Concepts
SFD-BF Method

• Improvement on previous SFD:


• Compute the difference between #SYN and
#FIN when the packets are matched on the 4-
tuple:
• When a SYN packet comes, determine the corresponding
4-tuple and insert this into BF. Increase the counter
specified by this 4-tuple.
• When a FIN/RST packet comes: determines the 4-
tuples and find it’s hash in BF to decrease the
corresponding counter
Intentional Dropping
Scheme for SYN
Flooding Mitigation
Idea

• Normally, if it does not receive a SYN-ACK after


sending a SYN for a certain time a client machine
then would resend another SYN until it gets
connected to the wanted server.
• The idea of this method is to drop all the first SYN
from all the source machine, which would help to
reduce SYN flood which is usually first SYNs with
spoofed addresses
Method

• The solution is to propose using 3 different BFs:


• BF1: stores the 4-tuple address of the first SYN
coming from a given source
• BF2: stores the 4-tupple of all SYNs, with which
the 3-way handshake is already completed
• BF-3: Store the 4-tupple of other SYNs.

Operating System
Concepts
Method

Once a SYN arrives, its 4-tuple address is checked


against the 3 BFs, where occurs 1 of the 3 following
cases:
• 1. Not in any BF This is the first SYN then will be
dropped, also insert the 4-tuple into BF1
• 2. If found in BF-1 this is a second SYN we just
move the 4-tuple from BF1 to BF3
• 3. If in BF-2  Let it go through.
• 4. If in BF-3  let it goes through with probability
p=1/n, where n is the value of corresponding counter in
BF-3
Method

When an ACK comes, its 4-tupple address is checked


against the BFs, which may results in 1 of 3
following cases”
1. Not in any BF  drop the packet
2. If it matches one in BF-2  let it through
3. If in BF-3  the connection is completed 
move the 4-tuple address from BF3 to BF-2
Result

• First SYN from any source will be dropped


• The second SYN from the same source will go through
• If this same source continue sending SYNs, the
probability that the SYN numbered n is allowed to go
through is 1/n
 Thus, the SYN flood caused by an attacking source will
be mitigated.
Thank you for​
your attentions!​

You might also like