Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Unit 2

MATHEMATICAL FOUNDATION OF BIG


DATA
• Syllabus
Probability: Random Variables and Joint Probability, Conditional Probability
and concept of Markov chains, Tail bounds, Markov chains and random
walks, Pair-wise independence and universal hashing, Approximate counting,
Approximate median.
Data Streaming Models and Statistical Methods: Flajolet Martin
algorithm, Distance Sampling and Random Projections, Bloom filters, Mode,
Variance, standard deviation, Correlation analysis and Analysis of Variance.
Probability Theory
Probability theory is a measure of unpredictability for a particular event.
Probability Spaces and Events
The sample space of a random experiment is the collection of all
possible outcomes.
An event associated with a random experiment is a subset of the
sample space.
The probability of any outcome is a number between 0 and 1. The
probabilities of all the outcomes add up to 1.
e.g.
• Experiment: toss a coin twice
• Sample space: possible outcomes of an experiment
• S = {HH, HT, TH, TT}
• Event: a subset of possible outcomes
• A={HH}, B={HT, TH}
Joint Probability
For events A and B, joint probability Pr(AB) stands for the probability that
both events happen.
A joint probability, in probability theory, refers to the probability that two
events will both occur.
In other words, joint probability is the likelihood of two events occurring
together.

• Formula for Joint Probability

• Where:
• P(A ⋂ B) is the notation for the joint probability of event “A” and “B”.
• P(A) is the probability of event “A” occurring.
• P(B) is the probability of event “B” occurring.
Conditional Probability
If A and B are events with Pr(A) > 0, the conditional probability of B
given A is

Conditional probability is defined as the likelihood of an event or outcome


occurring, based on the occurrence of a previous event or outcome.
Conditional probability is calculated by multiplying the probability of the
preceding event by the updated probability of the succeeding, or conditional,
event.
Tail Bounds
If a variable having good concentration is one that is close to its mean
with good probability.
First moment based and second moment based bounds are known as
Tail bounds.
Types of tail bounds
1. Markov Inequality
2. Chebyshev Inequality

A concentration inequality theorem is also known as Tail Bounds.


1. Markov inequality
In probability theory ,it gives an upper bound if non-negative function of
a random variable is greater than or equal to some positive constant.
Markov's inequality is the simplest tail bound, only requiring the
existence of the first moment.

2. Chebyshev Inequality
The Chebyshev bound is slightly stronger than Markov’s inequality.
Markov Chains
• A Markov chain or Markov process is a stochastic model describing
a sequence of possible events in which the probability of each event depends
only on the state attained in the previous event.
• A countably infinite sequence, in which the chain moves state at discrete time
steps, gives a discrete-time Markov chain (DTMC).
• A continuous-time process is called a continuous-time Markov
chain (CTMC).
• Markov chains are a fairly common, and relatively simple, way to
statistically model random processes.
• They have been used in many different domains, ranging from text
generation to financial modeling.
• Types
Discrete-time
Continuous-time
Random Walks
• A random walk is a mathematical object, known as a stochastic or random
process, that describes a path that consists of a succession of random steps
on some mathematical space such as the integers.
• A random walk is the process by which randomly-moving objects wander
away from where they started.
• random walk, in probability theory, a process for determining the probable
location of a point subject to random motions, given the probabilities (the
same at each step) of moving some distance in some direction.
• Random walks are an example of Markov processes, in which future
behavior is independent of past history.
Universal Hashing
• Universal hashing (in a randomized algorithm or data structure) refers to
selecting a hash function at random from a family of hash functions with a
certain mathematical property
• This ensures a minimum number of collisions.
• A randomized algorithm H for constructing hash functions h : U → {1,…
,M} is universal if for all (x, y) in U such that x ≠ y, Pr h∈H [h(x) = h(y)] ≤
1/M (i.e, The probability of x and y such that h(x) = h(y) is <= 1/M for all
possible values of x and y)
• A set H of hash functions is called a universal hash function family if the
procedure “choose h ∈ H at random” is universal.
Pair-wise Independence hashing
• In probability theory, a pairwise independent collection of random variables
is a set of random variables any two of which are independent.
• Any collection of mutually independent random variables is pairwise
independent, but some pairwise independent collections are not mutually
independent.
• Pairwise independent random variables with
finite variance are uncorrelated.
Approximate counting
• The approximate counting algorithm allows the counting of a large
number of events using a small amount of memory.
• Using Morris' algorithm, the counter represents an "order of
magnitude estimate" of the actual count. The approximation is
mathematically unbiased.
• To increment the counter, a pseudo-random event is used, such that the
incrementing is a probabilistic event. To save space, only the exponent is
kept.
• For example, in base 2, the counter can estimate the count to be 1, 2, 4, 8, 16,
32, and all of the powers of two. The memory requirement is simply to hold
the exponent.
Data Streaming Models
It contains three streaming models
1. Landmark window Model
2. Damped window Model
3. Sliding window Model
1. Landmark window Model
Include all objects from a given landmark.
All points have a weight w=1.

2. Sliding window model:


Remember only the n more recent entries, where n is the window size.
All points within the window have a weight w=1, for the rest: w=0.
3. Damped window model:
In this model, each object is associated with a weight which depends on
its arrival time.
Flojolet Martin Algorithm
Flajolet Martin Algorithm, also known as FM algorithm, is used to
approximate the number of unique elements in a data stream or database in one
pass.
Algorithm
• Create a bit vector (bit array) of sufficient length L, such that 2L>n, the
number of elements in the stream. Usually a 64-bit vector is sufficient since
264 is quite large for most purposes.
• The i-th bit in this vector/array represents whether we have seen a hash
function value whose binary representation ends in 0i. So initialize each bit
to 0.
• Once input is exhausted, get the index of the first 0 in the bit array (call this
R). By the way, this is just the number of consecutive 1s (i.e. we have seen
0,00,...,0R−1 as the output of the hash function) plus one.
• Calculate the number of unique words as 2R/ϕ, where ϕ is 0.77351. A proof
for this can be found in the original paper listed in the reference section.
Random Projections
• Random projection is a technique used to reduce the dimensionality of a set of
points which lie in Euclidean space.
• Random Projection is a method of dimensionality reduction and data
visualization that simplifies the complexity of high-dimensional datasets.
• The method generates a new dataset by taking the projection of each data point
along a randomly chosen set of directions.
• The projection of a single data point onto a vector is mathematically equivalent to
taking the dot product of the point with the vector.
• When performing Random Projection, the vectors are chosen randomly making it
very efficient. The name "projection" may be a little misleading as the vectors are
chosen randomly, the transformed points are mathematically not true projections
but close to being true projections.
• Random projection methods are known for their power, simplicity, and low error
rates when compared to other methods.
• They have been applied to many natural language tasks under the name random
indexing.
Distance Sampling
Distance sampling is a widely used group of closely related methods for
estimating the density and/or abundance of populations. The main methods are
based on line transects or point transects.
In this method of sampling, the data collected are the distances of the objects
being surveyed from these randomly placed lines or points, and the objective
is to estimate the average density of the objects within a region.
Types of Distance Sampling
The two main methods are line transect and point transect, but other methods
(mostly sub-types of either point or line transect) exist.
• Line Transect Surveys
• Cue-Counting
• Point Transect Surveys
• Strip Transects
• Trapping webs
Bloom filters
For understanding bloom filters, you must know what is hashing. A hash
function takes input and outputs a unique identifier of fixed length which is
used for identification of input.
A Bloom filter is a space-efficient probabilistic data structure that is used to
test whether an element is a member of a set.
Interesting Properties of Bloom Filters
• Unlike a standard hash table, a Bloom filter of a fixed size can represent a set
with an arbitrarily large number of elements.
• Adding an element never fails. However, the false positive rate increases
steadily as elements are added until all bits in the filter are set to 1, at which
point all queries yield a positive result.
• Bloom filters never generate false negative result, i.e., telling you that a
username doesn’t exist when it actually exists.
• Deleting elements from filter is not possible because, if we delete a single
element by clearing bits at indices generated by k hash functions, it might
cause deletion of few other elements.
Mode, Variance, standard deviation, Correlation analysis and
Analysis of Variance
Mode
Mode is the number which occur most often in the data set. Here 150 is
occurring twice so this is our mode.
Variance
• Variance is the numerical values that describe the variability of the
observations from its arithmetic mean and denoted by sigma-squared(σ2 )
• Variance measure how far individuals in the group are spread out, in the set
of data from the mean.

Where
Xi : Elements in the data set
mu : the population mean
=the population mean
Step 1: This formula says that take each element from
dataset(population) and subtract from mean of data set. Later
sum all the values.
Step 2: Take the sum in Step 1 and divide by total number of
elements.
Square in the above formula will nullify the effect of negative
sign(-)
Standard Deviation
• It is a measure of dispersion of observation within dataset relative to their mean.
It is square root of the variance and denoted by Sigma (σ) .
• Standard deviation is expressed in the same unit as the values in the dataset so it
measure how much observations of the data set differs from its mean.
Correlation analysis
• Correlation Analysis is statistical method that is used to discover if there is a
relationship between two variables/datasets, and how strong that
relationship may be.
• The study of how variables are correlated is called correlation analysis.
• correlation analysis is used to analyse quantitative data gathered from
research methods such as surveys and polls, to identify whether there is any
significant connections, patterns, or trends between the two.
• correlation analysis is used for spotting patterns within datasets
• A positive correlation result means that both variables increase in relation to
each other, while a negative correlation means that as one variable
decreases, the other increases.
Some examples of data that have a high correlation:
• Your caloric intake and your weight.
• Your eye color and your relatives’ eye colors.
• The amount of time your study and your GPA.
Positive Correlation
Any score from +0.5 to +1 indicates a very strong positive
correlation, which means that they both increase at the same
time.

Negative correlation
Any score from -0.5 to -1 indicate a strong negative correlation,
which means that as one variable increases, the other decreases
proportionally.

No Correlation
Very simply, a score of 0 indicates that there is no correlation, or
relationship, between the two variables.The larger the sample
size, the more accurate the result.
Correlation coefficient(r)
Properties
• It is a pure number
• r ranges from -1 to 1
• The correlation between two variables is known as simple correlation
or correlation of zero order
• It is not affected by coding of variables
• The square of r is referred as correlation of determination
• If the two variables are independent the r between them is zero.

You might also like