Professional Documents
Culture Documents
Unit 2 Mathematical Foundation of Big Data: - Syllabus
Unit 2 Mathematical Foundation of Big Data: - Syllabus
• Where:
• P(A ⋂ B) is the notation for the joint probability of event “A” and “B”.
• P(A) is the probability of event “A” occurring.
• P(B) is the probability of event “B” occurring.
Conditional Probability
If A and B are events with Pr(A) > 0, the conditional probability of B
given A is
2. Chebyshev Inequality
The Chebyshev bound is slightly stronger than Markov’s inequality.
Markov Chains
• A Markov chain or Markov process is a stochastic model describing
a sequence of possible events in which the probability of each event depends
only on the state attained in the previous event.
• A countably infinite sequence, in which the chain moves state at discrete time
steps, gives a discrete-time Markov chain (DTMC).
• A continuous-time process is called a continuous-time Markov
chain (CTMC).
• Markov chains are a fairly common, and relatively simple, way to
statistically model random processes.
• They have been used in many different domains, ranging from text
generation to financial modeling.
• Types
Discrete-time
Continuous-time
Random Walks
• A random walk is a mathematical object, known as a stochastic or random
process, that describes a path that consists of a succession of random steps
on some mathematical space such as the integers.
• A random walk is the process by which randomly-moving objects wander
away from where they started.
• random walk, in probability theory, a process for determining the probable
location of a point subject to random motions, given the probabilities (the
same at each step) of moving some distance in some direction.
• Random walks are an example of Markov processes, in which future
behavior is independent of past history.
Universal Hashing
• Universal hashing (in a randomized algorithm or data structure) refers to
selecting a hash function at random from a family of hash functions with a
certain mathematical property
• This ensures a minimum number of collisions.
• A randomized algorithm H for constructing hash functions h : U → {1,…
,M} is universal if for all (x, y) in U such that x ≠ y, Pr h∈H [h(x) = h(y)] ≤
1/M (i.e, The probability of x and y such that h(x) = h(y) is <= 1/M for all
possible values of x and y)
• A set H of hash functions is called a universal hash function family if the
procedure “choose h ∈ H at random” is universal.
Pair-wise Independence hashing
• In probability theory, a pairwise independent collection of random variables
is a set of random variables any two of which are independent.
• Any collection of mutually independent random variables is pairwise
independent, but some pairwise independent collections are not mutually
independent.
• Pairwise independent random variables with
finite variance are uncorrelated.
Approximate counting
• The approximate counting algorithm allows the counting of a large
number of events using a small amount of memory.
• Using Morris' algorithm, the counter represents an "order of
magnitude estimate" of the actual count. The approximation is
mathematically unbiased.
• To increment the counter, a pseudo-random event is used, such that the
incrementing is a probabilistic event. To save space, only the exponent is
kept.
• For example, in base 2, the counter can estimate the count to be 1, 2, 4, 8, 16,
32, and all of the powers of two. The memory requirement is simply to hold
the exponent.
Data Streaming Models
It contains three streaming models
1. Landmark window Model
2. Damped window Model
3. Sliding window Model
1. Landmark window Model
Include all objects from a given landmark.
All points have a weight w=1.
Where
Xi : Elements in the data set
mu : the population mean
=the population mean
Step 1: This formula says that take each element from
dataset(population) and subtract from mean of data set. Later
sum all the values.
Step 2: Take the sum in Step 1 and divide by total number of
elements.
Square in the above formula will nullify the effect of negative
sign(-)
Standard Deviation
• It is a measure of dispersion of observation within dataset relative to their mean.
It is square root of the variance and denoted by Sigma (σ) .
• Standard deviation is expressed in the same unit as the values in the dataset so it
measure how much observations of the data set differs from its mean.
Correlation analysis
• Correlation Analysis is statistical method that is used to discover if there is a
relationship between two variables/datasets, and how strong that
relationship may be.
• The study of how variables are correlated is called correlation analysis.
• correlation analysis is used to analyse quantitative data gathered from
research methods such as surveys and polls, to identify whether there is any
significant connections, patterns, or trends between the two.
• correlation analysis is used for spotting patterns within datasets
• A positive correlation result means that both variables increase in relation to
each other, while a negative correlation means that as one variable
decreases, the other increases.
Some examples of data that have a high correlation:
• Your caloric intake and your weight.
• Your eye color and your relatives’ eye colors.
• The amount of time your study and your GPA.
Positive Correlation
Any score from +0.5 to +1 indicates a very strong positive
correlation, which means that they both increase at the same
time.
Negative correlation
Any score from -0.5 to -1 indicate a strong negative correlation,
which means that as one variable increases, the other decreases
proportionally.
No Correlation
Very simply, a score of 0 indicates that there is no correlation, or
relationship, between the two variables.The larger the sample
size, the more accurate the result.
Correlation coefficient(r)
Properties
• It is a pure number
• r ranges from -1 to 1
• The correlation between two variables is known as simple correlation
or correlation of zero order
• It is not affected by coding of variables
• The square of r is referred as correlation of determination
• If the two variables are independent the r between them is zero.