Professional Documents
Culture Documents
Probability: Background Material: 1 Basic Definitions
Probability: Background Material: 1 Basic Definitions
1 Basic Definitions
1.1 Probability Space and Random Variables
Definition 1.1. A probability space is a triplet (Ω, E, µ) where Ω is the set of potential outcomes, E is
a set of events ([open] subsets of Ω) and µ : E → [0, 1] is a function assigning probability mass to any
E ∈ E. A random variable X over a set of outcomes O is a mapping X : Ω → O.
We often think of the probability space as a random variable X where the chosen outcome is the
(randomly chosen) value of X. In this course we will only consider two cases: when X is discrete (and
typically finite), and when X is a continuous real random
∑ variable. When X is discrete then µ essentially
assigns a probability mass to each outcome of X s.t. x µ(x) = Pr[X = x] = 1. When X is continuous
then (in all cases we will consider) there is no probability mass to a single outcome, but rather we look
at the cumulative distribution function (CDF) which maps any outcome x to the probability Pr[X <
∫ The derivative of the CDF is the non-negative probability density function (PDF) function, and so
x].
R PDF(x)dx = 1.
1.2 Independence
Definition 1.2. Two random variables X and Y are independent if for any two events EX and EY it
holds that Pr[X ∈ EX and Y ∈ EY ] = Pr[X ∈ EX ] · Pr[Y ∈ EY ].
In this course we will think of independence as two separate random coin tosses — we toss the coins
determining the value that X takes in room A and the coins determining the value of Y in room B. The
two have no effect on one another.
1.3 Expectation
∑
Definition 1.3. The expected value of a random variable X (or the mean of X) is E[X] = xPr[X = x]
∫ x
in the discrete case and E[X] = xPDF(x)dµ in the continuous case.
x
The mean of X is simply an average weight of its outcome: each outcome x gets weight of Pr[X = x].
The expectation also satisfies the two following properties.
Fact 1.4 (Linearity of expectation). For any two random variables X and Y and any a, b ∈ R it holds that
Fact 1.5. For any two independent random variables X and Y it holds that E[X · Y ] = E[X] · E[Y ].
1.4 Variance
Definition 1.6. The variance of a random variable X is
[ ]
Var(X) = E (X − E[X])2 = E[X 2 ] − E[X]2
√
The standard deviation of X is sd(X) = Var(X). The 2nd moment of X is E[X 2 ].
1
The variance of X represents how likely is for the outcomes of X to be far from the mean of X. Think
of the following two random variables, X and Y s.t.
Then the means of the two variables are equal: E[X] = E[Y ] = 1. Yet Var[X] = 1, whereas Var[Y ] = 999.
So even though on average, the value of X and of Y are the same, the values of Y are far more spread-out
than the values of X. The variance also satisfies the following property.
Fact 1.7. For any two independent random variables X and Y and any a, b ∈ R we have that
2 Examples
Below is a list of random variables that we will repeatedly use in this course. You are encouraged to
look for more information about these distributions (and use it in your assignments!) in books or online
(Wikipedia).
(i) Bernoulli Random Variable. A discrete random variable X is called a Bernoulli random variable,
denoted X ∼ Ber(p) if X takes only two values: {0, 1} and p = Pr[X = 1]. The expectation of a Bernoulli
random variable is E[X] = p and its variance is p − p2 .
Bernoulli random variables are often called indicator. For any event E we can associate a corresponding
Bernoulli r.v. where X = 1 if E holds, and X = 0 otherwise. We denote such indicators as 1{E} or 1{E} .
(ii) Uniform [0, 1]. When X is a r.v. chosen uniformly at random (u.a.r) from the interval [0, 1], denoted
X ∼ U[0,1] then for any x ∈ [0, 1] we have that PDF(x) = 1 and 0 everywhere else (hence the PDF indeed
integrates to 1) and CDF(x) = x on the [0, 1] interval. The expected value of X is E[X] = 1/2 and
Var[X] = 1/3 − (1/2)2 = 1/12.
(iii) Exponential Random Variable. A continuous random variable X is called exponential, denoted
X ∼ Exp(λ) if its PDF is defined as: PDF(x) = λe−λx for any x ≥ 0. In this case CDFX (x) = 1 − e−λx for
any x ≥ 0, the expectation E[X] = 1/λ and the variance Var(X) = 1/λ2 .
(iv) Laplace Random Variable. A continuous random variable X is called Laplace, denoted X ∼ Lap(λ)
if its PDF is defined as: PDF(x) = λ2 e−|λx| for any x ∈ R. Observe that one way to sample a r.v. X ∼ Lap(λ)
is to pick Y ∼ Exp(λ) and then set X = Y w.p. 1/2 and X = −Y w.p. 1/2. The mean of a Laplace
random variable is E[X] = 0 and its variance is Var(X) = 2/λ2 .
(v) Gaussian Random Variable. A continuous random variable X is called Gaussian, denoted X ∼
1
N (µ, σ 2 ) if its PDF is defined as: PDF(x) = √ 1 2 e− 2 (x−µ) for any x ∈ R. The mean of a Gaussian
2
2πσ
random variable is E[X] = µ and its variance is Var(X) = σ 2 . A r.v. X ∼ N (0, 1) is called a normal
random variable.
Gaussians are extremely well-studied. We know that they are close under linear operations: given inde-
pendent X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) then for any scalars a, b, c it holds that
Hence N (µ, σ 2 ) = µ + σ · N (0, 1) — namely, we only have to study properties of a normal Gaussian, and
they will propagate to any other Gaussian through translation (by µ) and stretch (by σ).
2
3 Basic Inequalities
Theorem 3.1 (Union Bound.). For any sequence of events {E1 , E2 , . . .} we have that
∪ ∑
Pr[ Ei ] ≤ Pr[Ei ]
i i
Theorem 3.2 (Markov’s Inequality). For any non-negative random variable X and any t > 0 we have
that
E[X]
Pr[X > t] ≤
t
Proof. ∑ ∑ ∑
E[X] = xPr[X = x] ≥ xPr[X = x] ≥ t · Pr[X = x] = t · Pr[X > t]
x x≥t x≥t
Theorem 3.3 (Chebyshev Inequality). For any random variable X and any t > 0 it holds that
[ ] Var(X)
Pr |X − E[X]| > t ≤
t2
Proof. Let Y be the random variable defined as (X − E[X])2 . Then
E[Y ] Var(X)
Pr[|X − E[X]| > t] = Pr[(X − E[X])2 > t2 ] = Pr[Y > t2 ] ≤ 2
=
t t2
4 Chernoff-Heoffding Bounds
Imagine we toss a fair coin n (many) times. We know that w.p. 1/2 we see Heads, which means that
roughly 1/2 of our tosses are likely to come out Heads and half should come out as Tails. So, what is the
probability that we see a lot of Heads? Say, more than 1+α 2 fraction of the tosses are Heads?
∑n Let X i be the Bernoulli random variable which is 1 if the i-th coin toss comes out Heads. Let X =
1
i=1 Xi . Then for every i we have that E[Xi ] = 2 and because of Linearity of Expectation we have that
E[X] = n2 . So now we can use Markov Inequality and deduce that
n/2 1
Pr[X > 1+α
2 n] ≤ =
n(1 + α)/2 1+α
This bound is really loose. First of all, it is pretty close to 1. More importantly, it doesn’t improve with n.
Let us now try to bound this event using Chebyshev. Well, Var[Xi ] = 1/4 for any i and since all coin
tosses are independent then Var(X) = n/4. So we now how that
n/4 1
Pr[X > 1+α
2 n] ≤ Pr[|X − n2 | > αn/2] ≤ =
α2 n2 /4 α2 n
3
This is already much better. It means that when n = 2/α2 then this event happens w.p. < 1/2. Yet, what
if we want this probability to be really small? Not just 1/2 but rather 1/20, 000? This means we have to
set n = 20, 000/α2 . In general, if we want this probability to be at most β, then we need to toss the coin
1/βα2 times.
To improve on this, we use the Chernoff-Hoeffding bounds.
Theorem 4.1. Let n > 0 be an integer and let X1 , X2 , . .∑ . , Xn be independent identically distributed
1
random variables where E[Xi ] = p for every i, and let S = n i Xi be their average. Then
Going back to our example, we can use the Chernoff bound and deduce this probability is at most
e−nα /6 .
2
So, if we want this probability to be at most β then we need to set n = 6 ln(1/β)/α2 . Ob-
serve that n now depends on log(1/β) rather than 1/β. The Heoffding bound gives a similar result
n = O(log(1/β)/α2 .1
∑
k
Pr[|X j − E[X j ]| > α] ≤ 2ke−2nα
2
Pr[∃j, |X j − E[X j ]| > α] ≤
j=1
Therefore, if we want that w.p. 1 − β all estimations to all k queries are within an error of α is suffices
to set n = ln(2k/β)/(2α2 ). In other words, if we want to be 99% confident we know the answer to all k
questions up to α accuracy, then it suffices to have a sample of size n = Ω(ln(k)/α2 ).
This argument will recur quite frequently throughout this course. We will often short hand to by
saying “using the Chernoff bound and the union bound we get...”
1
In general, this quadratic dependence on 1/α is unavoidable. However, if we know that p = O(α) then the Chernoff bound
outperforms the Heoffding bound: whereas the Hoeffding bound has dependence of α−2 , the Chernoff have n depending only
on 1/α.