Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Engineering 83908 – Differential Privacy

Probability: Background Material

1 Basic Definitions
1.1 Probability Space and Random Variables
Definition 1.1. A probability space is a triplet (Ω, E, µ) where Ω is the set of potential outcomes, E is
a set of events ([open] subsets of Ω) and µ : E → [0, 1] is a function assigning probability mass to any
E ∈ E. A random variable X over a set of outcomes O is a mapping X : Ω → O.
We often think of the probability space as a random variable X where the chosen outcome is the
(randomly chosen) value of X. In this course we will only consider two cases: when X is discrete (and
typically finite), and when X is a continuous real random
∑ variable. When X is discrete then µ essentially
assigns a probability mass to each outcome of X s.t. x µ(x) = Pr[X = x] = 1. When X is continuous
then (in all cases we will consider) there is no probability mass to a single outcome, but rather we look
at the cumulative distribution function (CDF) which maps any outcome x to the probability Pr[X <
∫ The derivative of the CDF is the non-negative probability density function (PDF) function, and so
x].
R PDF(x)dx = 1.

1.2 Independence
Definition 1.2. Two random variables X and Y are independent if for any two events EX and EY it
holds that Pr[X ∈ EX and Y ∈ EY ] = Pr[X ∈ EX ] · Pr[Y ∈ EY ].
In this course we will think of independence as two separate random coin tosses — we toss the coins
determining the value that X takes in room A and the coins determining the value of Y in room B. The
two have no effect on one another.

1.3 Expectation

Definition 1.3. The expected value of a random variable X (or the mean of X) is E[X] = xPr[X = x]
∫ x
in the discrete case and E[X] = xPDF(x)dµ in the continuous case.
x

The mean of X is simply an average weight of its outcome: each outcome x gets weight of Pr[X = x].
The expectation also satisfies the two following properties.
Fact 1.4 (Linearity of expectation). For any two random variables X and Y and any a, b ∈ R it holds that

E[aX + bY ] = aE[X] + bE[Y ]

Fact 1.5. For any two independent random variables X and Y it holds that E[X · Y ] = E[X] · E[Y ].

1.4 Variance
Definition 1.6. The variance of a random variable X is
[ ]
Var(X) = E (X − E[X])2 = E[X 2 ] − E[X]2

The standard deviation of X is sd(X) = Var(X). The 2nd moment of X is E[X 2 ].

1
The variance of X represents how likely is for the outcomes of X to be far from the mean of X. Think
of the following two random variables, X and Y s.t.

Pr[X = 0] = 1/2, Pr[X = 2] = 1/2 ; Pr[Y = 0] = 0.999, Pr[Y = 1000] = 0.001

Then the means of the two variables are equal: E[X] = E[Y ] = 1. Yet Var[X] = 1, whereas Var[Y ] = 999.
So even though on average, the value of X and of Y are the same, the values of Y are far more spread-out
than the values of X. The variance also satisfies the following property.
Fact 1.7. For any two independent random variables X and Y and any a, b ∈ R we have that

Var[aX + bY ] = a2 Var[X] + b2 Var[Y ]

2 Examples
Below is a list of random variables that we will repeatedly use in this course. You are encouraged to
look for more information about these distributions (and use it in your assignments!) in books or online
(Wikipedia).

(i) Bernoulli Random Variable. A discrete random variable X is called a Bernoulli random variable,
denoted X ∼ Ber(p) if X takes only two values: {0, 1} and p = Pr[X = 1]. The expectation of a Bernoulli
random variable is E[X] = p and its variance is p − p2 .
Bernoulli random variables are often called indicator. For any event E we can associate a corresponding
Bernoulli r.v. where X = 1 if E holds, and X = 0 otherwise. We denote such indicators as 1{E} or 1{E} .

(ii) Uniform [0, 1]. When X is a r.v. chosen uniformly at random (u.a.r) from the interval [0, 1], denoted
X ∼ U[0,1] then for any x ∈ [0, 1] we have that PDF(x) = 1 and 0 everywhere else (hence the PDF indeed
integrates to 1) and CDF(x) = x on the [0, 1] interval. The expected value of X is E[X] = 1/2 and
Var[X] = 1/3 − (1/2)2 = 1/12.

(iii) Exponential Random Variable. A continuous random variable X is called exponential, denoted
X ∼ Exp(λ) if its PDF is defined as: PDF(x) = λe−λx for any x ≥ 0. In this case CDFX (x) = 1 − e−λx for
any x ≥ 0, the expectation E[X] = 1/λ and the variance Var(X) = 1/λ2 .

(iv) Laplace Random Variable. A continuous random variable X is called Laplace, denoted X ∼ Lap(λ)
if its PDF is defined as: PDF(x) = λ2 e−|λx| for any x ∈ R. Observe that one way to sample a r.v. X ∼ Lap(λ)
is to pick Y ∼ Exp(λ) and then set X = Y w.p. 1/2 and X = −Y w.p. 1/2. The mean of a Laplace
random variable is E[X] = 0 and its variance is Var(X) = 2/λ2 .

(v) Gaussian Random Variable. A continuous random variable X is called Gaussian, denoted X ∼
1
N (µ, σ 2 ) if its PDF is defined as: PDF(x) = √ 1 2 e− 2 (x−µ) for any x ∈ R. The mean of a Gaussian
2

2πσ
random variable is E[X] = µ and its variance is Var(X) = σ 2 . A r.v. X ∼ N (0, 1) is called a normal
random variable.
Gaussians are extremely well-studied. We know that they are close under linear operations: given inde-
pendent X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) then for any scalars a, b, c it holds that

aX + bY + c ∼ N (aµ1 + bµ2 + c, a2 σ12 + b2 σ22 )

Hence N (µ, σ 2 ) = µ + σ · N (0, 1) — namely, we only have to study properties of a normal Gaussian, and
they will propagate to any other Gaussian through translation (by µ) and stretch (by σ).

2
3 Basic Inequalities
Theorem 3.1 (Union Bound.). For any sequence of events {E1 , E2 , . . .} we have that
∪ ∑
Pr[ Ei ] ≤ Pr[Ei ]
i i

where equality holds iff all events are pair-wise disjoint.


For example, let X ∼ Lap(λ) and fix t > 0. What is the probability of the event Pr[|X| > t]? This
event is the union of two events: X < −t and X > t which are clearly disjoint. Therefore,
∫ −t ∫ ∞
Pr[|X| > t] = Pr[X < −t] + Pr[X > t] = 2 λ λx
e dx + 2λ
e−λx dx = λ2 · 2 · λ1 e−λt = e−λt
−∞ t

Theorem 3.2 (Markov’s Inequality). For any non-negative random variable X and any t > 0 we have
that
E[X]
Pr[X > t] ≤
t
Proof. ∑ ∑ ∑
E[X] = xPr[X = x] ≥ xPr[X = x] ≥ t · Pr[X = x] = t · Pr[X > t]
x x≥t x≥t

where the first inequality holds because of the non-negativity of X.

Theorem 3.3 (Chebyshev Inequality). For any random variable X and any t > 0 it holds that
[ ] Var(X)
Pr |X − E[X]| > t ≤
t2
Proof. Let Y be the random variable defined as (X − E[X])2 . Then
E[Y ] Var(X)
Pr[|X − E[X]| > t] = Pr[(X − E[X])2 > t2 ] = Pr[Y > t2 ] ≤ 2
=
t t2

4 Chernoff-Heoffding Bounds
Imagine we toss a fair coin n (many) times. We know that w.p. 1/2 we see Heads, which means that
roughly 1/2 of our tosses are likely to come out Heads and half should come out as Tails. So, what is the
probability that we see a lot of Heads? Say, more than 1+α 2 fraction of the tosses are Heads?
∑n Let X i be the Bernoulli random variable which is 1 if the i-th coin toss comes out Heads. Let X =
1
i=1 Xi . Then for every i we have that E[Xi ] = 2 and because of Linearity of Expectation we have that
E[X] = n2 . So now we can use Markov Inequality and deduce that

n/2 1
Pr[X > 1+α
2 n] ≤ =
n(1 + α)/2 1+α
This bound is really loose. First of all, it is pretty close to 1. More importantly, it doesn’t improve with n.
Let us now try to bound this event using Chebyshev. Well, Var[Xi ] = 1/4 for any i and since all coin
tosses are independent then Var(X) = n/4. So we now how that
n/4 1
Pr[X > 1+α
2 n] ≤ Pr[|X − n2 | > αn/2] ≤ =
α2 n2 /4 α2 n

3
This is already much better. It means that when n = 2/α2 then this event happens w.p. < 1/2. Yet, what
if we want this probability to be really small? Not just 1/2 but rather 1/20, 000? This means we have to
set n = 20, 000/α2 . In general, if we want this probability to be at most β, then we need to toss the coin
1/βα2 times.
To improve on this, we use the Chernoff-Hoeffding bounds.

Theorem 4.1. Let n > 0 be an integer and let X1 , X2 , . .∑ . , Xn be independent identically distributed
1
random variables where E[Xi ] = p for every i, and let S = n i Xi be their average. Then

(Hoeffding:) Pr[S > p + α] ≤ e−2nα Pr[S < p − α] ≤ e−2nα


2 2
and
(Chernoff:) Pr[S > (1 + α)p] ≤ e−npα Pr[S < (1 − α)p] ≤ e−pnα
2 /3 2 /2
and

Going back to our example, we can use the Chernoff bound and deduce this probability is at most
e−nα /6 .
2
So, if we want this probability to be at most β then we need to set n = 6 ln(1/β)/α2 . Ob-
serve that n now depends on log(1/β) rather than 1/β. The Heoffding bound gives a similar result
n = O(log(1/β)/α2 .1

4.1 Multiple Estimations


Question: Preparing to the upcoming elections we are conducting a phone survey, and we ask randomly
chosen people whether they are pro or con k different current issues. How many people do we need to
survey in order to know that the true answer is within, say, 5%-error for any query? That is, there does
not exist even a single query on which our estimation is more than 5% off.
We formalize this problem as follows. Let n denote the size of our survey and for any j ∈ {1, 2, . . . , k}
we define Xij , which is a Bernoulli r.v. indicating whether the i-th person is supporting the j-th issue. Let

X j = n1 i Xij . Note, E[X j ] = E[Xij ] is precisely the fraction of the population that support j-th issue.
Observe that since we pick the survey participants randomly, then for any j it holds that {X1j , X2j , . . . , Xnj }
are all mutually independent. (Do note however that Xij1 and Xij2 are not independent since it is the same
person answering both questions.)
Our goal is to lower-bound the probability Pr[∀j, |X j − E[X j ]| ≤ α], which is equivalent to upper
bounding the probability Pr[∃j, |X j − E[X j ]| > α]. That is, we want to have it so that no question has a
bad estimation. Note that we can’t directly use Chernoff, because not all event are independent. Instead,
we can use the following argument.
Fix j. Now the events are independent and we can use Heoffding inequality to deduce that for one
issue Pr[|X j − E[X j ]| > α] < 2e−2nα . The next step is to use the Union Bound — since if there exists a
2

j s.t. |X j − E[X j ]| > α then this j is either 1, or 2, or 3,..., or k. So


k
Pr[|X j − E[X j ]| > α] ≤ 2ke−2nα
2
Pr[∃j, |X j − E[X j ]| > α] ≤
j=1

Therefore, if we want that w.p. 1 − β all estimations to all k queries are within an error of α is suffices
to set n = ln(2k/β)/(2α2 ). In other words, if we want to be 99% confident we know the answer to all k
questions up to α accuracy, then it suffices to have a sample of size n = Ω(ln(k)/α2 ).
This argument will recur quite frequently throughout this course. We will often short hand to by
saying “using the Chernoff bound and the union bound we get...”

1
In general, this quadratic dependence on 1/α is unavoidable. However, if we know that p = O(α) then the Chernoff bound
outperforms the Heoffding bound: whereas the Hoeffding bound has dependence of α−2 , the Chernoff have n depending only
on 1/α.

You might also like