Professional Documents
Culture Documents
Quick Reference of Probability
Quick Reference of Probability
Hao Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China
zhangh0214@gmail.com
Abstract 2 Counting
In this note, we give a quick reference of some of the 2.1 Summations
key concepts of counting, probablisitcs, and statis-
Lemma 4. The followings are some well-known series.
tics. Besides, a section of balls and bins model is ∑
provided as an exercise. For further reading, you • Telescoping series. 𝑛−1𝑖=1 (𝑎𝑖+1 − 𝑎𝑖 ) = 𝑎𝑛 − 𝑎1 .
∑ 1 1
may consult [1, 2, 3, 4, 5, 6]. • 𝑛−1
𝑖=1 𝑖(𝑖+1) = 1 − 𝑛
.
∑
• Arithmetic series. 𝑛𝑖=1 𝑖 = 𝑛(𝑛+1)
2
= Θ(𝑛2 ).
∑𝑛 2 𝑛(𝑛+1)(2𝑛+1)
• Sum of squares. 𝑖=1 𝑖 = 6
= Θ(𝑛3 ).
1 Mathematics Basis ∑𝑛 3 𝑛2 (𝑛+1)2
• Sum of cubes. 𝑖=1 𝑖 = 4
= Θ(𝑛4 ).
∑ 𝑛+1
1.1 Propositions • Geometric series. 𝑛𝑖=0 𝑥𝑖 = 1−𝑥 if 𝑥 ≠ 1.
∑∞ 1−𝑥𝑖 1
• Infinite geometric series. 𝑖=0 𝑥 = 1−𝑥 if |𝑥| < 1.
Definition 1 (Implies 𝑃 ⇒ 𝑄). True exactly when 𝑃 is false ∑𝑛 1
or 𝑄 is true, i.e., 𝑃 ⇒ 𝑄 ≔ ¬𝑃 ∨ 𝑄. • Harmonic series. 𝑖=1 𝑖 = log 𝑛 + Θ(1).
Lemma 1. ∃𝑥, ∀𝑦. 𝑃 (𝑥, 𝑦) ⇒ ∀𝑦, ∃𝑥. 𝑃 (𝑥, 𝑦). Lemma 5 (Integration Bound). Let 𝑓 ∶ ℝ+ → 𝑅+ be a
weakly increasing function,
Lemma 2 (The Well Ordering Principle). Every nonempty 𝑛 ∑
𝑛 𝑛
set of nonnegative integers has a smallest element. 𝑓 (𝑥) d𝑥 + 𝑓 (1) ≤ 𝑓 (𝑖) ≤ 𝑓 (𝑥) d𝑥 + 𝑓 (𝑛) . (1)
∫1 ∫1
𝑖=1
1.2 Sets and Binary Relations √ ( )𝑛
Lemma 6 (Stirling’s Formula). ∀𝑛 ≥ 1. 𝑛! ∼ 2𝜋𝑛 𝑛𝑒 .
Lemma 3. The followings are properties of sets.
• 𝑆 ⊆ 𝐴 ∧ 𝑇 ⊆ 𝐵 ⇒ 𝑆 × 𝑇 ⊆ 𝐴 × 𝐵. ∑∞ 1 d𝑖 𝑓 (0) 𝑖
Lemma 7 (Maclaurin’s Theorem). 𝑓 (𝑥) = 𝑖=0 𝑖! d𝑥𝑖 𝑥 .
• DeMorgan’s Law. 𝐴 ∪ 𝐵 = 𝐴̄ ∩ 𝐵, ̄ 𝐴 ∩ 𝐵 = 𝐴̄ ∪ 𝐵.
̄
Lemma 8. The followings are some well-known series.
∑
Definition 2 (Binary Relations). A binary relation 𝑅 ∶ 𝐴 → 1
• 1−𝑥 = ∞ 𝑖
𝑖=0 𝑥 .
𝐵 is a subset of 𝐴 × 𝐵. 𝐴 is called the domain, and 𝐵 is ∑∞ 1 𝑖
• exp 𝑥 = 𝑖=0 𝑖! 𝑥 .
called codomain. The image of a set 𝑆 ⊆ 𝐴 is the subset ∑ 𝑐𝑛 𝑖
of 𝐵 that is related to some element in 𝑆. The range is the • exp 𝑐𝑥 = ∞ 𝑖=0 𝑖! 𝑥 .
∑ 1 𝑖
image of 𝐴. There are kinds of relations. • log(1 − 𝑥) = − ∞ 𝑖=1 𝑖 𝑥 .
• Function: ≤ 1 arrow out.
• Total: ≥ 1 arrow out.
• Total function: = 1 arrow out.
2.2 Cardinality Rules
• One-to-one/Injective: ≤ 1 arrow in. Lemma 9 (Rule of Sum). If we have 𝑛1 ways to perform
• Onto/Surjective: ≥ 1 arrow in. action 1 and 𝑛2 ways to perform action 2, and we cannot do
• One-to-one correspondence/Bijective: = 1 arrow out both at the same time, then there are 𝑛1 + 𝑛2 choose one of
and = 1 arrow in. the actions.
Definition 3 (Countable Infinite). An infinite set that can Lemma 10 (Rule of Product). If there are 𝑛1 ways to perform
be put into a one-to-one correspondence with the natural action 1 and then by 𝑛2 ways to perform action 2, then there
numbers ℕ; otherwise, it is uncountable. are 𝑛1 ⋅ 𝑛2 ways to perform action 1 followed by action 2.
1
Lemma 11 (Pigenhole Principle). If |𝐴| > 𝑘|𝐵|, then for unknown random process) from experimental data. There is
every total function 𝑓 ∶ 𝐴 → 𝐵, there exists 𝑘 + 1 different no single correct answer.
elements of set 𝐴 that are mapped to the same element of 𝐵. Definition 9 (Experiment). A repeatable procedure with
Lemma 12 (Inclusion-Exclusion Principle). |𝐴∪𝐵| = |𝐴|+ well-defined possible outcomes.
|𝐵| − |𝐴 ∩ 𝐵|. Definition 10 (Sample Space Ω). Set of all possible out-
comes 𝜔 of an experiment.
Definition 4 (Ordinary Generating Function). The ordinary
generating funciton 𝐹 (𝑥) for the sequence (𝑓𝑖 )∞ is the Definition 11 (Event 𝐸). Subset of the sample space 𝐸 ⊆ Ω.
∑∞ 𝑖
𝑖=0
power series 𝐹 (𝑥) ≔ 𝑖=0 𝑓𝑖 𝑥 . Definition 12 (Probability Function Pr). A function
Pr ∶ Ω → [0, 1] giving the probability for each outcome.
Lemma 13 (Convolution Rule). Let 𝐹 (𝑥) be the generating ∑
function for selection items from a set 𝐴, and 𝐺(𝑥) for the Besides, Pr must also satisfies 𝜔∈Ω Pr(𝜔) = 1.
set 𝐵. 𝐴 are 𝐵 are disjoint. The generating function for Definition 13 (Probability of an Event). The probability of
selecting items from the union 𝐴 ∪ 𝐵 is 𝐹 (𝑥)𝐺(𝑥). an event 𝐸 is the sum of the probabilities of all the outcomes
∑
in 𝐸, i.e., Pr(𝐸) ≔ 𝜔∈𝐸 Pr(𝜔). Suppose Ω is finite and if
2.3 Permutations and Combinations each outcome is equally likely, then Pr(𝐸) = |𝐸||Ω|
.
Definition 5 (Permutation). A permutation of a finite set is Lemma 17. For two events 𝐸1 and 𝐸2 , Pr(𝐸1 ∪ 𝐸2 ) =
an ordered sequence of all the elements in that set, wich each Pr(𝐸1 ) + Pr(𝐸2 ) − Pr(𝐸1 ∩ 𝐸2 ). Specifically, if 𝐸1 and 𝐸2
element appearing exactly once. are disjoint, Pr(𝐸1 ∪ 𝐸2 ) = Pr(𝐸1 ) + Pr(𝐸2 ).
objects into 𝑟 distinct bins, with 𝑘1 objects in the first bin, 𝑘2 Pr(𝐴) = Pr(𝐴 ∩ 𝐵𝑖 ) = Pr(𝐴 ∣ 𝐵𝑖 ) Pr(𝐵𝑖 ) (2)
𝑖=1 𝑖=1
objects in the second bin, and so on is
Definition 15 (Independent 𝐴 ⟂ 𝐵). Two event 𝐴 and 𝐵 are
independent if knowing that 𝐵 occurred does not change the
3 Probability Basis probability that 𝐴 occurred.
{
3.1 Terminologies Pr(𝐴 ∣ 𝐵) = Pr(𝐴) if Pr(𝐵) ≠ 0;
𝐴⟂𝐵⇔ (3)
Pr(𝐵 ∣ 𝐴) = Pr(𝐵) if Pr(𝐴) ≠ 0.
Definition 7 (Probability). Mathematical language for quan-
tifying uncertainty. In probability, there are a few logically In other words, 𝐴 ⟂ 𝐵 ⇔ Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵). Note
self-contained rules. The random process is fully known, and that disjoint events with positive probability are not indepen-
the objective is to find the probability of a certain outcome. dent.
There is only one correct answer logically follow these rules. Theorem 19 (Bayes’ Theorem). For two events 𝐴 and 𝐵,
Definition 8 (Statistics). Using data to infer the distribution Pr(𝐴 ∣ 𝐵) Pr(𝐵) Pr(𝐴 ∣ 𝐵) Pr(𝐵)
Pr(𝐵 ∣ 𝐴) = =∑ .
that generated the data. In statistics, the outcome is known, Pr(𝐴) 𝑖 Pr(𝐴 ∣ 𝐵𝑖 ) Pr(𝐵𝑖 )
and we apply probability to make inference (illuminate the (4)
2
Table 1: Distribution functions of random variables. We allow 𝑥 to be any number. If 𝑥 is a value that 𝑋 never takes, then
𝑝(𝑥) = 0.
( 𝑛 )
4 Random Variables ∑ ∑
𝑛
var 𝑐𝑖 𝑋𝑖 = 𝑐𝑖2 var 𝑋𝑖 . (8)
𝑖=1 𝑖=1
4.1 Terminologies
Definition 17 (Correlation 𝜌). The correlation between 𝑋
Definition 16 (Discrete Random Variable 𝑋). A discrete and 𝑌 is 𝜌 ≔ cov(𝑋,𝑌
𝜎 𝜎
)
∈ [−1, 1]. Furthermore,
random variable is a function 𝑋 ∶ pow Ω → ℝ. For any 𝑋 𝑌
3
Table 3: Common discrete distributions.
Table 4: Common inequalities, where 𝑋̄ is the sample mean, and 𝜇 and 𝜎 2 are the mean and variance, respectively.
Lemma 23. Geometric distribution and exponential distri- 5 Inequalities and Convergence
bution are memoryless,
5.1 Inequalities
Pr(𝑋 = 𝑥 + 𝑥0 ∣ 𝑋 ≥ 𝑥0 ) = Pr(𝑋 = 𝑥) . (12)
Table 4 summaries common inequalities.
Definition 20 (Multivariate Gaussian Distribution). 𝑿 ∼ Theorem 25 (Cauchy-Schwarz Inequality).
(𝝁, 𝚺) if √
( ) 𝔼[|𝑋𝑌 |] ≤ 𝔼[𝑋 2 ] 𝔼[𝑌 2 ] . (16)
1 1
𝑝(𝒙) = √ exp − (𝒙 − 𝝁)⊤ 𝚺−1 (𝒙 − 𝝁) .
(2𝜋)𝑑 det 𝚺 2 Theorem 26 (Jensen’s Inequality). If 𝑓 is convex,
(13) 𝔼[𝑓 (𝑋)] ≥ 𝑓 (𝔼[𝑋]) . (17)
Definition 21 (Multinomial Distribution). 𝑿 ∼ Mult(𝑛, 𝒑)
if 5.2 Convergence of Random Variables
( ) 𝐾
𝑛 ∏ 𝑥
𝑝(𝒙) = 𝑝 𝑖. (14) Theorem 27 (Law of Large Numbers). Suppose
𝑥1 , 𝑥2 , … , 𝑥𝐾 𝑖=1 𝑖 𝑋1 , 𝑋2 , … , 𝑋𝑚 , … are i.i.d. random variables with
∑
It models a situation where drawing 𝑛 balls from an urn mean 𝜇 and variance 𝜎 2 . Let 𝑋̄ 𝑚 ≔ 𝑚1 𝑚𝑖=1 𝑋𝑖 be the
with 𝐾 differnet colors. 𝑝𝑖 is the probability of drawing sample mean. Then
color 𝑖, and 𝑥𝑖 is the count of the number of balls for color 𝑖. ∀𝜀 > 0. lim Pr(|𝑋̄ 𝑚 − 𝜇| < 𝜀) = 1 .
( 𝑛 ) (18)
𝑥 ,𝑥 ,…,𝑥
≔ ∏𝐾𝑛! is also called the multinomial coeffi- 𝑚→∞
1 2 𝑟 𝑖=1 𝑥𝑖 ! In other words, the sample mean is very close to the true
cient.
mean of the distribution with high probability.
Lemma 24 (Multinomail Expansion). Corollary 28 (Law of Large Numbers for Histograms). With
( 𝑟 )𝑛 ( ) high probability the density histogram of a large number of
∑ ∑ 𝑛 ∏ 𝑥
𝑎𝑖 = 𝑎𝑖 𝑖 . (15) samples from a distribution is a good approximation of the
𝑖=1 𝑥 +⋯+𝑥 =𝑛
𝑥1 , 𝑥2 , … , 𝑥𝑟 graph of the underlying PDF 𝑝.
1 𝑟
4
Theorem 29 (Central Limit Theorem). Suppose Proof. Let 𝑋𝑖𝑗 = 𝕀(ball 𝑖 and 𝑗 are tossed into the same bin).
𝑋1 , 𝑋2 , … , 𝑋𝑚 , … are i.i.d. random variables with Then
∑ [𝑛−1 𝑛 ] 𝑛−1 𝑛
mean 𝜇 and variance 𝜎 2 . Let 𝑋̄ 𝑚 ≔ 𝑚1 𝑚𝑖=1 𝑋𝑖 be the
( )
∑ ∑ ∑ ∑ 𝑛 1 𝑛(𝑛 − 1)
sample mean. Then for large 𝑚, 𝔼 𝑋𝑖𝑗 = 𝔼[𝑋𝑖𝑗 ] = = .
2 𝑘 2𝑘
( ) 𝑖=1 𝑗=𝑖+1 𝑖=1 𝑗=𝑖+1
̄ 𝜎2 (22)
𝑋𝑚 ∼ 𝜇, , (19)
𝑚
𝑋̄ 𝑚 − 𝜇
√ ∼ (0, 1) . (20) 6.2 Identical Balls
𝜎∕ 𝑚
Consider the process that we randomly toss 𝑛 identical
balls into 𝑘 bins.
6 Standard Model: Balls and Bins
Lemma 34. The number of ways of placing the balls into
( )
6.1 Distinct Balls bins is 𝑛+𝑘−1
𝑛
.
Consider the process that we random toss 𝑛 distinct balls Proof. If balls are distinct, there are (𝑛+𝑘−1)!
(𝑘−1)!
ways. Since
into 𝑘 bins. balls are indistinguishable, 𝑛! of the arrangements are dupli-
( )
cated. Thus the answer is (𝑛+𝑘−1)! = 𝑛+𝑘−1 .
Lemma 30. Suppose the order within a bin does not matter. (𝑘−1)!𝑛! 𝑛
The number of ways of placing the balls into bins is 𝑘𝑛 . Lemma 35. Suppose no bin may contain more than one ball,
Proof. There are 𝑘 ways to place each of the 𝑛 balls, respec- (𝐾 )that 𝑛 ≤ 𝑘. The number of ways of placing the balls is
so
.
tively. 𝑛
Proof. Select 𝑛 out of the 𝑘 bins to put balls in.
Lemma 31. Suppse the balls in each bin are ordered. The
Lemma 36. Suppose no may be left empty. Assumging that
number of ways of placing the balls into bins is (𝑛+𝑘−1)!
(𝑘−1)!
. (𝑛−1)
𝑛 ≥ 𝑘, the number of ways of placing the balls is 𝑘−1 .
Proof. The number of ways of arranging 𝑛 balls and 𝑘 − 1 Proof. First, we put a ball in each bin. Then we use the result
( ) (𝑛−1) ( 𝑛−1 ) (𝑛−1)
sticks (as separators) in a row is (𝑛 + 𝑘 − 1)!. Since sticks are of Lemma 34. 𝑛−𝑘+𝑘−1𝑛−𝑘
= 𝑛−𝑘 = 𝑛−1−𝑛+𝑘 = 𝑘−1 .
indistinguishable, (𝑘 − 1)! of the arrangements are duplicated.
Thus the answer is (𝑛+𝑘−1)!
(𝑘−1)!
.
Lemma 37. The expect number of balls fall in a given bin is
𝑛
Lemma 32 (Birthday Paradox). The probability that .
( at least) 𝑘
two balls are tossed into the same bin is ≥ 1 − exp − 𝑛(𝑛−1)
2𝑘
. Proof. It follows a binomial distribution Bin(𝑛, 𝑘1 ).
Proof. By looking at the complementary event. The proba- Lemma 38. The expected number of balls we toss until a
bility that all balls are tossed into different bins is given bin contains a ball is 𝑘.
(𝑘) Proof. It follows a geometric distribution Geo( 𝑘1 ).
𝑛
𝑛! 𝑘(𝑘 − 1)(𝑘 − 2) ⋯ (𝑘 − 𝑛 + 1)
= Lemma 39 (Coupon Collector’s Problem). The expected
𝑘𝑛 𝑘𝑛
0 1 2 𝑛−1 number of balls we toss until every bin contains at least one
= (1 − )(1 − )(1 − ) ⋯ (1 − ) ball is ∼ 𝑘 log 𝑘.
𝑘 𝑘 𝑘 𝑘
1 2 𝑛−1 Proof. Partition the tosses into 𝑘 stages. The 𝑖-th stage con-
< exp(0) exp(− ) exp(− ) ⋯ exp(− )
( 𝑛−1 𝑘 ) 𝑘 𝑘 tains of the tosses after tossing a ball into the empty (𝑖 − 1)-th
∑ 𝑖 bin until tossing a ball into the empty 𝑖-th bin. During the 𝑖-th
= exp −
𝑘 stage, 𝑖 − 1 bins contain balls and 𝑘 − 𝑖 + 1 bins are empty. It
𝑖=0
( ) follows a geometric distribution Geo( 𝑘−𝑖+1 ). The expected
𝑛(𝑛 − 1) 𝑘
= exp − . (21) number of tosses is 𝑘
. The overall expected number of
2𝑘 𝑘−𝑖+1
tosses is
√
When 𝑛 = 2𝑘, it approximates 1e . ∑
𝑘
𝑘 ∑𝑘
1
=𝑘 ∼ 𝑘 log 𝑘 . (23)
𝑖=1
𝑘−𝑖+1 𝑖=1
𝑖
Lemma 33. The expected number of pairs of balls within
the the same bin is 𝑛(𝑛−1)
2𝑘
.
5
7 Statistics Basis References
7.1 Terminologies [1] G. Casella and R. L. Berger. Statistical inference, volume 2.
Duxbury Pacific Grove, CA, 2002. 1
̂ Provide a single “best
Definition 22 (Point Estimation 𝜃). [2] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
guess” of some quantity 𝜃 of interest. We denote the bias Introduction to Algorithms, 3rd Edition. MIT Press, 2009. 1
of 𝜃̂ as 𝔼[𝜃]
̂ − 𝜃. We say that 𝜃̂ is unbiased if 𝔼[𝜃] ̂ = 𝜃, [3] R. Hogg, J. McKean, and A. Craig. Introduction to Mathemati-
and consistent if ∀𝜀 > 0. lim𝑚→∞ Pr(|𝜃̂ − 𝜃| < 𝜀) = 1. We cal Statistics. 7th Edition. Pearson Press, 2012. 1
denote the mean square error of 𝜃̂ as 𝔼[(𝜃̂ − 𝜃)2 ]. [4] R. J. Larsen and M. L. Marx. Introduction to Mathematical
Statistics and Its Applications: Pearson New International Edi-
Definition 23 (Sample Mean 𝑋).̄ 𝑋̄ ≔ 1 ∑𝑚 𝑋𝑖 . Sample tion. Pearson Higher Ed, 2013. 1
𝑚 𝑖=1
𝜎2 [5] E. Lehman, T. Leighton, and A. Meyer. Mathematics for Com-
̄ = 𝜇, and var 𝑋̄ =
mean is unbiased 𝔼[𝑋] .
𝑚 puter Science. MIT, 2017. 1
1 ∑𝑚 [6] L. Wasserman. All of statistics: A Concise Course in Statistical
Definition 24 (Sample Variance 𝑠2 ). 𝑠2 ≔ 𝑚−1 𝑖=1 (𝑋𝑖 − Inference. Springer Science & Business Media, 2013. 1
̄ 2 . Sample variance is unbiased 𝔼[𝑠2 ] = 𝜎 2 .
𝑋)
∏
𝑚 ∑
𝑚
𝜃̂ = arg max 𝑝(𝑋𝑖 ; 𝜃) = arg max log 𝑝(𝑋𝑖 ; 𝜃) . (24)
𝜃 𝑖=1 𝜃 𝑖=1