Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Quick Reference of Probability

Hao Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China
zhangh0214@gmail.com

Abstract 2 Counting
In this note, we give a quick reference of some of the 2.1 Summations
key concepts of counting, probablisitcs, and statis-
Lemma 4. The followings are some well-known series.
tics. Besides, a section of balls and bins model is ∑
provided as an exercise. For further reading, you • Telescoping series. 𝑛−1𝑖=1 (𝑎𝑖+1 − 𝑎𝑖 ) = 𝑎𝑛 − 𝑎1 .
∑ 1 1
may consult [1, 2, 3, 4, 5, 6]. • 𝑛−1
𝑖=1 𝑖(𝑖+1) = 1 − 𝑛
.

• Arithmetic series. 𝑛𝑖=1 𝑖 = 𝑛(𝑛+1)
2
= Θ(𝑛2 ).
∑𝑛 2 𝑛(𝑛+1)(2𝑛+1)
• Sum of squares. 𝑖=1 𝑖 = 6
= Θ(𝑛3 ).
1 Mathematics Basis ∑𝑛 3 𝑛2 (𝑛+1)2
• Sum of cubes. 𝑖=1 𝑖 = 4
= Θ(𝑛4 ).
∑ 𝑛+1
1.1 Propositions • Geometric series. 𝑛𝑖=0 𝑥𝑖 = 1−𝑥 if 𝑥 ≠ 1.
∑∞ 1−𝑥𝑖 1
• Infinite geometric series. 𝑖=0 𝑥 = 1−𝑥 if |𝑥| < 1.
Definition 1 (Implies 𝑃 ⇒ 𝑄). True exactly when 𝑃 is false ∑𝑛 1
or 𝑄 is true, i.e., 𝑃 ⇒ 𝑄 ≔ ¬𝑃 ∨ 𝑄. • Harmonic series. 𝑖=1 𝑖 = log 𝑛 + Θ(1).

Lemma 1. ∃𝑥, ∀𝑦. 𝑃 (𝑥, 𝑦) ⇒ ∀𝑦, ∃𝑥. 𝑃 (𝑥, 𝑦). Lemma 5 (Integration Bound). Let 𝑓 ∶ ℝ+ → 𝑅+ be a
weakly increasing function,
Lemma 2 (The Well Ordering Principle). Every nonempty 𝑛 ∑
𝑛 𝑛
set of nonnegative integers has a smallest element. 𝑓 (𝑥) d𝑥 + 𝑓 (1) ≤ 𝑓 (𝑖) ≤ 𝑓 (𝑥) d𝑥 + 𝑓 (𝑛) . (1)
∫1 ∫1
𝑖=1
1.2 Sets and Binary Relations √ ( )𝑛
Lemma 6 (Stirling’s Formula). ∀𝑛 ≥ 1. 𝑛! ∼ 2𝜋𝑛 𝑛𝑒 .
Lemma 3. The followings are properties of sets.
• 𝑆 ⊆ 𝐴 ∧ 𝑇 ⊆ 𝐵 ⇒ 𝑆 × 𝑇 ⊆ 𝐴 × 𝐵. ∑∞ 1 d𝑖 𝑓 (0) 𝑖
Lemma 7 (Maclaurin’s Theorem). 𝑓 (𝑥) = 𝑖=0 𝑖! d𝑥𝑖 𝑥 .
• DeMorgan’s Law. 𝐴 ∪ 𝐵 = 𝐴̄ ∩ 𝐵, ̄ 𝐴 ∩ 𝐵 = 𝐴̄ ∪ 𝐵.
̄
Lemma 8. The followings are some well-known series.

Definition 2 (Binary Relations). A binary relation 𝑅 ∶ 𝐴 → 1
• 1−𝑥 = ∞ 𝑖
𝑖=0 𝑥 .
𝐵 is a subset of 𝐴 × 𝐵. 𝐴 is called the domain, and 𝐵 is ∑∞ 1 𝑖
• exp 𝑥 = 𝑖=0 𝑖! 𝑥 .
called codomain. The image of a set 𝑆 ⊆ 𝐴 is the subset ∑ 𝑐𝑛 𝑖
of 𝐵 that is related to some element in 𝑆. The range is the • exp 𝑐𝑥 = ∞ 𝑖=0 𝑖! 𝑥 .
∑ 1 𝑖
image of 𝐴. There are kinds of relations. • log(1 − 𝑥) = − ∞ 𝑖=1 𝑖 𝑥 .
• Function: ≤ 1 arrow out.
• Total: ≥ 1 arrow out.
• Total function: = 1 arrow out.
2.2 Cardinality Rules
• One-to-one/Injective: ≤ 1 arrow in. Lemma 9 (Rule of Sum). If we have 𝑛1 ways to perform
• Onto/Surjective: ≥ 1 arrow in. action 1 and 𝑛2 ways to perform action 2, and we cannot do
• One-to-one correspondence/Bijective: = 1 arrow out both at the same time, then there are 𝑛1 + 𝑛2 choose one of
and = 1 arrow in. the actions.
Definition 3 (Countable Infinite). An infinite set that can Lemma 10 (Rule of Product). If there are 𝑛1 ways to perform
be put into a one-to-one correspondence with the natural action 1 and then by 𝑛2 ways to perform action 2, then there
numbers ℕ; otherwise, it is uncountable. are 𝑛1 ⋅ 𝑛2 ways to perform action 1 followed by action 2.

1
Lemma 11 (Pigenhole Principle). If |𝐴| > 𝑘|𝐵|, then for unknown random process) from experimental data. There is
every total function 𝑓 ∶ 𝐴 → 𝐵, there exists 𝑘 + 1 different no single correct answer.
elements of set 𝐴 that are mapped to the same element of 𝐵. Definition 9 (Experiment). A repeatable procedure with
Lemma 12 (Inclusion-Exclusion Principle). |𝐴∪𝐵| = |𝐴|+ well-defined possible outcomes.
|𝐵| − |𝐴 ∩ 𝐵|. Definition 10 (Sample Space Ω). Set of all possible out-
comes 𝜔 of an experiment.
Definition 4 (Ordinary Generating Function). The ordinary
generating funciton 𝐹 (𝑥) for the sequence (𝑓𝑖 )∞ is the Definition 11 (Event 𝐸). Subset of the sample space 𝐸 ⊆ Ω.
∑∞ 𝑖
𝑖=0
power series 𝐹 (𝑥) ≔ 𝑖=0 𝑓𝑖 𝑥 . Definition 12 (Probability Function Pr). A function
Pr ∶ Ω → [0, 1] giving the probability for each outcome.
Lemma 13 (Convolution Rule). Let 𝐹 (𝑥) be the generating ∑
function for selection items from a set 𝐴, and 𝐺(𝑥) for the Besides, Pr must also satisfies 𝜔∈Ω Pr(𝜔) = 1.
set 𝐵. 𝐴 are 𝐵 are disjoint. The generating function for Definition 13 (Probability of an Event). The probability of
selecting items from the union 𝐴 ∪ 𝐵 is 𝐹 (𝑥)𝐺(𝑥). an event 𝐸 is the sum of the probabilities of all the outcomes

in 𝐸, i.e., Pr(𝐸) ≔ 𝜔∈𝐸 Pr(𝜔). Suppose Ω is finite and if
2.3 Permutations and Combinations each outcome is equally likely, then Pr(𝐸) = |𝐸||Ω|
.

Definition 5 (Permutation). A permutation of a finite set is Lemma 17. For two events 𝐸1 and 𝐸2 , Pr(𝐸1 ∪ 𝐸2 ) =
an ordered sequence of all the elements in that set, wich each Pr(𝐸1 ) + Pr(𝐸2 ) − Pr(𝐸1 ∩ 𝐸2 ). Specifically, if 𝐸1 and 𝐸2
element appearing exactly once. are disjoint, Pr(𝐸1 ∪ 𝐸2 ) = Pr(𝐸1 ) + Pr(𝐸2 ).

Theorem 14. The number of permutations of 𝑘 elements out


𝑛!
of a set of 𝑛 elements is (𝑛−𝑘)! . Specifically, the number of
3.2 Conditional Probability
permutations of a set of 𝑛 elements is 𝑛!. Definition 14 (Conditional Probability Pr(𝐴 ∣ 𝐵)). Prob-
ability of an event 𝐴 given that another event 𝐵 has oc-
Definition 6 (Combination). A combination of a finite set is curred. If Pr(𝐵) ≠ 0, Pr(𝐴 ∣ 𝐵) ≔ Pr(𝐴∩𝐵) . Furthermore,
a subset of that set. Pr(𝐵)
Pr(𝐴 ∩ 𝐵) = Pr(𝐴 ∣ 𝐵) Pr(𝐵) is called the multiplication
Theorem 15. The number of combinations of 𝑘 elements out rule.
( )
of a set of 𝑛 elements is 𝑘𝑛 ≔ 𝑘!(𝑛−𝑘)!
𝑛!
. This is also called Trees are a great way to organize the underlying process
the binomial coefficient. The followings are properties of into a sequence of actions. Each level of the tree shows the
binomial coefficient. ( ) outcomes of one action. Conditional probabilities are written

• Binomial expansion. (𝑎 + 𝑏)𝑛 = 𝑛𝑘=0 𝑘𝑛 𝑎𝑛−𝑘 𝑏𝑘 . In along the branches, so the probability of getting to any node
∑ ( )
particular, 2𝑛 = 𝑛𝑘=0 𝑘𝑛 . is the product of the probabilities along the path to get there.
( )𝑘 ( ) ( )𝑘
• Binomail bounds. 𝑘𝑛 ≤ 𝑘𝑛 ≤ 𝑒𝑛 . Theorem 18 (Law of Total Probability). Suppose
(𝑛) (𝑛−1𝑘) (𝑛−1) 𝐵1 , 𝐵2 , … , 𝐵𝑛 are a partition of the sample space, then for
• Pascal’s triangle identity. 𝑘 = 𝑘−1 + 𝑘 . any event 𝐴,
Theorem 16. The number of ways of depositing 𝑛 distinct ∑
𝑛 ∑
𝑛

objects into 𝑟 distinct bins, with 𝑘1 objects in the first bin, 𝑘2 Pr(𝐴) = Pr(𝐴 ∩ 𝐵𝑖 ) = Pr(𝐴 ∣ 𝐵𝑖 ) Pr(𝐵𝑖 ) (2)
𝑖=1 𝑖=1
objects in the second bin, and so on is
Definition 15 (Independent 𝐴 ⟂ 𝐵). Two event 𝐴 and 𝐵 are
independent if knowing that 𝐵 occurred does not change the
3 Probability Basis probability that 𝐴 occurred.
{
3.1 Terminologies Pr(𝐴 ∣ 𝐵) = Pr(𝐴) if Pr(𝐵) ≠ 0;
𝐴⟂𝐵⇔ (3)
Pr(𝐵 ∣ 𝐴) = Pr(𝐵) if Pr(𝐴) ≠ 0.
Definition 7 (Probability). Mathematical language for quan-
tifying uncertainty. In probability, there are a few logically In other words, 𝐴 ⟂ 𝐵 ⇔ Pr(𝐴 ∩ 𝐵) = Pr(𝐴) Pr(𝐵). Note
self-contained rules. The random process is fully known, and that disjoint events with positive probability are not indepen-
the objective is to find the probability of a certain outcome. dent.
There is only one correct answer logically follow these rules. Theorem 19 (Bayes’ Theorem). For two events 𝐴 and 𝐵,
Definition 8 (Statistics). Using data to infer the distribution Pr(𝐴 ∣ 𝐵) Pr(𝐵) Pr(𝐴 ∣ 𝐵) Pr(𝐵)
Pr(𝐵 ∣ 𝐴) = =∑ .
that generated the data. In statistics, the outcome is known, Pr(𝐴) 𝑖 Pr(𝐴 ∣ 𝐵𝑖 ) Pr(𝐵𝑖 )
and we apply probability to make inference (illuminate the (4)

2
Table 1: Distribution functions of random variables. We allow 𝑥 to be any number. If 𝑥 is a value that 𝑋 never takes, then
𝑝(𝑥) = 0.

Range Discrete Continuous


PMF ℝ → [0, 1] 𝑝(𝑥) ≔ Pr(𝑋 = 𝑥) -
𝑏
PDF ℝ → [0, ∞) - Pr(𝑎 ≤ 𝑋 ≤ 𝑏) = ∫𝑎 𝑝(𝑥) d𝑥
∑𝑥 𝑥
CDF ℝ → [0, 1] Pr(𝑋 ≤ 𝑥) = 𝑢=−∞ 𝑝(𝑢) Pr(𝑋 ≤ 𝑥) = ∫−∞ 𝑝(𝑢) d𝑢
Joint PMF ℝ × ℝ → [0, 1] 𝑝(𝑥, 𝑦) ≔ Pr(𝑋 = 𝑥 ∧ 𝑌 = 𝑦) -
𝑑 𝑏
Joint PDF ℝ × ℝ → [0, ∞) - Pr(𝑎 ≤ 𝑋 ≤ 𝑏 ∧ 𝑐 ≤ 𝑌 ≤ 𝑑) = ∫𝑐 ∫𝑎 𝑝(𝑥, 𝑦) d𝑥 d𝑦
∑𝑥 ∑𝑦 𝑥 𝑦
Joint CDF ℝ × ℝ → [0, 1] 𝑢=−∞ ∑𝑣=−∞
𝑝(𝑢, 𝑣) ∫−∞ ∫−∞ 𝑝(𝑢, 𝑣) d𝑢 d𝑣

Marginal PMF ℝ → [0, 1] 𝑝(𝑥) ≔ 𝑦=−∞ 𝑝(𝑥, 𝑦) -

Marginal PDF ℝ → [0, ∞) - 𝑝(𝑥) ≔ ∫−∞ 𝑝(𝑥, 𝑦) d𝑦
Marginal CDF ℝ → [0, 1] 𝐹 (𝑥) ≔ lim𝑦→∞ 𝐹 (𝑥, 𝑦) 𝐹 (𝑥) ≔ lim𝑦→∞ 𝐹 (𝑥, 𝑦)

Table 2: Properties of discrete expectation and variance.

Expectation Variance Covariance


2
Notation 𝔼[𝑋], 𝜇 var 𝑋, 𝜎 cov(𝑋, 𝑌 )
Meaning Measure of central tendency Measure of spread around the center How much two random variables vary together
∑ ∑
Definition (discrete) 𝑥 𝑥𝑝(𝑥) 𝔼[(𝑋 − 𝜇)2 ] = 𝑥 (𝑥 − 𝜇)2 𝑝(𝑥) 𝔼[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )]
∞ ∞
Definition (continuous) ∫−∞ 𝑥𝑝(𝑥) d𝑥 2
𝔼[(𝑋 − 𝜇) ] = ∫−∞ (𝑥 − 𝜇)2 𝑝(𝑥) d𝑥 𝔼[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )]
Alternative - var 𝑋 = 𝔼[𝑋 ] − 𝜇2
2
cov(𝑋, 𝑌 ) = 𝔼[𝑋𝑌 ] − 𝜇𝑋 𝜇𝑌
Scale and shift 𝔼[𝑎𝑋 + 𝑏] = 𝑎 𝔼[𝑋] + 𝑏 var(𝑎𝑋 + 𝑏) = 𝑎2 var 𝑋 cov(𝑎𝑋 + 𝑏, 𝑐𝑌 + 𝑑) = 𝑎𝑐 cov(𝑋, 𝑌 )
Linearity 𝔼[𝑋 + 𝑌 ] = 𝔼[𝑋] + 𝔼[𝑌 ] var(𝑋 + 𝑌 ) = var 𝑋 + var 𝑌 + 2 cov(𝑋, 𝑌 ) cov(𝑋1 + 𝑋2 , 𝑌 ) = cov(𝑋1 , 𝑌 ) + cov(𝑋2 , 𝑌 )

Function of 𝑋 𝔼[𝑓 (𝑋)] = 𝑥 𝑓 (𝑥)𝑝(𝑥) - -

( 𝑛 )
4 Random Variables ∑ ∑
𝑛
var 𝑐𝑖 𝑋𝑖 = 𝑐𝑖2 var 𝑋𝑖 . (8)
𝑖=1 𝑖=1
4.1 Terminologies
Definition 17 (Correlation 𝜌). The correlation between 𝑋
Definition 16 (Discrete Random Variable 𝑋). A discrete and 𝑌 is 𝜌 ≔ cov(𝑋,𝑌
𝜎 𝜎
)
∈ [−1, 1]. Furthermore,
random variable is a function 𝑋 ∶ pow Ω → ℝ. For any 𝑋 𝑌

value x, we write 𝑋 = 𝑥 to mean the event {𝜔 ∈ Ω ∣ 𝑋(𝜔) = 𝜌 = 1 ⇔ ∃𝑎 > 0, 𝑏. 𝑌 = 𝑎𝑋 + 𝑏 , (9)


𝑥}, so Pr(𝑋 = 𝑥) ≔ Pr({𝜔 ∈ Ω ∣ 𝑋(𝜔) = 𝑥}). 𝜌 = −1 ⇔ ∃𝑎 < 0, 𝑏. 𝑌 = 𝑎𝑋 + 𝑏 . (10)
Table 1 summaries distribution functions of random vari- Definition 18 (Independent 𝑋 ⟂ 𝑌 ). Jointly-distributed ran-
ables. dom variables 𝑋 and 𝑌 are independent if any event defined
by 𝑋 is independent of any event defined by 𝑌 . This is
Lemma 20 (Relationship Between CDF and PMF/PDF). equivalent to 𝐹 (𝑋, 𝑌 ) = 𝐹 (𝑥)𝐹 (𝑦) or 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦).
⎧𝐹 (𝑥) − 𝐹 (𝑥 − 1) if 𝑋 is discrete ;
Lemma 22. 𝑋 ⟂ 𝑌 ⇒ cov(𝑋, 𝑌 ) = 0. However, the con-
⎪ verse if false, since cov(𝑋, 𝑌 ) only measures the linear rela-
𝑝(𝑥) = ⎨ d𝐹 (𝑥) (5)
if 𝑋 is continuous . tionship between 𝑋 and 𝑌 .
⎪ d𝑥
⎩ Definition 19 (Quantile). The 𝑞-th quantile of 𝑋 is
𝜕 2 𝐹 (𝑥, 𝑦)
𝑝(𝑥, 𝑦) = . (6) 𝐹 −1 (𝑞) ≔ arg min 𝐹 (𝑥) ≤ 𝑞 = arg min Pr(𝑋 ≤ 𝑥) ≤ 𝑞 .
𝜕𝑥𝜕𝑦 𝑥 𝑥
( ) ( )(11)
Table 2 summaries properties of expectation, variance, and 1 1
In particular, we call 𝐹 −1the first quantile, 𝐹 −1 the
covariance. ( ) 4 2
median, and 𝐹 −1 34 the third quantile.
Lemma 21. If 𝑋1 , 𝑋2 , … , 𝑋𝑛 are independent,
[ 𝑛 ]
∏ ∏
𝑛 4.2 Common Distributions
𝔼 𝑋𝑖 = 𝔼[𝑋𝑖 ] , (7)
𝑖=1 𝑖=1 Table 3 summaries common distributions.

3
Table 3: Common discrete distributions.

Distribution Model Notation PMF/PDF CDF Domain Expect. Variance


𝑥 1−𝑥
Bernoulli One trial resulting in success/failure Ber(𝑝) (𝑛𝑝) (1 − 𝑝) - {0, 1} 𝑝 𝑝(1 − 𝑝)
Binomial # successes in 𝑛 independent Ber(𝑝) trials Bin(𝑛, 𝑝) 𝑥
𝑝 𝑥
(1 − 𝑝)1−𝑥 - {0, … , 𝑛} 𝑛𝑝 𝑛𝑝(1 − 𝑝)
1−𝑝 1−𝑝
Geometric # failures before the first success of indep. Ber(𝑝) trials Geo(𝑝) 𝑝(1 − 𝑝)𝑥 - ℕ 𝑝 𝑝2
𝜆𝑥
Poisson Count for rare events like radio. decay and traf. accidents Poi(𝜆) exp(−𝜆) 𝑥! - {0, 1, …} 𝜆 𝜆
1 𝑎+𝑏 (𝑏−𝑎+1)2 −1
Uniform Situation when all the outcomes are equally likely Unif(𝑎, 𝑏) 𝑏−𝑎+1
- {𝑎, … , 𝑏} 2 12
( 2
)
1
Gaussian Measurement error, averages of lots of data, etc.  (𝜇, 𝜎 2 ) √ exp − (𝑥−𝜇)
2𝜎 2
ℝ 𝜇 𝜎2
2𝜋𝜎 2
1 1
Exponential Lifetime of elect. comp./waiting time between rare events Exp(𝛽) 𝛽 exp(−𝛽𝑥) 1 − exp(−𝛽𝑥) [0, ∞) 𝛽 𝛽2
1 𝑥−𝑎 𝑎+𝑏 (𝑏−𝑎+1)2 −1
Uniform Situation when all the outcomes are equally likely Unif(𝑎, 𝑏) 𝑏−𝑎 𝑏−𝑎 𝛼
[𝑎, 𝑏] 2 12
𝛼𝑚𝛼
Pareto Prob. that an event occurs varies according to power law Pareto(𝑚, 𝛼) 𝑥𝛼+1
1 − 𝑚𝑥𝛼 [𝑚, ∞)

Table 4: Common inequalities, where 𝑋̄ is the sample mean, and 𝜇 and 𝜎 2 are the mean and variance, respectively.

Name Assumption Parameter Inequality


( 2)
2
Gaussian tail inequality 𝑋 ∼  (0, 1) 𝜀>0 Pr(|𝑋| > 𝜀) ≤ exp − 𝜀2 .
( 𝜀)
Gaussian tail inequality 𝑋1 , 𝑋2 , … , 𝑋𝑚 are iid drawn from  (0, 1) 𝜀>0 ̄ > 𝜀) ≤ √2 exp − 𝜀2 𝑚
Pr(|𝑋| 𝜀 𝑚 2
𝜇
Markov’s inequality 𝑋 is a non-negative 𝜀>0 Pr(𝑋 > 𝜀) ≤ 𝜀
2
Chebyshev inequality - 𝜀>0 Pr(|𝑋 − 𝜇| ≥ 𝜀) ≤ 𝜎𝜀2
(∑𝑚 ) ∏𝑚 𝑡2 (𝑏 −𝑎 )2
Hoeffding’s inequality 𝜇𝑖 = 0 and 𝑎𝑖 ≤ 𝑋𝑖 ≤ 𝑏𝑖 𝜀 > 0, 𝑡 > 0 Pr 𝑖=1 𝑋𝑖 ≥ 𝜀 ≤ exp(−𝑡𝜀) 𝑖=1 exp 𝑖8 𝑖
Hoeffding’s inequality 𝑋1 , 𝑋2 , … , 𝑋𝑚 iid drawn from Ber(𝑝) 𝜀>0 Pr(|𝑋̄ − 𝑝| > 𝜀) ≤ 2 exp(−2𝜀2 𝑚)

Lemma 23. Geometric distribution and exponential distri- 5 Inequalities and Convergence
bution are memoryless,
5.1 Inequalities
Pr(𝑋 = 𝑥 + 𝑥0 ∣ 𝑋 ≥ 𝑥0 ) = Pr(𝑋 = 𝑥) . (12)
Table 4 summaries common inequalities.
Definition 20 (Multivariate Gaussian Distribution). 𝑿 ∼ Theorem 25 (Cauchy-Schwarz Inequality).
 (𝝁, 𝚺) if √
( ) 𝔼[|𝑋𝑌 |] ≤ 𝔼[𝑋 2 ] 𝔼[𝑌 2 ] . (16)
1 1
𝑝(𝒙) = √ exp − (𝒙 − 𝝁)⊤ 𝚺−1 (𝒙 − 𝝁) .
(2𝜋)𝑑 det 𝚺 2 Theorem 26 (Jensen’s Inequality). If 𝑓 is convex,
(13) 𝔼[𝑓 (𝑋)] ≥ 𝑓 (𝔼[𝑋]) . (17)
Definition 21 (Multinomial Distribution). 𝑿 ∼ Mult(𝑛, 𝒑)
if 5.2 Convergence of Random Variables
( ) 𝐾
𝑛 ∏ 𝑥
𝑝(𝒙) = 𝑝 𝑖. (14) Theorem 27 (Law of Large Numbers). Suppose
𝑥1 , 𝑥2 , … , 𝑥𝐾 𝑖=1 𝑖 𝑋1 , 𝑋2 , … , 𝑋𝑚 , … are i.i.d. random variables with

It models a situation where drawing 𝑛 balls from an urn mean 𝜇 and variance 𝜎 2 . Let 𝑋̄ 𝑚 ≔ 𝑚1 𝑚𝑖=1 𝑋𝑖 be the
with 𝐾 differnet colors. 𝑝𝑖 is the probability of drawing sample mean. Then
color 𝑖, and 𝑥𝑖 is the count of the number of balls for color 𝑖. ∀𝜀 > 0. lim Pr(|𝑋̄ 𝑚 − 𝜇| < 𝜀) = 1 .
( 𝑛 ) (18)
𝑥 ,𝑥 ,…,𝑥
≔ ∏𝐾𝑛! is also called the multinomial coeffi- 𝑚→∞
1 2 𝑟 𝑖=1 𝑥𝑖 ! In other words, the sample mean is very close to the true
cient.
mean of the distribution with high probability.
Lemma 24 (Multinomail Expansion). Corollary 28 (Law of Large Numbers for Histograms). With
( 𝑟 )𝑛 ( ) high probability the density histogram of a large number of
∑ ∑ 𝑛 ∏ 𝑥
𝑎𝑖 = 𝑎𝑖 𝑖 . (15) samples from a distribution is a good approximation of the
𝑖=1 𝑥 +⋯+𝑥 =𝑛
𝑥1 , 𝑥2 , … , 𝑥𝑟 graph of the underlying PDF 𝑝.
1 𝑟

4
Theorem 29 (Central Limit Theorem). Suppose Proof. Let 𝑋𝑖𝑗 = 𝕀(ball 𝑖 and 𝑗 are tossed into the same bin).
𝑋1 , 𝑋2 , … , 𝑋𝑚 , … are i.i.d. random variables with Then
∑ [𝑛−1 𝑛 ] 𝑛−1 𝑛
mean 𝜇 and variance 𝜎 2 . Let 𝑋̄ 𝑚 ≔ 𝑚1 𝑚𝑖=1 𝑋𝑖 be the
( )
∑ ∑ ∑ ∑ 𝑛 1 𝑛(𝑛 − 1)
sample mean. Then for large 𝑚, 𝔼 𝑋𝑖𝑗 = 𝔼[𝑋𝑖𝑗 ] = = .
2 𝑘 2𝑘
( ) 𝑖=1 𝑗=𝑖+1 𝑖=1 𝑗=𝑖+1
̄ 𝜎2 (22)
𝑋𝑚 ∼  𝜇, , (19)
𝑚
𝑋̄ 𝑚 − 𝜇
√ ∼  (0, 1) . (20) 6.2 Identical Balls
𝜎∕ 𝑚
Consider the process that we randomly toss 𝑛 identical
balls into 𝑘 bins.
6 Standard Model: Balls and Bins
Lemma 34. The number of ways of placing the balls into
( )
6.1 Distinct Balls bins is 𝑛+𝑘−1
𝑛
.

Consider the process that we random toss 𝑛 distinct balls Proof. If balls are distinct, there are (𝑛+𝑘−1)!
(𝑘−1)!
ways. Since
into 𝑘 bins. balls are indistinguishable, 𝑛! of the arrangements are dupli-
( )
cated. Thus the answer is (𝑛+𝑘−1)! = 𝑛+𝑘−1 .
Lemma 30. Suppose the order within a bin does not matter. (𝑘−1)!𝑛! 𝑛
The number of ways of placing the balls into bins is 𝑘𝑛 . Lemma 35. Suppose no bin may contain more than one ball,
Proof. There are 𝑘 ways to place each of the 𝑛 balls, respec- (𝐾 )that 𝑛 ≤ 𝑘. The number of ways of placing the balls is
so
.
tively. 𝑛
Proof. Select 𝑛 out of the 𝑘 bins to put balls in.
Lemma 31. Suppse the balls in each bin are ordered. The
Lemma 36. Suppose no may be left empty. Assumging that
number of ways of placing the balls into bins is (𝑛+𝑘−1)!
(𝑘−1)!
. (𝑛−1)
𝑛 ≥ 𝑘, the number of ways of placing the balls is 𝑘−1 .
Proof. The number of ways of arranging 𝑛 balls and 𝑘 − 1 Proof. First, we put a ball in each bin. Then we use the result
( ) (𝑛−1) ( 𝑛−1 ) (𝑛−1)
sticks (as separators) in a row is (𝑛 + 𝑘 − 1)!. Since sticks are of Lemma 34. 𝑛−𝑘+𝑘−1𝑛−𝑘
= 𝑛−𝑘 = 𝑛−1−𝑛+𝑘 = 𝑘−1 .
indistinguishable, (𝑘 − 1)! of the arrangements are duplicated.
Thus the answer is (𝑛+𝑘−1)!
(𝑘−1)!
.
Lemma 37. The expect number of balls fall in a given bin is
𝑛
Lemma 32 (Birthday Paradox). The probability that .
( at least) 𝑘
two balls are tossed into the same bin is ≥ 1 − exp − 𝑛(𝑛−1)
2𝑘
. Proof. It follows a binomial distribution Bin(𝑛, 𝑘1 ).

Proof. By looking at the complementary event. The proba- Lemma 38. The expected number of balls we toss until a
bility that all balls are tossed into different bins is given bin contains a ball is 𝑘.
(𝑘) Proof. It follows a geometric distribution Geo( 𝑘1 ).
𝑛
𝑛! 𝑘(𝑘 − 1)(𝑘 − 2) ⋯ (𝑘 − 𝑛 + 1)
= Lemma 39 (Coupon Collector’s Problem). The expected
𝑘𝑛 𝑘𝑛
0 1 2 𝑛−1 number of balls we toss until every bin contains at least one
= (1 − )(1 − )(1 − ) ⋯ (1 − ) ball is ∼ 𝑘 log 𝑘.
𝑘 𝑘 𝑘 𝑘
1 2 𝑛−1 Proof. Partition the tosses into 𝑘 stages. The 𝑖-th stage con-
< exp(0) exp(− ) exp(− ) ⋯ exp(− )
( 𝑛−1 𝑘 ) 𝑘 𝑘 tains of the tosses after tossing a ball into the empty (𝑖 − 1)-th
∑ 𝑖 bin until tossing a ball into the empty 𝑖-th bin. During the 𝑖-th
= exp −
𝑘 stage, 𝑖 − 1 bins contain balls and 𝑘 − 𝑖 + 1 bins are empty. It
𝑖=0
( ) follows a geometric distribution Geo( 𝑘−𝑖+1 ). The expected
𝑛(𝑛 − 1) 𝑘
= exp − . (21) number of tosses is 𝑘
. The overall expected number of
2𝑘 𝑘−𝑖+1
tosses is

When 𝑛 = 2𝑘, it approximates 1e . ∑
𝑘
𝑘 ∑𝑘
1
=𝑘 ∼ 𝑘 log 𝑘 . (23)
𝑖=1
𝑘−𝑖+1 𝑖=1
𝑖
Lemma 33. The expected number of pairs of balls within
the the same bin is 𝑛(𝑛−1)
2𝑘
.

5
7 Statistics Basis References
7.1 Terminologies [1] G. Casella and R. L. Berger. Statistical inference, volume 2.
Duxbury Pacific Grove, CA, 2002. 1
̂ Provide a single “best
Definition 22 (Point Estimation 𝜃). [2] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
guess” of some quantity 𝜃 of interest. We denote the bias Introduction to Algorithms, 3rd Edition. MIT Press, 2009. 1
of 𝜃̂ as 𝔼[𝜃]
̂ − 𝜃. We say that 𝜃̂ is unbiased if 𝔼[𝜃] ̂ = 𝜃, [3] R. Hogg, J. McKean, and A. Craig. Introduction to Mathemati-
and consistent if ∀𝜀 > 0. lim𝑚→∞ Pr(|𝜃̂ − 𝜃| < 𝜀) = 1. We cal Statistics. 7th Edition. Pearson Press, 2012. 1
denote the mean square error of 𝜃̂ as 𝔼[(𝜃̂ − 𝜃)2 ]. [4] R. J. Larsen and M. L. Marx. Introduction to Mathematical
Statistics and Its Applications: Pearson New International Edi-
Definition 23 (Sample Mean 𝑋).̄ 𝑋̄ ≔ 1 ∑𝑚 𝑋𝑖 . Sample tion. Pearson Higher Ed, 2013. 1
𝑚 𝑖=1
𝜎2 [5] E. Lehman, T. Leighton, and A. Meyer. Mathematics for Com-
̄ = 𝜇, and var 𝑋̄ =
mean is unbiased 𝔼[𝑋] .
𝑚 puter Science. MIT, 2017. 1
1 ∑𝑚 [6] L. Wasserman. All of statistics: A Concise Course in Statistical
Definition 24 (Sample Variance 𝑠2 ). 𝑠2 ≔ 𝑚−1 𝑖=1 (𝑋𝑖 − Inference. Springer Science & Business Media, 2013. 1
̄ 2 . Sample variance is unbiased 𝔼[𝑠2 ] = 𝜎 2 .
𝑋)

Definition 25 (Empirical CDF 𝐹̂ ). Suppose 𝑋1 , 𝑋2 , … , 𝑋𝑚


are iid drawn from a distribution, the empirical CDF 𝐹̂ puts

mass 𝑚1 at each data point 𝑋𝑖 , 𝐹̂ (𝑥) = 𝑚1 𝑚𝑖=1 𝕀(𝑋𝑖 ≤ 𝑥). 𝐹
̂
is an unbiased and consistent estimation of the true CDF 𝐹 .

7.2 Parametric Inference


Theorem 40 (Maximum Likelihood Estimate, MLE). Let
𝑋1 , 𝑋2 , … , 𝑋𝑚 be iid with pdf 𝑝, the maximum likelihood
estimate for a parameter distribution is the parameter for
which the data is most likely,


𝑚 ∑
𝑚
𝜃̂ = arg max 𝑝(𝑋𝑖 ; 𝜃) = arg max log 𝑝(𝑋𝑖 ; 𝜃) . (24)
𝜃 𝑖=1 𝜃 𝑖=1

MLE is a point estimation and it is asymptotically unbiased


and has asymptotically minimal variance.

Theorem 41 (Bayes Estimate). Bayes estimate regards 𝜃 as


random,
𝑝(𝐷 ∣ 𝜃)𝑝(𝜃)
𝑝(𝜃 ∣ 𝐷) = . (25)
𝑝(𝐷)
We can then compute a point estimation from the posterior,
e.g.,
𝜃̂ = 𝔼[𝜃 ∣ 𝐷] = 𝜃𝑝(𝜃 ∣ 𝐷) d𝜃 . (26)

Definition 26 (Frequentists). Frequentists say that probabil-
ity measures the frequency of various outcomes of an experi-
ment. For example, saying a fair coin has a 50% probability
of heads means that if we toss it many times then we expect
about half the tosses to land heads.

Definition 27 (Bayesians). Bayesians say that probability is


an abstract concept that measures a state of knowledge or a
degree of belief in a given proposition. In practice, Bayesians
do not assign a single value for the probability of a coin
coming up heads. Rather they consider a range of values
each with its own probability of being true.

You might also like