Introduction To Probability (Lectures 1-2)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Introduction to Probability

(Lectures 1-2)

1
1 Probability Spaces and Random Variables
The goal of Lecture 1 and Lecture 2 is twofold. First, we will present the formal definitions of a
probability space and a random variable. Then, we will introduce some more concrete concepts that
will be used throughout the course (such as cumulative distribution function and probability density
function).
Overview of Lectures 1-2: The triplet:

(Ω, F, P),

will be our notation to describe a probability space. We will call

1. Ω: The underlying space of uncertainty or set of states of the world.

2. F: The “σ-algebra” of subsets of Ω. We will think about it is as a collection of “events”


whose likelihood of occurrence we would like to analyze.

3. P: The probability measure. We will think about it as a [0, 1]-valued function summarizing
the likelihood of an event.

A real-valued random variable will be denoted by the map

X : Ω → R,

which will satisfy a restriction that we will call “measurability”. We will focus on how the probability
measure on (Ω, F) induces a probability over subsets of R.
Once we are done with abstract definitions, we will introduce objects that we will use quite
often during this course. In the context of a real-valued random variable we will define

a) The cumulative distribution function (c.d.f.): which we will connect to the “induced” probability
measure of a random variable,

b) “Discrete” and “Absolutely Continuous” random variables: this classification will emerge from
differences in the c.d.f. of random variables.

c) The probability density function (p.d.f.): a convenient way of summarizing the information in
the c.d.f. of an absolutely continuous random variable,

d) The expectation operator EP [·] (and then the variance of a real-valued random variable).

e) Finally, the moment generating function.

2
1.1 Measurable spaces, Probability Measures, and Probability Spaces
1.1.1 σ-algebras and Measurable Spaces

We will start with the definition of a “measurable space”. In order to model randomness, we need
to have some structure that allows us to specify what the randomness is all about.

Let Ω be any given nonempty set and let F be a nonempty collection of subsets satisfying the
following properties:

1. F is “closed” under complements: F ∈ F =⇒ Ω\F ∈ F

2. F is “closed” under countable unions: Fn ∈ F for all n ∈ N =⇒ ∪∞


n=1 Fn ∈ F.

Definition (σ-algebra). (Billingsley (1995), p. 19 and 20) If the nonempty collection F ⊆ 2Ω


satisfies properties 1,2 above then F is called a σ-algebra (or a σ-algebra of subsets of Ω)

Definition (Measurable Space). (Billingsley (1995), p. 161) The pair (Ω, F), where F is a
σ-algebra of subsets of Ω is called a measurable space. The subsets of F are called the events of Ω.

If you are interested in playing with these definitions, here is a simple exercise.

Practice Problem 1 (Optional). Let A be a collection of elements of 2Ω . Define

F ∗ (A) ≡ {F | F is a σ-algebra of Ω containing A}

i) Show that F ∗ (A) is non-empty.

ii) Let σ(A) denote the intersection over all the σ-algebras contained in F ∗ (A). Show that σ(A)
is a σ-algebra. The collection σ(A) will be the called the “σ-algebra generated by A”. Such
collection is unique.

Practice Problem 2 (Optional). Show that σ-algebra is “closed” under countable intersections.
That is, if Fn ∈ F for all n ∈ N then ∩∞
n=1 Fn ∈ F.

1.1.2 Probability Measures

Once we know which are the objects that uncertainty will refer to, we need to have a formal way
of expressing the “likelihood” of the different events we have specified. In order to do this we will
define a probability measure.

3
Definition. (Probability Measure) (Billingsley (1995), p. 22) Let (Ω, F) be a measurable space.
Let P : F → [0, 1] be a “set” function mapping the σ-algebra of subsets of Ω into the real line. We
say that P is a probability measure if it satisfies the following properties:

1. P(∅) = 0, P(Ω) = 1 [Normalization]


S∞ P∞
2. P( n=1 An ) = n=1 P(An ) for a countable disjoint sequence of elements of F [σ-additivity]

Definition (Probability space). (Billingsley (1995), p. 23) The triplet (Ω, F, P ) is called a
probability space.

Some implications of the definition of a probability measure.

1. Monotonicity: For events A, B, A ⊆ B implies P(A) ≤ P(B).

Proof. Write B = A ∪ (B\A). By assumption, A ∈ F; Practice Problem 2 implies B \ A ∈ F


because B \ A = B ∩ (Ω \ A). By definition A ∩ (B\A) = ∅. Therefore, by finite additivity
(which is implied by σ-additivity) we have P(B) = P(A) + P(B\A), which implies P(B) −
P(A) = P(B\A) ≥ 0, as P is a [0, 1]-valued set function.

2. Boole’s Inequality: (Billingsley (1995) p. 24) Boole’s inequality is a pretty useful result
that will show up quite often. Let An be any sequence of events in F. Then

X
P(∪∞
n=1 An ) ≤ P(An )
i=1

We provide a proof of Boole’s inequality for only two events A, B. This is, we show that:

P(A ∪ B) ≤ P(A) + P(B)

Proof. Let A1 = A, A2 = B\A. By finite additivity:

P(A ∪ B) = P(A1 ∪ A2 )
= P (A1 ) + P (A2 )

Since B\A ⊆ B, then P(B\A) ≤ P (B). Therefore, P(A ∪ B) ≤ P (A) + P (B).

4
In order for you to manipulate the definition of a probability measure, you will be asked to
write down proofs of the following statements:

Practice Problem 3 (Properties of a probability measure). Let (Ω, F, P) be a probability space.


Show that a probability measure satisfies:

1. P(F1 ∪ F2 ) = P(F1 ) + P(F2 ) − P(F1 ∩ F2 ) for any F1 , F2 ∈ F

2. P(∪n∈N Fn ) ≤ P(Fn ) for any countable collection {Fn }


P
n∈N

1.2 Random Variables


Along with a probability measure, a central concept in probability and statistics is that of a random
variable.

Definition (S-valued random variable). Let (Ω, F), (S, S) be two measurable spaces. The map:

X :Ω→S

is called an S-valued random variable [relative to (Ω, F)-(S, S)] if for all A ∈ S :

X −1 (A) ≡ {ω ∈ Ω | X(ω) ∈ A} ∈ F,

this is, if X −1 (A) is measurable for all A ∈ S.

Note that, by definition, a random variable takes events in the space S to well-defined events
in the space Ω. If there is already a well-defined probability measure P on (Ω, F), then there is a
natural “induced” probability measure in (S, S):

Definition (“Induced” Probability Measure or “Law” of a Random Variable). (see Billingsley


(1995) p. 185 on Transformation of measures and see if you can make the connection) Let (Ω, F, P)
be a probability space. The probability measure over (S, S) induced by the random variable X :
Ω → S is given by:

PX (A) ≡ P(X −1 (A)) ∀ A ∈ S.

1.2.1 Cumulative Distribution Function for an R-valued random variable

In this section, we will analyze the statement: “X is a real-valued random variable with a cumulative
distribution function F : R → [0, 1]”. We will explain the relation between the c.d.f. of a random

5
variable and the induced probability measure we discussed in the last subsection. We will focus on
real-valued random variables (relative to the Borel σ-algebra on the real line, which is the smallest
σ-algebra containing all open sets).

Definition. (Billingsley (1995), p. 188 and 256) The cumulative distribution function (c.d.f.) of a
real-valued random variable X : Ω → R is defined as the real-valued function F : R → [0, 1] given
by
FX (x) ≡ P{ω ∈ Ω | X(ω) ≤ x} = P{X −1 (−∞, x]}

The c.d.f. of a random variable X is nothing else than its induced probability measure evaluated
at sets of the form (−∞, x]. The c.d.f. satisfies the following properties:

Proposition 1. If FX is the c.d.f. of a random variable X : Ω → R then

1. FX is non-decreasing.

2. limx↑∞ FX (x) = 1

3. limx↓−∞ FX (x) = 0

4. limh→0+ F (x + h) = F (x)

Furthermore, if F is a function satisfying 1,2,3,4, then there is a probability space (Ω, F, P) and a
random variable X : Ω → R such that F coincides with FX .

Practice Problem 4. This week’s problem set will walk you through the proof of Proposition 1.

Here are some commonly used examples of c.d.f.s of real-valued random variables:

• Uniform Distribution: The random variable X is said to have a uniform distribution in [a, b]
if the c.d.f. is given by:



 0 if x<a
F (x) = x − a/[b − a] if x ∈ [a, b)

x≥b

 1 if

• Bernoulli Distribution: The random variable X is said to have a Bernoulli distribution with
parameter p ∈ [0, 1] if its c.d.f. is given by:

6



 0 if x<0
F (x) = 1 − p if x ∈ [0, 1)

1 if x≥1

• Geometric Distribution: The random variable X is said to have a Geometric Distribution


with parameter p if its c.d.f. is given by:

(
0 if x<1
F (x) =
1 − (1 − p)k if x ∈ [k, k + 1) for k ∈ N

• Exponential Distribution: The random variable X is said to have an exponential distribution


with parameter λ>0 if

(
0 if x < 0
F (x) =
1− e−λx if x ≥ 0

• Gaussian Distribution: The random variable X is said to have a Gaussian distribution with
parameters (µ, σ 2 ) if

Z x 1  1 
F (x) = √ exp − 2 (u − µ)2 du
−∞ 2πσ 2 2σ

• Pareto Distribution: The random variable X is said to have a Pareto distribution with pa-
rameters (xm , α) if

 0 if x < xm
F (x) = 
xm

 1− if x ≥ xm
x

One common way to classify the random variables and their c.d.f.s is the following:

Definition (Discrete and Absolutely Continuous Random Variables).


1. Discrete Random Variables: A random variable X is discrete if there is a countable set
{x1 , x2 , . . .}, x1 < x2 < . . . such that

PX ({x1 , x2 . . .}c ) = 0

The smallest set {x1 , x2 , . . .} for which PX ({x1 , x2 . . .}c ) = 0 is called the support of X. The
map p : Supp → [0, 1] defined by p(s) ≡ PX (s) is called the probability mass function.

7
2. Absolutely Continuous Random Variables (w.r.t. Lebesgue Measure): A random
variable X is absolutely continuous if:
Z x
F (x) = f (y)dy
−∞

for some integrable, nonnegative function f that we will call the probability density func-
tion (p.d.f.) of X. If F is continuously differentiable the p.d.f. will coincide with the
derivative of F (see p. 257 in Billingsley (1995)).

A Bernoulli r.v. is an example of a discrete distribution with finite support. A Geometric r.v.
is an example of a discrete random variable with countable support. The gaussian, pareto, and
exponential distribution are both continuous and absolutely continuous random variables. Note
that not every random variable falls into one of the categories above (see, for example, the entry
for Cantor’s function in wikipedia) .

1.2.2 Moments of Discrete and Absolutely Continuous Random Variables

In this section we introduce the definition of expectation for only discrete and absolutely continuous
random variables. For a more detailed exposition you can see Chapters 1.4-1.6 in Durrett (2010).

Definition (Expectation of the transformation of an absolutely continuous random variable). Let


X be an absolutely continuous real-valued random variable with c.d.f F and p.d.f. f (x). Let
g : X → R be a real-valued function. The expectation of g(X) under F is defined as:
Z
EF [g(X)] ≡ g(x)f (x)dx
R
Definition (Expectation of the transformation of a discrete random variable). Let {x1 , x2 , . . .} be
the support of the discrete random variable X. Let g : X → R be a real-valued function. The
expectation of g(X) under the probability mass function p is given by:
X
EF [g(X)] ≡ g(xi )p(xi ).
xi ∈Supp

Now, some important observations:

1. Mean of a random variable: EF [X] ≡ µ is called the mean or first moment of X. If


Ef [|X|] < ∞ then X is said to be integrable.

2. Variance of a random variable: EF [(X−µ)2 ] ≡ σ 2 is called the variance or second centered


moment of X. The mean and the variance of random variables will play an important role

8
in Weak Laws of Large numbers and Central Limit Theorems. For certain random variables
the variance need not be finite (the pareto distribution with parameter α = 2 is an example
of this fact)

3. Jensen’s Inequality: For any convex function g: EF [g(X)] ≥ g(E[X]). This is an important
result. We will not prove it, but we will be using it quite often.

4. Markov’s Inequality: For any random variable X and c > 0, cPX [|X| > c] ≤ EF [|X|]

5. k-th Moment: EF [X k ] is called the k-th uncentered moment of the random variable X,
k = 1, 2, . . .. Note that Jensen’s inequality implies that if EF [|X|k ] < ∞ for some k ∈ N then
EF [|X|j ] < ∞ for all j < k. Likewise, if EF [|X|k ] = ∞, then EF [|X|j ] = ∞ for all j > k.

6. Layer Representation of k-th moments: if X is an absolutely continuous random variable


R∞
and E[|X|k ] < ∞ then E[|X|k ] = k 0 ck−1 PX (|x| > c)dc.

7. Moment Generating Function: The random variable X is said to have a moment gener-
ating function mX (t) if

mX (t) ≡ EF [exp(tX)] < ∞ for all t ∈ (−ǫ, ǫ)

for some ǫ > 0. Important: Two random variables with the same moment generating
function have the same distribution.

The last question of this week’s problem set will ask you to go over some of these concepts.

9
Quick Summary of Lectures 1 and 2: We have provided formal definitions of a probability
space and a real-valued random variable.
We defined discrete and absolutely continuous random variables. We used our classification to
provide a definition of the expectation of a function g(X). We defined the mean of a real-valued
random variable (µ), the variance of a real-valued random variable (σ 2 ), the k-th moment of a
real-valued random variable, and at the very end of Lecture 2, the moment generating function.
The moment generating function contains—in a sense that we did not make precise—the same
information as the c.d.f. of a real-valued random variable. Here are some results to keep in mind.

• Two discrete random variables with the same moment generating function have the same
support and the same probability mass function. See Billingsley (1995), p.147, second para-
graph.

• Two positive random variables with the same moment generating function have the same
distribution. See Billingsley (1995), Theorem 22.2. Can you fill in the details of the proof?

Optional: Suppose that FX has a density f . Suppose that mX (t) = E[exp(tX)] is well defined
for every t ∈ R. The question is: how do we recover f from the moment generating function? An
sketch of the answer is the following:

1. Extend the image of the m.g.f. to the complex plane: Let i = −1. For t ∈ R, define:

φ(t) ≡ mX (it) = E[exp(itX)]

The function φ(t) : R → C is called the characteristic function (in fact, this is always well
defined: even if the m.g.f. does not exist.).

2. Recover the p.d.f. from the characteristic function: It turns out that one can prove the
following equality (or inversion formula):
Z ∞
f (x) = exp(−it)φ(t)dt
−∞

You can found details in Section 26 (342-349) in Billingsley (1995) or Section 3.3 (106-111) in
Durrett (2010). The main take away (and we will use this result later) is that under certain condi-
tions two random variables with the same moment generating function have the same distribution
over the Borel σ-algebra in the real line. In this sense, the moment generating function provides a
characterization of the c.d.f.

10
References
Billingsley, P. (1995): Probability and Measure, John Wiley & Sons, New York, 3rd ed.

Durrett, R. (2010): Probability: Theory and Examples, Cambridge Series in Statistical and
Probabilistic Mathematics, Cambridge University Press, 4th ed.

11

You might also like