Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Chapter B

Probability Theory via the Lebesque


Integral
The primary objective of this chapter is to introduce the basic probability model from
the measure theoretic point of view. Consequently, we rst start with discussing the
idea behind the formal notion of a probability space, and provide a fairly introduc-
tory discussion of nite measure theory. A good part of this discussion is likely to
be new for the economics student, so our pace is quite leisurely. In particular, we
discuss algebras and -algebras in detail, pay due attention to Borel -algebras, and
prove several elementary properties of probability measures. Moreover, we outline
the constructions of some useful probability spaces, including those that are induced
by distribution functions. As usual, these constructions are achieved by invoking the
fundamental extension theorem of Carathodory. We omit the proof of the existence
part of this theorem, but prove its uniqueness part as an application of the Sierpinski
Class Lemma. We then introduce the notion of a random variable, and discuss the
notion of Borel measurability at some length.
The high point of the chapter is the introduction of the Lebesgue integration the-
ory within the context of nite measure spaces. In fact, we almost exclusively work
with probability measures, so the Lebesgue integral for the present exposition is none
other than the so-called expectation functional. Our treatment is again leisurely. In
particular, we introduce the fundamental convergence theorems for the Lebesgue in-
tegral by means of a step-by-step approach. For instance, the Monotone Convergence
Theorem is given in four dierent formulations. First, we prove it for a sequence
of nonnegative random variables the pointwise limit of which is real-valued. Then
we drop the nonnegativity assumption from the statement of the theorem, and then
reintroduce it but this time work with sequences that converge almost surely to an
extended real-valued function. Our fourth formulation states the result in full gener-
ality. We also study other important properties of the expectation functional, such
as its linearity, the change of variables formula, and Jensens Inequality. The chapter
concludes with a brief introduction to the normed linear space of integrable random
variables, and other related spaces.
There is, of course, no shortage of truly excellent textbooks on probability theory.
In particular, the classic treatments of Billingsley (1986), Durrett (1991), Shiryaev
(1996) and Chung (2001) have a scope far more comprehensive than ours. The proofs
that we omit here and in the following chapters can be recovered from any one of these
books. A more recent reference, which the present author nds most commendable,
is Fristedt and Gray (1997).
1
1 Event Spaces
The most fundamental notion of probability theory is that of a probability measure.
Roughly speaking, a probability measure tells us the likelihood of observing any
conceivable event in an experiment the outcome of which is uncertain. To formally
introduce this concept, however, we need to model the elusive term conceivable
event in this description hence the next subsection.
1.1 -Algebras
Diii:i1io:. Given any nonempty set X, let A and be nonempty subsets of 2
X
.
The class A is called an algebra on X if
(i) X\A A for all A A; and
(ii) A B A for all A, B A.
The collection is called a -algebra on X if it satises (i) and
(iii)

A
i
whenever A
i
for each i = 1, 2, ....
Any element of is called a -measurable set in X. If is a -algebra on X, we
refer to the pair (X, ) as a measurable space.
In words, an algebra on X is a nonempty collection of subsets of X that is closed
under complementation and taking pairwise (and thus nite) unions. It is readily
veried that both and X belong to any algebra A on X, and that an algebra is
closed under taking pairwise (and thus nite) intersections. (To prove the rst claim,
observe that, since A is nonempty, there exists an A X in A, and hence X\A
belongs to X. Thus X = A (X\A) A.) Moreover, a collection of subsets of X
is a -algebra, if it is an algebra and is closed under taking countable unions. By the
de Morgan Law, this also implies that is closed under taking countable (nite or
innite) intersections: If C is a nonempty countable subset of , then

C . It is
useful to note that there is no dierence between an algebra and a -algebra when
the ground set X under consideration is nite.
1
Before considering some examples, let us provide a quick interpretation of the
formal model at hand. Given a nonempty set X and a -algebra on X, we think
of X as the set of all possible outcomes that may result in an experiment, the so-
called sample space, and view any member of (and only such a subset of X)
as an event that may take place in the experiment. To illustrate, consider the
experiment of rolling an ordinary die once. It is natural to take X := {1, ..., 6} as
the sample space of this experiment. But what is an event here? The answer
depends on the actual scenario that one wishes to model. If it is possible to discern
1
These are relatively easy claims, but it is probably a good idea to warm up by proving them.
In particular, how do you know that a -algebra is actually an algebra?
2
the dierences between all subsets of X, then we would take 2
X
as the -algebra of
the model, thereby deeming any subset of X as a conceivable event (e.g. {1, 2, 3}
would be the event that a number strictly less than 4 comes up). On the other
hand, the situation we wish to model may call for a dierent type of an event space.
For example, if we want to model the beliefs of a person who will be told after the
experiment only whether or not 1 has come up, {1, 2, 3} would not really be deemed
as a conceivable event. (If the outcome is 2, one would like to say that {1, 2, 3}
has occurred, but given her informational limitation, our individual has no way of
concluding this.) Indeed, this person may have an assessment only of the likelihood
of 1 coming up in the experiment, so a nontrivial event for her is either 1 comes
up or 1 doesnt come up. Consequently, to model the beliefs of this individual, it
makes more sense to choose a -algebra like {, X, {1}, {2, ..., 6}}. An event in this
model would then be one of the four members of this particular collection of sets.
In practice, then, there is some latitude in choosing a particular class of events to
endow a sample space X with. However, we cannot do this in a completely arbitrary
way. If A is an event, then we need to be able to talk about this event not occurring,
that is, to deem the set X\A also as an event. This is guaranteed by condition
(i) above. Similarly, we wish to be able to talk about at least one of countably
many events occurring, and this is the rationale behind condition (iii) above. In
addition, conditions (i) and (iii) force us to view countably many events occurring
simultaneously as an event as well. To give an example, consider the experiment
of rolling an ordinary die arbitrarily many times. Clearly, we would take X = N

as the sample space of this experiment. Suppose next that we would like to be
able talk about the situation that in the ith roll of the die, number 2 comes up.
Then we would choose a -algebra that would certainly contain all sets of the form
A
i
:= {(
m
) N

:
i
= 2}. This -algebra must contain many other types of
subsets of X. For instance, the situation that in neither the rst nor the second roll
2 turns up must formally be an event, because {(
m
) N

:
1
,
2
= 2} equals
(X\A
1
) (X\A
2
). Similarly, since each A
i
is deemed as an event, a -algebra
maintains that

A
i
(2 comes up at least once through the rolls) and

A
i
(each roll results in 2 coming up) are considered as events in our model.
In short, given a -algebra on X, the intuitive concept of an event is formalized
as any -measurable set. That is, and mark this, we say that A is an event if and
only if A , and for this reason a -algebra on X is often referred to as an event
space on X. One may dene many dierent event spaces on a given sample space,
so what an event really is depends on the model one chooses to work with.
Ex.:iii 1. [1] 2
X
and {, X} are -algebras on any nonempty set X. The collection
2
X
corresponds to the nest event space allowing each subset of X to be deemed as
an event.
2
By contrast, {, X} is the coarsest possible event space that allows one
to perceive of only two types of events, nothing happens and something happens.
2
I have already told you that certain subsets of X may not be deemed as events for an observer
3
[2] Let X := {a, b, c, d}. None of the collections {}, {X}, {, X, {a}} and {, X, {a},
{b, c, d}, {b}, {a, c, d}} qualify as an algebra on X. On the other hand, each of the col-
lections {, X}, {, X, {a}, {b, c, d}} and {{, X, {a}, {b, c, d}, {b}, {a, c, d}, {a, b}, {c, d}}
is an algebra on X.
[3] If X is nite and A is an algebra on X, then A is a -algebra. So, as noted
earlier, the distinction between the notions of an algebra and a -algebra disappear
in the case of nite sample spaces.
[4] Let us agree to call an interval right-semiclosed if it has the form (a, b]
with a b < , or of the form (a, ) with a. The class of all
right semiclosed intervals is obviously not an algebra on R. But the set A of all
nite unions of right-semiclosed intervals called the algebra induced by right-
semiclosed intervals is an algebra on R. In fact, A is the smallest algebra that
contains all right-semiclosed intervals. It is not a -algebra. (Proofs?)
[5] A := {S N : min{|S| , |N\S|} < } is an algebra on N but it is not a
-algebra. Indeed, {i} A for each odd i N, but {1, 3, ...} / A.
Exercise 1. Let X be a metric space, and let A
1
be the class that consists of all
open subsets of X, A
2
the class of all closed subsets of X, and A
3
:= A
1
A
2
.
Determine if any of these classes is an algebra or a -algebra.
Exercise 2. Let X be any nonempty set, and a class of -algebras on X.
(a) Show that

is a -algebra on X.
(b) Give an example to show that

need not be an algebra even if is nite.


Exercise 3. Dene
A :=
_
A N : (
1
n
|A {1, ..., n}|) is convergent
_
.
(Note. For any A A, the number lim
1
n
|A {1, ..., n}| is called the asymptotic
density of A.) True or false: A is an algebra but not a -algebra.

Exercise 4. Show that a -algebra cannot be countably innite.


In practice it is not uncommon that we have a pretty good idea about the kinds
of sets we wish to consider as events, but we have diculty in terms of nding a
good -algebra for the problem because the collection of sets we have at hand
does not constitute a -algebra. The resolution is usually to extend the collection
with limited information, so 2
X
may not always be the relevant event space to endow X with. (I
will talk about this issue at greater length when studying the notion of conditional probability in
Chapter F.) Apart from this, there are also technical reasons for why one cannot always view 2
X
as
a useful event space. Roughly speaking, when X is an innite set, 2
X
may be too large of a set
for one to be able to assign probability numbers to each element of 2
X
in a nontrivial way. (More
on this in Section 3.5.)
4
of sets which we are interested in to a -algebra in a minimal way. (We consider a
minimal extension because we wish to depart from our interesting sets as little as
possible. Otherwise taking 2
X
as the event space would trivially solve the problem of
extension.) This idea leads us to the following fundamental concept.
Diii:i1io:. Let X be a nonempty set and A a nonempty subclass of 2
X
. The
smallest -algebra on X that contains A (in the sense that this -algebra is included
in any other -algebra that contains A) is called the -algebra generated by A,
and is denoted as (A).
For example, if X := {a, b, c}, then ({}) = ({X}) = {, X}, ({, X, {a}}) =
{, X, {a}, {b, c}}, and ({, X, {a}, {b}}) = 2
X
. Of course, we have = () for
any -algebra on any nonempty set.
Does any nonempty class of sets generate a -algebra? The answer does not follow
readily from the denition above, because it is not self-evident if we can always
nd a smallest -algebra that extends any given nonempty class of sets. Our rst
proposition, however, shows that we can actually do this, so there is really no existence
problem regarding generated -algebras.
3
Pnoioi1io: 1. Let X be a nonempty set and A a nonempty subclass of 2
X
. There
exists a unique smallest -algebra that includes A, so (A) is well-dened. We have
(A) =

{ : is a -algebra and A } .
Exercise 5.
H
Prove Proposition 1.
Exercise 6. Does the -algebra generated by the algebra of Example 1.[4] include
all open sets in R?
Exercise 7.
H
Compute (A), where A := {S R : min{|S| , |R\S|} < }.
1.2 Borel -algebras
Let X be any metric space, and let O
X
stand for the set of all open sets in X. The
members of O
X
are of obvious importance, but unfortunately O
X
need not even be
an algebra. In metric spaces, then, it is natural to consider the -algebra generated
by O
X
. This -algebra is called the Borel -algebra on X, and its members are
referred to as Borel sets (or in probabilistic jargon, Borel events). Throughout
this text, we denote the Borel -algebra on a metric space X by B(X). By denition,
therefore, we have B(X) = (O
X
).
3
As you will soon painfully nd out, however, the explicit characterization of a generated -
algebra can be a seriously elusive problem. Just to get a feeling for the diculties that one may
encounter in this regard, try to compute the -algebra ({{a} : a Q}) on R.
5
Notation. We write B[a, b] for B([a, b]), and B(a, b] for B((a, b]), where < a <
b < .
Ex.:iii 2. By denition, B(R) = (O
R
), but one does not actually need all open
sets in R for generating B(R). For instance, what if we used instead the class of all
open intervals, call it A
1
, as a primitive collection and attempt to nd (A
1
)? This
would lead us exactly to the -algebra (O
R
)! To see this, observe rst that (O
R
)
is obviously a -algebra that contains A
1
so that we clearly have (A
1
) (O
R
).
(Recall the denition of (A
1
)!) To establish the converse containment, remember
that every open set in R can be written as the union of countably many open intervals.
(Right?) Thus, we have O
R
(A
1
). (Why exactly?) But then, since (O
R
) is the
smallest -algebra that contains O
R
, and (A
1
) is of course a -algebra, we must
have (O
R
) (A
1
). So, we conclude: (O
R
) = (A
1
).
In fact, there are all sorts of other ways of generating the Borel -algebra on R.
For instance, consider the following classes:
A
2
:= the set of all closed intervals
A
3
:= the set of all closed sets in R
A
4
:= the set of all intervals of the form (a, b]
A
5
:= the set of all intervals of the form (, a]
A
6
:= the set of all intervals of the form (, a).
It is easy to show that all of these collections generate the same -algebra:
B(R) := (O
R
) = (A
1
) = = (A
6
). (1)
We have already showed that (O
R
) = (A
1
). On the other hand, for any closed
interval [a, b], we have [a, b] =

_
a
1
i
, b +
1
i
_
(O
R
), so we have A
2
(O
R
)
so that (A
2
) (O
R
). Conversely, for any open interval (a, b), we have (a, b) =

_
a +
1
i
, b
1
i

(A
2
). So A
1
(A
2
), and it follows that (A
1
) (A
2
). The
rest of the claims in (1) can be proved similarly.
This example shows that dierent collections of sets might well generate the same
-algebra. In fact, it is generally true that the Borel -algebra on a metric space is
also generated by the class of all closed subsets of this space. That is, for any metric
space X,
B(X) := ({O X : O is open}) = ({S X : S is closed}).
(Verify!) The following exercises play on this theme a bit more.
Exercise 8. Show that there is a countable subset A of 2
R
such that (A) = B(R).
6
Exercise 9. For any n N, let
A
1
:= {X
n
J
i
: J
i
is a bounded open interval, i = 1, ..., n} ,
A
2
:= {X
n
J
i
: J
i
is a bounded right-closed interval, i = 1, ..., n} ,
A
3
:= {X
n
J
i
: J
i
is a bounded closed interval, i = 1, ..., n} .
Show that we have B(R
n
) = (A
1
) = (A
2
) = (A
3
).
Exercise 10. Prove: If X is a separable metric space, then B(X) = ({N
,X
(x) :
x X and > 0}).

Exercise 11. For any m N, and (t


i
, B
i
) [0, 1] B[0, 1], i = 1, ..., m, dene
A(t
1
, ..., t
m
, B
1
, ..., B
m
) := {f C[0, 1] : f(t
i
) B
i
, i = 1, ..., m},
and
A := {A(t
1
, ..., t
m
, B
1
, ..., B
m
) : m N and (t
i
, B
i
) [0, 1]B[0, 1], i = 1, ..., m}.
Prove that (A) = B(C[0, 1]).
The fact that there is often no way of giving an explicit description of a generated
-algebra is a source of discomfort. Nevertheless, one can usually say quite a bit
about (A) even without having a specic formula that tells us how its members are
derived from those of A. Indeed, in all of the examples (exercises) considered above,
we (you) have computed (A) by using the denition of the generated -algebra
directly. The following exercise provides another illustration of this.
Exercise 12.
H
Let X be a metric space, and Y a metric subspace of X. Prove that
B(Y ) = {B Y : B B(X)}.
The observation noted in the previous exercise is quite useful. For instance, it
implies that the knowledge of B(R) is sucient to describe the class of all Borel
subsets of [0, 1]; we have B[0, 1] = {B [0, 1] : B B(R)}. Similarly, B(R
n
+
) =
{B R
n
+
: B B(R
n
)}. We conclude with a less immediate corollary.
Exercise 13.
H
For any S B[0, 1] and R, show that (S +) [0, 1] B[0, 1].
2 Probability Spaces
We are now ready to introduce the concept of probability measure.
4
4
The origins of probability theory goes back to the famous exchange between Blaise Pascal and
Pierre Fermat that started in 1654. While Pascal and Fermat were mostly concerned with gambling
7
Diii:i1io:. Let (X, ) be a measurable space. A function p : R is said to be
-additive if
p
_

i=1
A
i
_
=

i=1
p(A
i
)
for any (A
m
)

with A
i
A
j
= for each i = j. Any -additive function
p : R
+
with p() = 0 is called a measure on (or on X if is clear from the
context), and we refer to the list (X, , p) as a measure space. If p(X) < , then p
is called a nite measure, and the list (X, , p) is referred to as a nite measure
space. In particular, if p(X) = 1 holds, then p is said to be a probability measure,
and in this case, (X, , p) is called a probability space.
Diii:i1io:. Given a metric space X, any measure p on B(X) is called a Borel
measure on X, and in this case (X, B(X), p) is referred to as a Borel space. If, in
addition, p is a probability measure, then (X, B(X), p) is called a Borel probability
space.
Notation. Throughout this text, the set of all Borel probability measures on a
metric space X is denoted as P(X).
We think of a probability measure p as a function that assigns to each event (that
is, to each member of the -algebra that p is dened on) a number between 0 and
1. This number corresponds to the likelihood of the occurrence of that event. The
map p is -additive in the sense that it is additive with respect to countably many
pairwise disjoint events. This additivity property, which is the heart and soul of
measure theory, entails several other useful properties for probability measures. For
instance, it implies that any probability measure is nitely additive, that is,
p
_
m

i=1
A
i
_
=
m

i=1
p(A
i
)
for any nite class {A
1
, ..., A
m
} of pairwise disjoint events. (To verify this, use -
additivity and the fact that {A
1
, ..., A
m
, , , ...} is a collection of pairwise disjoint
events.) In turn, this implies that, for any events A and B with A B, we have
p(B\A) = p(B) p(A), because p(B) = p(A(B\A)) = p(A) +p(B\A). Some other
useful properties of probability measures are given next.
type problems, the importance and applicability of the general topic was shortly understood, and the
subject was developed by many mathematicians, including Jakob Bernoulli, Abraham de Moivre,
and Pierre Laplace. Despite the host of work that took place in the 18th and 19th centuries,
however, a universally agreed denition of probability did not appear until 1933. At this date
Andrei Kolmogorov introduced the (axiomatic) denition that we are about to present, and set the
theory on rigorous grounds, much the same way Euclid has given an axiomatic basis for planar
geometry.
8
Exercise 14. Let (X, , p) be a probability space, m N, and let A, B, A
i
,
i = 1, ..., m. Prove:
(a) If A B, then p(A) p(B),
(b) p(X\A) = 1 p(A),
(c) p(

m
A
i
)

m
p(A
i
),
(d) (Bonferronis Inequality) p(

m
A
i
)

m
p(A
i
) (m1).
Warning. One is often tempted to conclude from Exercise 14.(a) that any subset
of an event of probability zero occurs with probability zero. There is a catch here.
How do you know that this subset is assigned a probability at all? For instance, let
X := {a, b, c}, := {, X, {a, b}, {c}} and let p be the probability measure on that
satises p({c}) = 1. Here, while p({a, b}) = 0, it is not true that p({a}) = 0 since p is
not even dened at {a}. This probability space maintains that {a} is not an event.
Note. Those probability spaces for which any subset of an event of probability zero
is an event (and hence occurs with probability zero) are called complete. With the
exception of a few (optional) remarks, however, this notion will not play an important
role in the present exposition.
Exercise 15. (The Exclusion-Inclusion Formula) Let (X, , p) be a probability
space, m N, and A
1
, ..., A
m
. Where N
t
:= {(i
1
, ..., i
t
) {1, ..., m}
t
:
i
1
< < i
t
}, t = 1, ..., m, show that
p
_
m

i=1
A
i
_
=

iN
1
p(A
i
)

(i,j)N
2
p(A
i
A
j
) +

(i,j,k)N
3
p(A
i
A
j
A
k
)
+ (1)
m1
p
_
m

i=1
A
i
_
.
The following are simple but surprisingly useful observations.
Pnoioi1io: 2. Let (X, , p) be a probability space, and let (A
m
)

. If A
1

A
2
(in which case we say that (A
m
) is an increasing sequence), then
limp(A
m
) = p
_

i=1
A
i
_
.
On the other hand, if A
1
A
2
(in which case we say that (A
m
) is a decreasing
sequence), then
limp(A
m
) = p
_

i=1
A
i
_
.
Proof. Let (A
m
)

be an increasing sequence. Set B


1
:= A
1
and B
i
:=
A
i
\A
i1
, i = 2, ..., and note that B
i
for each i and

A
i
=

B
i
. But
9
B
i
B
j
= for any i = j, so, by -additivity,
p
_

i=1
A
i
_
= p
_

i=1
B
i
_
=

i=1
p(B
i
) = lim
m
m

i=1
p(B
i
) = lim
m
p
_
m

i=1
B
i
_
= lim
m
p (A
m
) .
The proof of the second claim is left as an exercise.
As an immediate application of this result and Exercise 14.(c), we obtain a basic
inequality of probability theory:
Booles Inequality. For any probability space (X, , p),
p
_

i=1
A
i
_

i=1
p(A
i
) for any (A
m
)

.
Proof. Use Proposition 2 and Exercise 14.(c).
Exercise 16.
H
Let (X, , p) be a probability space. Show that if (A
m
)

satises p(A
i
A
j
) = 0 for every i = j, then
p
_

i=1
A
i
_
=

i=1
p(A
i
).
Exercise 17. Given any probability space (X, , p), show that
p
_

i=1
A
i
_
p
_

i=1
B
i
_

i=1
(p(A
i
) p(B
i
))
for all (A
m
), (B
m
)

with B
m
A
m
, m = 1, 2, ....
Exercise 18.
H
Let (X, B(X), p) be a Borel probability space, and let O
X
and C
X
denote the class of all open and closed subsets of X, respectively.
(a) Prove that
sup{p(T) C
X
: T S} = p(S) = inf{p(O) O
X
: S O}
for any S B(X).
(b) Show that, if X is -compact, that is, it can be written as a union of countably
many compact subsets of itself, then
p(S) = sup{p(K) : Kis a compact subset of Xwith K S}
for any S B(X). (Note. Such a Borel probability measure is said to be regular.)
10
The observations noted in Proposition 2 are often referred to as the continuity
(from below and above, respectively) properties of a probability measure.
5
As the
proof of this result makes transparent, these properties are derived directly from the
-additivity of a probability measure. Indeed, any nite measure on satises these
properties.
Warning. The rst claim of Proposition 2 is valid even when p is an innite measure.
(The proof goes through verbatim.) In this case, the validity of the second claim,
however, requires the additional assumption that p(A
k
) < for some k. To see
the need for this additional hypothesis, consider the measure space (N, 2
N
, q), where
q(S) := |S| . (q is called the counting measure.) Here, if A
m
:= {m, m + 1, ...} for
each m N, then

A
i
= and yet q(A
m
) = for each m.
It is useful to observe that a partial converse of this observation is also true. To
make this precise, let us agree to refer to a function dened on an algebra A as -
additive on A if it is additive with respect to any countably many pairwise disjoint
events the union of which belongs to A. (Notice that this denition conforms with
the way we used the term -additive for a measure so far.) It turns out that nite
additivity and continuity of a set function imply its -additivity.
Pnoioi1io: 3. Let A be an algebra (on some nonempty set), and q : A R
+
a nitely additive function such that, for any decreasing sequence (C
m
) A

with

C
i
= , we have limq(C
m
) = 0. Then, q is -additive on A.
Proof. Take any class {A
m
A : m = 1, 2, ...} such that A
i
A
j
= for each i = j,
and A :=

A
i
A. Let B
m
:=

m
A
i
, and observe that q(A) = q(A\B
m
) +q(B
m
)
for each m, by nite additivity of q. But (A\B
m
) is a decreasing sequence in A with

A\B
i
= so that, by hypothesis, limq(A\B
m
) = 0. Therefore, letting m
in the equation q(A) = q(A\B
m
) + q(B
m
) and using the nite additivity of q again,
we get
q(A) = limq(B
m
) = limq
_
m

i=1
A
i
_
= lim
m

i=1
q (A
i
) =

i=1
q (A
i
)
and hence the proposition.
In words, a nitely additive set function which is continuous from above at the
empty set is -additive. As you can now show easily, an analogous result can also be
proved in terms of continuity from below.
5
It is customary to write A
m

A
i
if (A
m
) is an increasing sequence of sets, and A
m

A
i
if it is a decreasing one. Thus Proposition 2 states that A
m

A
i
implies p(A
m
) p(

A
i
),
and similarly for decreasing sequences of events. This is the motivation behind the term continuity
of a probability measure.
11
Exercise 19. Let A := {S N : min{|S| , |N\S|} < }, and dene p : A [0, 1]
as
p(S) :=
_
1, |N\S| <
0, |S| <
.
Show that p is nitely additive, but not -additive.
Let us conclude this section with a brief summary. At this point, you should
be somewhat comfortable with the notion of probability space (X, , p). In such a
space, X stands for the sample space of the experiment being modeled, the set of all
outcomes, so to speak. The -algebra , on the other hand, tells us which subsets
of X can be discerned in the experiment, that is, of which subsets of X we can talk
about the likelihood of occurring or not occurring. (But recall that things are not
completely arbitrary; by denition of a -algebra, there is quite a bit of consistency in
what is and is not deemed as an event.) Finally, the probability measure p quanties
the likelihood of the members of . Things hang tight together by the property of
-additivity; the likelihood of the union of a countably many pairwise disjoint events
is simply the sum of the individual probabilities of each of these events.
And now, its time to make things a bit more concrete.
3 Constructing Probability Spaces
3.1 Motivating Examples
Our rst example provides the formal description of the canonical probability space
whose sample space is nite. This space corresponds to a special case of the more
general formulation of what is said to be a simple probability space.
Ex.:iii 3. Let X be any metric space. The support of any function f R
X
is
dened as the set
supp(f) := cl{ X : f() > 0}.
Since every nite set is closed in a metric space, we have
supp(f) = { X : f() > 0}
for all f with nite support. On the other hand, we say that f is a simple density
function if f 0, supp(f) is nite and

supp(f)
f() = 1. A simple density
function f induces a probability measure p
f
on a -algebra on X as follows:
p
f
(S) :=

supp(f)S
f() for any S .
(Note. If supp(f) S = , then p
f
(S) = 0.) For obvious reasons, such a probability
measure is called simple. We denote the set of all simple probability measures by
12
P
s
(X), and say that (X, , p
f
) is a simple probability space. It is easily seen that
any probability space (X, 2
X
, p) with |X| < is a simple probability space. (For,
f [0, 1]
X
dened by f() := p({}) is a simple density function, and p = p
f
.
6
)
Since singleton sets are closed in any metric space, (X, 2
X
, p) is a Borel probability
space, provided that X is a nite set. More generally, we dene the support of any
Borel probability measure p P(X), denoted supp(p), as the smallest closed set S
such that p(S) = 1. Given this denition, a Borel probability measure is simple i
it has nite support. (Note. Such measures are referred to as simple lotteries in
decision theory.)
Exercise 20. Let X be a nonempty countable set and p : 2
X
R. Prove:
(a) (X, 2
X
, p) is a probability space i there exists an f [0, 1]
X
such that p(S) =

S
f().
(b) If (X, 2
X
, p) is a probability space such that there exist an > 0 and f [, 1]
X
with p(S) =

S
f() for each S 2
X
, then |X| < .
Exercise 21.
H
Let X be a countably innite set, and take any f B(X) with f 0.
Show that there exists a g R
X
+
such that (X, 2
X
, p) is a probability space where
p : 2
X
R is given by p(S) :=

S
g()f().
The following example is very important. It shows that constructing non-simple
probability measures on an innite sample space is in general not a trivial matter.
We will revisit this example several times throughout the sequel.
Ex.:iii 4. Consider the experiment of tossing successively k many (fair) coins.
Denoting heads by 1 and tails by 0 (for convenience), the sample space of this
experiment can be written as X := {0, 1}
k
. Given that X is nite, there is no problem
with taking 2
X
as the relevant event space. After all, thanks to the niteness of X,
there is a natural way of assigning probabilities to events by using the notion of
relative frequency. Hence we dene p(S) :=
|S|
2
k
for any event S 2
X
. Of course,
p is none other than the simple probability measure induced by the simple density
function f() :=
1
|X|
for each X. If we drop the niteness assumption, however,
things get slightly icy, and it is actually at this point that the use of the formalism
of the general probability model kicks in.
Consider the experiment of tossing a (fair) coin innitely many times. The sample
space of this experiment is the sequence space {0, 1}

. How do we dene events and


probabilities here? The problem is that innite cardinality of our sample space makes
it impossible to use the idea of relative frequency to assign probabilities to all the
events that we are interested in. Yet, intuitively, we still want to use the relative
frequency interpretation of probability here. For instance, we really want to be able
6
That is to say, if X is nite, specifying p on singleton events denes p on the entire 2
X
: p(S) =

S
p({}) for any S X.
13
to say that the probability of observing innitely many tails is 1. Or, what is a bit
more problematic, we want to be able to say that after suciently many tosses, the
probability that the relative frequency of heads tends to
1
2
is large (because the coin
is fair).
So, how should we dene our event space? Here is the idea. Let us rst deal with
the easy events. For example, consider the set
{(
m
) {0, 1}

:
1
= a
1
, ...,
k
= a
k
},
where k N and a
1
, ..., a
k
{0, 1}. This set is said to be a cylinder set, for it is
completely determined by a nite number of its initial elements. This property makes
our relative frequency intuition operational. Clearly, we wish to assign probability
1/2
k
to this event. More generally, we want to consider as an event any cylinder
set, that is, any set of the form {(
m
) {0, 1}

: (
1
, ...,
k
) S} where k N and
S {0, 1}
k
. Since this is the event that the outcome of the rst k tosses belongs
to S, it is natural to assign to it the probability
|S|
2
k
. What next? Well, it turns out
that this is all we need to do. So far we know that we wish to include the set of all
cylinder sets
A :=

k=1
_
{(
m
) : (
1
, ...,
k
) S} : S {0, 1}
k
_
in our event space. (Check that this class is an algebra on {0, 1}

.) So why dont we
consider A as the nucleus of our event space, and take the -algebra that it generates,
(A), as the event space for the problem? After all, not only is (A) is a -algebra
that diers fromAin a minimal way, it also contains all sorts of interesting events that
are not contained in A. For instance, in contrast to the collection A, (A) maintains
that all tosses after the fth toss come up heads, is an event, for {(
m
) :
k
= 1}
is a cylinder set for each k, and hence
{(
m
) :
k
= 1 for all k 6} =

k=6
{(
m
) :
k
= 1} (A). (2)
Similarly, the situation innitely many heads come up throughout the experiment
is captured by (A) (but not by A). For,
{(
m
) :
k
= 1 for innitely many k} =

k=1

i=k
{(
m
) :
i
= 1} (A). (3)
(This is not entirely obvious; make sure you verify both of the claims made in (3).)
So, you see, there are many non-cylinder events that we can deduce from cylinder sets
by taking unions, intersections and complements, and by taking unions, intersections
and complements of the resulting sets, and so on. The end point of this process, the
explicit description of which cannot really be given, is none other than (A).
7
7
Quiz. True or false: {(
m
) : lim
1
k

i
=
1
2
} (A).
14
All this is good, (A) certainly looks like a good event space to endow our sample
space {0, 1}

with. But it is worrisome that we only know what probabilities to


assign to the members of A so far. What do we do about the members of (A)\A?
(And you are quite right if you suspect that there are very many such sets.) The good
news is that we dont have to do anything about them, because the probabilities of
these sets are already determined! Read on.
3.2 Carathodorys Extension Theorem
The stage is now set for the following fundamental theorem of measure theory, which
we state here without proof.
C.n.1iioon. Ex1i:io: Tiioni:. Let A be an algebra on a nonempty set
X and q : A R
+
. If q is -additive on A, then there exists a measure p on (A)
such that p(A) = q(A) for each A A. Moreover, if q(X) < , then p is unique.
This is a powerful theorem that allows us to construct a probability measure
(uniquely) on a -algebra by specifying the behavior of the measure only on the
algebra that generates this -algebra. Since algebras are often much easier than -
algebras to work with, Carathodorys Extension Theorem turns out to be extremely
useful in constructing probability measures. For instance, we may apply this theorem
to Example 4 where, as discussed above, we know how to assign probabilities to the
cylinder subsets of {0, 1}

.
Moni o: Ex.:iii 4. Consider the framework of Exercise 4, and dene q [0, 1]
A
by q({0, 1}

) := 1 and
q({(
m
) : (
1
, ...,
k
) S}) :=
|S|
2
k
for each k N and S {0, 1}
k
. (Is q well-dened?) Now q is easily checked to be
nitely additive. Moreover, as we show next, q is continuous from above at the empty
set, so we have the following fact:
Claim. q is a -additive function on A.
Proof of Claim. Take any decreasing sequence (A
m
) of cylinder sets in {0, 1}

with

A
i
= . Since {0, 1}

is compact (why?), and each A


i
is closed in {0, 1}

A
i
= implies that the class {A
1
, A
2
, ...} cannot have the Finite Intersection
property. Thus

M
A
i
= for some M N. It follows that, for any m M, we
have A
m
A
M
=

M
A
i
= so that limq(A
m
) = q() = 0. Applying Proposition 3
completes the argument.
The stage is now set for Carathodorys Extension Theorem. Applying this the-
orem, we actually nd a unique probability measure p on (A) which agrees with q
15
on each cylinder set. In turn, this solves nicely the problem of nding the right
probability space for the experiment of Example 4.
8
But, how does p attach probabilities to every event in (A)? Unfortunately, a
complete answer would require us to go through the proof of Carathodorys Ex-
tension Theorem in this particular context, and we wish to avoid this at this stage.
However, it is not dicult to nd the probability of at least some non-cylinder events
in our experiment. For instance, let us compute the probability of the event that all
tosses after the fth toss come up heads (recall (2)). By using Proposition 2, this is
done easily. Dening the cylinder sets A
k
:= {(
m
) :
6
= =
k
= 1} for each
k 6, and using that proposition, we get
p
_

k=6
{(
m
) :
k
= 1}
_
= lim
k
p (A
k
) = lim
k
q (A
k
) = lim
k
2
5
2
k
= 0,
a very agreeable nding. We can use a similar technique to compute the probability
of many other interesting events. For instance, in the case of the event given in (3),
we have
p
_

k=1

i=k
{(
m
) :
i
= 1}
_
= lim
k
p
_

i=k
{(
m
) :
i
= 1}
_
= 1 lim
k
p ({(
m
) :
i
= 0 for all i k})
= 1,
once again quite an intuitive observation.
Exercise 22. Consider the probability space ({0, 1}

, (A), p) we have constructed


above for the experiment of tossing innitely many fair coins. Show that the state-
ments at least one head comes up after the tenth toss, only heads come up after
nitely many tosses, and a tail comes up at every even toss, are formally captured
as events in the model at hand. Compute the probability of each of these events.
Exercise 23. Let (X, , q) be a probability space, and S a subset of X with S / .
(a) Show that
( {S}) = {(S A) ((X\S) B) : A, B }.
(b) By using Carathodorys Extension Theorem, show that there is a probability
measure p on ( {S}) such that p(A) = q(A) for each A .
(c) Do part (b) without using Carathodorys Extension Theorem.
8
This example attests to the usefulness of the notion of -algebra. Suppose you instead designated
2
{0,1}

as the event space of this experiment. How would you dene the probability of an arbitrary
set in {0, 1}

?
Quiz. Show that 2
{0,1}

= (A) in the example at hand.


16
3.3 The Lebesgue-Stieltjes Probability Measure
We next move to another example in which we again use Carathodorys Exten-
sion Theorem to construct the right probability space. This example will play a
fundamental role in much of what follows.
Let us rst recall the notion of distribution function.
Diii:i1io:. A map F : R [0, 1] is said to be a distribution function if it is
increasing, right-continuous and we have F() = 0 = 1 F().
9
Exercise 24. Show that a distribution function can have at most countably many
discontinuity points. Also show that if a distribution function is continuous, then it
must be uniformly continuous.
Ex.:iii 5. Let Abe the algebra induced by the right-semiclosed intervals (Example
1.[4]). Let F be a distribution function. Dene the map q [0, 1]
A
as follows:
() If a b < , then q((a, b]) := F(b) F(a);
() If a, then q((a, )) := 1 F(a); and
() If A
1
, ..., A
m
are nitely many disjoint intervals in A, then
q
_
m

i=1
A
i
_
:=
m

i=1
q(A
i
).
10
So far so good. Once again our problem is to nd a probability measure on (A) that
accords with q, which is a potentially dicult problem. Moreover, what if there are
two such measures, which one should we choose? Carathodorys Extension Theorem
deals with these issues at one stroke. If we can show that q is -additive on A, we can
conclude that q actually denes the probability measure that is induced by F on
the Borel -algebra B(R). After all, Carathodorys Extension Theorem would then
say that there is a unique extension of q to a probability measure on (A) = B(R).
This probability measure, denoted p
F
, is called the Lebesgue-Stieltjes probability
measure induced by F on R.
We are not done though, we still have to establish the -additivity of q. The
strategy of attack is identical to that in the case of Example 4. Since q is obviously
nitely additive, it is enough to establish the continuity of q from below at the empty
9
Notation. F() := lim
t
F(t) and F() := lim
t
F(t). Moreover, for any real number
a, F(a) denotes the left-limit of F at a, that is, F(a) := lim
m
F(a
1
m
). The expression F(a+)
is understood similarly.
10
How do we know that q is well-dened? Since a right-semiclosed interval can be written as a nite
union of other right-semiclosed intervals, we have at the moment two dierent ways of computing the
probability of such intervals, which may be, in principle, distinct from each other. But things work
out ne here. For instance, it is immediately veried that q((, b]) = q((, b 1] (b 1, b])
for any b R. One may easily generalize this example to verify that q is well-dened.
17
set (Proposition 3). To this end, take a decreasing sequence (A
m
) in A such that

A
i
= . Take an arbitrary > 0, and x some index i N. Since A
i
equals the
union of nitely many right-closed intervals, and F is right-continuous, one can show
that we can nd a bounded set B
i
in A such that cl
R
(B
i
) A
i
and q(A
i
)q(B
i
) <

2
i
.
(Proof. Exercise.) The boundedness of B
i
implies that cl
R
(B
i
) is closed in R. But R is
a compact metric space (yes?), and

cl
R
(B
i
) = . It follows that

M
cl
R
(B
i
) = for
some suciently large positive integer M. (Why?) Consequently, for each m M,
we have

m
B
i
= , and therefore,
q(A
m
) = q
_
A
m
\
m

i=1
B
i
_
q
_
m

i=1
(A
i
\B
i
)
_
since A
1
A
m
. We are almost done. All we need to observe now is that
q (

m
(A
i
\B
i
))

m
q(A
i
\B
i
).
11
From this, it follows that
q(A
m
)
m

i=1
(q(A
i
) q(B
i
)) <
m

i=1

2
i
< , m M.
Conclusion: For any > 0, there exists an M N such that q(A
m
) < for all
m M, that is, limq(A
m
) = 0.
A major upshot of Example 5 is the following: One can always dene a probability
measure on the reals by means of a distribution function. Interestingly, the converse
of this is also true. That is to say, any Borel probability measure p on R arises this
way. Indeed, for any such p, the map t p((, t]) on R is a distribution function.
This means that, on R, a Borel probability measure can actually be identied with a
distribution function.
Exercise 25.
H
Show that, for any p P(R), the map F
p
: x p((, t]) is a
distribution function. Moreover, prove that
p({t}) = F
p
(t) F
p
(t) for any t R,
so F
p
continuous at t i p({t}) = 0.
11
Where does this come from? Well, from Exercise 14.(c), or better, from
Booles Inequality for Finitely Additive Set Functions: If C is an algebra and r : C [0, 1] nitely
additive, then r (

m
C
i
)

m
r(C
i
) for any m N and C
1
, ..., C
m
C.
Proof is easy. For m = 2, let D = C
1
C
2
and use nite additivity to get
r(C
1
C
2
) = r((C
1
\D) D (C
2
\D))
= r(C
1
\D) +r(D) +r(C
2
\D)
= r(C
1
) r(D) +r(D) +r(C
2
) r(D)
r(C
1
) + r(C
2
).
The rest follows by induction.
18
We have constructed in Example 5 the Lebesgue-Stieltjes measure induced by a
distribution function F on the entire R. The analogous construction works for any
interval in R. For instance, if X := (a, b] with < a < b < , and F [0, 1]
X
is
an increasing and right-continuous function with F(a+) = 0 and F(b) = 1, then we
can dene the Lebesgue-Stieltjes probability measure induced by F on (a, b] by
using precisely the approach developed in Example 5. It is not dicult to show that
this measure is the restriction of the Lebesgue-Stieltjes probability measure induced
by the distribution function G on R to B(a, b], where G is the (unique) distribution
function with G|
X
= F.
We should also note that F(b) = 1 amounts only to a normalization here. Indeed,
the argument outlined in Example 5 works for any right-continuous and increasing
F R
(a,b]
such that F(b) > F(a). (The only modication needed in the argument
given in Example 5 is that we now consider only the right-closed intervals that are
in (a, b] when dening q via F and set q(X) = F(b) F(a).) Of course, the resulting
(unique) measure p now called the Lebesgue-Stieltjes measure induced by F on
(a, b] is not a probability measure (unless F(b) F(a) = 1). This measure rather
assesses the measure of the space X as F(b) F(a).
3.4 The Lebesgue Measure
Even though we focus on probability measures throughout this text, we need to
consider at least one innite measure, which is, geometrically speaking, the natural
measure of the real line. To introduce this measure, take any i Z, let X
i
:= (i, i+1],
and dene F
i
: X
i
[0, 1] by F
i
(t) := t i. Let
i
denote the Lebesgue-Stieltjes
measure induced by F
i
on X
i
for each i Z. We dene the Lebesgue measure on
B(R) by
(S) :=

iZ

i
(S X
i
).
Clearly, this measure agrees with each
i
on X
i
in the sense that (S) =
i
(S) for
any S B(X
i
), i Z. Moreover, it assigns to any interval in R its length as its
measure.
12
Exercise 26.
H
Prove that (R, B(R), ) is a measure space such that
((a, b]) = b a, a < b < .
The restriction of to any Borel subset X of R is a Borel measure on X, that
is, (X, B(X), |
B(X)
) is a Borel measure space for any X B(R). For brevity, we
denote this measure space simply as (X, B(X), ) in what follows. For instance,
([0, 1], B[0, 1], ) is a Borel probability space.
12
Quiz. Why is well-dened?
19
Let us establish a few elementary facts about the Lebesgue measure. First of all,
what is the Lebesgue measure of a singleton set? The answer is:
({a}) =
_

m=1
(a
1
m
, a]
_
= lim
m
((a
1
m
, a]) = lim
m
1
m
= 0
for any real number a. (Why the second equality?) Consequently, any singleton set
in R has Lebesgue measure zero. In fact, any countable set has this property, since
by -additivity of , we have
({a
1
, a
2
, ...}) =
_

i=1
{a
i
}
_

i=1
({a
i
}) = 0
for any (a
m
) [a, b]

. For instance, we have (Q) = 0.


13
3.5 More on the Lebesgue Measure
The previous subsection contains just about all you need to know about the Lebesgue
measure to follow the subsequent development. So, if you wish to get to the core of
probability theory right away, you may proceed at this point directly to Section 5.
The present subsection aims at completing the above discussion by going over a few
highlights of the Lebesgue measure theory. The presentation takes place mostly by
means of exercises.
Exercise 27.
H
(Translation Invariance of ) For any (S, ) B(R) R, show that
(S +) = (S).
Exercise 28.
H
(a) (Non-atomicity of ) Show that, for any A B(R) with (A) > 0,
there exists a B B(R) with B A and 0 < (B) < (A).
(b) Give an example of a measure on B(R) which does not possess either of the
properties mentioned in part (a) and Exercise 27.
13
While various attempts of formulating (what we now call) the Lebesgue measure were made
prior to the contributions of Emile Borel and Henri Lebesgue (in their respective doctoral theses
of 1854 and 1902), these attempts were not brought to their fruition precisely because they too
assigned measure zero to countably innite sets, an implication that was deemed absurd by the
mathematical community of the day. (Even the otherwise revolutionary Cantor was no exception to
this.) In succession, Borel and Lebesgue set the theory on a completely rigorous foundation, and as
the structure of countable sets were better understood in time, it was eventually accepted that an
innite set can be deemed very small, in fact negligible, from the measure-theoretic perspective.
(See Hawkins (1980), especially pp. 172-180, for a beautiful survey on the origins of the theory of
the Lebesgue measure and integral.)
The situation may at rst seem somewhat reminiscent of Cantors countability theory viewing
countably innite sets smaller than uncountable sets, but this is misleading. For, there are in fact
uncountable sets in [a, b] which have Lebesgue measure zero. While such sets are somewhat esoteric,
and will not concern us here, you should make note of the fact that the relative size of a subset
of the real line from the countability and measure perspectives may well be radically dierent.
20
It is worth noting that the probability space ([0, 1], B[0, 1], ) is not complete, that
is, there are -null sets A in B[0, 1] such that B / B[0, 1] for some B A. (Here by
an -null set A, we mean any Borel subset A of [0, 1] with (A) = 0.) However, we
can complete this space in a straightforward manner. Dene
L[0, 1] := {S B : S B[0, 1] and B is a subset of an -null event}.
(Any member of L[0, 1] is said to be a Lebesgue measurable set.) Now dene

(S B) := (S) for any S B[0, 1] and any subset B of an -null event. Then
([0, 1], L[0, 1],

) is a complete probability space.


14
This space called the Lebesgue
probability space extends ([0, 1], B[0, 1], ) in the sense that B[0, 1] L[0, 1] and

|
B[0,1]
= . Moreover, it is the smallest such extension in the sense that if ([0, 1], , )
is any complete probability space with B[0, 1] and |
B[0,1]
= , then L[0, 1] .
Curiously, L[0, 1] is much larger than B[0, 1].
15
And yet, there are still sets in [0, 1]
which do not belong to L[0, 1], that is, there are sets that are not Lebesgue measurable.
The following exercise walks you through a proof of this fact.

Exercise 29.
H
For any set S in [0, 1] and any [0, 1], let us agree to write S
for the set {t [0, 1] : t = s + (mod 1) and s S}.
16
(a) Show that S is Lebesgue measurable if S is Lebesgue measurable, and in this
case (S ) = (S).
Now dene the equivalence relation on [0, 1] by i Q. Use the
Axiom of Choice to select exactly one element from each of the induced equivalence
classes, and denote the resulting collection by S. Enumerate next the rationals in
[0, 1] as {r
1
, r
2
, ...}, and dene S
m
:= S r
m
for each m.
(b) Show that {S
1
, S
2
, ...} is a partition of [0, 1].
(c) Use parts (a) and (b) to conclude that we would have ([0, 1]) {0, } if S was
Lebesgue measurable. Thus, S cannot be Lebesgue measurable.
17
(d) Prove Vitalis Theorem: There is no probability space ([0, 1], 2
[0,1]
, p) such that
p(S ) = p(S) for all S [0, 1].
14
Quiz. Prove!
15
The cardinality of L[0, 1] is strictly larger than that of B[0, 1]. (The cardinality of B[0, 1] is the
same as that of R.) This is clearly not the right place to prove these facts. If you are interested,
have a look at Hewitt and Stromberg (1965), pp. 133-134.
16
For any a, b [0, 1], a +b (mod 1) equals a +b if a +b 1, and a +b 1 otherwise.
17
More generally, every set of positive Lebesgue measure in [0, 1] (or in R) contains a Lebesgue
nonmeasurable subset. The present proof, which is due to Guiseppe Vitali, can easily be modied
to establish this stronger statement. Thomas (1985) provides an alternative proof that derives from
basic graph theory.
Note. Lebesgue nonmeasurable sets cannot be found by the nite constructive method. Loosely
said, Solovay (1970) have shown that the existence of such a set in [0, 1] cannot be proved (within the
axiomatic system of standard set theory) without invoking the Axiom of Choice. (If youre interested
in these sort of things, you may want to read the expository account of Briggs and Schater (1979).)
21
Vitalis Theorem shows that the use of the probability spaces that take as the
event space the power set of the sample space may sometimes be seriously limited.
Insight: The -algebra technology is indispensable for the development of probability
theory.
4 The Sierpinski Class Lemma
In this short section we provide a proof of the uniqueness part of Carathodorys
Extension Theorem. As you will see later, the technique we will introduce for this
purpose is useful in a good number of other occasions as well.
Let us agree to call a class S of subsets of a given nonempty set X a Sierpinski
class (or for short, an S-class) on X, provided that
(i) if A, B S and A B, then B\A S, and
(ii) if A
1
, A
2
, ... S and A
1
A
2
, then

A
i
S.
The smallest S-class on X that contains a given class of subsets of X, say A, is called
the S-class generated by A, and is denoted by s(A). It is not dicult to verify
that such a set exists. Indeed, we have
s(A) =

{C 2
X
: A C and C is an S-class on X}.
This follows from the fact that the intersection of any collection of S-classes on X is
again an S-class on X.
The following (easy) exercise reports a useful observation about S-classes that we
will need shortly.
Exercise 30.
H
Let X be a nonempty set, and S an S-class on X. Prove that if
X S and S is closed under taking nite intersections, then S must be a -algebra.
Here comes a major result that we shall later use again and again.
Tii Siinii:ii Ci. Li::..
18
Given any nonempty set X, and let A 2
X
be
closed under taking nite intersections, and X A. If S is an S-class on X such that
A S, then (A) S.
Proof. Let S
0
:= s(A). Obviously, X S
0
. Thus, by Exercise 30, if we can show
that S
0
is closed under taking nite intersections, then we can conclude that S
0
is a
-algebra. From this it would follow that (A) S
0
S, as we seek.
19
18
This result is often referred to as Dynkins - Theorem (where an S-class is instead called a
-system). However, historically speaking, I think it is more suitable to use the terminology we
adopt here, for even a stronger result is proved by Sierpinski (1928), albeit in a non-probabilistic
context. (I learned this from Bert Fristedt.)
19
So, my objective is to derive the statement
A B S
0
for all (A, B) S
0
S
0
,
22
Dene
S
1
:= {A X : A B S
0
for all B A}.
By hypothesis, we have A S
1
. Moreover, S
1
is an S-class on X. Indeed, if A, C S
1
and A C, then
(C\A) B = (C B)\(A B) S
0
for all B A,
and if A
1
, A
2
, ... S
1
and A
1
A
2
, then
_

i=1
A
i
_
B =

i=1
(A
i
B) S
0
for all B A.
It follows that S
0
S
1
, that is, A B S
0
for all (A, B) S
0
A.
Now dene
S
2
:= {B X : A B S
0
for all A S
0
}.
By what is established in the previous paragraph, we have A S
2
. But, again, one can
easily check that S
2
is an S-class on X. Therefore, S
0
S
2
, that is, AB S
0
for all
A, B S
0
. By induction, it follows that S
0
is closed under taking nite intersections,
and we are done.
What is the point of all this? Well, the idea is the following. If we learned
somehow that a property holds for all sets in a class A which contains the sample
space, and is closed under taking nite intersections, and if, in addition, we managed
to show that the class of all sets for which this property is true is an S-class, then
we may use the Sierpinski Class Lemma to conclude that all sets in the -algebra
generated by A actually belong to the latter class, and hence satisfy the property in
question. Since it is usually easier to work with S-classes rather than -algebras, this
observation may, in turn, provide help when one needs to go from a given set to the
-algebra generated by that set. To illustrate, consider the following claim:
Pnoioi1io: 4. Let X and A be as in the Sierpinski Class Lemma. If p and q are
two nite measures on (A) such that p|
A
= q|
A
, then p = q.
A moments reection will show that this is even stronger than the uniqueness part
of Carathodorys Extension Theorem (for we do not require here A to be an algebra).
from the statement
A B S
0
for all (A, B) AA,
which is true by hypothesis. Watch out for a very pretty trick! I will rst prove the intermediate
statement
A B S
0
for all (A, B) S
0
A,
and then deal the nal blow by using this intermediate step.
23
How does one prove something like this? Lets use the idea outlined informally above.
Dene S := {S X : p(S) = q(S)}. Using Proposition 2, it is easy to verify that S
is an S-class on X. But, by hypothesis, the property p(S) = q(S) holds for all S in
A. By the Sierpinski Class Lemma, then, (A) S, that is, p(S) = q(S) holds for
all S (A), and we are done. (Nice trick, no?)
Warning. The uniqueness result reported in Proposition 4 is not valid for innite
measures in general. However, if X can be written as a countable union of disjoint
sets X
i
, and p(X
i
) = q(X
i
) for each i, then Proposition 4 applies even though p(X) =
q(X) = .
We conclude by noting that closedness of A under taking nite intersections is
crucial for Proposition 4. To see this, let X := {a, b, c, d} and A := {{a, b}, {b, c}} so
that (A) = 2
X
. Now let p be the probability measure on 2
X
that assigns proba-
bility
1
2
to the outcomes b and d, and let q be the probability measure that assigns
probability
1
2
to the outcomes a and c. Clearly, p and q are probability measures on
2
X
with p = q on A but p = q in general. (Compare with Proposition 4.) What goes
wrong here is that the Sierpinski Class Lemma does not work when A is not closed
under taking nite intersections. Indeed, S = {{a, b}, {b, c}, {a, b, c}} is a superset
of A which is an S-class, and yet we have (A) = 2
X
S in this example. Notice
that, the problem would disappear if we replaced A with A

= {{a, b}, {b, c}, {b}, X}.


Since A

is closed under taking nite intersections, by the Sierpinski Class Lemma,


any two probability measures on 2
X
that agree on A

must agree on (A

) = 2
X
.
20
Exercise 31. If X is a metric space and p, q P(X) with p(O) = q(O) for all open
subsets O of X, then p = q. Give two proofs of this, one that uses the uniqueness
part of Carathodorys Extension Theorem, and another that uses the Sierpinski
Class Lemma.
Exercise 32.
H
Let X be a nonempty set. A class M 2
X
is said to be a monotone
class on X if, for any (A
m
) M

, A
1
A
2
implies

A
i
M and
A
1
A
2
implies

A
i
M.
(a) Show that a monotone class on X which is an algebra is a -algebra on X.
(b) Show that if A 2
X
is an algebra, the smallest class that contains A denoted
as m(A) must be an algebra on X.
(c) (Halmos) Prove the Monotone Class Lemma: If A is an algebra on X, and if A

is a monotone class on X, then A A

implies (A) A

.
(d) Prove the uniqueness part of Carathodorys Extension Theorem by using the
Monotone Class Lemma.
20
For concreteness, here is a direct proof. Let p and q be two such probability measures. Then p
and q agree on both {a, b} and {b} so that p({a}) = p({a, b}) p({b}) = q({a, b}) q({b}) = q({a}).
One can similarly show that p({c}) = q({c}). Finally, these measures agree on {d} as well, because
p({d}) = p(X)

t{a,b,c}
p({t}) = q(X)

t{a,b,c}
q({t}) = q({d}).
24
5 Random Variables
One is often interested in a particular characteristic of the outcome of a random
experiment. To deal with such situations we need to transform a given probability
space (that models the mother experiment) to another probability space the sample
space of which is a subset of R (or a more complex metric space). This transforma-
tion is done by means of a random variable. For instance, consider the experiment
of tossing (independently) two fair dice, and suppose that for some reason we are
interested in the sum of the faces of these dice. We could model here the mother
experiment by means of the probability space (X, 2
X
, p) where X := {1, ..., 6}
2
and
p(S) :=
|S|
36
for any S 2
X
. On the other hand, this is not immediately useful, for
we are interested in the experiment only insofar as its implications for the sum of the
faces of the two dice are concerned. To obtain the probability space that is tailored
for our purposes here, we would use the map x : X {2, ..., 12} which is dened
by x(i, j) := i + j. (This map is an example of a random variable). Indeed, the
probability space we are after is none other than (Y, 2
Y
, q), where Y = {2, ..., 12} and
q(S) := p({ X : x() S}) for each S 2
Y
. Of course, we could get to this space
directly by dening q(S) :=

aA
1
36
(6|7 a|) for each S 2
Y
, but as you will see,
the previous method is far superior.
None of this is really new to you, so let us move on to the formal development.
5.1 Random Variables as Measurable Functions
Here is the formal denition of a random variable.
Diii:i1io:. Let (X, ) be a measurable space. A mapping x : X R such that
x
1
(B) for every Borel subset B of R is called a random variable on (X, ).
More generally, if Y is a metric space, and x is a map from X into Y such that
x
1
(B) for every B B(Y ), then x is called a Y -valued random variable on
(X, ).
21
Notation. In this book the set of all random variables on a measurable space (X, )
is denoted as RV(X, ). Moreover, we dene
RV
+
(X, ) := {x RV(X, ) : x 0}.
(This notation is not standard in the literature.)
A few remarks on terminology are in order.
Ri:.ni 1. [1] One often talks about a random variable on (X, , p), but strictly
speaking, this means that x is a random variable on (X, ). Indeed, the measure p
21
Thus, by convention, I call an R-valued random variable simply as a random variable.
25
does not play any role in the denition of a random variable. As it will be clear
shortly, p is instead used to assign probabilities to events that are dened through a
random variable on (X, ).
[2] In real analysis what we refer here as a random variable on (X, ) is called
a -measurable real function on X. Furthermore, a random variable on a Borel
probability space is said to be Borel measurable. While we mostly stick with the
probabilistic jargon in this book, you should nonetheless familiarize yourself with
this alternative terminology since it is widely used elsewhere. (In fact, we too will
ocassionally use this terminology in later chapters as well.)
[3] Many authors refer to an R
n
-valued random variable as a random n-vector
if n 2. Analogously, an R

-valued random variable may be called a random real


sequence, an B[0, 1]-valued random variable a random bounded map on [0, 1],
and so on. More generally, Resnick (1999) refers to a Y -valued random variable (for
any metric space Y ) as a random element of Y. None of these terminologies is
standard, however.
Let (X, ) be a measurable space and Y a metric space. In principle, to verify that
a map x : X Y is a Y -valued random variable, we need to show that x
1
(B)
for every B B(Y ). Fortunately, there is a redundancy in this denition. Indeed, if,
for any class A of Borel subsets of Y that generates B(Y ), we have
x
1
(A) for every A A,
then we may conclude that x is a Y -valued random variable. For instance, if
x
1
(O) for every open subset O of Y,
or
x
1
(S) for every closed subset S of Y,
then x is a Y -valued randomvariable. These observations, which are routinely invoked
in practice, are proved as follows.
Exercise 33. Let (X, ) be a measurable space, Y a metric space, and A any class
of subsets of Y such that (A) = B(Y ). Prove:
(a) {A R : x
1
(A) } is a -algebra,
(b) x RV(X, ) i x
1
(A) for all A A,
The following simple consequence of this exercise, and the fact that the set of all
intervals of the form (, a] generates B(R) (Example 2), is used so frequently that
it may be a good idea to single it out.
26
Observation 1. Given any measurable space (X, ), a map x : X R is a random
variable if, and only if,
{ X : x() a} for any a R.
22
We should now go through some examples. But rst, let us agree on the following
jargon.
Diii:i1io:. Given any metric space Y, a Y -valued random variable in particular,
a random variable is called simple if its range is a nite set, and discrete if its
range is countable.
Ex.:iii 6. [1] If X is a nonempty set and Y a metric space, then any function
dened on X is a random variable on (X, 2
X
), that is, Y
X
= RV(X, 2
X
). If X is
nite (countable), then any such function is a simple (discrete) random variable on
(X, 2
X
).
[2] Let (X, ) be a measurable space. For any event A , recall that the
indicator function of A, 1
A
{0, 1}
X
, is dened as
1
A
() :=
_
1, if A
0, otherwise
.
Clearly, 1
A
, and more generally,

m
a
i
1
A
i
is a simple random variable on (X, , p)
for any A, A
1
, ..., A
m
and a
1
, ..., a
m
R, m being any positive integer. Any
simple random variable can be expressed in this way. (Really?)
[3] Let X and Y be two metric spaces. If x Y
X
is continuous, then x is a
Y -valued random variable on (X, B(X)). (Proof. Apply Exercise 33.) In particular,
C(X) RV(X, B(X)).
As a matter of fact, it is enough that x be continuous on X at all but countably many
points. (But this is not trivial to prove.) So, any monotonic function x on X R
is a random variable on (X, B(X)). (Note. A more direct way of proving the latter
statement is by noting that the inverse image of an interval under a monotonic map
in R
X
is an interval.)
A random variable on (X, B(X)) need not be continuous, monotonic or bounded.
For instance, the function x R
[0,1]
dened as
x() :=

1
12
, if 0 <
1
2
1
2
, if =
1
2
1
21
, otherwise
,
22
can be replaced with < in this statement. Right?
27
is a random variable on ([0, 1], B[0, 1]).
23
[4] Let x and y be random variables on a measurable space (X, ). Then x +y is
a random variable on this space as well since, for every real number a, we have
(x + y)
1
((, a)) =

rQ
{ : x() r and y() a r}.
But the latter set lies in , for, thanks to Observation 1, both x
1
((, r]) and
y
1
((, a r]) belong to for any r.
By induction, we may generalize this nding in the obvious way: If x
1
, ..., x
m

RV(X, ), then
m
x
i
RV(X, ).
24
[5] Let x and y be two random variables on a measurable space (X, ). If f
C(R
2
), then f(x, y) is a random variable on (X, ). To see this, let O be any open
set in R and observe that f
1
(O) is open in R
2
by continuity of f. But every open
set in R
2
can be expressed as a countable union of open rectangles with sides parallel
to the axes. (Why?) So, we may write f
1
(O) =

(I
i
1
I
i
2
) where (I
m
1
) and (I
m
2
)
are two sequences of open intervals in R. Thus,
{ X : f(x(), y()) O} =
_
X : (x(), y())

i=1
(I
i
1
I
i
2
)
_
=

i=1
(x
1
(I
i
1
) y
1
(I
i
2
))

and hence the claim. This observation generalizes the previous example, and extends
to any nite number of random variables in the obvious way.
[6] Let (x
m
) be an increasing sequence of random variables on a measurable space
(X, ). (By increasing here, we mean that x
1
x
2
.) Then, x := limx
m
(i.e. the
pointwise limit of the sequence (x
m
)) is an R-valued random variable. In particular,
if we know that x(X) R, then x RV(X, ). Indeed, the hypothesis x
1
x
2

guarantees that x
1
((, a]) =

x
1
i
((, a]) for each a R, so Observation 1
yields the assertion easily.
[7] Using part [4], we may conclude that RV(X, ) is a linear space under the
usual addition and scalar multiplication operations. The set of all simple random
variables on (X, ) constitutes a subspace of this linear space. By part [3], if X is a
metric space, then C(X) is also a subspace of RV(X, B(X)). (The same is not true
for the set of all monotonic functions on X since this is not a linear space in its own
right.)
23
Quiz. Can an everywhere discontinuous function on R be a random variable? Say, is 1
Q

RV(R,B(R))?
24
Quiz. For any n = 2, 3, ..., show that the same claim is true for random n-vectors as well.
28
While neither B(X) nor RV(X, ) includes the other,
25
the set RV
b
(X, ) :=
B(X) RV(X, ) can be viewed as a normed linear subspace of B(X).
26
As you are
asked to prove in Exercise 37, this subspace is closed, and is thus Banach.
Exercise 34. Let (X, ) be a measurable space, and Y and Z two metric spaces.
Let x be a Y -valued random variable on (X, ), and y a Z-valued random variable
on (Y, B(Y )). Show that y x is a Z-valued random variable on (X, ).
Exercise 35.
H
Show that if x R
X
is a random variable on a measurable space
(X, ), then so is |x| , but not conversely.
Exercise 36. Given any metric space X, show that any upper (or lower) semicontin-
uous map x R
X
is a random variable on (X, B(X)).
Exercise 37.
H
Let (X, ) be a measurable space, and (x
m
) a sequence of R-valued
random variables on (X, ). Show that if inf x
m
, sup x
m
, liminf x
m
and limsup x
m
are R-valued random variables on (X, ).
27
So if limx
m
is a real-valued function,
then it is a random variable.
Exercise 38.
H
Let X be a metric space. Show that B(X) is the smallest -algebra
on X such that C(X) RV(X, B(X)).
5.2 The Distribution of a Random Variable
Now that we have played around with its formal denition a bit, let us recall the
idea behind the concept of random variable. We begin with a random experiment
modeled by a probability space (X, , p). Then we pick a function x mapping X to
R, so the values of x are random in the sense that x assumes a given value a i
a certain event occurs in the experiment. A natural candidate for this event is, of
course, x
1
(a). But for this to make sense formally, the set x
1
(a) must really be an
event, which explains why we require x
1
(a) when dening a random variable.
More generally, to assess the likelihood of the event that the value of x belongs to
some open (or closed, or semiclosed) interval I, we need x
1
(I) to belong to . A
random variable x on (X, ) is a real map on X that has precisely this property.
So, for any B B(R), what is the probability of x
1
(B)? Given the probability
space at hand, we have an obvious way of assigning a probability to this event, namely,
p(x
1
(B)). (Notice that we can do this precisely because x
1
(B) = the domain
of p).
25
I know that I have not given you an example of a bounded real map which is not a random
variable. These are not so easy to construct. Take my word for it for the moment.
26
However, as I will explain in Section 8, the standard practice is to use a slight modication of
the sup-norm in the case of bounded random variables.
27
Notation. These functions are understood to be dened pointwise. That is, supx
m
is the real map dened on X by supx
m
(), liminf x
m
is the map dened by
liminf x
m
(), etc..
29
Diii:i1io:. A random variable x on a probability space (X, , p) induces a Borel
probability measure p
x
on R as follows:
p
x
(B) := p(x
1
(B)) for any B B(R).
This measure is called the distribution of x.
28
In turn, the distribution function
induced by p
x
, denoted as F
x
, is called the distribution function of x.
The distribution p
x
of a random variable x on a probability space (X, , p) de-
scribes the probabilities for all events that involve x. But to conrm this formally,
we need to check that p
x
P(R), that is, p
x
is really a Borel probability measure.
Of course, we have p
x
(R) = p(x
1
(R)) = p(X) = 1 and similarly p
x
() = 0. On
the other hand, for any pairwise disjoint sequence of events (A
m
) in B(R), we have
x
1
(A
i
) x
1
(A
j
) = for each i = j, and hence
p
x
_

i=1
A
i
_
= p
_
x
1
_

i=1
A
i
__
= p
_

i=1
x
1
(A
i
)
_
=

i=1
p
_
x
1
(A
i
)
_
=

i=1
p
x
(A
i
).
Thus p
x
P(R). We may then think of x transforming the space (X, , p) to a Borel
probability space (R,B(R), p
x
).
The fact that p
x
is indeed a Borel probability measure on R ensures that the
distribution function F
x
of x, which is dened by F
x
(t) := p
x
((, t]), is indeed a
distribution function in the formal sense of the term. We will see in the next chapter
that F
x
is often more useful than p
x
in making probabilistic computations concerning
x.
We say that the random variable x is continuous if F
x
is a continuous function.
(So, p({ : x() = a}) = p
x
({a}) = 0 for any continuous random variable x and
a R. Why?)
Warning. A continuous random variable is distinct from a random variable which
is continuous. The former concept is universally dened while the latter concept
demands (something like) a metric structure on the sample space. Even in the case
of Borel probability spaces these are distinct notions. If X is nite, any real function
on X is a random variable on (X, 2
X
) which is continuous on X (where we view X
as metrized by the discrete metric). But, obviously, in this case no real function on
X is a continuous random variable.
28
Again, there is nothing probabilistic about the notion of random variable. It is thus a bit silly
to say that x is dened on a probability space (X, , p) x is fully identied by the measurable
space (X, ). However, in probability theory, one is foremost interested in the distribution of a
random variable, and that surely depends on the probability measure that one uses on (X, ). For
this reason, probabilists often talk of a random variable x on a probability space (X, , p), and
henceforth, I will adhere to this convention as well.
30
Ri:.ni 2. We often describe the probabilistic behavior of a random variable by
specifying its distribution function directly. So the statement let x be a random
variable on X with the distribution function F means that x is a random variable
on some probability space (X, , p) such that F = F
x
, that is, p
x
= p x
1
induces
the distribution function F (or, equivalently, p
x
is the Lebesgue-Stieltjes measure
induced by F). In turn, a distribution function F is often dened in terms of a
piecewise continuous function f : R R
+
with
_

f(t)dt = 1 as follows:
29
F(a) =
_
a

f(t)dt, < a < . (4)


In this case, f is said to be a density for F, or equivalently, we say that F is induced
by the density function f.
30
Two typical examples of distribution functions induced by density functions are
the uniform distribution on [a, b], a < b, for which
f(t) =
_
1
ba
, a t b
0, otherwise
,
and the exponential distribution (with parameter ) for which
f(t) =
_
e
t
, t 0
0, otherwise
,
where > 0. A random variable with uniform distribution on [a, b] is said to be uni-
formly distributed on [a, b]. Random variables that are exponentially distributed
are dened similarly.
Ri:.ni 3. In Section 3.3 we have seen that there is a one-to-one correspondence
between Borel probability measures on R and distribution functions. Interestingly,
there is a similar relation between such measures and random variables on the prob-
ability space ((0, 1), B(0, 1), ) as well. One direction is indeed obvious; any such
random variable x induces a Borel probability measure on R, namely, its distribu-
tion
x
. Conversely, every Borel probability measure p on R arises this way, that is,
p equals the distribution of some such x. This is easy to see if p is induced by a
strictly increasing distribution function, say F. Since F is then a bijection from R
29
A function that maps S R into R is called piecewise continuous if it is continuous at all
but nitely many points of S.
30
Warning. Not all distribution functions arise this way. In fact, (4) denes F as a continuous
function, so no discontinuous distribution function has a density. This fact is very easy to prove
for bounded densities. (Take any > 0 and let K := sup{|f(t)| : t R}. For any a < b with
|a b| <

K
, we have F(b) F(a)
_
b
a
|f(t)| dt K(b a) < .) The claim is in fact true for any
Riemann integrable f, but we dont need to prove this here.
31
onto (0, 1), it is invertible, and we may dene in this case x := F
1
. By monotonicity,
x RV((0, 1), B(0, 1)) and we have

x
((a, b]) = ((x
1
(a), x
1
(b)]) = F(b) F(a) = p((a, b])
for any a b < , and

x
((a, )) = 1 F(a) = p((a, ))
for any a. Thus
x
and p agree on the class A of all right-closed intervals. By
Proposition 4, then,
x
= p, as we sought.
Unfortunately, life is a bit more complicated than this, because this method would
fail when F is not invertible. What we need to do in the general case is to come up
with some sort of a denition for the inverse of an increasing function, even though
this function may be constant on some intervals. Here is the deal: We supply the
denition, and you show that it works.
Exercise 39. Let p be the Lebesgue-Stieltjes probability measure on R induced by
the distribution function F. Dene x R
(0,1)
by
x() := inf{t R : F(t) }.
Prove: x is a left-continuous and increasing function (called the pseudo-inverse
of F). Moreover, x RV((0, 1), B(0, 1)) and
x
= p. (Note. x is a nonnegative
random variable if F(0) = 0.)
Conclusion: There is a one to one correspondence between the notions of Borel prob-
ability measure on R, distribution function, and randomvariable on ((0, 1), B(0, 1)).

We need to introduce one nal preliminary concept here, and then we will be
ready to launch our attack on the Lebesgue integration theory.
Diii:i1io:. Let x and y be two random variables on a probability space (X, , p).
If
p({ X : x() = y()}) = 1,
we say that x and y are equal almost surely, and write x =
a.s.
y. The expressions
x
a.s.
y and x
a.s.
y are similarly dened.
The importance of this concept derives from the fact that two random variables
that are equal almost surely are indistinguishable from the probabilistic point of view.
The intuition behind this is hopefully clear. The following exercise supplies the formal
argument.
Exercise 40. Prove: If x and y are two random variables on a probability space
(X, , p) such that x =
a.s.
y, then p
x
= p
y
and F
x
= F
y
.
32
6 The Expectation Functional
As you know well, the expected value of a simple random variable is the weighted
average of the values taken by this random variable, where the weight of each value
is its probability. As you will also recall, by using (Riemann) integration, we can
extend this idea to the case of a random variable whose distribution is induced by a
density function. In this section, we outline the situation in the case of an arbitrary
random variable.
6.1 The Expectation of Simple Random Variables
Let (X, , p) be a probability space. The task at hand is easy for a simple random
variable dened on this space. By denition, for any such random variable x, we have
|x(X)| < and
x =

ax(X)
a1
x
1
(a)
. (5)
Using this representation, we dene the expectation of x as
E(x) :=

ax(X)
ap(x
1
(a)).
The intuitive basis for this denition is straightforward. If x is a simple random
variable which takes value a with probability p
x
(a), we then dene the expected
value of x as the weighted average of all is values, where the weight of a x(X)
is p
x
(a). That is, E(x) =

ax(X)
ap
x
(a). Of course, the alternative notation E
p
(x)
would be more informative, and we will use it in the later chapters. But given that
we work with a xed (though arbitrary) probability measure throughout this section,
the simpler notation that we adopt here will not cause confusion.
The following simple exercise gives us an easy way to compute the expectation of
a simple random variable which may have been expressed in a way dierent than (5).
We will make use of this exercise shortly.
Exercise 41. Let (X, , p) be a probability space, A a nite partition of X,
and f R
A
. Show that
E
_

AA
f(A)1
A
_
=

AA
f(A)p(A).
Let us establish a few basic properties of the expectation functional (which we
have so far dened only on the linear space of simple random variables). First, note
that E is a positive functional in the sense that if x
a.s.
0, then E(x) 0. This
follows immediately from the denitions. Let us next show that this functional is
linear. (This observation, albeit simple, is one of the major building blocks of the
33
subsequent development). Take any two simple random variables x and y on (X, , p),
and let := x(X) y(X). Dene
A
(a,b)
:= x
1
(a) y
1
(b) for any (a, b) .
Observe that {A
(a,b)
: (a, b) } is a nite partition of X. Hence the following
identities are valid:
x =

(a,b)
a1
A
(a,b)
y =

(a,b)
b1
A
(a,b)
and x + y =

(a,b)
(a + b)1
A
(a,b)
.
Thus, by applying the observation noted in Exercise 41 three times, we nd
E(x + y) =

(a,b)
(a + b)p(A
(a,b)
) =

(a,b)
ap(A
(a,b)
) +

(a,b)
bp(A
(a,b)
) = E(x) +E(y).
Since the linear homogeneity of the expectation operator is obvious, we have proved
that
E(x + y) = E(x) +E(y)
for any R and any simple random variables x and y on a given probability space.
In turn, this observation also allows us to conclude that
E(x) E(y) whenever x
a.s.
y
since E(x) E(y) = E(xy), and the expectation of a simple random variable which
is nonnegative almost surely is nonnegative. Therefore, if x =
a.s.
y, we must have
E(x) = E(y). In what follows we will use these properties of E freely.
6.2 The Expectation of Nonnegative Random Variables
We are now ready to take a major step.
Diii:i1io:. Let x be a random variable on a probability space (X, , p) with x 0.
The expectation of x, denoted E(x), is dened as
E(x) := sup{E(z) : z L(x)}
where L(x) stands for the set of all simple random variables z such that z x.
31
FIGURE 1 ABOUT HERE
31
Quiz. Show that this denition agrees with the previous one we gave in the case of simple
random variables.
34
In words, for any x RV
+
(X, ), the number E(x) is dened as the supremum
of the weighted averages of all those simple random variables that are smaller than
x. So, in this sense, the idea behind the denition of E(x) is reminiscent of that of the
computation of the area under a given curve in R R
+
by means of approximating
this area by summing up some of the rectangles that lie under the curve and above
the horizontal axis. (See Figure 1.) Put this way, you should see that E(x) can be
thought of as some sort of an integral of x it is called the Lebesgue integral of x
where the sets in the domain of x (analogous to the bases of the rectangles under
the curve) are measured according to the underlying probability measure. This
motivates the following notation:
_
X
xdp := E(x)
for any x RV
+
(X, ). Adopting the widely used conventions of integration theory,
we also dene _
S
xdp := E(x1
S
) for any S .
Therefore, recalling the denition of the expectation of a simple random variable, we
have _
S
dp := E(1
S
) = p(S)
for each S . Moreover, the denition of E entails the following properties for the
Lebesgue integral.
In the following set of exercises (X, , p) stands for an arbitrarily xed probability
space.
Exercise 42. For any 0 and x, y RV
+
(X, ), show that
(a)
_
X
xdp =
_
X
xdp;
(b)
_
X
ydp
_
X
xdp whenever y x; and
(c)
_
A
xdp
_
B
xdp for any A, B with A B.
Exercise 43. For any x RV(X, ) with x > 0, show that
lim
m
_
x
1
((m,))
m
x
dp = 0.
Exercise 44.
H
For any x, y RV
+
(X, ), show that
x
a.s.
y implies
_
X
xdp
_
X
ydp.
(Corollary. The expectations of two almost surely equal nonnegative random vari-
ables are equal to each other.)
35
Exercise 45. Let x be a nonnegative random variable on (X, , p).
(a) For any m N, dene A
m
:= { X : x()
1
m
}, and show that (A
m
) is an
increasing sequence in . What is limp(A
m
)?
(b) Prove:
x =
a.s.
0 if and only if
_
X
xdp = 0.
Our task now is to establish the additivity of E, which is actually not at all a
trivial thing to do. We begin with the following fundamental theorem which was
proved by Beppo Levi in 1906.
Tii Mo:o1o:i Co:\inoi:ci Tiioni: 1. (Levi) Let (X, , p) be a probability
space, and (x
m
) an increasing sequence in RV
+
(X, ) such that limx
m
R
X
. Then
limx
m
is a random variable with expectation limE(x
m
), that is,
_
X
lim
m
x
m
dp = lim
m
__
X
x
m
dp
_
.
Proof. We have noted in Example 6.[6] that x := limx
m
is a random variable on
(X, , p). Moreover, it follows from Exercise 42.(b) that E(x
1
) E(x
2
) E(x)
so that all we need to do here is to establish that sup E(x
m
) E(x).
32
Recalling the
denition of E(x), the proof will then be complete if we can show that
sup E(x
m
) sup{E(z) : z L(x)}.
So our claim is: sup E(x
m
) E(z) for any z L(x).
To prove this, x an arbitrary z L(x) and > 0, and dene
A
m
:= { X : x
m
() < z() }, m = 1, 2, ...
33
(Verify that A
m
for each m.) Since
z = (z ) 1
X\A
m
+ z1
Am
+1
X\A
m
and E is a linear functional over the linear space of simple random variables, we have
E(x
m
) E(z) = E(x
m
)
_
X\Am
(z ) dp
_
Am
zdp
_
X\Am
dp

_
E(x
m
)
_
X\A
m
(z ) dp
_
(max z(X)) p(A
m
) p(X\A
m
).
32
Here, to simplify the notation, I write supE(x
m
) for the number sup{E(x
m
) : m = 1, 2, ...}.
33
Observe that the only reason why I cant safely conclude that E(x
M
) E(z) for some M
(and hence for all m M) is that z is larger than x
M
on the set A
M
. Indeed, if A
M
= for
some M, then it follows readily that E(x
m
) E(z) for each m M (since A
1
A
2
).
Unfortunately, I dont know if A
m
= for some m. But I do know that A
m
s can be ignored in the
limit, for

A
i
= so that limp(A
m
) = 0. So, for m large enough, the problematic region A
m
should not really matter. It is this intuition that the proof is built on.
36
But by denition of A
m
, we have (z ) 1
X\A
m
L(x), so that
E(x
m
)
_
X\Am
(z ) dp = E(x
m
) E( (z ) 1
X\A
m
) 0
for each m. Combining this inequality with the previous one, and using the fact that
p(X\A
m
) 1, we obtain
E(x
m
) E(z) (max z(X)) p(A
m
) , m = 1, 2, ... (6)
Observe next that x
1
x
2
implies that (A
m
) is a decreasing sequence. Moreover,
since z x and x
m
x, it must be the case that, for any X, we have x
m
() >
z() for m large enough. Thus:

A
i
= . So, by applying Proposition 2, we nd
p(A
m
) 0. Therefore, letting m in (6), we nd sup E(x
m
) = limE(x
m
) .
Since > 0 is arbitrary here, we are done.
Before moving towards the applications of this fundamental theorem, let us state
one important corollary of it which we will need later on. The way it is stated, the
Monotone Convergence Theorem 1 is useful only when establishing the convergence
of integrals of increasing sequences of nonnegative random variables. The following
celebrated result, which was proved by Pierre Fatou in 1906, often proves useful when
studying the convergence problem for arbitrary sequences of nonnegative random
variables.
F.1o Li::.. If (x
m
) is a sequence of nonnegative random variables on a prob-
ability space (X, , p) such that liminf x
m
R
X
, then
_
X
liminf
m
x
m
dp liminf
m
__
X
x
m
dp
_
.
Proof. Let y
m
:= inf{x
m
, x
m+1
, ...} for each m, and y := liminf x
m
. By Exercise 37,
each y
m
and y are random variables on (X, , p). Moreover, we have 0 y
1
y
2

so, by the Monotone Convergence Theorem 1,
_
X
liminf
m
x
m
dp =
_
X
lim
m
y
m
dp
= lim
m
_
X
y
m
dp
= liminf
m
__
X
y
m
dp
_
liminf
m
__
X
x
m
dp
_
(Why is the nal inequality valid?)
37
Warning. The inequality in Fatous Lemma may hold strictly. For instance, let (x
m
)
be the sequence of random variables dened on the probability space ([0, 1], B[0, 1], )
by
x
m
() :=
_
m, if 0 < <
1
m
0, otherwise
.
For this sequence, we have liminf x
m
= 0 with E(x
m
) = 1 for each m, so E(liminf x
m
) <
liminf E(x
m
). This example also shows that the monotonicity requirement of the
Monotone Convergence Theorem 1 cannot be completely relaxed.
Recall that we have still not established the additivity of the expectation func-
tional over nonnegative random variables. The way we propose to do this is by using
a two-step procedure. First, we show that every such random variable can be written
as a limit of an increasing sequence of simple random variables. Second, we deduce
the linearity of E over the set of all nonnegative random variables from its linearity
over the set of simple random variables by using the Monotone Convergence Theorem
1. The rst step is formalized in the following technical but often useful result.
Li::. 1. For any probability space (X, , p) and x RV
+
(X, ), there exists a
sequence (x
m
) of simple random variables on (X, , p) such that
0 x
1
x
2
and limx
m
= x.
Proof. (The idea behind the proof is illustrated in Figure 2.) For each m N and
i {1, ..., m2
m
}, dene
A
i
m
:=
_
X :
i1
2
m
x() <
i
2
m
_
and B
m
:= { X : x() m} .
Observe that {B
m
, A
1
m
, ..., A
m2
m
m
} is a partition of X (for each m), and moreover, since
x RV(X, ), every member of this partition belongs to . Dene, for each m,
x
m
() :=
_
i1
2
m
, if A
i
m
, i = 1, ..., m2
m
m, if B
m
,
and note that (x
m
) is a sequence of nonnegative simple random variables on (X, , p)
with x
1
x
2
. Now x any X. Clearly, x() < M for some positive integer
M. But then 0 x() x
m
() <
1
2
m
for all m M, so x
m
x.
FIGURE 2 ABOUT HERE
Finally, we are ready to prove the additivity of the expectation functional with
respect to nonnegative random variables.
38
Pnoioi1io: 5. Let (X, , p) be a probability space. Then,
E(x + y) = E(x) +E(y) for all x, y RV
+
(X, )and 0.
Proof. Thanks to Exercise 42.(a), it is enough to prove this claim for = 1. To do
this, we use Lemma 1 to nd increasing sequences (x
m
) and (y
m
) of nonnegative simple
random variables on (X, , p) such that x
m
x and y
m
y. Clearly, (x
m
+y
m
) is an
increasing sequence of nonnegative simple randomvariables such that x
m
+y
m
x+y.
So, by applying the Monotone Convergence Theorem 1 three times, we get
E
_
lim
m
(x
m
+ y
m
)
_
= lim
m
E(x
m
+ y
m
)
= lim
m
E(x
m
) + lim
m
E(y
m
)
= E( lim
m
x
m
) +E( lim
m
y
m
).
It follows that E(x + y) = E(x) +E(y), as we sought.
Exercise 46.
H
Let (X, , p) be a probability space, and X a nonempty countable
subset of RV
+
(X, ) such that

xX
x() < for any X. (Recall Proposition
1.10.) Show that
E
_

xX
x
_
=

xX
E(x
i
) .
Exercise 47. Let (X, , p) be a probability space, and x RV
+
(X, ). Prove:
(a)
_
X
xdp =
_
A
xdp +
_
X\A
xdp for any A ;
(b) If A is a countable partition of X, then
_
X
xdp =

AA
_
A
xdp.
The method of combining the forces of Lemma 1 and the Monotone Convergence
Theorem 1 is commonly used in integration theory. Here are two further exercises
for which you might nd this method helpful.
Exercise 48. Let (X, , p) be a probability space, and 0 a < b < . Take any
x RV(X, ) with x(X) [a, b]. Prove or disprove:
_
X
xdp =
_
R
1
[a,b]
dp
x
.
(Interpretation?)
Exercise 49. Let x be a nonnegative random variable on (X, , p) such that E(x) =
1. Dene q : R
+
by q(A) :=
_
A
xdp. First show that q is a probability measure
on and then establish that
_
X
ydq = E(xy) for any y RV
+
(X, , q).
39
6.3 The Expectation of Arbitrary Random Variables
The next order of business is to extend the expectation functional to the class of all
random variables over a given probability space. There is a standard way of doing
this. Let x be a random variable on (X, , p). We dene the positive and negative
parts of x as
x
+
:= max{x, 0} and x

:= max{x, 0},
respectively. We leave it to you to verify that x
+
and x

are nonnegative random


variables on (X, , p) such that x = x
+
x

. (See Figure 3.) In turn, we dene the


expectation of x as follows:
E(x) := E(x
+
) E(x

),
provided that E(x
+
) and E(x

) are not both innite. Otherwise, that is, if E(x


+
) =
= E(x

), we say that the expectation of x does not exist. By contrast, if both


E(x
+
) and E(x

) are nite, or equivalently, E(|x|) < , we say that x is integrable.


FIGURE 3 ABOUT HERE
Warning. It is not at all weird for a random variable not to have an expectation. For
instance, let F be the distribution function induced by the following density function:
f(t) :==
_
1
2t
2
, |t| > 1
0, otherwise
.
Now let x be a random variable (on some probability space) with the distribution
function F (Remarks 2 and 3). Then, E(x
+
) = = E(x

), so E(x) does not exist.


So there it is. We nally have the full denition of the expectation functional.
Before proceeding any further, let us briey review how we derived it. We rst dened
E for simple random variables in an intuitive way (the weighted average idea), and
then extended it to the class of all nonnegative random variables by means of approx-
imation. We preserved the linearity of E during this extension. This was a nontrivial
thing to do, and it required the combined power of the Monotone Convergence The-
orem 1 and the fact that any nonnegative random variable can be approximated (in
fact, uniformly) by nonnegative simple random variables. Finally, we have extended
E further to the linear space of all random variables by means of applying it to the
positive and negative parts of random variables in a linear way. It should not be
surprising that linearity of E is preserved in this extension as well.
Ex.:iii 7. Let (X, 2
X
, p) be any probability space with X being a countable set,
and let us agree to write p() for p({}) throughout this example. Our objective is to
40
nd an expression for the expectation of an arbitrary random variable x on (X, 2
X
, p)
with

X
|x()| p() < .
34
(Any guesses?)
We know the answer if X is nite, for then x is simple by necessity, and we have
an explicit formula for the expectation of simple random variables. To focus on the
nontrivial case, then, we assume that X is countably innite, and we enumerate it as
X := {
1
,
2
, ...}. Observe that if x is nonnegative and simple, then the denition of
E for simple random variables and -additivity of p immediately yield
E(x) =

ax(X)
ap
x
(a) =

ax(X)

x
1
(a)
ap()
so, by Proposition 1.10, we nd
E(x) =

i=1
x(
i
)p(
i
) (7)
in accordance with our intuition. In fact, we can show that (7) holds in general. To
see this, assume next that x is nonnegative. (See, were simply applying the strategy
we have followed throughout this section. Start with the simple random variables,
then go to nonnegative ones using a monotone convergence argument, and so on.)
For each m N, we dene the simple random variables
x
m
() :=
_
x(), if {
1
, ...,
m
}
0, otherwise
.
Clearly, x
m
x, so we have E(x) = limE(x
m
) by the Monotone Convergence Theo-
rem 1. But then since (7) is valid for simple random variables, we have
E(x) = lim
m
E(x
m
) = lim
m
m

i=1
x(
i
)p(
i
) =

i=1
x(
i
)p(
i
)
which establishes (7) for nonnegative random variables on (X, 2
X
, p). Finally, let us
drop the assumption that x is nonnegative. In this case, given that

|x(
i
)| p(
i
) <
, both E(x
+
) and E(x

) must be nite numbers. (Why exactly?) So, we may use


the denition of E and the fact that (7) holds for nonnegative random variables to
get
E(x
+
) E(x

) =

i=1
x
+
(
i
)p(
i
)

i=1
x

(
i
)p(
i
) =

i=1
(x
+
x

)(
i
)p(
i
).
It follows that (7) holds for any x RV(X, 2
X
), as we sought.
We are not exactly done though, because the formula (7) depends on the particular
way we enumerated the sample space X. How do we know that we would not get
34
Question: Isnt this somewhat ambiguous, what is the order of summation in the series

X
|x()| p()? Answer: Thanks to Proposition 1.10, it doesnt matter!
41
a dierent number for E(x) if we enumerated X in another way? Because, given
that

|x(
i
)| p(
i
) < , Dirichlets Rearrangement Theorem tells us that any
rearrangement of the series

x(
i
)p(
i
) would always give the same number. Thus,
without a hint of ambiguity, we can conclude here that E(x) =

X
x()p().
Here is another illustration of how to use the denition of E for an arbitrary
random variable. Let (X, , p) be a probability space, and x RV(X, ). Assume
that E(x) exists. Then |E(x)| is a well-dened (extended) real number. How does it
compare with E(|x|)? Well, since E is a particular type of integral, it is natural to
expect that |E(x)| E(|x|). Indeed,
|E(x)| =

E(x
+
) E(x

E(x
+
)

E(x

= E(x
+
) +E(x

)
= E(x
+
+ x

)
= E(|x|)
where the fourth equality is a consequence of Proposition 5. We proved:
Observation 2. For any random variable x on a probability space (X, , p), if E(x)
exists, then
|E(x)| E(|x|).
Here are some further properties of the expectation functional.
Exercise 50.
H
Let x and y be two random variables on (X, , p) such that E(x)
and E(y) exist, and y is integrable. Show that
(a) E(x + y) = E(x) +E(y) for any R;
(b) If x
a.s.
y, then E(x) E(y) (so if x =
a.s.
0, then E(x) = 0); and
(c) If E(x) = E(y) and x
a.s.
y, then x =
a.s.
y.
Exercise 51.
H
Let x and y be two random variables on (X, , p) such that E(x) and
E(y) exist. Prove: If
_
S
xdp
_
S
ydp for all S , then x
a.s.
y.
Exercise 52.
H
Let (X, , p) be a probability space. Show that, if {A
i
: i =
1, 2, ...} is a partition of X, then
_
X
xdp =

i=1
_
A
i
xdp
for any x RV(X, ) with E(x) < .
42
A few comments on the ndings of Exercise 50 are in order. First, even though
part (a) of this exercise might tempt you to think of E as a linear functional on
RV(X, ) in the formal sense of the term, this is not true. For, we cannot write
E(x y) = E(x) E(y) for x, y RV(X, ) if the expectation of both x and y is
. What we need is to conne attention to the linear space of integrable random
variables to declare E as a linear functional in full glory. (We will come back to this
issue in Section 9.)
Second, note the following two corollaries of Exercise 50.(b). They are routinely
invoked in probability theory.
Observation 3. Any two integrable random variables that are almost surely equal
must have the same expectation.
Observation 4. If x and y are two random variables with x
a.s.
y, then E(y) <
implies that E(x) exists and nite.
Proof. Observation 3 follows readily from Exercise 50.(b). To prove Observation
4, notice that x
a.s.
y implies x
+

a.s.
y
+
, so we have E(x
+
) E(y
+
) < . It follows
that E(x) exists. Applying Exercise 50.(b) completes the proof.
The next order of business is to extend the Monotone Convergence Theorem 1 to
the case of all random variables. You should check that the next statement generalizes
our earlier formulation of this result.
Tii Mo:o1o:i Co:\inoi:ci Tiioni: 2. Let (x
m
) be an increasing sequence
of random variables on a probability space (X, , p) such that E(x
1
) > . If limx
m
is a real-valued function, then E(limx
m
) = limE(x
m
).
Proof. Let x := limx
m
and recall that x RV(X, ) (Example 6.[6]). Moreover,
E(x) exists (although it can equal ). Indeed, E(x
1
) > implies E(x

1
) < ,
while x x
1
implies x

1
. Thus, by Exercise 42.(b), E(x

) E(x

1
) < .
Now, if E(x
1
) = , then, because (x
m
) is an increasing sequence, we have E(x)
E(x
m
) = for each m, so E(x) = limE(x
m
) holds trivially. Let us assume then
that E(x
1
) < . The key observation is that (x
m
x
1
) is an increasing sequence of
nonnegative random variables that converges to the random variable x x
1
. (Yes?)
So, by using Exercise 50.(a) twice and the Monotone Convergence Theorem 1 once,
we nd
E(x) E(x
1
) = E(x x
1
) = lim
m
E(x
m
x
1
) = lim
m
E(x
m
) E(x
1
),
and the result follows from the fact that E(x
1
) is a real number.
It is important to note that the condition E(x
1
) > cannot be omitted in the
statement of the above theorem (although it can obviously be relaxed to E(x
k
) >
43
for some k). To see this, consider the probability space (N, 2
N
, p) where p() := 0
and p(S) :=

iS
1
2
i
for any S 2
N
\{}. For each positive integer m, dene x
m

RV(N, 2
N
) by x
m
(i) :=
2
i
m
. By using the observation noted in Example 7, we nd
that
E(x
m
) =
_
N
x
m
dp =

i=1
x
m
(i)p({i}) =

i=1
1
m
= , m = 1, 2, ...
even though x
m
0. Thus, in this example, we have limE(x
m
) = < 0 =
E( limx
m
).
Now that our arsenal includes the Monotone Convergence Theorem 2 and Fatous
Lemma, and we know that the expectation functional possesses the nice linearity
property given in Exercise 50.(a), we can prove all sorts of convergence theorems that
would apply to non-monotonic sequences of random variables. A lot of information
is hidden in the following result which is really a generalization of Fatous Lemma.
Li::. 2. Let (X, , p) be a probability space, and let y, x
m
RV(X, ), m =
1, 2, ... . Assume that E(y) < , liminf x
m
and limsup x
m
are real-valued functions,
and |x
m
|
a.s.
y for each m. Then
E
_
liminf
m
x
m
_
liminf
m
E(x
m
) limsup
m
E(x
m
) E
_
limsup
m
x
m
_
.
35
Proof. Let y
m
:= inf{x
m
, x
m+1
, ...} for each m. Since y
a.s.
|y
1
| y
1
, we have
> E(y) E(|y
1
|) E(y
1
) = E(y
1
)
by Exercises 50.(b) and (a). Thus, E(y
1
) > , and therefore we can obtain the
inequality E(liminf x
m
) liminf E(x
m
) in exactly the same way we have deduced
Fatous Lemma from the Monotone Convergence Theorem 1 all we have to do now
is to use instead the Monotone Convergence Theorem 2. (Check!) The inequality in
the middle, on the other hand, is trivial. To prove the rightmost inequality, we apply
what we have just established to nd E( liminf(x
m
)) liminf E(x
m
). Multiplying
both sides by 1 and using Exercise 50.(a), we get E(limsup x
m
) limsup E(x
m
).
36

The following is another major convergence theorem. It provides a useful al-


ternative to the Monotone Convergence Theorem 2, because it works for certain
non-monotonic sequences of random variables.
35
That the expectations of liminf x
m
and limsupx
m
exist is part of the assertion, of course.
Indeed, the subsequent argument establishes the existence of E(liminf x
m
) and E(limsupx
m
) indi-
rectly. Here is a direct verication. We have |x
m
|
a.s.
y for each m, right? Then liminf |x
m
|
a.s.
y,
so E(y) < implies E(liminf |x
m
|) < . Then, it follows from Observation 4 that E(liminf x
m
)
exists and nite.
36
Reminder. limsupa
m
= liminf(a
m
) for any real sequence (a
m
).
44
Liiioi Do:i:.1i Co:\inoi:ci Tiioni: 1. Let (X, , p) be a proba-
bility space, and let y, x
m
RV(X, ), m = 1, 2, .... If E(y) < , limx
m
R
X
, and
|x
m
|
a.s.
y for each m, then
E
_
lim
m
x
m
_
= lim
m
E(x
m
).
Proof. By Lemma 2,
E
_
liminf
m
x
m
_
liminf
m
E(x
m
) limsup
m
E(x
m
) E
_
limsup
m
x
m
_
and we are done.
Exercise 53. Let (X, , p) be a probability space, and x
m
RV(X, ), m = 1, 2, ....
If limx
m
R
X
, limx
m

a.s.
0, and limE(x
m
) = 0, then limE(|x
m
|) = 0.
Exercise 54. Let X be a compact metric space, f
m
, f C(X), m = 1, 2, ..., and
p P(X). Show that if f
m
f uniformly, then E(limf
m
) = limE(f
m
), but not
conversely.
Exercise 55.
H
(Schees Theorem) Let (X, , p) be a probability space, and x
m

RV(X, ), m = 1, 2, .... Prove : If x 0, E(x) = E(x
m
) < for each m, and
x
m
x, then
lim
m
_
X
|x x
m
| dp = 0.
7 Generalizations
7.1 R-Valued Random Variables
Consider the probability space ({0, 1}, 2
{0,1}
, p) where p({1}) := 1, and dene x
m

N
{0,1}
as x
m
(0) := m and x
m
(1) := 1, m = 1, 2, ... . Clearly, (x
m
) is an increasing
sequence in RV({0, 1}, 2
{0,1}
), and we should be able to talk about the expectation
of limx
m
. Unfortunately, we cant quite do this right now, because limx
m
(0) =
, and our theory so far deals only with real-valued random variables. This is
particularly problematic in this example because it seems that there is really no
conceptual problem with thinking about the expected value of limx
m
. This value
seems to be even nite, for we have x
m
x where x : {0, 1} R satises x(0) =
and x(1) = 1, so it appears natural to set E(x) = 1. But the expression E(x) = 1
is not meaningful for us since E is so far undened at an R-valued function.
The x is easy. Every major result we have proved in the previous section
extends (almost readily) to cover R-valued random variables. Indeed, by Exer-
cise 37, for any sequence (x
m
) of R-valued random variables on a given probability
45
space, sup x
m
, inf x
m
, limsup x
m
, liminf x
m
, and hence when it exists, limx
m
, are well-
dened R-valued random variables. We thus dene E(x) for any R
+
-valued random
variable just as we dened this for random variables:
E(x) := sup{E(z) : z L(x)}
where L(x) stands for the set of all (real-valued) simple random variables z such
that z x. (Observe that this denition entails E(x) = 1 in the example given
above.) One may easily extend this denition to R-valued random variables (using
the positive and negative parts of the functions), and rewrite the previous section in
a way that covers R-valued random variables modifying the current treatment only
in trivial ways. (To illustrate, we will extend the Monotone Convergence Theorem 1
below for such random variables. Also note that Lemma 1 and its proof applies to
R
+
-valued random variables without modication.)
7.2 Almost Sure Convergence
Consider again the probability space ({0, 1}, 2
{0,1}
, p) where p({1}) := 1, and dene
x
m
{1, 1}
{0,1}
as x
m
(0) := (1)
m
and x
m
(1) := 1, m = 1, 2, .... What is the
expectation of limx
m
? Once again this is not a meaningful question for us because
limx
m
is not dened at 0. Yet there is a sense in which this is a triviality. The
sequence (x
m
) behaves irregularly only at = 0, but, probabilistically speaking, we
shouldnt really care about this, for in our experiment the outcome = 0 obtains only
with probability zero. Viewed this way, (x
m
) is in fact a constant sequence where it
matters, that is, on a set of probability 1.
To formalize this point, we need to introduce the notion of almost sure conver-
gence. Given the discussion above, we now work with R-valued random variables.
Diii:i1io:. Let x
m
, x, m = 1, 2, ..., be any R-valued random variables on a proba-
bility space (X, , p). We say that (x
m
) converges almost surely to x whenever
p({ X : x
m
() x()}) = 1.
In this case, we write x
m

a.s.
x (read (x
m
) converges to x almost surely).
For instance, in the context of the example above, we have x
m

a.s.
x, where
x(1) := 1 and x(0) = a for any a R. We certainly wish to be able to talk about
the almost sure limits of sequences, and say in this instance something like the
expectation of the sequence (x
m
) is 1 in the limit. Thus, we need to nd a way of
extending our convergence theorems to a.s. convergence theorems. It wont take
you long to realize that this is not really a dicult task.
To make things more concrete, let us pose the following question:
46
Does x
m
x almost surely imply E(x
m
) E(x), where each x
m
and x
are R
+
-valued random variables?
This is a perfectly meaningful question that we hope to settle in the armative.
Notice that the Monotone Convergence Theorem 1 does not provide an answer for
two reasons. First, we allow here x
m
and/or x to take as a value. Second, the
statement x
m
x almost surely allows the sequence (x
m
) not converge to x (or be
downright divergent) on a set of probability zero. In fact, (x
m
) does not even have
to be increasing everywhere, all we know here is:
p
__
X : x
1
() x
2
() and lim
m
x
m
() = x()
__
= 1.
(We express this situation by writing x
m

a.s.
x in what follows.) So, the task at
hand is to extend the Monotone Convergence Theorem 1 to cover the case of almost
surely increasing sequences that converge almost surely. The basic idea behind this
generalization is based on the following elementary observation.
Claim. For any R
+
-valued random variable x, we have
sup
zL

(x)
E(z) = E(x) = sup
zL(x)
E(z)
where L

(x) is the set of all simple random variables z (these are real-valued) such
that x
a.s.
z.
Proof of Claim. For any z L

(x), dene S(z) := { X : z() x()} ,


and observe that z1
S(z)
L(x) and E(z) = E(z1
S(z)
). It follows that sup
zL

(x)
E(z)
sup
zL(x)
E(z). Since L(x) L

(x), the converse inequality is trivial.


Let x and y be any two R
+
-valued random variables such that y
a.s.
x. Then, we
must have L

(x) L

(y) (why?) so that, by the Claim above, we nd E(x) E(y).


(You see, all we are doing here is solving Exercise 42 for R
+
-valued random variables.)
This implies that
E(x) = E(y) whenever x =
a.s.
y (8)
which parallels Observation 3.
Now let x and x
m
, m = 1, 2, ..., be R
+
-valued random variables on a probability
space (X, , p), and assume x
m

a.s.
x. Dene
A := { X : x() = }.
If p(A) = 0, that is, if x is real-valued almost surely, then life is easy. In this case, we
dene
S := { X : x
m
() x()} and T := S (X\A).
47
Note that x, and hence each x
m
, take real values on T, because x
1
() x
2
()
x() < for all T. Moreover, since S (X\A) = X\((X\S) A), we have
p(T) = 1 p((X\S) A) 1 (p(X\S) + p(A)) = p(S) = 1,
so x
m
1
T
=
a.s.
x
m
and x1
T
=
a.s.
x.
37
Thus by the Monotone Convergence Theorem 1
and (8),
lim
m
E(x
m
) = lim
m
E(x
m
1
T
) = E(x1
T
) = E(x)
as we sought.
Consider next the case where p(A) > 0, and dene B := { X : x
m
() }.
Positivity of p(A) implies that E(x) = and p(B) > 0. (Why?) Pick any K > 0,
and dene B
m
:= { B : x
m
() > K} for each m. Since x
1
() x
2
() for
all B, we have B
1
B
2
, whereas B

B
i
is true by denition of B.
So, by Proposition 2, we have
lim
m
E(x
m
) lim
m
Kp(B
m
) = Kp
_

i=1
B
i
_
Kp(B).
Since K > 0 is arbitrary here, this implies limE(x
m
) = = E(x) as we sought.
Consequently, the answer of the question posed above is yes, and we have the
following powerful generalization of our earlier work.
Tii Mo:o1o:i Co:\inoi:ci Tiioni: 3. Let (x
m
) be a sequence of R
+
-valued
random variables on a probability space (X, , p), and let x R
X
+
. If x
m
x, or
more generally, if x
m

a.s.
x and x is an R
+
-valued random variable, then E(x) =
limE(x
m
).
The other pointwise convergence theorems we have proved in Section 6 can be
shown to hold also with almost sure convergence by means of a reasoning analogous
to the one just given. The idea is simply to replace the involved R-valued random
variables with suitable random variables that equal to the original ones almost surely.
Since the expectations of these variables are the same, applying our earlier conver-
gence theorems to the random variables thus obtained will establish the same result
for the original R-valued random variables. The legitimacy of this technique derives
from the following fact.
Observation 5. Let x and y be two R-valued random variables (on a given proba-
bility space). If both E(x) and E(y) exist, then
E(x)

=
E(y) whenever x

a.s.
=
a.s.
y.
37
Each x
m
1
T
(and also x1
T
) is a nonnegative random variable on (X, , p), right?
48
Note also that E(x) exists, provided that x
a.s.
y and E(y) < .
Proof. If x
a.s.
y, then x
+

a.s.
y
+
and y


a.s.
x

, so it follows from (8)


that E(x
+
) E(y
+
) and E(y

) E(x

). So, if both E(x) and E(y) exist, then


E(x) E(y). Since x =
a.s.
y implies x
a.s.
y
a.s.
x, applying this observation twice
yields the second claim. Finally, suppose E(y) exists and nite. Then, since x
a.s.
y
implies x
+

a.s.
y
+
, we nd E(x
+
) E(y
+
) < by using (8). It follows that E(x)
exists.
Now use Observation 5 and the technique outlined above the one we used to
prove the Monotone Convergence Theorem 3 to establish the following two conver-
gence theorems.
Tii Mo:o1o:i Co:\inoi:ci Tiioni: 4. Let x and x
m
, m = 1, 2, ..., be
R-valued random variables on a probability space (X, , p). If E(x
1
) > and
x
m

a.s.
x, then E(x) = limE(x
m
).
Liiioi Do:i:.1i Co:\inoi:ci Tiioni: 2. Let x and x
m
, m = 1, 2, ...,
be R-valued random variables on a probability space (X, , p). If there exists an
R-valued random variable y on (X, , p) such that E(y) < and |x
m
|
a.s.
y for all
m, then x
m

a.s.
x implies E(x
m
) E(x).
Exercise 56. Prove the above theorems.
Exercise 57. Show that E(x +y) = E(x) +E(y) for any R-valued random variables
x and y on a probability space (X, , p), provided that the expectations of x, y, and
x + y exist.
Exercise 58.
H
Let x be a nonnegative random variable on a probability space (X, , p)
such that E(x) < . Prove: For any (A
m
)

with limp(A
m
) = 0, we have
lim
m
_
Am
xdp = 0.
Exercise 59.
H
Let (X, , p) be a probability space, and let x
m
, x RV(X, ),
m = 1, 2, ..., satisfy limE(x
m
) = E(x) < . Prove: If x
m
, x 0 for each m, and
limp ({ X : x() x
m
() > }) = 0 for every > 0, then limE(|x
m
x|) =
0.
Exercise 60. (Schees Theorem, Again) Let x and x
m
, m = 1, 2, ..., be R-valued
random variables on (X, , p). Prove: If x is an R
+
-valued random variable on
(X, , p) such that E(x) = E(x
m
) < for each m and x
m

a.s.
x, then
_
X
|x x
m
| dp 0.
49
7.3 The Lebesgue Integral and Finite Measure Spaces
All of the results about the Lebesgue integral that we derived in this and the preceding
sections remain valid in the case of nite measure spaces. To make this point clear,
let (X, , q) be such a space, that is, let (X, ) be a measurable space and q a measure
on with q(X) . If q(X) = 0, then all is trivial, so let us assume q(X) > 0,
and dene p :=
1
q(X)
q. Obviously, p is a probability measure on . Now if f R
X
is
-measurable, then f is an R-valued random variable on (X, ), so we may dene
_
X
fdq := q(X)
_
X
fdp,
provided that
_
X
fdp exists.
38
It follows from this formula that all properties of the
Lebesgue integral with respect to a nite measure can be derived from the corre-
sponding properties of the functional E in a straightforward manner.
7.4 The Lebesgue Integral and (R, B(R), )
Let f : R R
+
be any Borel measurable function. We wish to formalize the idea
of integrating f on R (or on any Borel subset of R) with respect to the Lebesgue
measure .
39
The standard way of doing this is to repeat the development sketched in
Section 6, that is, to start with a denition in the case where f(R) is nite, and then
to extend this denition to the general case by means of approximation. However,
given our work so far, we can arrive at this denition much faster. The trick is simply
to realize that (R, B(R), ) is in fact made of countably many disjoint probability
spaces.
For any i Z, let X
i
:= (i, i + 1] and recall that (X
i
, B(X
i
),
i
) is a probability
space.
40
Moreover, f|
X
i
RV(X
i
, B(X
i
)) for each i Z. (Why?) Consequently,
_
X
i
f|
X
i
d
i
is a well-dened number in R
+
for all i. We may then dene the Lebesgue
integral of f, denoted by
_
R
fd, as:
_
R
fd :=

iZ
_
X
i
f|
X
i
d
i
.
38
Of course, this is equivalent to the denition we would have arrived at if we rst dened
_
X
fdq
for f with nite range as

af(X)
aq(f
1
(a)), and then for nonnegative f as sup{
_
X
gdq : f g
RV(X, ) and |g(X)| < }, and then for arbitrary f as
_
X
f
+
dq
_
X
f

dq, provided that the


latter two integrals exist.
39
Given the purposes of this text, we will discuss this matter only with respect to nonnegative
functions. We should note, however, that the general case would be obtained much the same way E is
extended from the class of nonnegative random variables to the class of arbitrary random variables.
40
Of course,
i
is the unique probability measure that assigns to each subinterval of X
i
the length
of that interval. Equivalently, it is the Lebesgue-Stieltjes measure induced by F
i
[0, 1]
X
i
, where
F
i
(t) := t i.
50
In turn, for any S B(R), we dene the Lebesgue integral of f on S as
_
S
fd :=
_
R
f1
S
d where 1
S
is the indicator function of S on R.
Notice that our denition of the Lebesgue integral is duly consistent with the
denition of the Lebesgue measure on R in the sense that
_
R
1
S
d =

iZ
_
X
i
1
SX
i
d
i
=

iZ

i
(S X
i
) = (S)
for any S B(R). Moreover, for any Borel measurable functions f and g in R
R
+
and
0, we have
_
R
(f + g)d =
_
R
fd +
_
R
gd
and
_
R
fd
_
R
gd whenever f g. Note also that
({t R : f(t) = g(t)}) = 0 implies
_
R
fd =
_
R
gd. (9)
(Proof. If S
i
:= {t X
i
: f(t) = g(t)}, then
i
(S
i
) = (S
i
) = 0, and hence, by
appealing to (8) we get
_
X
i
f|
X
i
d
i
=
_
X
i
g|
X
i
d
i
for all i Z.) Finally, all of our
convergence theorems for the expectation functional have analogous counterparts for
the Lebesgue integral on R. The following exercise provides two examples.
Exercise 61. Let f and f
m
, m = 1, 2, ..., be nonnegative Borel measurable functions
on R such that f
m
(t) f(t) for almost all t R (i.e. {t : f
m
(t) f(t)
fails}) = 0). Show that
lim
m
_
R
f
m
d =
_
R
fd,
provided that either of the following conditions hold:
(a) There exists a Borel measurable g : RR
+
such that
_
R
gd < and f
m
, f g
for all m;
(b) There exists a K > 0 such that f
m
, f K for all m.
We wont pursue the investigation of the Lebesgue integral on R any further here.
Its connection with the Riemann integral on R will be discussed in the next chapter,
and its extension to R
2
later than that. We will eventually use it to integrate the
densities of continuous random variables when we need to make certain conditional
probability computations.
51
8 Further Properties of E
8.1 The Change of Variables Formula
The rst property that we will study here concerns the change of variables in com-
puting the expectation of a random variable. To see what we mean by this, take any
random variable x on some (X, , p) and recall that x induces the Borel probability
space (x(X), B(x(X)), p
x
). Now suppose we are given a nonnegative random variable
f dened on (x(X), B(x(X)), p
x
). What is the expectation of f? Applying the de-
nition directly, we have E
px
(f) =
_
x(X)
fdp
x
. Alternatively, we could think of f x
as a random variable on the space (X, , p) (right?) and compute E
p
(f x). Clearly,
while E
px
(f) is the expectation of f with respect to the probability measure induced
by x, E
p
(f x) is computed by using the original probability measure p but on the
transformed random variable f x. It appears that these two numbers should be the
same.
To make things transparent, let us consider the experiment of throwing a single
die, whose probability space is (X, 2
X
, p) with X := {1, ..., 6} and p(S) := |S| /6 for
all S 2
X
. Consider the random variable x {1, 2}
X
which is dened as x() := 1
if is even and x() := 2 if is odd. In addition, let f(t) := t
2
, t {1, 2},
and observe that f is a random variable on ({1, 2}, 2
{1,2}
, p
x
). Quite clearly, we have
E
px
(f) = (1)
1
2
+ (4)
1
2
=
5
2
; after all f takes value 1 with probability 1/2 and 4 with
probability 1/2. And how about f x? Well, this is a {1, 4}-valued random variable
on (X, 2
X
, p) that takes value 1 if the outcome of the experiment is even and 4 if the
outcome is odd. Thus, we have E
p
(f x) =
5
2
.
This example suggests that computing E
p
(fx) is in general an alternative method
of computing the expectation of f. It turns out that this is true as stated, provided
that f is nonnegative. In fact, we would not have needed the nonnegativity re-
quirement, if f was known to have a nite expectation. As we shall see later, these
observations are of great practical value.
Pnoioi1io: 6. (The Change of Variables Formula) Let x be a random variable on a
probability space (X, , p), and let f be any random variable on (x(X), B(x(X)), p
x
)
with f 0 or E
px
(f) < . Then
_
x(X)
fd(p x
1
) =
_
X
(f x)dp. (10)
Proof. Assume rst that f is a simple nonnegative random variable. In this
case, we can write f =

m
a
i
1
A
i
where A
i
is a Borel subset of x(X) and a
i
0,
i = 1, ..., m, for some m N. Then (10) follows readily from the observation that
f x =

m
a
i
1
x
1
(A
i
)
. Suppose next that f need not be simple but f 0. In this case
we use Lemma 1 to nd an increasing sequence (f
m
) of simple nonnegative random
variables on (x(X), B(x(X)), p
x
) such that limf
m
= f. But then (f
m
x) is also an
52
increasing sequence of simple random variables, and we have lim(f
m
x) = f x. So,
by applying the Monotone Convergence Theorem 1 twice, we nd
_
x(X)
fd(p x
1
) = lim
m
_
x(X)
f
m
d(p x
1
) = lim
m
_
X
(f
m
x)dp =
_
X
(f x)dp.
We leave extending this observation to an arbitrary f with E
px
(f) < as an exercise.

Exercise 62. (a) Complete the proof of Proposition 6.


(b) Give an example that shows that the requirement f 0 or E
px
(f) <
cannot be omitted in the statement of Proposition 6.
8.2 Jensens Inequality
You are already familiar with the next property of the expectation functional, for it
is widely used in economics it goes by the name of Jensens Inequality. (It was
proved by Johan Jensen in 1906.) This inequality compares the expectation of a
concave transformation of a random variable x and the same transformation of the
expectation of x. Here is its exact statement (as a probabilist would give it).
Ji:i: I:iq.ii1.. Let x be a random variable on a probability space (X, , p)
and let R
J
be a concave function, where J is a (possibly unbounded) interval
with x(X) J. If E(x) < , then x RV(X, ) and
E((x)) (E(x)).
If is convex, then the inequality goes the other direction.
Proof. Since R
J
is concave, it is continuous on int
R
(J). Given that J is an
interval, then, is continuous at all points in J except possibly at two points. Thus
RV(J, B(J)) (Example 6.[3]). It follows that x is a random variable on (X, ).
To prove the main part of the proposition, dene
41
F := {f R
J
: f is ane and f }.
By using the Minkowski Supporting Hyperplane Theorem, we nd that inf
fF
f(t) =
(t) for all t J (Figure 4). Moreover, we have E( inf
fF
f x) E(f x) for all
f F, and hence E( inf
fF
f(x)) inf
fF
E(f(x)). Consequently,
E( x) = E
_
inf
fF
f x
_
inf
fF
E(f x) = inf
fF
f(E(x)) = (E(x)),
where the penultimate equality follows from aneness of f.
FIGURE 4 ABOUT HERE
41
Reminder. f R
J
is ane i there exist a, b R such that f(t) = at +b for all t J.
53
Exercise 63. Show that the hypothesis E(x) < cannot be omitted in Jensens
Inequality.
We will have plenty of occasions later on to make use of this celebrated inequality.
For your immediate enjoyment, here are some quick applications.
Exercise 64. (The Arithmetic-Geometric Mean Inequality) Let m N, a
1
, ..., a
m

0, and let
1
, ...,
m
> 0 satisfy

m
1

i
= 1. Use the fact that t log t is a concave
function on R
++
to prove that
m

i=1
a
i

i

m

i=1
a
1

i
i
with equality holding i a
1
= = a
m
.
Exercise 65. (Ordering of Power Means) Let n N and = 0. The -power mean
is a function M

on R
n
such that
M

(a
1
, ..., a
n
) :=
_
1
n
n

i=1
a

i
_1

.
Show that, for any , = 0 with < , we have M

.
Exercise 66.
H
(A Mean-Median Inequality) Let x be a random variable on a prob-
ability space (X, , p) such that E(x) < . A median of x is dened to be any
number m such that p
x
((, m]) = p
x
([m, )). We denote an arbitrarily xed
median of x as med(x).
(a) Show that
med(x) arg min{E(|x a|) : a R}.
(b) Use Jensens Inequality to prove that
|E(x) med(x)|
_
E((x E(x))
2
).
Interpretation. The median and the expectation of a random variable (with nite
mean) cannot dier more than one standard deviation.
Exercise 67. Let x be a random variable on (R, B(R)) such that x(X) = [0, 1], and
let C[0, 1]. Prove: If E
p
((x)) (E
p
(x)) holds for all p P(R), then must
be concave on [0, 1].
We conclude with an exercise that provides a Jensen-like inequality for functions
that are not necessarily concave.
54
Exercise 68.
H
A function R
[0,)
is said to be subquadratic if, there exists a
function C R
[0,)
such that
(b) (a) (|b a|) C(a)(b a)
for all a, b 0.
42
(a) Show that if is a nonpositive subquadratic function, then it is concave.
(b) Show that a subquadratic function need not be concave.
(c) Let x be a nonnegative random variable on a probability space (X, , p) and let
R
[0,)
be subquadratic and continuous. Prove: If E(x) < , then x
RV(X, ) and
E((x) (|x E(x)|)) (E(x)).
This inequality holds as an equality if (t) = t
2
for all t 0.
8.3 Chebyshevs Inequality
Another inequality of probability theory that plays a major role in the derivation of
laws of large numbers is Chebyshevs Inequality. (It was proved by Pafnuty Chebyshev
in 1867 who used it to prove a generalization of the so-called Weak Law of Large
Numbers.) This inequality is an easy consequence of the following simple result
which can also be used to obtain other sorts of estimations of tail probabilities.
Li::. 3. Let (X, , p) be a probability space and x RV(X, ). Then, for any
continuous f : R R
+
, we have
p ({ X : f(x()) a})
1
a
E(f x), a R.
Proof. It is easy to see that fx RV(X, ). Dene S := { X : f(x()) a} .
Then f x a1
S
, so
E(f x)
_
X
a1
S
dp = ap(S),
and we are done.
Here is an immediate application of this lemma.
M.nio\ I:iq.ii1.. For any random variable x dened on a probability space
(X, , p), we have
p ({ X : |x()| a})
1
a
E(|x|), a R.
42
This exercise is taken from Abramovich, Jameson and Sinnamon (2003) who provide
many other interesting results about subquadratic functions.
55
Proof. Apply Lemma 3 with f being the map t |t| on R.
And here is another.
Ciii.ii\ I:iq.ii1.. For any random variable x dened on a probability
space (X, , p), we have
p ({ X : |x()| a})
1
a
2
E(x
2
), a R.
Proof. Apply Lemma 3 with f being the map t t
2
on R.
Exercise 69. Let x be a random variable on a probability space (X, , p) such that
E(x) exists. We dene V(x) := E((x E(x))
2
), which is called the variance of x.
(a) Chebyshevs Inequality is often stated as
p({ : |x() E(x)| a)
1
a
2
V(x), a R.
Prove this.
(b) Chebyshevs Inequality may hold as an equality. Verify this by using a random
variable x RV((0, 1), B(0, 1)) whose distribution function is
1
2
1
[a,a)
+
1
2
1
[a,)
.
(How do we know that there is such a random variable?)
9 Spaces of Integrable Random Variables
Let 1 < , and let (X, , p) be a probability space. The set of all random vari-
ables x on (X, , p) such that E(|x|

) < is denoted as L

(X, , p). Thus L


1
(X, , p)
is the space of all integrable random variables on (X, , p). In turn, L
2
(X, , p) is the
space of all square-integrable random variables on (X, , p).
It is easily checked that L

(X, , p) is obviously closed under (pointwise) scalar


multiplication. It is also closed under pointwise addition. Indeed, if x, y L

(X, , p),
then, because
|x + y|

(|x| +|y|)

(2 max{x, y})

(|x|

+|y|

) ,
we nd E(|x + y|

) < , that is, x+y L

(X, , p). It follows that L

(X, , p) is a
linear space (under the usual pointwise operations). Moreover, E is a linear functional
on this space.
43
There is a natural way of making L

(X, , p) a normed linear space. Dene the


real map

on this linear space by


x

:= (E(|x|

))
1/
=
__
X
|x|

dp
_
1/
.
43
Warning. But E need not be continuous on this space. Quiz. Prove this by using the example
I gave to show that the inequality in Fatous Lemma may hold strictly.
56
The following exercise carries you through a proof of the fact that this map is a norm
on L

(X, , p).
Exercise 70. (a) (Hlders Inequality) Let and be any two real numbers with
> 1 and
1

+
1

= 1. For x L

(X, , p) and y L

(X, , p) with x

> 0
and y

> 0, use the Arithmetic-Geometric Mean Inequality (Exercise 64) with


a
1
=
|x()|

and a
2
=
|y()|

for any X, and integrate the resulting inequality


to conclude that
E(|xy|) x

for any x L

(X, , p) and y L

(X, , p).
44
(b) (Ordering of L

-norms) Let 1. Use Hlders Inequality to prove that if


z L

(X, , p), then z

for all 1 . Compare this result with the


observation noted in Exercise 66.
(c) (Minkowskis Inequality) Use Hlders Inequality to prove that
z + w

+w

for all z, w L

(X, , p).
Minkowskis Inequality brings us very close to declaring (L

(X, , p),

) a
normed linear space, but we cant quite do that right now, because

identies any
two random variables that are distinct from each other on a set of probability zero.
Or put dierently, the problem is that x

= 0 does not guarantee that x = 0, it


implies only that x =
a.s.
0 (Observation 5). But this is only a minor technical wrinkle.
If we agree to identify any two random variables that are equal to each other almost
surely, then the diculty disappears. Formally, this means that L

(X, , p) should
really be thought of as composed of equivalence classes of random variables.
45
Viewed
this way, L

(X, , p) would carry all of the properties of a normed linear space.


In fact, we can say a bit more.
Tii Riiz-Ficiin Tiioni:. Let (X, , p) be a probability space and 1 <
. Then L

(X, , p) is a Banach space.


Proof. Let (x
m
) be a Cauchy sequence in L

(X, , p). We wish to nd a subse-


quence of this sequence that converges with respect to the norm

. This is enough
to conclude that (x
m
) converges in L

(X, , p).
44
Cauchy-Schwarz Inequality: E(|xy|) x
2
y
2
.
45
To be completely formal about this, I should dene the equivalence relation on L

(X, , p)
by x y i x =
a.s.
y, and make the quotient space L

(X, , p)/

a linear space by dening


[x]

:= [x]

and [x]

+ [y]

= [x + y]

for all x, y L

(X, , p) and R. (This makes


L

(X, , p) and L

(X, , p)/

homomorphic, if you know what this means.) Dene next the norm

on L

(X, , p)/

as [x]

= x

. Whenever one refers to L

(X, , p) as a normed linear


space, formally speaking, (s)he must in fact be talking about (L

(X, , p)/

).
57
Because (x
m
) is Cauchy, it must have a subsequence (x
m
k
) such that x
m
x
m
k


1
2
k
for all m m
k
. (Yes?) Then
k

i=1
_
_
x
m
i+1
x
m
i
_
_

< 1, k = 1, 2, ... (11)


For each k, dene
y
k
:= |x
m
1
| +|x
m
2
x
m
1
| + +|x
m
k
+1
x
m
k
| .
Obviously, (y
k
()) is an increasing sequence for any X, so we may dene y :
X R
+
by
y() := lim
k
y
k
().
Moreover, applying Minkowskis Inequality and (11) for each k, we have
y

1
= y
k

_
x
m
1

+x
m
2
x
m
1

+ +
_
_
x
m
k+1
x
m
k
_
_

< (x
m
1

+ 1)

,
while, by the Monotone Convergence Theorem 3,
lim
k
_
X
y

k
dp =
_
X
y

dp.
It follows that there is a constant K (namely, (x
m
1

+ 1)

) such that
_
X
y

dp K.
Therefore, y is nite almost surely. (Yes?). Put more precisely, there is a subset S of
X such that p(S) = 0 and
|x
m
1
()| +

i=1

x
m
i+1
() x
m
i
()

< for all X\S,


and this implies
x
m
1
() +

i=1
_
x
m
i+1
() x
m
i
()
_
< for all X\S.
Since, for any k = 2, 3, ..., x
m
k
= x
m
1
+

k1
(x
m
i+1
x
m
i
), we may thus conclude
that (x
m
k
()) converges (as k ) for each X\S. So, if we dene
x() :=
_
limx
m
k
(), if X\S
0, otherwise
,
we have x
m
k

a.s.
x.
58
Clearly, x RV(X, ) why? so it remains to prove that x
m
k
x

0.
This is not dicult. Since |x| y and |x
m
k
| y
m
k
y for each k, we have
|x
m
k
x|

(|x
m
k
|

+|x|

) 2
+1
y

.
Since E(y

) < and |x
m
k
x|

a.s.
0, therefore, we can apply Lebesgues Domi-
nated Convergence Theorem2 to conclude that limx
m
k
x

= limE(|x
m
k
x|

) =
0, and the proof is complete.
Ri:.ni 4. Consider the probability space ([0, 1], B[0, 1], ) and dene the random
variable x on this space by letting x() := if Q and x() := 1 otherwise.
This random variable is clearly unbounded. Yet, x is bounded almost surely, or put
dierently, it is unbounded only where it doesnt matter anyway, that is, on a
set of probability zero. (For instance, E(x) = 1, right?) Especially for probabilistic
applications, almost sure boundedness is more useful than the standard notion of
boundedness.
A random variable x on a probability space (X, , p) is said to be essentially
bounded if there exists a K > 0 and a set S with p(S) = 0 such that |x()| K
for all X\S. The set of all such random variables is denoted L

(X, , p). Clearly,


B(X) RV(X, ) L

(X, , p), but the latter cannot be normed by the sup-norm


(as the previous example shows). Instead we use the following modication of the
sup-norm for L

(X, , p):
x

:= inf{K > 0 : p({ X : |x()| K}) = 1}.


46
Identifying again random variables that are equal to each other almost surely, and
using

, we can view L

(X, , p) as a normed linear space. This space too is


Banach. (The proof of this is much easier than the Riesz-Fischer Theorem.)
In Exercises 71-74, (X, , p) stands for an arbitrarily xed probability space.
Exercise 71. (Markovs Inequality) Let 1 < , and x L

(X, , p). Show


that
p(|x| a)
1
a

E(|x|

) a R.
Exercise 72.
H
Prove: If 1 , then L

(X, , p) L

(X, , p).

Exercise 73.
H
Prove: If 1 , then L

(X, , p) L

(X, , p) i there
is no sequence (A
m
) in such that p(A
m
) > 0 for each m and p(A
m
) 0.
Exercise 74. Let 1 < and take any sequence (x
m
) in L

(X, , p) such that


x
m

a.s.
x for some x L

(X, , p). Show that x


m
x

0 i x
m

.
46
Sometimes the notation ess sup
X
|x()| is used instead of x

. But since x

reduces to
the usual sup-norm on B(X), the notation I adopt here will not cause confusion.
59

Exercise 75. (a) Prove that C[0, 1] is dense in L

([0, 1], B, ) for any 1 < .


(b) Prove that L

([0, 1], B, ) is separable for any 1 < .


Exercise 76.
H
Prove that L

([0, 1], B, ) is not separable.


60
HINTS FOR SELECTED EXERCISES OF CHAPTER 7
Exercise 5. It is enough to show that the right hand side of the equation above
is a -algebra.
Exercise 7. (A) = {A R : either A or R\A is countable}.
Exercise 12. Show that {T Y : T B(X)} is a -algebra that contains
all closed sets in Y. This proves the part of the claim. To prove the converse
containment, show that T := {T X : T Y B(Y )} is a -algebra that contains
all closed sets in X so that B(X) T .
Exercise 13. Show that T := {A [0, 1] : A + B(R)} is a -algebra that
contains all closed sets in [0, 1], and hence B[0, 1] T . This means that (S + )
[0, 1] B(R). Now use Exercise 12.
Exercise 16. Couple the Exclusion-Inclusion Formula with Proposition 2.
Exercise 18. (a) Call a Borel subset S of X nice if p(S) = inf{p(O) O
X
:
S O}. Now dene A := {S B(X) : both S and X\S are nice}. Check that A is
a -algebra on X.
Exercise 21. Enumerate X as {
1
,
2
, ...}, let K :=

f(
i
)/2
i
, and dene
g(
i
) := 1/2
i
K, i = 1, 2, ....
Exercise 25. Let me show that F
p
is actually right-continuous. Take any x R
and let x
m
x. Then (, x
1
] (, x
2
] , so it follows from Proposition 2
that
limF
p
(x
m
) = limp((, x
m
]) = p
_

i=1
(, x
i
]
_
= p((, x]) = F
p
(x).
Exercise 26. If (A
m
) B(R)

satises A
i
A
j
= for all i = j, and X
i
:=
(i, i + 1], i Z, then

j=1
A
j
_
=

iZ

i
_

j=1
(A
j
X
i
)
_
=

iZ

j=1

i
(A
j
X
i
) =

j=1

iZ

i
(A
j
X
i
) =

j=1
(A
j
),
where we used Proposition 1.10 to get the third equality.
Exercise 27. First recall Exercise 13. Then let X := [|| , 1+||], and keeping
in mind the point raised in part (a), consider as dened on B(X). Now dene the
function R
B[0,1]
by (S) := (S +). Show that and agree on all right-closed
intervals in [0, 1], and use Proposition 4.
Exercise 28. (a) Dene f [0, 1]
[0,1]
by f(t) := (A [0, t]) and use the
Intermediate Value Theorem.
Exercise 29. (a) Recall Exercises 13 and 27.
61
(b) If [0, 1], then Q for some S. If = r
m
, then we have
S
m
, m = 1, 2, ... .
Exercise 30. First show that S is closed under taking nite unions. Next, note
that (A
m
) S

implies B
m
:= A
1
A
m
S for each m, and

A
i
=

B
i
.
Exercise 32. (b) To show that m(A) is closed under complementation, dene
M := {A X : X\A m(A)}, and verify that m(A) M. To show that m(A) is
closed under taking nite unions, dene
M
A
:= {B X : A B m(A)} for all A m(A),
and show that M
A
is monotone and A M
A
for any A m(A).
Exercise 35. |x| = max{x, 0} max{x, 0}.
Exercise 37. For any a R,
_
X : liminf
m
x
m
() < a
_
=

k=1

m=k
{ X : x
m
() < a}
and
_
X : limsup
m
x
m
() a
_
=

k=1

m=k
{ X : x
m
() a}.
Exercise 38. Let S denote the class of all -algebras on X such that C(X)
RV(X, ), and dene

:=

S. To show that B(X)

, take any nonempty


closed subset S of X. Dene f : X R by f(x) := d(x, S). Then S = f
1
(0)
for all S.
Exercise 44. Dene S := { X : x() y()}, and notice that z1
S
L(x)
whenever z L(y).
Exercise 46. Enumerate X as {x
1
, x
2
, ...}, and dene y
m
:=

m
x
i
for each
m N. Now apply the Monotone Convergence Theorem 1 to (y
m
).
Exercise 50. (a) Here is how to do this with = 1. We have x
+
x

+y
+
y

=
x + y = (x + y)
+
(x + y)

so that x
+
+ y
+
+ (x + y)

= (x + y)
+
+ x

+ y

. Now
use Proposition 5.
Exercise 51. Dene S := { X : x() < y()}, and use the hypothesis and
Exercise 50.(a) to get
_
X
(yx)1
S
= 0. It follows from Exercise 45 that (yx)1
S
=
a.s.
0, that is, p(S) = 0.
Exercise 52. Use Exercise 47.(b).
Exercise 55. Dene y
m
:= x x
m
, m = 1, 2, ..., and use Lebesgues Dominated
Convergence Theorem 1 to get
_
y
+
m
dp 0.
Exercise 58. Observe that x1
Am

a.s.
0, and use Lebesgues Dominated Con-
vergence Theorem 2.
62
Exercise 59. Dene y
m
:= x x
m
, and verify that
E(y
+
m
)
1
k
+
_
A
m,k
xdp, m, k = 1, 2, ...
where A
m,k
:= { : y
m
() >
1
k
}, m, k N. Now let m rst, k next, and
recall Exercise 58.
Exercise 66. (b) Using Jensens Inequality, part (a) and Jensens Inequality
again,
|E(x med(x))| E(|x med(x)|)
_
E((x E(x))
2
).
Exercise 68. (b) If (J) [1, 2], then is subquadratic.
(c) Use the denition of subquadracity and the monotonicity of the Lebesgue
integral to get E((x) (|x E(x)|)) C(E(x))E(x E(x)) = 0.
Exercise 70. (b) Choose x = z

and y = 1.
(c) Choose x = |z| |z + w|
1
and y = |w| |z + w|
1
.
Exercise 72. Use Hlders Inequality.
Exercise 73. If x L

(X, , p)\L

(X, , p), set A


m
:= { X : |x()| m},
m = 1, 2, ..., and check that p(A
m
) > 0 for each m, while p(A
m
) 0. (If p(A
m
) = 0
for some m, then x L

(X, , p).)
Conversely, suppose there is an (A
m
)

such that p(A


m
) > 0 (for each m) and
p(A
m
) 0. Without loss of generality, assume p(A
m
)
1
2
m
and p(A
m+1
)
1
4
p(A
m
).
Dene B
m
:= A
m
\(A
m+1
A
m+2
) so that 0 < p(B
m
)
1
2
m
for each m, while
B
i
B
j
= for each i = j. If < , you can settle the matter by studying the real
function x dened on X by
x() :=

i=1
_
1
p(B
m
)
_1

1
B
i
.
Exercise 76. A metric space cannot be separable, if it contains an uncountable
set S such that, for some > 0, the distance between any two distinct points in S is
at most .
63

You might also like