Dajani Dirksin Ergodic Theory

A Simple Introduction to Ergodic Theory
Karma Dajani and Sjoerd Dirksin
December 18, 2008

2
Contents
1 Introduction and preliminaries 5

1.1 What is Ergodic Theory? . . . . . . . . . . . . . . . . . . . . . 5
1.2 Measure Preserving Transformations . . . . . . . . . . . . . . 6
1.3 Basic Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Induced and Integral Transformations . . . . . . . . . . . . . . 15
1.5.1 Induced Transformations . . . . . . . . . . . . . . . . . 15
1.5.2 Integral Transformations . . . . . . . . . . . . . . . . . 18
1.6 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Other Characterizations of Ergodicity . . . . . . . . . . . . . . 22
1.8 Examples of Ergodic Transformations . . . . . . . . . . . . . . 24
2 The Ergodic Theorem 29

2.1 The Ergodic Theorem and its consequences . . . . . . . . . . . 29
2.2 Characterization of Irreducible Markov Chains . . . . . . . . . 41
2.3 Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Measure Preserving Isomorphisms and Factor Maps 47

3.1 Measure Preserving Isomorphisms . . . . . . . . . . . . . . . . 47
3.2 Factor Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Natural Extensions . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Entropy 55
4.1 Randomness and Information . . . . . . . . . . . . . . . . . . 55
4.2 Definitions and Properties . . . . . . . . . . . . . . . . . . . . 56
4.3 Calculation of Entropy and Examples . . . . . . . . . . . . . . 63
4.4 The Shannon-McMillan-Breiman Theorem . . . . . . . . . . . 66
4.5 Lochs’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3
4
5 Hurewicz Ergodic Theorem 79

5.1 Equivalent measures . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Non-singular and conservative transformations . . . . . . . . . 80
5.3 Hurewicz Ergodic Theorem . . . . . . . . . . . . . . . . . . . . 83
6 Invariant Measures for Continuous Transformations 89

6.1 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Unique Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Topological Dynamics 101

7.1 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2 Topological Entropy . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.1 Two Definitions . . . . . . . . . . . . . . . . . . . . . . 108
7.2.2 Equivalence of the two Definitions . . . . . . . . . . . . 114
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8 The Variational Principle 127

8.1 Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2 Measures of Maximal Entropy . . . . . . . . . . . . . . . . . . 136
Chapter 1
Introduction and preliminaries
1.1 What is Ergodic Theory?

It is not easy to give a simple definition of Ergodic Theory because it uses
techniques and examples from many fields such as probability theory, statis-
tical mechanics, number theory, vector fields on manifolds, group actions of
homogeneous spaces and many more.
The word ergodic is a mixture of two Greek words: ergon (work) and odos
(path). The word was introduced by Boltzmann (in statistical mechanics)
regarding his hypothesis: for large systems of interacting particles in equilib-
rium, the time average along a single trajectory equals the space average. The
hypothesis as it was stated was false, and the investigation for the conditions
under which these two quantities are equal lead to the birth of ergodic theory
as is known nowadays.
A modern description of what ergodic theory is would be: it is the study
of the long term average behavior of systems evolving in time. The collection
of all states of the system form a space X, and the evolution is represented
by either
– a transformation T : X → X, where T x is the state of the system at
time t = 1, when the system (i.e., at time t = 0) was initially in state x.
(This is analogous to the setup of discrete time stochastic processes).
– if the evolution is continuous or if the configurations have spacial struc-
ture, then we describe the evolution by looking at a group of transfor-
mations G (like Z2 , R, R2 ) acting on X, i.e., every g ∈ G is identified
with a transformation Tg : X → X, and Tgg0 = Tg ◦ Tg0 .
5
6 Introduction and preliminaries
The space X usually has a special structure, and we want T to preserve

the basic structure on X. For example
–if X is a measure space, then T must be measurable.
–if X is a topological space, then T must be continuous.
–if X has a differentiable structure, then T is a diffeomorphism.
In this course our space is a probability space (X, B, µ), and our time is
discrete. So the evolution is described by a measurable map T : X → X, so
that T −1 A ∈ B for all A ∈ B. For each x ∈ X, the orbit of x is the sequence
x, T x, T 2 x, . . . .
If T is invertible, then one speaks of the two sided orbit
. . . , T −1 x, x, T x, . . . .
We want also that the evolution is in steady state i.e. stationary. In the
language of ergodic theory, we want T to be measure preserving.
1.2 Measure Preserving Transformations

Definition 1.2.1 Let (X, B, µ) be a probability space, and T : X → X mea-
surable. The map T is said to be measure preserving with respect to µ if
µ(T −1 A) = µ(A) for all A ∈ B.
This definition implies that for any measurable function f : X → R, the

process
f, f ◦ T, f ◦ T 2 , . . .
is stationary. This means that for all Borel sets B1 , . . . , Bn , and all integers
r1 < r2 < . . . < rn , one has for any k ≥ 1,
µ ({x : f (T r1 x) ∈ B1 , . . . f (T rn x) ∈ Bn }) =
µ {x : f (T r1 +k x) ∈ B1 , . . . f (T rn +k x) ∈ Bn } .

In case T is invertible, then T is measure preserving if and only if µ(T A) =

µ(A) for all A ∈ B. We can generalize the definition of measure preserving to
the following case. Let T : (X1 , B1 , µ1 ) → (X2 , B2 , µ2 ) be measurable, then
T is measure preserving if µ1 (T −1 A) = µ2 (A) for all A ∈ B2 . The following
Measure Preserving Transformations 7
gives a useful tool for verifying that a transformation is measure preserving.

For this we need the notions of algebra and semi-algebra.
Recall that a collection S of subsets of X is said to be a semi-algebra if
(i) ∅ ∈ S, (ii) A ∩ B ∈ S whenever A, B ∈ S, and (iii) if A ∈ S, then
X \A = ∪ni=1 Ei is a disjoint union of elements of S. For example if X = [0, 1),
and S is the collection of all subintervals, then S is a semi-algebra. Or if
X = {0, 1}Z , then the collection of all cylinder sets {x : xi = ai , . . . , xj = aj }
is a semi-algebra.
An algebra A is a collection of subsets of X satisfying:(i) ∅ ∈ A, (ii) if
A, B ∈ A, then A ∩ B ∈ A, and finally (iii) if A ∈ A, then X \ A ∈ A.
Clearly an algebra is a semi-algebra. Furthermore, given a semi-algebra S
one can form an algebra by taking all finite disjoint unions of elements of S.
We denote this algebra by A(S), and we call it the algebra generated by S.
It is in fact the smallest algebra containing S. Likewise, given a semi-algebra
S (or an algebra A), the σ-algebra generated by S (A) is denoted by B(S)
(B(A)), and is the smallest σ-algebra containing S (or A).
A monotone class C is a collection of subsets of X with the following two
properties
– if E1 ⊆ E2 ⊆ . . . are elements of C, then ∪∞
i=1 Ei ∈ C,
– if F1 ⊇ F2 ⊇ . . . are elements of C, then ∩∞

i=1 Fi ∈ C.
The monotone class generated by a collection S of subsets of X is the smallest

monotone class containing S.
Theorem 1.2.1 Let A be an algebra of X, then the σ-algebra B(A) gener-

ated by A equals the monotone class generated by A.
Using the above Theorem, one can get an easier criterion for checking that
a transformation is measure preserving.
Theorem 1.2.2 Let (Xi , Bi , µi ) be probability spaces, i = 1, 2, and T : X1 →

X2 a transformation. Suppose S2 is a generating semi-algebra of B2 . Then,
T is measurable and measure preserving if and only if for each A ∈ S2 , we
have T −1 A ∈ B1 and µ1 (T −1 A) = µ2 (A).
Proof. Let
C = {B ∈ B2 : T −1 B ∈ B1 , and µ1 (T −1 B) = µ2 (B)}.
Then, S2 ⊆ C ⊆ B2 , and hence A(S2 ) ⊂ C. We show that C is a monotone

class. Let E1 ⊆ E2 ⊆ . . . be elements of C, and let E = ∪∞
i=1 Ei . Then,
−1 ∞ −1
T E = ∪i=1 T Ei ∈ B1 .
µ1 (T −1 E) = µ1 (∪∞
n=1 T
−1
En ) = lim µ1 (T −1 En )
n→∞
= lim µ2 (En )
n→∞
= µ2 (∪∞
n=1 En )
= µ2 (E).
Thus, E ∈ C. A similar proof shows that if F1 ⊇ F2 ⊇ . . . are elements of C,

then ∩∞i=1 Fi ∈ C. Hence, C is a monotone class containing the algebra A(S2 ).
By the monotone class theorem, B2 is the smallest monotone class containing
A(S2 ), hence B2 ⊆ C. This shows that B2 = C, therefore T is measurable
and measure preserving.
For example if
– X = [0, 1) with the Borel σ-algebra B, and µ a probability measure on B.
Then a transformation T : X → X is measurable and measure preserving if
and only if T −1 [a, b) ∈ B and µ (T −1 [a, b)) = µ ([a, b)) for any interval [a, b).
– X = {0, 1}N with product σ-algebra and product measure µ. A trans-
formation T : X → X is measurable and measure preserving if and only
if
T −1 ({x : x0 = a0 , . . . , xn = an }) ∈ B,
and
µ T −1 {x : x0 = a0 , . . . , xn = an } = µ ({x : x0 = a0 , . . . , xn = an })

for any cylinder set.
Exercise 1.2.1 Recall that if A and B are measurable sets, then
A∆B = (A ∪ B) \ (A ∩ B) = (A \ B) ∪ (B \ A).
Show that for any measurable sets A, B, C one has
µ(A∆B) ≤ µ(A∆C) + µ(C∆B).
Another useful lemma is the following (see also ([KT]).

Basic Examples 9
Lemma 1.2.1 Let (X, B, µ) be a probability space, and A an algebra gener-

ating B. Then, for any A ∈ B and any > 0, there exists C ∈ A such that
µ(A∆C) < .
Proof. Let
D = {A ∈ B : for any > 0, there exists C ∈ A such that µ(A∆C) < }.
Clearly, A ⊆ D ⊆ B. By the Monotone Class Theorem (Theorem (1.2.1)),

we need to show that D is a monotone
S∞ class. To this end, let A1 ⊆ A2 ⊆ · · ·
be a sequence in D, and let A = n=1 An , notice that µ(A) = lim µ(An ).
n→∞
Let > 0, there exists an N such that µ(A∆AN ) = |µ(A) − µ(AN )| < /2.
Since AN ∈ D, then there exists C ∈ A such that µ(AN ∆C) < /2. Then,
µ(A∆C) ≤ µ(A∆AN ) + µ(AN ∆C) < .
Hence, A ∈ D. Similarly, one can show that D is closed under decreasing in-
tersections so that D is a monotone class containg A, hence by the Monotone
class Theorem B ⊆ D. Therefore, B = D, and the theorem is proved.

1.3 Basic Examples

(a) Translations – Let X = [0, 1) with the Lebesgue σ-algebra B, and
Lebesgue measure λ. Let 0 < θ < 1, define T : X → X by
T x = x + θ mod 1 = x + θ − bx + θc.
Then, by considering intervals it is easy to see that T is measurable and

measure preserving.
(b) Multiplication by 2 modulo 1 – Let (X, B, λ) be as in example (a), and
let T : X → X be given by

2x 0 ≤ x < 1/2
T x = 2x mod 1 =
2x − 1 1/2 ≤ x < 1.
For any interval [a, b),

a b a+1 b+1
T −1 [a, b) = [ , ) ∪ [ , ),
2 2 2 2
and
λ T −1 [a, b) = b − a = λ ([a, b)) .

Although this map is very simple, it has in fact many facets. For example,
iterations of this map yield the binary expansion of points in [0, 1) i.e., using
T one can associate with each point in [0, 1) an infinite sequence of 0’s and
1’s. To do so, we define the function a1 by
(
0 if 0 ≤ x < 1/2
a1 (x) =
1 if 1/2 ≤ x < 1,
then T x = 2x − a1 (x). Now, for n ≥ 1 set an (x) = a1 (T n−1 x). Fix x ∈ X,

for simplicity, we write an instead of an (x), then T x = 2x − a1 . Rewriting
2
we get x = a21 + T2x . Similarly, T x = a22 + T 2 x . Continuing in this manner,
we see that for each n ≥ 1,
a1 a2 an T n x
x= + 2 + ··· + n + n .
2 2 2 2
Since 0 < T n x < 1, we get
n
X ai T nx
x− = → 0 as n → ∞.
i=1
2i 2n
Thus, x = ∞
P ai
i=1 2i . We shall later see that the sequence of digits a1 , a2 , . . .
forms an i.i.d. sequence of Bernoulli random variables.
(c) Baker’s Transformation – This example is the two-dimensional version
of example (b). The underlying probability space is [0, 1)2 with product
Lebesgue σ-algebra B × B and product Lebesgue measure λ × λ. Define
T : [0, 1)2 → [0, 1)2 by
(
(2x, y2 ) 0 ≤ x < 1/2
T (x, y) = y+1
(2x − 1, 2 ) 1/2 ≤ x < 1.
Exercise 1.3.1 Verify that T is invertible, measurable and measure preserv-

ing.
(d) β-transformations
√
– Let X = [0, 1) with the Lebesgue σ-algebra B. Let
1+ 5
β = 2 , the golden mean. Notice that β 2 = β + 1. Define a transformation
Basic Examples 11
T : X → X by

βx 0 ≤ x < 1/β
T x = βx mod 1 =
βx − 1 1/β ≤ x < 1.
Then, T is not measure preserving with respect to Lebesgue measure (give

a counterexample), but is measure preserving with respect to the measure µ
given by Z
µ(B) = g(x) dx,
B
where ( √
5+3 5
10
0 ≤ x < 1/β
g(x) = √
5+ 5
10
1/β ≤ x < 1.
Exercise 1.3.2 Verify that T is measure preserving with respect to µ, and

show that (similar to example (b)) iterations of this map generate expansions
for points x ∈ [0, 1) (known as β-expansions) of the form
∞
X bi
x= ,
i=1
βi
where bi ∈ {0, 1} and bi bi+1 = 0 for all i ≥ 1.
(e) Bernoulli Shifts – Let X = {0, 1, . . . k − 1}Z (or X = {0, 1, . . . k − 1}N ),

F the σ-algebra generated by the cylinders. Let p = (p0 , p1 , . . . , pk−1 ) be a
positive probability vector, define a measure µ on F by specifying it on the
cylinder sets as follows
µ ({x : x−n = a−n , . . . , xn = an }) = pa−n . . . pan .
Let T : X → X be defined by T x = y, where yn = xn+1 . The map T , called

the left shift, is measurable and measure preserving, since
T −1 {x : x−n = a−n , . . . xn = an } = {x : x−n+1 = a−n , . . . , xn+1 = an },
and
µ ({x : x−n+1 = a−n , . . . , xn+1 = an }) = pa−n . . . pan .
Notice that in case X = {0, 1, . . . k − 1}N , then one should consider cylinder
sets of the form {x : x0 = a0 , . . . xn = an }. In this case
T −1 {x : x0 = a0 , . . . , xn = an } = ∪k−1
j=0 {x : x0 = j, x1 = a0 , . . . , xn+1 = an },
and it is easy to see that T is measurable and measure preserving.

(f) Markov Shifts – Let (X, F, T ) be as in example (e). We define a measure
ν on F as follows. Let P = (pij ) be a stochastic k × k matrix, and q =
(q0 , q1 , . . . , qk−1 ) a positive probability vector such that qP = q. Define ν on
cylinders by
ν ({x : x−n = a−n , . . . xn = an }) = qa−n pa−n a−n+1 . . . pan−1 an .
Just as in example (e), one sees that T is measurable and measure pre-
serving.
(g) Stationary Stochastic Processes– Let (Ω, F, IP) be a probability space,
and
. . . , Y−2 , Y−1 , Y0 , Y1 , Y2 , . . .
a stationary stochastic process on Ω with values in R. Hence, for each k ∈ Z
IP (Yn1 ∈ B1 , . . . , Ynr ∈ Br ) = IP (Yn1 +k ∈ B1 , . . . , Ynr +k ∈ Br )
for any n1 < n2 < · · · < nr and any Lebesgue sets B1 , . . . , Br . We want to
see this process as coming from a measure preserving transformation.
Let X = RZ = {x = (. . . , x1 , x0 , x1 , . . .) : xi ∈ R} with the product σ-algebra
(i.e. generated by the cylinder sets). Let T : X → X be the left shift i.e.
T x = z where zn = xn+1 . Define φ : Ω → X by
φ(ω) = (. . . , Y−2 (ω), Y−1 (ω), Y0 (ω), Y1 (ω), Y2 (ω), . . . ).
Then, φ is measurable since if B1 , . . . , Br are Lebesgue sets in R, then
φ−1 ({x ∈ X : xn1 ∈ B1 , . . . xnr ∈ Br }) = Yn−1
1
(B1 ) ∩ . . . ∩ Yn−1
r
(Br ) ∈ F.
Define a measure µ on X by
µ(E) = IP φ−1 (E) .

On cylinder sets µ has the form,

µ ({x ∈ X : xn1 ∈ B1 , . . . xnr ∈ Br }) = IP (Yn1 ∈ B1 , . . . , Ynr ∈ Br ) .
Basic Examples 13
Since
T −1 ({x : xn1 ∈ B1 , . . . xnr ∈ Br }) = {x : xn1 +1 ∈ B1 , . . . xnr +1 ∈ Br },
stationarity of the process Yn implies that T is measure preserving. Further-
more, if we let πi : X → R be the natural projection onto the ith coordinate,
then Yi (ω) = πi (φ(ω)) = π0 ◦ T i (φ(ω)).
(h) Random Shifts – Let (X, B, µ) be a probability space, and T : X → X an
invertible measure preserving transformation. Then, T −1 is measurable and
measure preserving with respect to µ. Suppose now that at each moment
instead of moving forward by T (x → T x), we first flip a fair coin to decide
whether we will use T or T −1 . We can describe this random system by means
of a measure preserving transformation in the following way.
Let Ω = {−1, 1}Z with product σ-algebra F (i.e. the σ-algebra generated
by the cylinder sets), and the uniform product measure IP (see example (e)),
and let σ : Ω → Ω be the left shift. As in example (e), the map σ is measure
preserving. Now, let Y = Ω × X with the product σ-algebra, and product
measure IP × µ. Define S : Y → Y by
S(ω, x) = (σω, T ω0 x).
Then S is invertible (why?), and measure preserving with respect to IP × µ.
To see the latter, for any set C ∈ F, and any A ∈ B, we have
(IP × µ) S −1 (C × A) = (IP × µ) ({(ω, x) : S(ω, x) ∈ (C × A))

= (IP × µ) ({(ω, x) : ω0 = 1, σω ∈ C, T x ∈ A)
+ (IP × µ) {(ω, x) : ω0 = −1, σω ∈ C, T −1 x ∈ A

= (IP × µ) {ω0 = 1} ∩ σ −1 C × T −1 A

+ (IP × µ) {ω0 = −1} ∩ σ −1 C × T A

= IP {ω0 = 1} ∩ σ −1 C µ T −1 A

+ IP {ω0 = −1} ∩ σ −1 C µ (T A)

= IP {ω0 = 1} ∩ σ −1 C µ(A)

+ IP {ω0 = −1} ∩ σ −1 C µ(A)

= IP(σ −1 C)µ(A) = IP(C)µ(A) = (IP × µ)(C × A).
(h) continued fractions – Consider ([0, 1), B), where B is the Lebesgue σ-
algebra. Define a transformation T : [0, 1) → [0, 1) by T 0 = 0 and for x 6= 0
1 1
T x = − b c.
x x
Exercise 1.3.3 Show that T is not measure preserving with respect to

Lebesgue measure, but is measure preserving with respect to the so called
Gauss probability measure µ given by
Z
1 1
µ(B) = dx.
B log 2 1 + x
An interesting feature of this map is that its iterations generate the continued
fraction expansion for points in (0, 1). For if we define
(
1 if x ∈ ( 12 , 1)
a1 = a1 (x) = 1
n if x ∈ ( n+1 , n1 ], n ≥ 2,
1 1
then, T x = x
− a1 and hence x = . For n ≥ 1, let an = an (x) =
a1 + T x
a1 (T n−1 x). Then, after n iterations we see that
1 1
x= = ... = .
a1 + T x 1
a1 +
. 1
a2 + . . +
an + T n x
pn 1
In fact, if = , then one can show that {qn } are mono-
qn 1
a1 +
. 1
a2 + . . +
an
tonically increasing, and
pn 1
|x − | < 2 → 0 as n → ∞.
qn qn
The last statement implies that
1
x= .
1
a1 +
1
a2 +
1
a3 +
..
.
Recurrence 15
1.4 Recurrence
Let T be a measure preserving transformation on a probability space (X, F, µ),
and let B ∈ F. A point x ∈ B is said to be B-recurrent if there exists k ≥ 1
such that T k x ∈ B.
Theorem 1.4.1 (Poincaré Recurrence Theorem) If µ(B) > 0, then

a.e. x ∈ B is B-recurrent.
Proof Let F be the subset of B consisting of all elements that are not B-
recurrent. Then,
F = {x ∈ B : T k x ∈
/ B for all k ≥ 1}.
We want to show that µ(F ) = 0. First notice that F ∩ T −k F = ∅ for all
k ≥ 1, hence T −l F ∩ T −m F = ∅ for all l 6= m. Thus, the sets F, T −1 F, . . .
are pairwise disjoint, and µ(T −n F ) = µ(F ) for all n ≥ 1 (T is measure
preserving). If µ(F ) > 0, then
X
1 = µ(X) ≥ µ ∪k≥0 T −k F = µ(F ) = ∞,
k≥0
a contradiction.
The proof of the above theorem implies that almost every x ∈ B returns
to B infinitely often. In other words, there exist infinitely many integers
n1 < n2 < . . . such that T ni x ∈ B. To see this, let
D = {x ∈ B : T k x ∈ B for finitely many k ≥ 1}.
Then,
D = {x ∈ B : T k x ∈ F for some k ≥ 0} ⊆ ∪∞
k=0 T
−k
F.
Thus, µ(D) = 0 since µ(F ) = 0 and T is measure preserving.
1.5 Induced and Integral Transformations

1.5.1 Induced Transformations
Let T be a measure preserving transformation on the probability space
(X, F, µ). Let A ⊂ X with µ(A) > 0. By Poincaré’s Recurrence Theo-
rem almost every x ∈ A returns to A infinitely often under the action of T .
For x ∈ A, let n(x) := inf{n ≥ 1 : T n x ∈ A}. We call n(x) the first return
time of x to A.
Exercise 1.5.1 Show that n is measurable with respect to the σ-algebra

F ∩ A on A.
By Poincaré Theorem, n(x) is finite a.e. on A. In the sequel we remove

from A the set of measure zero on which n(x) = ∞, and we denote the new
set again by A. Consider the σ-algebra F ∩ A on A, which is the restriction
of F to A. Furthermore, let µA be the probability measure on A, defined by
µ(B)
µA (B) = , for B ∈ F ∩ A,
µ(A)
so that (A, F ∩ A, µA ) is a probability space. Finally, define the induced map

TA : A → A by
TA x = T n(x) x , for x ∈ A.
From the above we see that TA is defined on A. What kind of a transformation
is TA ?
Exercise 1.5.2 Show that TA is measurable with respect to the σ-algebra

F ∩ A.
Proposition 1.5.1 TA is measure preserving with respect to µA .
Proof For k ≥ 1, let
Ak = {x ∈ A : n(x) = k}
Bk = {x ∈ X \ A : T x, . . . , T k−1 x 6∈ A, T k x ∈ A}.
Notice that A = ∞
S
k=1 Ak , and
T −1 A = A1 ∪ B1 and T −1 Bn = An+1 ∪ Bn+1 . (1.1)
Let C ∈ F ∩A, since T is measure preserving it follows that µ(C) = µ(T −1 C).
To show that µA (C) = µA (TA−1 C), we show that
µ(TA−1 C) = µ(T −1 C).

Induced and Integral Transformations 17
B. 1
..
........
...
..
..
B. 1 B. 2 .
.... ....
....... .......
.... ....
T 2 A \ A ∪ T A ≡ (A, 3)
B. 1 B. 2 B. 3
.. .. ..
........ ........ ........
... ... ...
.. .. ..
T A \ A ≡ (A, 2)
B. 1 B. 2 B. 3 B. 4
.... .... .... ....
....... ....... ....... .......
.... .... .... ....
A ≡ (A, 1)
A1 A2 A3 A4 A5
Figure 1.1: A tower.
Now,
∞
[ ∞
[
TA−1 (C) = Ak ∩ TA−1 C = Ak ∩ T −k C,
k=1 k=1
hence ∞
X
TA−1 (C) µ Ak ∩ T −k C .

µ =
k=1
On the other hand, using repeatedly (1.1) , one gets for any n ≥ 1,
µ T −1 (C) = µ(A1 ∩ T −1 C) + µ(B1 ∩ T −1 C)

= µ(A1 ∩ T −1 C) + µ(T −1 (B1 ∩ T −1 C))

= µ(A1 ∩ T −1 C) + µ(A2 ∩ T −2 C) + µ(B2 ∩ T −2 C)
..
.
X n
= µ(Ak ∩ T −k C) + µ(Bn ∩ T −n C).
k=1
Since
∞
! ∞
[ X
1 ≥ µ Bn ∩ T −n C = µ(Bn ∩ T −n C),
n=1 n=1
it follows that
lim µ(Bn ∩ T −n C) = 0.
n→∞
Thus,
∞
X
−1
µ Ak ∩ T −k C = µ(TA−1 C).

µ(C) = µ(T C) =
k=1
This shows that µA (C) = µA (TA−1 C), which implies that TA is measure pre-
serving with respect to µA .
Exercise 1.5.3 Assume T is invertible. Without using Proposition 1.5.1

show that for all C ∈ F ∩ A,
µA (C) = µA (TA C).

√
1+ 5
Exercise 1.5.4 Let G = , so that G2 = G + 1. Consider the set
2
1 [ 1 1
X = [0, ) × [0, 1) [ , 1) × [0, ),
G G G
endowed with the product Borel σ-algebra, and the normalized Lebesgue
measure λ × λ. Define the transformation
 y
 (Gx, G ),

 (x, y) ∈ [0, G1 ) × [0, 1]
T (x, y) =
 (Gx − 1, 1 + y ), (x, y) ∈ [ 1 , 1) × [0, 1 ).


G G
G
(a) Show that T is measure preserving with respect to λ × λ.
(b) Determine explicitely the induced transformation of T on the set [0, 1)×
[0, G1 ).
1.5.2 Integral Transformations

Let S be a measure preserving transformation on a probability space (A, F, ν),
and let f ∈ L1 (A, ν) be positive and integer valued. We now construct a mea-
sure preserving transformation T on a probability space (X, C, µ), such that
the original transformation S can be seen as the induced transformation on
X with return time f .
(1) X = {(y, i) : y ∈ A and 1 ≤ i ≤ f (y), i ∈ N},

Induced and Integral Transformations 19
(2) C is generated by sets of the form
(B, i) = {(y, i) : y ∈ B and f (y) ≥ i} ,
where B ⊂ A, B ∈ F and i ∈ N.
ν(B)
(3) µ(B, i) = Z and then extend µ to all of X.
f (y) dν(y)
A
(4) Define T : X → X as follows:

(
(y, i + 1), if i + 1 ≤ f (y),
T (y, i) =
(Sy, 1), if i + 1 > f (y).
Now (X, C, µ, T ) is called an integral system of (A, F, ν, S) under f . We now

show that T is µ-measure preserving. In fact, it suffices to check this on the
generators.
Let B ⊂ A be F-measurable, and let i ≥ 1. We have to discern the
following two cases:
(1) If i > 1, then T −1 (B, i) = (B, i − 1) and clearly
ν(B)
µ(T −1 (B, i)) = µ(B, i − 1) = µ(B, i) = R .
A
f (y) dν(y)
(2) If i = 1, we write An = {y ∈ A : f (y) = n}, and we have

∞
[
−1
T (B, 1) = (An ∩ S −1 B, n) (disjoint union).
n=1
S∞
Since n=1 An = A we therefore find that
∞
X ν(An ∩ S −1 B) ν(S −1 B)
µ(T −1 (B, 1)) = Z = Z
n=1 f (y) dν(y) f (y) dν(y)
A A
ν(B)
= Z = µ(B, 1) .
f (y) dν(y)
A
This shows that T is measure preserving. Moreover, if we consider the

induced transformation of T on the set (A, 1), then the first return time
n(x, 1) = inf{k ≥ 1 : T k (x, 1) ∈ (A, 1)} is given by n(x, 1) = f (x), and
T(A,1) (x, 1) = (Sx, 1).
1.6 Ergodicity
Definition 1.6.1 Let T be a measure preserving transformation on a proba-
bility space (X, F, µ). The map T is said to be ergodic if for every measurable
set A satisfying T −1 A = A, we have µ(A) = 0 or 1.
Theorem 1.6.1 Let (X, F, µ) be a probability space and T : X → X mea-

sure preserving. The following are equivalent:
(i) T is ergodic.
(ii) If B ∈ F with µ(T −1 B∆B) = 0, then µ(B) = 0 or 1.
(iii) If A ∈ F with µ(A) > 0, then µ (∪∞

n=1 T
−n
A) = 1.
(iv) If A, B ∈ F with µ(A) > 0 and µ(B) > 0, then there exists n > 0 such
that µ(T −n A ∩ B) > 0.
Remark 1.6.1
1. In case T is invertible, then in the above characterization one can replace
T −n by T n .
2. Note that if µ(B4T −1 B) = 0, then µ(B \ T −1 B) = µ(T −1 B \ B) = 0.
Since
B = B \ T −1 B ∪ B ∩ T −1 B ,

and
T −1 B = T −1 B \ B ∪ B ∩ T −1 B ,

we see that after removing a set of measure 0 from B and a set of measure 0
from T −1 B, the remaining parts are equal. In this case we say that B equals
T −1 B modulo sets of measure 0.
3. In words, (iii) says that if A is a set of positive measure, almost every
x ∈ X eventually (in fact infinitely often) will visit A.
4. (iv) says that elements of B will eventually enter A.
Ergodicity 21
Proof of Theorem 1.6.1

(i)⇒(ii) Let B ∈ F be such that µ(B∆T −1 B) = 0. We shall define a
measurable set C with C = T −1 C and µ(C∆B) = 0. Let
∞ [
\ ∞
C = {x ∈ X : T n x ∈ B i.o. } = T −k B.
n=1 k=n
Then, T −1 C = C, hence by (i) µ(C) = 0 or 1. Furthermore,

∞ [ ∞
! ∞ \∞
!
\ [
µ(C∆B) = µ T −k B ∩ B c + µ T −k B c ∩ B
n=1 k=n n=1 k=n
∞
! ∞
!
[ [
−k
≤ µ T B ∩ Bc +µ T −k B c ∩ B
k=1 k=1
∞
X
µ T −k B∆B .

≤
k=1
Using induction (and the fact that µ(E∆F ) ≤ µ(E∆G) + µ(G∆F )), one can
−k

show that for each k ≥ 1 one has µ T B∆B = 0. Hence, µ(C∆B) = 0
which implies that µ(C) = µ(B). Therefore, µ(B) = 0 or 1.
(ii)⇒(iii) Let µ(A) > 0 and let B = ∞ −n
A. Then T −1 B ⊂ B. Since T
S
n=1 T
is measure preserving, then µ(B) > 0 and
µ(T −1 B∆B) = µ(B \ T −1 B) = µ(B) − µ(T −1 B) = 0.
Thus, by (ii) µ(B) = 1.

(iii)⇒(iv) Suppose µ(A)µ(B) > 0. By (iii)
∞
! ∞
!
[ [
µ(B) = µ B ∩ T −n A = µ (B ∩ T −n A) > 0.
n=1 n=1
Hence, there exists k ≥ 1 such that µ(B ∩ T −k A) > 0.

(iv)⇒(i) Suppose T −1 A = A with µ(A) > 0. If µ(Ac ) > 0, then by (iv) there
exists k ≥ 1 such that µ(Ac ∩ T −k A) > 0. Since T −k A = A, it follows that
µ(Ac ∩ A) > 0, a contradiction. Hence, µ(A) = 1 and T is ergodic.
1.7 Other Characterizations of Ergodicity

We denote by L0 (X, F, µ) the space of all complex valued measurable func-
tions on the probability space (X, F, µ). Let
Z
p 0
L (X, F, µ) = {f ∈ L (X, F, µ) : |f |p dµ(x) < ∞}.
X
We use the subscript R whenever we are dealing only with real-valued func-
tions.
Let (Xi , Fi , µi ), i = 1, 2 be two probability spaces, and T : X1 → X2 a mea-
sure preserving transformation i.e., µ2 (A) = µ1 (T −1 A). Define the induced
operator UT : L0 (X2 , F2 , µ2 ) → L0 (X1 , F1 , µ1 ) by
UT f = f ◦ T.
The following properties of UT are easy to prove.
Proposition 1.7.1 The operator UT has the following properties:
(i) UT is linear
(ii) UT (f g) = UT (f )UT (g)
(iii) UT c = c for any constant c.
(iv) UT is a positive linear operator
(v) UT 1B = 1B ◦ T = 1T −1 B for all B ∈ F2 .

R R
(vi) X1 UT f dµ1 = X2 f dµ2 for all f ∈ L0 (X2 , F2 , µ2 ), (where if one side
doesn’t exist or is infinite, then the other side has the same property).
(vii) Let p ≥ 1. Then, UT Lp (X2 , F2 , µ2 ) ⊂ Lp (X1 , F1 , µ1 ), and ||UT f ||p =

||f ||p for all f ∈ Lp (X2 , F2 , µ2 ).
Exercise 1.7.1 Prove Proposition 1.7.1
Using the above properties, we can give the following characterization of

ergodicity
Other Characterizations of Ergodicity 23
Theorem 1.7.1 Let (X, F, µ) be a probability space, and T : X → X mea-

sure preserving. The following are equivalent:
(i) T is ergodic.
(ii) If f ∈ L0 (X, F, µ), with f (T x) = f (x) for all x, then f is a constant

a.e.
(iii) If f ∈ L0 (X, F, µ), with f (T x) = f (x) for a.e. x, then f is a constant

a.e.
(iv) If f ∈ L2 (X, F, µ), with f (T x) = f (x) for all x, then f is a constant

a.e.
(v) If f ∈ L2 (X, F, µ), with f (T x) = f (x) for a.e. x, then f is a constant

a.e.
Proof
The implications (iii)⇒(ii), (ii)⇒(iv), (v)⇒(iv), and (iii)⇒(v) are all clear.
It remains to show (i)⇒(iii) and (iv)⇒(i).
(i)⇒(iii) Suppose f (T x) = f (x) a.e. and assume without any loss of gener-
ality that f is real (otherwise we consider separately the real and imaginary
parts of f ). For each n ≥ 1 and k ∈ Z, let
k k+1
X(k,n) = {x ∈ X : n
≤ f (x) < }.
2 2n
Then, T −1 X(k,n) ∆X(k,n) ⊆ {x : f (T x) 6= f (x)} which implies that
µ T −1 X(k,n) ∆X(k,n) = 0.

By ergodicity of T , µ(X(k,n) ) = 0 or 1, for each k ∈ Z. On the other hand,

for each n ≥ 1, we have
[
X= X(k,n) (disjoint union).
k∈Z

Hence, for each n ≥ 1, there exists a unique integer kn such that µ X(kn ,n) =
kn
1. In fact, X(k1 ,1) ⊇ X(k2 ,2) ⊇ . . ., and { n } is a bounded increasing sequence,
2
kn T
hence limn→∞ exists. Let Y = n≥1 X(kn ,n) , then µ(Y ) = 1. Now, if
2n
kn
x ∈ Y , then 0 ≤ |f (x) − kn /2n | < 1/2n for all n. Hence, f (x) = limn→∞ n ,
2
and f is a constant on Y .
(iv)⇒(i) Suppose T −1 A = A and µ(A) > 0. We want to show that µ(A) = 1.

Consider 1A , the indicator function of A. We have 1A ∈ L2 (X, F, µ), and
1A ◦ T = 1T −1 A = 1A . Hence, by (iv), 1A is a constant a.e., hence 1A = 1 a.e.
and therefore µ(A) = 1.
1.8 Examples of Ergodic Transformations

Example 1–Irrational Rotations. Consider ([0, 1), B, λ), where B is the Lebesgue
σ-algebra, and λ Lebesgue measure. For θ ∈ (0, 1), consider the transforma-
tion Tθ : [0, 1) → [0, 1) defined by Tθ x = x + θ (mod 1). We have seen
in example (a) that Tθ is measure preserving with respect λ. When is Tθ
ergodic?
If θ is rational, then Tθ is not ergodic. Consider for example θ = 1/4, then
the set
A = [0, 1/8) ∪ [1/4, 3/8) ∪ [1/2, 5/8) ∪ [3/4, 7/8)
is Tθ -invariant but µ(A) = 1/2.
p
Exercise 1.8.1 Suppose θ = with gcd(p, q) = 1. Find a non-trivial Tθ -
q
invariant set. Conclude that Tθ is not ergodic if θ is a rational.
Claim. Tθ is ergodic if and only if θ is irrational.

Proof of Claim.
(⇒) The contrapositive statement is given in Exercise 1.8.1 i.e. if θ is rational,

then Tθ is not ergodic.
(⇐) Suppose θ is irrational, and let f ∈ L2 (X, B, λ) be Tθ -invariant. Write

f in its Fourier series
X
f (x) = an e2πinx .
n∈Z
Examples of Ergodic Transformations 25
Since f (Tθ x) = f (x), then

X X
f (Tθ x) = an e2πin(x+θ) = an e2πinθ e2πinx
n∈Z n∈Z
X
2πinx
= f (x) = an e .
n∈Z
2πinθ 2πinx
P
Hence, n∈Z an (1 − e )e = 0. By the uniqueness of the Fourier
2πinθ
coefficients, we have an (1 − e ) = 0 for all n ∈ Z. If n 6= 0, since θ is
irrational we have 1 − e2πinθ 6= 0. Thus, an = 0 for all n 6= 0, and therefore
f (x) = a0 is a constant. By Theorem 1.7.1, Tθ is ergodic.
Exercise 1.8.2 Consider the probability space ([0, 1), B × B, λ × λ), where
as above B is the Lebesgue σ-algebra on [0, 1), and λ normalized Lebesgue
measure. Suppose θ ∈ (0, 1) is irrational, and define Tθ × Tθ : [0, 1) × [0, 1) →
[0, 1) × [0, 1) by
Tθ × Tθ (x, y) = (x + θ mod (1) , y + θ mod (1) ) .
Show that Tθ × Tθ is measure preserving, but is not ergodic.
Example 2–One (or Two) sided shift. Let X = {0, 1, . . . k − 1}N , F the σ-
algebra generated by the cylinders, and µ the product measure defined on
cylinder sets by
µ ({x : x0 = a0 , . . . xn = an }) = pa0 . . . pan ,
where p = (p0 , p1 , . . . , pk−1 ) is a positive probability vector. Consider the
left shift T defined on X by T x = y, where yn = xn+1 (See Example (e) in
Subsection 1.3). We show that T is ergodic. Let E be a measurable subset
of X which is T -invariant i.e., T −1 E = E. For any > 0, by Lemma 1.2.1
(see subsection 1.2), there exists A ∈ F which is a finite disjoint union of
cylinders such that µ(E∆A) < . Then
|µ(E) − µ(A)| = |µ(E \ A) − µ(A \ E)|
≤ µ(E \ A) + µ(A \ E) = µ(E∆A) < .
Since A depends on finitely many coordinates only, there exists n0 > 0
such that T −n0 A depends on different coordinates than A. Since µ is a
product measure, we have
µ(A ∩ T −n0 A) = µ(A)µ(T −n0 A) = µ(A)2 .
Further,
µ(E∆T −n0 A) = µ(T −n0 E∆T −n0 A) = µ(E∆A) < ,
and
µ E∆(A ∩ T −n0 A) ≤ µ(E∆A) + µ(E∆T −n0 A) < 2.

Hence,
|µ(E) − µ((A ∩ T −n0 A))| ≤ µ E∆(A ∩ T −n0 A) < 2.

Thus,
|µ(E) − µ(E)2 | ≤ |µ(E) − µ(A)2 | + |µ(A)2 − µ(E)2 |

= |µ(E) − µ((A ∩ T −n0 A))| + (µ(A) + µ(E))|µ(A) − µ(E)|
< 4.
Since > 0 is arbitrary, it follows that µ(E) = µ(E)2 , hence µ(E) = 0 or 1.

Therefore, T is ergodic.
The following lemma provides, in some cases, a useful tool to verify that a
measure preserving transformation defined on ([0, 1), B, µ) is ergodic, where
B is the Lebesgue σ-algebra, and µ is a probability measure equivalent to
Lebesgue measure λ (i.e., µ(A) = 0 if and only if λ(A) = 0).
Lemma 1.8.1 (Knopp’s Lemma) . If B is a Lebesgue set and C is a class

of subintervals of [0, 1) satisfying
(a) every open subinterval of [0, 1) is at most a countable union of disjoint

elements from C,
(b) ∀A ∈ C , λ(A ∩ B) ≥ γλ(A), where γ > 0 is independent of A,
then λ(B) = 1.
Proof The proof is done by contradiction. Suppose λ(B c ) > 0. Given ε > 0
there exists by Lemma 1.2.1 a set Eε that is a finite disjoint union of open
intervals such that λ(B c 4Eε ) < ε. Now by conditions (a) and (b) (that
is, writing Eε as a countable union of disjoint elements of C) one gets that
λ(B ∩ Eε ) ≥ γλ(Eε ).
Examples of Ergodic Transformations 27
Also from our choice of Eε and the fact that
λ(B c 4Eε ) ≥ λ(B ∩ Eε ) ≥ γλ(Eε ) ≥ γλ(B c ∩ Eε ) > γ(λ(B c ) − ε),
we have that
γ(λ(B c ) − ε) < λ(B c 4Eε ) < ε .
Hence γλ(B c ) < ε + γε, and since ε > 0 is arbitrary, we get a contradiction.

Example 3–Multiplication by 2 modulo 1–Consider ([0, 1), B, λ) be as in Ex-

ample (1) above, and let T : X → X be given by

2x 0 ≤ x < 1/2
T x = 2x mod 1 =
2x − 1 1/2 ≤ x < 1,
(see Example (b), subsection 1.3). We have seen that T is measure preserving.
We will use Lemma 1.8.1 to show that T is ergodic. Let C be the collection
of all intervals of the form [k/2n , (k + 1)/2n ) with n ≥ 1 and 0 ≤ k ≤ 2n − 1.
Notice that the the set {k/2n : n ≥ 1, 0 ≤ k < 2n − 1} of dyadic rationals
is dense in [0, 1), hence each open interval is at most a countable union of
disjoint elements of C. Hence, C satisfies the first hypothesis of Knopp’s
Lemma. Now, T n maps each dyadic interval of the form [k/2n , (k + 1)/2n )
linearly onto [0, 1), (we call such an interval dyadic of order n); in fact,
T n x = 2n x mod(1). Let B ∈ B be T -invariant, and assume λ(B) > 0. Let
A ∈ C, and assume that A is dyadic of order n. Then, T n A = [0, 1) and
1
λ(A ∩ B) = λ(A ∩ T −n B) = λ(T n A ∩ B)
λ(A)
1
= λ(B) = λ(A)λ(B).
2n
Thus, the second hypothesis of Knopp’s Lemma is satisfied with γ = λ(B) >
0. Hence, λ(B) = 1. Therefore T is ergodic.
Exercise 1.8.3 Let β > 1 be a non-integer, and consider the transformation

Tβ : [0, 1) → [0, 1) given by Tβ x = βx mod(1) = βx − bβxc. Use Lemma
1.8.1 to show that Tβ is ergodic with respect to Lebesgue measure λ, i.e. if
Tβ−1 A = A, then λ(A) = 0 or 1.
Example 4–Induced transformations of ergodic transformations– Let T be an

ergodic measure preserving transformation on the probability space (X, F, µ),
and A ∈ F with µ(A) > 0. Consider the induced transformation TA on
(A, F ∩ A, µA ) of T (see subsection 1.5). Recall that TA x = T n(x) x, where
n(x) := inf{n ≥ 1 : T n x ∈ A}. Let (as before)
Ak = {x ∈ A : n(x) = k}
Bk = {x ∈ X \ A : T x, . . . , T k−1 x 6∈ A, T k x ∈ A}.
Proposition 1.8.1 If T is ergodic on (X, F, µ), then TA is ergodic on (A, F∩

A, µA ).
Proof Let C ∈ F ∩ A be such that TA−1 C = C. We want to show S that

µA (C) = 0 or 1; equivalently, µ(C) = 0 or µ(C) = µ(A). Since A = k≥1 Ak ,
we have C = TA−1 C = k≥1 Ak ∩ T −k C. Let E = k≥1 Bk ∩ T −k C, and
S S
F = E ∪C (disjoint union). Recall that (see subsection 1.5) T −1 A = A1 ∪B1 ,
and T −1 Bk = Ak+1 ∪ Bk+1 . Hence,
T −1 F = T −1 E ∪ T −1 C
[
(Ak+1 ∪ Bk+1 ) ∩ T −(k+1) C ∪ (A1 ∪ B1 ) ∩ T −1 C

=
k≥1
[ [
= (Ak ∩ T −k C) ∪ (Bk ∩ T −k C)
k≥1 k≥1
= C ∪ E = F.
Hence, F is T -invariant, and by ergodicity of T we have µ(F ) = 0 or 1.
–If µ(F ) = 0, then µ(C) = 0, and hence µA (C) = 0.
–If µ(F ) = 1, then µ(X \ F ) = 0. Since
X \ F = (A \ C) ∪ ((X \ A) \ E) ⊇ A \ C,
it follows that
µ(A \ C) ≤ µ(X \ F ) = 0.
Since µ(A \ C) = µ(A) − µ(C), we have µ(A) = µ(C), i.e., µA (C) = 1.
T −k A = 1, then, T
S
Exercise 1.8.4 Show that if TA is ergodic and µ k≥1
is ergodic.
Chapter 2
The Ergodic Theorem
2.1 The Ergodic Theorem and its consequences

The Ergodic Theorem is also known as Birkhoff’s Ergodic Theorem or the
Individual Ergodic Theorem (1931). This theorem is in fact a generalization
of the Strong Law of Large Numbers (SLLN) which states that for a sequence
Y1 , Y2 , . . . of i.i.d. random variables on a probability space (X, F, µ), with
E|Yi | < ∞; one has
n
1X
lim Yi = EY1 (a.e.).
n→∞ n
i=1
For example consider X = {0, 1}N , F the σ-algebra generated by the cylinder
sets, and µ the uniform product measure, i.e.,
µ ({x : x1 = a1 , x2 = a2 , . . . , xn = an }) = 1/2n .
Suppose one is interested in finding the frequency of the digit 1. More pre-
cisely, for a.e. x we would like to find
1
lim #{1 ≤ i ≤ n : xi = 1}.
n→∞ n
Using the Strong Law of Large Numbers one can answer this question easily.
Define
1, if xi = 1,
Yi (x) :=
0, otherwise.
29
30 The Ergodic Theorem
Since µ is product measure, it is easy to see that Y1 , Y2 , . . . form an i.i.d.

Bernoulli
Pn process, and EYi = E|Yi | = 1/2. Further, #{1 ≤ i ≤ n : xi = 1} =
i=1 Yi (x). Hence, by SLLN one has
1 1
lim #{1 ≤ i ≤ n : xi = 1} = .
n→∞ n 2
Suppose now we are interested in the frequency of the block 011, i.e., we
would like to find
1
lim #{1 ≤ i ≤ n : xi = 0, xi+1 = 1, xi+2 = 1}.
n→∞ n
We can start as above by defining random variables

1, if xi = 0, xi+1 = 1, xi+2 = 1,
Zi (x) :=
0, otherwise.
Then,
n
1 1X
#{1 ≤ i ≤ n : xi = 0, xi+1 = 1, xi+2 = 1} = Zi (x).
n n i=1
It is not hard to see that this sequence is stationary but not independent. So
one cannot directly apply the strong law of large numbers. Notice that if T
is the left shift on X, then Yn = Y1 ◦ T n−1 and Zn = Z1 ◦ T n−1 .
In general, suppose (X, F, µ) is a probability space and T : X → X a measure
preserving transformation. For f ∈ L1 (X, F, µ), we would like to know under
n−1
1X
what conditions does the limit limn→∞ f (T i x) exist a.e. If it does exist
n i=0
what is its value? This is answered by the Ergodic Theorem which was
originally proved by G.D. Birkhoff in 1931. Since then, several proofs of this
important theorem have been obtained; here we present a recent proof given
by T. Kamae and M.S. Keane in [KK].
Theorem 2.1.1 (The Ergodic Theorem) Let (X, F, µ) be a probability space
and T : X → X a measure preserving transformation. Then, for any f in
L1 (µ),
n−1
1X
lim f (T i (x)) = f ∗ (x)
n→∞ n
i=0
R R ∗
exists a.e., is T -invariant and X f dµ = X
f dµ. If moreover T is ergodic,
∗ ∗
R
then f is a constant a.e. and f = X f dµ.
The Ergodic Theorem and its consequences 31
For the proof of the above theorem, we need the following simple lemma.
Lemma 2.1.1 Let M > 0 be an integer, and suppose {an }n≥0 , {bn }n≥0 are
sequences of non-negative real numbers such that for each n = 0, 1, 2, . . . there
exists an integer 1 ≤ m ≤ M with
an + · · · + an+m−1 ≥ bn + · · · + bn+m−1 .
Then, for each positive integer N > M , one has
a0 + · · · + aN −1 ≥ b0 + · · · + bN −M −1 .
Proof of Lemma 2.1.1 Using the hypothesis we recursively find integers

m0 < m1 < . . . < mk < N with the following properties
m0 ≤ M, mi+1 − mi ≤ M for i = 0, . . . , k − 1, and N − mk < M,

a0 + . . . + am0 −1 ≥ b0 + . . . + bm0 −1 ,
am0 + . . . + am1 −1 ≥ bm0 + . . . + bm1 −1 ,
..
.
amk−1 + . . . + amk −1 ≥ bmk−1 + . . . + bmk −1 .
Then,
a0 + . . . + aN −1 ≥ a0 + . . . + amk −1
≥ b0 + . . . + bmk −1 ≥ b0 + . . . bN −M −1 .

Proof of Theorem 2.1.1 Assume with no loss of generality that f ≥ 0
(otherwise we write f = f + − f − , and we consider each part separately).
fn (x)
Let fn (x) = f (x) + . . . + f (T n−1 x), f (x) = lim supn→∞ , and f (x) =
n
fn (x)
lim inf n→∞ . Then f and f are T -invariant. This follows from
n
fn (T x)
f (T x) = lim sup
n→∞ n

fn+1 (x) n + 1 f (x)
= lim sup · −
n→∞ n+1 n n
fn+1 (x)
= lim sup = f (x).
n→∞ n+1
(Similarly f is T -invariant). Now, to prove that f ∗ exists, is integrable and

T -invariant, it is enough to show that
Z Z Z
f dµ ≥ f dµ ≥ f dµ.
X X X
For since f − f ≥ 0, this would imply that f = f = f ∗ . a.e.

R R
We first prove that X f dµ ≤ X f dµ. Fix any 0 < < 1, and let L > 0 be
any real number. By definition of f , for any x ∈ X, there exists an integer
m > 0 such that
fm (x)
≥ min(f (x), L)(1 − ).
m
Now, for any δ > 0 there exists an integer M > 0 such that the set
X0 = {x ∈ X : ∃ 1 ≤ m ≤ M with fm (x) ≥ m min(f (x), L)(1 − )}
has measure at least 1 − δ. Define F on X by

f (x) x ∈ X0
F (x) =
L x∈/ X0 .
Notice that f ≤ F (why?). For any x ∈ X, let an = an (x) = F (T n x), and

bn = bn (x) = min(f (x), L)(1 − ) (so bn is independent of n).We now show
that {an } and {bn } satisfy the hypothesis of Lemma 2.1.1 with M > 0 as
above. For any n = 0, 1, 2, . . .
–if T n x ∈ X0 , then there exists 1 ≤ m ≤ M such that
fm (T n x) ≥ m min(f (T n x), L)(1 − )

= m min(f (x), L)(1 − )
= bn + . . . + bn+m−1 .
Hence,
an + . . . + an+m−1 = F (T n x) + . . . + F (T n+m−1 x)
≥ f (T n x) + . . . + f (T n+m−1 x) = fm (T n x)
≥ bn + . . . + bn+m−1 .
–If T n x ∈
/ X0 , then take m = 1 since
an = F (T n x) = L ≥ min(f (x), L)(1 − ) = bn .

Hence by Lemma 2.1.1 for all integers N > M one has
F (x) + . . . + F (T N −1 x) ≥ (N − M ) min(f (x), L)(1 − ).
Integrating both sides, and using the fact that T is measure preserving one
gets Z Z
N F (x) dµ(x) ≥ (N − M ) min(f (x), L)(1 − ) dµ(x).
X X
Since Z Z
F (x) dµ(x) = f (x) dµ(x) + Lµ(X \ X0 ),
X X0
one has
Z Z
f (x) dµ(x) ≥ f (x) dµ(x)
X X0
Z
= F (x) dµ(x) − Lµ(X \ X0 )
X
(N − M )
Z
≥ min(f (x), L)(1 − ) dµ(x) − Lδ.
N X
Now letting first N → ∞, then δ → 0, then → 0, and lastly L → ∞ one

gets together with the monotone convergence theorem that f is integrable,
and Z Z
f (x) dµ(x) ≥ f (x) dµ(x).
X X
We now prove that

Z Z
f (x) dµ(x) ≤ f (x) dµ(x).
X X
Fix > 0, for any x ∈ X there exists an integer m such that
fm (x)
≤ (f (x) + ).
m
For any δ > 0 there exists an integer M > 0 such that the set
Y0 = {x ∈ X : ∃ 1 ≤ m ≤ M with fm (x) ≤ m (f (x) + )}

has measure at least 1 − δ. Define G on X by

f (x) x ∈ Y0
G(x) =
0 x∈/ Y0 .
Notice that G ≤ f . Let bn = G(T n x), and an = f (x)+ (so an is independent

of n). One can easily check that the sequences {an } and {bn } satisfy the
hypothesis of Lemma 2.1.1 with M > 0 as above. Hence for any M > N ,
one has
G(x) + . . . + G(T N −M −1 x) ≤ N (f (x) + ).
Integrating both sides yields
Z Z
(N − M ) G(x)dµ(x) ≤ N ( f (x)dµ(x) + ).
X X
R
Since f ≥ 0, the measure ν defined by ν(A) = A f (x) dµ(x) is absolutely
continuous with respect to the measure µ. Hence, there exists δ0 > 0 such
that
R if µ(A) < δ, then ν(A) < δ0 . Since µ(X \ Y0 ) < δ, then ν(X \ Y0 ) =
X\Y0
f (x)dµ(x) < δ0 . Hence,
Z Z Z
f (x) dµ(x) = G(x) dµ(x) + f (x) dµ(x)
X X X\Y0
Z
N
≤ (f (x) + ) dµ(x) + δ0 .
N −M X
Now, let first N → ∞, then δ → 0 (and hence δ0 → 0), and finally → 0,

one gets Z Z
f (x) dµ(x) ≤ f (x) dµ(x).
X X
This shows that Z Z Z
f dµ ≥ f dµ ≥ f dµ,
X X X
hence, f = f = f ∗ a.e., and f ∗ is T -invariant. In case T is ergodic, then the

T -invariance of f ∗ implies that f ∗ is a constant a.e. Therefore,
Z Z
∗ ∗
f (x) = f (y)dµ(y) = f (y) dµ(y).
X X

Remarks
(1) Let us study further the limit f ∗ in the case that T is not ergodic. Let I
be the sub-σ-algebra of F consisting of all T -invariant subsets A ∈ F. Notice
that if f ∈ L1 (µ), then the conditional expectation of f given I (denoted by
Eµ (f |I)), is the unique a.e. I-measurable L1 (µ) function with the property
that Z Z
f (x) dµ(x) = Eµ (f |I)(x) dµ(x)
A A
for all A ∈ I i.e., T −1 A = A. We claim that f ∗ = Eµ (f |I). Since the limit

function f ∗ is T -invariant, it follows that f ∗ is I-measurable. Furthermore,
for any A ∈ I, by the ergodic theorem and the T -invariance of 1A ,
n−1 n−1
1X 1X
lim (f 1A )(T i x) = 1A (x) lim f (T i x) = 1A (x)f ∗ (x) a.e.
n→∞ n n→∞ n
i=0 i=0
and Z Z
f 1A (x) dµ(x) = f ∗ 1A (x) dµ(x).
X X
This shows that f ∗ = Eµ (f |I).
(2) Suppose T is ergodic and measure preserving with respect to µ, and let
ν be a probability measure which is equivalent to µ (i.e. µ and ν have the
same sets of measure zero so µ(A) = 0 if and only if ν(A) = 0), then for
every f ∈ L1 (µ) one has
n−1 Z
1X
lim f (T i (x)) = f dµ
n→∞ n X
i=0
ν a.e.
Exercise 2.1.1 (Kac’s Lemma) Let T be a measure preserving and ergodic

transformation on a probability space (X, F, µ). Let A be a measurable
subset of X of positive µ measure, and denote by n the first return time map
and let TA be the induced transformation of T on A (see section 1.5). Prove
that Z
n(x) dµ = 1.
A
Conclude that n(x) ∈ L1 (A, µA ), and that
n−1
1X 1
lim n(TAi (x)) = ,
n→∞ n µ(A)
i=0
almost everywhere on A.
√
1+ 5
Exercise 2.1.2 Let β = , and consider the transformation Tβ :
2
[0, 1) → [0, 1) given by Tβ x = βx mod(1) = βx − bβxc. Define b1 on
[0, 1) by

0 if 0 ≤ x < 1/β
b1 (x) =
1 if 1/β ≤ x < 1,
Fix k ≥ 0. Find the a.e. value (with respect to Lebesgue measure) of the
following limit
1
lim #{1 ≤ i ≤ n : bi = 0, bi+1 = 0, . . . , bi+k = 0}.
n→∞ n
Using the Ergodic Theorem, one can give yet another characterization of
ergodicity.
Corollary 2.1.1 Let (X, F, µ) be a probability space, and T : X → X a

measure preserving transformation. Then, T is ergodic if and only if for all
A, B ∈ F, one has
n−1
1X
lim µ(T −i A ∩ B) = µ(A)µ(B). (2.1)
n→∞ n
i=0
Proof Suppose T is ergodic, and let A, B ∈ F. Since the indicator function

1A ∈ L1 (X, F, µ), by the ergodic theorem one has
n−1 Z
1X
lim 1A (T i x) = 1A (x) dµ(x) = µ(A) a.e.
n→∞ n X
i=0
Then,
n−1 n−1
1X 1X
lim 1T −i A∩B (x) = lim 1T −i A (x)1B (x)
n→∞ n n→∞ n
i=0 i=0
n−1
1X
= 1B (x) lim 1A (T i x)
n→∞ n
i=0
= 1B (x)µ(A) a.e.
Pn−1
Since for each n, the function limn→∞ n1 i=0 1T −i A∩B is dominated by the
constant function 1, it follows by the dominated convergence theorem that
n Z n−1
1X −i 1X
lim µ(T A ∩ B) = lim 1T −i A∩B (x) dµ(x)
n→∞ n X n→∞ n i=0
i=0
Z
= 1B µ(A) dµ(x) = µ(A)µ(B).
X
Conversely, suppose (2.1) holds for every A, B ∈ F. Let E ∈ F be such that

T −1 E = E and µ(E) > 0. By invariance of E, we have µ(T −i E ∩ E) = µ(E),
hence
n−1
1X
lim µ(T −i E ∩ E) = µ(E).
n→∞ n
i=0
On the other hand, by (2.1)

n−1
1X
lim µ(T −i E ∩ E) = µ(E)2 .
n→∞ n
i=0
Hence, µ(E) = µ(E)2 . Since µ(E) > 0, this implies µ(E) = 1. Therefore, T
is ergodic.
To show ergodicity one needs to verify equation (2.1) for sets A and B
belonging to a generating semi-algebra only as the next proposition shows.
Proposition 2.1.1 Let (X, F, µ) be a probability space, and S a generating

semi-algebra of F. Let T : X → X be a measure preserving transformation.
Then, T is ergodic if and only if for all A, B ∈ S, one has
n−1
1X
lim µ(T −i A ∩ B) = µ(A)µ(B). (2.2)
n→∞ n
i=0
Proof We only need to show that if (2.2) holds for all A, B ∈ S, then it
holds for all A, B ∈ F. Let > 0, and A, B ∈ F. Then, by Lemma 1.2.1
(subsection 1.2) there exist sets A0 , B0 each of which is a finite disjoint union
of elements of S such that
µ(A∆A0 ) < , and µ(B∆B0 ) < .
Since,
(T −i A ∩ B)∆(T −i A0 ∩ B0 ) ⊆ (T −i A∆T −i A0 ) ∪ (B∆B0 ),
it follows that
|µ(T −i A ∩ B) − µ(T −i A0 ∩ B0 )| ≤ µ (T −i A ∩ B)∆(T −i A0 ∩ B0 )

≤ µ(T −i A∆T −i A0 ) + µ(B∆B0 )

< 2.
Further,
|µ(A)µ(B) − µ(A0 )µ(B0 )| ≤ µ(A)|µ(B) − µ(B0 )| + µ(B0 )|µ(A) − µ(A0 )|

≤ |µ(B) − µ(B0 )| + |µ(A) − µ(A0 )|
≤ µ(B∆B0 ) + µ(A∆A0 )
< 2.
Hence,
! !
1Xn−1 n−1
1X
µ(T −i A ∩ B) − µ(A)µ(B) − µ(T −i A0 ∩ B0 ) − µ(A0 )µ(B0 )

n n i=0

i=0
n−1
1 X
µ(T −i A ∩ B) + µ(T −i A0 ∩ B0 ) − |µ(A)µ(B) − µ(A0 )µ(B0 )|

≤
n i=0
< 4.
Therefore, " #
n−1
1X
lim µ(T −i A ∩ B) − µ(A)µ(B) = 0.
n→∞ n i=0

Theorem 2.1.2 Suppose µ1 and µ2 are probability measures on (X, F), and
T : X → X is measurable and measure preserving with respect to µ1 and µ2 .
Then,
(i) if T is ergodic with respect to µ1 , and µ2 is absolutely continuous with

respect to µ1 , then µ1 = µ2 ,
(ii) if T is ergodic with respect to µ1 and µ2 , then either µ1 = µ2 or µ1 and

µ2 are singular with respect to each other.
Proof (i) Suppose T is ergodic with respect to µ1 and µ2 is absolutely con-

tinuous with respect to µ1 . For any A ∈ F, by the ergodic theorem for a.e.
x one has
n−1
1X
lim 1A (T i x) = µ1 (A).
n→∞ n
i=0
Let
n−1
1X
CA = {x ∈ X : lim 1A (T i x) = µ1 (A)},
n→∞ n
i=0
then µ1 (CA ) = 1, and by absolute continuity of µ2 one has µ2 (CA ) = 1. Since

T is measure preserving with respect to µ2 , for each n ≥ 1 one has
n−1 Z
1X
1A (T i x) dµ2 (x) = µ2 (A).
n i=0 X
On the other hand, by the dominated convergence theorem one has

Z n−1 Z
1X
lim 1A (T i x)dµ2 (x) = µ1 (A) dµ2 (x).
n→∞ X n i=0 X
This implies that µ1 (A) = µ2 (A). Since A ∈ F is arbitrary, we have µ1 = µ2 .

(ii) Suppose T is ergodic with respect to µ1 and µ2 . Assume that µ1 6= µ2 .
Then, there exists a set A ∈ F such that µ1 (A) 6= µ2 (A). For i = 1, 2 let
n−1
1X
Ci = {x ∈ X : lim 1A (T j x) = µi (A)}.
n→∞ n
j=0
By the ergodic theorem µi (Ci ) = 1 for i = 1, 2. Since µ1 (A) 6= µ2 (A), then

C1 ∩ C2 = ∅. Thus µ1 and µ2 are supported on disjoint sets, and hence µ1
and µ2 are mutually singular.
We end this subsection with a short discussion that the assumption of er-
godicity is not very restrictive. Let T be a transformation on the probability
space (X, F, µ), and suppose T is measure preserving but not necessarily
ergodic. We assume that X is a complete separable metric space, and F the
corresponding Borel σ-algebra (in order to make sure that the conditional
expectation is well-defined a.e.). Let I be the sub-σ-algebra of T -invariant
measurable sets. We can decompose µ into T -invariant ergodic components
in the following way. For x ∈ X, define a measure µx on F by
µx (A) = Eµ (1A |I)(x).
Then, for any f ∈ L1 (X, F, µ),
Z
f (y) dµx (y) = Eµ (f |I)(x).
X
Note that
Z Z
µ(A) = Eµ (1A |I)(x) dµ(x) = µx (A) dµ(x),
X X
and that Eµ (1A |I)(x) is T -invariant. We show that µx is T -invariant and

ergodic for a.e. x ∈ X. So let A ∈ F, then for a.e. x
µx (T −1 A) = Eµ (1A ◦ T |I)(x) = Eµ (IA |I)(T x) = Eµ (IA |I)(x) = µx (A).
Now, let A ∈ F be such that T −1 A = A. Then, 1A is T -invariant, and hence
I-measurable. Then,
µx (A) = Eµ (1A |I)(x) = 1A (x) a.e.
Hence, for a.e. x and for any B ∈ F,
µx (A ∩ B) = Eµ (1A 1B |I)(x) = 1A (x)Eµ (1B |I)(x) = µx (A)µx (B).
In particular, if A = B, then the latter equality yields µx (A) = µx (A)2 which
implies that for a.e. x, µx (A) = 0 or 1. Therefore, µx is ergodic. (One
in fact needs to work a little harder to show that one can find a set N of
µ-measure zero, such that for any x ∈ X \ N , and any T -invariant set A, one
has µx (A) = 0 or 1. In the above analysis the a.e. set depended on the choice
of A. Hence, the above analysis is just a rough sketch of the proof of what
is called the ergodic decomposition of measure preserving transformations.)
Characterization of Irreducible Markov Chains 41
2.2 Characterization of Irreducible Markov

Chains
Consider the Markov Chain in Example(f) subsection 1.3. That is X =
{0, 1, . . . N − 1}Z , F the σ-algebra generated by the cylinders, T : X → X
the left shift, and µ the Markov measure defined by the stochastic N × N
matrix P = (pij ), and the positive probability vector π = (π0 , π1 , . . . , πN −1 )
satisfying πP = π. That is
µ({x : x0 = i0 , x1 = i1 , . . . xn = in }) = πi0 pi0 i1 pi1 i2 . . . pin−1 in .
We want to find necessary and sufficient conditions for T to be ergodic. To
achieve this, we first set
n−1
1X k
Q = lim P ,
n→∞ n
k=0
(k)
where P k = (pij ) is the k th power of the matrix P , and P 0 is the k × k
identity matrix. More precisely, Q = (qij ), where
n−1
1 X (k)
qij = lim pij .
n→∞ n
k=0
1
Pn−1 (k)
Lemma 2.2.1 For each i, j ∈ {0, 1, . . . N − 1}, the limit limn→∞ n k=0 pij
exists, i.e., qij is well-defined.
Proof For each n,

n−1 n−1
1 X (k) 1 1X
pij = µ({x ∈ X : x0 = i, xk = j}).
n k=0 πi n k=0
Since T is measure preserving, by the ergodic theorem,

n−1 n−1
1X 1X
lim 1{x:xk =j} (x) = lim 1{x:x0 =j} (T k x) = f ∗ (x),
n→∞ n n→∞ n
k=0 k=0
where f ∗ is T -invariant and integrable. Then,

n−1 n−1
1X 1X
lim 1{x:x0 =i,xk =j} (x) = 1{x:x0 =i} (x) lim 1{x:x0 =j} (T k x) = f ∗ (x)1{x:x0 =i} (x).
n→∞ n n→∞ n
k=0 k=0
Since n1 n−1
P
k=0 1{x:x0 =i,xk =j} (x) ≤ 1 for all n, by the dominated convergence
theorem,
n−1
1 1X
qij = lim µ({x ∈ X : x0 = i, xk = j})
πi n→∞ n k=0
Z n−1
1 1X
= lim 1{x:x0 =i,xk =j} (x) dµ(x)
πi X n→∞ n k=0
Z
1
= f ∗ (x)1{x:x0 =i} (x) dµ(x)
πi X
Z
1
= f ∗ (x) dµ(x)
πi {x:x0 =i}
which is finite since f ∗ is integrable. Hence qij exists.
Exercise 2.2.1 Show that the matrix Q has the following properties:
(a) Q is stochastic.
(b) Q = QP = P Q = Q2 .
(c) πQ = π.
We now give a characterization for the ergodicity of T . Recall that the

matrix P is said to be irreducible if for every i, j ∈ {0, 1, . . . N − 1}, there
(n)
exists n ≥ 1 such that pij > 0.
Theorem 2.2.1 The following are equivalent,
(i) T is ergodic.
(ii) All rows of Q are identical.
(iii) qij > 0 for all i, j.
(iv) P is irreducible.
(v) 1 is a simple eigenvalue of P .
Proof
(i) ⇒ (ii) By the ergodic theorem for each i, j,
n−1
1X
lim 1{x:x0 =i,xk =j} (x) = 1{x:x0 =i} (x)πj .
n→∞ n
k=0
By the dominated convergence theorem,

n−1
1X
lim µ({x ∈ X : x0 = i, xk = j}) = πi πj .
n→∞ n
k=0
Hence,
n−1
1 1X
qij = lim µ({x ∈ X : x0 = i, xk = j}) = πj ,
πi n→∞ n k=0
i.e., qij is independent of i. Therefore, all rows of Q are identical.
(ii) ⇒ (iii) If all the rows of Q are identical, then all the columns of Q are
constants. Thus, for each j there exists a constant cj such that qij = cj for
all i. Since πQ = π, it follows that qij = cj = πj > 0 for all i, j.
(iii) ⇒ (iv) For any i, j
n−1
1 X (k)
lim pij = qij > 0.
n→∞ n
k=0
(n)
Hence, there exists n such that pij > 0, therefore P is irreducible.
(iv) ⇒ (iii) Suppose P is irreducible. For any state i ∈ {0, 1, . . . , N − 1}, let
Si = {j : qij > 0}. Since Q is a stochastic matrix, it follows that Si 6= ∅. Let
l ∈ Si , then qil > 0. Since Q = QP = QP n for all n, then for any state j
N
X −1
(n) (n)
qij = qim pmj ≥ qil plj
m=0
(n)
for any n. Since P is irreducible, there exists n such that plj > 0. Hence,
qij > 0 for all i, j.
(iii) ⇒ (ii) Suppose qij > 0 for all j = 0, 1, . . . , N − 1. Fix any state j, and
let qj = max0≤i≤N −1 qij . Suppose that not all the qij ’s are the same. Then
there exists k ∈ {0, 1, . . . , N − 1} such that qkj < qj . Since Q is stochastic
and Q2 = Q, then for any i ∈ {0, 1, . . . , N − 1} we have,
N
X −1 N
X −1
qij = qil qlj < qil qj = qj .
l=0 l=0
This implies that qj = max0≤i≤N −1 qij < qj , a contradiction. Hence, the

columns of Q are constants, or all the rows are identical.
(ii) ⇒ (i) Suppose all the rows of Q are identical. We have shown above
that this implies qij = πj for all i, j ∈ {0, 1, . . . , N − 1}. Hence πj =
(k)
limn→∞ n1 n−1
P
k=0 pij .
Let
A = {x : xr = i0 , . . . , xr+l = il }, and B = {x : xs = j0 , . . . , xs+m = jm }
be any two cylinder sets of X. By Proposition 2.1.1 in Section 2, we must

show that
n−1
1X
lim µ(T −i A ∩ B) = µ(A)µ(B).
n→∞ n
i=0
Since T is the left shift, for all n sufficiently large, the cylinders T −n A and
B depend on different coordinates. Hence, for n sufficiently large,
(n+r−s−m)
µ(T −n A ∩ B) = πj0 pj0 j1 . . . pjm−1 jm pjm i0 pi0 i1 . . . pil−1 il .
Thus,
n−1
1X
lim µ(T −k A ∩ B)
n→∞ n
k=0
n−1
1 X (k)
= πj0 pj0 j1 . . . pjm−1 jm pi0 i1 . . . pil−1 il lim pjm i0
n→∞ n
k=0
= (πj0 pj0 j1 . . . pjm−1 jm )(πi0 pi0 i1 . . . pil−1 il )
= µ(B)µ(A).
Therefore, T is ergodic.
(ii) ⇒ (v) If all the rows of Q are identical, then qij = πj for all i, j. If
P −1
vP = v, then vQ = v. This implies that for all j, vj = ( N i=0 vi )πj . Thus,
v is a multiple of π. Therefore, 1 is a simple eigenvalue.
(v) ⇒ (ii) Suppose 1 is a simple eigenvalue. For any i, let qi∗ = (qi0 , . . . , qi(N −1) )
denote the ith row of Q then, qi∗ is a probability vector. From Q = QP , we get
qi∗ = qi∗ P . By hypothesis π is the only probability vector satisfying πP = P ,
hence π = qi∗ , and all the rows of Q are identical.
2.3 Mixing
As a corollary to the ergodic theorem we found a new definition of ergodicity;
namely, asymptotic average independence. Based on the same idea, we now
define other notions of weak independence that are stronger than ergodicity.
Definition 2.3.1 Let (X, F, µ) be a probability space, and T : X → X a

measure preserving transformation. Then,
(i) T is weakly mixing if for all A, B ∈ F, one has
n−1
1 X
µ(T −i A ∩ B) − µ(A)µ(B) = 0.

lim (2.3)
n→∞ n
i=0
(ii) T is strongly mixing if for all A, B ∈ F, one has
lim µ(T −i A ∩ B) = µ(A)µ(B). (2.4)

n→∞
Notice that strongly mixing implies weakly mixing, and weakly mixing im-
plies ergodicity. This follows from the simple fact that if {an } is a sequence
n−1
1X
of real numbers such that limn→∞ an = 0, then limn→∞ |ai | = 0, and
n i=0
n−1
1X
hence limn→∞ ai = 0. Furthermore, if {an } is a bounded sequence,
n i=0
then the following are equivalent:
n−1
1X
(i) limn→∞ |ai | = 0
n i=0
n−1
1X
(ii) limn→∞ |ai |2 = 0
n i=0
(iii) there exists a subset J of the integers of density zero, i.e.

1
lim # ({0, 1, . . . , n − 1} ∩ J) = 0,
n→∞ n
such that limn→∞,n∈J

/ an = 0.
Using this one can give three equivalent characterizations of weakly mixing
transformations, can you state them?
Exercise 2.3.1 Let (X, F, µ) be a probability space, and T : X → X a

measure preserving transformation. Let S be a generating semi-algebra of
F.
(a) Show that if equation (2.3) holds for all A, B ∈ S, then T is weakly
mixing.
(b) Show that if equation (2.4) holds for all A, B ∈ S, then T is strongly
mixing.
Exercise 2.3.2 Consider the one or two-sided Bernoulli shift T as given in

Example (e) in subsection 1.3, and Example (2) in subsection 1.8. Show that
T is strongly mixing.
Exercise 2.3.3 Let (X, F, µ) be a probability space, and T : X → X a

measure preserving transformation. Consider the transformation T × T de-
fined on (X × X, F × F, µ × µ) by T × T (x, y) = (T x, T y).
(a) Show that T × T is measure preserving with respect to µ × µ.
(b) Show that T × T is ergodic, if and only if T is weakly mixing.

Chapter 3
Measure Preserving
Isomorphisms and Factor Maps
3.1 Measure Preserving Isomorphisms

Given a measure preserving transformation T on a probability space (X, F, µ),
we call the quadruple (X, F, µ, T ) a dynamical system . Now, given two dy-
namical systems (X, F, µ, T ) and (Y, C, ν, S), what should we mean by: these
systems are the same? On each space there are two important structures:
(1) The measure structure given by the σ-algebra and the probability mea-
sure. Note, that in this context, sets of measure zero can be ignored.
(2) The dynamical structure, given by a measure preserving transforma-

tion.
So our notion of being the same must mean that we have a map
ψ : (X, F, µ, T ) → (Y, C, ν, S)
satisfying
(i) ψ is one-to-one and onto a.e. By this we mean, that if we remove a

(suitable) set NX of measure 0 in X, and a (suitable) set NY of measure
0 in Y , the map ψ : X \ NX → Y \ NY is a bijection.
(ii) ψ is measurable, i.e., ψ −1 (C) ∈ F, for all C ∈ C.
47
48 Measure Preserving Isomorphism and Factor Maps
(iii) ψ preserves the measures: ν = µ ◦ ψ −1 , i.e., ν(C) = µ (ψ −1 (C)) for all

C ∈ C.
Finally, we should have that
(iv) ψ preserves the dynamics of T and S, i.e., ψ ◦ T = S ◦ ψ, which is the

same as saying that the following diagram commutes.
T
N... .............................................................. N...
... ...
... ...
... ...
... ...
... ...
ψ ...
...
...
...
...
...
ψ
... ...
. .
......... .........
... ...
S
N 0..............................................................N 0
This means that T -orbits are mapped to S-orbits:
x→ Tx → T 2x → ··· → T nx →
↓ ↓ ↓ ↓ ↓
2 n
ψ(x) → S (ψ(x)) → S (ψ(x)) → · · · → S (ψ(x)) →
Definition 3.1.1 Two dynamical systems (X, F, µ, T ) and (Y, C, ν, S) are

isomorphic if there exist measurable sets N ⊂ X and M ⊂ Y with µ(X\N ) =
ν(Y \ M ) = 0 and T (N ) ⊂ N, S(M ) ⊂ M , and finally if there exists a
measurable map ψ : N → M such that (i)–(iv) are satisfied.
Exercise 3.1.1 Suppose (X, F, µ, T ) and (Y, C, ν, S) are two isomorphic dy-
namical systems. Show that
(a) T is ergodic if and only if S is ergodic.
(b) T is weakly mixing if and only if S is weakly mixing.
(c) T is strongly mixing if and only if S is strongly mixing.
Examples
(1) Let K = {z ∈ C : |z| = 1} be equipped with the Borel σ-algebra B on
K, and Haar measure (i.e., normalized Lebeque measure on the unit circle).
Measure Preserving Isomorphisms 49
Define S : K → K by Sz = z 2 ; equivalently Se2πiθ = e2πi(2θ) .. One can

easily check that S is measure preserving. In fact, the map S is isomorphic
to the map T on ([0, 1), B, λ) given by T x = 2x (mod 1) (see Example
(b) in subsection 1.3, and Example (3) in subsection 1.8). Define a map
φ : [0, 1) → K by φ(x) = e2πix . We leave it to the reader to check that φ
is a measurable isomorphism, i.e., φ is a measurable and measure preserving
bijection such that Sφ(x) = φ(T x) for all x ∈ [0, 1).
(2) Consider ([0, 1), B, λ), the unit interval with the Lebesgue σ-algebra, and
Lebesgue measure. Let T : [0, 1) → [0, 1) be given by T x = N x − bN xc.
Iterations of T generate the N -adic expansion of points in the unit interval.
Let Y := {0, 1, . . . , N − 1}N , the set of all sequences (yn )n≥1 , with yn ∈
{0, 1, . . . , N − 1} for n ≥ 1. We now construct an isomorphism between
([0, 1), B, λ, T ) and (Y, F, µ, S), where F is the σ-algebra generated by the
cylinders, and µ the uniform product measure defined on cylinders by
1
µ({(yi )i≥1 ∈ Y : y1 = a1 , y2 = a2 , . . . , yn = an }) = ,
Nn
for any (a1 , a2 , a3 , . . .) ∈ Y , and where S is the left shift.
Define ψ : [0, 1) → Y = {0, 1, . . . , N − 1}N by
∞
X ak
ψ: x = k
7→ (ak )k≥1 ,
k=1
N
where ∞ k
P
k=1 ak /N is the N -adic expansion of x (for example if N = 2 we
get the binary expansion, and if N = 10 we get the decimal expansion). Let
C(i1 , . . . , in ) = {(yi )i≥1 ∈ Y : y1 = i1 , . . . , yn = in } .
In order to see that ψ is an isomorphism one needs to verify measurability

and measure preservingness on cylinders:
i1 i2 in i1 i2 in + 1
ψ −1 (C(i1 , . . . , in )) = + 2 + ··· + n, + 2 + ··· +
N N N N N Nn
and
1
λ ψ −1 (C(i1 , . . . , in )) = n = µ (C(i1 , . . . , in ))) .

N
Note that
N = {(yi )i≥1 ∈ Y : there exists a k ≥ 1 such that yi = N −1 for all i ≥ k}

is a subset of Y of measure 0. Setting Ỹ = Y \ N , then ψ : [0, 1) → Ỹ is a

bijection, since every x ∈ [0, 1) has a unique N -adic expansion generated
by T . Finally, it is easy to see that ψ ◦ T = S ◦ ψ.
Exercise 3.1.2 Consider ([0, 1)2 , B × B, λ × λ), where B × B is the product
Lebesgue σ-algebra, and λ × λ is the product Lebesgue measure Let T :
[0, 1)2 → [0, 1)2 be given by

1
 (2x, 2 y), 0 ≤ x < 21



T (x, y) =
 1
 (2x − 1, (y + 1)), 12 ≤ x < 1.


2

Show that T is isomorphic to the two-sided Bernoulli shift S on {0, 1}Z , F, µ ,
where F is the σ-algebra generated by cylinders of the form
∆ = {x−k = a−k , . . . , x` = a` : ai ∈ {0, 1}, i = −k, . . . , `}, k, ` ≥ 0,
and µ the product measure with weights ( 21 , 12 ) (so µ(∆) = ( 12 )k+`+1 ).
√
1+ 5
Exercise 3.1.3 Let G = , so that G2 = G + 1. Consider the set
2
1 [ 1 1
X = [0, ) × [0, 1) [ , 1) × [0, ),
G G G
endowed with the product Borel σ-algebra. Define the transformation
 y
 (Gx, G ),

 (x, y) ∈ [0, G1 ) × [0, 1]
T (x, y) =
 (Gx − 1, 1 + y ), (x, y) ∈ [ 1 , 1) × [0, 1 ).


G G
G
(a) Show that T is measure preserving with respect to normalized Lebesgue
measure on X.
(b) Now let S : [0, 1) × [0, 1) → [0, 1) × [0, 1) be given by
 y
 (Gx, ), (x, y) ∈ [0, G1 ) × [0, 1]

 G
S(x, y) =
 (G2 x − G, G + y ), (x, y) ∈ [ 1 , 1) × [0, 1).


G
G2
Factor Maps 51
Show that S is measure preserving with respect to normalized Lebesgue

measure on [0, 1) × [0, 1).
(c) Let Y = [0, 1) × [0, G1 ), and let U be the induced transformation of T

on Y , i.e., for (x, y) ∈ Y , U (x, y) = T n(x,y) , where n(x, y) = inf{n ≥
1 : T n (x, y) ∈ Y }. Show that the map φ : [0, 1) × [0, 1) → Y given by
y
φ(x, y) = (x, )
G
defines an isomorphism from S to U , where Y has the induced measure
structure (see Section 1.5).
3.2 Factor Maps

In the above section, we discussed the notion of isomorphism which describes
when two dynamical systems are considered the same. Now, we give a precise
definition of what it means for a dynamical system to be a subsystem of
another one.
Definition 3.2.1 Let (X, F, µ, T ) and (Y, C, ν, S) be two dynamical systems.

We say that S is a factor of T if there exist measurable sets M1 ∈ F and
M2 ∈ C, such that µ(M1 ) = ν(M2 ) = 1 and T (M1 ) ⊂ M1 , S(M2 ) ⊂ M2 ,
and finally if there exists a measurable and measure preserving map ψ : M1 →
M2 which is surjective, and satisfies ψ(T (x)) = S(ψ(x)) for all x ∈ M1 . We
call ψ a factor map.
Remark Notice that if ψ is a factor map, then G = ψ −1 C is a T -invariant

sub-σ-algebra of F, since
T −1 G = T −1 ψ −1 C = ψ −1 S −1 C ⊆ ψ −1 C = G.
Examples Let T be the Baker’s transformation on ([0, 1)2 , B × B, λ × λ),

given by 
 (2x, 21 y), 0 ≤ x < 21
T (x, y) =
(2x − 1, 12 (y + 1)), 21 ≤ x < 1,

and let S be the left shift on X = {0, 1}N with the σ-algebra F generated by
the cylinders, and the uniform product measure µ. Define ψ : [0, 1)×[0, 1) →
X by
ψ(x, y) = (a1 , a2 , . . .),
an
where x = ∞
P
n=1 n is the binary expansion of x. It is easy to check that ψ
2
is a factor map.
Exercise 3.2.1 Let T be the left shift on X = {0, 1, 2}N which is endowed
with the σ-algebra F, generated by the cylinder sets, and the uniform product
measure µ giving each symbol probability 1/3, i.e.,
1
µ ({x ∈ X : x1 = i1 , x2 = i2 , . . . , xn = in }) = ,
3n
where i1 , i2 , . . . , in ∈ {0, 1, 2}.
Let S be the left shift on Y = {0, 1}N which is endowed with the σ-algebra G,
generated by the cylinder sets, and the product measure ν giving the symbol
0 probability 1/3 and the symbol 1 probability 2/3, i.e.,
2 1
µ ({y ∈ Y : y1 = j1 , y2 = j2 , . . . , yn = jn }) = ( )j1 +j2 +...+jn ( )n−(j1 +j2 +...+jn ) ,
3 3
where j1 , j2 , . . . , jn ∈ {0, 1}. Show that S is a factor of T .
Exercise 3.2.2 Show that a factor of an ergodic (weakly mixing/strongly

mixing) transformation is also ergodic (weakly mixing/strongly mixing).
3.3 Natural Extensions

Suppose (Y, G, ν, S) is a non-invertible measure-preserving dynamical system.
An invertible measure-preserving dynamical system (X, F, µ, T ) is called a
natural extension of (Y, G, ν, S) if S is a factor of T and the factor map ψ
satisfies ∨∞ m −1
m=0 T ψ G = F, where
∞
_
T k ψ −1 G
k=0
is the smallest σ-algebra containing the σ-algebras T k ψ −1 G for all k ≥ 0.

Natural Extensions 53

Example Let T on {0, 1}Z
, F, µ be the two-sided Bernoulli shift, and S on
{0, 1}N∪{0}
, G, ν be the one-sided Bernoulli shift, both spaces are endowed
with the uniform product measure. Notice that T is invertible, while S is
not. Set X = {0, 1}Z , Y = {0, 1}N∪{0} , and define ψ : X → Y by
ψ (. . . , x−1 , x0 , x1 , . . .) = (x0 , x1 , . . .) .
Then, ψ is a factor map. We claim that

∞
_
T k ψ −1 G = F.
k=0
To prove this, we show that ∞ k −1

W
k=0 T ψ G contains all cylinders generating
F.
Let ∆ = {x ∈ X : x−k = a−k , . . . , x` = a` } be an arbitrary cylinder in
F, and let D = {y ∈ Y : y0 = a−k , . . . , yk+` = a` } which is a cylinder in G.
Then,
ψ −1 D = {x ∈ X : x0 = a−k , . . . , xk+` = a` } and T k ψ −1 D = ∆.
This shows that ∞

_
T k ψ −1 G = F.
k=0
Thus, T is the natural extension of S.

Chapter 4
Entropy
4.1 Randomness and Information

Given a measure preserving transformation T on a probability space (X, F, µ),
we want to define a nonnegative quantity h(T ) which measures the average
uncertainty about where T moves the points of X. That is, the value of
h(T ) reflects the amount of ‘randomness’ generated by T . We want to define
h(T ) in such a way, that (i) the amount of information gained by an appli-
cation of T is proportional to the amount of uncertainty removed, and (ii)
that h(T ) is isomorphism invariant, so that isomorphic transformations have
equal entropy.
The connection between entropy (that is randomness, uncertainty) and
the transmission of information was first studied by Claude Shannon in 1948.
As a motivation let us look at the following simple example. Consider a source
(for example a ticker-tape) that produces a string of symbols · · · x−1 x0 x1 · · ·
from the alphabet {a1 , a2 , . . . , an }. Suppose that the probability of receiving
symbol ai at any given time is pi , and that each symbol is transmitted inde-
pendently of what has beenPtransmitted earlier. Of course we must have here
that each pi ≥ 0 and that i pi = 1. In ergodic theory we view this process
as the dynamical system (X, F, B, µ, T ), where X = {a1 , a2 , . . . , an }N , B the
σ-algebra generated by cylinder sets of the form
∆n (ai1 , ai2 , . . . , ain ) := {x ∈ X : xi1 = ai1 , . . . , xin = ain }
µ the product measure assigning to each coordinate probability pi of seeing
55
56 Entropy
the symbol ai , and T the left shift. We define the entropy of this system by
n
X
H(p1 , . . . , pn ) = h(T ) := − pi log2 pi . (4.1)
i=1
If we define log pi as the amount of uncertainty in transmitting the symbol ai ,

then H is the average amount of information gained (or uncertainty removed)
per symbol (notice that H is in fact an expected value). To see why this is an
appropriate definition, notice that if the source is degenerate, that is, pi = 1
for some i (i.e., the source only transmits the symbol ai ), then H = 0. In
this case we indeed have no randomness. Another reason to see why this
definition is appropriate, is that H is maximal if pi = n1 for all i, and this
agrees with the fact that the source is most random when all the symbols are
equiprobable. To see this maximum, consider the function f : [0, 1] → R+
defined by (
0 if t = 0,
f (t) =
−t log2 t if 0 < t ≤ 1.
Then f is continuous and concave downward, and Jensen’s Inequality implies
that for any p1 , . . . , pn with pi ≥ 0 and p1 + . . . + pn = 1,
n n
1 1X 1X 1 1
H(p1 , . . . , pn ) = f (pi ) ≤ f ( pi ) = f ( ) = log2 n ,
n n i=1 n i=1 n n
so H(p1 , . . . , pn ) ≤ log2 n for all probability vectors (p1 , . . . , pn ). But

1 1
H( , . . . , ) = log2 n,
n n
so the maximum value is attained at ( n1 , . . . , n1 ).
4.2 Definitions and Properties

So far H is defined as the average information per symbol. The above defini-
tion can be extended to define the information transmitted by the occurrence
of an event E as − log2 P (E). This definition has the property that the in-
formation transmitted by E ∩ F for independent events E and F is the sum
of the information transmitted by each one individually, i.e.,
− log2 P (E ∩ F ) = − log2 P (E) − log2 P (F ).
Definitions and Properties 57
The only function with this property is the logarithm function to any base.
We choose base 2 because information is usually measured in bits.
In the above example of the ticker-tape the symbols were transmitted
independently. In general, the symbol generated might depend on what has
been received before. In fact these dependencies are often ‘built-in’ to be
able to check the transmitted sequence of symbols on errors (think here of
the Morse sequence, sequences on compact discs etc.). Such dependencies
must be taken into consideration in the calculation of the average information
per symbol. This can be achieved if one replaces the symbols ai by blocks of
symbols of particular size. More precisely, for every n, let Cn be the collection
of all possible n-blocks (or cylinder sets) of length n, and define
X
Hn := − P (C) log P (C) .
C∈Cn
Then n1 Hn can be seen as the average information per symbol when a block
of length n is transmitted. The entropy of the source is now defined by
Hn
h := lim . (4.2)
n→∞ n
The existence of the limit in (4.2) follows from the fact that Hn is a subadditive
sequence, i.e., Hn+m ≤ Hn + Hm , and proposition (4.2.2) (see proposition
(4.2.3) below).
Now replace the source by a measure preserving system (X, B, µ, T ).
How can one define the entropy of this system similar to the case of a
source? The symbols {a1 , a2 , . . . , an } can now be viewed as a partition
A = {A1 , A2 , . . . , An } of X, so that X is the disjoint union (up to sets
of measure zero) of A1 , A2 , . . . , An . The source can be seen as follows: with
each point x ∈ X, we associate an infinite sequence · · · x−1 , x0 , x1 , · · · , where
xi is aj if and only if T i x ∈ Aj . We define the entropy of the partition α by
n
X
H(α) = Hµ (α) := − µ(Ai ) log µ(Ai ) .
i=1
Our aim is to define the entropy of the transformation T which is independent

of the partition we choose. In fact h(T ) must be the maximal entropy over
all possible finite partitions. But first we need few facts about partitions.
58 Entropy
Exercise 4.2.1 Let α = {A1 , . . . , An } and β = {B1 , . . . , Bm } be two parti-

tions of X. Show that
T −1 α := {T −1 A1 , . . . , T −1 An }
and
α ∨ β := {Ai ∩ Bj : Ai ∈ α, Bj ∈ β}
are both partitions of X.
The members of a partition are called the atoms of the partition. We say
that the partition β = {B1 , . . . , Bm } is a refinement of the partition α =
{A1 , . . . , An }, and write α ≤ β, if for every 1 ≤ j ≤ m there exists an
1 ≤ i ≤ n such that Bj ⊂ Ai (up to sets of measure zero). The partition
α ∨ β is called the common refinement of α and β.
Exercise 4.2.2 Show that if β is a refinement of α, each atom of α is a finite

(disjoint) union of atoms of β.
Given two partitions α = {A1 , . . . An } and β = {B1 , . . . , Bm } of X, we

define the conditional entropy of α given β by

XX µ(A ∩ B)
H(α|β) := − log µ(A ∩ B) .
A∈α B∈β
µ(B)
(Under the convention that 0 log 0 := 0.)

The above quantity H(α|β) is interpreted as the average uncertainty
about which element of the partition α the point x will enter (under T )
if we already know which element of β the point x will enter.
Proposition 4.2.1 Let α, β and γ be partitions of X. Then,
(a) H(T −1 α) = H(α) ;
(b) H(α ∨ β) = H(α) + H(β|α);
(c) H(β|α) ≤ H(β);
(d) H(α ∨ β) ≤ H(α) + H(β);
(e) If α ≤ β, then H(α) ≤ H(β);

(f ) H(α ∨ β|γ) = H(α|γ) + H(β|α ∨ γ);

(g) If β ≤ α, then H(γ|α) ≤ H(γ|β);
(h) If β ≤ α, then H(β|α) = 0.
(i) We call two partitions α and β independent if
µ(A ∩ B) = µ(A)µ(B) for all A ∈ α, B ∈ β .
If α and β are independent partitions, one has that
H(α ∨ β) = H(α) + H(β) .
Proof We prove properties (b) and (c), the rest are left as an exercise.
XX
H(α ∨ β) = − µ(A ∩ B) log µ(A ∩ B)
A∈α B∈β
XX µ(A ∩ B)
= − µ(A ∩ B) log
A∈α B∈β
µ(A)
XX
+ − µ(A ∩ B) log µ(A)
A∈α B∈β
= H(β|α) + H(α).
We now show that H(β|α) ≤ H(β). Recall that the function f (t) = −t log t
for 0 < t ≤ 1 is concave down. Thus,
XX µ(A ∩ B)
H(β|α) = − µ(A ∩ B) log
B∈β A∈α
µ(A)
XX µ(A ∩ B) µ(A ∩ B)
= − µ(A) log
B∈β A∈α
µ(A) µ(A)

XX µ(A ∩ B)
= µ(A)f
B∈β A∈α
µ(A)
!
X X µ(A ∩ B)
≤ f µ(A)
B∈β A∈α
µ(A)
X
= f (µ(B)) = H(β).
B∈β
60 Entropy
Exercise 4.2.3 Prove the rest of the properties of Proposition 4.2.1

Wn−1 −i
Now consider the partition i=0 T α, whose atoms are of the form Ai0 ∩
−1 −(n−1)
T Ai1 ∩ . . . ∩ T Ain−1 , consisting of all points x ∈ X with the property
that x ∈ Ai0 , T x ∈ Ai1 , . . . , T n−1 x ∈ Ain−1 .
Exercise 4.2.4 Show that if α is a finite partition of (X, F, µ, T ), then

n−1 n−1 j
_ X _
−i
H( T α) = H(α) + H(α| T −i α).
i=0 j=1 i=1
To define the notion of the entropy of a transformation with respect to a

partition, we need the following two propositions.
Proposition 4.2.2 If {an } is a subadditive sequence of real numbers i.e.,

an+p ≤ an + ap for all n, p, then
an
lim
n→∞ n
exists.
Proof Fix any m > 0. For any n ≥ 0 one has n = km + i for some i between
0 ≤ i ≤ m − 1. By subadditivity it follows that
an akm+i akm ai am ai
= ≤ + ≤ k + .
n km + i km km km km
Note that if n → ∞, k → ∞ and so lim supn→∞ an /n ≤ am /m. Since m is
arbitrary one has
an am an
lim sup ≤ inf ≤ lim inf .
n→∞ n m n→∞ n
Therefore limn→∞ an /n exists, and equals inf an /n.
Proposition 4.2.3 Let α be a finite partitions of (X, B,

Wµ, T ), where T is a
1 n−1 −i
measure preserving transformation. Then, limn→∞ n H( i=0 T α) exists.
Proof Let an = H( n−1 −i

W
i=0 T α) ≥ 0. Then, by Proposition 4.2.1, we have
n+p−1
_
an+p = H( T −i α)
i=0
n−1 n+p−1
_ _
−i
≤ H( T α) + H( T −i α)
i=0 i=n
p−1
_
= an + H( T −i α)
i=0
= an + ap .
Hence, by Proposition 4.2.2

n−1
an 1 _
lim = lim H( T −i α)
n→∞ n n→∞ n
i=0
exists.
We are now in position to give the definition of the entropy of the trans-
formation T .
Definition 4.2.1 The entropy of the measure preserving transformation T

with respect to the partition α is given by
n−1
1 _
h(α, T ) = hµ (α, T ) := lim H( T −i α) ,
n→∞ n
i=0
where
n−1
_ X
H( T −i α) = − µ(D) log(µ(D)) .
i=0 Wn−1
D∈ i=0 T −i α
Finally, the entropy of the transformation T is given by
h(T ) = hµ (T ) := sup h(α, T ) .

α
The following theorem gives an equivalent definition of entropy..

62 Entropy
Theorem 4.2.1 The entropy of the measure preserving transformation T

with respect to the partition α is also given by
n−1
_
h(α, T ) = lim H(α| T −i α) .
n→∞
i=1
Proof Notice that the sequence {H(α| ni=1 −i

W
WnT α) −i
is bounded from below,
and is non-increasing, hence limn→∞ H(α| i=1 T α) exists. Furthermore,
n n j
_ 1X
−i
_
lim H(α| T α) = lim H(α| T −i α).
n→∞ n→∞ n
i=1 j=1 i=1
From exercise 4.2.4, we have

n−1 n−1 j
_ X _
−i
H( T α) = H(α) + H(α| T −i α).
i=0 j=1 i=1
Now, dividing by n, and taking the limit as n → ∞, one gets the desired
result
Theorem 4.2.2 Entropy is an isomorphism invariant.
Proof Let (X, B, µ, T ) and (Y, C, ν, S) be two isomorphic measure preserv-

ing systems (see Definition 1.2.3, for a definition), with ψ : X → Y the
corresponding isomorphism. We need to show that hµ (T ) = hν (S).
Let β = {B1 , . . . , Bn } be any partition of Y , then ψ −1 β = {ψ −1 B1 , . . . , ψ −1 Bn }
is a partition of X. Set Ai = ψ −1 Bi , for 1 ≤ i ≤ n. Since ψ : X → Y is an
isomorphism, we have that ν = µψ −1 and ψT = Sψ, so that for any n ≥ 0
and Bi0 , . . . , Bin−1 ∈ β
ν Bi0 ∩ S −1 Bi1 ∩ . . . ∩ S −(n−1) Bin−1

= µ ψ −1 Bi0 ∩ ψ −1 S −1 Bi1 ∩ . . . ∩ ψ −1 S −(n−1) Bin−1

= µ ψ −1 Bi0 ∩ T −1 ψ −1 Bi1 ∩ . . . ∩ T −(n−1) ψ −1 Bin−1

= µ Ai0 ∩ T −1 Ai1 ∩ . . . ∩ T −(n−1) Ain−1 .

Setting
A(n) = Ai0 ∩ . . . ∩ T −(n−1) Ain−1 and B(n) = Bi0 ∩ . . . ∩ S −(n−1) Bin−1 ,

Calculation of Entropy and Examples 63
we thus find that

n−1
1 _
hν (S) = sup hν (β, S) = sup lim Hν ( S −i β)
β β n→∞ n
i=0
1 X
= sup lim − ν(B(n)) log ν(B(n))
β n→∞ n Wn−1 −i
B(n)∈ i=0 S β
1 X
= sup lim − µ(A(n)) log µ(A(n))
ψ −1 β n→∞ n Wn−1
A(n)∈ i=0 T −i ψ −1 β
−1
= sup hµ (ψ β, T )
ψ −1 β
≤ sup hµ (α, T ) = hµ (T ) ,
α
where in the last inequality the supremum is taken over all possible finite
partitions α of X. Thus hν (S) ≤ hµ (T ). The proof of hµ (T ) ≤ hν (S) is done
similarly. Therefore hν (S) = hµ (T ), and the proof is complete.
4.3 Calculation of Entropy and Examples

Calculating the entropy of a transformation directly from the definition does
not seem feasible, for one needs to take the supremum over all finite parti-
tions, which is practically impossible. However, the entropy of a partition is
relatively easier to calculate if one has full information about the partition
under consideration. So the question is whether it is possible to find a par-
tition α of X where h(α, T ) = h(T ). Naturally, such a partition contains all
the information ‘transmitted’ by T . To answer this question we need some
notations and definitions.
For α = {A1 , . . . , AN } and all m, n ≥ 0, let
m
! −n
!
_ _
σ T −i α and σ T −i α
i=n i=−m
be the smallest σ-algebras containing the partitions m T −i α and −n −i

W W
W−∞ i=n i=−m T α
respectively. Furthermore, let σ i=−∞ T −i α be the smallest σ-algebra con-

W−n
taining all the partitions m −i −i
W
i=n T α and i=−m T α for W∞
all n andm. We
call a partition α a generator with respect to T if σ i=−∞ T −i α = F,
64 Entropy
where F is the σ-algebra

W∞ −i on X. If T is non-invertible, then α is said to be
a generator if σ ( i=0 T α) = F. Naturally, this equality is modulo sets
of measure zero. One has also the following characterization of generators,
saying basically, that each measurable set in X can be approximated by a
finite disjoint union of cylinder sets. See also [W] for more details and proofs.
Proposition 4.3.1 The partition α is a generator of F if for each A ∈ F

and for each ε > 0 there exists a finite disjoint union C of elements of {αnm },
such that µ(A4C) < ε.
We now state (without proofs) two famous theorems known as Kolmogorov-

Sinai’s Theorem and Krieger’s Generator Theorem. For the proofs, we refer
the interested reader to the book of Karl Petersen or Peter Walter.
Theorem 4.3.1 (Kolmogorov and Sinai, 1958) If α is a generator with re-

spect to T and H(α) < ∞, then h(T ) = h(α, T ).
Theorem 4.3.2 (Krieger, 1970) If T is an ergodic measure preserving trans-

formation with h(T ) < ∞, then T has a finite generator.
We will use these two theorems to calculate the entropy of a Bernoulli

shift.
Example (Entropy of a Bernoulli shift)–Let T be the left shift on X =
{1, 2, · · · , n}Z endowed with the σ-algebra F generated by the cylinder sets,
and product measure µ giving symbol i probability pi , where p1 + p2 + . . . +
pn = 1. Our aim is to calculate h(T ). To this end we need to find a partition
α which generates the σ-algebra F under the action of T . The natural choice
of α is what is known as the time-zero partition α = {A1 , . . . , An }, where
Ai := {x ∈ X : x0 = i} , i = 1, . . . , n .
Notice that for all m ∈ Z,
T −m Ai = {x ∈ X : xm = i} ,
and
Ai0 ∩ T −1 Ai1 ∩ · · · ∩ T −m Aim = {x ∈ X : x0 = i0 , . . . , xm = im } .

Calculation of Entropy and Examples 65
In other words, m −i
W
i=0 T α is precisely the collection of cylinders of length
m (i.e., the collection of all m-blocks), and these by definition generate F.
Hence, α is a generating partition, so that
m−1
!
1 _
−i
h(T ) = h(α, T ) = lim H T α .
m→∞ m
i=0
First notice that − since µ is product measure here − the partitions
α, T −1 α, · · · , T −(m−1) α
are all independent since each specifies a different coordinate, and so
H(α ∨ T −1 α ∨ · · · ∨ T −(m−1) α)
= H(α) + H(T −1 α) + · · · + H(T −(m−1) α)
X n
= mH(α) = −m pi log pi .
i=1
Thus,
n n
1 X X
h(T ) = lim (−m) pi log pi = − pi log pi .
m→∞ m
i=1 i=1
Exercise 4.3.1 Let T be the left shift on X = {1, 2, · · · , n}Z endowed with
the σ-algebra F generated by the cylinder sets, and the Markov measure µ
given by the stochastic matrix P = (pij ), and the probability vector π =
(π1 , . . . , πn ) with πP = π. Show that
n X
X n
h(T ) = − πi pij log pij
j=1 i=1
Exercise 4.3.2 Suppose (X1 , B1 , µ1 , T1 ) and (X2 , B2 , µ2 , T2 ) are two dynam-

ical systems. Show that
hµ1 ×µ2 (T1 × T2 ) = hµ1 (T1 ) + hµ2 (T2 ).

66 Entropy
4.4 The Shannon-McMillan-Breiman Theorem

In the previous sections we have considered only finite partitions on X, how-
ever all the definitions and results hold if we were to consider countable par-
titions of finite entropy. Before we state and prove the Shannon-McMillan-
Breiman Theorem, we need to introduce the information function associated
with a partition.
Let (X, F, µ) be a probability space, and α = {A1 , A2 , . . .} be a finite or
a countable partition of X into measurable sets. For each x ∈ X, let α(x)
be the element of α to which x belongs. Then, the information function
associated to α is defined to be
X
Iα (x) = − log µ(α(x)) = − 1A (x) log µ(A).
A∈α
For two finite or countable partitions α and β of X, we define the condi-

tional information function of α given β by

XX µ(A ∩ B)
Iα|β (x) = − 1(A∩B) (x) log .
B∈β A∈α
µ(B)
We claim that
X
Iα|β (x) = − log Eµ (1α(x) |σ(β)) = − 1A (x) log E(1A |σ(β)), (4.3)
A∈α
where σ(β) is the σ-algebra generated by the finite or countable partition β,

(see the remark following the proof of Theorem (2.1.1)). This follows from the
fact (which is easy to prove using the definition of conditional expectations)
that if β is finite or countable, then for any f ∈ L1 (µ), one has
Z
X 1
Eµ (f |σ(β)) = 1B f dµ.
B∈β
µ(B) B
R
Clearly, H(α|β) = X
Iα|β (x) dµ(x).
Exercise 4.4.1 Let α and β be finite or countable partitions of X. Show

that
Iα W β = Iα + Iβ|α .
The Shannon-McMillan-Breiman Theorem 67
Now suppose T : X → X is a measure preserving transformation on

(X, F, µ), and let α = {A1 , A2 , . . .} be any countable partition. Then T −1 =
{T −1 A1 , T −1 A2 , . . .} is also a countable partition. Since T is measure pre-
serving one has,
X X
IT −1 α (x) = − 1T −1 Ai (x) log µ(T −1 Ai ) = − 1Ai (T x) log µ(Ai ) = Iα (T x).
Ai ∈α Ai ∈α
Furthermore,
n Z
1 _
−i 1
lim H( T α) = lim IWni=0 T −i α (x) dµ(x) = h(α, T ).
n→∞ n + 1 n→∞ n + 1 X
i=0
The Shannon-McMillan-Breiman theorem says if T is ergodic and if α has

1 W
finite entropy, then in fact the integrand I n T −i α (x) converges a.e. to
n + 1 i=0
h(α, T ). Notice that the integrand can be written as
n
!
1 W 1 _
I n T −i α (x) = − log µ ( T −i α)(x) ,
n + 1 i=0 n+1 i=0
where ( ni=0 T −i α)(x) is the element of ni=0 T −i α containing x (often referred

W W
to as the α-cylinder of order n containing x). Before we proceed we need the
following proposition.
Proposition 4.4.1 Let α = {A1 , A2 , . . .} be a countable partition with finite

entropy. For each n = 1, 2, 3, . . ., let fn (x) = Iα| Wni=1 T −i α (x), and let f ∗ =
supn≥1 fn . Then, for each λ ≥ 0 and for each A ∈ α,
µ ({x ∈ A : f ∗ (x) > λ}) ≤ 2−λ .
Furthermore, f ∗ ∈ L1 (X, F, µ).
Proof Let t ≥ 0 and A ∈ α. For n ≥ 1, let

n
!
_
−i
fnA (x) = − log Eµ 1A | T α (x),
i=1
and
Bn = {x ∈ X : f1A (x) ≤ t, . . . , fn−1
A
(x) ≤ t, fnA (x) > t}.
68 Entropy
Wn for−i x ∈ A one
Notice that has fn (x) = fnAW(x), and for x ∈ Bn one has
Eµ (1A | i=1 T α) (x) < 2−t . Since Bn ∈ σ ( ni=1 T −i α), then
Z
µ(Bn ∩ A) = 1A (x) dµ(x)
Bn
Z n
!
_
= Eµ 1A | T −i α (x) dµ(x)
Bn i=1
Z
≤ 2−t dµ(x) = 2−t µ(Bn ).
Bn
Thus,
µ ({x ∈ A : f ∗ (x) > t}) = µ ({x ∈ A : fn (x) > t, for some n})
= µ {x ∈ A : fnA (x) > t, for some n}

= µ (∪∞
n=1 A ∩ Bn )
X∞
= µ (A ∩ Bn )
n=1
∞
X
−t
≤ 2 µ(Bn ) ≤ 2−t .
n=1
We now show that f ∗ ∈ L1 (X, F, µ). First notice that
µ ({x ∈ A : f ∗ (x) > t}) ≤ µ(A),
hence,
µ ({x ∈ A : f ∗ (x) > t}) ≤ min(µ(A), 2−t ).

Using Fubini’s Theorem, and the fact that f ∗ ≥ 0 one has

Z Z ∞
∗
f (x) dµ(x) = µ ({x ∈ X : f ∗ (x) > t}) dt
X
Z0 ∞ X
= µ ({x ∈ A : f ∗ (x) > t}) dt
0 A∈α
XZ ∞
= µ ({x ∈ A : f ∗ (x) > t}) dt
A∈α 0
XZ ∞
≤ min(µ(A), 2−t ) dt
A∈α 0
XZ − log µ(A) XZ ∞
= µ(A) dt + 2−t dt
A∈α 0 A∈α − log µ(A)
X X µ(A)
= − µ(A) log µ(A) +
A∈α A∈α
loge 2
1
= Hµ (α) + < ∞.
loge 2

So far we have defined the notion of the conditional entropy Iα|β when α
and β are countable partitions. We can generalize the definition to the case
α is a countable partition and G is a σ-algebra as follows (see equation (4.3)),
Iα|G (x) = − log Eµ (1α(x) |G).
If we denote by ∞
W −i
Wn −i
i=1 T α = σ(∪n i=1 T α), then
Iα| W∞
i=1 T
−i α (x) = lim Iα| n
W
i=1 T
−i α (x). (4.4)
n→∞
Exercise 4.4.2 Give a proof of equation (4.4) using the following important
theorem, known as the Martingale Convergence Theorem (and is stated to
our setting)
Theorem 4.4.1 (Martingale Convergence Theorem) Let C1 ⊆ C2 ⊆ · · · be a
sequence of increasing σalgebras, and let C = σ(∪n Cn ). If f ∈ L1 (µ), then
Eµ (f |C) = lim Eµ (f |Cn )
n→∞
µ a.e. and in L1 (µ).

70 Entropy
Exercise 4.4.3 Show that if T is measure preserving on the probability

space (X, F, µ) and f ∈ L1 (µ), then
f (T n x)
lim = 0, µ a.e.
n→∞ n
Theorem 4.4.2 (The Shannon-McMillan-Breiman Theorem) Suppose T is

an ergodic measure preserving transformation on a probability space (X, F, µ),
and let α be a countable partition with H(α) < ∞. Then,
1 W
lim I ni=0 T −i α (x) = h(α, T ) a.e.
n→∞ n + 1
Proof For each n = 1, 2, 3, . . ., let fn (x) = Iα| Wni=1 T −i α (x). Then,
IWni=0 T −i α (x) = IWni=1 T −i α (x) + Iα| Wni=1 T −i α (x)

= IWn−1
i=0 T
−i α (T x) + fn (x)
= IWn−1
i=1 T
−i α (T x) + Iα| n−1 T −i α (T x) + fn (x)
W
i=1
2
= IWn−2
i=0 T
−i α (T x) + fn−1 (T x) + fn (x)
..
.
= Iα (T n x) + f1 (T n−1 x) + . . . + fn−1 (T x) + fn (x).
Let f R(x) = Iα| W∞

i=1 T
−i α (x) = limn→∞ fn (x). Notice that f ∈ L1 (X, F, µ)
since X f (x) dµ(x) = h(α, T ). Now letting f0 = Iα , we have
n
1 W 1 X
I ni=0 T −i α (x) = fn−k (T k x)
n+1 n + 1 k=0
n n
1 X 1 X
= f (T k x) + (fn−k − f )(T k x).
n + 1 k=0 n + 1 k=0
By the ergodic theorem,

n
X Z
k
lim f (T x) = f (x) dµ(x) = h(α, T ) a.e.
n→∞ X
k=0
n
1 X
We now study the sequence { (fn−k − f )(T k x)}. Let
n + 1 k=0
FN = sup |fk − f |, and f ∗ = sup fn .

k≥N n≥1
Notice that 0 ≤ FN ≤ f ∗ +f , hence FN ∈ L1 (X, F, µ) and limN →∞ FN (x) = 0

a.e. Also for any k, |fn−k − f | ≤ f ∗ + f , so that |fn−k − f | ∈ L1 (X, F, µ) and
lim n → ∞|fn−k − f | = 0 a.e.
For any N ≤ n,
n n−N
1 X 1 X
|fn−k − f |(T k x) = |fn−k − f |(T k x)
n + 1 k=0 n + 1 k=0
n
1 X
+ |fn−k − f |(T k x)
n + 1 k=n−N +1
n−N
1 X
≤ FN (T k x)
n + 1 k=0
N −1
1 X
+ |fk − f |(T n−k x).
n + 1 k=0
If we take the limit as n → ∞, then by exercise (4.4.3) the second term tends
to 0 a.e., and by the ergodic theorem as well as the dominated convergence
theorem, the first term tends to zero a.e. Hence,
1 W
lim I ni=0 T −i α (x) = h(α, T ) a.e.
n→∞ n + 1

The above theorem Wn can−ibe interpreted as providing an estimate of the size
of
Wnthe −iatoms of i=0 T α. For n sufficiently large, a typical element A ∈
i=0 T α satisfies
1
− log µ(A) ≈ h(α, T )
n+1
or
µ(An ) ≈ 2−(n+1)h(α,T ) .
Furthermore, if α is a generating partition (i.e. ∞ −i
W
i=0 T α = F, then in the
conclusion of Shannon-McMillan-Breiman Theorem one can replace h(α, T )
by h(T ).
72 Entropy
4.5 Lochs’ Theorem

In 1964, G. Lochs compared the decimal and the continued fraction expan-
sions. Let x ∈ [0, 1) be an irrational number, and suppose x = .d1 d2 · · · is the
decimal expansion of x (which is generated by iterating the map Sx = 10x
(mod 1)). Suppose further that
1
x = = [0; a1 , a2 , · · · ] (4.5)
1
a1 +
1
a2 +
1
a3 +
...
is its regular continued fraction (RCF) expansion (generated by the map

T x = x1 − b x1 c). Let y = .d1 d2 · · · dn be the rational number determined by
the first n decimal digits of x, and let z = y + 10−n . Then, [y, z) is the
decimal cylinder of order n containing x, which we also denote by Bn (x).
Now let
1
y=
1
b1 +
. 1
b2 + . . +
bl
and
1
z=
1
c1 +
. 1
c2 + . . +
ck
be the continued fraction expansion of y and z. Let
m (n, x) = max {i ≤ min (l, k) : for all j ≤ i, bj = cj } . (4.6)
In other words, if Bn (x) denotes the decimal cylinder consisting of all points
y in [0, 1) such that the first n decimal digits of y agree with those of x,
and if Cj (x) denotes the continued fraction cylinder of order j containing
x, i.e., Cj (x) is the set of all points in [0, 1) such that the first j digits in
their continued fraction expansion is the same as that of x, then m(n, x) is
the largest integer such that Bn (x) ⊂ Cm(n,x) (x). Lochs proved the following
theorem:
Lochs’ Theorem 73
Theorem 4.5.1 Let λ denote Lebesgue measure on [0, 1). Then for a.e.
x ∈ [0, 1)
m(n, x) 6 log 2 log 10
lim = .
n→∞ n π2
In this section, we will prove a generalization of Lochs’ theorem that

allows one to compare any two known expansions of numbers. We show
that Lochs’ theorem is true for any two sequences of interval partitions on
[0, 1) satisfying the conclusion of Shannon-McMillan-Breiman theorem. The
content of this section as well as the proofs can be found in [DF]. We begin
with few definitions that will be used in the arguments to follow.
Definition 4.5.1 By an interval partition we mean a finite or countable

partition of [0, 1) into subintervals. If P is an interval partition and x ∈ [0, 1),
we let P (x) denote the interval of P containing x.
Let P = {Pn }∞ n=1 be a sequence of interval partitions. Let λ denote

Lebesgue probability measure on [0, 1).
Definition 4.5.2 Let c ≥ 0. We say that P has entropy c a.e. with respect
to λ if
log λ (Pn (x))
− → c a.e.
n
Note that we do not assume that each Pn is refined by Pn+1 .

Suppose that P = {Pn }∞ ∞
n=1 and Q = {Qn }n=1 are sequences of interval
partitions. For each n ∈ N and x ∈ [0, 1), define
mP,Q (n, x) = sup {m | Pn (x) ⊂ Qm (x)} .
Theorem 4.5.2 Let P = {Pn }∞ ∞

n=1 and Q = {Qn }n=1 be sequences of interval
partitions and λ Lebesgue probability measure on [0, 1). Suppose that for some
constants c > 0 and d > 0, P has entropy c a.e with respect to λ and Q has
entropy d a.e. with respect to λ. Then
mP,Q (n, x) c
→ a.e.
n d
74 Entropy
Proof First we show that

mP,Q (n, x) c
lim supn→∞ ≤ a.e.
n d
Fix ε > 0. Let x ∈ [0, 1) be a point at which the convergence conditions of
c+η
the hypotheses are met. Fix η > 0 so that c < 1 + ε. Choose N so
c− η
d
that for all n ≥ N
λ (Pn (x)) > 2−n(c+η)
and
λ (Qn (x)) < 2−n(d−η) .
n c o
Fix n so that min n, n ≥ N , and let m0 denote any integer greater than
d
c
(1 + ε) n. By the choice of η,
d
λ (Pn (x)) > λ (Qm0 (x))
so that Pn (x) is not contained in Qm0 (x). Therefore

c
mP,Q (n, x) ≤ (1 + ε) n
d
and so
mP,Q (n, x) c
lim supn→∞ ≤ (1 + ε) a.e.
n d
Since ε > 0 was arbitrary, we have the desired result.
Now we show that
mP,Q (n, x) c
lim inf n→∞ ≥ a.e.
n d
c
Fix ε ∈ (0, 1). Choose η > 0 so that ζ =: εc − η 1 + (1 − ε) > 0. For
d
j c k
each n ∈ N let m̄ (n) = (1 − ε) n . For brevity, for each n ∈ N we call an
d
element of Pn (respectively Qn ) (n, η) −good if
λ (Pn (x)) < 2−n(c−η)
(respectively
λ (Qn (x)) > 2−n(d+η) .)
Lochs’ Theorem 75
For each n ∈ N, let

Pn (x) is (n, η) − good and Qm̄(n) (x) is (m̄ (n) , η) − good
Dn (η) = x : .
and Pn (x) " Qm̄(n) (x)
If x ∈ Dn (η), then Pn (x) contains an endpoint of the (m̄ (n) , η) −good
interval Qm̄(n) (x). By the definition of Dn (η) and m̄ (n),
λ (Pn (x))
< 2−nζ .
λ Qm̄(n) (x)
Since no more than one atom of Pn can contain a particular endpoint of an
atom of Qm̄(n) , we see that λ (Dn (η)) < 2 · 2−nζ and so
∞
X
λ (Dn (η)) < ∞,
n=1
which implies that

λ {x | x ∈ Dn (η) i.o.} = 0.
Since m̄ (n) goes to infinity as n does, we have shown that for almost every
x ∈ [0, 1), there exists N ∈ N, so that for all n ≥ N , Pn (x) is (n, η) −good
and Qm̄(n) (x) is (m̄ (n) , η) −good and x ∈ / Dn (η). In other words, for al-
most every x ∈ [0, 1), there exists N ∈ N, so that for all n ≥ N , Pn (x)
is (n, η) −good and Qm̄(n) (x) is (m̄ (n) , η) −good and Pn (x) ⊂ Qm̄(n) (x).
Thus, for almost every x ∈ [0, 1), there exists N ∈ N, so that for all n ≥ N ,
mP,Q (n, x) ≥ m̄ (n), so that
mP,Q (n, x) c
≥ b(1 − ε) c.
n d
This proves that
mP,Q (n, x) c
lim inf n→∞ ≥ (1 − ε) a.e.
n d
Since ε > 0 was arbitrary, we have established the theorem.
The above result allows us to compare any two well-known expansions
of numbers. Since the commonly used expansions are usually performed for
points in the unit interval, our underlying space will be ([0, 1), B, λ), where
B is the Lebesgue σ-algebra, and λ the Lebesgue measure. The expansions
we have in mind share the following two properties.
76 Entropy
Definition 4.5.3 A surjective map T : [0, 1) → [0, 1) is called a number

theoretic fibered map (NTFM) if it satisfies the following conditions:
(a) there exists a finite or countable partition of intervals α = {Ai ; i ∈ D}

such that T restricted to each atom of α (cylinder set of order 0) is
monotone, continuous and injective. Furthermore, α is a generating
partition.
(b) T is ergodic with respect to Lebesgue measure λ, and there exists a T

invariant probability measure µ equivalent to λ with bounded density.
dµ dλ
(Both and are bounded, and µ(A) = 0 if and only if λ(A) = 0
dλ dµ
for all Lebesgue sets A.).
Let T be an NTFM with corresponding partition α, and T -invariant measure

µ equivalent to λ. Let L, M > 0 be such that
Lλ(A) ≤ µ(A) < M λ(A)
for all Lebesgue sets A (property (b)). For n ≥ 1, let Pn = n−1 −i

W
i=0 T α, then
by property (a), Pn is an interval partition. If Hµ (α) < ∞, then Shannon-
McMillan-Breiman Theorem gives
log µ(Pn (x))

lim − = hµ (T ) a.e. with respect to µ.
n→∞ n
Exercise 4.5.1 Show that the conclusion of the Shannon-McMillan-Breiman
Theorem holds if we replace µ by λ, i.e.
log λ(Pn (x))

lim − = hµ (T ) a.e. with respect to λ.
n→∞ n
Iterations of T generate expansions of points x ∈ [0, 1) with digits in D.
We refer to the resulting expansion as the T -expansion of x.
Almost all known expansions on [0, 1) are generated by a NTFM. Among
them are the n-adic expansions (T x = nx (mod 1), where n is a positive
integer), β expansions (T x = βx (mod 1), where β > 1 is a real number),
continued fraction expansions (T x = x1 − b x1 c), and many others (see the
book Ergodic Theory of Numbers).
Lochs’ Theorem 77
Exercise 4.5.2 Prove Theorem (4.5.1) using Theorem (4.5.2). Use the fact
that the continued fraction map T is ergodic with respect to Gauss measure
µ, given by Z
1 1
µ(B) = dx,
B log 2 1 + x
π2
and has entropy equal to hµ (T ) = .
6 log 2
Exercise 4.5.3 Reformulate and prove Lochs’ Theorem for any two NTFM
maps S and T on [0, 1).
78 Entropy
Chapter 5
Hurewicz Ergodic Theorem
In this chapter we consider a class of non-measure preserving transformations.

In particular, we study invertible, non-singular and conservative transforma-
tions on a probability space. We first start with a quick review of equivalent
measures, we then define non-singular and conservative transformations, and
state some of their properties. We end this section by giving a new proof
of Hurewicz Ergodic Theorem, which is a generalization of Birkhoff Ergodic
Theorem to non-singular conservative transformations.
5.1 Equivalent measures

Recall that two measures µ and ν on a measure space (Y, F) are equivalent
if µ and ν have the same null-sets, i.e.,
µ(A) = 0 ⇔ ν(A) = 0, A ∈ F.
The theorem of Radon-Nikodym says that if µ, ν are σ-finite and equiv-

alent, then there exist measurable functions f, g ≥ 0, such that
Z Z
µ(A) = f dν and ν(A) = g dµ.
A A
Furthermore, for all h ∈ L1 (µ) (or L1 (ν)),

Z Z Z Z
h dµ = hf dν and h dν = hg dµ.
79
80 Hurewicz Ergodic Theorem
dµ dν
Usually the function f is denoted by and the function g by .
dν dµ
Now suppose that (X, B, µ) is a probability space, and T : X → X a
measurable transformation. One can define a new measure µ ◦ T −1 on (X, B)
by µ ◦ T −1 (A) = µ(T −1 A) for A ∈ B. It is not hard to prove that for
f ∈ L1 (µ), Z Z
f d(µ ◦ T −1 ) = f ◦ T dµ (5.1)
Exercise 5.1.1 Starting with indicator functions, give a proof of (5.1).
Note that if T is invertible, then one has that

Z Z
f d(µ ◦ T ) = f ◦ T −1 dµ (5.2)
5.2 Non-singular and conservative transfor-

mations
Definition 5.2.1 Let (X, B, µ) be a probability space and T : X → X an
invertible measurable function. T is said to be non-singular if for any A ∈ B,
µ(A) = 0 if and only if µ(T −1 A) = 0.
Since T is invertible, non-singularity implies that
µ(A) = 0 if and only if µ(T n A) = 0, n 6= 0.
This implies that the measures µ ◦ T n defined by µ ◦ T n (A) = µ(T n A) is

equivalent to µ (and hence equivalent to each other). By the theorem of
Radon-Nikodym, there exists for each n 6= 0, a non-negative measurable
dµ ◦ T n
function ωn (x) = (x) such that
dµ
Z
n
µ(T A) = ωn (x) dµ(x).
A
We have the following propositions.

Non-singular and conservative transformations 81
Proposition 5.2.1 Suppose (X, B, µ) is a probability space, and T : X → X

is invertible and non-singular. Then for every f ∈ L1 (µ),
Z Z Z
f (x) dµ(x) = f (T x)ω1 (x) dµ(x) = f (T n x)ωn (x) dµ(x).
X X X
Proof We show the result for indicator functions only, the rest of the proof
is left to the reader.
Z
1A (x) dµ(x) = µ(A) = µ(T (T −1 A))
X
Z
= ω1 (x) dµ(x)
T −1 A
Z
= 1A (T x)ω1 (x) dµ(x).
X
Proposition 5.2.2 Under the assumptions of Proposition 5.2.1, one has for
all n, m ≥ 1, that
ωn+m (x) = ωn (x)ωm (T n x), µ a.e.
Proof For any A ∈ B,

Z Z
n
ωn (x)ωm (T x)dµ(x) = 1A (x)ωm (T n x)d(µ ◦ T n )(x)
A ZX
= 1A (T −n x)ωm (x)dµ(x)
ZX
= 1T n A (x)d(µ ◦ T m )(x)
X Z
m n m+n
= µ ◦ T (T A) = µ(T A) = ωn+m (x)dµ(x).
A
Hence, ωn+m (x) = ωn (x)ωm (T n x), µ a.e.

Exercise 5.2.1 Let (X, B, µ) be a probability space, and T : X → X an
invertible P
non-singular transformation. For any measurable function f , set
n−1 i
fn (x) = i=0 f (T x)ωi (x), n ≥ 1, where ω0 (x) = 1. Show that for all
n, m ≥ 1,
fn+m (x) = fn (x) + ωn (x)fm (T n x).
Definition 5.2.2 Let (X, B, µ) be a probability space, and T : X → X a

measurable transformation. We say that T is conservative if for any A ∈ B
with µ(A) > 0, there exists an n ≥ 1 such that µ(A ∩ T −n A) > 0.
Note that if T is invertible, non-singular and conservative , then T −1 is also

non-singular and conservative. In this case, for any A ∈ B with µ(A) > 0,
there exists an n 6= 0 such that µ(A ∩ T n A) > 0.
Proposition 5.2.3 Suppose T is invertible, non-singular and conservative

on the probability space (X, B, µ), and let A ∈ B with µ(A) > 0. Then for µ
a.e. x ∈ A there exist infinitely many positive and negative integers n, such
that T n x ∈ A.
Proof Let B = {x ∈ A : T n x ∈ / A for all n ≥ 1}. Note that for any n ≥ 1,

−n
B ∩ T B = ∅. If µ(B) > 0, then by conservativity there exists an n ≥ 1,
such that µ(B∩T −n B) is positive, which is a contradiction. Hence, µ(B) = 0,
and by non-singularity we have µ(T −n B) = 0 for all n ≥ 1.
n
S∞ let−nC = {x ∈ A; T x ∈ A for only finitely many n ≥ 1}, then
Now,
C ⊂ n=1 T B, implying that
∞
X
µ(C) ≤ µ(T −n B) = 0.
n=1
Therefore, for almost every x ∈ A there exist infinitely many n ≥ 1 such that
T n x ∈ A. Replacing T by T −1 yields the result for n ≤ −1.
Proposition 5.2.4 Suppose T is invertible, non-singular and conservative,

then ∞
X
ωn (x) = ∞, µ a.e.
n=1
P∞
Proof Let A = {x ∈ X : n=1 ωn (x) < ∞}. Note that
∞
[ ∞
X
A= {x ∈ X : ωn (x) < M }.
M =1 n=1
If µ(A) > 0, then there exists an M ≥ 1 such that the set

∞
X
B = {x ∈ X : ωn (x) < M }
n=1
Hurewicz Ergodic Theorem 83
has positive measure. Then, B ∞

R P
n=1 ωn (x) dµ(x) < M µ(B) < ∞. However,
Z X ∞ ∞ Z
X
ωn (x) dµ(x) = ωn (x) dµ(x)
B n=1 n=1 B
X∞
= µ(T n B)
n=1
∞ Z
X
= 1T n B (x) dµ(x)
n=1 X
∞
Z X
= 1B (T −n x) dµ(x).
X n=1
R P∞
Hence, X n=1 1B (T −n x) dµ(x) < ∞, which implies that
∞
X
1B (T −n x) < ∞ µ a.e.
n=1
Therefore, for µ a.e. x one has T −n x ∈ B for only finitely many n ≥ 1,

contradicting Proposition 5.2.3. Thus µ(A) = 0, and
∞
X
ωn (x) = ∞, µ a.e.
n=1
5.3 Hurewicz Ergodic Theorem

The following theorem by Hurewicz is a generalization of Birkhoff’s Ergodic
Theorem to our setting; see also Hurewicz’ original paper [H]. We give a new
prove, similar to the proof for Birkhoff’s Theorem; see Section 2.1 and [KK].
Theorem 5.3.1 Let (X, B, µ) be a probability space, and T : X → X an

invertible, non-singular and conservative transformation. If f ∈ L1 (µ), then
n−1
X
f (T i x)ωi (x)
i=0
lim n−1
= f∗ (x)
n→∞ X
ωi (x)
i=0
exists µ a.e. Furthermore, f∗ is T -invariant and

Z Z
f (x) dµ(x) = f∗ (x) dµ(x).
X X
Proof Assume with no loss of generality that f ≥ 0 (otherwise we write

f = f + − f − , and we consider each part separately). Let
fn (x) = f (x) + f (T x)ω1 (x) + · · · + f (T n−1 x)ωn−1 (x),
gn (x) = ω0 (x) + ω1 (x) + · · · + ωn−1 (x), ω0 (x) = g0 (x) = 1,
fn (x) fn (x)
f (x) = lim sup n−1
= lim sup ,
n→∞ X n→∞ gn (x)
ωi (x)
i=0
and
fn (x) fn (x)
f (x) = lim inf n−1
= lim inf .
n→∞ X n→∞ gn (x)
ωi (x)
i=0
By Proposition (5.2.2), one has gn+m (x) = gn (x) + gm (T n x). Using Exercise
(5.2.1) and Proposition (5.2.4), we will show that f and f are T -invariant.
To this end,
fn (T x)
f (T x) = lim sup n
n→∞ gn (T x)
fn+1 (x) − f (x)
ω1 (x)
= lim sup
n→∞ gn+1 (x) − g(x)
ω1 (x)
fn+1 (x) − f (x)
= lim sup
n→∞ gn+1 (x) − g(x)

fn+1 (x) gn+1 (x) f (x)
= lim sup · −
n→∞ gn+1 (x) gn+1 (x) − g(x) gn+1 (x) − g(x)
fn+1 (x)
= lim sup
n→∞ gn+1 (x)
= f (x).
(Similarly f is T -invariant).
Now, to prove that f∗ exists, is integrable and T -invariant, it is enough
to show that Z Z Z
f dµ ≥ f dµ ≥ f dµ.
X X X
For since f − f ≥ 0, this would imply that f = f = f∗ . a.e.

R R
We first prove that X f dµ ≤ X f dµ. Fix any 0 < < 1, and let L > 0 be
any real number. By definition of f , for any x ∈ X, there exists an integer
m > 0 such that
fm (x)
≥ min(f (x), L)(1 − ).
gm (x)
Now, for any δ > 0 there exists an integer M > 0 such that the set
X0 = {x ∈ X : ∃ 1 ≤ m ≤ M with fm (x) ≥ gm (x) min(f (x), L)(1 − )}
has measure at least 1 − δ. Define F on X by

f (x) x ∈ X0
F (x) =
L x∈/ X0 .
Notice that f ≤ F (why?). For any x ∈ X, let an = an (x) = F (T n x)ωn (x),

and bn = bn (x) = min(f (x), L)(1 − )ωn (x). We now show that {an } and
{bn } satisfy the hypothesis of Lemma 2.1.1 with M > 0 as above. For any
n = 0, 1, 2, . . .
–if T n x ∈ X0 , then there exists 1 ≤ m ≤ M such that
fm (T n x) ≥ min(f (x), L)(1 − )gm (T n x).
Hence,
ωn (x)fm (T n x) ≥ min(f (x), L)(1 − )gm (T n x)ωn (x).
Now,
bn + . . . + bn+m−1 = min(f (x), L)(1 − )gm (T n x)ωn (x)

≤ ωn (x)fm (T n x)
= f (T n x)ωn (x) + f (T n+1 x)ωn+1 (x) + · · · + f (T n+m−1 x)ωn+m−1 (x)
≤ F (T n x)ωn (x) + F (T n+1 x)ωn+1 (x) + · · · + F (T n+m−1 x)ωn+m−1 (x)
= an + an+1 + · · · + an+m−1 .
–If T n x ∈
/ X0 , then take m = 1 since
an = F (T n x)ωn (x) = Lωn (x) ≥ min(f (x), L)(1 − )ωn (x) = bn .
Hence by T -invariance of f , and Lemma 2.1.1 for all integers N > M one
has
F (x)+F (T x)+ω1 (x)+· · ·+ωN −1 (x)F (T N −1 x) ≥ min(f (x), L)(1−)gN −M (x).
Integrating both sides, and using Proposition (5.2.1) together with the T -
invariance of f one gets
Z Z
N F (x) dµ(x) ≥ min(f (x), L)(1 − )gN −M (x) dµ(x)
X X
Z
= (N − M ) min(f (x), L)(1 − ) dµ(x).
X
Since Z Z
F (x) dµ(x) = f (x) dµ(x) + Lµ(X \ X0 ),
X X0
one has
Z Z
f (x) dµ(x) ≥ f (x) dµ(x)
X X0
Z
= F (x) dµ(x) − Lµ(X \ X0 )
X
(N − M )
Z
≥ min(f (x), L)(1 − ) dµ(x) − Lδ.
N X
Now letting first N → ∞, then δ → 0, then → 0, and lastly L → ∞ one

gets together with the monotone convergence theorem that f is integrable,
and Z Z
f (x) dµ(x) ≥ f (x) dµ(x).
X X
We now prove that

Z Z
f (x) dµ(x) ≤ f (x) dµ(x).
X X
Fix > 0, for any x ∈ X there exists an integer m such that

fm (x)
≤ (f (x) + ).
gm (x)
For any δ > 0 there exists an integer M > 0 such that the set
Y0 = {x ∈ X : ∃ 1 ≤ m ≤ M with fm (x) ≤ (f (x) + )gm (x)}
has measure at least 1 − δ. Define G on X by

f (x) x ∈ Y0
G(x) =
0 x∈/ Y0 .
Notice that G ≤ f . Let bn = G(T n x)ωn (x), and an = (f (x) + )ωn (x). We
now check that the sequences {an } and {bn } satisfy the hypothesis of Lemma
2.1.1 with M > 0 as above.
–if T n x ∈ Y0 , then there exists 1 ≤ m ≤ M such that
fm (T n x) ≤ (f (x) + )gm (T n x).
Hence,
ωn (x)fm (T n x) ≤ (f (x)+)gm (T n x)ωn (x) = (f (x)+)(ωn (x)+· · ·+ωn+m−1 (x).
By Proposition (5.2.2), and the fact that f ≥ G, one gets
bn + . . . + bn+m−1 = G(T n x)ωn (x) + · · · + G(T n+m−1 x)ωn+m−1 (x)

≤ f (T n x)ωn (x) + · · · + f (T n+m−1 x)ωn+m−1 (x)
= ωn (x)fm (T n x)
≤ (f (x) + )(ωn (x) + · · · + ωn+m+1 (x))
= an + · · · an+m−1 .
–If T n x ∈
/ Y0 , then take m = 1 since
bn = G(T n x)ωn (x) = 0 ≤ (f (x) + )(ωn (x)) = an .
Hence by Lemma 2.1.1 one has for all integers N > M
G(x) + G(T x)ω1 (x) + . . . + G(T N −M −1 x)ωN −M −1 (x) ≤ (f (x) + )gN (x).
Integrating both sides yields

Z Z
(N − M ) G(x)dµ(x) ≤ N ( f (x)dµ(x) + ).
X X
R
Since f ≥ 0, the measure ν defined by ν(A) = A f (x) dµ(x) is absolutely
continuous with respect to the measure µ. Hence, there exists δ0 > 0 such
that
R if µ(A) < δ, then ν(A) < δ0 . Since µ(X \ Y0 ) < δ, then ν(X \ Y0 ) =
X\Y0
f (x)dµ(x) < δ0 . Hence,
Z Z Z
f (x) dµ(x) = G(x) dµ(x) + f (x) dµ(x)
X X X\Y0
Z
N
≤ (f (x) + ) dµ(x) + δ0 .
N −M X
Now, let first N → ∞, then δ → 0 (and hence δ0 → 0), and finally → 0,

one gets Z Z
f (x) dµ(x) ≤ f (x) dµ(x).
X X
This shows that Z Z Z
f dµ ≥ f dµ ≥ f dµ,
X X X
hence, f = f = f∗ a.e., and f∗ is T -invariant.
Remark We can extend the notion of ergodicity to our setting. If T is non-

singular and conservative, we say that T is ergodic if for any measurable set
A satisfying µ(A∆T −1 A) = 0, one has µ(A) = 0 or 1. It is easy to check that
the proof of Proposition (1.7.1) holds in this case, so that T ergodic implies
that each T -invariant function is a constant µ a.e. Hence, if T is invertible,
non-singular, conservative and ergodic, then by Hurewicz Ergodic Theorem
one has for any f ∈ L1 (µ),
n−1
X
f (T i x)ωi (x) Z
i=0
lim n−1
= f dµ µ a.e.
n→∞ X X
ωi (x)
i=0
Chapter 6
Invariant Measures for

Continuous Transformations
6.1 Existence
Suppose X is a compact metric space, and let B be the Borel σ-algebra i.e.,
the σ-algebra generated by the open sets. Let M (X) be the collection of all
Borel probability measures on X. There is natural embedding of the space
X in M (X) via the map x → δx , where δX is the Dirac measure concentrated
at x (δx (A) = 1 if x ∈ A, and is zero otherwise). Furthermore, M (X) is a
convex set, i.e., pµ + (1 − p)ν ∈ M (X) whenever µ, ν ∈ M (X) and 0 ≤ p ≤ 1.
Theorem 6.1.2 below shows that a member of M (X) is determined by how
it integrates continuous functions. We denote by C(X) the Banach space of
all complex valued continuous functions on X under the supremum norm.
Theorem 6.1.1 Every member of M (X) is regular, i.e., for all B ∈ B and
every > 0 there exist an open set U and a closed sed C such that C ⊆
B ⊆ U such that µ(U \ C ) < .
Idea of proof Call a set B ∈ B with the above property a regular set. Let
R = {B ∈ B : B is regular }. Show that R is a σ-algebra containing all the
closed sets.
Corollary 6.1.1 For any B ∈ B, and any µ ∈ M (X),
µ(B) = sup µ(C) = inf µ(U ).
B⊆U :U open
C⊆B:C closed
89
90 Invariant measures for continuous transformations
Theorem 6.1.2 Let µ, m ∈ M (X). If

Z Z
f dµ = f dm
X X
for all f ∈ C(X), then µ = m.
Proof From the above corollary, it suffices to show that µ(C) = m(C) for
all closed subsets C of X. Let > 0, by regularity of the measure m there
exists an open set U such that C ⊆ U and m(U \ C) < . Define f ∈ C(X)
as follows
(
0 x∈/ U
f (x) = d(x,X\U )
d(x,X\U )+d(x,C)
x ∈ U .
Notice that 1C ≤ f ≤ 1U , thus

Z Z
µ(C) ≤ f dµ = f dm ≤ m(U ) ≤ m(C) + .
X X
Using a similar argument, one can show that m(C) ≤ µ(C) + . Therefore,
µ(C) = m(C) for all closed sets, and hence for all Borel sets.
This allows us to define a metric structure on M (X) as follows. A sequence
{µn } in M (X) converges to µ ∈ M (X) if and only if
Z Z
lim f (x) dµn (x) = f (x) dµ(x)
n→∞ X X
for all f ∈ C(X). We will show that under this notion of convergence
the space M (X) is compact, but first we need The Riesz Representation
Representation Theorem.
Theorem 6.1.3 (The Riesz Representation Theorem) Let X be a compact

metric space and J : C(X) → C a continuous linear map such that J is a
positive Roperator and J(1) = 1. Then there exists a µ ∈ M (X) such that
J(f ) = X f (x) dµ(x).
Theorem 6.1.4 The space M (X) is compact.

Existence 91
Idea of proof Let {µn } be a sequence in R M (X). Choose a countable dense

subset of {fn } of C(X). The sequence { X f1 dµn } is a bounded sequence of
R (1)
complex numbers, hence has a convergent subsequence { X f1 dµn }. Now,
R (1)
the sequence { X f2 dµn } is bounded, and hence has a convergent sub-
R (2) R (2)
sequence { X f2 dµn }. Notice that { X f1 dµn } is also convergent. We
(i)
continue in this manner, to get for each i a subsequence {µn } of {µn }
(i) (j) R (i)
such that for all j ≤ i, {µn } is a subsequence of {µn } and { X fj dµn }
(n) R (n)
converges. Consider the diagonal sequence {µn }, then { X fj dµn } con-
R (n)
verges for all j, and hence { X f dµn } converges for all f ∈ C(X). Now
R (n)
define J : C(X) → C by J(f ) = limn→∞ { X f dµn }. Then, J is lin-
ear, continuous (|J(f )| ≤ supx∈X |f (x)|), positive and J(1) = 1. Thus,
by Riesz Representation Theorem, there exists a µ ∈ M (X) such that
R (n) R (n)
J(f ) = limn→∞ { X f dµn } = X f dµ. Therefore, limn→∞ µn = µ, and
M (X) is compact.
Let T : X → X be a continuous transformation. Since B is generated
by the open sets, then T is measurable with respect to B. Furthermore, T
induces in a natural way, an operator T : M (X) → M (X) given by
(T µ)(A) = µ(T −1 A)
i
for all A ∈ B. Then T µ(A) = µ(T −i A). Using a standard argument, one
can easily show that
Z Z
f (x) d(T µ)(x) = f (T x)dµ(x)
X X
for all continuous functions f on X. Note that T is measure preserving with

respect to µ ∈ M (X) if and only if T µ = µ. Equivalently, µ is measure
preserving if and only if
Z Z
f (x) dµ(x) = f (T x) dµ(x)
X X
for all continuous functions f on X. Let

M (X, T ) = {µ ∈ M (X) : T µ = µ}
be the collection of all probability measures under which T is measure pre-
serving.
Theorem 6.1.5 Let T : X → X be continuous, and {σn } a sequence in

M (X). Define a sequence {µn } in M (X) by
n−1
1X i
µn = T σn .
n i=0
Then, any limit point µ of {µn } is a member of M (X, T ).
Proof
R We need toR show that for any continuous function f on X, one
has X f (x) dµ(x) = X f (T x) dµ. Since M (X) is compact there exists a
µ ∈ M (X) and a subsequence {µni } such that µni → µ in M (X). Now for
any f continuous, we have
Z Z Z Z

f (T x) dµ(x) − f (x) dµ(x) = lim f (T x) dµn (x) − f (x) dµ n (x)
j j

X X
j→∞ X X

Z nj −1
1 X
f (T i+1 x) − f (T i x) dσnj (x)

= lim

j→∞ nj X
Z i=0
1
= lim (f (T nj x) − f (x)) dσnj (x)
j→∞ nj X
2 supx∈X |f (x)|
≤ lim = 0.
j→∞ nj
Therefore µ ∈ M (X, T ).
Theorem 6.1.6 Let T be a continuous transformation on a compact metric

space. Then,
(i) M (X, T ) is a compact convex subset of M (X).
(ii) µ ∈ M (X, T ) is an extreme point (i.e. µ cannot be written in a non-

trivial way as a convex combination of elements of M (X, T )) if and
only if T is ergodic with respect to µ.
Proof (i) Clearly M (X, T ) is convex. Now let {µn } be a sequence in

M (X, T ) converging to µ in M (X). We need to show that µ ∈ M (X, T ).
Existence 93
Since T is continuous, then for any continuous function f on X, the function

f ◦ T is also continuous. Hence,
Z Z
f (T x) dµ(x) = lim f (T x) dµn (x)
X n→∞ X
Z
= lim f (x) dµn (x)
n→∞ X
Z
= f (x) dµ(x).
X
Therefore, T is measure preserving with respect to µ, and µ ∈ M (X, T ).

(ii) Suppose T is ergodic with respect to µ, and assume that
µ = pµ1 + (1 − p)µ2
for some µ1 , µ2 ∈ M (X, T ), and some 0 < p ≤ 1. We will show that µ = µ1 .
Notice that the measure µ1 is absolutely continuous with respect to µ, and
T is ergodic with respect to µ, hence by Theorem (2.1.2) we have µ1 = µ.
Conversely, (we prove the contrapositive) suppose that T is not ergodic with
respect to µ. Then there exists a measurable set E such that T −1 E = E,
and 0 < µ(E) < 1. Define measures µ1 , µ2 on X by
µ(B ∩ E) µ (B ∩ (X \ E))
µ1 (B) = and µ1 (B) = .
µ(E) µ(X \ E)
Since E and X \ E are T -invariant sets, then µ1 , µ2 ∈ M (X, T ), and µ1 6= µ2
since µ1 (E) = 1 while µ2 (E) = 0. Furthermore, for any measurable set B,
µ(B) = µ(E)µ1 (B) + (1 − µ(E))µ2 (B),
i.e. µ1 is a non-trivial convex combination of elements of M (X, T ). Thus, µ
is not an extreme point of M (X, T ).
Since the Banach space C(X) of all continuous functions on X (under
the sup norm) is separable i.e. C(X) has a countable dense subset, one gets
the following strengthening of the Ergodic Theorem.
Theorem 6.1.7 If T : X → X is continuous and µ ∈ M (X, T ) is ergodic,
then there exists a measurable set Y such that µ(Y ) = 1, and
n−1 Z
1X i
lim f (T x) = f (x) dµ(x)
n→∞ n X
i=0
for all x ∈ Y , and f ∈ C(X).

Proof Choose a countable dense subset {fk } in C(X). By the Ergodic

Theorem, for each k there exists a subset Xk with µ(Xk ) = 1 and
n−1 Z
1X
lim fk (T i x) = fk (x) dµ(x)
n→∞ n X
i=0
T∞
for all x ∈ Xk . Let Y = k=1 Xk , then µ(Y ) = 1, and
n−1 Z
1X
lim fk (T i x) = fk (x)dµ(x)
n→∞ n X
i=0
for all k and all x ∈ Y . Now, let f ∈ C(X), then there exists a subse-
quence {fkj } converging to f in the supremum norm, and hence is uniformly
convergent. For any x ∈ Y , using uniform convergence and the dominated
convergence theorem, one gets
n−1 n−1
1X 1X
lim f (T i x) = lim lim fkj (T i x)
n→∞ n n→∞ j→∞ n
i=0 i=0
n−1
1X
= lim lim fkj (T i x)
j→∞ n→∞ n
i=0
Z Z
= lim fkj dµ = f dµ.
j→∞ X X
Theorem 6.1.8 Let T : X → X be continuous, and µ ∈ M (X, T ). Then T

is ergodic with respect to µ if and only if
n−1
1X
δT i x → µ a.e.
n i=0
Proof Suppose T is ergodic with respect to µ. Notice that for any f ∈

C(X), Z
f (y) d(δT i x )(y) = f (T i x),
X
Existence 95
Hence by theorem 6.1.7, there exists a measurable set Y with µ(Y ) = 1 such
that
n−1 Z n−1 Z
1X 1X
lim f (y) d(δT i x )(y) = lim f (T i x) = f (y) dµ(y)
n→∞ n X n→∞ n i=0 X
i=0
1
Pn−1
for all x ∈ Y , and f ∈ C(X). Thus, n i=0 δT i x → µ for all x ∈ Y .
Conversely, suppose n1 n−1

P
i=0 δT i x → µ for all x ∈ Y , where µ(Y ) = 1. Then
for any f ∈ C(X) and any g ∈ L1 (X, B, µ) one has
n−1 Z
1X
lim f (T i x)g(x) = g(x) f (y) dµ(y).
n→∞ n X
i=0
By the dominated convergence theorem
n−1 Z Z Z
1X i
lim f (T x)g(x) dµ(x) = g(x)dµ(x) f (y) dµ(y).
n→∞ n X X X
i=0
Now, let F, G ∈ L2 (X, B, µ). Then, G ∈ L1 (X, B, µ) so that
n−1 Z Z Z
1X i
lim f (T x)G(x) dµ(x) = G(x)dµ(x) f (y)dµ(y)
n→∞ n X X X
i=0
for all f ∈ C(X). LetR > 0, there

R exists f ∈ C(X) such that ||F − f ||2 <
which implies that | F dµ − f dµ| < . Furthermore, there exists N so
that for n ≥ N one has
Z
n−1 Z Z
1X i

f (T x)G(x) dµ(x) − Gdµ f dµ < .

Xn

i=0 X X
Thus, for n ≥ N one has

Z
n−1 Z Z
1 X
F (T i x)G(x) dµ(x) − Gdµ F dµ

Xn

i=0 X X
Z n−1
1 X
≤ F (T i x) − f (T i x) |G(x)| dµ(x)
X n i=0
Z
n−1 Z Z
1 X
+ f (T i x)G(x)dµ(x) − Gdµ f dµ

Xn X X
i=0
Z Z Z Z

+ f dµ
Gdµ − F dµ G dµ
X X X X
< ||G||2 + + ||G||2 .
Thus,
n−1 Z Z Z
1X i
lim F (T x)G(x) dµ(x) = G(x) dµ(x) F (y) dµ(y)
n→∞ n X X X
i=0
for all F, G ∈ L2 (X, B, µ) and x ∈ Y . Taking F and G to be indicator

functions, one gets that T is ergodic.

Exercise 6.1.1 Let X be a compact metric space and T : X → X be a
continuous homeomorphism. Let x ∈ X be periodic point of T of period n,
i.e. T n x = x and T j x 6= x for j < i. Show that if µ ∈ M (X, T ) is ergodic
n−1
1X
and µ({x}) > 0, then µ = δT i x .
n i=0
6.2 Unique Ergodicity

A continuous transformation T : X → X on a compact metric space is
uniquely ergodic if there is only one T -invariant probabilty measure µ on
X. In this case, M (X, T ) = {µ}, and µ is necessarily ergodic, since µ is an
extreme point of M (X, T ). Recall that if ν ∈ M (X, T ) is ergodic, then there
exists a measurable subset Y such that ν(Y ) = 1 and
n−1 Z
1X i
lim f (T x) = f (y) dν(y)
n→∞ n X
i=0
Unique Ergodicity 97
for all x ∈ Y and all f ∈ C(X). When T is uniquely ergodic we will see that
we have a much stronger result.
Theorem 6.2.1 Let T : X → X be a continuous transformation on a com-

pact metric space X. Then the following are equivalent:
n−1
1X
(i) For every f ∈ C(X), the sequence { f (T j x)} converges uniformly
n j=0
to a constant.
n−1
1X
(ii) For every f ∈ C(X), the sequence { f (T j x)} converges pointwise
n j=0
to a constant.
(iii) There exists a µ ∈ M (X, T ) such that for every f ∈ C(X) and all
x ∈ X.
n−1 Z
1X i
lim f (T x) = f (y) dµ(y).
n→∞ n X
i=0
(iv) T is uniquely ergodic.
Proof (i) ⇒ (ii) immediate.

(ii) ⇒ (iii) Define L : C(X) → C by
n−1
1X
L(f ) = lim f (T i x).
n→∞ n
i=0
By assumption L(f ) is independent of x, hence L is well defined. It is easy

to see that L is linear, continuous (|L(f )| ≤ supx∈X |f (x)|), positive and
L(1) = 1. Thus, by Riesz Representation Theorem there exists a probability
measure µ ∈ M (X) such that
Z
L(f ) = f (x) dµ(x)
X
for all f ∈ C(x). But

n−1
1X
L(f ◦ T ) = lim f (T i+1 x) = L(f ).
n→∞ n
i=0
Hence, Z Z
f (T x) dµ(x) = f (x) dµ(x).
X X
Thus, µ ∈ M (X, T ), and for every f ∈ C(X),
n−1 Z
1X i
n→∞ n X
i=0
for all x ∈ X.
(iii) ⇒ (iv) Suppose µ ∈ M (X, T ) is such that for every f ∈ C(X),
n−1 Z
1X i
n→∞ n X
i=0
for all x ∈ X. Assume ν ∈ M (X, T ), we will show that µ = ν. For any f ∈

n−1
1X
C(X), since the sequence { f (T j x)} converges pointwise to the constant
n j=0
R
function X f (x) dµ(x), and since each term of the sequence is bounded in
absolute value by the constant supx∈X |f (x)|, it follows by the Dominated
Convergence Theorem that
Z n−1 Z Z Z
1X i
lim f (T x) dν(x) = f (x) dµ(x)dν(y) = f (x) dµ(x).
n→∞ X n X X X
i=0
But for each n,

Z n−1 Z
1X i
f (T x) dν(x) = f (x) dν(x).
X n i=0 X
R R
Thus, X f (x) dµ(x) = X f (x) dν(x), and µ = ν.
(iv) ⇒ (i) The proof is done by contradiction. Assume M (X, T ) = {µ} and
n−1
1X
suppose g ∈ C(X) is such that the sequence { g ◦ T j } does not converge
n j=0
uniformly on X. Then there exists > 0 such that for each N there exists
n > N and there exists xn ∈ X such that
n−1 Z
1 X j

g(T xn ) − g dµ ≥ .
n j=0 X
Unique Ergodicity 99
Let
n−1 n−1
1X 1X j
µn = δT j xn = T δxn .
n j=0 n j=0
Then, Z Z
| gdµn − g dµ| ≥ .
X X
Since M (X) is compact. there exists a subsequence µni converging to ν ∈
M (X). Hence, Z Z
| g dν − g dµ| ≥ .
X X
By Theorem (6.1.5), ν ∈ M (X, T ) and by unique ergodicity µ = ν, which is
a contradiction.
Example If Tθ is an irrational rotation, then Tθ is uniquely ergodic. This is
a consequence of the above theorem and Weyl’s Theorem on uniform distri-
bution: for any Riemann integrable function f on [0, 1), and any x ∈ [0, 1),
one has
n−1 Z
1X
lim f (x + iθ − bx + iθc) = f (y) dy.
n→∞ n X
i=0
As an application of this, let us consider the following question. Consider

the sequence of first digits
{1, 2, 4, 8, 1, 3, 6, 1, . . .}
obtained by writing the first decimal digit of each term in the sequence
{2n : n ≥ 0} = {1, 2, 4, 8, 16, 32, 64, 128, . . .}.
For each k = 1, 2, . . . , 9, let pk (n) be the number of times the digit k appears
in the first n terms of the first digit sequence. The asymptotic relative fre-
pk (n)
quency of the digit k is then limn→∞ . We want to identify this limit
n
for each k ∈ {1, 2, . . . 9}. To do this, let θ = log10 2, then θ is irrational. For
k = 1, 2, . . . , 9, let Jk = [log10 k, log10 (k + 1)). By unique ergodicity of Tθ , we
have for each k = 1, 2, . . . , 9,
n−1
1X k+1
lim 1Jk (Tθj (0)) = λ(Jk ) = log10 .
n→∞ n k
i=0
Returning to our original problem, notice that the first digit of 2i is k if and
only if
k · 10r ≤ 2i < (k + 1) · 10r
for some r ≥ 0. In this case,
r + log10 k ≤ i log10 2 = iθ < r + log10 (k + 1).
This shows that r = biθc, and
log10 k ≤ iθ − biθc < log10 (k + 1).
But Tθi (0) = iθ − biθc, so that Tθi (0) ∈ Jk . Summarizing, we see that the first
digit of 2i is k if and only if Tθi (0) ∈ Jk . Thus,
n−1
pk (n) 1X k+1
lim = lim 1Jk (Tθi (0)) = log10 .
n→∞ n n→∞ n
i=0
k
Chapter 7
Topological Dynamics
In this chapter we will shift our attention from the theory of measure preserv-
ing transformations and ergodic theory to the study of topological dynamics.
The field of topological dynamics also studies dynamical systems, but in a
different setting: where ergodic theory is set in a measure space with a mea-
sure preserving transformation, topological dynamics focuses on a topological
space with a continuous transformation. The two fields bear great similarity
and seem to co-evolve. Whenever a new notion has proven its value in one
field, analogies will be sought in the other. The two fields are, however, also
complementary. Indeed, we shall encounter a most striking example of a map
which exhibits its most interesting topological dynamics in a set of measure
zero, right in the blind spot of the measure theoretic eye.
In order to illuminate this interplay between the two fields we will be merely
interested in compact metric spaces. The compactness assumption will play
the role of the assumption of a finite measure space. The assumption of the
existence of a metric is sufficient for applications and will greatly simplify
all of our proofs. The chapter is structured as follows. We will first intro-
duce several basic notions from topological dynamics and discuss how they
are related to each other and to analogous notions from ergodic theory. We
will then devote a separate section to topological entropy. To conclude the
chapter, we will discuss some examples and applications.
101
102 Basic Notions
7.1 Basic Notions

Throughout this chapter X will denote a compact metric space. Unless
stated otherwise, we will always denote the metric by d and occasionally
write (X, d) to denote a metric space if there is any risk of confusion. We will
assume that the reader has some familiarity with basic (analytic) topology.
To jog the reader’s memory, a brief outline of these basics is included as an
appendix. Here we will only summarize the properties of the space X, for
future reference.
Theorem 7.1.1 Let X be a compact metric space, Y a topological space and

f : X −→ Y continuous. Then,
• X is a Hausdorff space
• Every closed subspace of X is compact
• Every compact subspace of X is closed
• X has a countable basis for its topology
• X is normal, i.e. for any pair of disjoint closed sets A and B of X

there are disjoint open sets U , V containing A and B, respectively
• If O is an open cover of X, then there is a δ > 0 such that every subset

A of X with diam(A) < δ is contained in an element of O. We call
δ > 0 a Lebesgue number for O.
• X is a Baire space
• f is uniformly continuous
• If Y is an ordered space then f attains a maximum and minimum on

X
• If A and B are closed sets in X and [a, b] ⊂ R, then there exists a

continuous map g : X −→ [a, b] such that g(x) = a for all x ∈ A and
g(x) = b for all x ∈ b
103
The reader may recognize that this theorem contains some of the most
important results in analytic topology, most notably the Lebesgue number
lemma, Baire’s category theorem, the extreme value theorem and Urysohn’s
lemma.
Let us now introduce some concepts of topological dynamics: topological
transitivity, topological conjugacy, periodicity and expansiveness.
Definition 7.1.1 Let T : X −→ X be a continuous transformation. For

x ∈ X, the forward orbit of x is the set F OT (x) = {T n (x)|n ∈ Z≥0 }. T is
called one-sided topologically transitive if for some x ∈ X, F OT (x) is dense
in X.
If T is invertible, the orbit of x is defined as OT (x) = {T n (x)|n ∈ Z}. If T
is a homeomorphism, T is called topologically transitive if for some x ∈ X
OT (x) is dense in X.
In the literature sometimes a stronger version of topological transitivity is

introduced, called minimality.
Definition 7.1.2 A homeomorphism T : X −→ X is called minimal if every

x ∈ X has a dense orbit in X.
Example 7.1.1 Let X be the topological space X = {1, 2, . . . , 1000} with

the discrete topology. X is a finite space, hence compact, and the discrete
metric (
0 if x = y
ddisc (x, y) =
1 if x 6= y
induces the discrete topology on X. Thus X is a compact metric space. If
we now define T : X −→ X by the permutation (12 · · · 1000), i.e. T (1) = 2,
T (2) = 3,. . .,T (1000) = 1, then it is easy to see that T is a homeomorphism.
Moreover, OT (x) = X for all x ∈ X, so T is minimal.
Example 7.1.2 Fix n ∈ Z>0 . Let X be the topological space X = {0} ∪

{ n1k |k ∈ Z>0 } equipped with the subspace topology inherited from R. Then
X is a compact metric space. Indeed, if O is an open cover of X, then O
contains an open set U which contains 0. U contains all but finitely many
points of X, so by adding an open set Uy to U from O with y ∈ Uy , for each
y not in U , we obtain a finite subcover. Define T : X −→ X by T (x) = nx .
104 Basic Notions
Note that T is continuous, but not surjective as T −1 ({1}) = ∅. Observe

that F OT (1) = X − {0} and X − {0} = X, so T is one-sided topologically
transitive as x = 1 has a dense forward orbit.
The following theorem summarizes some equivalent definitions of topological

transitivity.
Theorem 7.1.2 Let T : X −→ X be a homeomorphism of a compact metric
space. Then the following are equivalent:
1. T is topologically transitive
2. If A is a closed set in X satisfying T A = A, then either A = X or A
has empty interior (i.e. X − A is dense in X)
3. For any pair of non empty open sets U , V in X there exists an n ∈ Z
such that T n U ∩ V 6= ∅
Proof
Recall that OT (x) is dense in X if and only if for every U open in X,

U ∩ OT (x) 6= ∅.
(1)⇒(2): Suppose that OT (x) = X and let A be as stated. Suppose that A
does not have an empty interior, then there exists an open set U such that
U is open, non-empty and U ⊂ A. Now, we can find a p ∈ Z such that
T p (x) ∈ U ⊂ A. Since T n A = A for all n ∈ Z, OT (x) ⊂ A and taking
closures we obtain X ⊂ A. Hence, either A has empty interior, or A = X.
(2)⇒(3): Suppose U , V are open and non-empty. Then W = ∪∞ n=−∞ T U
n
is open, non-empty and invariant under T . Hence, A = X − W is closed,

T A = A and A 6= X. By (2), A has empty interior and therefore W is dense
in X. Thus W ∩ V 6= ∅, which implies (3).
(3)⇒(1): For an arbitrary y ∈ X we have: y ∈ OT (x) if and only if every
open neighborhood V of y intersects OT (x), i.e. T m (x) ∈ V for some m ∈ Z.
By theorem 7.1.1 there exists a countable basis U = {Un }∞n=1 for the topology
of X. Since every open neighborhood of y contains a basis element Ui such
that y ∈ Ui , we see that OT (x) = X if and only if for every n ∈ Z>0 there is
some m ∈ Z such that T m (x) ∈ Un . Hence,
∞
\ ∞
[
{x ∈ X|OT (x) = X} = T m Un
n=1 m=−∞
105
But ∪∞ m
m=−∞ T Un is T -invariant and therefore intersects every open set in
X by (3). Thus, ∪∞ m
m=−∞ T Un is dense in X for every n. Now, since X is
a Baire space (c.f. theorem 7.1.1), we see that {x ∈ X|OT (x) = X} is itself
dense in X and thus certainly non-empty (as X 6= ∅).
An analogue of this theorem exists for one-sided topological transitivity, see

[W].
Note that theorem 7.1.2 clearly resembles theorem (1.6.1) and (2) implies
that we can view topological transitivity as an analogue of ergodicity (in
some sense). This is also reflected in the following (partial) analogue of
theorem (1.7.1).
Theorem 7.1.3 Let T : X −→ X be continuous and one-sided topologically

transitive or a topologically transitive homeomorphism. Then every continu-
ous T -invariant function is constant.
Proof
Let f : X −→ Y be continuous on X and suppose f ◦ T = f . Then

f ◦ T n = f , so f is constant on (forward) orbits of points. Let x0 ∈ X have a
dense (forward) orbit in X and suppose f (x0 ) = c for some c ∈ Y . Fix ε > 0
and let x ∈ X be arbitrary. By continuity of f at x, we can find a δ > 0 such
that d(f (x), f (x̃)) < ε for any x̃ ∈ X with d(x, x̃) < δ. But then, since x0
has a dense (forward) orbit in X, there is some n ∈ Z (n ∈ Z≥0 ) such that
d(x, T n (x0 )) < δ. Hence,
d(f (x), c) = d(f (x), f (T n (x0 ))) < ε
Since ε > 0 was arbitrary, our proof is complete.
Exercise 7.1.1 Let X be a compact metric space with metric d and let
T : X −→ X is a topologically transitive homeomorphism. Show that if T is
an isometry (i.e. d(T (x), T (y)) = d(x, y), for all x, y ∈ X) then T is minimal.
Definition 7.1.3 Let X be a topological space, T : X −→ X a transforma-

tion and x ∈ X. Then x is called a periodic point of T of period n (n ∈ Z>0 )
if T n (x) = x and T m (x) 6= x for m < n. A periodic point of period 1 is called
a fixed point.
106 Basic Notions
Example 7.1.3 The map T : R −→ R, defined by T (x) = −x has one fixed

point, namely x = 0. All other points in the domain are periodic points of
period 2.
Definition 7.1.4 Let X be a compact metric space and T : X −→ X a

homeomorphism. T is said to be expansive if there exist a δ > 0 such that: if
x 6= y then there exists an n ∈ Z such that d(T n (x), T n (y)) > δ. δ is called
an expansive constant for T .
Example 7.1.4 Let X be a finite space with the discrete topology. X is a

compact metric space, since the discrete metric ddisc induces the topology on
X. Now, any bijective map T : X −→ X is an expansive homeomorphism
and any 0 < δ < 1 is an expansive constant for T .
For a non-trivial example of an expansive homeomorphism, see exercise 7.3.1.

Expansiveness is closely related to the concept of a generator.
Definition 7.1.5 Let X be a compact metric space and T : X −→ X a
homeomorphism. A finite open cover α of X is called a generator for T if
the set ∩∞
n=−∞ T
−n
An contains at most one point of X, for any collection
∞
{An }n=−∞ , Ai ∈ α.
Theorem 7.1.4 Let X be a compact metric space and T : X −→ X a

homeomorphism. Then T is expansive if and only if T has a generator.
Proof
Suppose that T is expansive and let δ > 0 be an expansive constant for T .

Let α be the open cover of X defined by α = {B(a, 2δ )|a ∈ X}. By compact-
ness, there exists a finite subcover β of α. Suppose that x, y ∈ ∩∞ n=−∞ T
−n
Bn ,
m m
Bi ∈ β for all i. Then d(T (x), T (y)) ≤ δ, for all m ∈ Z. But by expan-
siveness, there exists an l ∈ Z such that d(T l (x), T l (y)) > δ, a contradiction.
Hence, x = y. We conclude that β is a generator for T .
Conversely, suppose that α is a generator for T . By theorem 7.1.1, there
exists a Lebesgue number δ > 0 for the open cover α. We claim that
δ/4 is an expansive constant for T . Indeed, if x, y ∈ X are such that
d(T m (x), T m (y)) ≤ δ/4, for all m ∈ Z. Then, for all m ∈ Z, the closed
ball B(T m (x), δ/4) contains T m (y) and has diameter δ/2 < δ. Hence,
B(T m (x), δ/4) ⊂ Am ,

107
for some Am ∈ α. It follows that x, y ∈ ∩∞

m=−∞ T
−m
Am .But α is a generator,
∞ −m
so ∩m=−∞ T Am contains at most one point. We conclude that x = y, T is
expansive.
Exercise 7.1.2 In the literature, our definition of a generator is sometimes

called a weak generator. A generator is then defined by replacing ∩∞
n=−∞ T
−n
An
∞ −n
by ∩n=−∞ T An in definition 7.1.5. Show that in a compact metric space
both concepts are in fact equivalent.
Exercise 7.1.3 Prove the following basic properties of an expansive home-

omorphism T : X −→ X:
a. T is expansive if and only if T n is expansive (n 6= 0)
b. If A is a closed subset of X and T (A) = A, then the restriction of T to

A, T |A , is expansive
c. Suppose S : Y −→ Y is an expansive homeomorphism. Then the product

map T × S : X × Y −→ X × Y is expansive with respect to the metric
d = max{dX , dY }.
Definition 7.1.6 Let X, Y be compact spaces and let T : X −→ X, S :

Y −→ Y be homeomorphisms. T is said to be topologically conjugate to S if
there exists a homeomorphism φ : X −→ Y such that φ ◦ T = S ◦ φ. φ is
called a (topological) conjugacy.
Note that topological conjugacy defines an equivalence relation on the space

of all homeomorphisms.
The term ‘topological conjugacy’ is, in a sense, a misnomer. The following
theorem shows that topological conjugacy can be considered as the counter-
part of a measure preserving isomorphism.
Theorem 7.1.5 Let X, Y be compact spaces and let T : X −→ X, S :
Y −→ Y be topologically conjugate homeomorphisms. Then,
1. T is topologically transitive if and only if S is topologically transitive
2. T is minimal if and only if S is minimal
3. T is expansive if and only if S is expansive

108 Two Definitions
Proof
The proof of the first two statements is trivial. For the third statement
we note that
∞
\ ∞
\ ∞
\
−n −1 −1 −n −1
T (φ An ) = φ ◦S (An ) = φ ( S −n An )
n=−∞ n=−∞ n=−∞
The lefthandside of the equation contains at most one point if and only if
the righthandside does, so a finite open cover α is a generator for S if and
only if φ−1 α = {φ−1 A|A ∈ α} is a generator for T . Hence the result follows
from theorem 7.1.4.
7.2 Topological Entropy

Topological entropy was first introduced as an analogue of the succesful con-
cept of measure theoretic entropy. It will turn out to be a conjugacy invariant
and is therefore a useful tool for distinguishing between topological dynamical
systems. The definition of topological entropy comes in two flavors. The first
is in terms of open covers and is very similar to the definition of measure the-
oretic entropy in terms of partitions. The second (and chronologically later)
definition uses (n,ε) separating and spanning sets. Interestingly, the latter
definition was a topological dynamical discovery, which was later engineered
into a similar definition of measure theoretic entropy.
7.2.1 Two Definitions

We will start with a definition of topological entropy similar to the defini-
tion of measure theoretic entropy introduced earlier. To spark the reader’s
recognition of the similarities between the two, we use analogous notation
and terminology. Before kicking off, let us first make a remark on the as-
sumptions about our space X. The first definition presented will only require
X to be compact. The second, on the other hand, only requires X to be a
metric space. We will obscure from these subtleties and simply stick to the
context of a compact metric space, in which the two definitions will turn out
to be equivalent. Nevertheless, the reader should be aware that both defini-
tions represent generalizations of topological entropy in different directions
109
and are therefore interesting in their own right.

Let X be a compact metric space and let α, β be open covers of X. We say
that β is a refinement of α, and write α ≤ β, if for every B ∈ β there is an
A ∈ α such that B ⊂ A. The common refinement of α and β is defined to
be α ∨ β = {A ∩ B|AW∈ α, B ∈ β}. For a finite collection {αi }ni=1 of open
covers we may define ni=1 αi = {∩ni=1 Aji |Aji ∈ αi }. For a continuous trans-
formation T : X −→ X we define T −1 α = {T −1 A|A ∈ α}. Note that these
are all again open covers of X. Finally, we define the diameter of an open
cover α as diam(α) := supA∈α diam(A), where diam(A) = supx,y∈A d(x, y).
Exercise 7.2.1 Show that T −1 (α ∨ β) = T −1 (α) ∨ T −1 (β) and show that

α ≤ β implies T −1 α ≤ T −1 β.
Definition 7.2.1 Let α be an open cover of X and let N (α) be the number
of sets in a finite subcover of α of minimal cardinality. We define the entropy
of α to be H(α) = log(N (α)).
The following proposition summarizes some easy properties of H(α). The

proof is left as an exercise.
Proposition 7.2.1 Let α be an open cover of X and let H(α) be the entropy
of α. Then
1. H(α) ≥ 0
2. H(α) = 0 if and only if N (α) = 1 if and only if X ∈ α
3. If α ≤ β, then H(α) ≤ H(β)
4. H(α ∨ β) ≤ H(α) + H(β)
5. For T : X −→ X continuous we have H(T −1 α) ≤ H(α). If T is
surjective then H(T −1 α) = H(α).
Exercise 7.2.2 Prove proposition 7.2.1 (Hint: If T is surjective then T (T −1 A) =

A).
We will now move on to the definition of topological entropy for a contin-

uous transformation with respect to an open cover and subsequently make
this definition independent of open covers.
110 Two Definitions
Definition 7.2.2 Let α be an open cover of X and let T : X −→ X be

continuous. We define the topological entropy of T with respect to α to be:
n−1
1 _
h(T, α) = lim H( T −i α)
n→∞ n
i=0
We must show that h(T, α) is well-defined, i.e. that the right hand side exists.
Wn−1 −i
Theorem 7.2.1 limn→∞ n1 H( i=1 T α) exists.
Proof
Define
n−1
_
an = H( T −i α)
i=0
Then, by proposition (4.2.2),, it suffices to show that {an } is subadditive.

Now, by (4) of proposition 7.2.1 and exercise 7.2.1
n+p−1
_
an+p = H( T −i α)
i=0
n−1 p−1
_ _
−i −n
≤ H( T α) + H(T T −j α)
i=0 j=0
≤ an + ap
This completes our proof.
Proposition 7.2.2 h(T, α) satisfies the following:
1. h(T, α) ≥ 0
2. If α ≤ β, then h(T, α) ≤ h(T, β)
3. h(T, α) ≤ H(α)
111
Proof
These are easy consequences of proposition 7.2.1. We will only prove the
third statement. We have:
n−1
_ n−1
X
−i
H( T α) ≤ H(T −i α)
i=0 i=0
≤ nH(α)
Here we have subsequently used (4) and (5) of the proposition.
Finally, we arrive at our sought definition:
Definition 7.2.3 (I) Let T : X −→ X be continuous. We define the topo-

logical entropy of T to be:
h1 (T ) = sup h(T, α)
α
where the supremum is taken over all open covers of X.
We will defer a discussion of the properties of topological entropy till the end
of this chapter.
Let us now turn to the second approach to defining topological entropy.

This approach was first taken by E.I. Dinaburg.
We shall first define the main ingredients. Let d be the metric on the com-
pact metric space X and for each n ∈ Z≥0 define a new metric by:
dn (x, y) = max d(T i (x), T i (y))

0≤i≤n−1
Definition 7.2.4 Let n ∈ Z≥0 , ε > 0 and A ⊂ X. A is called (n, ε)-

spanning for X if for all x ∈ X there exists a y ∈ A such that dn (x, y) < ε.
Define span(n, ε, T ) to be the minimum cardinality of an (n, ε)-spanning set.
A is called (n, ε)-separated if for any x, y ∈ A dn (x, y) > ε. We define
sep(n, ε, T ) to be the maximum cardinality of an (n, ε)-separated set.
112 Two Definitions
Figure 7.1: An (n, ε)-separated set (left) and (n, ε)-spanning set (right). The
dotted circles represent open balls of dn -radius 2ε (left) and of dn -radius ε
(right), respectively.
Note that we can extend this definition by letting A ⊂ K, where K is

a compact subset of X. This generalization to the context of non-compact
metric spaces was devised by R.E. Bowen. See [W] for an outline of this
extended framework.
We can also formulate the above in terms of open balls. If B(x, r) = {y ∈
X|d(x, y) < r} is an open ball in the metric d, then the open ball with centre
x and radius r in the metric dn is given by:
n−1
\
Bn (x, r) := T −i B(T i (x), r)
i=0
Hence, A is (n, ε)-spanning for X if:

[
X= Bn (a, ε)
a∈A
and A is (n, ε)-separated if:
(A − {a}) ∩ Bn (a, ε) = ∅ for all a ∈ A
Definition 7.2.5 For n ∈ Z≥0 and ε > 0 we define cov(n, ε, T ) to be the

minimum cardinality of a covering of X by open sets of dn -diameter less
than ε.
The following theorem shows that the above notions are actually two sides
of the same coin.
113
Theorem 7.2.2 cov(n, 3ε, T ) ≤ span(n, ε, T ) ≤ sep(n, ε, T ) ≤ cov(n, ε, T ).

Furthermore, sep(n, ε, T ), span(n, ε, T ) and cov(n, ε, T ) are finite, for all n,
ε > 0 and continuous T .
Proof
We will only prove the last two inequalities. The first is left as an exercise
to the reader. Let A be an (n, ε)-separated set of cardinality sep(n, ε, T ).
Suppose A is not (n, ε)-spanning for X. Then there is some x ∈ X such
that dn (x, a) ≥ ε, for all a ∈ A. But then A ∪ {x} is an (n, ε)-separated
set of cardinality larger than sep(n, ε, T ). This contradiction shows that A
is an (n, ε)-spanning set for X. The second inequality now follows since the
cardinality of A is at least as large as span(n, ε, T ).
To prove the third inequality, let A be an (n, ε)-separated set of cardinal-
ity sep(n, ε, T ). Note that if α is an open cover of dn -diameter less than ε,
then no element of α can contain more than 1 element of A. This holds in
particular for an open cover of minimal cardinality, so the third inequality is
proved.
The final statement follows for span(n, ε, T ) and cov(n, ε, T ) from the com-
pactness of X and subsequently for sep(n, ε, T ) by the last inequality.
Exercise 7.2.3 Finish the proof of the above theorem by proving the first
inequality.
1
Lemma 7.2.1 The limit cov(ε, T ) := limn→∞ n
log(cov(n, ε, T )) exists and
is finite.
Proof
Fix ε > 0. Let α, β be open covers of X consisting of sets with dm -

diameter and dn -diameter, smaller than ε and with cardinalities cov(m, ε, T )
and cov(n, ε, T ), respectively. Pick any A ∈ α and B ∈ β. Then, for
x, y ∈ A ∩ T −m B,
dm+n (x, y) = max d(T i (x), T i (y))
0≤i≤m+n−1
≤ max{ max d(T i (x), T i (y)), max d(T j (x), T j (y))}

0≤i≤m−1 m≤j≤m+n−1
< ε
114 Equivalence of the two Definitions
Thus, A ∩ T −m B has dm+n -diameter less than ε and the sets {A ∩ T −m B|A ∈
α, B ∈ β} form a cover of X. Hence,
cov(m + n, ε, T ) ≤ cov(m, ε, T )· cov(n, ε, T )

Now, since log is an increasing function, the sequence {an }∞ n=1 defined by
an = log cov(n, ε, T ) is subadditive, so the result follows from proposition 5.
Motivated by the above lemma and theorem, we have the following definition
Definition 7.2.6 (II) The topological entropy of a continuous transforma-

tion T : X −→ X is given by:
1
h2 (T ) = lim lim log(cov(n, ε, T ))
ε↓0 n→∞n
1
= lim lim log(span(n, ε, T ))
ε↓0 n→∞ n
1
= lim lim log(sep(n, ε, T ))
ε↓0 n→∞ n
Note that there seems to be a concealed ambiguity in this definition, since

h2 (T ) still depends on the chosen metric d. As we will see later on, in
compact metric spaces h2 (T ) is independent of d, as long as it induces the
topology of X. However, there is definitely an issue at this point if we drop
our assumption of compactness of the space X.
7.2.2 Equivalence of the two Definitions

We will now show that definitions (I) and (II) of topological entropy coincide
on compact metric spaces. This will justify the use of the notation h(T ) for
both definitions of topological entropy.
In the meantime, we will continue to write h1 (T ) for topological entropy in
the definition using open covers and h2 (T ) for the second definition of entropy.
After all the work done in the last section, our only missing ingredient is the
following lemma:
115
Lemma 7.2.2 Let X be a compact metric space. Suppose {αn }∞ n=1 is a se-
quence of open covers of X such that limn→∞ diam(αn ) = 0, where diam is
the diameter under the metric d. Then limn→∞ h1 (T, αn ) = h1 (T ).
Proof
We will only prove the lemma for the case h1 (T ) < ∞. Let ε > 0 be arbi-
trary and let β be an open cover of X such that h1 (T, β) > h1 (T ) − ε. By
theorem 7.1.1 there exists a Lebesgue number δ > 0 for β. By assumption,
there exists an N > 0 such that for n ≥ N we have diam(αn ) < δ. So, for
such n, we find that for any A ∈ αn there exists a B ∈ β such that A ⊂ B,
i.e. β ≤ αn . Hence, by proposition 7.2.2 and the above,
h1 (T ) − ε < h1 (T, β) ≤ h1 (T, αn ) ≤ h1 (T )
for all n ≥ N .
Exercise 7.2.4 Finish the proof of lemma 7.2.2. That is, show that if
h1 (T ) = ∞, then limn→∞ h1 (T, αn ) = ∞.
Theorem 7.2.3 For a continuous transformation T : X −→ X of a compact
metric space X, definitions (I) and (II) of topological entropy coincide.
Proof
Step 1: h2 (T ) ≤ h1 (T ).
For any n, let {αk }∞ k=1 be the sequence of open covers of X, defined by
1
αk = {Bn (x, 3k )|x ∈ X}. Then, clearly, the dn -diameter of αk is smaller than
1
Wn−1 −i
k
. Now, i=0 T αk is an open W cover of X by sets of dn -diameter smaller
than k , hence cov(n, k , T ) ≤ N ( n−1
1 1 −i
i=0 T αk ). Since limk→∞ diam(αk ) = 0,
by lemma 7.2.2, we can subsequently take the log, divide by n, take the limit
for n → ∞ and the limit for k → ∞ to obtain the desired result.
Step 2: h1 (T ) ≤ h2 (T ).
Let {βk }∞ 1
k=1 be defined by βk = {B(x, k )|x ∈ X}. Note that δk = k is a
2
Lebesgue number for βk , for each k. Let A be an (n, δ2k )-spanning set for X
of cardinality span(n, δ2k , T ). Then, for each a ∈ A the ball B(T i (a), δ2k ) is
contained in a member of βk (for all 0 ≤ i < n), hence Bn (T i (a), δ2k ) is con-
tained in a member of n−1
W −i
Wn−1 −i 1
i=0 T βk . Thus, N ( i=0 T βk ) ≤ span(n, 4k , T ).
Proceeding in the same way as in step 1, we obtain the desired inequality.
Properties of Entropy
We will now list some elementary properties of topological entropy.
Theorem 7.2.4 Let X be a compact metric space and let T : X −→ X be
continuous. Then h(T ) satisfies the following properties:
1. If the metrics d and d˜ both generate the topology of X, then h(T ) is the
same under both metrics
2. h(T ) is a conjugacy invariant
3. h(T n ) = n· h(T ), for all n ∈ Z>0
4. If T is a homeomorphism, then h(T −1 ) = h(T ). In this case, h(T n ) =

|n|· h(T ), for all n ∈ Z
Proof
˜ denote the two metric spaces. Since both induce

(1): Let (X, d) and (X, d)
˜ and j : (X, d)
the same topology, the identity maps i : (X, d) −→ (X, d) ˜ −→
(X, d) are continuous and hence by theorem 7.1.1 uniformly continuous. Fix
ε1 > 0. Then, by uniform continuity, we can subsequently find an ε2 > 0 and
ε3 > 0 such that for all x, y ∈ X:
˜ y) < ε2
d(x, y) < ε1 ⇒ d(x,
˜ y) < ε2 ⇒ d(x, y) < ε3
d(x,
Let A be an (n, ε1 )-spanning set for T under d. Then A is also (n, ε2 )-

spanning for T under d. ˜ Analogously, every (n, ε2 )-spanning set for T under
d˜ is (n, ε3 )-spanning for T under d. Therefore,
˜ ≤ span(n, ε1 , T, d)
span(n, ε3 , T, d) ≤ span(n, ε2 , T, d)
where the fourth argument emphasizes the metric. By subsequently taking

the log, dividing by n, taking the limit for n → ∞ and the limit for ε1 ↓ 0
(so that ε2 , ε3 ↓ 0) in these inequalities, we obtain the desired result.
(2): Suppose that S : Y −→ Y is topologically conjugate to T , where Y is a
117
compact metric space, with conjugacy ψ : Y −→ X. Let dX be the metric on

X. Then the map d˜Y defined by d˜Y (x, y) = dX (ψ(x), ψ(y)) defines a metric
on Y which induces the topology of Y . Indeed, let BdY (y0 , η) be a basis
element of the topology of Y . Then ψ(BdY (y0 , η)) is open in X, as ψ is an
open map. Hence, there is an open ball BdX (ψ(y0 ), η̃) ⊂ ψ(BdY (y0 , η)), since
the open balls are a basis for the topology of X. Now, ψ −1 (BdX (ψ(y0 ), η̃)) ⊂
BdY (y0 , η) and
y ∈ ψ −1 (Bd (ψ(y0 ), η̃)) ⇔ d˜Y (y0 , y) = dX (ψ(y), ψ(y0 )) < η̃
X
hence, Bd˜Y (y0 , η̃) ⊂ BdY (y0 , η). Analogously, for any y0 ∈ Y and δ̃ > 0, there
is a δ > 0 such that
BdY (y0 , δ) ⊂ Bd˜Y (y0 , δ̃)
Thus, d˜Y induces the topology of Y .
Now, for x1 , x2 ∈ X, we have:
dX (T (x1 ), T (x2 )) = dX (T ◦ ψ(y1 ), T ◦ ψ(y2 ))

= dX (ψ ◦ S(y1 ), ψ ◦ S(y2 ))
= d˜Y (S(y1 ), S(y2 ))
Where we have used that x1 = ψ(y1 ), x2 = ψ(y2 ) for some y1 , y2 ∈ Y , since

ψ is a bijection. Thus, we see that the n−diameters (in the metrics dX and
d˜Y ) remain the same under ψ. Hence, cov(n, ε, T, dX ) = cov(n, ε, S, d˜Y ), for
all ε > 0 and n ∈ Z≥0 . By (1), h(S) is the same under dY and d˜Y , so it now
easily follows that h(S) = h(T ).
(3): Observe that
max d(T ni (x), T ni (y)) ≤ max d(T j (x), T j (y))
1≤i≤m−1 1≤j≤nm−1
therefore, span(m, ε, T n ) ≤ span(nm, ε, T ). This implies m1 span(m, ε, T n ) ≤

n
nm
span(nm, ε, T ), thus h(T n ) ≤ nh(T ).
Fix ε > 0. Then, by uniform continuity of T i on X (c.f. theorem 7.1.1),
we can find a δ > 0 such that d(T i (x), T i (y)) < ε for all x, y ∈ X satisfying
d(x, y) < δ and i = 0, . . . , n − 1. Hence, for x, y ∈ X satisfying d(x, y) < δ,
we find dn (x, y) < ε. Let A be an (m, δ)-spanning set for T n . Then, for all
x ∈ X there exists some a ∈ A such that
max d(T ni (x), T ni (a)) < δ
0≤i≤m−1
But then, by the above,
max max d(T ni+k (x), T ni+k (a)) = max d(T j (x), T j (a)) < ε
0≤k≤n−1 0≤i≤m−1 0≤j≤nm−1
Thus, span(mn, ε, T ) ≤ span(m, δ, T n ) and it follows that h(T n ) ≥ nh(T )

(4): Let A be an (n, ε)-separated set for T . Then, for any x, y ∈ A dn (x, y) >
ε. But then,
max d(T −i (T n−1 (x)), T −i (T n−1 (y))) = max d(T i (x), T i (y)) > ε
0≤i≤n−1 0≤i≤n−1
so T n−1 A is an (n, ε)-separated set for T −1 of the same cardinality. Con-

versely, every (n, ε)-separated set B for T −1 gives the (n, ε)-separated set
T −(n−1) B for T . Thus, sep(n, ε, T ) = sep(n, ε, T −1 ) and the first statement
readily follows. The second statement is a consequence of (3).
Exercise 7.2.5 Show that assertion (4) of theorem 7.2.4, i.e. h(T −1 ) =
h(T ), still holds if X is merely assumed to be compact.
Exercise 7.2.6 Let (X, dX ), (Y, dY ) be compact metric spaces and T :

X −→ X, S : Y −→ Y continuous. Show that h(T × S) = h(T ) + h(S).
(Hint: first show that the metric d = max{dX , dY } induces the product
topology on X × Y ).
7.3 Examples
In this section we provide a few detailed examples to get some familiarity
with the material discussed in this chapter. We derive some simple properties
of the dynamical systems presented below, but leave most of the good stuff
for you to discover in the exercises. After all, the proof of the pudding is in
the eating!
Example. (The Quadratic Map) Consider the following biological popu-

lation model. Suppose we are studying a colony of penguins and wish to
model the evolution of the population through time. We presume that there
is a certain limiting population level, P ∗ , which cannot be exceeded. When-
ever the population level is low, it is supposed to increase, since there is
119
plenty of fish to go around. If, on the other hand, population is near the up-
per limit level P ∗ , it is expected to decrease as a result of overcrowding. We
arrive at the following simple model for the population at time n, denoted
by Pn :
Pn+1 = λPn (P ∗ − Pn )
Where λ is a ‘speed’ parameter. If we now set P ∗ = 1, we can think of Pn
as the population at time n as a percentage of the limiting population level.
Then, writing x = P0 , the population dynamics are described by iteration of
the following map:
Definition 7.3.1 The Quadratic Map with parameter λ, Qλ : R −→ R, is

defined to be
Qλ (x) = λx(1 − x)
Of course, in the context of the biological model formulated above, our atten-
tion is restricted to the interval [0, 1] since the population percentage cannot
lie outside this interval.
The quadratic map is deceivingly simple, as it in fact displays a wide range
of interesting dynamics, such as periodicity, topological transitivity, bifurca-
tions and chaos.
Let us take a closer look at the case λ > 4. For x ∈ (−∞, 0) ∪ (1, ∞)
it can be shown that Qnλ (x) → −∞ as n → ∞. The interesting dynam-
ics occur in the interval [0, 1]. Figure 7.2 shows a phase portrait for Qλ
on [0, 1], which is nothing more than the graph of Qλ together with the
graph of the line y = x. In a phase portrait the trajectory of a point
under iteration of the map Qλ is easily visualized by iteratively project-
ing from the line y = x to the graph and vice versa. From our phase
portrait for Qλ it is immediately visible that there is a ‘leak’ in our in-
terval, i.e. the subset A1 = {x ∈ [0, 1]|Qλ (x) ∈
/ [0, 1]} is non-empty. Define
An = {x ∈ [0, 1]|Qλ (x) ∈ [0, 1] for 0 ≤ i ≤ n − 1, Qnλ (x) ∈
i
/ [0, 1]} and set
∞
C = [0, 1] − ∪n=1 An . This procedure is reminiscent of the construction of the
‘classical’ Cantor set, by removing middle-thirds in subsequent steps. In the
exercises below it is shown that C is indeed a Cantor set.
Figure 7.2: Phase portrait of the quadratic map Qλ for λ > 4 on [0, 1]. The
arrows indicate the (partial) trajectory of a point in [0, 1]. The dotted lines
divide [0, 1] in three parts: I0 , A1 and I1 (from left to right).
Example 7.3.1 . (Circle Rotations) Let S 1 denote the unit circle in C. We

define the rotation map Rθ : S 1 −→ S 1 with parameter θ by Rθ (z) = eiθ z =
ei(θ+ω) , where θ ∈ [0, 2π) and z = eiω for some ω ∈ [0, 2π). It is easily seen
that Rθ is a homeomorphism for every θ. The rotation map is very similar
to the earlier introduced shift map Tθ . This is not surprising, considering the
fact that the interval [0, 1] is homeomorphic to S 1 after identification of the
endpoints 0 and 1.
Proposition 7.3.1 Let θ ∈ [0, 2π) and Rθ : S 1 −→ S 1 be the rotation map.

If θ is rational, say θ = ab , then every z ∈ S 1 is periodic with period b. Rθ is
minimal if and only if θ is irrational.
Proof
The first statement follows trivially from the fact that e2niπ = 1 for n ∈ Z.
121
Suppose that θ is irrational. Fix ε > 0 and z ∈ S 1 , so z = eiω for some

ω ∈ [0, 2π). Then
Rθn (z) = Rθm (z) ⇔ ei(ω+nθ) = ei(ω+mθ)

⇔ ei(n−m)θ = 1
⇔ (n − m)θ ∈ Z
Thus, the points {Rθn (z)|n ∈ Z} are all distinct and it follows that {Rθn (z)}∞
n=1
is an infinite sequence. By compactness, this sequence has a convergent
subsequence, which is Cauchy. Therefore, we can find integers n > m such
that
d(Rθn (z), Rθm (z)) < ε
where d is the arc length distance function. Now, since Rθ is distance preserv-
ing with respect to this metric, we can set l = n−m to obtain d(Rθl (z), z) < ε.
Also, by continuity of Rθl , Rθl maps the connected, closed arc from z to Rθl (z)
onto the connected closed arc from Rθl (z) to Rθ2l (z) and this one onto the arc
connecting Rθ2l (z) to Rθ3l (z), etc. Since these arcs have positive and equal
length, they cover S 1 . The result now follows, since the arcs have length
smaller than ε, and ε > 0 was arbitrary.
As mentioned in the proof, Rθ is an isometry with respect to the arc length

distance function and this metric induces the topology on S 1 . Hence, we
see that span(n, ε, Rθ ) = span(1, ε, Rθ ) for all n ∈ Z≥0 and it follows that
h(Rθ ) = 0, for any θ ∈ [0, 2π).
Example 7.3.2 . (Bernoulli Shift Revisited) In this example we will rein-

troduce the Bernoulli shift in a topological context. Let Xk = {0, 1, . , k −1}Z
(or Xk+ = {0, . , k − 1}N∪{0} ) and recall that the left shift on Xk is defined by
L : Xk −→ Xk , L(x) = y, yn = xn+1 . We can define a topology on Xk (or
Xk+ ) which makes the shift map continuous (in fact, a homeomorphism in
the case of Xk ) as follows: we equip {0, . . . , k − 1} with the discrete topology
and subsequently give Xk the corresponding product topology. It is easy to
see that the cylinder sets form a basis for this topology. By the Tychonoff
theorem (see e.g. [Mu]), which asserts that an arbitrary product of compact
spaces is compact in the product topology, our space Xk is compact. In the

exercises you are asked to show that the metric
d(x, y) = 2−l if l = min{|j| : xj 6= yj }
induces the product topology, i.e. the open balls B(x, r) with respect to this
metric form a basis for the product topology. Here we set d(x, y) = 0 if
x = y, this corresponds to the case l = ∞.
We will end this example by calculating the topological entropy of the full
shift L.
Proposition 7.3.2 The topological entropy of the full shift map is equal to
h(L) = log(k).
Proof
We will prove the proposition for the left shift on Xk+ , the case for the left shift
+
on Xk is similar. Fix 0 < ε < 1 and pick any x = {xi }∞ ∞
i=0 , y = {yi }i=0 ∈ Xk .
Notice that if at least one of the first n symbols in the itineraries of x and y
differ, then
dn (x, y) = max d(T i (x), T i (y)) = 1 > ε
0≤i≤n−1
where we have used the metric d on Xk+ defined above. Thus, the set An
consisting of sequences a ∈ Xk+ such that ai ∈ {0, . . . , k −1} for 0 ≤ i ≤ n−1
and aj = 0 for j ≥ n is (n, ε)-separated. Explicitly, a ∈ An is of the form
a0 a1 a2 . . . an−1 000 . . .
Since there are k n possibilities to choose the first n symbols of an itinerary,
we obtain sep(n, ε, L) ≥ k n . Hence,
1
h(T ) = lim lim log(sep(n, ε, L)) ≥ log(k)
ε↓0 n→∞ n
To prove the reverse inequality, take l ∈ Z>0 such that 2−l < ε. Then An+l
is an (n, ε, L)-spanning set, since for every x ∈ Xk+ there is some a ∈ An+l
for which the first n + l symbols coincide. In other words, d(x, a) < 2−l < ε.
Therefore,
1 n+l
h(T ) = lim lim log(span(n, ε, L)) ≤ lim lim log(k) = log(k)
ε↓0 n→∞ n ε↓0 n→∞ n
and our proof is complete.
123
Figure 7.3: Phase portrait of the teepee map T on [0, 1].
Example 7.3.3 . (Symbolic Dynamics) The Bernoulli shift on the bise-

quence space Xk is a topological dynamical system whose dynamics are quite
well understood. The subject of symbolic dynamics is devoted to using this
knowledge to unravel the dynamical properties of more difficult systems. The
scheme works as follows. Let T : X −→ X be a topological dynamical sys-
tem and let α = {A0 , . . . , Ak−1 } be a partition of X such that Ai ∩ Aj = ∅
for i 6= j (Note the difference with the earlier introduced notion of a parti-
tion). We assume that T is invertible. Now, for x ∈ X, we let φi (x) be the
index of the element of α which contains T i (x), i ∈ Z. This defines a map
φ : X −→ Xk , called the Itinerary map generated by T on α,
φ(x) = {φi (x)}∞

i=−∞
The image of x under φ is called the itinerary of x. Note that φ satisfies

φ ◦ T = L ◦ φ. Of course, if T is not invertible, we may apply the same
procedure by replacing Xk by Xk+ .
We will now apply the above procedure to determine the number of pe-
riodic points of the teepee (or tent) map T : [0, 1] −→ [0, 1], defined by (see
figure 7.3) (
2x if 0 ≤ x ≤ 12
T (x) =
2(1 − x) if 12 ≤ x ≤ 1
Let I0 = [0, 12 ], I1 = [ 21 , 1]. We would like to define an itinerary map
ψ : [0, 1] −→ X2+ generated on {I0 , I1 }. Unfortunately, the fact that I0 ∩

I1 6= ∅ causes some complications. Indeed, we have an ambiguous choice of
the itinerary for x = 21 , since it can be described by either (01000 . . .) or
(11000 . . .), or any other point in [0, 1] which will end up in x = 21 upon
iteration of T . We will therefore identify any pair of sequences of the form
(a0 a1 . . . al 01000 . . .) and (a0 a1 . . . al 11000 . . .) and replace the two sequences
by a single equivalence class. The resulting space will be denoted by X̃2+ and
the resulting itinerary map again by ψ : [0, 1] −→ X̃2+ .
We will first show that ψ is injective. Suppose that there are x, y ∈ [0, 1]
with ψ(x) = ψ(y). Then T n (x) and T n (y) lie on the same side of x = 21 for
all n ∈ Z≥0 . Suppose that x 6= y. Then, for all n ∈ Z≥0 , T n ([x, y]) does
not contain the point x = 21 . But then no rational numbers of the form 2pn ,
p ∈ {1, 2, 3, . . . , 2n − 1}, are in [x, y] for all n ∈ Z>0 . Since these form a
dense set in [0, 1], we obtain a contradiction. We conclude that x = y, ψ is
one-to-one.
+
Now let us show that ψ is also surjective. Let a = {ai }∞ i=0 ∈ X̃2 . If a is one
of the aforementioned equivalence classes, we use the ‘0’ element to represent
a. Now, define
An = {x ∈ [0, 1]|x ∈ Ia0 , T (x) ∈ Ia1 , . . . , T n (x) ∈ Ian }
= Ia0 ∩ T −1 Ia1 ∩ . . . ∩ T −n Ian
Since T is continuous, T −i Iai is closed for all i and therefore An is closed for
n ∈ Z≥0 . As [0, 1] is a compact metric space, it follows from theorem 7.1.1
that An is compact. Moreover, {An }∞ n=0 forms a decreasing sequence, in the
sense that An+1 ⊂ An . Therefore, the intersection ∩∞ n=0 An is not empty and
∞
a is precisely the itinerary of x ∈ ∩n=0 An . We conclude that ψ is onto.
Now, since ψ ◦ T = L ◦ ψ, with L : X̃2+ −→ X̃2+ the (induced) left shift, we
have found a bijection between the periodic points of L on X̃2+ and those
of T on [0, 1]. The periodic points for T are quite difficult to find, but the
periodic points of L are easily identified. First observe that there are no
periodic points in the earlier defined equivalence classes in X̃2+ since these
points are eventually mapped to 0. Now, any periodic point a of L of period
n in X2+ must have an itinerary of the form
(a0 a1 . . . an−1 a0 a1 . . . an−1 a0 a1 . . .)
Hence, T has 2n periodic points of period n in [0, 1].
125
Exercise 7.3.1 In this exercise you are asked to prove some properties of
the shift map. Let Xk = {0, 1, . , k − 1}Z and Xk+ = {0, 1, . , k − 1}N∪{0} .
a. Show that the map d : Xk × Xk −→ Xk , defined by
d(x, y) = 2−l if l = min{|j| : xj 6= yj }
is a metric on Xk that induces the product topology on Xk .
b. Show that the map d∗ : Xk × Xk −→ Xk , defined by

∞
∗
X |xn − yn |
d (x, y) =
n=−∞
2|n|
is a metric on Xk that induces the product topology on Xk .
c. Show that the one-sided shift on Xk+ is continuous. Show that the two-
sided shift on Xk is a homeomorphism.
d. Show that the two-sided shift L : Xk −→ Xk is expansive.
Exercise 7.3.2 Let Qλ be the quadratic map and let C, A1 be as in the first
example. Set [0, 1] − A1 = I0 ∪ I1 , where I0 is the√interval left of A1 and I1
the interval to the right. We assume that λ > 2 + 5, so that |Q0λ (x)| > 1 for
all x ∈ I0 ∪ I1 (check this!). Let X2+ = {0, 1}N and let φ : C −→ X2+ be the
itinerary map generated by Qλ on {I0 , I1 }. This exercise serves as another
demonstration of the usefulness of symbolic dynamics.
a. Show that φ is a homeomorphism.
b. Show that φ is a topological conjugacy between Qλ and the one-sided

shift map on X2+ .
c. Prove that Qλ has exactly 2n periodic points of period n in C. Show that

the periodic points of Qλ are dense in C. Finally, prove that Qλ is
topologically transitive on C.
Exercise 7.3.3 Let Qλ be the quadratic map and let C be as in the first
example. This exercise shows that C is a Cantor set, i.e. a closed, totally
disconnected and perfect
√ subset of [0, 1]. As in the previous exercise, we will
assume that λ > 2 + 5.
a. Prove that C is closed.
b. Show that C is totally disconnected, i.e. the only connected subspaces of

C are one-point sets.
c. Demonstrate that C is perfect, i.e. every point is a limit point of C.

Chapter 8
The Variational Principle
In this chapter we will establish a powerful relationship between measure

theoretic and topological entropy, known as the Variational Principle. It
asserts that for a continuous transformation T of a compact metric space X
the topological entropy is given by h(T ) = sup{hµ (T )|µ ∈ M (X, T )}. To
prove this statement, we will proceed along the shortest and most popular
route to victory, provided by [M]1 . The proof is at times quite technical and
is therefore divided into several digestable pieces.
8.1 Main Theorem

For the first part of the proof of the Variational Principle we will only use
some properties of measure theoretic entropy. All the necessary ingredients
are listed in the following lemma:
Lemma 8.1.1 Let α, β, γ be finite partitions of X and T a measure preserv-

ing transformation of the probability space (X, F, µ). Then,
1. Hµ (α) ≤ log(N (α)), where N (α) is the number of sets in the partition
1
of non-zero measure. Equality holds only if µ(A) = N (α) , for all A ∈ α
with µ(A) > 0
2. If α ≤ β, then hµ (T, α) ≤ hµ (T, β)

1
Our exposition of Misiurewicz’s proof follows [W] to a certain extent
127
3. hµ (T, α) ≤ hµ (T, γ) + Hµ (α|γ)
4. hµ (T n ) = n · hµ (T ), for all n ∈ Z>0
Proofs
(1): This is a straighforward consequence of the concavity of the function

f : [0, ∞) −→ R, P defined by f (t) = −t log(t) (t > 0), f (0) = 0. In this
notation, Hµ (α) = A∈α f (µ(A)).
(2): α ≤ β implies that for every B ∈ β, there some A ∈ α such that B ⊂ A
and hence also T −i B ⊂ T −i A, i ∈ Z≥0 . Therefore, for n ≥ 1
n−1
_ n−1
_
−i
T α≤ T −i β
i=0 i=0
The result now follows from proposition .

(3): By proposition (4.2.1)(e) and (b), respectively,
n−1
_ n−1
_ n−1
_
H( T −i α) ≤ H( T −i γ ∨ T −i α)
i=0 i=0 i=0
n−1
_ n−1
_ n−1
_
= H( T −i γ) + H( T −i α| T −i γ)
i=0 i=0 i=0
Now, by successively applying proposition (4.2.1) (d), (g) and finally using
the fact that T is measure preserving, we get
n−1
_ n−1
_ n−1
X n−1
_
−i −i −i
H( T α| T γ) ≤ H(T α| T −i γ)
i=0 i=0 i=0 i=0
n−1
X
≤ H(T −i α|T −i γ)
i=0
= nH(α|γ)
Combining the inequalities, dividing both sides by n and taking the limit for
n → ∞ gives the desired result.
Main Theorem 129
(4): Note that

n−1 k−1 n−1
_ 1 _ −nj _ −i
h(T n , T −i α) = lim H( T ( T α))
k→∞ k
i=0 j=0 i=0
kn−1
n _
= lim H( T −i α)
k→∞ kn
i=0
m−1
n _
= lim H( T −i α)
m→∞ m
i=0
= nh(T, α)
Notice that { n−1 −i

W
i=0 T α | α a partition of X} ⊂ {β | β a partition of X}.
Hence,
nh(T ) = n sup h(T, α)

α
n−1
_
= sup h(T , n
T −i α)
α
i=0
n
≤ sup h(T , α)
α
= h(T n )
Wn−1
Finally, since α ≤ i=0 T −i α, we obtain by (2),
n−1
_
n
h(T , α) ≤ h(T , n
T −i α) = nh(T, α)
i=0
So h(T n ) ≤ nh(T ). This concludes our proof.
Exercise 8.1.1 Use lemma 8.1.1 and proposition (4.2.1) to show that for an
invertible measure preserving transformation T on (X, F, µ):
h(T n ) = |n|h(T ), for all n ∈ Z

We are now ready to prove the first part of the theorem.
Theorem 8.1.1 Let T : X −→ X be a continuous transformation of a

compact metric space. Then h(T ) ≥ sup{hµ (T )|µ ∈ M (X, T )}.
Proof
Fix µ ∈ M (X, T ). Let α = {A1 , . . . , An } be a finite partition of X. Pick

1
ε > 0 such that ε < n log n
. By theorem 16, µ is regular, so we can find closed
sets Bi ⊂ Ai such that µ(Ai − Bi ) < ε, i = 1, . . . , n. Define B0 = X − ∪ni=1 Bi
and let β be the partition β = {B0 , . . . , Bn }. Then, writing f (t) = −t log(t),
n X
n
X µ(Bi ∩ Aj )
Hµ (α|β) = µ(Bi )f ( )
i=0 j=1
µ(Bi )
n
X µ(B0 ∩ Aj )
= µ(B0 ) f( )
j=1
µ(B0 )
in the final step we used the fact that for i ≥ 1,
Bi ∩ Ai = Bi
Bi ∩ Aj = ∅ if i 6= j
We can define a σ-algebra on B0 by F ∩ B0 = {F ∩ B0 |F ∈ F} and define

the conditional measure µ(·|B0 ) : F ∩ B0 −→ [0, 1] by
µ(A ∩ B0 )
µ(A|B0 ) =
µ(B0 )
It is easy to check that (B0 , F ∩ B0 , µ(·|B0 )) is a probability space and

µ(·|B0 ) ∈ M (B0 , T |B0 ). Now, noting that α0 := {A∩B0 |A ∈ α} is a partition
of B0 under µ(·|B0 ), we get
n
X µ(B0 ∩ Aj )
Hµ (α|β) = µ(B0 ) f( ) = µ(B0 )Hµ(·|B0 ) (α0 )
j=1
µ(B0 )
Hence, we can apply (1) of lemma 8.1.1 and use the fact that
µ(B0 ) = µ(X − ∪ni=1 Bi ) = µ(∪ni=1 Ai − ∪ni=1 Bi ) = µ(∪i=1 (Ai − Bi )) < nε

Main Theorem 131
to obtain
Hµ (α|β) ≤ µ(B0 ) log(n) < nε log(n) < 1
Define for each i, i = 1, . . . , n, the open set Ci by Ci = B0 ∪ Bi = X −
∪j6=i Bj . Then we can define an open cover of X by γ = {C1 , . . . , Cn }.
By (1)
W of lemma 8.1.1 weWhave for m ≥ 1, in the notation of the lemma,
Hµ ( m−1
i=0 T −i
β) ≤ log(N ( m−1 −i
i=0 T β)). Note that γ is not necessarily
Wm−1 −ia par-
tition, butWby the proof of (1) of lemma 8.1.1, we still have Hµ ( i=0 T β) ≤
log(2m N ( m−1 −i
i=0 T γ)), since β contains (at most) one more set of non-zero
measure.
Thus, it follows that
hµ (T, β) ≤ h(T, γ) + log 2 ≤ h(T ) + log 2
and by (3) of lemma 8.1.1 we get obtain
hµ (T, α) ≤ hµ (T, β) + Hµ (α|β) ≤ h(T ) + log 2 + 1
Note that if µ ∈ M (X, T ), then also µ ∈ M (X, T m ), so the above inequality

holds for T m as well. Applying (4) of lemma 8.1.1 and (3) of theorem 7.2.4
leads us to mhµ (T ) ≤ mh(T ) + log 2 + 1. By dividing by m, taking the limit
for m → ∞ and taking the supremum over all µ ∈ M (X, T ), we obtain the
desired result.
We will now finish the proof of the Variational Principle by proving the
opposite inequality. This is the hard part of the proof, since it does not only
require more machinery, but also a clever trick.
Lemma 8.1.2 Let X be a compact metric space and α a finite partition of

X. Then, for any µ, ν ∈ M (X, T ) and p ∈ [0, 1] we have Hpµ+(1−p)ν (α) ≥
pHµ (α) + (1 − p)Hν (α)
Proof
Define the function f : [0, ∞) −→ R by f (t) = −t log(t) (t > 0), f (0) = 0.

Then f is concave and so, for A measurable, we have:
0 ≤ f (pµ(A) + (1 − p)ν(A)) − pf (µ(A)) − (1 − p)f (ν(A))

The result now follows easily.
Exercise 8.1.2 Suppose X is a compact metric space and T : X −→ X a

continuous transformation. Use (the proof of) lemma 8.1.2 to show that for
any µ, ν ∈ M (X, T ) and p ∈ [0, 1] we have hpµ+(1−p)ν (T ) ≥ phµ (T ) + (1 −
p)hν (T ).
Exercise 8.1.3 Improve the result in the exercise above by showing that we
can replace the inequality by an equality sign, i.e. for any µ, ν ∈ M (X, T )
and p ∈ [0, 1] we have hpµ+(1−p)ν (T ) = phµ (T ) + (1 − p)hν (T ).
Recall that the boundary of a set A is defined by ∂A = A − A.
Lemma 8.1.3 Let X be a compact metric space and µ ∈ M (X). Then,
1. For any x ∈ X and δ > 0, there is a 0 < η < δ such that µ(∂B(x, η)) =
0
2. For any δ > 0, there is a finite partition α = {A1 , . . . , An } of X such

that diam(Aj ) < δ and µ(∂Aj ) = 0, for all j
3. If T : X −→ X is continuous, µ ∈ M (X, T ) and Aj ⊂ X measurable

such that µ(∂Aj ) = 0, j = 0, . . . , n − 1, then µ(∂(∩n−1
j=0 T
−j
Aj )) = 0
4. If µn → µ in M (X) and A is a measurable set such that µ(∂A) = 0,

then limn→∞ µn (A) = µ(A)
Proof
(1): Suppose the statement is false. Then there exists an x ∈ X and δ > 0
such that for all 0 < η < δ µ(∂B(x, η)) > 0. Let {ηi }∞ i=1 be a sequence of
distinct real numbers P∞ satisfying 0 < ηi < δ and ηi ↑ δ for i → ∞. Then,
∞
µ(∪i=1 ∂B(x, ηi )) = i=1 µ(∂B(x, ηi )) = ∞, contradicting the fact that µ is
a probability measure on X.
(2): Fix δ > 0. By (1), for each x ∈ X, we can find an 0 < ηx < δ/2
such that µ(∂B(x, ηx )) = 0. The collection {B(x, ηx )|x ∈ X} forms an open
cover of X, so by compactness there exists a finite subcover which we denote
by β = {B1 , . . . , Bn }. Define α by letting A1 = B1 and for 0 < j ≤ n let
j−1
Aj = Bj − (∪k=1 Bk ). Then α is a partition of X, diam(Aj ) < diam(Bj ) < δ
and µ(∂Aj ) ≤ µ(∪ni=1 ∂Bi ) = 0, since ∂Aj ⊂ ∪ni=1 ∂Bi .
(3): Let x ∈ ∂(∩n−1 j=0 T
−j
Aj ). Then x ∈ ∩n−1
j=0 T
−j A , but x ∈
j / ∩n−1
j=0 T
−j
Aj .
−j −k
That is, every open neighborhood of x intersects every T Aj , but x ∈ / T Ak
Main Theorem 133
for some 0 ≤ k ≤ n − 1. Hence, x ∈ T −k Ak − T −k Ak and by continuity of

T k,
T k (x) ∈ T k (T −k Ak ) − Ak ⊂ Ak − Ak = ∂Ak
Thus, ∂(∩n−1
j=0 T
−j
Aj ) ⊂ ∪n−1
j=0 T
−j
∂Aj and the statement readily follows.
(4): Recall that µn → µ in M (X) for n → ∞ if and only if:
Z Z
lim f (x)dµn (x) = f (x)dµ(x)
n→∞ X X
for all f ∈ C(X). Let A be as stated and define Ak = {x ∈ X|d(x, A) > k1 },

k ∈ Z>0 . Then, since A and X − Ak are closed, it follows from theorem 7.1.1
that for every k there exists a function fk ∈ C(X) such that 0 ≤ fk ≤ 1,
fk (x) = 1 for all x ∈ A and fk (x) = 0 for all x ∈ X − Ak . Now, for each k,
Z Z Z
lim µn (A) = lim 1A (x)dµn (x) ≤ lim fk (x)dµn (x) = fk (x)dµ(x)
n→∞ n→∞ X n→∞ X X
Note that fk ↓ 1A in µ-measure for k → ∞, so by the Dominated Convergence

Theorem,
Z
lim µn (A) ≤ lim fk (x)dµ(x) = µ(A) = µ(A)
n→∞ k→∞ X
where the final inequality follows from the assumption µ(∂A) = 0. The proof
of the opposite inequality is similar.
Lemma 8.1.4 Let q, n be integers such that 1 < q < n. Define, for 0 ≤ j ≤
q − 1, a(j) = b n−j
q
c, where bc means taking the integer part. Then we have
the following
1. a(0) ≥ a(1) ≥ · · · ≥ a(q − 1)
2. Fix 0 ≤ j ≤ q − 1. Define
Sj = {0, 1, . . . , j − 1, j + a(j)q, j + a(j)q + 1, . . . , n − 1}
Then
{0, 1, . . . , n − 1} = {j + rq + i|0 ≤ r ≤ a(j) − 1, 0 ≤ i ≤ q − 1} ∪ Sj
and card(Sj ) ≤ 2q.

3. For each 0 ≤ j ≤ q − 1, (a(j) − 1)q + j ≤ b n−j

q
− 1cq + j ≤ n − q. The
numbers {j + rq|0 ≤ j ≤ q − 1, 0 ≤ r ≤ a(j) − 1} are distinct and no
greater than n − q.
The three statements are clear after a little thought.
Example 8.1.1 Let us work out what is going on in lemma 8.1.4 for n =
10, q = 4. Then a(0) = b 10−0 4
c = 2 and similarly, a(1) = 2, a(2) = 2
and a(3) = 1. This gives us S0 = {8, 9}, S1 = {0, 9}, S2 = {0, 1} and
S3 = {0, 1, 2, 7, 8, 9}. For example, S3 is obtained by taking all nonnegative
integers smaller than 3 and adding the numbers 3+a(3)q = 7, 3+a(3)q+1 = 8
and 3 + a(3)q + 2 = 9. Note that card(Sj ) ≤ 8. One can readily check all
properties stated in the lemma, e.g. (a(3) − 1)q + 3 ≤ b 10−34
− 1c · 4 + 3 =
3 ≤ 6 = n − q.
We shall now finish the proof of our main theorem. The proof is presented
in a logical order, which is in this case not quite the same as thinking order.
Therefore, before plunging into a formal proof, we will briefly discuss the
main ideas.
We would like to construct a Borel probability measure µ with measure the-
oretic entropy hµ (T ) ≥ sep(ε, T ) := limn→∞ n1 log(sep(n, ε, T )). To do this,
we first find a measure σn with Hσn ( n−1 −i
W
i=0 T α) equal to log(sep(n, ε, T )),
where α is a suitably chosen partition. This is not too difficult. The problem
is to find an element of M (X, T ) with this property. Theorem (6.1.5) sug-
gest a suitable measure µ which can be obtained from the σn . The trick in
lemma 8.1.4 plays a crucial role in getting from an estimate for W Hσn to one
for Hµ . The idea is to first remove the ‘tails’ Sj of the partition n−1 −i
i=0 T α,
make a crude estimate for these tails and later add them back on again.
Lemma 8.1.3 fills in the remaining technicalities.
Theorem 8.1.2 Let T : X −→ X be a continuous transformation of a

compact metric space. Then h(T ) ≤ sup{hµ (T )|µ ∈ M (X, T )}.
Proof
Fix ε > 0. For each n, let En be an (n, ε)-separated set of P cardinal-

ity sep(n, ε, T ). Define σn ∈ M (X) by σn = (1/sep(n, ε, T )) x∈En δx ,
Main Theorem 135
where δPx is the Dirac measure concentrated at x. Define µn ∈ M (X) by

µn = n n−1
1 −i
i=0 σn ◦T . By theorem 19, M (X) is compact, hence there exists a
subsequence {nj }∞ j=1 such that {µnj } converges in M (X) to some µ ∈ M (X).
By theorem (6.1.5), µ ∈ M (X, T ). We will show that hµ (T ) ≥ sep(ε, T ), from
which the result clearly follows.
By (2) of lemma 8.1.3, we can find a µ-measurable partition α = {A1 , . . . , Ak }
of X such that diam(Aj ) < ε and µ(∂Aj ) = 0, j = 1, . . . , k. We may as-
Wn−1
sume that every x ∈ En is contained in some Aj . Now, if A ∈ i=0 T −i α,
then A cannot contain more than one element of En . Hence, W σn (A) = 0 or
k n−1 −i
1/sep(n, ε, T ). Since ∪j=1 Aj contains all of En , we see that Hσn ( i=0 T α) =
log(sep(n, ε, T )).
Fix integers q, n with 1 < q < n and define for each 0 ≤ j ≤ q − 1 a(j) as in
lemma 8.1.4. Fix 0 ≤ j ≤ q − 1. Since
n−1 a(j)−1 q−1
_ _ _ _
−i −(rq+j)
T α=( T ( T −i α)) ∨ ( T −i α)
i=0 r=0 i=0 i∈Sj
we find that
n−1
_
log(sep(n, ε, T )) = Hσn ( T −i α)
i=0
a(j)−1 q−1
X _ X
−rq−j
≤ Hσn (T ( T −i α)) + Hσn (T −i α)
r=0 i=0 i∈Sj
a(j)−1 q−1
X _
≤ Hσn ◦T −(rq+j) ( T −i α) + 2q log k
r=0 i=0
Here we used proposition (4.2.1)(d) and lemma 8.1.1. Now if we sum the
above inequality over j and divide both sides by n, we obtain by (3) and (1)
of lemma 8.1.4
n−1 q−1
q 1X _ 2q 2
log(sep(n, ε, T )) ≤ Hσn ◦T −l ( T −i α) + log k
n n l=0 i=0
n
q−1
_ 2q 2
≤ H µn ( T −i α) + log k
i=0
n
136 Measures of Maximal Entropy
By (3) of lemma 8.1.3, each atom A of q−1 −i

W
i=0 T α has boudary of µ-measure
zero, so by (4) of the same lemma, limj→∞ µnj (A) = µ(A). Hence, if we
replace n by nj in the above inequality and take the limit for j → ∞ we
obtain qsep(ε, T ) ≤ Hµ q−1 −i
W
i=0 T α. If we now divide both sides by q and
take the limit for q → ∞, we get the desired inequality.
Corollary 8.1.1 (The Variational Principle) The topological entropy of a

continuous transformation T : X −→ X of a compact metric space X is
given by h(T ) = sup{hµ (T )|µ ∈ M (X, T )}
To get a taste of the power of this statement, let us recast our proof of the
invariance of topological entropy under conjugacy ((2) of theorem 7.2.4). Let
φ : X1 → X2 denote the conjugacy. We note that µ ∈ M (X1 , T1 ) if and only
if µ ◦ φ−1 ∈ M (X2 , T2 ) and we observe that hµ (T1 ) = hµ◦φ−1 (T2 ). The result
now follows from the Variational Principle. It is that simple.
8.2 Measures of Maximal Entropy

The Variational Principle suggests an educated way of choosing a Borel prob-
ability measure on X, namely one that maximizes the entropy of T .
Definition 8.2.1 Let X be a compact metric space and T : X −→ X be

continuous. A measure µ ∈ M (X, T ) is called a measure of maximal entropy
if hµ (T ) = h(T ). Let Mmax (X, T ) = {µ ∈ M (X, T )|hµ (T ) = h(T )}. If
Mmax (X, T ) = {µ} then µ is called a unique measure of maximal entropy.
Example. Recall that the topological entropy of the circle rotation Rθ is

given by h(Rθ ) = 0. Since hµ (Rθ ) ≥ 0 for all µ ∈ M (X, Rθ ), we see that
every µ ∈ M (X, Rθ ) is a measure of maximal entropy, i.e. Mmax (X, Rθ ) =
M (X, Rθ ). More generally, we have h(T ) = 0 and hence Mmax (X, T ) =
M (X, T ) for any continuous isometry T : X −→ X.
Measures of maximal entropy are closely connected to (uniquely) ergodic

measures, as will become apparent from the following theorem.
Theorem 8.2.1 Let X be a compact metric space and T : X −→ X be

continuous. Then
137
• Mmax (X, T ) is a convex set.
• If h(T ) < ∞ then the extreme points of Mmax (X, T ) are precisely the
ergodic members of Mmax (X, T ).
• If h(T ) = ∞ then Mmax (X, T ) 6= ∅. If, moreover, T has a unique

measure of maximal entropy, then T is uniquely ergodic.
• A unique measure of maximal entropy is ergodic. Conversely, if T is

uniquely ergodic, then T has a measure of maximal entropy.
Proofs
(1): Let p ∈ [0, 1] and µ, ν ∈ Mmax (X, T ). Then, by exercise 8.1.3,
hpµ+(1−p)ν (T ) = phµ (T ) + (1 − p)hν (T ) = ph(T ) + (1 − p)h(T ) = h(T )
Hence, pµ + (1 − p)ν ∈ Mmax (X, T ).

(2): Suppose µ ∈ Mmax (X, T ) is ergodic. Then, by theorem 6.1.6, µ cannot
be written as a non-trivial convex combination of elements of M (X, T ). Since
Mmax (X, T ) ⊂ M (X, T ), µ is an extreme point of Mmax (X, T ). Conversely,
suppose µ is an extreme point of M (X, T ) and suppose there is a p ∈ (0, 1)
and ν1 , ν2 ∈ M (X, T ) such that µ = pν1 + (1 − p)ν2 . By exercise 8.1.3,
h(T ) = hµ (T ) = phν1 (T ) + (1 − p)hν2 (T ). But by the Variational Principle,
h(T ) = sup{hµ (T )|µ ∈ M (X, T )}, so h(T ) = hν1 (T ) = hν2 (T ). In other
words, ν1 , ν2 ∈ Mmax (X, T ), thus µ = ν1 = ν2 . Therefore, µ is also an
extreme point of M (X, T ) and we conclude that µ is ergodic.
For the second part, suppose that Mmax (X, T ) 6= ∅.
(3): By the Variational Principle, for any n ∈ Z>0 , we can find a µn ∈
M (X, T ) such that hµn (T ) > 2n . Define µ ∈ M (X, T ) by
∞ N ∞
X µn X µn X µn
µ= = +
n=1
2n n=1
2n
n=N
2n
But then,
N
X µn
hµ (T ) ≥ >N
n=1
2n
This holds for arbitrary N ∈ Z>0 , so hµ (T ) = h(T ) = ∞ and µ is a measure
of maximal entropy.
Now suppose that T has a unique measure of maximal entropy. Then, for
any ν ∈ M (X, T ), hµ/2+ν/2 (T ) = 21 hµ (T ) + 12 hν (T ) = ∞. Hence, µ = ν,
M (X, T ) = {µ}.
(4): The first statement follows from (2), for h(T ) < ∞, and (3), for
h(T ) = ∞. If T is uniquely ergodic, then M (X, T ) = {µ} for some µ.
By the Variational Principle, hµ (T ) = h(T ).
Example 8.2.1 For θ irrational Rθ is uniquely ergodic with respect to Lebesgue

measure. By the theorem above, it follows that Lebesgue measure is also a
unique measure of maximal entropy for Rθ .
Exercise 8.2.1 Let X = {0, 1, . . . , k − 1}Z and let T : X −→ X be the

full two-sided shift. Use proposition 7.3.2 to show that the uniform product
measure is a unique measure of maximal entropy for T .
We end this section with a generalization of the above exercise.
Proposition 8.2.1 Every expansive homeomorphism of a compact metric

space has a measure of maximal entropy.
Proof
Let T : X −→ X be an expansive homeomorphism and let δ > 0 be an

expansive constant for T . Fix 0 < ε < δ. Define µ ∈ M (X, T ) as in the
proof of theorem 8.1.2. Then, by the proof, hµ (T ) ≥ sep(ε, T ). We will show
that h(T ) = sep(ε, T ). It then immediately follows that µ ∈ Mmax (X, T ).
Pick any 0 < η < ε and let A be an (n, η)-separated set of cardinal-
ity sep(n, η, T ). By expansiveness, we can find for any x, y ∈ A some
k = kx,y ∈ Z such that
d(T k (x), T k (y)) > ε
By theorem 7.2.2, A is finite, so l := max{|kx,y | : x, y ∈ A} is in Z>0 . Now,

by our choice of l, for x, y ∈ T −l A we have
max d(T i (x), T i (y)) > ε

0≤i≤2l+n−1
139
So T −l A is an (2l + n, ε, T )-separated set of cardinality sep(n, η, T ). Hence,

1
sep(η, T ) = lim log(sep(n, η, T ))
n→∞ n
1
≤ lim log(sep(2l + n, ε, T ))
n→∞ n
1
≤ lim log(sep(2l + n, ε, T ))
n→∞ 2l + n
= sep(ε, T )
Conversely, if A is (n, ε)-separated then A is also (n, η)-separated, so
sep(ε, T ) ≤ sep(η, T ).
We conclude that sep(ε, T ) = sep(η, T ). Since 0 < η < ε was arbitrary, this
shows that h(T ) = limη↓0 sep(η, T ) = sep(ε, T ). This completes our proof.
From the proof we extract the following corollary.
Corollary 8.2.1 Let T : X −→ X be an expansive homeomorphism and let

δ be an expansive constant for T . Then h(T ) = sep(ε, T ), for any 0 < ε < δ.
At this point we would like to make a remark on the proof of the above propo-
sition. One might be inclined to believe that the measure µ constructed in
the proof of theorem 8.1.2 is a measure of maximal entropy for any continu-
ous transformation T . The reader should note, though, that µ still depends
on ε and it is therefore only because of the above corollary that µ is a mea-
sure of maximal entropy for expansive homeomorphisms.
Bibliography
[B] Michael Brin and Garrett Stuck, Introduction to Dynamical Systems,

Cambridge University Press, 2002.
[D] K. Dajani and C. Kraaikamp, Ergodic theory of numbers, Carus

Mathematical Monographs, 29. Mathematical Association of Amer-
ica, Washington, DC 2002.
[DF] Dajani, K., Fieldsteel, A. Equipartition of Interval Partitions and an

Application to Number Theory, Proc. AMS Vol 129, 12, (2001), 3453
- 3460.
[De] Devaney, R.L., 1989, An Introduction to Chaotic Dynamical Systems,

Second Edition, Addison-Wesley Publishing Company, Inc.
[G] Gleick, J., 1998, Chaos: Making a New Science, Vintage
[Ho] Hoppensteadt, F.C., 1993, Analysis and Simulation of Chaotic Sys-

tems, Springer-Verlag New York, Inc.
[H] W. Hurewicz, Ergodic theorem withour invariant measure. Annals of

Math., 45 (1944), 192–206.
[KK] Kamae, Teturo; Keane, Michael, A simple proof of the ratio ergodic
theorem, Osaka J. Math. 34 (1997), no. 3, 653–657.
[KT] Kingman and Taylor, Introduction to measure and probability, Cam-

bridge Press, 1966.
[M] Misiurewicz, M., 1976, A short proof of the variational principle for
a Zn+ action on a compact space, Asterique, 40, pp. 147-187
[Mu] Munkres, J.R., 2000, Topology, Second Edition, Prentice Hall, Inc.
141
[Pa] William Parry, Topics in Ergodic Theory, Reprint of the 1981 original.
Cambridge Tracts in Mathematics, 75. Cambridge University Press,
Cambridge, 2004.
[P] Karl Petersen, Ergodic Theory, Cambridge Studies in Advanced

Mathematics, 2. Cambridge University Press, Cambridge, 1989.
[W] Peter Walters, An Introduction to Ergodic Theory, Graduate Texts

in Mathematics, 79. Springer-Verlag, New York-Berlin, 1982.
Index
(X, d), 102 sep(n, ε, T ), 111

(n, ε)-separated, 111 span(n, ε, T ), 111
(n, ε)-spanning, 111
B(x, r), 112 Stationary Stochastic Processes, 12
Bn (x, r), 112
algebra, 7
F OT (x), 103
algebra generated, 7
H(α), 109 atoms of a partition, 58
Mmax (X, T ), 136
N (α), 109 Baker’s Transformation, 10
OT (x), 103 Bernoulli Shifts, 11
Qλ , 119 binary expansion, 10
Rθ , 120 Birkhoff’s Ergodic Theorem, 29
T −1 α, 109
X, 102 Cantor set, 125
α ∨ β, 109 circle rotation, 120
β-transformations, 10 common refinement, 58
Wn of open cover, 109
i=1 αi , 109
∂A, 132 conditional expectation, 35
cov(ε, T ), 113 conditional information function, 66
cov(n, ε, T ), 112 conservative, 82
d, 102 Continued Fractions, 13
ddisc , 103 diameter
dn , 111 of a set, 109
diam(A), 109 of open cover, 109
diam(α), 109 dynamical system, 47
h(T ), 114
h(T, α), 110 entropy of the partition, 57
h1 (T ), 111 entropy of the transformation, 61
h2 (T ), 114 equivalent measures, 79
sep(ε, T ), 134 ergodic, 20
143
ergodic decomposition, 40 monotone class, 7

expansive, 106
constant, 106 natural extension, 52
extreme point, 92 non-singular transformation, 80
factor map, 51 orbit, 103

first return time, 16 forward, 103
fixed point, 105
periodic point, 105
generator, 64, 106 phase portrait, 119
weak, 107 Poincaré Recurrence Theorem, 15
homeomorphism quadratic map, 118

conjugate, 107
Radon-Nikodym Theorem, 79
expansive, 106, 107
random shifts, 13
generator for, 106
refinement
minimal, 103
of open cover, 109
topologically transitive, 103
regular measure, 89
weak generator for, 107
Riesz Representation Representation
Hurewicz Ergodic Theorem, 83
Theorem, 90
induced map, 16
semi-algebra, 7
induced operator, 22
Shannon-McMillan-Breiman Theo-
information function, 66
rem, 70
integral system, 19
shift map
irreducible Markov chain, 42
full, 121
isometry, 105
strongly mixing, 45
isomorphic, 47
subadditive sequence, 60
Kac’s Lemma, 35 symbolic dynamics, 123
Knopp’s Lemma, 26
topological conjugacy, 107
Lebesgue number, 102 topological entropy
Lochs’ Theorem, 72 definition (I), 111
definition (II), 114
Markov measure, 41 of open cover, 109
Markov Shifts, 12 properties, 116
measure of maximal entropy, 136 wrt open cover, 110
unique, 136 topologically transitive, 103
measure preserving, 6 homeomorphism, 103
145
one-sided, 103
translations, 9
uniquely ergodic, 96
Variational Principle, 136
weakly mixing, 45

Dajani Dirksin Ergodic Theory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dajani Dirksin Ergodic Theory

Uploaded by

Copyright:

Available Formats

A Simple Introduction to Ergodic Theory

Karma Dajani and Sjoerd Dirksin

December 18, 2008

1 Introduction and preliminaries 5

2 The Ergodic Theorem 29

3 Measure Preserving Isomorphisms and Factor Maps 47

5 Hurewicz Ergodic Theorem 79

6 Invariant Measures for Continuous Transformations 89

7 Topological Dynamics 101

8 The Variational Principle 127

Introduction and preliminaries

1.1 What is Ergodic Theory?

The space X usually has a special structure, and we want T to preserve

If T is invertible, then one speaks of the two sided orbit

1.2 Measure Preserving Transformations

This definition implies that for any measurable function f : X → R, the

In case T is invertible, then T is measure preserving if and only if µ(T A) =

gives a useful tool for verifying that a transformation is measure preserving.

– if F1 ⊇ F2 ⊇ . . . are elements of C, then ∩∞

The monotone class generated by a collection S of subsets of X is the smallest

Theorem 1.2.1 Let A be an algebra of X, then the σ-algebra B(A) gener-

Theorem 1.2.2 Let (Xi , Bi , µi ) be probability spaces, i = 1, 2, and T : X1 →

Then, S2 ⊆ C ⊆ B2 , and hence A(S2 ) ⊂ C. We show that C is a monotone

Thus, E ∈ C. A similar proof shows that if F1 ⊇ F2 ⊇ . . . are elements of C,

for any cylinder set.

Exercise 1.2.1 Recall that if A and B are measurable sets, then

Show that for any measurable sets A, B, C one has

µ(A∆B) ≤ µ(A∆C) + µ(C∆B).

Another useful lemma is the following (see also ([KT]).

Lemma 1.2.1 Let (X, B, µ) be a probability space, and A an algebra gener-

Clearly, A ⊆ D ⊆ B. By the Monotone Class Theorem (Theorem (1.2.1)),

µ(A∆C) ≤ µ(A∆AN ) + µ(AN ∆C) < .

1.3 Basic Examples

Then, by considering intervals it is easy to see that T is measurable and

For any interval [a, b),

then T x = 2x − a1 (x). Now, for n ≥ 1 set an (x) = a1 (T n−1 x). Fix x ∈ X,

Exercise 1.3.1 Verify that T is invertible, measurable and measure preserv-

Then, T is not measure preserving with respect to Lebesgue measure (give

Exercise 1.3.2 Verify that T is measure preserving with respect to µ, and

where bi ∈ {0, 1} and bi bi+1 = 0 for all i ≥ 1.

(e) Bernoulli Shifts – Let X = {0, 1, . . . k − 1}Z (or X = {0, 1, . . . k − 1}N ),

µ ({x : x−n = a−n , . . . , xn = an }) = pa−n . . . pan .

Let T : X → X be defined by T x = y, where yn = xn+1 . The map T , called

T −1 {x : x−n = a−n , . . . xn = an } = {x : x−n+1 = a−n , . . . , xn+1 = an },

and it is easy to see that T is measurable and measure preserving.

On cylinder sets µ has the form,

+ (IP × µ) {ω0 = −1} ∩ σ −1 C × T A

+ IP {ω0 = −1} ∩ σ −1 C µ(A)

= IP(σ −1 C)µ(A) = IP(C)µ(A) = (IP × µ)(C × A).

Exercise 1.3.3 Show that T is not measure preserving with respect to

The last statement implies that

Theorem 1.4.1 (Poincaré Recurrence Theorem) If µ(B) > 0, then

1.5 Induced and Integral Transformations

Exercise 1.5.1 Show that n is measurable with respect to the σ-algebra

By Poincaré Theorem, n(x) is finite a.e. on A. In the sequel we remove

so that (A, F ∩ A, µA ) is a probability space. Finally, define the induced map

Exercise 1.5.2 Show that TA is measurable with respect to the σ-algebra

Proposition 1.5.1 TA is measure preserving with respect to µA .

Proof For k ≥ 1, let

T −1 A = A1 ∪ B1 and T −1 Bn = An+1 ∪ Bn+1 . (1.1)

µ(TA−1 C) = µ(T −1 C).

Figure 1.1: A tower.

µ T −1 (C) = µ(A1 ∩ T −1 C) + µ(B1 ∩ T −1 C)

= µ(A1 ∩ T −1 C) + µ(T −1 (B1 ∩ T −1 C))

µ(A∆C) ≤ µ(A∆AN ) + µ(AN ∆C) < .

µ(E∆T −n0 A) = µ(T −n0 E∆T −n0 A) = µ(E∆A) < ,

|µ(E) − µ((A ∩ T −n0 A))| ≤ µ E∆(A ∩ T −n0 A) < 2.

Since > 0 is arbitrary, it follows that µ(E) = µ(E)2 , hence µ(E) = 0 or 1.

X0 = {x ∈ X : ∃ 1 ≤ m ≤ M with fm (x) ≥ m min(f (x), L)(1 − )}

fm (T n x) ≥ m min(f (T n x), L)(1 − )

an = F (T n x) = L ≥ min(f (x), L)(1 − ) = bn .

F (x) + . . . + F (T N −1 x) ≥ (N − M ) min(f (x), L)(1 − ).