Notes Part1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

i

A Ramble Through Probability


How I Learned to Stop Worrying and Love Measure Theory

Troy Butler, Don Estep, Nishant Panda

May 1, 2019
i

ii
i

iii

Here I shall present, without using Analysis, the principles and general results of the
Theory, applying them to the most important questions of life, which are indeed, for
the most part, only problems in probability. One may even say, strictly speaking, that
almost all our knowledge is only probable; and in the small number of things that we
are able to know with certainty, in the mathematical sciences themselves, the principal
means of arriving at the truth - induction and analogy - are based on probabilities, so
that the whole system of human knowledge is tied up with the theory set out in this essay.

P.-S. Laplace

In this work I try to give definitions that are as general and precise as possible to some
numbers that we consider in Analysis: the definite integral, the length of a curve, the
area of a surface.

H. Lebesgue (in his Ph.D. thesis!)

The theory of probability as a mathematical discipline can and should be developed


from axioms in exactly the same way as geometry and algebra.

A. Kolmogorov

The theory of probability, like all other mathematical theories, cannot resolve a priori
concrete questions which belong to the domain of experience. Its only role - beautiful
in itself - is to guide experience and observations by the interpretation it furnishes for
their results.

É. Borel

The statistician who supposes that [their] main contribution to the planning of an ex-
periment will involve statistical theory, finds repeatedly that [they] makes [their] most
valuable contribution simply by persuading the investigator to explain why [they wish]
to do the experiment.

G. M. Cox

The combination of some data and an aching desire for an answer does not ensure that
a reasonable answer can be extracted from a given body of data.

J. Tukey

On two occasions I have been asked [by members of Parliament], “Pray, Mr. Babbage,
if you put into the machine wrong figures, will the right answers come out?” I am
not able rightly to apprehend the kind of confusion of ideas that could provoke such a
question.

C. Babbage

While you are young, and are not burdened by the need to take care of the earnings
for keeping the family, try to penetrate deeper into the essence of mathematics and to
appreciate its beauty.

A. Skorokhod

Arithmetic is where the answer is right and everything is nice and you look out the
window and see the blue sky – or the answer is wrong and you have to start over and
try again and see how it comes out this time.

C. Sandburg
i

iv
i

Contents

Preface ix

I The Nature Trail 1

1 Setting the Scene 3

2 Preliminaries 5
2.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Cardinality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Sequences of sets . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Extended real number system . . . . . . . . . . . . . . . . . . . . 18
2.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Worked Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Law of Large Numbers, Weierstrass Approximation Theorem 21


3.1 Some discrete probability . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Weierstrass Approximation Theorem . . . . . . . . . . . . . . . . 30
3.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Probability Model for Sequences of Coin Tosses 35


4.1 Probability model for sequences of coin tosses . . . . . . . . . . . 36
4.2 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . 43
4.3 Sets of measure zero . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . 54
4.5 Wish list for measure theory for Rn . . . . . . . . . . . . . . . . . 56
4.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.7 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 57

II The Foothills 59

5 Construction of a General Measure Structure 61


5.1 Sigma algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Sets of measure zero, completion of measure . . . . . . . . . . . . 75

v
i

vi Contents

 Hausdorff measure . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Outer measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 89
5.6 Premeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.7 Approximation of measures . . . . . . . . . . . . . . . . . . . . . 99
5.8 Zoology of measure creatures . . . . . . . . . . . . . . . . . . . . 101
5.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.10 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6 Measure Structure in Euclidean Space 103


6.1 Approximation of open sets . . . . . . . . . . . . . . . . . . . . . 104
6.2 Generating the σ- algebra . . . . . . . . . . . . . . . . . . . . . . 106
6.3 Borel, Lebesgue-Stieljes measures on R . . . . . . . . . . . . . . 109
6.4 Borel, Lebesgue-Stieljes measures in Rn . . . . . . . . . . . . . . 120


6.5 Regularity of Lebesgue-Stieljes measure . . . . . . . . . . . . . . 128


6.6 Properties of Lebesgue measure . . . . . . . . . . . . . . . . . 130
6.7 Approximation of Lebesgue-Stieljes measure . . . . . . . . . . 132
6.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.9 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7 Probability 137
7.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 Examples of probability models . . . . . . . . . . . . . . . . . . . 145
7.3 “Infinitely often”, First Borel Cantelli Lemma . . . . . . . . . . . 158
7.4 Conditional probability, independence . . . . . . . . . . . . . . . 168


7.5 Second Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . 173
7.6 Tail σ-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.8 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 176

8 Measurable Functions, Random Variables 177


8.1 Induced σ- algebras . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.2 Measurable functions . . . . . . . . . . . . . . . . . . . . . . . . 180


8.3 Approximation by simple functions . . . . . . . . . . . . . . . . . 187
8.4 Relation to continuity . . . . . . . . . . . . . . . . . . . . . . 194
8.5 Measure induced by a measurable function . . . . . . . . . . . . . 195
8.6 Probability measures of random variables . . . . . . . . . . . . . . 198


8.7 Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.8 Properties of the Hausdorff measure in Rn . . . . . . . . . . . 210
8.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.10 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 211

9 Integration, Expectation 215


9.1 Simple functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.2 Nonnegative measurable functions . . . . . . . . . . . . . . . . . 221
9.3 General measurable functions . . . . . . . . . . . . . . . . . . . . 229
9.4 Measures induced by integration . . . . . . . . . . . . . . . . . . 234
9.5 Sequences of functions . . . . . . . . . . . . . . . . . . . . . . . 241
9.6 Some important inequalities . . . . . . . . . . . . . . . . . . . . . 250
9.7 Moments, variance, covariance . . . . . . . . . . . . . . . . . . . 252
9.8 Change of variables . . . . . . . . . . . . . . . . . . . . . . . . . 255
i

Contents vii


9.9 Integration of functions that depend on parameters . . . . . . . . . 259
9.10 Riemann and Lebesgue integration . . . . . . . . . . . . . . . 261
9.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
9.12 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 264

III The Peaks 267

10 Measures on Product Spaces 269


10.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.2 Product σ-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.3 Monotone Class Theorem . . . . . . . . . . . . . . . . . . . . . . 273
10.4 Product measure for two factors . . . . . . . . . . . . . . . . . . . 274
10.5 Integration for a product space with two factors . . . . . . . . . . 280
10.6 Extension to product spaces with n factors . . . . . . . . . . . . . 284
10.7 Independent random variables . . . . . . . . . . . . . . . . . . . . 287
10.8 Convolution, sums of independent random variables . . . . . . . . 296


10.9 Countable collections of independent random variables . . . . . . 298
10.10 Application to a stochastic process model . . . . . . . . . . . . 303
10.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
10.12 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 306

11 Sequences of Random Variables 307


11.1 Sequences of random variables . . . . . . . . . . . . . . . . . . . 308


11.2 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 312
11.3 Empirical distribution functions, medians, quantile functions . 316


11.4 Convergence of random series . . . . . . . . . . . . . . . . . . . . 325
11.5 Kolmogorov 0 − 1 Law for random variables . . . . . . . . . . 332
11.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
11.7 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 334

12 Sequences of Measures 335


12.1 Basic notions of weak convergence . . . . . . . . . . . . . . . . . 336
12.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . 341
12.3 Tightness, relative compactness . . . . . . . . . . . . . . . . . . . 348
12.4 Convergence of densities, Scheffe’s Theorem . . . . . . . . . . . . 351
12.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
12.6 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 353

13 Lebesgue-Radon-Nikodym Theorem 355


13.1 Hahn, Jordan Decompositions . . . . . . . . . . . . . . . . . . . . 356
13.2 Absolute continuity, mutually singular measures . . . . . . . . . . 360
13.3 Lebesgue-Radon-Nikodym Theorem . . . . . . . . . . . . . . . . 362
13.4 Properties of the Radon-Nikodym derivative . . . . . . . . . . . . 364
13.5 Differentiation of a measure as a limit . . . . . . . . . . . . . . . . 366
13.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
13.7 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 373

14 Conditional Probability, Conditional Expectation 375


14.1 Special case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
14.2 Conditioning probability on a σ-algebra . . . . . . . . . . . . . . 377
i

viii Contents

14.3 Properties of conditional probability . . . . . . . . . . . . . . . . 384


14.4 Conditioning expectation on a σ-algebra . . . . . . . . . . . . . . 386
14.5 Properties of conditional expectation . . . . . . . . . . . . . . . . 388
14.6 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . 393
14.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
14.8 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 396

15 Lindeberg-Feller Central Limit Theorem 397


15.1 A preliminary result . . . . . . . . . . . . . . . . . . . . . . . . . 397
15.2 TLindeberg condition . . . . . . . . . . . . . . . . . . . . . . . . 397
15.3 Fixed point iterations and zero-biased transformations . . . . . . . 397


15.4 Proof of the Lindeberg-Feller Central Limit Theorem . . . . . . . 397
15.5 Law of the Iterated Logarithm . . . . . . . . . . . . . . . . . . 397
15.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
15.7 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 397

16 Lp Spaces 399
16.1 Lp spaces for 1 ≤ p < ∞ . . . . . . . . . . . . . . . . . . . . . . 400
16.2 L∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
16.3 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
16.4 Approximation by simple functions and separability . . . . . . . . 410
16.5 Comparison of Lp spaces . . . . . . . . . . . . . . . . . . . . . . 411
16.6 Convergence in measure . . . . . . . . . . . . . . . . . . . . . . . 411
Hilbert Space structure of L2 . . . . . . . . . . . . . . . .

16.7 . . . . 412


16.8 Conditional probability and orthogonal projection . . . . . . . 420


16.9 Duality and the Riesz Representation Theorem . . . . . . . . . 421


16.10 Weierstrass Approximation Theorem . . . . . . . . . . . . . . 423
16.11 Constructing smooth approximations using convolution . . . . 423
16.12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
16.13 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 423

17 Disintegration of Measures 425


17.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
17.2 Worked problems . . . . . . . . . . . . . . . . . . . . . . . . . . 425

A Solutions to Worked Problems 427


A.1 Solutions to problems in Chapter 5 . . . . . . . . . . . . . . . . . 427
A.2 Solutions to problems in Chapter 6 . . . . . . . . . . . . . . . . . 428
A.3 Solutions to problems in Chapter 8 . . . . . . . . . . . . . . . . . 430
A.4 Solutions to problems in Chapter 9 . . . . . . . . . . . . . . . . . 434

B To Do List 437

Bibliography 441

Index 445
i

Preface

We have a habit in writing articles published in scientific journals to make the work
as finished as possible, to cover up all the tracks, to not worry about the blind alleys
or describe how you had the wrong idea first, and so on. So there isn’t any place to
publish, in a dignified manner, what you actually did in order to get to do the work.
R. Feynman

A good stack of examples, as large as possible, is indispensable for a thorough under-


standing of any concept, and when I want to learn something new, I make it my first job
to build one.

P. Halmos

For me, mathematics is a collection of examples; a theorem is a statement about a


collection of examples and the purpose of proving theorems is to classify and explain
the examples...

J. Conway

We often hear that mathematics consists mainly of proving theorems. Is a writer’s job
mainly that of writing sentences?

G.-C. Rota

It is impossible to be a mathematician without being a poet in soul.

S. Kovalevskaya

A professor is one who can speak on any subject – for precisely fifty minutes.
N. Wiener

Life is like riding a bicycle. To keep your balance, you must keep moving.

A. Einstein

This textbook is aimed at providing a mathematically capable reader with the means
to learn measure theory and probability theory together. One consequence of the de-
velopment of abstract measure theory and probability theory on parallel tracks is that
textbooks tend to focus heavily on one or the other. This is unfortunate, because abstract
measure theory and probability theory have a strongly positive symbiotic relationship
in terms of gaining understanding. Probability provides a practical and accessible mo-
tivation for the complicated details required in measure theory, while measure theory

ix
i

x Preface

lays bare the essential foundation of numerical approximation. So, this book attempts
to weave both measure theory and probability together.
Even so, there are many fine books on both measure theory and probability theory,
e.g., see [AG96, ADD00, Bil12, Bog00, Bre92, Chu01, Doo94, Dud92, Dur10, Fel68,
Fel66, Fol99, FG97, Gut13, HJ94, JP04, McK14, Par00, Pol02, Res05, Roy88, Ros00,
Swa94], and it is reasonable to ask if another book is necessary. In addition to melding
measure and probabilities theories, this book is designed to address several considera-
tions.
• Measure theory can be presented from a number of viewpoints using a multitude
of approaches including proofs for the major results. Each particular approach is
better suited for some aspects of the subject but less well suited for other aspects.
Over two decades of teaching measure theory and probability, we have come to
rely on a literal library of books, using different books as the basis for different
topics. But, it is not a trivial task for a reader to move between books, because
of differences in notation and definition, order of assumptions and results, and
details in statements of theorems. In some sense, a major part of the work for
this book involved abstracting the approaches from the books we like for various
topics and presenting them systematically with an unified development. We have
borrowed freely from many great textbooks and present detailed description of
the books used as sources at the end of each chapter. In using this approach, we
seem to be following a well worn path among writers of textbooks for measure
theory and probability!
• It may be a consequence of the fact that the development of measure theory since
Lebesgue’s thesis has been focused on abstraction and generalization, but in any
case, graduate textbooks in measure theory and probability theory tend to use
very efficient proofs and have very few examples. By efficient, we mean that an
extremely high value is placed on brevity and abstraction. In our experience, these
are not qualities that are optimal for learning any mathematics. This book uses
“elementary” proofs that explain why theorems hold, when we know them, and
presents a number of examples illustrating definitions and results. This has the
side benefit of making this textbook a course in "How to do analysis".
• At its heart, measure theory is built on numerical approximation of “length, area,
and volume”. But this fact is certainly downplayed, if not hidden outright, in
graduate textbooks in both measure theory and probability theory. This book
emphasizes the numerical approximation aspects of measure theory in its devel-
opment and includes significant material on the application of measure theory to
numerical approximation and practical numerical measure theory that is difficult
to find in textbooks that use rigorous measure theory.
• Mathematics students do not often learn about probability theory, even though
most take a measure theory course. Statistics students tend to have far too little
training in analysis. Both groups learn far too little about numerical approxima-
tion. Students in applied mathematical sciences, engineering and science students
often find it difficult to gain immediate practical sense from the abstract approach
of most textbooks in measure theory and probability even as rigorous probability
becomes increasingly important in science and engineering. This book aims to
provide a way through measure theory and probability that addresses these peda-
gogical issues.
On the other hand, this book does not present the most general setting for probability and
measure theory nor is the coverage of some advanced probability and measure theory
i

Preface xi

as comprehensive as found in other textbooks. All of these topics can be pursued after
learning the material in this book by readers that need specialized training.
What is covered in the book. This book presents the core material from measure the-
ory and measure-theoretic probability theory, i.e., the main concepts and theorems along
with numerous examples, albeit in an unusual arrangement. In addition, it covers the
foundation of stochastic computation and some important results in mathematical statis-
tics because both provide good applications of the core material. Those subjects are
not commonly covered in standard courses in probability. The unusual topics are well
marked and are not required to follow the development of the core material.
The book does not cover a few topics commonly found in other books. The main
omission is Fourier analysis and applications to probability including the common proof
of the Central Limit Theorem (we present another proof). Fourier analysis is an im-
portant subject that deserves a careful theoretical and computational treatment and the
authors find that the cursory treatment provided in probability textbooks is less than
useful. This book also omits stochastic processes. That is another beautiful, important
application of probability that deserves its own treatment. It is easy to find good text-
books in Fourier analysis, see [AH01, Bra86, Che01, Fol92, Fol99, Kat04, SW71], and
stochastic processes, see [Fel68, GS04, KT75, KT81, Res92, Ros96], for readers that
want to learn those topics. Also, this book does not develop very abstract measure the-
ory and probability theory, though it gives a taste with a discussion of measure theory
in abstract metric spaces. Again, it is easy to find good books at that level for readers
who need to study more than the core material.
This book also avoids developing machinery for “slick” proofs that is common
among probability books, e.g. the π − λ theorem. In the authors’ opinion, such proofs
do not do a good job of explaining why theorems are true and readers who learn prob-
ability that way tend to see probability as a collection of results rather than a coherent
method of reasoning, estimation and computation. Instead, this book uses what used to
be called "elementary proofs" based on approximation, convergence, and estimation. At
first glance, that makes some proofs longer. But there are patterns in such an approach
that has the effect that proofs become easier to understand as the material progresses.
A guide to using the book. Every chapter has the same outline. They begin with a
section called “A panorama" that provides a picture of what will be covered. After the
main material, there is a section on references and another on worked problems. There
are many good books on measure theory and probability theory, and we link the material
presented in this book to the material in other textbooks. Where we have frequently
borrowed ideas about presentation and proofs, we note that in the reference section. We
encourage readers to look into other books to learn about different approaches. The
worked example section presents a few complicated problems for which solutions are
provided in the back of the book.
In order to refrain from laboriously repeating assumptions in theorems, we set global
assumptions for a section or a chapter displayed prominently in a box with red borders
and purple background. It is very important to keep track of such global assumptions.
Any section that can be safely skipped is indicated with a bicycle.
We have divided the chapters into four Parts.
Part I presents material on how probability arises as a model and how considering
apparently simple questions in probability quickly escalates into complicated mathe-
matics that can be addressed by measure theory. It also gives a picture of what is in-
volved in developing a general theory for measuring “length, area, and volume”, which
will help the reader when dealing with the complexities of measure theory. Only Chap-
i

xii Preface

ter 2 in preliminary material is required for the following material, however the authors
strongly recommend reading through all of Part I before embarking on the rest of the
book.
We present the foundations of measure theory and measure-theoretic probability in
Part II. We begin with the construction of general measure theory in Chapter 5, then
focus on the application to construction of measures in Euclidean space. Following the
general measure theory, we develop the application to probability theory, and justify the
application by stating and proving several fundamental results. We conclude with part
with investigating integration and the probability-counterpart of expectation.
In Part III, we develop more advanced topics in measure theory and probability.
i

Part I

The Nature Trail


i

i
i

Chapter 1

Setting the Scene

With caution judge of probability. Things deemed unlikely, e’en impossible, experience
oft’

Shakespeare

A very small cause which escapes our notice determines a considerable effect that we
cannot fail to see, and then we say that the effect is due to chance.
H. Poincaré

When things get too complicated, it sometimes makes sense to stop and wonder: ’Have
I asked the right question?

E. Bombieri

The origin of measure theory begins in the 1800’s with the start of a decades-long
effort to create a theory for integration that improves and generalizes the Riemann inte-
gral. Early contributors E. Borel, C. Jordan, J. Hadamard, and G. Peano recognize that
a general theory of integration would also involve generalizing the familiar notions of
length, area, and volume for simple geometric figures to more complicated sets. Build-
ing on Borel’s work in particular, H. Lebesgue establishes a systematic theoretical basis
for both measure and integration in his thesis published in 1902.
But that was only a launching point for measure theory, as it turned out that Lebesgue’s
theory could be generalized, abstracted, and applied in a number of ways. Early exten-
sions are made by C. Carathéodory, M. Fréchet, Nikodym, J. Radon, W. Sierpiński,
and T.-J. Stieltjes. The formative development of modern measure theory continues for
roughly five more decades, with contributions by many well-known mathematicians.
One of the highlights occurs in 1933, when A. Kolmogorov publishes a book in
which he proposes measure theory as a foundation for rigorous probability, based on
related work of E. Borel, F. Bernstein, F.P. Cantelli, M. Fréchet, P. Lévy, A. Markov,
H. Rademacher, E. Slutsky, H. Steinhaus, S. Ulam and R. von Mises. This lifts prob-
ability out of a disreputable state in the mathematical sciences and initiates a parallel
development in the theory of probability. This also initiates a blossoming of mathemat-
ical statistics.
Today, it would be impossible to overstate the central importance of measure theory

3
i

4 Chapter 1. Setting the Scene

in the mathematical sciences. For example, measure theory provides the foundation
for a comprehensive theory of integration, probability theory, ergodic theory, stochastic
computation, mathematical statistics, theory of partial differential equations, spaces of
functions, functional analysis, and Fourier analysis. It is a theory that is both beautiful
and useful.
i

Chapter 2

Preliminaries

You know, many years ago – back in the 1930’s – I thought I was interested in the
foundations of mathematics. I even wrote a paper or two in the area. We had a brilliant
young logician on our faculty who was making a name for himself. One term I happened
to lecture in the same classroom, the hour immediately after he did. Each day, before
erasing the blackboard, I would take a look to see what he was up to. Well, in September
he started out with Theorem 1. Shortly before Christmas he was up to Theorem 747.
You know what it was? ’If x=y and y=z, then x=z’! At that point, something within
me snapped. I said to myself, ’There are some things in mathematics that you just have
to take for granted!’ And I never again had anything to do with the foundations of
mathematics.

G. Birkhoff

Later generations will regard Mengenlehre (set theory) as a disease from which one has
recovered.
H. Poincaré

There exists a Dedekind complete, chain ordered field, called the real numbers. It is
unique up to isomorphism...

E. Schechter

Panorama
Measure theory and probability are constructed in order to deal with complex sets that
arise when describing very practical situations. Directly or indirectly, what makes the
sets complicated has to do with limiting behavior of some kind of infinite process. In
this first chapter, we describe some of the important concepts and operations for sets.
The key ideas are methods for combining sets in order to get new sets, how to define
functions and their inverses on sets, and how to count the number of elements in sets.
Some of the examples are aimed at helping the reader to think about complicated sets,
including the central idea of a set whose elements are themselves sets, i.e. a “set of
sets”. That idea leads to another central idea of an infinite sequence of sets. We also
explain the importance of distinguishing two “kinds” of infinity.

5
i

6 Chapter 2. Preliminaries

While this chapter is admittedly very dry, it is very important to everything that
follows. At least, it is relatively short!

2.1 Sets
Sets are familiar of course.

Definition 2.1.1: Set

A set is a collection of objects called elements or points.

We use capital letters A, B to denote sets, and when we want to distinguish one
particular set, e.g., as the largest set that contains all other elements and sets of a certain
class of objects, we use the “blackboard” font X in measure theory and Ω in probability.
Some important examples with their notation:

Example 2.1.1

N = {0, 1, 2, 3, · · · } (natural numbers),


Z = {· · · , −3, −2, −1, 0, 1, 2, 3, · · · } (integers),
Q = set of rational numbers,
R = set of real numbers,
+
R = set of nonnegative real numbers,
C = set of complex numbers,
n
R = n dimensional Euclidean space of vectors with real coefficients,
C([a, b]) = real valued functions that are continuous on the interval [a, b].

A reader can properly complain that there is something circular about defining a
“set” as a “collection” and Definition 2.1.1 is only a statement of the labels we use
to denote the concept rather than a workable definition. In fact, the concept of set is
understood by the operations involving sets that can be performed. Perhaps, the most
fundamental of these is “belonging to":

Definition 2.1.2

If A is a set, a ∈ A, or a belongs to A, if a is a point or element of A. We use


a∈ / A to indicate that a does not belong to A. If B is a set, then B ⊂ A (A ⊃ B)
means that every element of B is an element of A. We say that B is a subset of
A. We write A = B if A ⊂ B and B ⊂ A and say that A and B are equal. B is a
proper subset of A if B ⊂ A but A has an element not in B.

Recall the notation {indices : specified condition is satisfied} that is used to describe
subsets.

Example 2.1.2

The set of odd natural numbers is given by {k : k ∈ N, k = 2i + 1, some i ∈ N}.


i

2.1. Sets 7

One of the attractive qualities of measure theory is that it can be applied to familiar
sets in Euclidean space and sets in much more complicated spaces.

Example 2.1.3

The set of continuous nonnegative functions on [0, 1] is {f : f ∈ C([0, 1]), f (x) ≥


0 for all 0 ≤ x ≤ 1}.

There is one special subset that has nothing in it.

Definition 2.1.3: Empty set

The empty set ∅ is the set that has no elements.

We always allow ∅ ⊂ A for any set A, which means ∅ is an empty subset. For any set
A, we always have a ∈ / ∅ where a is any element in A.
At an elementary level, we think of a set in terms of its points. But, we can also
think of a set as defining the collection of its subsets, which is fundamental to measure
theory.

Definition 2.1.4: Power set

The power set PX for a set X is the set consisting of all subsets of X.

The power set of a set includes the set itself and the empty set.

Example 2.1.4

If X = {a, b, c}, then PX = ∅, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} .


Measure theory is built on sets and set operations. The main operations are:

Definition 2.1.5

Let A and B be sets.

A ∪ B = {a : a ∈ A or a ∈ B} (Union),
A ∩ B = {a : a ∈ A and a ∈ B} (Intersection),
A\B = A − B = {a : a ∈ A and a ∈
/ B} (Difference).

Note that A\B = ∅ is possible.


In the case that that there is a largest or master set X, so all sets under consideration
are subsets of X, then we define:

Definition 2.1.6: Complement of a set

For any subset A ⊂ X we denote,

Ac = X\A (Complement of A).

Another less familiar operation turns out to be important for measure theory:
i

8 Chapter 2. Preliminaries

Definition 2.1.7: Symmetric difference of sets

If A and B are sets, then

A4B = (A\B) ∪ (B\A) (Symmetric Difference).

If the symmetric difference of two sets is “small”, the sets nearly coincide.
We frequently deal with operations and sums of collections of objects indexed by a
set. It is usually important to distinguish the case of the index set being a subset of the
natural numbers from other cases. We use roman letter for indices, e.g. i, j, k, l, m, n,
for indices that take their values in a subset of the natural numbers and greek letter for
indices in all other cases. Note that we drop the subscript index set A in the statements
when it is clear which index set is being considered.

Example 2.1.5


[ [
(i, i + 1] = R+ , {α} = [0, 1].
i=0 α∈[0,1]

We collect the basic facts about these operations in the theorem below.

Theorem 2.1.1

Consider subsets A, {Aα , α ∈ A} of a set X. Then,


[ c \
Aα = Acα ,
α∈A α∈A
\ c [
Aα = Acα ,
α∈A α∈A
\ [
Aα ⊂ Aβ ⊂ Aα for any β ∈ A,
α∈A α∈A
\  \ 
A∪ Aα = A ∪ Aα ,
α∈A α∈A
[  [
A∩ Aα = A ∩ Aα ),
α∈A α∈A
A ⊂ B ⇒ B ⊂ Ac . c

Proof. It is a good warmup exercise to prove


c this theorem. We prove the first result
to give some impetus. If x ∈ ∪α∈A Aα then x ∈ / ∪α∈A Aα . This implies that
x∈ / Aα for any α. Hence, x ∈ Acα for all α. Therefore, x ∈ ∩α∈A Acα . Vice versa,
x ∈ ∩α∈A Acα ⇒ x ∈ Acα for all α. Hence, x ∈
/ Aα for any α, so x ∈
/ ∪α∈A Aα . Hence,
c
x ∈ ∪α∈A Aα .

Below, we use familiar product spaces such as Z2 = Z × Z and Rn . It is useful to


have an abstract version.
i

2.2. Functions 9

Definition 2.1.8: Product of Sets

Let {Ai }mi=1 be a collection of nonempty sets. The


Qm (Cartesian) product set is a
(new) set of ordered m-tuples, A1 × · · · × Am = i=1 Ai = {(a1 , a2 , · · · , am ) :
ai ∈ Ai , 1 ≤ i ≤ m}.

The order in the notation ( , · · · , ) is important, so A × B 6= B × A for example.

Example 2.1.6

Let A = {a, b, c} and B = {1, 2}. Then,

A × B = {(a, 1), (a, 2), (b, 1), (b, 2), (c, 1), (c, 2)}.

Products of sets behaves nicely with respect to intersection.

Theorem 2.1.2

If A, B, C, D are sets, then

(A ∩ B) × (C ∩ D) = (A × C) ∩ (B × D).

However, it does not behave nicely with respect to other set operations, e.g.

(A ∪ B) × (C ∪ D) 6= (A × C) ∪ (B × D),

in general. We illustrate in Fig. 2.1


(

D
D
(
(

C ∩D

C
(

C
(

( ( ( (
( ( ( (
A A∩B B A B

Figure 2.1. Left: Illustration of (A ∩ B) × (C ∩ D) = (A × C) ∩ (B × D). Right:


Illustration of (A ∪ B) × (C ∪ D) 6= (A × C) ∪ (B × D).

2.2 Functions
Along with sets, measure theory and probability are also built on functions.

Definition 2.2.1: Function

Let X and Y be sets. A function f from X to Y, f : X → Y, is a rule that assigns


i

10 Chapter 2. Preliminaries

one element b ∈ Y to each element a ∈ X. We write

b = f (a), a∈X

Functions are also called maps, operators, and transformations.

Typically, operator is reserved for maps on infinite dimensional spaces, while map and
transformation usually indicate a function valued in a vector space of finite dimension
larger than 1. A function usually indicates a scalar, e.g. real or complex, valued map.

Example 2.2.1

f (x) = x2 is a function from R to R+ .

Example 2.2.2
Rb
Definite integration on an interval [a, b], i.e. f (x) → a
f (x) dx, is an operator
that maps C([a, b]) into R.

It turns out to be useful for intuition to distinguish a special case of functions.

Definition 2.2.2: Set function

Let X be a set. A set function is function defined on a subset of PX .

Example 2.2.3

Let X be a set. The cardinality set function is defined,


(
∞, A has an infinite number of elements,
|A| =
number of elements in A, A has a finite number of elements,

for A ∈ PX . The double use of | | for absolute value and cardinality ends up not
causing much trouble.

Example 2.2.4

Fix a continuous function f on R. Define the set function on finite intervals,


Z b
F ([a, b]) = f (x) dx,
a

using the familiar Riemann integral. One outcome of measure theory is to extend
the domain of this set function to a much larger class of sets.

A set function may be defined on sets consisting of single points. Vice versa, we
can think of a function defined on points of a set as a set function,
i

2.2. Functions 11

Definition 2.2.3

Let X and Y be sets and f : X → Y. For A ⊂ X, define,

f (A) = {f (a) : a ∈ A} ⊂ Y.

So, any function f defines a set function in a natural way.

There are two important sets associated with a function.

Definition 2.2.4

The domain of a function is the set of permissible inputs. The range of a function
is the set of all outputs of a function for a given domain.

In practice, there is some ambiguity in the definitions of domain and range. The
“natural” domain is the set of all inputs for which the function is defined, but we often
restrict this to some subset. Likewise, range is often used to refer to a set that contains
the actual range of a function, e.g. R and R+ both might be called the range of x2 for
x ∈ R. It is important to be very precise about the domain and range in measure theory
and probability. With this in mind, we define:

Definition 2.2.5

• A map f : X → Y is onto if for each b ∈ Y, there is an a ∈ X with f (a) = b.


• A map f : X → Y is 1-1 if for any a1 , a2 ∈ X with a1 6= a2 ; f (a1 ) 6= f (a2 ).

Example 2.2.5

f (x) = x2 is an onto but not 1-1 function from R to R+ .

Example 2.2.6

The operator f (x) → 2f (x) is an onto and 1 − 1 map from C([a, b]) to C([a, b]).

Example 2.2.7

Differentiation is an onto but not 1 − 1 operator from the set of continuously


differentiable functions on [a, b] to C([a, b]).

The concept of the inverse map to a function is centrally important to measure the-
ory. It is extremely important to pay attention to the domain and range in this situa-
tion.
Definition 2.2.6: Inverse map

Let f : X → Y be a map from domain X to range Y. The inverse image of a point


y ∈ Y is defined,
f −1 (y) = {x : x ∈ X, f (x) = y}. (2.1)
i

12 Chapter 2. Preliminaries

The inverse map or inverse of f is the map f −1 : Y → PX defined by (2.1).

Note that the inverse image of a point is a set in general.

Example 2.2.8
 √ √
Consider f (x) = x2 mapping R to R+ . Then f −1 (y) = − y, + y for y ≥ 0.

Example 2.2.9

Consider
√ f (x, y) = x2 + y 2 mapping R2 to R+ . f −1 (z) is a circle in R2 of radius
z for z ≥ 0.

Example 2.2.10

Consider f (x) = sin(x) mapping R to [−1, 1]. Then f −1 (y) is an infinite set of
points for each for 1 ≤ y ≤ 1.

Example 2.2.11

Consider f (x) = x3 mapping R to R. Then f −1 (y) = y 1/3 for y ∈ R. This is a


special case!

Example 2.2.12

Consider the average over [a, b] of a continuous function mapping C([a, b]) to R,
Rb
f (x) → b−a 1
a
f (x) dx. f −1 (c) in C([a, b]) consists of the set of continuous
functions that have average value c.

The natural domain of the inverse map to a function f : X → Y is the range Y. The
range of the inverse map is a subset of the power set PX of the domain of the map. That
is, the inverse map to a function f : X → Y maps Y to a space whose points are sets
in X in general. We can interpret this another way:

Definition 2.2.7

Let f : X → Y be a map from domain X to range Y. The inverse map f −1


defines a space of equivalence classes on X, where a1 and a2 are equivalent if
f (a1 ) = f (a2 ).

Example 2.2.13

Consider differentiation on the set of quadratic polynomials with real coefficients,


which is onto the set of linear polynomials with real coefficients. The inverse of
i

2.2. Functions 13

differentiation, the “antiderivative” or integral, is


 −1
d na o
ax + b) = x2 + bx + c : c ∈ R .
dx 2

In other words, the antiderivative of a given linear polynomial is an equivalence


class of quadratic functions in which members of the class differ by an additive
constant.

Remark 2.2.1

In elementary mathematics, the term “invertible function” is defined to be an onto


and 1-1 function, and an almost moral quality is accorded to those functions, at
least judging by the number of exercises that are usually assigned to students. But,
that is really an abuse of language and that case is far too limiting. Measure theory
is built on the inverse of functions defined as equivalence classes.

Considering Def. 2.2.3, it is sometimes convenient to consider a “more symmetric”


formulation of the inverse map between the domain and range. First, we define,

Definition 2.2.8

Let f : X → Y be a map from domain X to range Y. The inverse image of a set


A ⊂ Y is defined,
f −1 (A) = {x : x ∈ X, f (x) ∈ A}.

It follows immediately that

Theorem 2.2.1

Let f : X → Y be a map from domain X to range Y. The inverse defines a map


f −1 : PY → PX .

The reason that inverse maps play such an important role in measure theory is that
the inverse of a map “commutes” with unions, intersections and complements.

Theorem 2.2.2

Let f : X → Y be a map from domain X to range Y, {Bα }α∈A be a collection of


subsets of Y, and B ⊂ Y. Then,
!
[ [
−1
f Bα = f −1 (Bα )
α∈A α∈A
!
\ \
f −1 Bα = f −1 (Bα )
α∈A α∈A
−1 c
f (B ) = (f −1 (B)) .
c
i

14 Chapter 2. Preliminaries

Proof. This is another good exercise. We prove the first result to give a start.

x ∈ ∪α Bα ⇔ x ∈ Bα0 some α0 ⇔ f −1 (x) ∈ f −1 (Bα0 ) some α0


⇔ f −1 (x) ∈ ∪α f −1 (Bα )

The importance of Theorem 2.2.2 is revealed below. Note that functions and set
operations do not commute the same way, e.g. f (B1 ∩ B2 ) =
6 f (B1 ) ∩ f (B2 ), in
general.

2.3 Cardinality
We mentioned above that specifying the size, or cardinality, of an index set is important
in certain places. Formalizing that notion,

Definition 2.3.1

Two sets X and Y are equivalent or have the same cardinality, written X ∼ Y,
if there is a 1 − 1 and onto map f : X → Y.
• If X = ∅ or X ∼ {1, 2, . . . , n} for some n ∈ N, we say that X is finite.
• If X is finite or X ∼ N, we say that X is countable.
• If X is not empty, finite, or countable, we say is X is uncountable.

Note that there are different cardinalities among the uncountable sets but that is not
important for the material below.
Countable cardinality has some interesting properties. For example

Example 2.3.1

Z is countable. We use the 1-1 and onto map defined by 0 ↔ 0, 1 ↔ 1, −1 ↔ 2,


2 ↔ 3, −2 ↔ 4, ...

Example 2.3.2

The product space Z2 = {(x, y), x ∈ Z, y ∈ Z} is countable. We arrange the


points in Z2 in an “infinite table” as shown in Fig. 2.2. We then number the points
as shown to obtain the onto and 1-1 correspondence. The numbering scheme is
1
(i, j) → k = (i + j)(i + j − 1) − j + 1, (i, j) ∈ N × N.
2

Example 2.3.3

Example 2.3.2 can be used to show that Q is countable, after recognizing that
numbers in Q can be written as the ratio of integers.

In fact, it follows from the definition that all countable sets are equivalent and, in-
deed, can be represented in the same way.
i

2.3. Cardinality 15

0 1 3 6
(1,1) (1,2) (1,3) (1,4) ...
2 4 7 11
(2,1) (2,2) (2,3) (2,4) ...
5 8 12 17
(3,1) (3,2) (3,3) (3,4) ...
9 13 18 24 ...
(4,1) (4,2) (4,3) (4,4)

...
...

...

...

...
Figure 2.2. Showing that N ∼ Z2 .

Theorem 2.3.1

A countable set X can be written as {a1 , a2 , a3 , · · · }, where a1 , a2 , · · · enumerate


the points in X.

As we said, below we construct complicated sets using unions and intersections. A


crucial fact underlying the construction is the following.

Theorem 2.3.2

The countable union of countable sets is countable.

Proof. Since we are dealing with a countable number of sets, we can write them as
{Ai }∞ {ai,0 , ai,1 , ai,2 , · · · }.
i=0 . Moreover, each Ai is countable, so we can write Ai = S

Then we can use the numbering scheme show in Fig. 2.2 to label i=0 Ai = {ai,j , i ∈
N, j ∈ N}.

Investigations of the cardinality of non-countable sets is complicated and fairly es-


oteric. As an example, we use the “Cantor diagonalization argument” to show,

Theorem 2.3.3

The set of real numbers in (0, 1] is uncountable.

Proof. The proof is by contradiction. If (0, 1] is countable, then we can enumerate


the numbers in (0, 1] by {ri }∞ i=1 . We show that given any such enumeration, we can
“construct” a point that is not in {ri }∞
i=1 , which means that such an enumeration cannot
exist. The construction uses the Axiom of Choice.
First, we recall that a point x in (0, 1] can be represented in a decimal expansion,
x = .a1 a2 a3 · · · , where ai ∈ {0, 1, 2, · · · , 9} for all i. Some numbers have two such
decimal expansions, e.g. .629999999 · · · = .63. By arbitrarily choosing the expansion
that ends in 9s when possible, each number in (0, 1] therefore has a unique decimal
expansion.
i

16 Chapter 2. Preliminaries

Suppose that {ri }∞


i=1 is an enumeration of the points in (0, 1]. We can write

r1 = .a11 a12 a13 a14 · · ·


r2 = .a21 a22 a23 a24 · · ·
r3 = .a31 a32 a33 a34 · · ·
r4 = .a41 a42 a43 a44 · · ·
..
.
We construct a number x = .a1 a2 a3 a4 · · · ∈ (0, 1] using the procedure ai = 3 if
aii 6= 3 and aii = 4 if aii = 3. This means that x is different from ri in the ith digit for
/ {ri }∞
all i, so x ∈ i=1 .

Using this, we can conclude that other sets are uncountable

Example 2.3.4

R5 , R+ , R− , {x : x ∈ R, x > 4} are all uncountable since they contain sets that


can be identified with (0, 1).

Example 2.3.5

C([a, b]) is uncountable since it contains the set of constant functions with values
in (0, 1].

The following result about cardinality of product sets is straightforward to prove.

Theorem 2.3.4

For product sets A × B, the cardinality of A × B is |A| |B| when A and B are sets
with finite cardinality. If A and B are countable sets, then A × B is countable. If
A and B are nonempty sets and either A or B or both are uncountable, then A × B
is uncountable.

2.4 Sequences of sets


It turns out that measure theory often deals with countable collections of sets, and we
discuss a few useful ideas.
The first notion presents conditions under which we can think of a collection of sets
as a sequence that has a limit.

Definition 2.4.1

Let {Ai }∞
i=1 be a collection of subsets of a set X.

Ai = A, then we say that {Ai }∞
S
If A1 ⊂ A2 ⊂ A3 . . . and i=1 is a (mono-
i=1
tone) increasing sequence of sets and that Ai converges to A. We denote this
by Ai % A and Ai ↑ A.

Ai = A, then we say that {Ai }∞
T
If A1 ⊃ A2 ⊃ A3 . . . and i=1 is a (mono-
i=1
i

2.4. Sequences of sets 17

tone) decreasing sequence of sets and Ai converges to A. We denote this by


Ai & A and Ai ↓ A.
In either case, we say the sequence is monotone.

Some books use “sequence” when referring to both a collection and a monotone se-
quence of sets.

Example 2.4.1

1

Let Ai = 0, 1 − i for i ≥ 2. We have

A2 = (0, 1/2) ⊂ A3 = (0, 2/3) ⊂ A4 = (0, 4/5) ⊂ · · ·


S∞
and i=2 Ai = (0, 1).

Example 2.4.2

1

Let Ai = 0, 1 + i for i ≥ 1. We have

A1 = (0, 3/2) ⊃ A2 = (0, 4/3) ⊃ A3 = (0, 5/4) ⊃ · · ·


T∞
and i=2 Ai = (0, 1].

The following result is easy to show.

Theorem 2.4.1

Let {Ai }∞
i=1 be a sequence of subsets of X.

i
S
1. If Ai % A then, Ai = Aj .
j=1
2. If Ai % A then Aci & A . c

3. If Ai & A then Aci % Ac .

In several situations, dealing with a collection of sets depends heavily on whether


or not the sets in the collection are non-intersecting.

Definition 2.4.2

A collection {Ai }∞ i=1 of sets


Sin X is (pairwise) disjoint if Ai ∩ Aj = ∅ for i 6= j.

If {Ai }∞
i=1 is disjoint, then i=1 Ai is called a union of disjoint sets or a disjoint
union.

Example 2.4.3

We have the disjoint union (0, 1] = 0, 21 ∪ 1 3 3 7 7 15


   
2, 4 ∪ 4, 8 ∪ 8 , 16 ∪ ···.

Some authors use a special notation to indicate the case of a disjoint union. In this
book, we always note that situation in words. When reading any book, it is important to
note how disjoint unions are indicated.
i

18 Chapter 2. Preliminaries

The next set of ideas is based on the observation that given two subsets A, B ⊂ X,
we can write the union as a disjoint union:

A ∪ B = (A) ∪ (B ∩ Ac ) .

Theorem 2.1.1 implies the following result, which turns out to be extremely useful in
measure theory proofs.

Theorem 2.4.2

Let {Ai }∞
i=1 be a collection of subsets of X. Then,
S∞
1. Set A = i=1 Ai . Define the sequence of sets {Bj }∞ j=1 via B1 = A1 and
Sj
Bj = i=1 Ai for j ≥ 2. Then Bj % A.
S 
j−1
2. Define the collection of sets {Bj }∞
j=1 via B1 = A1 and Bj = Aj \ i=1 Ai
S∞
for j ≥ 2. Then {Bj }∞ j=1 is a disjoint collection of sets with i=1 Ai =
S∞
i=1 Bi .

2.5 Extended real number system


It is convenient to treat ∞ as a number in some arithmetic computations. Formally, we
define,

Definition 2.5.1

b = R ∪ {−∞, ∞} with the rules


The extended real number system is R

−∞ ≤ x ≤ ∞, x ∈ R,
b −∞ < x < ∞, x ∈ R,
x ± ∞ = ±∞, x ∈ R, ∞ + ∞ = ∞, −∞ − ∞ = −∞,
x · (±∞) = ±∞, x > 0, x · (±∞) = ∓∞, x < 0, 0 · (±∞) = 0.

Before the reader starts wondering why Calculus makes a lot of fuss about ∞, note that
these definitions are consistent with how we deal with sequences that may have infinite
limits (and may have been said to diverge). For that reason, we do not assign ∞ − ∞ a
value. Also, the convention that 0 · (±∞) = 0 is only permissible because in measure
theory that is usually the correct value that would be assigned with a careful analysis.
With these conventions, other structures associated with real numbers are extended
in the obvious way. In particular,

Definition 2.5.2

b + = [0, ∞] = {x ∈ R
R b : 0 ≤ x ≤ ∞},
b n = {(x1 , · · · , xn ) : xi ∈ R,
R b 1 ≤ i ≤ n}.
i

2.5. Extended real number system 19

Definition 2.5.3

b n , n ≥ 1, we write ∞ = (∞, · · · , ∞) and −∞ = (−∞, · · · , −∞).


In R

Remark 2.5.1

It is important to keep in mind that the extended reals are not a field.

We incorporate these definitions into extensions of limit, supremum, and infimum.

Definition 2.5.4

If A ⊂ R
b is nonempty and not bounded from below, we define inf A = −∞. If
A ⊂ R is nonempty and not bounded from above, we define sup A = ∞.
b

It follows,

Theorem 2.5.1

Every nonempty subset of R b has an infimum and supremum. If {ai }∞ is a se-


i=1
quence in R, then lim sup ai and lim inf ai exist in R.
b b

We extend the notion of convergence of a sequence to include ±∞ analogously,

Definition 2.5.5

A sequence {ai }∞i=1 ⊂ R converges to ∞ if for every r ∈ R there is an m such


b
that ai ≥ r for i ≥ m. Likewise, a sequence {ai }∞ i=1 ⊂ R converges to −∞ if
b
for every r ∈ R there is an m such that ai ≤ r for i ≥ m.

It follows,

Theorem 2.5.2

• If {ai }∞ ∞
i=1 is an increasing sequence in R, then {ai }i=1 converges to lim sup ai .
b
• If {ai }∞ ∞
i=1 is an decreasing sequence in R, then {ai }i=1 converges to lim inf ai .
b
• A sequence {ai }∞ i=1 in R converges in R if and only if lim sup ai = lim inf ai ,
b b
in which case lim ai = lim sup ai = lim inf ai .

Proof. The proof is a good exercise. It requires treating different cases depending on
whether or not ±∞ is involved.

We frequently deal with functions in relation to the extended numbers. Recall how
infinity is incorporated into limits of functions:
i

20 Chapter 2. Preliminaries

Definition 2.5.6

Let f be a function on R and a ∈ R.


• lim f (x) = ∞ if for every N there exists a δ > 0 such that f (x) > N for
x→a
all x with |x − a| < δ.
• lim f (x) = −∞ if for every N there exists a δ > 0 such that f (x) < N for
x→a
all x with |x − a| < δ.
• lim f (x) = a if for every  > 0 there is an M such that |f (x) − a| <  for
x→∞
all x > M .
• lim f (x) = a if for every  > 0 there is an M such that |f (x) − a| < 
x→−∞
for all x < M .
• lim f (x) = ∞ if for every N there is an M such that f (x) > N for all
x→∞
x > M.
• lim f (x) = −∞ if for every N there is an M such that f (x) < N for all
x→∞
x > M.
• lim f (x) = ∞ if for every N there is an M such that f (x) > N for all
x→−∞
x < M.
• lim f (x) = −∞ if for every N there is an M such that f (x) < N for all
x→−∞
x < M.

Using these definitions, we can extend the domain and/or range of some functions from
the real numbers to the extended real numbers.

Example 2.5.1

f (x) = x2 maps R b + but f (x) = sin(x) cannot be extended to a function on


b to R
R since limx→±∞ sin(x) is not defined.
b

There is a particular case that is important for measure theory,

Definition 2.5.7

An extended real valued nonnegative function on a set X is a function that maps


b +.
X to R

2.6 References
2.7 Worked Problems
i

Chapter 3

Law of Large Numbers,


Weierstrass
Approximation Theorem

The conception of chance enters into the very first steps of scientific activity in virtue of
the fact that no observation is absolutely correct. I think chance is a more fundamental
conception than causality; for whether in a concrete case, a cause-effect relationship
holds or not can only be judged by applying the laws of chance to the observation.
M. Born

It is mainly the practice of gambling which sets certain stubborn minds against [the]
notion of the independence of successive events. They observe that heads and tails
appear about equally often in a long series of tosses, and they conclude that if there has
been a long run of tails, it should be followed by a head; they look at this as a debt owed
to them by the game. A little reflection will suffice, however, to convince anyone of the
childishness of this anthropomorphism

É. Borel

The results concerning fluctuations in coin tossing show the widely held beliefs about
the law of large numbers are fallacious. They were so amazing and so at variance
with common intuition that even sophisticated colleagues doubted that coins actually
misbehave as theory predicts.

W. Feller

It is likely that unlikely things should happen.

Aristotle

I guess I think of lotteries as a tax on the mathematically challenged.


R. Jones

Panorama
The main goal of this chapter is to begin exploring the deep connection between prob-
ability and analysis. That connection is fully described by measure theory. But it is not
at all obvious, which may be the reason measure theory was not applied to probability

21
i

22 Chapter 3. Law of Large Numbers, Weierstrass Approximation Theorem

until thirty-some years after it was developed.


Analysis becomes relevant when we ask more about probability than simply carry-
ing out familiar probability computations in discrete probability, e.g. chances of pick-
ing a particular card from a deck. For example, much deeper mathematical issues arise
when we play a probability game and over, e.g. how is the probability of choosing a
king from a single shuffled deck related to the results that are obtained when we choose
at random a single card from each deck in a large, or infinite, collection of decks. That
relation is key to using probability as a model for physical phenomena and developing
a systematic way to carry out probability computations.
To give an example of a deep question in probability, we begin by explaining the
Law of Large Numbers, which partially answers the question of what happens when we
repeat a random experiment over and over. This theorem is a central theme because it is
connected to the use of probability as a mathematical model and we develop a number
of variations in subsequent chapters. Then, in a startling development, we demonstrate
the connection between probability and analysis by using the Law of Large Numbers to
prove the celebrated Weierstrass Approximation Theorem, which states that continuous
functions can be approximated arbitrarily well by polynomials.
We are still some distance from introducing measure theory. But the proofs in this
chapter are a good warmup.

3.1 Some discrete probability


We begin by recalling some discrete probability definitions and ideas that are used in
the rest of the book. As with the material in Chapter 1, it may be a good idea to keep an
elementary probability textbook nearby as a reference while we develop key probability
concepts from a rigorous measure theory point of view.
The following definitions are made in reference to an “experiment”, which may
be thought of as an abstract generalization of a scientific experiment. Thus, it is an
operation we carry out that produces a result that is subject to some degree of uncertainty
or possible variation. The outcomes of a scientific experiment may be subject to a
degree of uncertainty both because of intrinsic reasons, e.g. the physical system varies
in a way that is not completely predictable or at a scale that is not readily observable,
and for external reasons, e.g., due to experimental and observation errors.

Definition 3.1.1

An experiment is a repeatable procedure that yields exactly one outcome from a


specified set of possible results. The set of possible outcomes is called the sample
space. A trial is a single performance of the experiment. Discrete probability
refers to a situation in which the sample space is at most countable.

Often, an experiment may be associated with several different sample spaces.

Example 3.1.1

The experiment is to draw a card from a standard deck. We can classify the possi-
ble outcome in a number of ways, e.g.,
Sample space 1 A point in the space of 52 outcomes.
i

3.1. Some discrete probability 23

Sample space 2 A point in the space of two outcomes: red or black.


Sample space 3 A point in the space of 13 outcomes: {2, 3, 4, . . ., King, Ace}.

Note that sample spaces 2 and 3 are sets whose points are sets.
There is a special case of a sequence of trials that greatly simplifies analysis, but is
also still important.

Definition 3.1.2

A sequence of trials of an experiment is independent if the outcome of one trial


in the sequence does not affect the outcomes of the later trials. We say these are
independent trials and the trials are independent.

Example 3.1.2

We typically assume that a sequence of coin tosses is independent.

Example 3.1.3

Consider a bag holding a collection of white and black marbles. In experiment 1,


we choose a marble without looking, mark its color, put it back in the bag, and
shake vigorously. In experiment 2, we choose a marble without looking, mark
its color, and discard it (not in the bag). It is reasonable to consider the trials of
experiment 1 to be independent, but the trials of experiment 2 are certainly not
independent.

We often wish to further group elements in a sample space.

Definition 3.1.3

Any collection or set of outcomes in a sample space is called an event. Individual


members (singleton sets) of the sample space are called (sample) points. We say
an event occurs in a trial if the outcome of the trial is a sample point that is in the
event.

Example 3.1.4

Consider Example 3.1.1. In sample space 1, black is an event with 26 points. In


sample space 2, black is an event with 1 point. Black is not an event in sample
space 3.

On a purely functional level, probability is a function defined on events in a sample


space satisfying specific rules.

Definition 3.1.4

Consider a sample space with m outcomes. Probabilities are numbers assigned


to events with the properties:
i

24 Chapter 3. Law of Large Numbers, Weierstrass Approximation Theorem

1. Each sample point is assigned a non-negative number called the probability.


2. The sum of the m probabilities of the m outcomes is 1.
3. If A is an event and P (A) is the probability of A, then P (A) is the sum of
probabilities of the outcomes in A.
P is generally reserved for probability. The probability (function) on the events
in the sample space is the function that assigns a probability to each event. For
later reference, probability is a non-negative finitely additive set function.

Probability is associated with “randomness” and “uncertainty” and the rules governing
probability reflect properties of the experiment. But, it is important to note that there
is nothing uncertain or random about the rules governing probability. The connection
to uncertainty or randomness comes through the interpretation of the probability values
placed on the outcomes and how those values are assigned.

Example 3.1.5

Consider the experiment of flipping a two side coin with a head side (H) and a
tail side (T ). The sample space is {H, T }. Given the complexity of modeling the
physics of the motion through the flip to the catch, we might assign probability
by assuming that each outcome is equally likely, i.e. P (H) = P (T ) = 1/2.
The randomness or uncertainty in the experiment is that, short of carrying out
a complex predictive physics simulation, we cannot predict which outcome will
occur before the toss is made.

In general, a common approach for assigning probabilities in the absence of any


information about probabilities of events is based on assuming each outcome is equally
likely. If the sample space has m outcomes, then P (any outcome) = 1/m. It follows
that,
number of outcomes in the event
P (event) = .
total number of outcomes
But, there are situations in which it is unclear how to apply that rule or if it is
applicable at all.

Example 3.1.6

An important application of probability is the modeling of flow of a liquid through


a porous media. A porous media is a material structure that is permeated by
“pores” or “voids” that connect to form a complex network through which liq-
uid may pass. Some common examples include the ground and sponges, see Fig-
ure 3.1. The individual voids may encompass a wide range of scales themselves
while the scales of the voids in bulk are often much, much smaller than the size
of the region through which the fluid flows. Generally, it is too challenging to use
models for flow valid at the smallest scales, so the most common partial differen-
tial equation models consider “bulk” flow on a scale that is much larger than the
individual pores. Such models use a porosity parameter field that is obtained by
some kind of “upscaling” of the fine grained variation. For example, the porosity
might be obtained in a region by averaging (using the harmonic average) the fine
scaled porosity over the region. The variations in the porosity are often modeled
as a “random function” that perturbs the upscaled porosity.
i

3.1. Some discrete probability 25

Figure 3.1. An example of a porous media with two “substrates”.

Example 3.1.7

b, c, d, e, f }. PX = ∅, {a},

We consider a sample space with six outcomes {a,
{b}, · · · , {f }, {a, b}, {a, c}, · · · , {a, b, c, d, e, f } . Setting P (a) = .1, P (b) = .2,
P (c) = .1, P (d) = .1, P (e) = .2, and P (f ) = .3, then if B = {d, e, f },
P (B) = .4.

A system involving probability has three ingredients.

Definition 3.1.5

In discrete probability, a sample space together with its power set and a set of
probabilities is called a probability space. If Ω is the sample space and P the
probability, then we write (Ω, PΩ , P ) to emphasize the three ingredients.

We give names to some special kinds of events,

Definition 3.1.6

A sure event must occur in an experiment, so it contains the entire sample space.
An almost sure event is an event with probability one. An event with probability
zero happens almost never. An impossible event never occurs in an experiment,
so it is the event with no outcomes.

Definition 3.1.7

If A is an event in a sample space, its complement Ac is the set of outcomes not


in A.

Probabilities must satisfy certain properties with respect to taking unions and inter-
sections of events. For example,

Theorem 3.1.1
i

26 Chapter 3. Law of Large Numbers, Weierstrass Approximation Theorem

If A, B are events in a probability space, then

P (A ∪ B) = P (A) + P (B) − P (A ∩ B). (3.1)

Definition 3.1.8

Two events in a probability space are (mutually) exclusive if they have no out-
comes in common.

Theorem 3.1.2

If A, B are exclusive events in a probability space, then

P (A ∪ B) = P (A) + P (B). (3.2)

In general, if {Ai : 1 ≤ i ≤ m} is a collection of m exclusive events, then,


m
X
P (A1 ∪ A2 ∪ . . . ∪ An ) = P (Ai ). (3.3)
i=1

3.2 Law of Large Numbers


We work in a sample space associated with a given experiment. We assume that a
certain outcome O occurs with probability x when the experiment is conducted.

The Law of Large Numbers is a probabilistic statement about the frequency of oc-
currence of an outcome in a sequence of a large number of repeated independent trials
of an experiment. It is a result that we discuss several more times throughout the book.
In this section, we give an elementary proof of a simple version that does not require
measure theory.
Suppose that we do not know x. How might we determine it? If we conduct a single
trial, O might result or it might not. In either case, it gives little information about x.
However, if we conduct a large number m  1 of trials, intuition suggests O should
occur “approximately” xm times, at least most of the time. Another way of stating this
intuition is,
number of times O occurs
≈ probability of O.
total number of trials
But, we have to be wary about intuitive feelings:

Example 3.2.1

If we conduct a set of trials in which we flip a fair coin many times (m), we
expect to see around 50% ( m 2 ) heads most of the time. However, it turns out the
probability of getting exactly m/2 heads in m flips is approximately,
1
√ ,
πm
i

3.2. Law of Large Numbers 27

for m large. This goes to 0 as m increases.

It is certainly possible that we might have a run of either “good” or “bad” luck in the
sequence of experiments, which would undermine the intuition. So, we have to be
careful about how we phrase the intuition mathematically.

Theorem 3.2.1: Law of Large Numbers

Let  > 0 and δ > 0 be given. If k is the number of times O occurs in m


independent trials, then the probability that k/m differs from x by less than δ is
greater that 1 − ,
 
k
P k : − x < δ
> 1 − , (3.4)
m

for all m sufficiently large.

It is important to spend some time reading the conclusion of Theorem 3.2.1 and
understanding its meaning. The theorem does not say O will occur exactly xm times
in m trials nor that O must occur approximately xm times in m trials. The role of δ
is that it quantifies the way in which k/m approximates x, thus avoiding the issue in
Ex. 3.2.1. The role of  is that it allows the possibility, however small in probability, that
the sequence of trials can produce a result that is not expected. For example, we would
have the (mis)fortune to obtain all heads in the sequence of trials. By making δ small,
we obtain a better approximation to x. By making  small, we obtain the expected result
with higher probability. The cost in each case is that having to conduct a large number
m of trials.
As well as being interesting in its own right, the Law of Large Numbers (LLN) is
centrally important to various aspects of probability theory. For example, it is important
in consideration of the role of probability theory as a mathematical model of reality. An
important component of mathematical modeling is “validation”, which roughly is the
process of verifying the usefulness of the model in terms of describing the system being
modeled. One aspect of validation is quantifying the accuracy of predictions made by
the model.
For example, we could try to describe the result of a particular coin flip for a fair coin
deterministically by using physics involving the initial position on the thumb, equations
describing the effects of force and direction of the flip, the effect of airflow, and so on.
This is a complex and computationally expensive undertaking. In the absence of such
a detailed computation for a particular flip, it is reasonable to believe that the outcome
for a fair coin is equally likely to be head or tails. The LLN describes how we could
validate the assignment of a probability to O through repeated trials.
The LLN can be proved using an elementary argument based on the binomial expan-
sion. Of course, the binomial coefficients and binomial expansions are very important
in probability. We briefly review the basic ideas, see [Est02] for a more detailed presen-
tation. This theorem was first proved by Jakob Bernoulli.

Definition 3.2.1
i

28 Chapter 3. Law of Large Numbers, Weierstrass Approximation Theorem

The binomial coefficient is


 
i i!
= , i, j ∈ N.
j j!(i − j)!

Theorem 3.2.2

For any numbers a, b and positive integer m,


m  
m
X m
(a + b) = ak bm−k ,
k
k=0
m  
m−1
X m k−1 m−k
m(a + b) = k a b ,
k
k=0
m  
m−1
X k m k m−k
a(a + b) = a b ,
m k
k=0
m  2  
m−2
X k k m k m−k
(1 − m−1 )a2 (a + b) = 2
− a b .
m m k
k=0

By setting a = x and b = 1 − x, we see that Theorem 3.2.2 implies


m  
X m k m−k
1= x (1 − x) .
k
k=0

Definition 3.2.2

The m + 1 binomial polynomials of degree m are


 
m k m−k
pm,k (x) = x (1 − x) , k = 0, 1, . . . , m.
k

The connection to probability is,

Theorem 3.2.3

If 0 ≤ x ≤ 1 is the probability of an event E, then pm,k (x) is the probability that


E occurs exactly k times in m independent trials.

Theorem 3.2.2 implies

Theorem 3.2.4

0 ≤ pm,k (x) ≤ 1, for all 0 ≤ x ≤ 1. (3.5a)


m
X
pm,k (x) = 1, for all 0 ≤ x ≤ 1. (3.5b)
k=0
i

3.2. Law of Large Numbers 29

m
X
kpm,k (x) = mx, for all 0 ≤ x ≤ 1. (3.5c)
k=0
m
X
k 2 pm,k (x) = (m2 − m)x2 + mx, for all 0 ≤ x ≤ 1. (3.5d)
k=0

One important use of binomial polynomials is proving the LLN.

Proof. We prove Theorem 3.2.1 by proving that given  > 0 and δ > 0
X
pm,k (x) > 1 − ,
0≤k≤m
|m
k
−x|<δ

for all m sufficiently large.


Consider the complementary sum
X X
pm,k (x) = 1 − pm,k (x),
0≤k≤m 0≤k≤m
|m
k
−x|≥δ |m
k
−x|<δ

which is estimated
 2
X 1 X k 1
pm,k (x) ≤ −x pm,k (x) ≤ Tm (x),
δ2 m m2 δ 2
0≤k≤m 0≤k≤m
|m
k
−x|≥δ |m
k
−x|≥δ

where
m
X 2
Tm (x) = (k − mx) pm,k (x).
k=0
m
Using (3.5c) and (3.5d), we find Tm (x) = mx(1 − x), and so Tm (x) ≤ 4, for all
0 ≤ x ≤ 1. Therefore,
X 1 X 1
pm,k (x) ≤ and pm,k (x) ≥ 1 − , 0 ≤ x ≤ 1. (3.6)
4mδ 2 4mδ 2
0≤k≤m 0≤k≤m
|m
k
−x|≥δ |m
k
−x|<δ

This shows the result.

Remark 3.2.1

It is interesting to consider how the final line implies the result. Given δ and , we
require
1
m≥ .
4δ 2
This can be achieved uniformly with respect to the value of x. However, increasing
the accuracy by decreasing δ requires a very substantial increase in the number of
trials m. This adverse scaling occurs again, unfortunately.
i

30 Chapter 3. Law of Large Numbers, Weierstrass Approximation Theorem

3.3 Weierstrass Approximation Theorem


The Weierstrass Approximation Theorem states that a real-valued continuous function
on a closed interval can be approximated uniformly by a polynomial to any desired
accuracy over the interval. Over-familiarity with the commonplace use of polynomi-
als to approximate functions may weaken the sense of the importance of this fact. It
helps to consider the issues underlying two common ways to generate an approximate
polynomial for a function.
The most familiar approach is the Taylor polynomial, which is computed using the
set of values of a function and its derivatives at a point. The first issue is that this applies
only to functions with derivatives, not functions that are merely continuous. Another
issue is that this is a local approximation in the sense that the approximation is accurate
in a (generally small) neighborhood of the point where the derivatives are evaluated.
The second common approach is to use interpolation. To any set of distinct n + 1
points in a given interval, there is a unique polynomial that interpolates, or agrees in
value, with the function at the points. An issue is that unless special care is taken in
regards to construction of the interpolation points, it becomes increasingly difficult to
compute the approximate polynomial with increasing n and, while the resulting poly-
nomial matches the function exactly at the interpolating points, it provides a poor ap-
proximation in between the interpolating points. Essentially, interpolating polynomials
of high degree must oscillate with extremely large amplitudes in order to match a set of
function values. This issue can be treated under additional assumptions, e.g. an inner
product yielding orthogonality.
Example 3.3.1

The function 1/(1 + x2 ) is a standard example used to show problems that can
arise for polynomial approximation. In Figure 3.2, we plot the Taylor polynomial
of degree 8 centered at x = 0 and an interpolating polynomial of degree 8. The
local nature of the approximation provided by a Taylor polynomial and the issue
of oscillation that can arise when interpolating at equally spaced points (known as
Runge’s phenomenon) are clearly visible.

6 1
Taylor polynomial of degree 8 at x=0
5 1/(1+x2)
0.5
4
0
3
-0.5
2

1 -1
Interpolating polynomial of degree 8
1/(1+x2)
0 -1.5
-1 -0.5 0 0.5 1 -5 0 5

Figure 3.2. Left: The Taylor polynomial of degree 8 for 1/(1 + x2 ) centered at x = 0.
Right: The polynomial of degree 8 that interpolates 1/(1 + x2 ) at equally spaced points.

Recall the metric space C([a, b]) of real-valued continuous functions on [a, b] with
i

3.3. Weierstrass Approximation Theorem 31

the max/sup metric. We introduce a convenient way to quantify the continuous behavior
of a function. Recall that a function f that is continuous on [a, b] is actually uniformly
continuous. If f is uniformly continuous on [a, b], then for any δ > 0, the set of numbers

{|f (x) − f (y)| : x, y ∈ [a, b] |x − y| < δ} ,

is bounded, and this set must have a least upper bound.

Definition 3.3.1

Let f be continuous on [a, b]. The modulus of continuity of f is

κ(f, δ) = sup |f (x) − f (y)| .


x,y∈[a,b]
|x−y|<δ

Now, we turn to the main results.

Theorem 3.3.1: Weierstrass Approximation

Assume f ∈ C([a, b]). Given  > 0, there is a polynomial bm with sufficiently


high degree m, such that

sup |f (x) − bm (x)| < .


[a,b]

We observe that the set of polynomials with rational coefficients is dense in the
space of all polynomials on [a, b] with respect to the sup metric. The set of polynomials
with rational coefficients is a countable set. We conclude,

Theorem 3.3.2

C([a, b]) is separable.

We actually prove Theorem 3.3.1 by showing how to construct a specific polynomial


approximation using special polynomials for a given continuous function. It is an easy
exercise to show that it suffices to treat [a, b] = [0, 1]. We partition [0, 1] into m + 1
uniformly spaced nodes xk = k/m for k = 0, 1, . . . , m.

Definition 3.3.2

The Bernstein polynomial of degree m for f on [0, 1] is


m
X
bm (f, x) = f (xk )pm,k (x).
k=0

The degree of bm (f, x) is less than but not necessarily equal to m.


Theorem 3.3.1 follows from
Theorem 3.3.3: Bernstein Approximation

Let f be a continuous function on [0, 1] and m be a positive integer. Then for


i

32 Chapter 3. Law of Large Numbers, Weierstrass Approximation Theorem

1.0

0.5
f(x)
b3(f,x)

0.0
0.0 0.5 1.0

Figure 3.3. Plot of b3 (f, x) for f (x) = cos(πx/2).

x ∈ [0, 1],
9
|f (x) − bm (f, x)| ≤ κ(f, m−1/2 ). (3.7)
4

Example 3.3.2

We plot b3 (f, x) for f (x) = cos(πx/2) in Fig. 3.3.

Example 3.3.3

We plot the Bernstein approximations of degree 8, 16 and 32 for f (x) = 1/(1+x2 )


on [−4, 4] in Figure 3.4. To compute the approximations, we use a change of
variables to transform between [−4, 4] and [0, 1]. For example, the approximation
of degree 8 is given by

b8 1/(1 + 42 (x − 1/2)2 ), x = (x − 1)8 /5 − (32x7 (x − 1))/13




− (32x(x − 1)7 )/13 + 14x2 (x − 1)6 − (224x3 (x − 1)5 )/5 + 70x4 (x − 1)4
− (224x5 (x − 1)3 )/5 + 14x6 (x − 1)2 + x8 /5.

We emphasize that Bernstein polynomials are not interpolating polynomials!

Example 3.3.4

For x2 on [0, 1],


m  2
2
X k 1
bm (x , x) = pm,k (x) = x2 + x(1 − x).
m m
k=0

So, bm (x2 , x) 6= x2 for x 6= 0, 1! However, |bm (x2 , x) − x2 | ≤ 1/(4m) → 0 as


m → ∞.

Proof. Intuition for the construction of the Bernstein polynomial approximation is


i

3.3. Weierstrass Approximation Theorem 33

0.8

f(x)
0.6
b8(f,x)
0.4 b16(f,x)

b32(f,x)
0.2
-2 -1 0 1 2

Figure 3.4. Bernstein approximations of degree 8, 16 and 32 for f (x) = 1/(1 + x2 )


on [−2, 2]. Compare to Figure 3.2.

gained by examining the error f (x) − bm (f, x). Using (3.5b), we write the error as
m
X m
X
f (x) − bm (f, x) = f (x)pm,k (x) − f (xk )pm,k (x).
k=0 k=0

The motivation is to try to get quantities of the form f (x) − f (xk ), which should be
small when x is close to xk . Collecting terms, we get
m
X
f (x) − bm (f, x) = (f (x) − f (xk ))pm,k (x). (3.8)
k=0

Now we have to consider two cases: I in which x is close to an xk and the continuity of
f is relevant and II in which x is not close to any xk . So, we split the sum on the right
of (3.8) into two parts. For δ > 0,
X X
f (x)−bm (f, x) = (f (x) − f (xk ))pm,k (x) + (f (x) − f (xk ))pm,k (x) .
0≤k≤m 0≤k≤m
|m
k
−x|<δ |m
k
−x|≥δ
| {z } | {z }
I II

δ is quantifies how close x is to an xk . Now I is small by continuity,


X
|I| ≤ |f (x) − f (xk )| pm,k (x) ≤ κ(f, δ) · 1
0≤k≤m
|m
k
−x|<δ

What about II, where the x are not close to any xk ? As m increases, the xk “fill
in” [0, 1], so the set of x that are not close to any xk should shrink. We prove that II is
small using the same argument used to prove the Law of Large Numbers. We note that
there is a C such that |f (x)| ≤ C, for 0 ≤ x ≤ 1. Hence, (3.6) in the proof of the LLN
implies
X C
|II| ≤ 2C pm,k (x) ≤ .
2mδ 2
0≤k≤m
|mk
−x|≥δ
i

34 Chapter 3. Law of Large Numbers, Weierstrass Approximation Theorem

So, we can make II as small as desired by taking m large. It is a good exercise to show
that in fact,  
1
|II| ≤ κ(f, δ) 1 + ,
4mδ 2
and so,  
1
|f (x) − bm (f, x)| ≤ κ(f, δ) 2 + .
4mδ 2
Setting δ = m−1/2 proves the result.

3.4 References
3.5 Worked problems
i

Chapter 4

Probability Model for


Sequences of Coin
Tosses

Measure what is measurable, and make measurable what is not so.

G. Galilei

The simplest laws of natural science are those that state the conditions under which
some event of interest to us will either certainly occur or certainly not occur ... An event
A, which under complex conditions S sometimes occurs and sometimes does not occur,
is called random with respect to the complex of conditions.

A. Kolmogorov

... the main question is that of constructing a mathematical model which presents a
sufficiently close agreement with reality. With this main question resolved (or at least
partially resolved, since there are often several solutions), the role of the mathemati-
cian is limited to the study of the properties of the model, and this is a problem of pure
mathematics. The comparison of the results thus obtained with experience, and the de-
velopment of theories which make such comparisons possible, lie outside of the domain
of mathematics. Thus, the role of mathematics is limited, though very important, and
in one’s theoretical research, one must never lose sight of reality and one must always
check new ideas against observations and experience.

É. Borel

So never lose an opportunity of urging a practical beginning, however small, for it is


wonderful how often in such matters the mustard-seed germinates and roots in itself.

F. Nightingale

The sciences do not try to explain, they hardly even try to interpret, they mainly make
models. By a model is meant a mathematical construct which, with the addition of
certain verbal interpretations, describes observed phenomena. The justification of such
a mathematical construct is solely and precisely that it is expected to work.
J. von Neumann

35
i

36 Chapter 4. Probability Model for Sequences of Coin Tosses

Panorama
In this chapter, we develop a probability model for the experiment of tossing a coin
randomly an infinite number of times. The discrete probability model does not apply
to this experiment, so we have to develop a new one. In this new model, it turns out,
computing the probability of events, e.g., the event in which the sequence heads-heads-
tails-heads occurs infinitely many times, requires computing the size, or measure, of
complicated sets of real numbers. This raises the need to develop a theory for measuring
the size of complicated sets and for determining what is needed in the theory so that it
is useful for probability.
Ways in which a model is judged include fidelity of description, usefulness for gain-
ing understanding, and computational accessibility. The first has to do with how well
the model describes what is being modeled. The second has to do with how the model
helps us understand the system being modeled. The third has to do with whether or not
the model is useful in practical terms. We partially address the first two issues for the
probability model for coin flips, by using the model to ask and answer some interesting
questions in probability. In particular, we state and prove two different versions of the
Law of Large Numbers. We also begin exploring the somewhat maddening role of sets
of “measure zero” in the theory. Maddening because some of the key elements of mea-
sure theory are necessary to deal with such sets, yet in many practical situations we can
ignore them completely.
The proofs we give in this chapter are rigorous, provided we develop a rigorous
theory for measure with the required properties. But, we do not do that until later
chapters. This may cause some uneasiness in a reader accustomed to the usual style of
polished mathematical presentation which usually requires rigor at every stage starting
with the foundation. However, this presentation is much closer to how new mathematics
is discovered, which often involves creating structure out of conjecture, experiment, and
creativity, then returning to fill in the mathematical details, which may in turn require
changing the original creation. After developing rigorous measure theory, we revisit the
material in this chapter to verify that everything discussed is indeed rigorously justified.

4.1 Probability model for sequences of coin tosses


We begin by developing a correspondence between the set of infinitely long sequences
of coin tosses and the unit interval of real numbers. We then use intuition drawn from
some simple examples to propose a systematic method for assigning probabilities in the
space of infinite sequences of coin tosses by measuring the sizes of corresponding sets
of real numbers. However, it turns out that perfectly reasonable probability questions
about sequences of coin tosses correspond to very complicated sets of real numbers.
Thus, this approach to creating a probability model for sequences of coin tosses requires
developing an approach for measuring the size of complicated sets of real numbers.

4.1.1 Bernoulli sequences and the unit interval


We begin by describing the experiments.

Definition 4.1.1

Suppose an experiment has two possible outcomes and the probabilities of these
outcomes are fixed. A finite number of independent trials of the experiment is
i

4.1. Probability model for sequences of coin tosses 37

called a Bernoulli trial. An infinite sequence of independent trials is called a


Bernoulli sequence.

Example 4.1.1

Let the experiment be the toss of a two-sided coin, with a head denoted (H) and
tails denoted (T ). We display some leading terms of 10 sequences generated ran-
domly from a fair coin with equal probability of heads or tails. The frequency of
occurrence of “runs”, or subsequences of repeated heads or tails, may be surpris-
ing.
HHTHHTTHHHTHHTHTTHHHHTHHHHHTHTHTTTTHHTHTTTHHTTTHHHTHHTTTHTHTHTHHHHHTTTHTHTHTT· · ·
THHHHTHHTHTTHHHTHTTTTHTHTHTHHHTTTHTHHHTTTHTHHHTTTHTHTTTTHHHTHHTHTTTTTTTTHHTTT· · ·
HTTTTTHHHTTTHTTTHHHTHTHTHTTHHTHHTTTTHHHHHTHHTHHHHHTTTTHTTTTTTHTTHHTTTTHTHHTTT· · ·
TTHTHHTHTTHHHTTHHTTHTHHHTTTHTHTHTHHHHTHTTHHTHHHHHHTTHTTTHHHTTHTTHTHHHTHHHHHTT· · ·
TTHHTTHTHHTTTHTTTTTTHHTHTTHHTHTHHHHHTTHTTHHHTTTHTHHTTTTTHTTHTTHTHHHTHTHHHTHTT· · ·
TTHTHHHTTTHHTHHHHTTHHTTTTTTHTHHHHHTHTHHTTTHHTHHTHTHTTTTHTHTTTHTTTHTHTHTHTTTHH· · ·
HHHHTHTTHTTHHTHTHHTHHHHTTTHHTHHHHHHHTTTTTTTHHHHHTTHTHHHTTHHHTTHTTHTTHTHTHHTTT· · ·
HTHTHTHHTTHHTHTHHTHHHHHTTTHHHTHTTTHTTTTTHHHHHHHHTHHHHTHTTHHHHHTHTHHHHHHHHHTTH· · ·
HHTHTTTTTTTTTHTHHTHTTTHHTTHHTTHTTTTHHTHHHTHHHHHHTTTTHTTTTHTHHHTTTTHTHTHHHHTTT· · ·
TTTHTTTHHTHTHHHTHTTTHTTTTTHTTTTTHTTTTTHHHTTHHHTHTTHTTHHHHHHTHHTTHTHHTTHTHHHTT· · ·
For the sake of comparison, here are the results of an unfair coin with proba-
bility 3/4 of heads and 1/4 of tails,
HHHTHHHTTHTHHHHHTHHHTHTHTTHHHHHTHHHHHHHHHTHTHHTHTHHTHTHHHHHHTTTTTHHHHHHHHHHHT· · ·
THHHHHHHHHHTHHHHHHHHTTHHHHTHHHTHHHTHHHHTTHHHTTTHTHTTHHHTHTHHHTHTHHHHHHTHHHHHH· · ·
HHTHHHHHHHHHHHHHHHHHHHHHHTHHHTHHHHTHHTHHHHTHTHHHHHHHHHTHTHHHHHHTTHHTHTHHHTHTT· · ·
HTHHTHHHHHHTHTTHHHHTHHHHHHHHHHHHHHHTHHHHHTTHHHHHHHHHHHHHTHHHHHHHHHHHHTHTHHHHT· · ·
HHHHHHHHTHHHHHTHHTHHHHHHTHHHHHHHHHHHHHHHHHTHHHTHHTHHHHHHHHHHHHTTTTHHTHTTHHHHH· · ·
HHHHHHTHTHHHHHHHTHHHHHHHTHHHHHHHHHTHHHHHHHHHHHHHHHTHHHTTHTHTHHTHTHHTTTTHHHHHT· · ·
THTHTHTHHHHHHTHHHTHHTHHHHHHHHHHHTHHHTHHHHHHHTHTHHHHTHHTHTHTTTTTHTTHHHTHHHHTHH· · ·
TTHHHHTHHHHTHTHHHHHHHHHHHHHTHHTHTHHHTHHTTTHHHHTHTTHHHHHHHHHHHHTHHTTTHHHHTHHHH· · ·
HHTHHHTHHHHHHHHTTHHHHTTHHHHTHHTHTHTHHTHHTHHTHHHHHHTHHHTTHHHHHTHHHHTHHTHHHHHTT· · ·
HHHTTHHHHHHHHHHTHHHTHHHTHHHHTHTHHTTHHHTTHHHHHHHHHHHTHHHHHHTHTHHHHHHHTHHTHHTHH· · ·

We collect all of the sequences resulting from such an experiment to form a space
(master set).

Definition 4.1.2

Fixing a particular experiment with two outcomes of fixed probabilities, we define


the space of Bernoulli sequences, B = { all Bernoulli sequences }. We use H
and T to denote the two outcomes.

For simplicity, we mostly treat the case where the outcomes have equal probability of
occurring, i.e. corresponding to a “fair coin”. The development can be extended to coins
where the two sides have different probabilities of occurring.
We show that B can almost be represented by the real numbers in Ω = (0, 1], which
implies that B is uncountable.
i

38 Chapter 4. Probability Model for Sequences of Coin Tosses

Theorem 4.1.1

If we delete a countable subset of B, we can index the remaining points using the
numbers in Ω.
Recall that by index, we mean there is a 1 − 1 and onto correspondence between the
two sets.
We given a name to the representation of the space of Bernoulli sequences:

Definition 4.1.3

Ω is called the sample space of the probability model.

The theorem implies that the model is not “perfect” in the sense that we cannot represent
all Bernoulli sequences in Ω. For consistency, we have to devise a method for assigning
probabilities that is not affected by the fact that some outcomes in the experimental
space B are not included.

Proof. We construct a map from Ω to B that fails to be onto by a countable subset. Any
point ω ∈ Ω can be written as an expansion in base 2, or fractional binary expansion,

X ai
ω= , ai = 0 or 1.
i=1
2i
Each such expansion corresponds to a Bernoulli sequence. (We use ω for a point in
Ω instead of x to be consistent with common notation in probability.) To see this,
define the ith term of the Bernoulli sequence to be H when ai = 1 and T when ai =
0.
Example 4.1.2

0.10111001001 · · · → HT HHHT T HT T H · · · .
A problem with using real numbers as an index set is the fact that some numbers do
not have a unique binary expansion (recall the proof of Theorem 2.3.3), but we consider
two Bernoulli sequences with different members to be distinct.

Example 4.1.3

1
2 = 0.1000 · · · = 0.0111 . . . , but HT T T T · · · 6= T HHH · · · .

Thus, this method to generate a Bernoulli sequence from a fractional binary expan-
sion does not define a function from Ω into B. To avoid this trouble, we adopt the
convention that if the real number ω has terminating and non-terminating binary expan-
sions, we use the non-terminating expansion. This is the reason for using Ω instead of
[0, 1].
With this convention, the method above defines a 1 − 1 map from Ω into B that is
not onto because it does not produce Bernoulli sequences ending in all T ’s. We claim
that the set BT of such Bernoulli sequences is countable. Let BTk be the finite set of
Bernoulli sequences that have only T ’s after the k th term. We have,

[
BT = BTk . (4.1)
k=1
i

4.1. Probability model for sequences of coin tosses 39

Theorem 2.3.2 implies that BT is countable.

Remark 4.1.1

The decomposition of a countable set as a countable union of finite sets in (4.1) is


a standard measure theory argument.

There is a 1 − 1 and onto correspondence between Ω and B \ BT . In order to be


able to use Ω as a model for sequences of coin tosses, we ignore BT and abuse notation
to let B denote B \ BT . It would be fair to say that we are hiding the deficiencies of
the model! We address this issue in the way we assign probabilities.

4.1.2 Initial encounter with events and measure


With the description Ω of B in hand, we turn to assigning probabilities to events in B.
The idea is that we assign the probability of event A in B equal to the measure of the
corresponding subset IA ⊂ Ω.
Example 4.1.4

If AH is the event in B consisting of sequences where H is the first outcome, the


corresponding set in Ω is

IAH = {ω ∈ Ω; ω = 0.1a1 a2 a3 . . . : ai = 0 or 1} = (.5, 1].

Note that the largest number not in IAH is 0.100000 . . . while the largest number
in IAH is 0.11111 . . . . We do not include 1/2 because we use non-terminating
expansions. Likewise, if AT is the event where T occurs as the first outcome, then
IAT = (0, .5].
Since B is uncountable, discrete probability does not apply. However, the
tosses in a sequence are independent of each other, and assigning the probabilities
P (AH ) = .5 and P (AT ) = .5 seems reasonable. Interestingly, the lengths of
IAH and IAT are both .5. This suggests that the length or “measure” of IA might
correspond to the probability for A ⊂ B.
One issue with this approach is that we have to ignore the countable set BT in
order to use Ω as a model for B. To be consistent with assigning probability, BT
should be assigned probability 0. This means that the corresponding set IBT , which is
undoubtedly complicated, should be assigned a “length” or measure of 0. Likewise, it
turns out any finite or countable subset of Ω needs to be assigned a length or measure
of 0. This has a number of important ramifications and it motivates devising a way to
measure the size of sets of real numbers that applies to complex sets.
Lebesgue developed an approach to measure the sizes of complex sets of real num-
bers that is the basis for measure theory. Measure theory can be developed in a very
abstract way that applies to spaces of many different kinds of objects, though we focus
on spaces consisting of real numbers in this book. In that context, it is initially rea-
sonable to think of measure as a generalization of length in one dimension, and area
and volume in higher dimensions. But, we also caution that measures can have other
interpretations. For example, we use measure to quantify probability later on.
To fit common conceptions of measuring the sizes of sets, at a minimum, a measure
µ should satisfy some properties.
i

40 Chapter 4. Probability Model for Sequences of Coin Tosses

Definition 4.1.4: First Wish List for Measures

A measure µ is a real-valued function defined on a collection of subsets of a space


X called the measurable sets. If A is a measurable set, µ (A) is the measure of
A. At a minimum, the structure must satisfy:
(Non-negativity) µ should be non-negative.
m
(Closed under disjoint finite unions) If {Ai }i=1 is a finite collection of disjoint
m
S
measurable sets, then Ai is measurable.
i=1
m
(Finite-additivity) If {Ai }i=1 is a collection of disjoint measurable sets, then,
m
! m
[ X
µ Ai = µ (Ai ).
i=1 i=1

Thus, a measure is a non-negative finitely additive set function, just like a probability
function. There should be a connection here.
We pay particular attention to the case of real numbers:

Example 4.1.5: Lebesgue measure

If the space X is an interval of real numbers and the measurable sets include inter-
vals for which

µ ((a, b)) = µ ([a, b]) = µ ((a, b]) = µ ([a, b)) = b − a, a, b ∈ X,

we call µ the Lebesgue measure on X and write µ = µL .

Note that this implies that the measure of a set of a single point is zero, i.e.,
µL ({a}) = 0.
It also implies that the Lebesgue measure is determined by assigning values on intervals.

4.1.3 Assigning probabilities to events


In Example 4.1.4, we assign probability to the event AH in B consisting of sequences
where H is the first outcome as P (AH ) = µL (IAH ) = .5. Likewise, we assigned
probability to AT as P (AT ) = µL (IAT ) = .5. These assignments are not inevitable.
Rather, they are a modeling choice. But, these assignments fit intuition or belief about
the probabilities of these two events.
We can continue along these lines. If we consider the events AHH , AHT , AT H ,
AT T in B in which the first two outcomes HH, HT , T H, T T are specified, the corre-
sponding intervals are
IAT T = (0, .25], IAT H = (.25, .5], IAHT = (.5, .75], IAHH = (.75, 1].
Since these intervals have equal length and disjoint, we assign the probability of .25 to
each and to each corresponding event.
We can continue with this argument, considering the events corresponding to speci-
fication of the first three outcomes, then the first four outcomes, and so on. Considering
i

4.1. Probability model for sequences of coin tosses 41

the events in which the first m outcomes are specified, we obtain 2m disjoint intervals
of equal length, and assign equal probability 2−m to each interval and thus each event.
In this way, we obtain a sequence of “binary” partitions Tm of Ω into 2m disjoint
S2m
subintervals Im,j of equal length such that Ω = j=1 Im,j , Im,j = ((j−1)2−m , j2−m ],
j = 1, · · · , 2m , see Figure 4.1. Note that we simplify notation by removing mention of
the event in B corresponding to each interval.

0 1

T4

a b

Figure 4.1. Illustration of the sequence of “binary” partitions Tm of Ω. We illustrate


an approximation of the interval (a, b) by subintervals in T5 .

We assign equal probabilities to each subinterval in a given partition and to the


corresponding events.
Using this assignation and the properties in the Wish List 4.1.4, we can compute the
probabilities of some interesting events.
Example 4.1.6

Consider the event A in which H is the mth outcome. Then,

IA = {ω ∈ Ω; ω = 0.a1 a2 . . . 1am+1 am+2 am+3 . . . : ai = 0 or 1}

Let s = 0.a1 a2 . . . am−1 1, so IA contains (s, s + 2−m ]. We can choose a1 , a2 ,


. . ., am−1 in 2m−1 different ways and each of the resulting intervals are disjoint
from the others, so we use finite additivity to conclude that,
1
P (A) = µL (IA ) = 2m−1 · = 1/2.
2m
As a concrete example, consider m = 3. Then, we have the following cases:
HT H, HHH, T HH, HT H: corresponding to 4 disjoint intervals of length 1/23 ,
and P (A) = 4/8 = 1/2.

Example 4.1.7

Let A be the event where exactly i of the first m outcomes are H, so



IA = ω ∈ Ω; ω = 0.a1 a2 . . . am am+1 · · · : exactly i of the first m digits are 1

and remaining are 0 or 1 .
i

42 Chapter 4. Probability Model for Sequences of Coin Tosses

Choose a1 , . . . , am so exactly i are 1 and set s = 0.a1 a2 . . . am . IA contains


(s, s + 2−m ]. The intervals corresponding to different choices of a1 , . . . , am are
disjoint and there are exactly mi such intervals. So


 
m 1
P (A) = µL (IA ) = · m.
i 2

But, what about a general event, e.g., associated with an arbitrary interval (a, b] ⊂
Ω? Such intervals correspond to events in B that are difficult to describe. Figure 4.1
suggests an interesting conjecture. SIt appears that any interval (a, b] ⊂ Ω can be
“approximated” arbitrarily well by Im,j ⊂(a,b] Im,j in the sense that the intervals of
points not in the approximation (a, b] \ ∪Im,j ⊂(a,b] Im,j shrink
S in size as m increases,
see Fig. 4.1. Since the lengths of the approximations Im,j ⊂(a,b] Im,j converge to
µL ((a, b]), we assign probability µL ((a, b]) to the event in B corresponding to (a, b].
In view of the Wish List 4.1.4 and the fact that µL (I) = 1, we extend this as a
general modeling approach.

Definition 4.1.5: A Measure Theory Model for Probability on B

If A is an event in B, we let IA denote the corresponding set of real numbers in


(0, 1]. Then, we assign the probability of A, denoted by P (A), to be µL (IA ).

All of this discussion is terribly vague, since we have not defined µL , described the
collection of measurable sets, or quantified the sense of approximation of sets observed
above! But, we show that these ideas lead to some interesting theorems in the next
couple of sections.

4.1.4 Recapping the construction of the model


We note that there are actually two steps in the construction of a probability model for
B:
Step 1 Identify the interval Ω with B;
Step 2 Map the set of observable events in B to the set of a measurable sets in Ω;
Step 2 Assign probabilities to events in B via P (A) = µL (IA ), where IA is the subset
of Ω corresponding to A ⊂ B.
We present more examples of probability models in Section 7.2.
The use of measure theory as a model for probability is not entirely free from con-
troversy and there are alternative proposed frameworks. But it is fair to say that the
proposal of measure theory as a foundation for probability by Kolmogorov stands as
one of the great mathematical achievements of the Twentieth Century. The worthiness
of measure theory as a framework for probability is demonstrated in part by the ability
to state and prove important probabilistic results. We present a couple of examples in
the next two sections and further examples in later chapters.
The assigning of probabilities in Step 3 is also associated with a degree of con-
troversy. Partly, this is due to the fact that “randomness” is used to model various
situations, including systems that are truly stochastic in nature and systems whose state
is unknown but not truly stochastic. Even if a system is random, there may be limited
information on the probability values of different events, and when there is information,
i

4.2. Weak Law of Large Numbers 43

it is often based on a finite set of observations. Above, we defined P (A) = µL (IA ) for
any event A based on what we assigned for some particularly simple events.
We conclude by noting that the model derived in this section can be applied to a
variety of situations.

Example 4.1.8

Ω can index the points in the space corresponding to the random throw of a dart
into a circular target of radius 1 and measuring the distance from the dart’s position
to the origin.

4.2 Weak Law of Large Numbers


We fix a particular experiment with two outcomes of fixed and equal probabilities.

Continuing the program of motivating measure theory as a model for probability in


B, we use it to state and prove some important results in probability. Of course, we
have not shown that it is possible to derive measures yet and we have only described
properties of measures under a lot of restrictions. But, we tackle those issues later. In
the mean time, we begin by revisiting the Law of Large Numbers.
Recall that intuition suggests that it should be possible to detect the probabilities
of H and T in B by examining the outcomes of many repetitions of the experiment.
In particular, the number of times that H occurs in a large number of trials should be
related to the probability of H. However, as discussed earlier, a precise statement of
this intuition is difficult to formulate. Assuming the probability of H is 1/2 and Sm is
the number of H’s that occur in the first m trials, then if we could show that
Sm 1
lim = ,
m→∞ m 2
then this would be a mathematical statement expressing the intuition. But such a result is
certainly false. A sequence of experiments could yield outcomes of all H’s for example.
So, we need to create a careful formulation.
To state and prove the desired result, we introduce some functions.

Definition 4.2.1

A random variable is a function on the sample space.

The name “random variable” is a rather disconcerting name to assign to a function!


Expressing and proving results in probability by using random variables is a supremely
important technique.

Definition 4.2.2

For ω ∈ Ω, define the random variable,

Sm (ω) = a1 + · · · + am , where ω = 0.a1 a2 · · · am · · ·

Sm gives the number of heads in the first m outcomes of the Bernoulli sequence
i

44 Chapter 4. Probability Model for Sequences of Coin Tosses

corresponding to ω.

Following this example, a random variable can also be viewed as a function on the
outcomes of an experiment.
Given δ > 0, we define
 
Sm (ω) 1
Iδ,m = ω ∈ Ω : − >δ .
(4.2)
m 2
Roughly speaking, this is the event consisting of outcomes for which there are not
approximately the same number of H and T after m trials, where δ quantifies the dis-
crepancy.
We prove

Theorem 4.2.1: Weak Law of Large Numbers for Bernoulli Sequences

For fixed δ > 0,


µL (Iδ,m ) → 0 as m → ∞. (4.3)

An observant reader should be uncomfortable at this conclusion, because Iδ,m is an


apparently complicated set, and we have not yet specified a procedure for computing the
measure µL of complicated sets. Fortunately, during the proof, it becomes apparent that
Iδ,m is actually a finite collection of nonoverlapping intervals for which µL is defined.
By definition, (4.3) implies that for any fixed δ > 0, given any  > 0,

µL (Iδ,m ) < ,

for all sufficiently large m. Identifying µL with P , we see that (4.3) extends the earlier
Law of Large Numbers (3.4) to B.

Remark 4.2.1

The idea of measuring the size of the set where a function takes a specified range
of values is central to measure theory. However, such a set is not a finite collection
of disjoint intervals in general.

Before proving Theorem 4.2.1, we conduct a numerical experiment.

Example 4.2.1

We compute 5000 sequences of “numerical” coin flips for varying m, where each
sequence corresponds to a sample number 0 < ω̂i ≤ 1, 1 ≤ i ≤ 5000. We then
evaluate
Sm (ω̂i ) 1
D̂i =
− . (4.4)
m 2
Finally, we compute a normalized histogram of {D̂i }5000
i=1 . We show plots of the
histograms in Figures 4.2 and 4.3. The results support the conclusion of the Law
of Large Numbers as increasing m has the consequence that more of the sequences
have small values of D̂i .

To prove Theorem 4.2.1, we reformulate it using two new random variables.


i

4.2. Weak Law of Large Numbers 45

.8 1
.3 m=100 m=1000 m=10000
.64 .8
.24
.18 .48 .6
.12 .32 .4
.06 .16 .2
0 0 0
.01 .09 .21 .01 .09 .21 .01 .09 .21

Figure 4.2. Normalized histograms of the results of 5000 different sequences of coin
tosses of lengths m = 100, m = 1000, m = 10000. Horizontal axis is D̂i (4.4). Vertical axis is
proportion of tosses in the indicated interval. As the number of tosses increases, the proportion
of tosses with values of D̂i near 0 increases.

.45 m=1000 .3 m=10000


.36 .24
.27 .18
.18 .12
.09 .06
0 0
.005 .025 .045 .001 .005 .009 .013

Figure 4.3. Normalized histograms of the results of 5000 different sequences of coin
tosses of lengths m = 1000 and m = 10000. Horizontal axis is D̂i (4.4). Vertical axis is
proportion of tosses in the indicated interval. We change the horizontal scales so that the “shape”
of the distribution of values of D̂i for larger and larger number of tosses is evident. Comparing
the left-hand plot of Figure 4.2 and these two plots suggests that the distributions in all three
cases have a similar profile even as the horizontal scale changes.

Definition 4.2.3

For ω ∈ Ω, we define the ith Rademacher function by,

Ri (ω) = 2ai − 1, ω = 0.a1 a2 · · ·

Equivalently, (
1, ai = 1,
Ri (ω) =
−1, ai = 0.

We plot some of these functions in Fig. 4.4. Ri has a useful interpretation. Suppose we
bet on a sequence of coin tosses such that at each toss, we win $1 if it is heads and lose
$1 if it is tails. Then Ri (ω) is the amount won or lost at the ith toss in the sequence of
tosses represented by ω.
Following this interpretation, we define another random variable.

Definition 4.2.4

The total amount won or lost after the mth toss in the betting game is given by
m
P
Wm (ω) = Ri (ω).
i=1
i

46 Chapter 4. Probability Model for Sequences of Coin Tosses

R1 R2 R3
1 1 1

1/2 1 1/4 1/2 3/4 1 1/4 1/2 3/4 1

-1 -1 -1

Figure 4.4. Plots of the first three Rademacher functions.

By the definition of Ri ,

Wm (ω) = 2(a1 + a2 + · · · + an ) − m = 2Sm (ω) − m, ω = .a1 a2 a3 · · · .

Now,
Sm (ω) 1
m − 2 > δ ⇔ 2Sm (ω) − m > 2mδ,

or in other words, if and only if,

|Wm (ω)| > 2δm. (4.5)

Note that since δ is arbitrary, the factor 2 is immaterial. We define,



Am = ω ∈ Ω : |Wm (ω)| > mδ .

Note that Am is determined by the inverse of Wm considered as a set function.


We can prove Theorem 4.2.1 by showing that

µL (Am ) → 0 as m → ∞. (4.6)

To do this, we use a special version of an important result.

Theorem 4.2.2: Special Case of Chebyshev’s Inequality

Let f be a non-negative, piecewise constant function on Ω and α > 0 be a positive


real number. Then,
Z1
1 
µL {ω ∈ Ω : f (ω) > α} < f (ω) dω,
α
0

where the integral is the standard Riemann integral, which is well defined for
piecewise constant, nonnegative functions.

We illustrate the theorem in Fig. 4.5.


i

4.2. Weak Law of Large Numbers 47

included in set included in set

Figure 4.5. We illustrate a typical set in Chebyshev’s inequality.

Proof. [Theorem 4.2.2] Since f is piecewise constant, there is a mesh 0 = ω1 < ω2 <
· · · < ωm = 1 such that f (ω) = ci for ωi < ω ≤ ωi+1 for 1 ≤ i ≤ m − 1. Since f is
nonnegative,

Z1 m
X m
X
f (ω) dω = ci (ωi+1 − ωi ) ≥ ci (ωi+1 − ωi )
0 i=1 i=1
ci >α
m
X 
>α (ωi+1 − ωi ) = αµL {ω ∈ I : f (ω) > α} .
i=1
ci >α

Now we are ready to prove Theorem 4.2.1

Proof. [Theorem 4.2.1] We can also describe the set Am as


2
(ω) > m2 δ 2 ,

Am = ω ∈ I : W m
2
where Wm (ω) is piecewise constant and non-negative. By Theorem 4.2.2,

Z1
1 2
µL (Am ) < 2 2 Wm (ω) dω.
m δ
0

We compute,

Z1 Z1 m
X
!2 m Z
X
1 m Z1
X
2
Wm (ω) dω = Ri (ω) dω = Ri2 (ω) dω + Ri (ω)Rj (ω) dω.
0 0 i=1 i=1 0 i,j=1 0
i6=j

The first integral on the right simplifies since Ri2 (ω) = 1 for all ω, so

m Z 1
X
Ri2 (ω) dω = m.
i=1 0
i

48 Chapter 4. Probability Model for Sequences of Coin Tosses

R1
We consider 0 Ri (ω)Rj (ω) dω when i 6= j. Without loss of generality, we assume
i < j. Set J to be the interval,
 
` `+1
J= , , 0 ≤ ` < 2i .
2i 2i

Ri is constant on J while Rj oscillates between values −1 and 1 exactly 2(j − i) times.


Because this is an even number of oscillations, cancellation implies
Z Z
Ri (ω)Rj (ω) dω = Ri (ω) Rj (ω) dω = 0.
J J

Therefore, Z 1
Ri (ω)Rj (ω) dω = 0, i 6= j.
0
R1 2
Thus, Wm (ω) dω = m, and
0

1 1
µL (Iδ,m ) ≤ m= ⇒ µL (Iδ,m ) → 0 as m → ∞.
m2 δ 2 mδ 2

The random variables introduced for this proof can be used to quantify other inter-
esting questions.

Example 4.2.2

Suppose in the betting game above, we start with M dollars. We compute an


expression that yields the probability we lose all the money.
If Am is the event where we lose the money on the mth toss, then the corre-
sponding set of numbers is

IAm = {ω ∈ Ω : Wi (ω) > −M for i < m and Wm (ω) = −M } .

The set IAm , determined by where a function has prescribed values, is generally
complicated. The event A of losing all the money, given by

[
IA = IAm
m=1

is even more complicated. The probability of A is µL (IA ), once we figure out how
that is computed.

4.3 Sets of measure zero


Theorem 4.2.1 states that the size of the event consisting of Bernoulli sequences for a
fair coin for which the relative frequency of H’s in the first m trials is larger than a fixed
distance from 1/2 tends to 0 as m → ∞. But, this leaves open the question: For a fair
coin and a “typical” ω, does
Sm (ω) 1
lim = ? (4.7)
m→∞ m 2
i

4.3. Sets of measure zero 49

This is an important question from the point of view of numerical simulation, as it is


quite common that we would have only one numerical sequence corresponding to a
choice of ω in hand. Can we reliably use the computed example to try to approximate
the answer to statistical questions?

Definition 4.3.1

The set of normal numbers in I is


 
Sm (ω) 1
N= ω∈Ω : → as m → ∞ .
m 2

Another way to state the intuition behind the Law of Large Numbers is that obtaining a
sequence corresponding to a non-normal numbers should be highly improbable. In fact,
ideally, there should be 0 probability of obtaining such a sequence.

Definition 4.3.2

An event in B is has probability zero if the corresponding set of real numbers


has Lebesgue measure 0.

Note, however, that there are non-normal sequences, so having probability zero does
not mean it is impossible to obtain a non-normal sequence.
In this section, we characterize sets with Lebesgue measure zero. We noted above
that the Lebesgue measure of a single point is zero. It follows immediately that finite
collections of points also have Lebesgue measure zero. Infinite collections are appar-
ently more complicated. For example, I is the uncountable union of single points and
does not have Lebesgue measure zero. Working from the assumptions about measure
we have made so far, we develop a general method for characterizing sets with Lebesgue
measure zero. In doing so, we actually motivate several key aspects of measure theory.
The characterization is based on a fundamentally important concept for metric spaces.

Definition 4.3.3

Given a subset A ⊂ Rn , a countable cover of A is a countable collection of sets




{Ai }i=1 in Rn such that A ⊂
S
Ai . If the sets in a countable cover are open, we
i=1
call it an open cover.

We emphasize that the requirement of being countable is important.

Example 4.3.1

Here is a countable cover of R:


∞  
[ 1 1 1 1
R⊂ i − ,i + 1 + ∪ − i − 1 − , −i + .
i=0
4 4 4 4
i

50 Chapter 4. Probability Model for Sequences of Coin Tosses

Definition 4.3.4

A set A ⊂ R has Lebesgue measure zero if for every  > 0, there is a countable

cover {Ai }i=1 of A, where each Ai consists of a finite union of open intervals,
such that
X∞
µL (Ai ) < .
i=1

We also say that A has measure zero.

Note that because each Ai in the countable cover consists of a finite union of open
intervals, their Lebesgue measure is computable. In this way, we sidestep the issue of
computing µL (A) directly.

Example 4.3.2

We show that N has Lebesgue measure 0. Given  > 0, we have the open cover:

[   
N⊂ i− ,i + .
i=0
2i+2 2i+2

We compute
∞ ∞
X    X 
µL i − i+2
, i + i+2 = = .
i=0
2 2 i=0
2i+1

This definition also uses (implicitly) another property of Lebesgue measure:

Definition 4.3.5

If (c, d) ⊆ (a, b), then µL ((c, d)) ≤ µL ((a, b)). We say that Lebesgue measure is
monotone.

We could use half open or closed intervals in the definition instead of open intervals,
but open intervals turn out to be convenient for “compactness” arguments.

Example 4.3.3

We show that a closed interval [a, b] with a 6= b cannot have measure zero. If [a, b]
is covered by countably many open intervals, we can extract a finite number that
cover [a, b] (a finite subcover) because it is compact. The sum of length of these
intervals must be at least b − a.

We describe some properties of sets of measure zero.

Theorem 4.3.1

1. A measurable subset of a set of measure zero has measure zero.



∞ S
2. If {Ai }i=1 is a countable collection of sets of measure zero, then Ai has
i=1
i

4.3. Sets of measure zero 51

measure zero.
3. Any finite or countable set of numbers has measure zero.

This states that a countable union of sets of measure zero is a set of measure zero. In
contrast, uncountable unions of sets of measure zero can have nonzero measure. The
assumption that the subset of the set of measure zero in 1. is measurable is an important
point that we address in later chapters.

Proof.
Result 1. This follows from the definition since any countable cover of the larger set
is also a cover of the smaller set.
Result 2. We choose  > 0. Since Ai has measure zero, there is a countable collection
of open intervals {Bi,1 , Bi,2 , . . . ,} covering Ai with

X 
µL (Bi,j ) ≤ .
j=1
2i


∞ S
By Theorem 2.3.2, the collection {Bi,j }i,j=1 is countable and covers Ai . Moreover,
i=1
 
∞ ∞ ∞ ∞
X X X X 
µL (Bi,j ) =  µL (Bi,j ) ≤ i
= .
j,i=1 i=1 j=1 i=1
2

Note that we use non-negativity to switch the order of summation in this argument.
Result 3. This follows from 2. and 3. and the observation that a point has measure
zero.

Remark 4.3.1

Result 2 is proved using a typical measure theory argument.

An interesting question is whether or not there are any interesting sets of measure
zero. We next show that there are uncountable sets of measure zero. In particular, we
describe the construction of a special example that is used frequently in measure theory.
The set is constructed by an iterative process.

Definition 4.3.6: Cantor (Middle Third) set

Step 1 Beginning with the unit interval F0 = [0,  1], divide F0 into 3 equal parts
and remove the middle third open interval 31 , 23 to get
   
1 2
F1 = 0, ∪ ,1 .
3 3

See Fig. 4.6.


i

52 Chapter 4. Probability Model for Sequences of Coin Tosses

Step 2 Working on F1 next, divide each of its two pieces into equal thirds and
remove the middle open intervals from the divisions to get F2 .
       
1 2 1 2 7 8
F2 = 0, ∪ , ∪ , ∪ ,1 .
9 9 3 3 9 9

This has 22 closed intervals of length 3−2 , see Fig. 4.7.


Step i Divide each of the 2i−1 pieces remaining after step i − 1 into equal thirds
and remove the middle piece from each to get Fi . Fi has 2i closed intervals of
length 3−i .
End result This procedure yields a sequence of closed sets {Fi }, where each Fi
is a finite union of 2i closed intervals of length 3−i .
The Cantor (Middle Third) Set C is defined,

\
C= Fi .
i=1

o 1 o 1_ 2_ 1
3 3
F0 F1

Figure 4.6. The first step in the construction of the Cantor set.

0 1_ 2_ 1_ 2_ 7_ 8_ 1
9 9 3 3 9 9
F2

Figure 4.7. The second step in the construction of the Cantor set.

Theorem 4.3.2

Let C be the Cantor set. Then,


1. C is closed.
2. Every point in C is a limit of a sequence of points in C.
3. C has measure zero.
4. C is uncountable.

Proof.
Result 1 Exercise.
Result 2 Exercise.
Result 3 C is contained in Fi for any i. Since Fi is a union of disjoint intervals whose
i i
lengths sum to (2/3) and, for any  > 0, (2/3) <  for all sufficiently large i, C has
measure zero.
i

4.3. Sets of measure zero 53

Result 4 We show that every point ω ∈ C can be represented uniquely by a series of


the form

X ai
ω= ,
i=1
3i
where ai = 0 or 2. This can be recognized as a base 3 decimal expansion. To show
uniqueness, if
∞ ∞
X ai X bi
i
= i
i=1
3 i=1
3
for ai , bi = 0 or 2, we show that ai = bi for all i. Suppose ai 6= bi for some i. Let m
be the smallest number with am 6= bm , so |am − bm | = 2. Since |ai − bi | ≤ 2 for all i,

∞ ∞ ∞
!
X ai − bi X ai − bi 1 X |ai − bi |
0= = ≥ |am − bm | −


i=1
3i i=m 3i 3m i=m+1
3i−m

!
1 X 2 1
≥ m 2− i
= m.
3 i=1
3 3

This is a contradiction and so every number in C has a unique base 3 decimal expansion.
2i−1
Now let {Gi,j }j=1 be the open “middle third” intervals removed to obtain Fi . Then,
a number given by the base 3 decimal expansion 0.b1 b2 b3 . . ., bi = 0, 1, 2, is in Gi,j for
some j if and only if:
• bj = 0 or 2 for each j < i, because it is in Fi−1 ;
• bi = 1, because it is in one of the discarded open intervals at this stage;
• the bj ’s are not all 0 or 2 for j > i.

It is a good exercise to use a variation of the Cantor diagonal argument to show that C
is uncountable.

To give some idea of the importance of the concept of sets of measure zero, we quote
a beautiful result of Lebesgue that states “if and only if” conditions for a function to be
Riemann integrable. Recall that two aspects of Riemann integration provided significant
impetus to the development of measure theory. First, there was a long search minimal
equivalent conditions on a function that would guarantee the function is Riemann inte-
grable. Second, the Riemann integral has some annoying “flaws”. The resolution of the
search is describe in detail in Section 9.10. Here, we simply quote the main result.
To explain the idea, we begin with a canonic example. First,

Definition 4.3.7

A property of sets that holds except on a set of measure zero is said to hold almost
everywhere (a.e. ). We say that almost all points in a set have a property if all the
points except those in a set of measure zero have the property.

Now, the example.

Definition 4.3.8
i

54 Chapter 4. Probability Model for Sequences of Coin Tosses

Dirichlet’s function is defined


(
1, if x ∈ Q,
D(x) =
0, if x 6∈ Q.

It is a good exercise to puzzle out the proof of the following theorem.

Theorem 4.3.3

D is a bounded function, D(x) = 0 a.e. , and D(x) is not continuous a.e. .

We emphasize that a function may be equal to a continuous function a.e. but not be
continuous a.e. !
The next result is part of Theorem 9.10.1 proved later.

Theorem 4.3.4: Lebesgue’s Theorem on Riemann Integration

A bounded function is Riemann integrable on a closed interval if and only if it is


continuous a.e. on the interval.

4.4 Strong Law of Large Numbers


We fix a particular experiment with two outcomes of fixed and equal probabilities.

We return to analyzing the set of normal numbers N.

Theorem 4.4.1: Strong Law of Large Numbers for Bernoulli Sequences

Nc is an uncountable set with Lebesgue measure zero.

This theorem is a statement that is naturally expressed in terms of measure theory. This
version of the Law of Large Numbers is called strong because Theorem 4.4.1 implies
Theorem 4.2.1. This is a consequence of a general result on different kinds of conver-
gence that discuss in Chapter 12.

Proof. We first show that that Nc is uncountable and contains a “Cantor-like” set.
Consider the map f : Ω → Ω,

f(ω) = 0.a1 11a2 11a3 11 . . . ,

for ω = 0.a1 a2 a3 . . . in a fractional binary expansion. The map is 1 − 1, so the image


of Ω is uncountable. Moreover, f(Ω) is contained in Nc . In fact, if y = f(ω), then
S3m (y) ≥ 2m, and
S3m (y) 2
≥ .
3m 3
(y) ∞ ∞
Note that S3m is a subsequence of Smm(y) m=1 . Such y’s clearly violate
 
3m m=1
the Law of Large Numbers. The image set f(Ω) is Cantor-like in that it is the count-
able nested intersection of sets consisting of finite number of well-separated, disjoint
intervals.
i

4.4. Strong Law of Large Numbers 55

We cover the complicated set Nc using a countable cover of much simpler sets.
Recall the set Am = {ω ∈ Ω : |Wm (ω)| > δm} used in the proof of the Weak Law
of Large Numbers. We use an equivalent definition,
4
(ω) > δ 4 m4 .

Am = ω ∈ Ω : W m

By Theorem 4.2.2,

Z 1 Z1 m
!4
1 4 1 X
µL (Am ) ≤ 4 4 Wm dω ≤ 4 4 Ri dω.
δ m 0 δ m i=1
0

The integrand yields 5 kinds of terms,

1. Ri4 for i = 1 · · · m.
2. Ri2 Rj2 for i 6= j.
3. Ri2 Rj Rk for i 6= j 6= k.
4. Ri3 Rj for i 6= j.
5. Ri Rj Rk Rl for i 6= j 6= k 6= l.

Since Ri4 (ω) = 1 and Ri2 (ω)Rj2 (ω) = 1 for all i, j,


Z 1 Z 1
Ri4 dω = Ri2 Rj2 dω = 1.
0 0

We show the other terms integrate to zero because of cancellation. Two follow from the
proof of the Weak Law of Large Numbers:
Z 1 Z 1
Ri2 Ri Rk dω = Rj Rk dω = 0, i 6= j 6= k,
0 0
Z 1 Z 1
Ri3 Rj dω = Ri Rj dω = 0, i 6= j.
0 0

Finally, assume i < j < k < l, and consider an interval of the form
 
r r+1
J= , .
2k 2k

Ri Rj Rk is constant on J. However, Rl oscillates 2(l − i) times on J, so


Z 1
Ri Rj Rk Rl dω = 0.
0

There are m terms of the first kind of integrand and 3m(m − 1) terms involving the
second kind of integrand, so
Z 1
4
Wm (ω) dω = 3m2 − 2m ≤ 3m2 ,
0

and
3
µL (Am ) ≤ .
m2 δ 4
i

56 Chapter 4. Probability Model for Sequences of Coin Tosses

We cover Nc using a collection of sets of the form Am for increasing m and decreas-
ing δ chosen in such a way that the cover has arbitrarily small measure. For a constant
4
C, set δm = Cm−1/2 , so
∞ ∞
X 3 3 X 1
= ,
δ 4 m2
m=1 m
C m=1 m3/2

where the series on the right converges to a sum that can be made smaller than any  > 0
by choosing sufficiently large C. Hence, given  > 0, there is a sequence {δm } such
that

X 3
≤ .
m=1 m
δ m2
4

For each m, set


Ãm = {ω ∈ I : |Wm (ω)| > δm m} .
Note Ãm is a finite union of intervals since Wm is piecewise constant. We have

3
µL (Ãm ) ≤ 4 m2
,
δm

and

X
µL (Ãm ) ≤ .
m=1
∞ ∞
If we show that Nc ⊂ Ãcm . If
S T
Ãm , then we are done. This holds if N ⊃
m=1 m=1

|Wm (ω)|
Ãcm , then for each m, |Wm (ω)| ≤ δm m, or
T
ω∈ m ≤ δm . Since δm → 0,
m=1
|Wm (ω)|
m → 0, or ω ∈ N.

The proof of Theorem 4.4.1 can be used to draw stronger conclusions. For exam-
ple, a normal number has the property that no finite sequence of digits occurs more
frequently than any other finite sequence of digits.

4.5 Wish list for measure theory for Rn


With some informal experience with measure theory ideas, we make a second attempt at
a wish list of desirable properties for a measure theory. We are considering the measure
on Rn that extends the standard notions of length, area, and volume. If A ⊂ Rn for
some n, let µL A denote its “measure”.

1. µL should be non-negative set function from sets in Rn into the extended reals R.
b
µL {x} = 0 for a single point. µL A = ∞ should be possible for unbounded sets.
2. In R, we should have µL ([a, b]) = b − a. In Rn , we should have

µL (Q) = (b1 − a1 )(b2 − a2 ) . . . (bn − an ),

for generalized rectangles (multi-intervals),

Q = {x ∈ Rn : ai ≤ xi ≤ bi , 1 ≤ i ≤ n} .
i

4.6. References 57

3. If {Ai }m
i=1 are disjoint sets, then

m
X
µL (A1 ∪ A2 ∪ · · · ∪ Am ) = µL (Ai ).
i=1

What about infinite collections? Well, µL ({x}) = 0. But in R,


[
(0, 1) = {x} .
x∈R

This is a problem because we cannot have


!
[ X
1 = µL ((0, 1)) = µL {x} = µL ({x}) = 0.
x∈R x∈(0,1)

So, uncountable collections of sets are a problem and we avoid them. What
about countable collections? Countable disjoint collections of sets of measure
zero should have measure zero. Also,
     
1 1 1 1 1
(0, 1] = ,1 ∪ , ∪ , ∪ ...,
2 3 2 4 3

and,
     
1 1 1 1 1
1 = µL ((0, 1]) = 1 − + − + − + ...
2 2 3 3 4
     
1 1 1 1 1
= µL , 1 + µL , + µL , + ···
2 3 2 4 3

So we would like to say that if {Ai }∞


i=1 is a countable collection of disjoint sets
then

[ ∞
 X
µL Ai = µL (Ai ).
i=1 i=1

4. If A ⊂ B are sets, then µL (A) ≤ µL (B), or µL should be “monotone”.


5. If a set A is obtained from another set B by rotation, translation, or reflection
maps, then µL (A) = µL (B).

It turns out that we cannot construct a desirable measure that satisfies all of these
properties. We have to give up something, so we do not require that the measure be
defined on all subsets on Rn . We settle for a measure defined on a class of subsets. On
the other hand, this class of subset is extremely rich and it is quite difficult to construct
a set that is not in it.

4.6 References
4.7 Worked problems
i

58 Chapter 4. Probability Model for Sequences of Coin Tosses

You might also like