AkademiaiKiado 004581

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 680

A.

RÉNYI

probability
theory

AKADÉMIAI KIADÓ, BUDAPEST


A. RÉNYI

PROBABILITY THEORY

The book has been compiled on the


basis of lectures held by the aut hor
since 1948 to students of mathematics
at the University of Budapest. It is an
introductory work: no preliminary
knowledge in the field of the theory
of probability is required. The r e a d ­
e r should, however, be familiar with
other branches of mathematics. He
must have a thorough foundation
not only in the elements of the dif­
ferential and integral calculus but
also in the theory of real and com­
plex functions. The book is meant
for demanding beginners who ar e
intending to get acquainted with the
ideas of the probability theory and to
acquire a thorough knowledge in this
field.

AKADÉMIAI KIADÓ
PUBLISHING HOUSE OF THE HUNGARIAN
ACADEMY OF SCIENCES
B U D A P E S T
I
i
\

•1

.
PROBABILITY THEORY
í

■ (

» .J

V
PROBABILITY
THEORY
by

A. RÉNYI
Member o f the Hungarian Academy o f Sciences
Professor o f Mathematics at the
Eötvös Loránd University, Budapest
Director o f the Mathematical Institute o f the
Hungarian Academy o f Sciences, Budapest

AKADÉMIAI KIADÓ, BUDAPEST 1970


This book is the enlarged version of

WAHRSCHEINLICHKEITSRECHNUNG

VEB Deutscher Verlag der Wissenschaften


Berlin 1962

VALÓSZÍNŰSÉGSZÁMÍTÁS
Tankönyvkiadó, Budapest 1966

and

CALCUL DES PROBABILITÉS


D u n o d , P a r is 1966

English translation by
DR. LÁSZLÓ VEKERDI

© A K A D É M IA I K IA D Ó , B U D A P E S T 1970

JO IN T E D IT IO N PU B L ISH E D BY
A K A D É M IA I K IA D Ó
PU B L ISH IN G H O U SE O F T H E H U N G A R IA N A C A D E M Y O F SC IE N C E S
AND
N O R T H H O L L A N D P U B L IS H IN G C O M P A N Y • A M S T E R D A M • L O N D O N

P R IN T E D IN H U N G A R Y
<\

*
PREFACE

One of the latest works of Alfréd Rényi is presented to the reader in this
volume. Before his sudden death on the 1st February, 1970, he corrected
the first proof of the book, but he had no longer time for the final proof­
reading* and writing the preface he had planned.
This preface is, therefore, a brief memorial to a great mathematician,
mentioning a few features of Alfréd Rényi’s professional career.
Professor Rényi lectured on probability theory at various universities
throughout an uninterrupted series of years, from 1948 till his untimely
death. His academic career started at the University of Debrecen and was
continued at the University of Budapest where he was professor of the
Chair of Theory of Probability. In the meantime he was invited lecturer for
shorter or longer terms in several scientific centres of the world. Thus he
was visiting professor at Stanford University, Michigan State University,
the University of Erlangen, and the University of North Carolina.
Besides his teaching activities, Professor Rényi was director of the Mathe­
matical Institute of the Hungarian Academy of Sciences for one and half
decade. Under his direction the Institute developed into an important re­
search centre of the science of mathematics.
He participated in the editorial work of a number of journals. He was
the editor of Studia Scientiarum Mathematicarum Hungarica and a
member of the Editorial Board of: Acta Mathematica, Annales Sci. Math.,
Publicationes Math., Matematikai Lapok, Zeitschrift für Wahrscheinlich­
keitstheorie, Journal of Applied Probability, Journal of Combinatorial
Analysis, Information and Control.
The careful reader will certainly note how the long teaching experience
and keen interest in research are amalgamated in the present book. The
material of Professor Rényi’s courses on probability theory was first pub­
lished in the form of lecture notes. It appeared as a book in Hungarian in
1954, and completely revised in German translation in 1962. The latter
book was the basis of a new Hungarian edition in 1965 and the French

* This was done by Mr. P. Bártfai, Mrs. A. Földes and Mrs. L. Rejtő.
PREFACE

translation published in 1966. In the new Hungarian edition the author


omitted some theoretical chapters of the German text, inserting new ones
dealing with recent results and modern practical methods. The present
book contains the complete texts of the Hungarian and German versions
and is completed with some additional new material. The presentation of
a number of well-chosen problems and exercises will certainly be regarded
as a valuable feature of the book; some of them — following the traditions
of the Hungarian Mathematical Competitions — have been selected from
the material of original publications.
In his lectures and books, Alfréd Rényi always strived at arousing interest
in recent results of research, besides presenting the fundamental text-book
material of probability theory. Accordingly, he often wrote of problems in
which he was just engaged. In the present book the reader will also find
many particularities which do not occur in other text-books dealing with
the same field. These problems have been selected mostly from among
research topics pursued by the author and his school; they are presented
with the aim of bringing within the scope of the beginner the spirit of the
living and rapidly developing present-day mathematics.
Pál Révész
CONTENTS

CH. I. ALGEBRAS OF EVENTS

§ 1. Fundamental relations 9
§ 2. Some further operations and relations 13
§ 3. Axiomatical development of the algebra of events 16
§ 4. On the structure of finite algebras of events 18
§ 5. Representation of algebras of events by algebras of sets 21
§ 6. Exercises 25

CH. II. PROBABILITY

§ 1. Aim and scope of the theory of probability 29


§ 2. The notion of probability 30
§ 3. Probability algebras 33
§ 4. Finite probability algebras 40
§ 5. Probabilities and combinatorics 41
§ 6. Kolmogorov probability spaces 46
§ 7. The extension of rings of sets, algebras of sets and measures
§ 8. Conditional probabilities 54
§ 9. The independence of events 57
§ 10. “Geometric” probabilities 62
§ 11. Conditional probability spaces 69
§ 12. Exercises 74

CH. III. DISCRETE RANDOM VARIABLES

§ 1. Complete systems of events and probability distributions 84


§ 2. The theorem of total probability and Bayes’ theorem 84
§ 3. Classical probability distributions 87
§ 4. The concept of a random variable 94
§ 5. The independence of random variables 99
§ 6. Convolutions of discrete random variables 100
§ 7. Expectation of a discrete random variable 102
§ 8. Some theorems on expectations 105
§ 9. The variance 110
§ 10. Some theorems concerning the variance 115
§ 11. The correlation coefficient 116
§ 12. The Poisson distribution 122
§ 13. Some applications of the Poisson distribution 125
§ 14. The algebra of probability distributions 131
6 CONTENTS

§ 15. Generating functions 155


§ 16. Approximation of the binomial distribution by the normal distribution 149
§17. Bernoulli’s law of large numbers 157
§ 18. Exercises 159

CH . rv. GENERAL THEORY OF RANDOM VARIABLES

§ 1. The general concept of a random variable 172


§ 2. Distribution functions and density functions 173
§ 3. Probability distribution in several dimensions 177
§ 4. Conditional distributions and conditional density functions 181
§ 5. Independent random variables 182
§ 6. The uniform distribution 184
§ 7. The normal distribution 186
§ 8. Distribution of a function of a random variable 193
§ 9. The convolution of distributions 195
§ 10. Distribution of a function of several random variables 203
§ 11. The general notion of expectation 208
§ 12. Expectation vectors of higher dimensional probability distributions 217
§ 13. The median and the quantiles 217
§ 14. The general notions of standard deviation and variance 219
§ 15. On some other measures of fluctuation 222
§ 16. Variance in the higher dimensional case 225
§ 17. Exercises 232

CH. V. MORE ABOUT RANDOM VARIABLES

§ 1. Random variables on conditional probability spaces 245


§ 2. Generalization of the notion of conditional probability on Kolmogorov
probability spaces 255
§ 3. Generalization of the notion of conditional probability on conditional
probability spaces 263
§ 4. Generalization of the notion of conditional mathematical expectation in
Kolmogorov probability spaces 270
§ 5. Generalization of Bayes’ theorem 273
§ 6. The correlation ratio 275
§ 7. On some other measures of the dependence of two random variables 279
§ 8. The fundamental theorem of Kolmogorov 286
§ 9. Exercises 290

CH. VI. CHARACTERISTIC FUNCTIONS

§ 1. Random variables with complex values 301


§ 2. Characteristic functions and their basic properties 302
§ 3. Characteristic functions of some important distributions 310
§ 4. Some fundamental theorems on characteristic functions 312
§ 5. Characteristic properties of the normal distribution 323
§ 6. Characteristic functions of multidimensional distributions 340
§ 7. Infinitely divisible distributions 347
CONTENTS 7

§ 8. Stable distributions 349


§ 9. Characteristic functions of conditional probability distributions 353
§ 10. Exercises 367

CH. VII. LAWS OF LARGE NUMBERS

§ 1. Chebyshev’s and related inequalities 373


§ 2. Stochastic convergence 374
§ 3. Generalization of Bernoulli’s law of large numbers 377
§ 4. Bernstein’s improvement of Chebyshev’s inequality 384
§ 5. The Borel-Cantelli lemma 389
§ 6. Kolmogorov’s inequality 392
§ 7. The strong law of large numbers 394
§ 8. The fundamental theorem of mathematical statistics 400
§ 9. The law of the iterated logarithm 402
§ 10. Sequences of mixing sets 406
§ 11. Stable sequences of events 409
§ 12. Sequences of exchangeable events 412
§ 13. The zero-one law 418
§ 14. Kolmogorov’s three-series theorem 420
§ 15. Laws of large numbers on conditional probability spaces 424
§ 16. Exercises 428

CH. VIU. THE LIMIT THEOREMS OF PROBABILITY THEORY

§ 1. The central limit theorems 440


§ 2. The local form of the central limit theorem 449
§ 3. The domain of attraction of the normal distribution 453
§ 4. Convergence to the Poisson distribution 458
§ 5. The central limit theorem for samples from a finite population 460
§ 6. Generalization of the central limit theorem through the application of
mixing theorems 466
§ 7. The central limit theorem for sums of a random number of random
variables 471
§ 8. Limit distributions for Markov chains 475
§ 9. Limit distributions for “order statistics” 486
§ 10. Limit theorems for empirical distribution functions 492
§ 11. Limit distributions concerning random walk problems 500
§ 12. Proof of the limit theorems by the operator method 515
§ 13. Exercises 528

CH . IX. APPENDIX. INTRODUCTION TO INFORMATION THEORY

§ 1. Hartley’s formula 540


§ 2. Shannon’s formula 546
§ 3. Conditional and relative information 554
§ 4. The gain of information 560
§ 5. The statistical meaning of information 564
§ 6. Further measures of information 569
8 CONTENTS

§ 7. Statistical interpretation of the information of order a 583


§ 8. The definition of information for general distributions 586
§ 9. Information-theoretical proofs of limit theorems 597
§ 10. Extension of information theory to conditional probability spaces 603
§11. Exercises 605

tables 617
REMARKS AND BIBLIOGRAPHICAL NOTES 638
REFERENCES 645
AUTHOR AND SUBJECT INDEX 661
CHAPTER I

A L G E B R A S OF E V E N T S

§ 1. Fundamental relations

Probability theory deals with events occurring in connection with random


mass-phenomena. As it is an abstract mathematical theory, the concept of
events is to be dealt with abstractly to o ; i.e. relations between events are to
be characterized axiomatically. For this reason, we consider first of all in
this Chapter the “algebras of events” . Indeed, relations between events have
a primarily logical character: one may assign to every event a proposition
stating its occurrence. Thus logical relations between propositions corre­
spond to the relations between events. The algebraic structure of the set of
events turns out to be a Boolean algebra. Algebras of events as a basis of
probability theory were first considered by V. I. Glivenko [3] (cf. also
A. N. Kolmogorov [9]).
As stated above, events are to be characterized as abstract concepts. We
shall define an algebra of events as a set of events connected with one and
the same “experiment” , taken in the widest sense of this word. There belongs
to every experiment a set of possible outcomes; for every event of the algebra
corresponding to the experiment one must be able to decide for each pos­
sible outcome whether the event occurred or not.
Let the events A, B, C ,. . . be elements of the same algebra of events.
Two events both either occurring or non-occurring at the same time for
every outcome of the experiment are said to be identical. The fact that the
events A and В are identical is denoted by A = B.
The non-occurrence of an event A is itself an event, denoted by A and
called the event complementary to A. From this definition it follows that

A = A. (1)

In the realm of logic this corresponds to the proposition that a statement


doubly negated coincides with the statement itself.
If A and В are two events of the same algebra of events, we may ask
whether they did occur both. Let our experiment be for instance the firing
on a target. By a vertical and a horizontal line we subdivide the target into
four equal parts. Let event A be a hit in the upper half of the target; event
10 A LG EBRA S O F EVENTS [I, § 1

В one in the right side of it. In this case the statement “A and В occurred
both” means the fact that the hit lies in the right upper quadrant of the target
(Fig- 1).

A В AB
Fig. 1

EventC, occurring if and only if both events A and В occur, is said to be


the product of eventsA and В ; we write C = AB. Thuswe have defined an
operation, namely the multiplication of events. Let us now see what the
properties of this operation are. First, since AB clearly does not depend on
the order of A and B, we have the commutative law

AB = BA. (2)
Also obviously,
AA = A, (3)
i.e. every event A is idempotent with respect to multiplication. The definition
of the product of events may be extended to more than two factors. A(BC)
occurs, by definition, if and only if the events A and BC occur; that is, if the
events A, B, and C all occur. Evidently, (AB)C has the same meaning. Thus
we have the associative law for multiplication:
A(BC) = (AB)C. (4)

Instead of A(BC) therefore we can write simply ABC. Clearly, the event
AB can occur only if A and В do not exclude each other. If A and В are
mutually exclusive, AB is an impossible event. It is useful to consider the
impossible event as an event too. It will be denoted by O. The fact, that A
and В are mutually exclusive, is thus expressed by AB = О. Since an event
and the complementary event obviously exclude each other, we have

AÄ = O. (5)
If A and В are two events of an algebra of events, one may ask whether
at least one of theevents A and В did occur. Let A denotethe event that the
hit lies in the upperhalf of the target and В the event that it lies in the right
I. 8 1] F U N D A M E N T A L R E L A T IO N S 11

half; the statement, that at least one of the events A and В occurred, means
then that the hit does not lie in the left lower quadrant of the target (Fig. 2).
The event occurring exactly when at least one of the events A and В occurs,
is said to be the sum of A and В and is denoted by A + B. It is easy to see

В A+B
Fig. 2

that
A + В = В + A (6)
(commutative law of addition) and also that
A + (В + С) = (A + В) + C (7)
(associative law of addition). The definition of the sum is readily extended to
the case of more than two events.
The event A + B occurs thus precisely, if A or В occurs; the word “or”,
however, does not mean in this connection that A and В exclude one an­
other. Thus for instance, in our repeatedly considered example the meaning
of A + В is the statement that the hit lies either in the upper half of the tar­
get (this is now the event A) or in the right lower quadrant (event Aß).
Therefore we have the relation
A + В = A + ÄB, (8)
where the two terms on the right hand side are now mutually exclusive.
By applying relation (8), every sum of events can be transformed in such
a way that the terms of the sum become pairwise mutually exclusive.
Clearly the formula
A + A = A (9)
is valid. Further we see that the event A + Ä certainly occurs; thus by
introducing the notation / for the “ sure event” we have
A + Ä = I. (10)
We agree further that
1= 0, 0 = 1, (11)
12 ALGEBRAS O F EVENTS [I, § <

i.e. that the event complementary to the sure event is the impossible event
О and conversely.
Evidently, the following relations are also valid:
AO = O, (12)
A + О = A, (13)
AJ = A, (14)
A + I= I. (15)

In order to be able to carry out unrestrictedly all operations in the algebra


of events, we need some further basic relations. First of all, does the distri­
butive law hold for the addition and multiplication in the algebra of events?
Now A(B + C) occurs, by definition, exactly if A occurs and В or C occurs.
This, however, means precisely that either A and В occur or A and C occur,
i.e. that AB + AC occurs. Therefore we have
A{B + C) = AB + AC. (16)

From the distributive law follows the so-called “ law of inclusion”

A + AB = A; (17)
since from (14), (16) and (15) we have
A + AB = A I + AB = A (I + В) = A I = А.

Clearly, rule (17) can be verified directly as well; the direct verification is,
however, clumsy for some complicated relations, while by applying the for­
mal rules of operation one can readily get a formal proof. This is the reason
why the algebra of events is useful; therefore it is advisable to obtain a
certain practice in such formal proofs.
The distributive laws can be extended (just like in ordinary algebra) to
more than two terms. In the algebra of events there exists, however, still
another distributive law:

A + ВС = (A + В) (A + C). (18).

The validity of (18) is readily seen: A + BC occurs exactly, if A occurs or


В and C occur; if A occurs, both factors of the product on the right hand
side occur and the same is true if В and C occur, but in no other case.
This consideration being somewhat more difficult as the preceding ones,
it is of interest to show how (18) is implied by the already deduced rules of
the algebra of events. Indeed, because of (2), (3), (16) and (17) we have

(A + В) (A + C) = A + AB + A C + BC = A + BC,
i, § 2 ] S O M E F U R T H E R O P E R A T IO N S A N D R E L A T IO N S 13

which is what we had to show.


Next we prove some further important relations:

ÄB = Ä + B, (19)

A + В = AB. (20)

The event AB occurs exactly, if AB does not occur, hence if the events
A and В do not both occur; Ä + В occurs exactly, if A or В (or both) do
not occur. These two propositions evidently state the same thing; thus (19)
is valid. Formula (20) can be proved in the same way.
As to the rules of operation valid for the addition and multiplication
of events, we see that both have the same properties (commutativity, asso­
ciativity, idempotency of every element) and that the relations between the
two kinds of rules of operation are symmetrical. Formulas (16) and (18)
are obtained from each other — by interchanging everywhere the signs of
multiplication and addition. Such formulas are called dual to one another.
Thus for instance the relations

A + AB = A and A(A + B) — A

are dual to one another. Clearly, there exist relations which are, because of
their symmetry, selfdual; e.g. the relation

(A + В) (A + С) (B + C) = AB + AC + BC.
П
For sake of brevity we write sometimes Y[ ^ к instead of A 1A2 .. . An
k= 1
n
and Yj instead of A x + A 2 + . . . + A„.
к =1

§ 2. Some further operations and relations

Subtraction is defined in the algebra of events by the formula

В — A = BÄ. (1)

With respect to subtraction the following rules hold:

A{B - C) = AB - AC, AB - С = (A - С) (В - C); (2)


14 A LG EBRA S O F EVENTS [I, § 2

they are the two distributive laws of subtraction. Using the subtraction,
the complementary event may be written in the form

A = I — A. (3)

А В Aa В
Fig. 4

The subtraction does not satisfy all the rules of operation known from
ordinary algebra. Thus for instance (A — В) + В is in general not equal
to A ; further A + {В — C) is not always identical to (A + В) — C. Hence,
if in relations between events there figures the sign of subtraction too, the
brackets are not to be omitted without any consideration. There are, how­
ever, cases when this omission is allowed, e.g.

A — (В + C) — (A — В) — C. (4)

The event A — В occurs exactly if A does and В does not occur; in the
same way, В — A occurs if В does but A does not occur. The meaning of the
expression (A — В) + (B — A) is therefore not O, but the event which
consists of the occurrence of one and only one of the events A and B. It is
reasonable to introduce for this event a new symbol. We put

(A - В) + (В - A) = AAB. (5)

The operation denoted by A is called the symmetric difference of the events


A and В (cf. Figs 3 and 4). It fulfils the following rules of operation, derived
readily from the already known rules:
, § 2] S O M E F U R T H E R O P E R A T IO N S A N D R E L A T IO N S 15

AAA = О AAB = BAA A(BAC) = ABAAC


AAO = Л /Ы 5 = (Л + В) - Л й В - A = ABAB
AAI = Ä A + В = (AAB)AAB В - A = {AAB)B (6)

Finally, we mention still another relation between events. If the occurrence


of an event A always entails the occurrence of the event B, then we say
that the event A implies the event B. We denote this fact by the symbol £ .
Therefore A £ В means that from the occurrence of the event A always fol­
lows the occurrence of the event B.

The following relations hold:


1. О <= A.
2. A £ I.
3. A £ A.
4. А Я В and В £ C imply A £ C.
5. A £ В and В £ A imply A — B.
6. A £ A + B.
7. AB £ A.
8. A £ В implies A = AB.
9. A £ C and В £ C imply A + В £ C.
10. C £ A and С £ В imply C £ AB.
11. A £ В implies В = A + BA.
12. A £ В implies В £ Ä .
13. A £ В implies AC £ BC.
14. Л £ f? implies ,4 + C £ ü + C.
15. Л7? = О and C E Л imply ÜC = O.

It is easy to show that the meaning of A = AB as well as of В = A + BÄ


is the same as of A £ B; these relations could have served as well for the
definition of the relation £ . Indeed, if the relation В = A + BÄ is valid,
then the occurrence of A implies the occurrence of B. If we have further
A = AB, then the occurrence of A implies the occurrence of B, since

B = B I= B{A + A) = BA + BA = A + BA.

From this it follows that for the validity of the relation A £ В the validity
of one of the relations A = AB and В = A + BÄ is necessary and sufficient.
The latter relation can be stated in the following form: For the validity of
A £ В a necessary and sufficient condition is the existence of a C such that
AC — О and В = A + C; indeed, from this it follows directly C = BA.
16 A LG EBRA S O F EVENTS [I, § 3

We introduce the following important concept. A system А ъ A .,,. ■., A„


is called a complete system o f events, if the relations Ак Ф О {k = 1 , 2 , . . .,rí);

Aj Ak = О for j ф к and Ak + A2 + . . . + A„ = /

are valid. For instance [A, A} is a complete system of events, provided that
А Ф О and А Ф I.

§ 3. Axiomatical development of the algebra of events

In the preceding paragraphs, we introduced certain operations for events


and discussed the rules for these operations. Now we have to make a further
abstraction. A set o f , of arbitrary elements А, В, C, . . . is said to be a Boolean
algebra, if the following conditions are fulfilled: Given any two elements
A and В of o f , there exists exactly one element of o f called the product of
A and В and denoted by AB and exactly one element of o f called the sum
of A and В and denoted by A + B;1 further there corresponds to every ele­
ment A of o f exactly one element A of o f . Let there exist two special ele­
ments of the set o f , namely О and I. Let the elements of the Boolean algebra
fulfil the relations obtained in the preceding paragraph; the following axioms
are therefore assumed to hold:

AA = A (1.1)
AB = BA (1.2)
A{BC) = (AB)C (1.3)
A + A = A (2.1)
A + В = В + A (2.2)
A + (В + C) = (A + В) + C (2.3)
A{B + C) = AB + AC (3.1)
A + BC = (A+ B) (A + C) (3.2)
AÄ = О (4.1)
A + Ä = I (4.2)
AI = A (5.1)
A + О = A (5.2)
AO = О (5.3)
A A 1= 1 (5.4)
1 The notations А П В and A U В are used often instead of AB and A + B,
respectively.
I, § 3] A X IO M A T IC A L D E V E L O P M E N T 17

It is to be noted that these axioms are not all mutually independent; thus
for instance (3.2) can be deduced from the others. It is, however, not our
aim to examine here which axioms could be omitted from the system.
The totality of the outcomes of an experiment forms a Boolean algebra,
if we understand by the product AB of two events A, В the joint occurrence
of both events and by the sum A + В of two events the occurrence of at
least one of the two events; further, if we denote by Ä the event complemen­
tary to A and by О and / the impossible and the sure events, respectively.
Indeed, the above 14 axioms are fulfilled in this case. More generally, every
subset of the set of the outcomes of an experiment is a Boolean algebra if it
contains the sure event, further for every event A its complementary event
Ä and for every A and В the events AB and A + B.
Clearly, one can find other Boolean algebras as well. Thus for instance,
the totality of the subsets of a set Я is also a Boolean algebra. We define the
sum of two sets as the union of the two sets and their product as the inter­
section of the two sets. Let I mean the set Я itself and О the empty set,
further A the set complementary to A with respect to Я and thus В — A
the set complementary to A with respect to B. A direct verification of each
axiom shows that this system is indeed a Boolean algebra.
There exists a close connection between Boolean algebras of events and
algebras of sets. In our example of the target this connection is clearly vis­
ible. This analogy between a Boolean algebra of sets and an algebra of
events has an important role in the calculus of probability.
In order to obtain a Boolean algebra, it is not necessary to consider all
subsets of a set. A collection T of the subsets of a set Я is said to be an algebra
o f sets, if the addition can be always carried out in it, if Я itself belongs to
T and for a set A its complementary set Ä = H — A belongs to T as well;
i.e. if the following conditions are satisfied:1

1. Я ( Г.
2. A £ T, В £ T implies A + В £T.
3. A £ T implies Ä £ T.

The collection of all subsets of a set Я is said to be a complete algebra


o f sets. A complete algebra of sets is always a Boolean algebra. Indeed, it
is easy to see that the validity of AB £ T follows from A £ T and В £ T by
the conditions 1, 2 and 3, since AB = Ä + В. The above 14 axioms are
evidently fulfilled.

1 The notation a £ M means here and in the following that a belongs to the set
M; a <J M means that a does not belong to the set M.
18 A LG EBRA S O F EVENTS [I, § 4

§ 4. On the structure of finite algebras of events

An event A is said to be a compound event, if it can be represented as the


sum of two events which are both different from A :

A = В + С, В ф А, С ф A.

(The condition В Ф А, С ф A is necessary, to exclude the trivial represen­


tations A = A + О and A — A + A, which are valid for every A.) Events
which do not permit any such representation are said to be elementary events.
Compound events may be obtained in a number of ways, but elementary
events can occur in only one manner. In order to illustrate this by an exam­
ple, let A denote the event of throwing 10 in a game with two dice. This is a
compound event; indeed the number 10 can be obtained by throwing with
both dice 5 as well as by getting 6 on one of the dice and 4 on the other.
The latter event is again a compound event, since we can have 6 on the first
die and 4 on the second and conversely. If A means the result 12 with two
dice, then A is an elementary event, since it can be realized only by casting
6 with each die.
If A is an elementary event, then from В ^ A follows either В = О or
В = A. Since A <=: A always holds, we denote the fact that В £ A with
В Ф A by the symbol В a A. Clearly, from В a A follows В c A, but the
converse does not hold. By using this notation the definition of the elemen­
tary event may be formulated as follows: The event А ф О is an elementary
event, if and only if there exists no B (B Ф О) such that В c A .1 Indeed, if
S c 4 is valid for some В Ф 0 , then from relation (11) of § 2 follows
A = В + AB where В and AB are distinct from A ; namely В ф A follows
from the assumption В cz A, while from AB = A would follow, because of
В a A and thus В — AB, the equation В = AB = (AB)B = AO = О,
which contradicts our assumption.
We give here a further characterisation of elementary events: А Ф О
is an elementary event, if for an arbitrary event В either AB = О or AB = A.
Otherwise namely there would exist a decomposition
A = AB+AB
of A, where AB Ф A and AB ф О; the converse is proved readily too.
Next we prove the following

T heorem 1. In an algebra consisting o f a finite number o f events every


event can be represented as a sum o f elementary events. This representation is
unique except for the order o f the terms.
The impossible event О is not considered to be an elementary event.
I, § 4] O N T H E S T R U C T U R E O F F IN IT E A L G E B R A S O F EV E N T S 19

In order to prove the theorem, we need two lemmas:


L emma 1. The product o f two distinct elementary events is O.
For every A x and A 2 we obviously have AXA 2 £ A v In particular, if A,
and A 2 are elementary events, we have either A XA 2 = О or A XA 2 = A x.
But the latter is impossible since it would imply A x £ A.,, which cannot
hold because of А х ф О and A x ф A2.
L emma 2. For every compound event В o f an algebra o f events with a finite
number o f events, there exists an elementary event A such that A a В holds.
Since В is a compound event, there exists an Ax Ф О such that A x с: B.
If A x is elementary, our statement is proved; if, however, it is not elementary,
then there exists an A2 a A x. If A., is elementary, we have found the ele­
mentary event required; if not, there exists again a new decomposition, etc.
The procedure must end after a finite number of steps because of the finite­
ness of the number of events. Therefore A„ must be elementary for a certain
number n.
P roof of T heorem 1. Let В be a compound event. According to Lemma 2
there exists then an elementary event A x such that A x a B, i.e. В = A x + Bx.
If Bx is elementary, the first statement of our theorem is proved; if Bx is
not elementary, we obtain, by applying repeatedly Lemma 2, a representa­
tion Bx = A 2 + B2, where A 2 is elementary; if B2is compound, the procedure
is to be continued. Thus we obtain a representation of В as a sum of ele­
mentary events;
В = Ax + A2 + . . . + Ar (1)
since the number of events is finite. It is evident, from this proof, that all
the Ar s are distinct. If not already known, this could easily be shown be­
cause of the rule A + A = A and the commutativity of the addition. (The
deduction used above to prove the representability of В as a sum of ele­
mentary events is nothing else as the so-called “descente infinie” known
from number theory.) It remains still to prove the uniqueness of represen­
tation (1). If there would exist two essentially different representations
В = Ax + A2 + . . . + Ar = Ax + Ao + . . . + As (2)
such that for instance A x Ф Aj (J = 1 , 2 , . . . , s), then multiplication of both
sides of (2) by A x would yield, by Lemma 1, A x — O, i.e. a contradiction.
Herewith Theorem 1 is completely proved. The representation (1) is called
the canonical representation of the event B.
T heorem 2. The number o f events o f a finite algebra o f events is necessarily
a power o f 2.
20 ALGEBRAS O F EVENTS [I, § 4

P roof. Let n denote the number of distinct elementary events of a finite


algebra of events. Every event of the algebra can be expressed as a sum of a
certain number of elementary events; this number /• can be one of the num­
bers 0, 1,. . ., n, if the impossible event is included. If r is fixed, the number
of the events which can be expressed as a sum of r distinct elementary events
is equal to the number of possible selections of exactly r from the n elemen-
Yl
tary events A b A 2, . . . , A n; that is equal to . Thus the number of ele-

ments of the whole algebra of events is £ " (n 1> which Is equal to 2".
r —0 [^ J
It follows from Theorem 1 that the sure event / can be represented as the
sum of all elementary events
I —Ai + A2 + . . . + An.
Thus always one and only one of the elementary events A b A 2, .. A „occurs.
The elementary events form a complete system of events.
Consider now as an example the algebra of events which consists of the
possible outcomes of a game with two dice. Clearly, the number of elemen­
tary events is 36; let us denote them by Ay (i ,j — 1, 2 ,. .., 6) where Ay
means that the result for the first die is i, that for the second, j. Accord­
ing to Theorem 2 the number of events of this algebra of events is
236 = 68 719 476 736. It would thus be impossible to discuss all cases.
We choose therefore another example, namely the tossing a coin twice.
The possible elementary events are; 1. first head, second head as well (let
A u denote this case); 2. first head, next tail, denoted by AV1; 3. first tail,
next head, denoted by A 21; 4. first tail, second also tail, notation; A 22. The
number of all possible events is 2d = 16. These are: I, O, the four elementary
events, further A u + A 12, A n + A 21, A n _+ A 22, A 12 + A 21, A 12 + A 22,
A 21 + A 22, and besides these the four events A n , A 12, A 21, A22 complementary
to the four elementary events.
By the canonical representation we have thus obtained a complete de­
scription of this finite algebra of events. Now also the rules of operation
obtain a new sense. Theorem 1 namely points to a connection which shall
lead us to Kolmogorov’s theory of probability. A compound event, which
is the sum of elementary events, can be characterized uniquely by the set
of terms of this sum. In this way one can assign to every event a set, namely
the set of the elementary events whose sum is the canonical representation
of the event. Let A' denote the collection of elementary events which form
the event A and similarly let B' denote the collection of the elementary
events from which event В is composed. One can show that the collection
of elementary events from which the event A В is composed is the inter-
1. § 5] R E P R E S E N T A T IO N B Y A L G E B R A S O F S E T S 21

section A'B' of A' and B'\ further that the collection of elementary events
from which the event A + В is composed is equal to the union A' + B'
of the sets A' and B '. In this assignment of events to sets, the elementary
events themselves correspond to the sets having only one element. Obviously
the empty set corresponds to the impossible event. To the sure event cor­
responds the set of all possible elementary events (with respect to the same
experiment); this set will be denoted by H and will be called the sample
space. Further, it is easy to show that to the complementary event Ä corre­
sponds the complementary set of A' with respect to II.
In the following paragraph we shall show that to every algebra of events
corresponds an algebra of the subsets of a set H such that there corresponds
to the event A + В the union of the sets belonging to A and В and to the
product AB the intersection of the sets belonging to A and B, finally to the
complementary event Ä the complementary set with respect to H of the
set belonging to A. In other words, one can find to every algebra o f events
an algebra o f sets which is isomorphic to it.
The proof of this theorem, due to Stone [1], is not at all simple; it will
be accomplished in the next paragraph; on first reading it can be omitted,
since Stone’s theorem will not be used in what follows. We give the proof
only in order to show that the basic assumption of Kolmogorov’s theory,
i.e. that events can always be represented by sets, does not restrict the gen­
erality in any way.
In the case of a finite algebra of events this fact was already established
by means of Theorem 1. Here we even have a uniquely determined event
corresponding to every subset of the sample space.
The theory of Boolean algebras is a particular case of the theory of more
general structures called lattices (cf. e.g. G. Birkhoff [1]).

§ 5. Representation of algebras of events by algebras of sets

In this paragraph we prove a theorem due to Stone, mentioned in § 4 .1

T heorem. There can be associated with every algebra o f events an algebra


o f sets isomorphic to it.

P roof. Let be an algebra of events; let ^4, B ,C ,. . . denote its elements.


Consider a subset a of having the following properties;
1. The event О is not contained in a.
2. From 4 ( a and В ([ a follows AB £ a.*
*The proof given here is due to O. Frink [1].
22 A LG EBRA S O F EVENTS [I, § 5

3. Among the sets satisfying conditions 1 and 2 a is maximal in the fol­


lowing sense: there exists no set ß satisfying conditions 1 and 2 and con­
taining a as a proper subset.
The sets a fulfilling these three conditions are briefly called crowds of
events.1
It is easy to show the following: If a is a crowd of events, we have

a) / É a.
b) A £ a implies A $ a and conversely.
c) If A + В £ a, then A or В belongs to a.

Now we prove the property of the crowds of events stated in

L emma 1. AB £ a implies A £ a (and clearly, because o f the symmetry


o f the formula, it implies В £ a as well).

P roof . Suppose that AB £ a, A $ a. Let ß be the union of a and all events


AC, where C runs through all elements of a; a is a proper subset of ß, since
we have / f a and thus, by assumption, A = A I 3 and A (J a.
The set ß fulfils the conditions 1 and 2. First we prove that ß satisfies
condition 1, and hence that О (f ß. Indeed, because of AB fj a, we have
AB ф О, hence А ф О; further, since (AC)B = (AB)C and (AB)C f a ,
we have for С в a the relation АС ф О. From this our first statement fol­
lows. It remains to show that D £ ß and E £ ß imply the relation DE £ ß.
Now we have either D £ a, E £ a or D £ a, E (J a (or conversely) or else
D $ a, E $ a. If D £ a, E £ a, then, hy our assumption, we have DE £ a
and thus certainly DE £ ß. If D £ a, E (J a, then there exists а C 6 a such
that E — AC, hence DE = A(CD); since CD £ a, we have DE £ ß- In the
case D (J a, E (J a, there exist two elements C, and C2 of a such that D = ACr
and E — AC2- Then DE = ^(СхСг), and since C1C2 f: <*> we have DE £ ß.
But this contradicts the assumption that a is maximal. Lemma 1 is there­
fore proved.
Next we prove

Lemma 2. Every event А (А Ф O') o f an algebra o f events ^ belongs at


least to one crowd o f events a.

1 In lattice theory such systems are called ultrafilters: ultrafilters are commonly
characterized as sets complementary to prime ideals. A nonempty subset ß of a Boolean
algebra cA is called a prime ideal, if the following conditions are fulfilled. 1. A £ ß
and В £ ß imply A + В aß. 2. A £ ß and В £ cA imply AB f ß. 3. If AB f ß, then
A £ ß or В £ ß (or both). Cf. e.g. G. Aumann [1].
I , § 5] R E P R E S E N T A T IO N B Y A L G E B R A S O F S E T S 23

This lemma is the consequence of a general set theoretical theorem of


Hausdorff; we have, however, to define some notions in order to state it.
A set S3 is said to be partially ordered, if an ordering relation, denoted by
< , is defined for certain pairs of its elements; if a < b, we say that the ele­
ment a precedes the element b.
The relation < is required to fulfil the following conditions:
1. For no element does a < a hold.
2. a < b and b < c imply a < c.

The relation < is therefore irreflexive and transitive. A subset -£? of a


partially ordered set <3? is called a chain, if for any two elements a and b
of 33 either a < b or b < a holds. An ordered subset S3 of -3? is thus said to
be a chain. A chain S3 is said to be a maximal chain in J? , if there does not
exist any element of 3% such that by adjoining it to S3 the so obtained subset
would still remain a chain.
We are now ready to state the above-mentioned lemma.

L emma 3. (Hausdorff). I f S3 is a partially ordered set, then every chain


S3 o f S3 is a subset o f a maximal chain}

Let us return to the proof of Lemma 2. Let S3 denote the set of those sys­
tems of events ß in the algebra of events t s f which fulfil conditions 1 and 2
of the crowds of events. If ß < у means that ß is a proper subset of y, S3 is
a partially ordered set. If А ф O, the set ß = (A) consisting only of the
element A evidently fulfils conditions 1 and 2. According to Lemma 3 there
exists a maximal chain containing ß = (A) as a subset. Let a denote the
union of the subsets у belonging to this chain. Clearly, a is a crowd of events,
since it is the union of sets ß fulfilling the rules 1 and 2 defining the crowds
of events. Therefore no element of the chain contains the event О and thus
a does not contain О either. Further if B1 and B2 belong to a, they belong to
a subset ßx respectively a subset ß2 of a. Since either ßx < ß2 or the contrary
must hold, ßx and ß2 belong both to ßx or to ß2 and the same holds for BXB2.
Thus BXB2 belongs to a as well. Further we see that a cannot be extended.
This is a consequence of the requirement that the chain be a maximal chain.
Lemma 2 is thus proved.
Now we can construct to every algebra of events a field of sets isomorphic
to it. Let 3 3 be the set of all crowds of events a of the algebra of events
We assign to every event A of >3: the subset S3 Aof 3 3 consisting of all crowds
of events a containing the events. The set -3$A will be called the representa-1

1 As to the proof of Lemma 3 cf. e.g. F. Hausdorff [1] or O. Frink [2].


24 A LG EBRA S O F EVENTS [I, § J

tive of the event A. As A Ф O, <32Ais, by Lemma 2, a nonempty set. We asso­


ciate to the event О the empty set. The system consisting of the empty set
and of all <32A-s shall be denoted by <92
We prove the following relations:

*32a*32b = *32Ab<
> (0
*32a = *32A> ( 2)

32A + *-%b —*2%a +b< ( 3)

Further it will be proved that the correspondence A -> 32A is one-to-one;


thus А Ф В implies <32A Ф <32B. Hence the algebra of sets is isomorphic to
the algebra of e v e n ts ^ . (We understand of course by <32A32B the intersec­
tion of <2%л and <32B and by <32A + <32B the union of both sets. <32A denotes
the complementary set with respect to - W , and A B , A + B, Ä the corre­
sponding operations in the algebra of events < j €2)
The relation (1) can be proved as follows. A crowd of events a belonging
to both of <32A and <32B contains both A and В and thus AB as well. Con­
versely, if AB belongs to a, then by Lemma 1 A £ я and В g a and thus a
belongs to <32A and to <32B hence also to <32A<32B.
Proof of (2). If the crowd of events a does not belong to the set <32A,
it must contain an event В such that AB = О ; otherwise namely AB Ф О
for every element В of a and a could be extended by adjoining A and every
product of the form AB (В £ a). This leads, just like in the proof of Lemma 1,
to a result which contradicts the assumption that a is a maximal chain.
From AB = О it follows that В = AB + ÄB = AB. Hence A belongs,
according to Lemma l , t o a. Conversely, if Ä belongs to a, A cannot belong
to it since A Ä = O. From this it follows that <32A consists exactly of the
crowds of events containing the event Ä , hence <32A — <32A.
Relation (3) is a direct consequence of relations (1) and (2). Indeed
we have

*32A+в —*32A
B—*32A
B= <32a<
32-b —<32A+ <
J&B—*32a+ *32B<
Thus we have proved that JH s an algebra of sets. In order to show that
.23 is isomorphic to the algebra of events <32, it still remains to prove that the
correspondence A -* <32A is one-to-one. If А ф B, we have AAB Ф O.
Hence at least one of the relations AB Ф О and AB ф О is valid as well.
Suppose that ÄB Ф O. Because of (1) <32AB — <32A<32B. Hence every crowd
of events belonging to 32AB belongs to <32A and also to <32D, hence it belongs
to <32Band does not belong to <32A. Thus we proved the existence of crowds of
events which belong to <32B but do not belong to <32A. Hence 32B and <32A
I, § 6 ] E X E R C IS E S 25

can n ot coincide. H erewith the theorem is proved. In this p ro o f w e have


used as regards c £ only its property o f being a B oolean algebra, therefore
we m ay form ulate the theorem just proved in a m ore general m anner:
There exists to every Boolean algebra an algebra o f sets isomorphic to it.

§ 6. Exercises

1. Prove
a) AB~+ CD = (A + B ) ( C + D),
b) (A + B) (A + B ) + (A + В) (Л + B) = I,
c) (A + B ) ( A + B) (A + B)(A~ + B) = 0 ,
d) (A + B) (A + C) (B + C) = AB + AC + BC,
e) A — BC = ( A — B ) + (A — C),
f) A — (B + C) = (A — B) — C ,
g) (A — B) + C = [{A + C) — B ] + BC,
h) (A — B) — (C — D) = [A — (B + C)] + (AD — B) ,
i) A — {A — [B — (B — C)]} = A B C ,
j) ABC + ABD + A CD + BCD =
= (A + B ) ( A + C) (A + D)( B + C)(B + D)(C + D) ,
k) A + B + C = (A — B ) + ( B — 0 + ( C — A) + ABC,
l) A A ( BA C) = (A AB) A C ,
m) (A + В) A (A + B ) = A A В ,
n) AB ABA = A AB.
o) Prove the relations enumerated in § 2 (6) for the symmetric difference.
p) The relation (A + B) — В = A does not hold in general. Under what con­
ditions is it valid?
q) Prove that А А В = C A D implies A A С = В A D.

Hint. If А А В = С zl D and А, В, C, D are subsets of the same set, then every


point of this set belongs to an even number (0, 2, or 4) of the sets А, В, C, D and
A A С = В A D means the same.
r) Prove that the elements of an arbitrary algebra of events form an Abelian
group with respect to the symmetric difference as the group operation.

2. The elements of a Boolean algebra form a ring with respect to the operations
of symmetric difference and multiplication. The zero element is 0, the unit element I.

3. In a finite algebra of events containing n elementary events one can give several
complete systems of events. Complete systems of events differing only in the order
of the terms are to be considered as identical. Let T„ denote the number of the different
complete systems of events.
a) Prove that Г, = 1 , 7 , = 2, T3 = 5, T4 = 15, Tb = 52, Tr>= 203.
b) Prove the recursion formula

a n d sh o w t h a t T l0 = 115 975.
26 A LG EBRA S O F EVENTS [I, § 6

c) Prove

k= I K -

4. Let yn denote the number of complete systems of events consisting of three


elements in a finite algebra of events consisting of n elementary events. Show that
3л-1 i j
Yn —---- ---------2”- 1.

5. P ro v e t h e r e la tio n

(T„ means the same as in Exercise 3).


6. Let Q„ denote the number of complete systems of events in an algebra with n
elementary events, such that every event is the sum of an odd number of different
elementary events. Prove that
Ql = 1, Q, = 1, ßs = 2, ß 4 = 5, ß 5 = 12, Qü = 37,
further that

■+i n=l "■

7. We can construct from the events A, B, and C by repeated addition and multi­
plication eighteen, in general different, events namely A, B, C, AB, AC, BC, A + B,
В + С, C + A, A + ВС, В + AC, C + AB, AB + AC, AB + ВС, АС + ВС, АВС,
А + В + С, AB + АС + ВС. (The phrase “in general different” means here that
no two of these events are identical for all possible choices of the events А, В, C.)
Prove that from 4 events one can construct 166, from 5 events 7579 and from 6 events
7 828 352 events in this way. (No general formula is known for the number of events
which can be formed from n events.)
8. The divisors of an arbitrary square-free1 number N form a Boolean algebra’
if the operations are defined as follows: We understand by the “sum” of two divisors
of N their least common multiple, by their “product” their greatest common divisor;
N
d being a divisor of N we understand by d the number — ; the number 1 serves as
d
О and the number N as I.
9. Verify that for the example of Exercise 8, our Theorem 1 is the same as the well-
known theorem on the unique representability of (square-free) integers as a product
of prime numbers.
10. The numbers 0 , 1 , . . . , 2"-l form a Boolean algebra if the rules of operation
are defined as follows: Represent these numbers in the binary system. We understand
by the “product” of two numbers the number obtained by multiplying the corre-

1 A number is said to be square-free if it is not divisible by any square number


except by 1. The square-freeness of N is required only to ensure the existence of the
complementary element.
I, § 6] E X E R C IS E S 27

sponding digits of both numbers place for place; by the sum the number obtained
by adding the digits place for place and by replacing everywhere the digit 2 obtained
in the course of addition by 1.

11. Let A, B, C denote electric relays or networks of relays. Any two of these
may be connected in series or in parallel. Two such networks which are either closed
both (allowing current to pass) or both open (not allowing current to pass) are con­
sidered as equivalent. Let A + В denote that A and В are coupled in parallel, AB
that they are coupled in series. Let A denote a network always closed if A is open
and conversely. Let О denote a network allowing no current to pass and I a network
always closed. Prove that all axioms of Boolean algebras are fulfilled.1

Hint. Relation (Л + B)C = AC + BC has for instance the meaning that it comes
to the same thing to connect first A and В in parallel and couple the network so obtained
with C in series or to couple first A and C in series, then В and C in series and then
the two systems so obtained in parallel. Both systems are equivalent to each other in
the sense that they either both allow to pass the current or both do not. A similar
consideration holds for the other distributive law. Both distributive laws are illustrated
in Fig. 5.

о —-C“:>
• A • •
(A+B)C = AC + BC

В •— i I— # A • — i i— • В • — i

--------- * c •---------1 1—#c«—*


AB + C = (A+C)(B+C)
Fig. 5

12. A domain of the plane is said to be convex if it contains for any two of its
points the segment connecting these points as well. We understand by the “sum”
of two convex domains the least convex domain containing both, by their “product”
their intersection which, evidently, is convex as well. Let further I denote the entire
plane and О the empty set. The addition and multiplication fulfil axioms (1.1)—(2.3);
the distributive laws are, however, not valid and the complement A is not defined.
13. Let us understand by a linear form a point, a line or a plane of the 3-dimensional
affine space, further let the empty set and the entire 3-dimensional space be called
linearforms too. We define as the sum of a finite number of linear forms the least linear
form containing their set theoretical union; let their product be their (set theoretical)
intersection, which is evidently a linear form too. Prove the same propositions as in
Exercise 12.

1 This example shows how Boolean algebra can be applied in the theory of net­
works and why it is of great importance in communication theory and in the construc­
tion of computers (cf. e.g. M. A. Gavrilov [1]).
28 ALG EBRA S O F EVENTS [I, § 6

14. Let Au A2, . . . , A„ b e a r b i t r a r y events. Form a ll products of these events


c o n ta in in g к d is tin c t f a c to rs , a n d let Sk b e the s u m of a ll products. Let Pk b e the
p r o d u c t o f all e v e n ts r e p re s e n ta b le a s a s u m o f к d is tin c t te r m s of Au A2, . . . , An.
P ro v e th e r e la tio n

Sk — P„-k+ 1 (к — 1, 2 , .. ., n)
in a formal way by applying the rules of operation of Boolean algebras and verify
it directly too (cf. the generalization of Exercise Id).

Hint. Sk has the meaning that among the events A,, A}, . . . , A„ there are at least
к which occur and the meaning of P„-k+ t is that among these same events there are
no n — к + 1 which do not occur; these two statements are equivalent.

15. Let H be a set and T a set of certain subsets of H. T is an algebra of sets, if


and only if the following conditions are fulfilled:

a) The set H belongs to T.


b) Whenever A and В belong to T, A — В belongs to T as well.

16. Show that condition b) of the preceding exercise cannot be replaced by b')
“whenever A and В belong to T, A A В belongs to T as well”.

17. If the conditions


a) A € T implies A € Г,
ß) A € T and В e T imply AB € T
are postulated instead of conditions a) and b) of Exercise 15 and T is nonempty,
then T is an algebra of sets.

18. Condition ß) of the preceding exercise may be replaced by


ß') “A € T and В CT imply A + В e Г.”

19. Prove that in Exercise 18 the proposition cannot be replaced by


ß") “A e r , В e r , and AB = О imply А + В € Г.”

Hint. Let Н be a finite set of the elements a, b, c, d and let Г consist of the following
subsets of H: {a, b}; {c, d ] ; {a, c } ; {b, d } ; О ; H.

20. We call a nonempty 50 of subsets of a set H that contains with two sets A and
В also A + В and A — B, a ring of sets. A ring of sets 01 is thus an algebra of sets
if and only if H belongs to 01. Prove that a nonempty system of sets containing with
A and В also AB and A — В is not necessarily a ring of sets. Show that the condition
“with two sets A and В, A + В and AB belong as well to S” is not sufficient for S
to be a ring of sets either.
C H A P T E R II

PROBABILITY

§ 1. Aim and scope of the theory of probability

There exist in nature several phenomena fitting into a deterministic scheme:


given a complex К of circumstances, a certain event A necessarily occurs.
On the other hand, there are a number of phenomena in the sciences as well
as in our everyday life which cannot be described by such schemes. It is
characteristic for such phenomena that given the complex К of the circum­
stances, the event A may or may not occur. Such events are called random
events and such schemes are said to be stochastic schemes. Consider for in­
stance a radioactive atom during a certain time interval: the atom may or
may not decay during the time of observation. The instant of the decay de­
pends on processes in the nucleus which are, however, neither known to us
nor observable.
We can see from this example the need of studying stochastic schemes.
Very often it is entirely impossible (at least at the present state of matters)
to consider all relevant circumstances. But in many practical problems this
is not at all necessary. An event A may be a random event with respect to
a complex of circumstances and at the same time it may be completely de­
termined with respect to another, more comprehensive complex of circum­
stances. The randomness or determinedness of an event is a matter of fact:
it depends only on whether the given complex of circumstances К does or
does not determine the course of the phenomena (that is, the occurrence or
the non-occurrence of the event A). But the choice of the complex of cir­
cumstances К depends on us, and we have a certain freedom to choose it
within the limits of possibilities.
Regarding random mass-phenomena we can sketch their outlines, in
spite of their random character. Consider for instance radioactive disinte­
grations. Each radioactive substance continues to decay according to a well
determined “rate” ; we can predict, what percentage of the substance will
disintegrate during a given time interval. The disintegration follows an
exponential law (cf. Ch. Ill) characterized by the half-life period. (The half-
life period is the time interval during which half of the radioactive substance
disintegrates. In the case of radium this is some 1600 years.) The exponential
law is a typical “probability law” . This law is confirmed by the observations
30 P R O B A B IL IT Y [II. § 2

with the same accuracy as most of the “deterministic” laws of nature. The
radioactive disintegration is thus a mass phenomenon described, as to its
regularity, by the theory of probability.
As seen in the above example, phenomena described by a stochastic
scheme ate also subject to natural laws. But in these cases the complex of
the considered circumstances does not determine the exact course of the
events; it determines a probability law, giving a bird’s view of the outcome.
Probability theory aims at the study of random mass-phenomena, this
explains its great practical importance. Indeed we encounter random mass-
phenomena in nearly all fields of science, industry, and everyday life.
Almost every “deterministic” scheme of the sciences turns out to be sto­
chastic at a closer examination. The laws of Boyle, Mariotte, and Gay-Lus­
sac for instance are usually considered to be deterministic laws. But the
pressure of the gas is caused by the impacts of the molecules of the gas on
the walls of the container. The mean pressure of the gas is determined by
the number and the velocity of the molecules hitting the wall of the container
per time unit. In fact, the pressure of the gas shows small fluctuations,
which may, however, be neglected in case of greater gas-masses. As another
example consider the chemical reaction of two substances A and В in a
watery solution. As it is well known, the velocity of the reaction is in every
instant proportional to the product of the concentrations of A and B. This
law is commonly considered as a causal one, but in reality the situation is
as follows. The atoms (respectively the ions) of the two substances move
freely in the solution. The average number of the “encounters” of an ion
of substance A with an ion of substance В is proportional to the product of
their concentrations; hence this law turns out to be essentially a stochastic
one too.
The development of modern science makes it often necessary to examine
small fluctuations in phenomena dealt with earlier only in their outlines
and considered at that level as causal. In the following, we shall find several
occasions to illustrate these principal questions with concrete examples.

§ 2. The notion of probability

Consider an experiment where the circumstances regarded do not unique­


ly determine the outcome of the experiment, but leave more possibilities.
Let A be one of these. If we repeat the experiment under invariant condi­
tions, A will occur at some of these experiments, while in the other experi­
ments Ä will occur. If among n experiments the event A occurred exactly к
к
times, then к is called the frequency and — the relative frequency of the event
n
fl, § 2] T H E N O T IO N O F P R O B A B IL IT Y 31

A in the given sequence of experiments. Generally, the relative frequency


of a random event is not constant in different sequences of experiments.
Consider as an example screws produced by an automatic machine. Let A
denote the event that the screw does not fit the requirements, i.e. that it

is defective. The frequency of the defective items will be in general different


in every series, for instance in the first series 3 in 100 screws, in the second 5,
etc. Under constant circumstances of production the percentage of the de­
fective items fluctuates about a certain value. After a change in the circum­
stances of production this value may be different, but about the new value
there will again be fluctuations.
The mentioned fluctuations of the relative frequency may be observed in
the simple experiment of coin tossing by observing for instance the relative
frequency of the heads. If H denotes head and T tail, we may obtain in a
sequence of 25 tossings the following outcome:
HTTHHTTTHHTHHHTTTHTTHHHTT.
Figure 6 represents the fluctuations of the relative frequency.
Anyone may perform such experiments in a few minutes. The order of
heads and tails will every time be different but the general picture will re­
main essentially similar to the above: the relative frequency will always
fluctuate about the value — .
2
Figure 7 represents the outcome of an experiment of 400 tossings. As
early as in the eighteenth century large sequences of tossings were observed.
Buffon for instance performed an experiment of 4040 tossings. The outcome
was 2048 times head, hence the relative frequency was 0.5069. In the begin-
32 P R O B A B IL IT Y [II, § 2

ning of our century К . Pearson obtained from 24 000 tossings the value
0.5005 for the relative frequency.
There are thus random events showing a certain stability of the relative
frequency, i.e. the latter fluctuates about a well-determined value and the

more trials are performed, the smaller are, generally, the fluctuations. The
number, about which the relative frequency of an event fluctuates, is called
the probability of the event in question. Thus, for instance, the probability
of “ heads” (supposed the coin is regular) is equal to —
Consider now another example. In throws of regular die of homogeneous
substance the relative frequency of any one of the faces 1, 2, 3, 4, 5, 6 fluc­
tuates about — , i.e. the probability of each number is equal to — , if,
6 6
however, the die is deformed, e.g. by curtailing one of its faces, these
probabilities will be different.
A further example is the following: there is a well-determined probability
that a certain atom of radioactive substance disintegrates during a given
time interval t. That is, in repeated observation of atoms during a time inter­
val t we find that the number of atoms decaying during this time interval
fluctuates about a well-determined value. This value is according to the
observations 1 — c~ Xl, where A is a positive constant depending on the radio­
active substance. (E.g. in case of radium, if the time is measured in seconds,
A = 1.38 • 10 11.) Later on we shall give a theoretical foundation of this
law, i.e. we shall deduce it from simple assumptions.
II, § 3] P R O B A B IL IT Y A L G E B R A S 33

Comparing our examples of coin tossing and radioactive disintegration


we see that in the first a coin is tossed many a times successively, while in
the second a great number of simultaneous events are observed. This differ­
ence is, however, unessential. Indeed, instead of tossing one coin succes­
sively, we could toss a great number of similar coins simultaneously. The only
essential thing from the point of view of probability theory is the (at least
conceptually) unrestricted repeatability of the observation under the same
circumstances.
To sum up: If the relative frequency of a random event fluctuates about
a well-determined number, the latter is called the probability of the event
in question. The probability of an event may change with the circumstances
of the experiment. The only way to decide whether the relative frequency
has or does not have the property of stability and to determine the value
about which the statistical fluctuations occur, is empirical observation.
Thus the probability is considered to be a value independent of the
observer. It gives the approximate value of the relative frequency of an event
in a long sequence of experiments.
As the probability of a random event is an objective value which does not
depend on the observer, one can approximately calculate the long run
behaviour of such events, and the calculations can be empirically verified.
In everyday life we often make subjective judgements concerning the prob­
ability of a random event. The mathematical theory of probability, how­
ever, does not deal with these subjective judgements but with objective proba­
bilities. These objective probabilities are to be “measured” just like physical
quantities. Probability theory and mathematical statistics have their own
methods to “measure” probabilities. These methods are mostly indirect
and at the end are always based upon the observation of relative frequencies.
But at a conceptual level we must sharply distinguish the probability, which
is a fixed number, from the relative frequency depending on chance.

§ 3. Probability algebras

The theory developed in this book is due to A. N. Kolmogorov. It is the


basis of modern probability theory.
In this theory one starts from the assumption that to all possible (or at
least all considered) events in an experiment (or with other words: to all
elements of an algebra of events) a numerical value is assigned: the proba­
bility of the event in question. If we perform an experiment n times and find
к occurrences of the event A, then we have, because of the relation 0 < к <
к
< n, anyhow 0 < — < 1; that is, the relative frequency of an event is always
n
34 P R O B A B IL IT Y [И , § 3

a number lying between zero and one. Evidently, the probability of an event
must therefore lie between zero and one as well. It is further clear that the
relative frequency of the sure event is equal to one and that of the impossible
event is equal to zero. Hence also the probability of the “sure” event must
be equal to one and that of the “impossible” event equal to zero. If A and В
are two possible outcomes of the same experiment which mutually exclude
each other and if in n performances of the same experiment the event A
occurred k A times and the event В k B times, then clearly the event A + В
occurred k A + k B times. Hence, denoting by f A, f B and f A+B the relative
frequencies of A, B, and A + B, respectively, we have:

I a +b ~ f a + f e ­
in other words, the relative frequency of the sum of two mutually exclusive
events is always equal to the sum of the relative frequencies of these events.
Hence also the probability of the sum of two mutually exclusive events must
be equal to the sum of the probabilities of the events. We therefore take the
following axioms:
a) To each element A o f an algebra o f events a non-negative real number
P{A) is assigned, called the probability o f the event A.
fi) The probability o f the sure event is equal to one, that is P(I) — 1.
y) I f AB = O, then P(A + B) = P(A) + P{B).

An algebra of events o f in which to every element A a number P(A) is


assigned satisfying Axioms tx), ß), у), will be called a probability algebra.
Let us first see some consequences of the axioms:

T heorem 1. / / В s A, then P(B ) < P(A ).

P roof. From В £ A it follows A = В + C, where C = AB and hence


BC = O. Thus by Axiom y) we have

P(A) = P(B) + P(C).

Since because of Axiom a) P(C) > 0, Theorem 1 follows immediately. It


can also be deduced directly from the relation between probability and re­
lative frequency. Indeed, if the occurrence of event В implies the occurrence
of event A (i.e. if В £ A), then in any sequence of experiments the event A
occurs at least as often as the event B.
Since A £ / for every A £ o f , it follows from Theorem 1 that

P(A) < 1 for every A.


II, § 3) P R O B A B IL IT Y A L G E B R A S 35

Another important consequence of the axioms is the possibility to find


the probability of the contrary event A from the probability of the event A.
Indeed, we have A + Ä = I and AÄ = O, and thus by Axiom y) P(A) +
+ P(Ä) = P(l) = 1. Herewith we proved

T heorem 2. For any event A the relation


P(A) + P(A) = 1
holds.
Because of О = 7, it follows from Theorem 2
P(0) = 1 - P(I) = 1 - 1 = 0;
i.e. the probability of the impossible event is equal to 0.
Theorem 2 can be deduced directly from the conceptual definition of
probability. Indeed, if in a sequence consisting of n experiments the event
A occurs к times, then Ä occurs exactly (n — к) times. Hence for the rela­
tive frequencies f A respectively f A we have

I a + f i — 1-
Axiom y) states that the probability of the sum of two mutually exclusive
events is equal to the sum of the probabilities of the two events. This leads
immediately to
T heo rem 3. I f the events A b A .,,. .., An are pairwise exclusive, i.e. if
A jAj = О for i Ф j, then

P(Ai + A2 + .. . + A„) = P(Al) + P(A2) + . . . + P(A„).

The proof proceeds by mathematical induction. Our theorem is valid for


n = 2. Suppose that it is proved for n — 1. Since clearly (Al + A 2) Ak =
— О (к — 3, 4 , . . ., ?i), we have because of the induction assumption

P(At + . . . + A„) = P(Ai + A,) + P(A3) + . . . + P(A„) =


= P(AJ + P(A2) + . . . + P(A„).

From Theorem 3 follows, as a generalization of Theorem 2,


T heorem 4. I f the events А ъ A2, . . ., An form a complete system (cf. I,
§2), then
П А ,) + P{A2) + . . . + P(A„) - 1.

Indeed, by assumption
Ai + A2 Ч- . . . + An = / and AjAj = О
36 P R O B A B IL IT Y [II, § 3

for i Ф j, further /'(/) = 1; hence Theorem 4 follows immediately from


Theorem 3.
As a further important consequence of the axioms we show how to cal­
culate the probability of the sum of two events A and B, if we drop the re­
striction that A and В should exclude each other.

T heorem 5. Let A and В be arbitrary events. We have then

P(A + В) = P(A) + P(B) - P{AB).

P r o o f . A + В can be represented as a sum of two mutually exclusive


events. We have namely
A + В = A + ÄB.
Hence Ъу Axiom y)
P(A + B) = P{A) + P{ÄB). (1)
On the other hand, we have В = AB + ÄB and AB-ÄB — О. From these
it follows
P(B) = P(AB) + P(AB). (2)
Subtraction of Equation (2) from (1) leads to our statement. In particular,
if we have AB = О our theorem will be reduced to the assertion of Axiom y).
We shall need the following simple theorems:

T heorem 6. If A s B. then
P{B - A) = P(B) - P{A).
Proof. We have, by assumption, В = A + {B — A) and A(B — A) = O,
hence Theorem 6 follows from Axiom y).
Clearly the relation P(B - A) = P(B) - P(A) does not hold in general.
It is, however, easy to obtain

T heorem 7. P(B - A) = P(B) - P(AB).

P roof. We have (В — A) + AB = В and (В — A)AB = О; from this


our theorem follows because of Axiom y).
Furthermore we have
T heorem 8. P(AAB) = P(A) + P{B) - 2 P(AB).
P roof. Because of
AAB = (A - В) + (B - A)
II, § 3] P R O B A B IL IT Y A L G E B R A S 37

and
(A - B) (B - A) = 0
we find
P(AAB) = P(A - B) + P(B - A),
and Theorem 8 follows from Theorem 7.
We proved Theorem 3 by repeated application of Axiom y). In the same
manner, we can obtain by repeated application of Theorem 5 a formula
for the probability of the sum of an arbitrary number of events. In particular,
we have
P(A + В + C) = P(A) + P(B) + P(C) - P(AB) - P{BC) -
- P(CA) + P(ABC).
More generally we have

T heorem 9. Let A s, A.,, . . A„ be arbitrary events o f a probability algebra.


Then we have
P(A\ + A., + . . . + An) = £ ( - l ) * - 1^ ) ,
A= 1
where
= I P(AhA,„...Aik).
1 < .ii< H < -< ik < .n

In the summation (ix, i2, . . ., if) run through all combinations, к at a


time, of the numbers 1, 2 ,. .., n, repetitions not allowed.
Theorem 9 is a particular case of the following more general theorem:

T heorem 10 (Ch. Jordan). Let Vj.n) denote the probability o f the occurrence
o f exactly r among the events А ъ A.,, . . ., A n. Then we have

K n) = ( - 1)" \ A] (r = 0, 1, .. ., rí), (3)

where S jf = 1 and

s (,n) = Z
l ^ i \ < i 2< ... <ii<.n
А Л Л • • • Л) fo r 1 = 1 ,2 ,... (4)

and the summation is to be extended over all combinations o f the numbers


1 , 2 , . . . , « , I at a time, repetitions not allowed.
We shall prove Theorem 10 by a general principle which can be used for
the proof of many similar identities.1

1 Cf. A. Rényi [23], [35].


38 P R O B A B IL IT Y [И, § 3

We need some preparatory definitions and remarks. Let A b A . , , A „


be arbitrary events. There exists a “least algebra of events” containing
the events А ъ A .,,. . ., An. Clearly, consists of 22" elements and is
generated by the 2" elementary events w = Ait А,г . . . Aik Äh . . . Äjn_k,
where (il5 U ,. . ., ik) is an arbitrary combination of the numbers 1, 2 , . . n;
к at a time (Jc = 0, 1,. . ., ri) and j \ , j 2, . . ■, j n-k are those of the numbers
1 , 2 , . . . , « which do not figure among ilt i2, ■. ., ikJ
We shall call every event belonging to<^f “an event expressible in terms of
A b A .,,. . A„ ” . Every В from is namely a sum of certain elementary-
events со. Thus В can be expressed by application of the basic operations of
the algebra of events to Ab A 2, . . ., A„. In other words: Every event В £ o i
is a function of the events Ab . . ., A„. The functional relation between an
event В £ and the events А г, A.,, . . ., A„ representing В does not depend
on the specific choice of the events A u A 2, ■. ., A„. In particular it does not
depend on the probabilities of the events A u A2, . . A„. The event
A it Aia. . . A ir, or the event Er that exactly r of the events A b A.,, . . ., A n
occur, are for instance events expressible in terms of Аъ A 2, ■. A„.
The following theorem contains the principle mentioned above:

T heorem 11. Let A u A 2, .. A„ be n arbitrary events and the algebra


o f events o f ail events expressible in terms o f the events Ak. Let cb c2, . . c„,
be real numbers and Bx, B2, . . Bm a sequence o f events such that Bk £ ^
(k = 1 , 2 , . . . , m). The inequality
m

X ck P(Bk) > 0 (5)


k =1
holds for an arbitrary probability algebra obtained by assigning probabilities
to the elements o f i f and only if it holds for those probability algebras,
in which the sequence o f numbers P(Ak) (k = 1 , 2 , . . . , «) consists only o f
zeros and ones.

P r o o f . Since every Bk is the sum of certain elementary events со, (5) is


equivalent to the inequality
X ЯЮР(cu) > 0, (6)

where the real numbers /.m depend only on the numbers ck and on the func­
tional dependence of the events Bk on the events Aj, but do not depend on
the numerical values P(Aj). The summation is over all the 2" elementary1

1 The events Ak (Jc = 1 , . . . , n) are here considered as variables (indefinite events);


thus there are no relations assumed between the Ak-s.
II, § 3 ] P R O B A B IL IT Y A L G E B R A S 39

events of ^ . Indeed we have

^’ 0) t'k ,
<oCBk
where the summation is over such values of к for which со figures in the
representation of Bk. Since the nonnegative numbers P(co) are submitted
to the only condition IP(co) = 1, (6) holds, in general, if and only if all
numbers ).01 are nonnegative. But when the sequence of numbers P(Aj)
(j — 1 , 2 , . . . , « ) consists of nothing but zeros and ones, one and only one
of the elementary events to has probability 1 and all the others have proba­
bility 0. Thus the proposition that лш> 0 for all со is equivalent to the prop­
osition that (6) is valid whenever the sequence of numbers P(Ak) consists
of zeros and ones only. Theorem 11 is now proved.
From Theorem 11 follows immediately

T heorem 12. I f А ъ A.,,. . ., A„ are arbitrary events and Въ В ,,. . Bm


are certain events expressible by Ал, A . ,, . . ., A„, then the relation
m

L ck P(Bk) = 0 (7)
*=i
holds in every probability algebra, i f and only if it holds in all cases when the
sequence o f numbers P(Ak) consists o f zeros and ones only.
m

Proof. Apply Theorem 11 in turn to the inequalities £ ckP(Bk)> 0


k= 1
m

and Y j (—ck)P(Bk) > 0 . Theorem 12 follows immediately.


k= 1
Now we can prove Theorem 10 (and thus also Theorem 9). I f/ fi o m
the numbers P(A k) are equal to 1 and the remaining n — l are equal to
0 (/ = 0, 1, 2, . .., n), then (3) is reduced to the identity

Е н гГ Я Ш й к
For l < r all terms of the left hand side of (8) are equal to 0, for / — /• only
the term к = 0 is distinct from zero, namely 1 and for 1 > r the sum can be
transformed as follows:

l
h=0
(- 1)kfr
( ^
t 1 f I1' ++ kj
J =
40 P R O B A B IL IT Y [II, § 4

Let us mention as another application of Theorem 11 the inequality of


M. Fréchet and A. J. Gumbel (cf. M. Fréchet [2]).

T heorem 13 (M. Fréchet).


Ыл) v(n)
(Г = 0 , 1 , . . . , » * - 1), (9)
г+ 1 (r
where the same notations are used as in Theorem 9.

T heorem 14 (A. J. Gumbel).

(r + i) ( " Ь * ? 1
----- 7-----TT— ^ -4 - ( r = 1 , 2 , . . . n - 1), (10)
in — 1 j n—1
l г J l ' - l

where the same notations are used as in Theorem 9.

P roof of Theorems 13 and 14. Apply Theorem 11. If / of the numbers


P(A^) are equal to 1, and the others are equal to 0, then (9) will be reduced
to the trivial inequality

I ' U] , "I
1H

U .J СГ
and Theorem 14 to the likewise trivial inequality

' n / [n\_[l
r + l) r+ 1 rI r
.n —1 ^ n —l
\ r 11*— 1

§ 4. Finite probability algebras

If the set of events of a probability algebra is finite, we have shown in


Chapter I the possibility of representing these events by the class of all subsets
of a finite set Q. Let Q consist of N elements, denoted by co2, • • -»©лг*
II, § 3] P R O B A B IL IT IE S A N D C O M B IN A T O R IC S 41

The probability P(A) of any event A is then uniquely determined by the values
of P for the sets which consist of exactly one element. Let {со,-} be the set
consisting of со,- only and let further be P({co,-}) = p, (i = 1, 2, . . N).
Then we have for each event A

P ( A ) = £ Pt-
coi£A
Since P(£2) = 1, the (nonnegative) numbers must obey the condition
N

£ Pi = i.
i=i
An important particular case is obtained if all the numbers pt are equal
to each other, that is, if they are equal to ~ . These special probability alge­
bras are called classical probability algebras, since the classical calculus of
probability exclusively dealt with these algebras.
At the early stages of development of probability theory one wished to
reduce the solution of any kind of problems to this case. But this often turned
out to be either too artificial or unnecessarily involved. Since, however, in the
games of chance (tossing of a coin, games of dice, roulette, card-games,
etc.) the probabilities can be determined in this manner indeed, and since
many problems of science and technology may be reduced to the study of
classical probability algebras, it is worthwhile to deal with them separately.
Tn the case of classical probability algebras we have

P{A) = —

where К denotes the number of the elements of A. Thus we arrive at the


“classical definition” of probability: The probability of an event is equal to
the quotient of the number of the favorable cases and of the total number
of all possible cases, provided these cases are all equally probable.
Today the “classical definition” of probability is not considered as a
definition any more, but only as a way to calculate probabilities, applicable
in case of classical probability algebras, i.e. finite probability algebras whose
elementary events have, for certain reasons (e.g. symmetry-properties), the
same probability.

§ 5. Probabilities and combinatorics

In classical probability algebras the probabilities are determined by com­


binatorial methods. In what follows we shall give some examples.
42 P R O B A B IL IT Y [II, § 5

Example la. A person having N keys in his pocket wishes to open his
apartment. He takes one key after the other from his pocket at random and
tries to open the door. What is the probability that he finds the right key
at the k-th trial? Suppose that the N\ possible sequences of the keys all have
the same probability. In this case the answer is very simple indeed: N ele­
ments have (N - 1)! permutations with a fixed element occupying the
(N — 1)! 1
k-th place. The probability in question is therefore----------- = — ; that
N\ N
is, the probability of finding the right key at the first, second,. . ., 7V-th trial,
respectively, is always . If the keys are on the same key-ring and if the
same key may be tried more than once, the answer is different and will be
dealt with later (cf. Ch. Ill, § 3, 7.).
Example lb. An urn contains M red and N — M white balls. Balls are
drawn from the urn one after the other without replacement. What is the
probability of obtaining the first red ball at the k-th drawing? In order to
answer the question we have to determine the total number of all permu­
tations of M red and N - M white balls having for their first k - 1 balls white
[N-M \
ones and for the k-th a red ball. The first k - \ balls may be chosen in
(different) ways from N —M white balls; furthermore, these can be arranged
in (k — 1)! different orders. The red ball on the /с-th place can be chosen
in M different ways and the remaining places can be filled up in ( N - k)\
different ways. Hence the probability in question - provided that all the
N\ permutations are equally probable — is given by

Obviously, the special case M — 1 is equivalent to Example la.


In order to make the calculations more easy, Pk may be written in the
following form:
„ M b- 1 / M 1
k ~ N —к + 1 Д l1
N - j+ 1) '

M
If N and M are large in comparison to k, and — is denoted by p (0 < p <
< 1), then we have approximately Pk « (1 — p)k~lp = nk. Indeed, if N
M
and M tend to infinity while p — — remains constant, then Pk tends to the
expression nk.
Ii, § 5] P R O B A B IL IT IE S A N D C O M B IN A T O R IC S 43

Example 2. Consider now the following problem. An urn contains N


balls, of which M are red and N — M white, 1 < M < N. From the urn
n balls are drawn. What is the probability of obtaining к red and n — к
white balls?
A new formulation of the same example discloses the practical importance
of the problem. In the serial production of machine parts a series of N items
contains M rejects. What is the probability, by taking a random sample
of n elements, that this sample will contain к rejects?

(N \
Solution. A sample of n elements may be chosen from N elements in
n)
different ways. Suppose that every such combination is equally probable.
ÍN ]-1
Then the probability of every combination is • Therefore we have only

to count, how many combinations contain just к of the rejects. There can
[ M\
be chosen к elements from M in different ways and n - к elements
К J
, . IN-M\
from N — M in , different ways. Therefore the probability in ques-
\n-k)
tion is:
M\ i N - M
_ U I U - к

k
nI

Example 3. The Max well-Boltzmann, Bose-Einstein, and Fermi-


Dirac statistics.
We start from the following simple combinatorial problem: How many
different ways are there in which n objects can be placed into N cells? Every
object can be placed into N different cells, hence there are N possibilities
for each object and the number of the possibilities of the different arrange­
ments is N". If we assume that the probability of every arrangement is the
same, i.e. N~", then we obtain the probability of a certain cell being occupied
by exactly к objects, if we count the possibilities in question. The к objects
occupying the cell can be chosen in |”jdifferent ways, and the number of
the possibilities to arrange the remaining n - к objects into N - \ cells is
(N — sothenum berof possibilities in question will be | ” (N — l)"_fc.
44 P R O B A B IL IT Y [II, § 5

The probability in question is

In] 11*/ 1
VAr (TVj \ N,

Such “problems of arrangements” are of paramount importance in sta­


tistical mechanics. There it is usual to examine the arrangements of certain
kinds of particles (molecules, photons, electrons, etc.) in the “phase space” .
The meaning of this is the following: If every particle can be characterized
by К data, then there corresponds to the state of the particle in question a
point of the phase space, having for coordinates the data characterizing
the particle. In subdividing the phase space into /с-dimensional parallelepi­
peds (cells) the physical system can be described approximately by giving
the number of particles in each cell. The assumption that all arrangements
have equal probabilities leads to the so-called Maxwell-Boltzmann sta­
tistics. This can be applied for instance in statistical mechanics to systems
of the molecules of a gas. But in the case of photons, electrons, and other
elementary particles we must proceed in a different way. For systems of
photons for instance, the following model was found to be valid: in distrib­
uting n objects into N cells two arrangements containing in each of the
cells the same number of objects are not to be considered as distinct. That
is, the objects are to be considered as not distinguishable and thus only
arrangements having different numbers of objects in the cells can be distin­
guished from each other. This assumption leads to the so called Bose-Ein-
stein statistics.
Next we calculate the number of possible arrangements under Bose-Ein-
stein statistics. This problem is the same as the following question: In how
many ways can n shillings be distributed among N persons? (Of course,
the number of shillings obtained is of interest, the individuality of the coins
being irrelevant.) This number is equal to the number of combinations of
IN + n - 1
N things, n at a time, repetitions allowed, i.e. to . Another solu-
n
tion of our problem is the following: Let to the n objects correspond n colli-
near points. Let these и points be subdivided by N —1 separating lines. Every
configuration thus obtained corresponds to one possible arrangement.
Every two consecutive lines signify one cell and the number of the points
lying between two consecutive lines represents the number of objects in the
corresponding cell. If there are no points between two consecutive lines,
the corresponding cell is empty. Figure 8 gives a possible arrangement of
eight objects into six cells; here the first cell contains one, the second two
objects, the third cell is empty, the fourth contains three objects, the fifth
is empty, and in the sixth are two objects.
И , § 5] P R O B A B IL IT IE S A N D C O M B IN A T O R IC S 45

The number of possibilities in question is thus obtained by dividing the


number of permutations of the n points and N —1 lines by the number ob­
tained by permuting only the points among themselves and the lines among
ш+п - 1
themselves; it is equal to . Hence under the Bose-Einstein hypo-
1 «
„ . E V + ti - H - 1
thesis the probability of each arrangement is
n

• • • • • • • •

Fig. 8

Now we shall calculate under Bose-Einstein statistics the probability


of exactly к particles being in a given cell. For this it suffices to determine
the number of possibilities having in the cell in question exactly к particles.
This, however, is equal to the number of possibilities of putting n — к
ш + п - к - 2)
particles into the remaining N —1cells, hence to ; thus the
Yl — К
probability is

/ N + n —к —2
\ n —к
N + n — l' '
n

We have dealt with Bose-Einstein statistics not only because of their


important physical applications. We also wished to make clear that the
determination of equally probable cases is not a question of pure logic;
experience is involved in it too. This example shows further that a hypo­
thesis relative to equally probable cases cannot always be checked directly.
Often we must rely on experimental verification of the consequences of the
hypothesis in question.
The Bose-Einstein statistics have no general validity. It does not apply
for instance in the case of electrons, where the so-called Fermi-Dirac sta­
tistics are appropriate. They are obtained by joining to the principle of
indistinguishability of the Bose-Einstein statistics the requirement that every
cell may be occupied by at most one particle (Pauli’s principle). Hence the
number of distinct arrangements is equal to the number of the possibilities
(N \
of choosing« from TVelements, i.e. to . The probability of a cell being
n
46 P R O B A B IL IT Y [II, § 6

occupied by one particle (more than one is out of question) is


TV- lj
n — 11 n
лп Tv'
n
The Fermi-Dirac statistics gives good agreement with experiments in
the case of electrons, protons, and neutrons.

§ 6. Kolmogorov probability spaces

The theory discussed up to now can only deal with the most elementary
problems of probabilities; those involving an infinite number of possible
events are not covered by it. To deal with them we need Kolmogorov’s
theory, which will now be discussed.
In Kolmogorov’s probability theory we assume that there is given an
algebra of sets, isomorphic to the algebra of events dealt with. This assump­
tion, as we have seen, does not restrict the generality. We assume further
that this algebra of sets contains not only the sum of any two sets belonging
to it but also the sum of denumerably many sets belonging to the algebra
of sets. Algebras of sets with this property are called o-algebras or Borel
algebras.
In Kolmogorov’s theory we therefore assume the following axioms:
I. Let there be given a nonempty set Q. The elements o f Q are said to be
elementary events and are denoted by o).
II. Let be specified an algebra o f se tso £ o f the subsets o f Q ; the sets A o f
•of. are called events.
III. o f is a a-algebra, that is1
00

Ak £ o 6 (k = 1 , 2 ,.. .) =» X Ak ( fo f .
*=i
From the Axioms I-III follows immediately that if A k £ o f (k = 1,2,...),
00

then also П Ak e o f ■
k =1
The following axioms prescribe the properties of probabilities:
IV. To each element A o f o f is assigned a nonnegative real number P{A),
called the probability o f the event A.

1 Here and in what follows the sign => stands for the (logical) implication.
I I , § 7] R IN G S A N D A L G E B R A S O F S E T S A N D M E A S U R E S 47

V. P(Q) = 1.
VI. I f A b A 2, . . A„,. . . is a finite or a denumerably infinite sequence
o f pairwise disjoint sets belonging to *A, then
P(A^ + A.> + .’. . + A n + . . . ) = P(Aj) + P(A2) + . . . + P(A„j + . . . .
Requirement VI is called the o-additivity (or complete additivity) of the
set function P{A).
A cr-algebra - A of subsets of a set Q on which a set function P(A) is defined
such that Axioms I-VI are fulfilled will be called a probability space in the
sense o f Kolmogorov and will be denoted by [Í2, А , Р].
Theorems proved in the previous paragraph hold clearly for Kolmogorov
probability spaces too, as the Axioms a ),ß ), and y) correspond to Axioms
IV, V, and VI respectively. Axiom VI, however, requires more than the
Axiom y) of probability algebras, since it assumes the additivity of P(A)
not only for finitely many, but also for denumerably many pairwise disjoint
sets belonging to the cr-algebra A .
Every finite probability algebra is a Kolmogorov probability space, since
an additive set function on a finite algebra of sets is trivially rx-additive.
The empty set is denoted by O. Obviously, we have always P(O) — 0
(cf. the note to Theorem 2 of § 3).
Apart from finite probability algebras the most simple probability fields
are those in which the space Í2 consists of denumerably many elements.
Let Q be a denumerable set, with elements cox, co2,... , co„,. . .; let A con­
sist of all subsets of Q. Let the set containing the only element con be denoted
by {a>„}; let further be P( {can}) = p„(n= 1 , 2 , . . .). In order that [ß, P]
■ CO
be a Kolmogorov probability space the conditions pn ä: Ó and Pn —1
n= 1
must, according to Axioms IV-VI, be satisfied. Further if A is an arbitrary
subset of Q, then by Axiom VI we have

P (A )= Y Pk‘
<Ok(.A

Conversely, if the above conditions are fulfilled, [Q, *A, P] is in fact a


Kolmogorov probability space. Thus we have also proved the consistency
of Kolmogorov’s axioms. The cr-additivity of P is readily seen, since in any
convergent series of nonnegative terms the terms can be rearranged and
bracketed in whatever order.

§ 7. The extension of rings of sets, algebras of sets and measures

In this paragraph we shall discuss some results of set theory and measure
theory, used in probability theory. We shall not aim at completeness. We
48 P R O B A B IL IT Y [II, § 7

assume that the reader is familiar with the fundamentals of measure theory
and of the theory of functions of a real variable. Accordingly, proofs are
merely sketched or even omitted, especially if dealing with often-used con­
siderations of the theory of functions of a real variable.1
We have seen already in Chapter I that every algebra of events is isomor­
phic to an algebra of sets. It is always assumed in Kolmogorov’s theory that
the sets assigned to the elements of the algebra of events form a er-algebra.
Hence the algebra of sets constructed to the algebra of events must be ex­
tended into a (т-algebra — if it is not already a e-algebra itself. This exten­
sion is always possible, even in the case of a ring of sets.
A system <52 of subsets of a set Q is called a ring o f sets if
A £ <J$ and +

A ring of sets is said to be a Borel ring o f sets or о-ring, if

А „ Ь Я (П = 1,2,. ..) =» i AneJ$.


И= 1

The ring of sets <^is an algebra of sets iff2the set Q belongs to <52. In fact,
an algebra of sets <sd can be characterized as a system of subsets of a set
Q having the following properties:
l. A Z <^€ and В => A - В

II. A £ and +
III. Q
This is obvious, as I and III imply that whenever A belongs to <j£ so does
Q —A —Ä and thus conditions 1,2, and 3 of § 3 Chapter I are fulfilled. Con­
versely, it follows from these conditions that whenever Л ()<j €and В
hold, so do Ä £ and A — В = Ä + В £ <j 5 as well. Hence the conditions
I, II, III are equivalent to the conditions 1, 2, 3 of Chapter I. We have now
the following theorem:

T heorem 1. Let Q be any set and <52 a ring consisting o f subsets o f Q. There
exists then a uniquely determined а-ring (or Borel ring) 5)(<52) with the follow­
ing properties: 5i(<52) contains <52and an arbitrary a-ring <52' containing <52
contains fS(<52) as well. In other words, 5l(<52) is the least а-ring containing <52.

Proof. Obviously, there exists a a-ring - 5 ’ containing <52. Such is, for in­
stance, the collection of all subsets of 12. Let now 5i(<52) be the intersection

1 More particularly see e.g. P. R. Halmos [1] or V. I. Smirnov [2].


2 “iff” stands here and in what follows as usual for “if and only if”.
II, § 7] R IN G S A N D A L G E B R A S O F S E T S A N D M E A S U R E S 49

of all cr-rings containing <32. Evidently, all statements of Theorem 1 are


fulfilled by <fi{-32).
The assumption in Kolmogorov’s theory that the sets assigned to the
events form a cr-algebra is, according to our theorem, no essential restriction
of the generality, since every algebra of sets may be extended by a suitable
extension into a cr-algebra.
Let us now consider another example. Let R denote the set of real numbers.
Let <32 be the collection of those subsets of the real axis which can be
represented as sums of a finite number of disjoint half-open intervals (closed
to the left and open to the right). O b v i o u s l y , i s a ring of sets, though it is
not a cr-ring. By Theorem 1 there exists a least cr-ring -Щ32) containing -32.
It is easy to see that is in our case a cr-algebra as well. The subsets
of R belonging to -3j{-32) are called Borel sets.
Let us now assume that a nonnegative and completely additive set func­
tion P(A) is defined on the sets of an algebra of sets (which is not a cr-algebra),
further that P(C2) = 1. We have seen that every algebra of sets <32 may be
extended into a cr-algebra. But is it possible (and if so how), to extend the
definition of P to the elements of 3i(< 3) while preserving the nonnegativity
and the cr-additivity ?
A nonnegative, completely additive set function defined over a ring of
sets is called a measure. Our question therefore, in a more general formula­
tion, is whether a measure defined over a ring <3 can be extended to the
least cr-ring 3i{<3A) containing <Jf.
For our purposes it suffices to consider а-finite measures. A measure
g defined over a ring -32 is said to be а-finite, if for every set A £ -32 there
exists a sequence An £<32(n — 1 , 2 , . . . ) such that every set An has a finite
00

measure g{An) < + oo and I c у A„. The following theorem can now be
n=1
asserted:
T heorem 2. If g(A) is a о-finite measure defined over a ring o f sets <32, there
exists a uniquely determined о-finite measure g(A) defined over the extended
ring Щ-32) such that for every A £ •Jf one has g(A) = g(A).

P roof. In order to construct g let us first define a set function g* in the


following manner. Let g*(A) for any subset of A of Q be the lower bound
00

of all sums у fi(^n)> where the A„ belong to ^ and their union contains A ;
n= l
that is
CO \

g*(A)= inf у g(An)\.


AQn—
2 A
1
n n= 1 '
50 P R O B A B IL IT Y [II. § 7

It is easy to verify that p*(A) has the following properties:


a) p*(A) > 0 ;
b) if A £3/2, then p*(A) = p(A);
c) if 4 c £ An , then p*(A) < £ ц*(А„);
n -1 n=l
p*(A) is called the outer measure of the set A. A set A is said to be mea­
surable. if every subset В of Q satisfies the following relation:
p*(B) = p*(A В) + p*(ÄB),
where Ä is the set complementary to A with respect to Q.
Let J?* be the collection of all sets measurable in this sense. Ji(A) is defined
on 3/2* by the equality p(A) = ц*(А). It is not difficult to prove that

1. '32* is a (T-algebra containing <32 (and thus -3(32) as well);


2. ß(A) is a measure on 32*, i.e. p(A) is completely additive.

From these statements it follows that p(A) satisfies the requirements of


Theorem 2.
It can be shown that this extension of p to 32(32) is unique. The measure
Ji so obtained has, as it is readily seen from the construction, the following
further property: If A £ 32* and p(A) = 0, then from В £ A follows В £ 3 1*
and thus ß(B) = 0. Measures having this property are called complete
measures. Thus every measure derived from an outer measure in the above
described way is complete.
Let us now consider the following example: Let R be the real axis
and let F(x) be a nondecreasing function continuous from the left satis­
fying the relations

lim F(x) = 0 lim F(x) = 1.


л:-«— со л-*- + CO

Let the ring 32 be the collection of all sets consisting of a finite number of
intervals closed to: the left and open to the right. p(A) will be defined as
follows: If A consists of the half-open disjoint intervals [ak, bk), ax < bx <
< a2 < b2 < . . . < ar < br, let then be

K A )= t (F(bk) - F ( a k)). (1)


k =1
It is easy to see that p(A) is a measure on <32, hence p(A) > 0 and if A n £32
i' 00
(n = 1, 2 ,. . .) and the A n are pairwise disjoint, while £ An = ^ £ ^ , t h e n
n= 1
II, § 7 ] R IN G S A N D A L G E B R A S O F S E T S A N D M E A S U R E S 51

K A) = £
/1=1
To prove this we first prove another important general theorem:
T heorem 3. A nonnegative additive set function g(A) defined on a ring o f
sets is a measure on *32 iff for every sequence o f sets B„ £ J ? such that Bn+x S
oo
e B„, g(Bn) < + 03 (n = 1, 2,. ..) and f ] Bn = О (i.e.for every decreasing
/1 = 1
sequence o f sets B„ having the empty set as their intersection) the relation
lim g{Bn) = 0 (2)
/1 — co
holds.
The proof is simple. Indeed, if (2) is fulfilled and

An dJ?, £ Ап £<Я,
n=1
while A„Am — О for n ф m, then we have for every n
' 00 /1 —1 / °°

В Z
k = l
Ak = XKк = 1
A k) + B Z
l k —n
A k \l
J

oo
thus from the fact that the sets Bn — Z Ak satisfy the conditions
k =n
00
B„ 5„+1 E [ ] 5„ = О it follows that lim g(Bn) = 0; thus
/1 = 1

л £ л ) = £ к Ак),
k=l J k—l
i.e. /л is completely additive. Conversely, if /1 is completely additive, then
00 00
whenever B„ É Bn+1 £ £„, П B„ = 0 hold, one has Bx = Z (Д, - ß„+i)
/1 = 1 /1 = 1
where Bn - 5„+1 € <3? and (B„ - Bn+1) (Bm - Bm+1) = О (n ф mi).
Therefore we have
lim p(B„) = lim £ p(Bk - Bk+1) = 0.
/!-*■ oo /1-*- co k —ti

Now we shall prove that the set function g defined by (1) satisfies the con­
ditions of Theorem 3. Let therefore be B„ £ -J&, Bn+1 £ Bn (n = 1 , 2 , . . . )
00
and П B„ = O. ThenO ^ g(Bn+1) <. p(Bn) (n = 1, 2 ,. . .), thus lim p(S„) = c
/1 = 1 /!-*-C0
does exist and д{Вп) > c (n = 1 , 2 , . . . ) . We shall show that the as-
52 P R O B A B IL IT Y [II, § 7

sumption c > 0 leads to a contradiction. In order to prove this we construct


from the set Bl which consists of the intervals [au,bh) another set B{ c Bx
consisting of a finite number of closed intervals obtained from every one
of the intervals [au, bu) by removing a small open subinterval (b'lh bu)
(aii < bu < bu) such that the points b'u be points of continuity of F(x)
and the value of the sum ^(F (6Xi) — F(b'u))be at most — . This procedure is
repeated with the set B[B2 (which may contain closed intervals as well be­
sides the half-open intervals) such that the sum of the increments of the func-
c
tion F(x) belonging to the removed intervals be at m o st- .In this way we
8
obtain a set B!2 which consists of a finite number of closed intervals. By
continuing the procedure, we obtain a sequence of sets B'„ having the follow­
ing properties: a) B'n consists of finitely many closed intervals; b) B'n+1 я B'n;
00

c) B'n = 0 ; d) the sum £ (F(b'„k) - F(a„k)) of the increments of thefunc-


П=1 &
, , C
tion F(x), taken over the intervals [ank, bnk] forming Bn, is at least equal to - - .
These properties are, however, contradictory since from a), b), c) follows
the existence of a number N such that, for every n > N, B'n is the empty set
(indeed, from B'„ = B[ B'2 . . . B'n Ф О for every n, the sets B'n being closed,
OO

the relation B’n # О would follow). But this contradicts d), and thus
n =l
we proved our statement that the set function p defined by (1) is a measure
on <3?.
According to Theorem 2 the definition of the measure ц can be extended
to all Borel subsets of the real axis. Thus we obtained on these sets a measure
H such that for A = [a, b) the relation Ji(A) = F(b) — F(a) is valid.
Especially, if
10 for 0,
X <
F(x) = \x for 0<x<l,
(l for 1 < X,

then the above procedure assigns to every subinterval [a, b) of the interval
[0, 1] the value b —a.
We have seen that ц is a complete measure determined on a cr-ring <J&*,
which contains the cr-ring If F(x) has the special form mentioned
above, this measure is just the ordinary Lebesgue measure defined on the
interval [0, 1]. Any measure ц constructed by means of a function F(x)
satisfying the above conditions is called a Lebesgue-Stieltjes measure de­
fined on the real axis.
II, § 7] R IN G S A N D A L G E B R A S O F S E T S A N D M E A S U R E S 53

The same construction can be applied in cases of more than one dimension.
Let F(xu x 2, . . ., x„) be a function of the real variables x b x 2, . . x„
having the following properties:
1. F(xb x2, . . ., x„) is in any one of its variables a non-decreasing func­
tion continuous from the left.
2. lim F(xy, x 2, . . x„) = 0, (k = 1 , 2 , . . . , rí) and lim F(xy, x 2, ■■■,x„) =
X k-*-oo

= 1, if every x k (k = 1 , 2 , . . ., rí) tends to + oo.


In order to formulate the third condition let us introduce the following
notation: Let (k = 1, 2 ,. . ., rí) denote the operation of taking the dif­
ference with respect to the variable xk and with difference h ; i.e. for a
function G(xy, x 2, . . ., x„) we put
G(xu x 2, . . ., x„) = G ( X y , x 2, .. ., x k + h , . . ., x n) -
- G(xj, x2, . . ., x„).

Now we can formulate the third condition:


3. For any numbers hk > 0 and any real values x k (k = 1 , 2 , . . . , « ) the
relation
4 ? 4 ? • • • 4 ? F(xi, x2, . . x„) > 0
should hold.
Let / be an «-dimensional interval consisting of the points (jc1; x 2, . . x„)
of the «-dimensional space satisfying the inequalities ak < x k < bk. Let
hk — bk — ak and
/!(/) = . . . A ft Р(аъ a2, . . . , a„). (3)
Let F72 be the set of all subsets A of the «-dimensional space which can
be represented as the union of finitely many pairwise disjoint intervals
Г
/„ I2, . . ., Ir. For A = £ 4 we put
k=l

K A )= i 44). (4)
k=l

It is readily seen by the same consideration as in the one-dimensional case


that the set function ц defined by (3) and (4) is a measure on the ring of sets
J&; thus p can be extended to the cr-algebra .F)(.J?) formed by all Borel sets
of the «-dimensional space.
Especially, if
0 for min Xi < 0,
(■Xu * 2 , •••,*„) j-j mjn j.Qr -^ > о (jc — 1 , 2 , . . . , и),
k= 1
54 P R O B A B IL IT Y III, § 8

then the extension of the set function p(A) defined above leads to the ordi­
nary и-dimensional Lebesgue-measure defined on the и-dimensional cube
0 < x k < 1 (k = 1, 2........и).

§ 8. Conditional probabilities

In the preceding paragraphs we introduced probabilities by means of


the relative frequencies. Accordingly, in order to introduce the notion of
conditional probability we shall examine first conditional relative frequen­
cies. If an event В occurs exactly и times in N trials and if among these и
trials the event A occurs к times together with the event B, then the quotient
к
— is called the conditional relative frequency of the event A with respect to
и
the condition B. The conditional relative frequency of an event A with re­
spect to the condition В in a sequence of trials is therefore equal to the simple
relative frequency of the event A in a subsequence of the sequence of trials
in question; this subsequence contains only those trials of the original
sequence in which the event В occurred. I f / fi denotes the relative frequency
и . . .
of В in the whole sequence of trials, then f B = — defining similarly f AB we
к
have f AB = — ■ Finally, if f A\B denotes the conditional relative frequency
к
of A with respect to the condition B, then, by definition, f A\B = — . Thus

r _ f AB
J A\B---- ■
JB
Since f AB fluctuates around P{AB) and f B around P(B), the conditional
P( A
relative frequency f A\B will fluctuate for P(B) > 0 around -. This
P( AB)
number shall be called the conditional probability of the event A with
respect to the condition B; it is assumed that P(B) > 0. The notation for
the conditional probability is P{A\B)\ thus we put

(I>
By means of formula (1) the conditional probability of any event A of a
probability algebra with respect to any condition В can be calculated, pro-
I I , § 8] C O N D I T I O N A L P R O B A B IL IT IE S 55

vided that P(B) > 0. If P(B) = 0, formula (1) has no sense; the conditional
probability P(A I B) is thus defined only for P(B) > O.1 Formula (1) may be
expressed in words by saying that the conditional probability of an event A
with respect to the condition В is nothing else than the ratio of the probability
of the joint occurrence of A and В and the probability of B.
Equality (1) is (in contradiction to the standpoint of many older text­
books) neither a theorem nor an axiom; it is the definition of conditional
probability.12 But this definition is not arbitrary; it is a logical consequence
of the concept of probability as the number about which the value of the
relative frequency fluctuates.
In the older literature of probability theory as well as in some vulgariza­
tions of modern physics one finds often the misleading formulation that the
probability of an event A changes because of the observation of the occur­
rence of an event B. It is, however, obvious that P(A \ B) and P(A) do not
differ because the occurrence of the event В was observed, but because of
the adjunction of the occurrence of event В to the originally given complex
of conditions.
Let us now state some examples.

Example 1. In the task of pebble-screening one may ask, what part of the
pebbles is small enough to pass through a sieve SA, i.e. what is the probability
of a pebble chosen at random to pass through the sieve SA. Let this event
be denoted by A. Assume now that the pebble was already sieved through
another sieve SB, and the pebbles which did not pass through the sieve
SB were separated. What is the probability that a pebble chosen at random
from those sieved through the sieve SB will pass through the sieve SA as
well? Let В denote the event that a pebble passes through SB, the probability
of this event let be denoted by P(B). Let further AB denote the event that
a pebble passes through both SB and SA, and P(AB) the corresponding
probability. Then the probability that a pebble chosen at random from
those which passed SB will pass SA as well is, according to the above,

Example 2. Two dice are thrown, a red one and a white one. What is the
probability of obtaining two sixes, provided that the white die showed a six?

1 We give a more general definition in Chapter IV.


2 In view of certain applications it is advisable to generalize the system of axioms
of probability theory in such a way that the notion of conditional probability is the
fundamental notion (cf. § 11 of this Chapter!.
56 P R O B A B IL IT Y [II, § 8

This conditional probability is by definition


1
36~ _ 1
\~~~6'

From a sheer mathematical point of view the conditional probability
P(A I B) may be considered as a new probability measure. Indeed, let ß be
an arbitrary set, a и-algebra of the subsets of ß , and P a probability
measure (i.e. a nonnegative, completely additive set function satisfying
P(ß) = 1). Further let В be a fixed element of such that P(B) > 0. Then
P(A I B) is a probability measure on as well, i.e. P(A \ B) satisfies Axioms
IV , V, a n d V I. In d e e d , by in tro d u c in g th e n o ta tio n P*(A) = P(A | B) we
have P*(A) > Ofor /1 P* (O) - / ^ ) = = 1,further if An ^
and A„Am = 0 for n ф m, then
00 CO

РВ£А„ l P ( A nB) „

P*[ „?! A") = P(B) “ = P(B) = и? г P * (An)-


Hence [ß, P(A | 5)] is again a Kolmogorov probability space. Thus
all theorems, proved for ordinary probabilities, remain valid, if the proba­
bility of every event is replaced by the conditional probability of the same
event relative to some fixed event В (of positive probability).
If A and В are two events of positive probability one can consider besides
the conditional probability of A relative to В also the conditional probability
of В relative to A.
From the definition it follows readily that

P(A)
,2 ,
hence P(B \ A) can be expressed by means of P{A | B), P(A), and P(B). One
can write (2) in the following form, equivalent to it:
P{ B\ A) = P {A \B)
P(B) P(A) ' ’
Formula (1) can be generalized as follows: If А ъ A?, A n are arbitrary
events such that P{A1A i . . . An_,) > 0, we have
P(A,A2. . . A„) = P(AX) P(A2 I A,) P(A3 I aM 2) . . . P(An \ A,A2. . . An_j).
(4 )
II, § 9] T H E IN D E P E N D E N C E O F EV E N T S 57

This formula is immediately verified by expressing the conditional proba­


bilities on the right hand side of (4) by means of (1).

§ 9. The independence of events

Let A and В be two events of a probability algebra; assume that P(A) > 0,
and P(B) > 0. In the preceding paragraph the conditional probability
P(A I B) was defined. Generally it is different from P(A). If, however, it is
not, i.e. if
P(A I В) = P(A) (1)
then we say that A is independent o f B. If A is independent of B, then В is
independent of A as well; indeed, by Formulas (2) and (3) of the preceding
paragraph
P (B \A ) = P{B). (Г)
It is therefore permissible to say that A and В are independent of each
other. From Formula (1) of § 8 follows readily a definition of independence
of two events that is symmetrical in A and B. Indeed, because of the inde­
pendence just defined we have
P(AB) = P(A)P(B). (2)
A and В being independent, (2) is valid; conversely, if (2) holds and P(A),
P(B) are both positive, then (1) and (1') hold as well, thus A and В are inde­
pendent. Hence (2) is the necessary and sufficient condition of the indepen­
dence, thus it may serve as a definition, either. Old textbooks of probability
theory used to call relation (2) the product rule o f probabilities. However,
according to the interpretation followed in this book (2) is not a theorem
but the definition of independence. (Since we take Formula (2) as the defi­
nition of independence, any event A with P{A) = 0 or P(A) = 1 is inde­
pendent of every event B.)
If A and В are independent, A and В are independent as well. Namely
from (2) it follows that
P{AB) = P(A) - P(AB) = P(A) - P(A)P(B) = P(A)P(B).
Therefore the independence of A and В implies the independence of A
and В and, similarly, that of Ä and B, further of Ä and B.
The independence of two complete systems of events is defined in the
following manner: The complete systems of events (Ab A 2, . . ., A m) and
(Въ B2, . . ., B„) are said to be independent, if the relations

P(AjBk) = Р(А,)Р(Вк) (J = 1, 2 ,. . ., m; к = 1, 2 ,. . ., n) (3)


58 P R O B A B IL IT Y [II, § 9

are valid for them. It is easy to see that from the m-n conditions figuring in
(3) every one containing A m or Bn can be omitted. If the remaining mn -
— (m + n — 1) = (m — 1) (n — 1) conditions are fulfilled, the omitted
ones are necessarily fulfilled too, as is seen from the relations

t
*=i
P(Aj Bk) = P(Aj) ( у = 1 ,2 ,...,» и ) (4)

and
m

I P(Aj Bk) = P(Bk) (k = 1 ,2 ,..., и). (5)


i =i

Indeed, if (3) is fulfilled for a given j and к = 1, 2, . . ., n — 1, then from


(4) it follows that (3) is fulfilled for к = n as well. Similarly, whenever (3)
holds for some к and for j = 1,2, . . m — 1, it holds for j = m too.
If m = n = 2, we get again the result proved earlier that three of the
four conditions
P(AB) = P(A)P(B),
P(AB) = P(A)P(B),
P(AB) = P(A)P(B), .
P{ÄB) = P(Ä)P(B)

are superfluous, since the validity of one implies necessarily the validity of
the remaining three. Thus the independence of the events A and В is equi­
valent to the independence of the complete systems of events (A, A) and
(В, В). This follows also from the relation

P(AB) - P(A)P(B) = - (P(ÄB) - P{Ä)P(B) ) (6)

valid for any two events A and B.


Example 1. For two tosses of a coin, four different outcomes are possible:
head-head, tail-head, head-tail, and tail-tail. Suppose that these possibilities
are all equally probable; then each will have the probability —. We obtain
the same result by the concept of independence, assuming that head and
tail are equally probable at both tosses both having probability — and that
the two tosses are independent from each other. Thus the probability of
each possibility is

1 1_±
~2 ' 2 ~ 4 '
I I , § 9] T H E IN D E P E N D E N C E O F EV EN TS 59

Let us now extend the concept of independence to more than two events.
If A, B, and C are pairwise independent (i.e. A and B, A and С, В and C
are independent) events of the same probability algebra, the non-existence
of any dependence between the events A, B, and C does not follow. This
may be seen from the following example.
Let us throw two dice; let A denote the event of obtaining an even number
with the first die, В the event of throwing an odd number with the second,
finally C the event of throwing either both even or both odd numbers. Then

P(A) = P(B) = P(C) = j .


further
P(AB) = P(AC) = P(BC) =

The events A, B, and C are therefore pairwise independent. Nevertheless,

P(ABC) = 0,
thus
P( (AB)C) ф P(AB)-P(C),

i.e. AB is not independent from C.


We shall say that A, B, and C are completely independent, if they are pair-
wise independent and each of them is independent of the product of the
remaining two. Thus A, В and C are completely independent, if the relations

P(AB) = P(A)P(B),
PIAC) = P(A)P(C),
P(BC) = P(B)P(C),
P(ABC) = P(A)P(B)P(C)

are valid. The first three of these relations express the pairwise indepen­
dence of A, B, and C, the fourth the fact that each of the events is inde­
pendent of the product of the remaining two. Indeed, from the first three
conditions we have:

P(AB)P(C) = P(AC)P(B) = P(BC)P(A) = P(A)P(B)P(C).


The (complete) independence of more than three events may be defined
in a similar manner. The events A h A 2, . . A„ are said to be completely
independent, if for any к = 2, 3 , . . . , n the relation

P(A, A ■ • • Aik) = P(Ait) P(A,J . . . P(Atk) ( 7)


60 P R O B A B IL IT Y [II, § 9

is valid for any combination (ib i2, . . ik) from the numbers 1 , 2 , . . . , /;.
n
Since from n objects one can choose к objects in ways, (7) consists of
К
2" —и —1 conditions. In what follows, by saying for more than two events
that they are independent we shall mean that they are completely indepen­
dent in the sense just defined. If only pairwise independence is meant this
will be stated explicitly. The independence of more than two complete sys­
tems of events can be defined in a similar manner.
Combinatorial methods for the calculation of probabilities have already
been mentioned. They rested upon the assumption of the equiprobability
of certain events. By means of the concept of independence, however, this
assumption may often be reduced to more simple assumptions. Besides the
simplification of the assumptions, this reduction has the advantage that
the checking of the practical validity of our assumptions becomes sometimes
more easy.
Example 2. Sampling without replacement. An urn contains n different
objects, numbered somehow from 1 to n. We draw one after the other к
items without replacement. What is the probability that we obtain a given
combination of к elements? Clearly the number of possible combinations
n1
ISU)‘
It was supposed that all combinations are equally probable, the proba-
I n -1
bility looked for is thus
l^
This result may also be obtained from the following simpler assumption:
at every drawing the conditional probability o f drawing any object still in the
urn is the same. Here the probability that a given combination occurs in a
given order is — • — ----- - ------ J------ . Namely, at the first drawing there
n n —1 n- к + 1
are in the urn n objects, the probability of choosing any one is — ; at the
n
second drawing the conditional probability of choosing any one of the n —1
objects which are still in the urn is --------, etc. Since the elements of the com-
n —1
bination in question may be chosen from the urn in k \ different orders,
the obtained result must be multiplied by k\ and thus we get that the proba­
bility of drawing a combination of к arbitrary elements is

_______k! / и ! “1
n{n — l j . . . (n —к + 1) 1к ]
I I, § 9 ] T H E IN D E P E N D E N C E O F E V E N T S 61

Example 3. Sampling with replacement. An urn contains N balls, namely


M
M red and N — M white. Let — = p.
What is the probability that we obtain in n drawings к times a red ball,
if the chosen ball is always replaced into the urn and the balls are again well
mixed.
In every drawing the probability of choosing a certain ball is equal to
— and the outcomes of the individual drawings are independent. Hence the
N
probability asked for is

Wk = ( ” /(1 -/> )" -* . (8)

Indeed, let A t denote the event of choosing a red ball at the i-th drawing
(/ = 1 , 2 , . . . , ri). These events are, because of the replacement, independent
of each other. The probability that at the q-th, i2- th ,. .., 4-th drawing a
red and at all the other (y'j-th, y'2- th ,. . ,,j n_k-th) drawings a white ball will
be chosen is nothing else than the probability of the event
A А/г . . . Aik ÄjxÄjs. . . Äjn k.
As the events A h, A u, . . .,Aik, Ah, Aj%, . . ., Ajn k are completely inde­
pendent and P{Ai) = p, P(Aj) = 1 — p, we get

P ( A h . . . A i k Ä h . . . Ä jn _ k ) = p k ( l - p ) n - k .

Since the order is irrelevant and only the number of red balls drawn is of
interest, the value so obtained must still be multiplied by the number of
n
the possible orderings, i.e. by . Thus we obtain (8).
к
This result can immediately be generalized for experiments with more
than two possible outcomes. Let the possible outcomes in every experiment
be A ^ \ A ^ \ . . ., A^r); let their probabilities be denoted by Р(А(Л)) = ph
r
(h = 1, 2 , . . . , r). Of course we have £ ph = 1. Assume that in repeated
A= 1
performance of the experiment the outcomes of the individual experiments are
independent of each other. Then the probability that in n repetitions of the
experiment event A ^ occurs k x times, event A<2> k 2 tim es,. . ., event A(r> k r
times, is

(9)
where £ k h = n. For r = 2 Formula (9) reduces to (8).
A-l
62 P R O B A B IL IT Y (II, § 10

§ 10. “ Geometric” probabilities

Let Q be a measurable subset of the и-dimensional Euclidean space with


positive, finite Lebesgue measure. Let f u r t h e r ^ be the set of all measurable
subsets of Q and ß{Ä) the и-dimensional Lebesgue measure of the measur­
able set A. Let P(A) be defined by

It is easy to see from the results of § 7 that [fi, o£,P ] is a Kolmogorov proba­
bility space. In this probability space probabilities may be obtained by
geometric determination of measures. Probabilities were thus calculated
already in the Eighteenth Century.1
Some simple examples will be presented here.
Example 1. In shooting at a square target we assume that every shot hits
the target (i.e. we consider only shots with this property). Let the probability
that the bullet hits a given part of the target be proportional to the area of
the part in question. What is the probability that the hit lies in the part A ?
Clearly we only have to determine the factor of proportionality. If Q de­
notes the entire target, the probability belonging to it must be equal to 1.
Hence

- №
where p(Q) denotes the area of the entire target and p(A) that of A. Thus
for instance the probability of hitting the left lower quadrant of the target is
equal to .
As seen from this example, not every subset of the sample space can be
considered as an event. Indeed, one cannot assign an event to every subset
of the target, since the “area”, as it is well known, cannot be defined for
every subset such that it is completely additive and that the areas of con­
gruent figures are equal.
In general, the distribution of probability is said to be uniform, if the
probability that an object situated at random lies in a subset can be obtained
according to the definition (1) from a geometric measure ц invariant under
displacement (e.g. volume, area, length of arc, etc.).
Example 2. A man forgot to wind up his watch and thus it stopped. What
is the probability that the minute hand stopped between 3 and 6 ? Suppose
1 Of course instead of Lebesgue measure the notion of the area (and volume) of
elementary geometry was applied.
II, § 10] “ G E O M E T R IC ” P R O B A B IL IT IE S 63

the probability that the minute hand stops on a given arc of the circum­
ference of the face of the watch is proportional to the length of the arc in
question. Then the probability asked for will be equal to the quotient of
the length of the arc in question, and the whole circumference of the face;
1
i.e. in our case to — .
4
In the above two examples the determination of the probabilities was
reduced to the determination of the area or of the length of the arc
in certain geometric configurations. Though this method is intuitively
very convincing it is nevertheless a very special method. Before applying
it to further examples, let us see its relation to the already described combi­
natorial method. This relation is most evident in Example 2. If we neglect
the fractions of the minutes and are looking for the probability that the
minute hand stops between the zeroth and the first, the first and second,. . . ,
the k-th and к + 1-th minute {к = 0, 1, . . .,59), then we have a sample
space consisting of 60 elementary events; the probability of every event is
the same, viz. -. In the case of the example of the target let us assume,
60
for sake of simplicity, that the sides of the square target are 1 m long. Let
us subdivide the target into n2 congruent little squares with sides parallel
to the sides of the target. The probability that a hit lies in a set which can be
obtained as the union of a certain number of the little squares is obtained
by dividing the number of the little squares in question through n2. Thus we
see that geometric probabilities can be approximately determined by a
combinatorial method. We must not, however, restrict ourselves to some
fixed n in the subdivision, for then we could not obtain the probability of a
hit lying in a domain limited by a general curve. If the mentioned subdivi­
sion is performed for every и however large, then the probability of measur­
able sets, or to be more precise, of every domain having an area in the sense
of Jordan, can be calculated by means of limits. For this calculation we
к
have to consider the quotient — where к means the number of small
nr
squares lying in the domain if the large squaie is subdivided into con-
к
gruent small squares and we have to determine the limit of ~ for
П-ЮО.
Probabilities obtained in a combinatorial way (without passing to the
limit) are always rational numbers; geometric probabilities, however, may
assume any value between 0 and 1. Thus for instance the probability that
. . . . . . . . 7C
the hit lies in the circle inscribed into the square target is equal to — .
64 P R O B A B IL IT Y [II, § 10

The reduction of the calculation of geometric probabilities to combina­


torial considerations is nowadays of historical interest only. Namely it was
thought for a long time that the classical definition of probability suffices to
establish the calculus of probability. From the point of view of the modern

Fig. 9

theory, however, such a reduction is unnecessary (except in the case when a


simplification of the calculations is brought about by it).
In what follows we continue to consider some further examples. Let us
first deal with the so-called Bertrand paradox.
Example 3. Take a circle and choose at random a chord of it. What is
the probability that this chord will be longer than the side of the regular
triangle inscribed into the circle? The difficulty lies in the fact that it is not
clear, what is meant by the expression that we choose a chord “at random” .
Each of the following interpretations seems to be more or less natural.
Interpretation 1. Since the length of a chord is uniquely determined by the
position of its midpoint, we can accomplish the random choice of the chord
by choosing at random a point in the interior of the circle and construct
the chord whose midpoint is the chosen point. The probability that the
point lies in a domain will be assumed to be proportional to the area of
the domain. Clearly the chord will be longer than the side of the inscribed
regular triangle, if the midpoint of the chord lies inside the circle drawn
about the centre of the original circle with half of its radius (cf. Fig. 9); hence
the answer is
(r 2
*2] = I
nr2 4

Interpretation 2. The length of the chord is uniquely determined by the


distance of its midpoint from the centre of the circle. In view of the sym-
II, § 10] 'G E O M E T R I C ” P R O B A B I L I T I E S 65

metry of the circle we may assume that the midpoint of the chord lies on a
fixed radius of the circle and choose the midpoint of the chord so that the
probability that it lies in a given segment of this fixed radius is assumed to
be proportional to the length of this segment. The chord will be longer

than the side of the inscribed regular triangle, if its midpoint has a distance
r 1
less than — from the centre of the circle; the answer is thus — (cf. Fig. 10).
Interpretation 3. Because of the symmetry of the circle one of the end­
points of the chord may be fixed, for instance in the point P0; the other end­
point can be chosen on the circle at random. Let the probability that this
other endpoint P lies on an arbitrary arc of the circle be proportional to the
length of this arc. The regular triangle inscribed into the circle having for
one of its vertices the fixed point P0 divides the circumference into three
equal parts. A chord drawn from the point P0 will be longer than the side
of the triangle, if its other endpoint lies on that one-third part of the circum­
ference which is opposite to point P0. Since the length of this latter is one
third of the circumference, the answer is, according to this interpretation,
equal t o ~ .
From a well-known theorem of the elementary geometry concerning the
central and peripheral angles it follows that the third interpretation is equi­
valent to the statement that the probability distribution of the intersection
point of the chord and the semicircle of centre P0 is uniform on this semi­
circle (Fig. 11).
Obviously, all interpretations discussed above can be realized in physical
experiments. The example seemed once a paradox, because one did not
pay attention to the fact that the three interpretations correspond to dif­
ferent experimental conditions concerning the random choice of the chord
66 P R O B A B IL IT Y [И, § 10

and these of course lead to different probability measures, defined on the


same algebra of events. The obtained measure in the set of straight lines is,
however, invariant with respect to motions of the plane only in the second
interpretation,1 in the other two interpretations congruent sets of lines do
not necessarily have equal measure.

Example 4. Decompose a unit segment into three subsegments by two


points chosen at random. What is the probability that a triangle can be
constructed from the three segments? Clearly we have to examine the proba­
bility that any one of the three segments is less than the sum of the remain­
ing two. Compared with the above, the example contains something new,
as here we have to choose two points at random. However, the problem
can be reduced readily to one similar to those dealt with above. Let indeed
be the segment in question the interval (0, 1) and the abscissas of the two
points chosen at random be x and y. To these two points there corresponds
a point of the plane with abscissa x and ordinate y. Thus there corresponds
to any decomposition of the unit interval (0, 1) into three segments one
point of the unit square of the plane and conversely. Now let the random
choice of the two points on the interval (0, 1) be performed in such a manner
that “the probability that the point representing the decomposition in ques­
tion lies in a domain A of the unit square” be equal to the area of that do­
main (not only proportional, since the area of the unit square is 1). In this

1 Cf. W. Blaschke [1].


II § 10] “ G E O M E T R IC ” P R O B A B IL IT IE S 67

case we only need to compute the area of the domain determined by the
inequalities (Fig. 12)

O c x c —c v c l and y —x < ^ ~
2 2
or
0<y<4-<x< 1 and X —у < — .
2 2

The area of this domain, i.e. the probability wanted, is equal to .

Fig. 12
The method just applied is often used, for instance in statistical physics.
Here, to every state of the physical system a point of the “phase space”
may be assigned, having for its coordinates the characterizing data of the
state in question. Accordingly, the phase space has as many dimensions,
as the state of the system has data to characterize it (the so-called degrees
of freedom of the system). In our example we assigned a point of the phase
space to a decomposition of the (0,1) interval by two points; the degree of
freedom of the “ system” is here equal to 2. The analogy can be made still
more obvious by assigning to the decomposition of the (0,1) interval a
physical system: two mass points moving in the interval (0,1).
Clearly the phase space may be chosen in many ways; by solving problems
of probability in this way, however, one must not forget to verify in every
given case separately the assumption that the probabilities belonging to
the subdomains of the phase space are proportional to the area (volume).
Finally we shall discuss here a classical example, B u ff on's needle problem
(1777).
68 P R O B A B IL IT Y [II, § 10

Example 5. A plane is partitioned with equidistant parallel lines of dis­


tance d into parallel strips of equal width. A needle of length l is thrown at
random upon the plane. What is the probability that the needle intersects
a line? For sake of brevity suppose l < d; in this case the needle can inter­
sect no more than one of the parallel lines. The problem may then be solved
as follows. The position of the needle on the plane may be characterized by
three d ata: by the two coordinates of its midpoint and by the angle between
the needle and the direction of the lines. Let the coordinate of the needle’s
midpoint perpendicular to the direction of the lines be denoted by x, that
in the direction of the lines by y. Whether the needle does or does not inter­
sect a line depends only on the coordinate a and the angle. So the coordinate
у may be disregarded. Or, what comes to the same thing, we may draw a
perpendicular to the parallels and assume that the midpoint of the needle
lies always on it. If we take the origin of the coordinate system on one of
the parallel lines, it may be assumed that the midpoint of the needle lies in
the first parallel strip, since the strips are all of equal width d < l. Let <p
denote the angle between the needle and the positive x-axis. The position
of the needle is then characterized in the rectangular coordinate system
(x, <p) (in the “phase space”) by one point and it can be assumed that this
point lies in the rectangle (0 < x < d, 0 < <p < n). The probability that
the point (x, (p) lies in an arbitrary domain of this rectangle is assumed to
be proportional to the area of this domain. Loosely speaking, we assume
that “all positions of the midpoint of the needle are equally probable and
all directions of the needle are equally probable” .
Fixing now the value of <p, the needle intersects the line x = 0 i f 0 < x <
/ /
< — sin w and the line x — d if d — —sin w < x < d. Thus the needle
2 Y 2 r
intersects the line x = 0, if and only if the point (x, cp) characterizing the
position of the needle lies to the left of the sine curve drawn over the line
x = 0 with an amplitude ~ and the line x = d, if the characterizing point
lies to the right of the sine curve drawn over the line x = d with the same
amplitude (Fig. 13). j
Since the area under a half-wave of a sine curve of amplitude ~ is equal
to /, the area of the domain formed by the points which correspond to inter­
section will be 21. The area of the whole rectangle being nd, the sought
21
probability is — - . Thus in many repetitions of Buffon’s experiment one will
nd
21
find intersection in approximately a fraction — of the experiments. It was
nd
tried more than once to determine approximately the value of n by this
II. § 11] C O N D IT IO N A L P R O B A B IL IT Y S P A C E S 69

method. Since, however, the assumptions (especially the equiprobability of


the directions) are quite difficult to realize, even many thousands of experi­
ments give the value of n only to a few digits. In principle, however, nothing
prevents us to carry out an experiment in which our assumptions about
position and direction of the needle are very nearly satisfied and thus to de­

termine the value of n with any prescribed precision. Of course this would
have no practical importance, since there are more straightforward and
reliable methods to compute the value of я. Still the question is of great
interest, since it shows that certain mathematical problems can be solved
approximately by performing experiments of a probabilistic nature. Now­
adays difficult differential equations and other problems of numerical anal­
ysis are treated in this manner (this is the so-called Monte Carlo method).
Questions dealt with in this paragraph are closely connected to integral
geometry.

§ 11. Conditional probability spaces


The axiomatic foundation of probability theory given in 1933 by A. N.
Kolmogorov was of paramount importance and since then it furnishes the
very basis of this branch of mathematics. There are, however, problems ei­
ther entirely outside of its range or leading in this theory to serious compli­
cations. In physics (e.g. in quantum mechanics) and in some parts of proba­
bility theory (especially in the theory of Markov chains and stochastic
processes) as well as in applications to number theory and integral geometry,
one often has to deal with so-called unbounded distributions, i.e. unbounded
measures. The use of unbounded measures, however, cannot be justified in
the theory of Kolmogorov. For instance one cannot speak (in the sense of
70 P R O B A B IL IT Y [II, § 11

the preceding paragraph) about a uniform probability distribution in the


whole Euclidean space. Similarly, it is nonsense to speak about the random
choice of an integer such that every integer has the same probability to be
chosen. At the first glance it might seem that this difficulty cannot be over­
come, since the value of probability can never exceed 1. In spite of this,
one can obtain by means of these unbounded, that is “nonsensical”, distri­
butions conditional probabilities which are in agreement with experience.
Thus the necessity arose to generalize the theory of probability in a way
justifying the use of such distributions.1
One can indeed give an axiomatic theory of probability which matches
the above-mentioned requirements.2
This theory contains the theory of Kolmogorov as a special case. The
fundamental concept of the theory is that of conditional probability; it
contains cases where ordinary probabilities are not defined at all.
We start from the following definitions and axioms:
Let there be given a set Q (called the space of elementary events) and let
denote a cr-algebra of subsets of Q. The elements A, B, . . . etc. of t j f are
called events. The set Q — A will be denoted by Ä. Let further & be a non­
empty system of sets such that £8 £ We assume that a set function
P(A I B) of two set variables is defined for A £ and В £ P{A | В) will
be called the conditional probability o f the event A with respect to the condi­
tion B.
We postulate the following axioms:
a) P(A \B ) > 0 and P(B | В) = 1 (A В í SS).
b) For any fixed В £ S§ P(A | B), as a function of A, is a measure on <^6,
i.e. if A n £ and A nAm = О for n Ф m, we have

W i 4,1*1 = 5 P{An\B).
l n = l I n=l

c) If A £ л , В $S8, C В £ C and P(B \ C) > 0, then

P(AB I C)
P(A I В) = —^----!— L .
V 1 ’ P (B \C )

If the Axioms a), b), and c) are satisfied, we shall call the system
[fl, -f,P{A [ B)] a conditional probability space.

1 The history of mathematics shows that on several occasions procedures, successful


e.g. in physical applications but inexact in a mathematical sense of the word, were
made exact later on by an extension of the mathematical notions involved.
2 Rényi, A. [14], [15], [18]. The idea of such a theory is due to Kolmogorov himself;
he, however, did not publish anything about it.
71
II, 5 И) C O N D IT IO N A L P R O B A B IL IT Y S P A C E S

If P*{A) is a measure defined on and P*(Q) = 1 (that is, if [Й, tsF, P*]
is a Kolmogorov probability field), further if 38* denotes the collection of
all sets В such that P*(B) > 0, then - as it is easy to see - the system
[Í2, tjF, .$*, P*(A I Z?)]is a conditional probability space, provided P*(A \ В)
is defined by
P* (AB)
P* (A [ B) = p ; ’~ { A ^ , B ^ * ) .

[ß, ■fs*. P* (A [ В)] will be called the conditional probability space


generated by the Kolmogorov probability space [Q, < j F, P*].
We shall prove some simple theorems which follow directly from our
axioms.

T heorem 1. For A £ <


j F, and В £<á? we have

P(A I В) = P(AB I B).

P roof. The statement follows from Axioms a) and c) by substitution of


C = B.

T heorem 2. For A £ and В £ we have


P(A I B) < 1.

P roof . According to Theorem 1 and Axiom b) we have P(A | B) <


< P(B I В). Our statement follows by Axiom a).

T heorem 3. For fi we have P(0 \ B) = 0.

Proof . The statement is evident because of Axiom b).

T heorem 4. I f A £ В £ Fft and AB = О, then P(A \ B) = 0.

Proof. The statement follows from Theorems 1 and 3.

T heorem 5. For В £ we have P(Q \ B) = 1.

P roof . The statement is obvious because of Axiom a) and Theorem 1.

T heorem 6. I f for fixed C £ -fi we put P*(A) = P(A | C), the system
[Í2, -/F, P*] is a Kolmogorov probability space. I f В is an element o f •jf. such
72 P R O B A B IL IT Y [II, § 11

that ВС £ AS and Р£(Я) > 0, further i f P*(A \ В) is, as usually, defined by


P* (AB)

we have
P*(A I В) = P(A I BC).
P roof . The first statement of the theorem is evident since P£ is a measure
on and P£(Q) = 1. The second statement follows from Axiom c); indeed
we have by Theorem 1

p , H l f n В* (AB) P (A B \C ) P(ABC\C)
Be (A \B ) — = - n ß j C y - = p ( ß C |c ) - = П А IВС) .

T heorem 7. Suppose Q C AS a n d put P*(A) = P(A ] fí). T h e n [Í2, , P*\


is a Kolmogorov probability space. Further if P*(B) > 0 we have

P* (AB)
« A iB > - r h n -

Remark. AS may contain sets if such that P*(B) = 0. On the other hand, sets
В for which P*(B) > 0 may not belong to 3S. Hence [fl, AS, P(A | В)]
is not necessarily identical to the Kolmogorov probability space
[Í2, P(A I Q)], not even in the case ß

P roof . Theorem 7 is a special case of Theorem 6.

From the theorems proved above one readily sees how the generalized
theory of probability can be deduced from our axioms.
Let us mention here some further examples.
Example 1. Let Q be the и-dimensional Euclidean space; let the points
of Q be denoted by to = (cuj, co2, • • .,co„). Let denote the class of all mea­
surable subsets of Q, let further f(co) be a nonnegative, measurable function
defined on Ü and AS the set of all measurable sets В such that ( f(co)da> be
finite and positive. Put

| f(co) dco
P(A I B) = ----- — .
f /(« ) dm
В
[Q, AS, P(A I B)] is then a conditional probability space. If j f(co)dco <
h
< + oo, a conditional probability space generated by a Kolmogorov proba-
II § ; l ] C O N D IT IO N A L P R O B A B IL IT Y S P A C E S 73

bility space is obtained; if, however, f f(co)d(o = + oo, this is not the case.
ii
Especially when f(co) = 1, we obtain the uniform probability distribution
in the whole и-dimensional space. In this case

P(A IB) = ~ n (AB^ ,

where p„(C) denotes the «-dimensional Lebesgue measure of the set C.


Example 2. Let £2 be the set of the natural numbers, -yd the class of all
subsets of Q, further p n (n — 1, 2 ,. . .) a sequence of arbitrary nonnegative
numbers not all equal to 0; let ^ denote the set of those subsets В of Q for
which £ Pn >s positive and finite. Let £ p„ be denoted by r(A) for A £ <y£
n£B n£A
and put

m
Clearly [ Q , y £ , P ( A \ B)] is a conditional probability space. It is gener-
00
ated by a Kolmogorov probability space if and only if the series £ p„ is
n= 1
convergent.
Especially when p„ = 1 (« = 1, 2 ,. . .),

I 1
Р(Л|5) = ^ Т
n£B
is equal to the ratio of the number of elements of the set AB and the set B.1
Evidently the question arises how conditional probabilities are connected
with relative frequencies, i.e. whether the generalized theory does have a
frequency-interpretation too.
The answer is affirmative and even very simple. The conditional proba­
bility P(A I B) can be interpreted in the generalized theory (as well as in the
theory of Kolmogorov) as the number about which the relative frequency of
A with respect to the condition В fluctuates. Thus the generalized theory
has the same relation to the empirical world as Kolmogorov’s theory.

1 In both cases, P(A | B) could have been represented as the ratio '' ^ ^ ■, where
/' ( B)
H is an unbounded measure) (With respect to the conditions for the existence of such
measures cf. Á. Császár [1], and A. Rényi [18]).
74 P R O B A B IL IT Y [И , § 12

§ 12. Exercises
1. Let р ъ p 2, P x 2 be given real numbers. Prove that the validity of the four inequal­
ities below is necessary and sufficient for the existence of two events A and В such
that P(A) = p u P(B) = p2, P(AB) = p i2.
1 — Pi —Рг+ Р\г ^ 0 , (1)
P i - Piг > 0, (2)
P 2 -P l2 ^°> (3)
Pi2> 0 • (4)

H i n t . On the right hand side of the inequalities (1)—(4) we have the probabilities
of AB, AB, AB, and AB. Of course they must be nonnegative, thus the conditions
are necessary. Their sufficiency can be shown as follows: from (l)-(4) it is clear that
0 < P u < Pi < Pi + p 2 — Pi2 < 1
and similarly
0 < pu < рг < Pi + Рг - Pu < 1.

The numbers p x, p 2, p n are therefore nonnegative and do not exceed 1.


Consider the interval I = (0,1) and suppose that a random point P is uniformly
distributed in this interval; i.e. let the probability that P lies in a subinterval of I be
equal to the length of this subinterval. Let A denote the event that the point lies in
the interval 0 < x < p u and В that it lies in the interval p x — p x2 < x < p x +
+ P i — P n ■ Then we have P ( A ) — p lt P ( B ) — p 2 , P ( A B ) = p l2 .
2. Generalize the assertion of Exercise 1 to и events (n = 3, 4 , . . . ) .
3. Examine how the conditions of Exercise 2 can be simplified if we assume that
P i j , tk
= Р ( ,А к А 1г. . . A tJ (1 < i\ < i2 < . . . < ik < n ) depends o n l y on к
{к = 1, 2 , . . . . n - 1).
4. How can the conditions of Exercise 2 be simplified if we assume that for every
к = 2, 3..........n
PhfuJt = P\P‘t • • • P‘f
5. Let Ay, A2 , . . . , A„ be any n events and suppose that the probabilities
P(Ak A,t . . . A,J (1 < к < n, 1 < i\ < i2 < . . . < / * < n) are known. Find the
probability that at least к of the n events Alt A2 , . . . , A„ will occur.

6. Prove the inequality


P(A A С) < P(A А В) + P(B A C).
Remark. If we define the “ distance” d(A, B) of the events A and В as the probability
P(A A B), then we have the “ triangle inequality” d(A, C) < d(A, B) + d(B, C).
7. If the distance of A and В is defined as

P(r - Л В \ for P(A + B) > 0,


d * (A, B) = p (A + B)
0 otherwise,
then the triangle inequality is again valid.
и , § 12] E X E R C IS E S 75

8. What is the probability that in n throws of a die the sum of the obtained numbers
is equal to к ?
Hint. Determine the coefficient of xk in the expansion of the generating function
(x + X 2 + X 3 + X4 + X 5 + X6)".

9. What is the probability that the sum of the numbers thrown is larger than 10
in a throw with three dice ?
Remark. This was the condition of gain in the “passe-dix” game which was current
in the Seventeenth Century.
10. What is more probable: to get at least one six with four dice or at least one
double six in 24 throws of two dice? (Chevalier de Méré’s problem.)
11. In a party of n married couples everybody dances. Every gentleman dances
with every one of the ladies with the same probability. What is the probability that
nobody dances with his own wife? Find the limit of this probability for n -* °° .
12. An urn contains n white and m red balls, п ф т \ balls are drawn from the urn
at random without replacement. What is the probability that at some instant the
numbers of white and red balls drawn are equal?
13. There is a queue of 100 men before the box-office of an exhibition. One ticket
costs 1 shilling. 60 of the men in the queue have only 1 shilling coins, 40 only 2 shilling
coins. The cash contains no money at the start. What is the probability that tickets
can be sold without any trouble (i.e. that never comes a man, having only 2 shilling
coins, before the cash desk at a moment when the latter contains no 1 shilling coin) ?
14. A particle moves along the j»axis with unit velocity. If it reaches a point with
integer abscissa it has one of two equiprobable possibilities: either it continues to
proceed or it turns back. Suppose that at the moment / = 0 the particle was in the
point X= 0 . Find the probability that at a time t the particle will have a distance
X from the origin (r is a positive integer, x an arbitrary integer).
15. Let the conditions of Exercise 14 be completed by the following: at the point
with abscissa x0 (a positive integer) there is an absorbing wall; if the particle arrives
at the point of abscissa x0 it will be absorbed and does not continue its movement.
Answer the question of the preceding exercise for x <. x0 .
16. A box contains M red and N white balls which are drawn one after the other
without replacement. Let Pk denote the probability that the first red ball will be drawn
at the k-th drawing. Since there are N white balls, clearly к <, N + 1 and thus
Pi + Рг + . . . + PN+i = 1. By substituting the explicit expression of Pk we obtain
an identity. How can this identity be proved directly, without using probability theory ?
17. Let us place eight rooks at random on a chessboard. What is the probability
that no rook can take another?
Hint. One has to count the number of ways in which 8 rooks can be placed on a
chessboard so that in every row and in every column there is exactly one rook.
18. Put
[M l I N - M j

'■“ ■ '" T i r
76 P R O B A B IL IT Y [II, § 12

and
И ^ = ( " j / ( l - P ) n~ k (A = 0 , 1 , . . . , « ) ,

M M
where p = — -. Prove that if M and N tend to infinity so that = p remains
N N
constant, then Pk (M, N) tends to Wk.
19. P u t

{k~ 1)! M(N ~ ky-


Q d K к ) = ------------^ -----------------------
and
P*=P(1-P)*-1 (* = 0 , 1 , 2 . . . ) ,

w h e re p = . S h o w t h a t if M and N te n d t o in fin ity s o t h a t - — p r e m a in s

constant, then Qk(M, N) tends to Vk (cf. § 5, Example lb). Estimate the error
I Qk ( M , N ) ~ Vk \.
20. How many raisins are to be put into 20 ozs of dough in order that the proba­
bility is at least 0.99 that a cake of 1 oz contains at least one raisin?

21. The amount of water in a container may be determined as follows: a certain


amount a of soluble stain is solved in 1 gallon of water taken from the container
and the stained water is replaced. After perfect mixing 1 gallon water is taken again
and its stain content determined. If the latter number is x and the mixing is supposed
100
to be perfect, the vessel con tains------ gallons of water. Similarly, the number of the
x
fishes in a pond may be determined as follows: 100 fishes are caught, marked (e.g.
by rings) and replaced into the pond. After an interval of some days 100 fishes are
caught again and the marked ones are counted. If their number is x > 0 then the
100
pond contains about ------ fishes. If the pond contains 10 000 fishes, what is the
JC
probability that among the fishes of the second catch the number of marked fishes
.s 0, 1, 2, or 3?
22. A stick is broken at a random point and the longest part is again broken at
random. What is the probability that a triangle can be formed from the three pieces
so obtained ? (Observe that the conditions of the breaking differ from those of Example
4 of § 10.)
23. Consider an undamped mathematical pendulum. Let the angle of the maximal
elongation be 2°. What is the probability that at a randomly chosen instant the
elongation will be greater than 1°?
24. Let Buffon’s problem be modified by throwing upon the plane a disk instead
of a needle. What is the probability that the disk will not cover any of the lines?
25. In a five storey building the first floor is 8 metres above the ground floor, while
each subsequent storey is 6 meters high. Suppose that the elevator stops somewhere
because of a short circuit. Let the height of the door of the elevator be 1.8 meter.
I ! , § 12] E X E R C IS E S 77

Compute the probability that at the time o f the stopping only the wall of the elevator
shaft can be seen from the elevator.
26. What conditions must the numbers p, q, r, s satisfy in order that there exist
events A and В such that

P(A I B) = p, P(A I B ) = q , P(B \ A) = r, P(B \ Ä) = s'!


21. A box contains 1000 screws. These are tested at random so that the probability
of a screw being tested is equal to . Suppose that 2 per cent of the screws are
defective; what is the probability that from the tested screws exactly two are defec­
tive?
28. If A and В are independent events and A s B, prove that either P(A) = 0
or P(B) - 1.
29. Show by an example that it is possible that the event A is independent of both
BC and В + C, while В and C are also independent but A is not independent either
of В or of C .

30. Prove that if A is independent of BC and of В + С, В of AC, and C of AB,


further if P(A), P(B), and P(C) are positive, then A, B, and C are completely indepen­
dent.
31. We perform n independent experiments; suppose that the probability of the
event A in the y'-th experiment is pt (J = 1, 2 , , rí). Let n„ik denote the probability
that in the n experiments the event A occurs just к times. Prove that one has always
л1'к • л„_к+и regardless of the values of the probabilities pu рг,...,р „ .
Hint. Use the relation

л п+\.к = 71n.к -1 Pn+l + Пп.кС — Pri + l)

and proceed by mathematical induction.


32. Let Ax, Аъ . . . , A„ be any distinct events. Let P(Ak) = pk (k — 1,2, . . . , n),
further let Ur denote the probability that exactly r from the events Ak (k = 1 , 2 , . . . , n)
occur. Put
Sk = £ P(Ah A,t . . . Aik) (fc = 1, 2,. .. ,/«).

Then by Theorem 10 of § 3 we have the following relation:

Ur = S r~ Í +i !) 5r+i + (r \ 2) Sr+2 - • ■■+ 1)Л-' („ ” J s ”-


How will this expression be simplified if we assume that the events A lt A2, ■■. , A„
are completely independent and equiprobable ?
33. In the notations of the previous exercise prove that

5r = £ / , + ( r + 1,i/,+! + . .. + ( ” )t/„.
78 P R O B A B IL IT Y [II, § 12

34. Which simplification of the relation in the preceding exercise is possible if we


assume that A lt A,, . . . , A„ are completely independent and equiprobable?
35. Let At, A 2 , . . . , A„ be arbitrary events and В an event which is a function
of the events Au A2, . . . , A„. Prove that there exist numbers C0 and C!llt-Jr
(1 < r < n; 1 < ij < it < . . . < i, < n) independent of the choice of the events
Ak such that
Jl
P{B) = c 0 + X I c , Л 4 А • • • A0 -
r=1 l<Ji<h<--<ir<‘n
Hint. See the proof of Theorem 11 of § 3.
36. Prove Theorem 10 of § 3 using the results of Exercise 35, and by determining
the coefficients Cjl(! i i-r in the particular case when the events Ak are independent.
According to the statement of Exercise 35 the formula with these coefficients will be
valid in the general case too.
37. As an application of Theorem 10 of § 3 solve the following problem: suppose
that и persons throw their visit-cards into a hat and then everybody draws one
visit-card from the hat. The probability that exactly r persons (r = 0, 1, . . . , n) draw
their own cards is:
W i l l (- 1)"-' 1
W' (И) = 7Г (1 - 7Г + 2 f ~ 3Í + • ■■+ (Я-Г)! )

I e.g. for n —у к- we have Wr{ri) —* -----) .


I e
38. The events Au A2, are said to be exchangeable1, if the value of
Р(Ак A,t . . . A,r) (1 < r < n; 1 < i\ < /2 < . . . < i, < ri)
depends only on r and does not depend on the choice of the different indices
iu i2, . . . , i , 0 = 1 , 2 , . . . , n). Thus if Au Аг, . . . , A„ are independent and equiprob­
able, they are also exchangeable. Show that from the exchangeability of the events
Au A2 , their independence does not follow.
39. a) Let an urn contain M red and N — M white balls, n balls are drawn without
replacement, n < min (M, N — M). Let Ak denote the event that the Аг-th drawing
yields a red ball. Prove that the events Alt A2, . . . , A„ are exchangeable.
b) Prove that the events A„ A2, . . . , A„ defined in Exercise a) are even then
exchangeable, if every replacement of a ball drawn from the urn is accompanied by
throwing R balls of the same colour into the urn.
40. Each of N urns contains red and white balls. Let the number of the red balls
in the r-th urn be ar, that of the white balls b, and let vr be the probability of drawing
a r
a red ball from the r-th urn; that is, we put vr = -------- — .Perform the following
a, + br
experiment. Choose first one of the urns; suppose that the probability of choosing
the r-th urn is p, > 0 (r = 1,2, . . . , N). Draw from the chosen urn n balls with

1 Also called “symmetrically dependent” or “equivalent” events.


[II, § 12 E X E R C IS E S 79

replacement. Let Ak denote the event that the k-th drawing yields a red ball. Prove
now the following statements:
a) The events Au Аъ . . . , A„ are exchangeable.
b) The events Ak are, generally, not even pairwise independent.
c) Let Wk denote the probability that from the n drawings exactly к yield red balls.
Compute the value of Wk .
d) Let nk denote the probability that the first red ball was drawn at the k-th
drawing; compute the value of лк .

4L Let Ak denote the event that given the conditions of Exercise 37 the ft-th person
draws his own visiting card. Prove that the events Ak are exchangeable.

42. Let N balls be distributed among n urns such that each ball can fall with the
same probability into any one of the urns. Compute
a) the probability P0 (я, N) that at least one ball falls into every urn;
b) the probability Pk (n, N) that exactly к (k = 1 , 2 , . . . , « — 1) of the urns remain
empty.

43. Let Ak denote in the preceding exercise the event that the k-th urn does not
remain empty; show that the events Ak are exchangeable and

г . . / ч л - V • - < „ > - £ ( * ) < - ‘I ' j l - v f -

Show that if N — A n and n - у then lim Vk — (1 — e~*)k.


{Remark. V„ is equal to the probability P0 (я, N ) occurring in Exercise 42.)
44. Banach was a passionate smoker and used to put one box of matches in both
pockets in order to be never without matches. Every time he needed a match, he chose
at random either the box in his right or that in his left pocket with the same probability
—- . One day he put into his pockets two full boxes, both containing n matches. Let
Pk denote the probability that on first finding one of the boxes to be empty, the other
box contained к matches. Calculate the value of Pk and find the value к which
maximizes this probability.
M
45. An urn contains M red and N — M white balls, —— = p . Let P, denote the
probability that in a sequence of drawings with replacement the r-th drawing of a
red ball is preceded by an even number of drawings of white balls. Prove that we have
P, > — for every value of p and r.

46. Let an urn contain M red and N - M white balls. Draw all balls from the
urn in turn without replacement and note the serial numbers of the red drawings.
Let these serial numbers be ku k2, , kM, and put X = L, + k2 + . . . + kM. Let
P„ (M , N) denote the probability that X = n (A < n < B), where

M ( M + 1) '
A = ----- ’ and В = A + M(N - M).
80 P R O B A B IL IT Y [II, § 12

Put
в
F(M, N, x) = £ P„ (M, N) x".
n=A

Determine the polynomial F(M, N, x) and thence the probabilities P,(M, N).'
Prove that
PB-n (M, N) = PA+„ (ЛТ, N).
47. Prove by means of probability theory that if <p(n) denotes the number of the
positive integers less than n and relatively prime to n (« = 1 , 2 , . . . ) , then2

¥’(”) = " П I1 - v|>


p\n V P)
where the product is to be taken over all distinct prime factors p of n .
Hint. Choose at random one of the numbers 1 , 2 , . . . , « such that each of these
numbers is equally probable. Let A„ denote the event that the number chosen can
be divided by the prime number p . Show that if p„ p2, . . . are the distinct prime
factors of the number «, then the events APl, А,Рг . . . are independent. The proba-
9?(«)
bility that the chosen number is relatively prime to и is, by definition, —- — . On the

other hand we have P(An) = — , hence, because of the independence of the events Ap,
P

^ n^ ( П лP \n) = П а д = П
p \ll (p \n> ' - 7 )P )-
48. a) Let Q be a countably infinite set, let its elements be cou a>2, . . . , a>„,. . . .
Let cA consist of all subsets of Q and let the probability measure P be defined in the
following manner: P({o)„}) = p„ where p„ > p„+l > 0 (« = 1 , 2 , . . . ) and £ p„ = 1.
/1=1

Prove that the set of those numbers x for which an A g c t can be found such that
P(A) = x, is a perfect set.
b) Prove that, given the conditions of Exercise 48 a), the range of the set function
P(4) is identical to the interval [0, 1], if and only if

P„ < £ pk (« = 1, 2 , . . . ) .
k-n+l
c) Given the conditions of Exercise 48 a), prove that to every r-tuple of numbers
xlt x2, . . . , xr with
Г
£ X, = 1, X, > 0, ( / = 1 , 2 .......r)
(=i
a complete system of events
A„ A2, í . . , Ar with P(At) = x, ( / = 1 , 2 , . . . , r)

1 This exercise is the basis of an important statistical method, called Wiicoxon’s


test.
2 9?(и) is called Euler’s function.
II, § 12] E X E R C IS E S 81

can be found if and only if

1 a
P „ < — Y pk (и = 1 , 2 , . . . ) .
r k=n
Hint, a) A number x is said to be representable if there exists an event A £ r:A
such that P(A) = x, i.e. if x can be represented in the form x = £ p„. If x„
(« = 1, 2 ,. . .) is representable and lim x„ = x Ф 0, then it is readily seen that x is
П—*- со

representable too. Indeed we can select from the sequence x„ an infinite subsequence
x„k (k = 1, 2 ,. ..) such that in the representation of each x„k the greatest member
is ptl. Take now from this sequence an infinite subsequence having in its representation
for second greatest member р,г. By progressing in this manner we obtain a sequence
Pi, (s = 1 , 2 , . . . ) and it is easy to verify that £ p it = x. The range of the function
S= 1
P(A) is thus a closed set. Furthermore, if x is a number which can be represented as
N
a sum of a finite number of the pr s, e.g. x = pih then
N
x = lim ( £ Pu + p„).
r t—► CO /= 1

If x = Pu> then
/=i n
x = lim ^ Pi,.
П - + со /= 1

Thus the range of the function P(A) is a perfect set.


c) It is easy to see that the condition is necessary. Its sufficiency can be shown
in the following manner: suppose we have x l > хг > . . . > x,. Then > — , and
r
on the other hand p, < — hence p, can be used for the representation of x,. Let now
r
be x\ = max (x, — p., x,) then xi > (I — p,). Since p, < — (1 — p j , p->can therefore
r r
be used for the representation of x[, that is for one of the xr s. Proceeding in this way
we can see that every p„ can be used for the representation of an xr Since

Í p „ = Í x , = l,
П= 1 /= 1

CO

we have obtained a decomposition of the series ^ p„ into r disjoint subseries such


/ 1=1
that the sum of the У-th subseries is equal to xt. If A, consists of the elements a>„ for
which p„ occurs in the representation of xh then the sets A, have the required
properties.
b) is the special case r — 2 of the statement c).
49. The Kolmogorov probability space [Q, cA., P] is said to be non-atomic, if there
exists to every event A of positive probability an event В cz A such that 0 < P(B) <
< P(A). Prove that in the case of a non-atomic probability space [Ű, cA, P] the range
of the function P(A) is the whole interval [0, 1 ].
82 P R O B A B IL IT Y [II, § 12

Hint. Prove first that for any e > 0 Q can be decomposed into a finite number
of disjoint subsets A, (A, € cA; j = 1, 2, . . . , m) such that P(A,) < e. This can be
seen as follows. If A € cA, P(A) > 0, then A contains a subset В £ A such that
0 < P(B) < e. Indeed if P{A) < e, we can choose В = A. If P(A) > e, then (since
P is non-atomic) а В c A can be found such that В £ c A and 0 < P(B) < P(A);
p(A) P(A)
here either P(B) or P(A — B) is not greater than —- — . If —- —- < e, we have
P{A)
completed the proof; if —- — > e, the procedure is continued. Since for large enough
P(A)
r we have - < e, there can be found in a finite number of steps a set В such that

В с A, В € cA and 0 < P(B) < e .


L et us p u t now


с( A) = sup P(B) fo r A£cA.
В £ А ,Р ( В ) < .е

According to what was said above, /j.c (A) > 0 for P(A) > 0. Choose a set Al C <Jl
for which 0 < P{A^) < e, further a set A2 £ Áx for which

e > РШ > ~ b. W

and then a set A3 с Ax + Аг for which e > P(A3) L / i . (/t[ + A2); generally, if
the sets A lt A2, . . . , A„ are already chosen, we choose a set A„+1 such that the con­
ditions

An+l £ Ax + Аг + . . . + A„

and

e > Д Л + 1) ^ у В* { A . + A 2 + . . . + A n)
CO

are satisfied. Then Alt Аг, . . . . A„, .. . are disjoint sets, hence P (A„) < 1
Tl= 1
and thus lim P(A„) = 0 and at the same time
П—►со

lim цс (Ax + A2 + ■■. + A„) = 0.


П—
*-00
CO

Since (A) is a monotonic set function, we get introducing the notation ^ A„ = В


/1 = 2
that (B) — 0. But it follows that P(B) = 0 and thus, introducing the notation
A \ = Al + B, we obtain that
C
O
A\ + £ A, = S2. 0 < P(A„) < e (n = 2, 3, . . .),
/1 = 2

and 0 < P(A\) < e .


I I , § 12] E X E R C IS E S 83

03

Choose now N so large that £ P(A„) < e. Then the sets A\, A2, . . . , /t v_, and
n^N
A'n = Y_, An possess the required properties. Now we can construct for an arbitrary
"=i .
number x (0 < x < 1) an A g d . such that P(A) = x in the following manner: Q is
decomposed first into a number Nl of disjoint subsets Av such that

0 < P ( A lf) < ~ 0' = 1 , 2 , . . . , N).


Г
Let xu = P (Y
/=i
^ i/) . Then x lies in one of the intervals [xhr, xhr+1), r — 1 , 2 , . . . ,
N2 — 1, let it be e.g. the interval [х,л , x,.ri+1). If x = x U f l , we have finished the
construction. If xUl < x < xUri+l we decompose 4b, l+1 into subsets A y (j =
= 1 , 2 , . . . , N2) such that

0 < P(Asi) < X ~ 2Xl-r' .


Let
*2,r = P ( £ Av + £ Л2,) ( - = 1 , 2 , . . . , N2).
\ i= i i=i )

Then x lies in one of the intervals [x2_„ x2j+I); e.g. x 6 [x2.ra, x2_r. + ,). By con­
tinuing this procedure we obtain a set

A — X ^ n + Y + •• - + Y j A sl + ■■■
i=i i=i i=i
for which P(A) = x.

50. Prove for an arbitrary probability space that the range of P(A) is a closed set.
Hint. A set A € cA will be called an atom (with respect to P), if P(A) > 0 and if
В g cA, В S A imply either P(B) = 0 or P(B) = P(A). Two atoms A and A’ are, a
set of zero measure excepted, either identical or disjoint. From this it follows that
there can always be found either a finite or a countably infinite number of disjoint
atoms A„ (« = 1 , 2 , . . .) such that the set Q — ^ An contains no further atoms. Put
n=i
03

Y
/ 1=1
A" = B’ = KAB), n2(A) = n(AB).

Then n(A) = fii (A) + fi2(A). Here fi,(A) can be considered as a measure on the
class of all subsets of the set Q' having for its elements the sets An, and /i2(A) is non-
atomic. Hence the statement of Exercise 50 is reduced to the Exercises 48a) and 49.
C H A P T E R 111

DISCRETE RANDOM VARIABLES

§ 1. Complete systems of events and probability distributions

We have defined in § 2 of Chapter I the concept of a “complete system of


events” with respect to finite probability algebras; this concept will now be
extended to arbitrary Kolmogorov probability spaces. A finite or denumer-
ably infinite system of events {An} (A„ £ n — 1, 2, . . .) is said to be
complete (in the wider sense), if for i ф j AtAj = О and if the occurrence of
an event A„ {n = 1, 2 ,. . .) is “almost sure” (i.e. it has the probability 1):

= (i)
n n

Thus we do not require that ^ An = Q; only that P (Q1) = 0 should


hold, where "
Q' =
П
The sequence of probabilities of a complete system of events will be called
a probability distribution (or briefly distribution). From a purely mathemati­
cal point of view every sequence of nonnegative numbers pu p 2, . . . for which
X>„=T (2)
n

can be considered as a probability distribution.


The expression “probability distribution” hints at the interpretation that
the probability 1 of the sure event is “ distributed” among the events A n
(n = 1 , 2 , . . . ) . There is a close analogy between probability distributions
and mass-distributions in mechanical systems, since every sequence of non­
negative numbers ръ p2, ■■■fulfilling (2) may be considered as a distribu­
tion of the unit mass among a finite or denumerably infinite number of
points. Later on we shall often return to this analogy.

§ 2. The theorem of total probability and Bayes’ theorem

Let BL, B2, . . ., Bn, . . . be a complete system of events and let P(Bj) > 0
(i = 1, 2, . . . ) . Then an arbitrary event A £ can be decomposed accord-
Ill, § 2] THEOREM O F T O T A L P R O B A B IL IT Y A N D B A Y E S ’ T H E O R E M 85

ing to the formula


00

A = Y ABn.
П=1
Since BjBj = 0 holds for i ф j, we obtain
P(A) = £ P (A B J . (1)
n

According to the definition of conditional probabilities we have

P(AB„) = P (A \B n)(P B n).


When substituted into (1) this gives
Р(А) = £ Р ( А \ В п)Р (В п). (2)
П
This relation is called the theorem o f total probability. Since £P(i?„) = 1,
n
according to (2) the probability P(A) is the weighted mean of the conditional
probabilities P(A \ Bn) taken with the weights P(Bn). From this it follows
immediately that

inf P(A I Bn) < P(A) < sup P(A\B„). (3)


n n

The theorem of total probability is closely connected to the following


simple theorem of mechanics. The center of the gravity of a body can be
obtained by decomposing the body into arbitrarily many parts and consider­
ing the mass of each part as concentrated in its center of gravity and then
forming the center of gravity of the resulting point-system. Equation (2) is
further analogous to the following chemical relation: Different solutions
of the same salt are placed into N vessels, the total volume of the solutions
being 1. Let P(B„) denote the volume of the n-th vessel and P(A | B„) the
concentration of the solution of the и-th vessel. If we mix the contents of
the vessels and denote by P(A) the concentration of the resulting solution,
Equation (2) will hold for this case too.
Example. Let an urn contain M red and N - M white balls. Draw balls
from the urn without replacement. Let Ak denote the event that we obtain
M
a red ball at the k-th drawing. Clearly, P(A,) = We shall show that

M
P(Ak) = -jy- (k = 2, 3 ,. . ., N ). According to the theorem of total proba­
bility
P(A2) = Р(Аг I АО Р Ш + P(A2 I A,) P(A,),
86 D IS C R E T E R A N D O M V A R IA B L E S (II I , § 2

hence
M - 1 M M N -M M
V2' N- l N N —1 N N
M
Similarly we obtain that P(Ak) = — , if к = 3, 4 ,. . A(cf. Exercise 39a,

§ 12, Ch. II).


Let A and В be any two elements of an algebra of events with P(A) > 0
and P{B) > 0. From the values of P(A), P(B) and P(A j B) one can obtain
P(B I A) as well; indeed (cf. (2), § 8, Ch. II)
P (A \B )P (B )
P( B \A )= p ^ - 1 - W

If {Bn} is a complete system of events and if in (4) Bk is substituted for


В and Expression (2) for P(A), we have
n/D , P (A \B k)P (B k)
i kl ) l P ( A \ B r)P (B n) ■ (5)
П
This is Bayes’ theorem.
There is hardly any other theorem of probability theory so much debated
as this.
Bayes’ theorem is well-proven, its validity cannot be doubted; only its
practical applications are controversial. An often-used name of this theorem
is for instance “theorem of the probability of causes” . This name originates
in the use of Bayes’ theorem to infer the probabilities of the hypotheses
(causes) Bk (k = 1, 2 ,. ..) from the occurrence of an event A; i.e. if one
wishes to examine how much the occurrence of an event A supports or re­
futes certain hypotheses. If the so-called a priori probabilities P(Bk) are known,
then Bayes’ theorem can be applied and the a posteriori probabilities
P(Bk I A) can be computed. However, the probabilities P(Bk) are often un­
known. Then it is usual to give them arbitrary values, which is really a
questionable procedure.
The name “theorem of the probability of causes” can lead to misunder­
standings, hence we must discuss it a little further. Indeed, from (4) it follows
that
P(A I В) P(B I A)
~ ~1> W '
Thus if the occurrence of the event A increases (e.g. doubles) the probability
of B, then the occurrence of the event В increases (doubles) the probability
of the event A as well; hence it is entirely impossible to infer the direction
of the causal relation from the value of the conditional probability only.
I l l § 3] C L A S S IC A L P R O B A B IL IT Y D IS T R IB U T IO N S 87

We mention finally the following chemical analogy of Bayes’ theorem:


Let N vessels contain solutions of different concentrations from the same
salt. Let the total volume of the solutions be 1. Let P(Bk) denote the volume
of the solution in the k-th vessel and P(A \ Bk) the concentration of the salt
in it, then formula (5) gives, what part of the total mass of salt is in the £-th
vessel.

§ 3. Classical probability distributions

1. In the preceding Chapter we have already discussed independent repeti­


tions of a simple alternative. Repeat n times an experiment with two possible
outcomes A and Ä, such that the repetitions are independent. If Bk (к =
= 0, 1,. . ., rí) denotes the event that A occurred at exactly к experiments,
then the events Bk (к = 0, 1,. . ., ri) form a complete system of events and
the corresponding probabilities are

W k = P ( B k ) = ^ p k q n - k (A:= 0,1,...,«) (1)


where p '= P(A) and q = 1 —p = P(A). The sequence of numbers Wk is
called the binomial distribution o f order n and parameter p. The name hints
at the fact that the numbers Wk are the terms of the expansion of (p + q)n
according to the binomial theorem.
2. A natural extension of the binomial distribution is the polynomial
d istrib u tio n it is obtained by independent repetitions of an experiment hav­
ing several possible outcomes. Let the possible outcomes of the experiment
be A„ A 2, . . ., Ar and let P(Aj) = Pj (j = 1 , 2 , . . . , r), then these probabili­
ties fulfil the condition p x + p 2 + . . . + pr = 1. Let Bkx, ka, . . kr denote
the event that in « independent repetitions of the experiment the event A 1
occurs k k times, the event A2 occurs k 2 times, . . . , the event Ar occurs k r
times, where k k + k 2 + . . . + kr — n. Then we have

PiyBk'-k...... “ fcx! k 2\ . . . kr\ Pl'Pi' ’ " Pr ' ^

The name “polynomial distribution” comes from the fact that the terms
P(Bki кг... kr ) can be obtained by expanding (p1 + p 2 + . . . + pr)n
according to the polynomial theorem. If r = 3, we call the distribution
the trinomial distribution.1

1 Called also “multinomial” distribution.


88 D IS C R E T E R A N D O M V A R IA B L E S [III, § 3

3. Let an experiment have only two possible outcomes, A and Ä. Perform


n independent repetitions of this experiment, but let the probability of A
(and thus of Ä) change from experiment to experiment. Let Bk denote the
event that A occurred exactly к times (к = 0, 1,. . «), then

P(Bk) = I PuPi, ■■■


Pik0 - Pi) (1 -P j,)...(i - Pj„_k) (3)

where pt is the probability that A did occur at the f-th experiment. The
summation is to be taken over all combinations (i1; i2, . . ik) of the k-th
order of theelements (1, 2 ,. . и) and j\, j 2, .. .,j„ -k denote those numbers
of the sequence 1 , 2 whi ch do not occur among /,, i2, . . ., ik. The
numbers P(Bk) form a probability distribution. If for instance all probabili­
ties Pi are equal to each other, we obtain as a particular case the binomial
distribution (1).
The distribution (3) occurs for instance in the following practical problem :
In a factory there are n machines which do not work all the time. They are
switched on and switched off independently from each other. Let pt denote
the probability that the г-th machine is working at a given moment, let
P{Bk) be the probability that at this instant exactly к machines are working,
П
then P(Bk) is given by the Formula (3). The fact that £ P(Bk) = 1 can
fc= 0
be seen directly in the following manner: A simple calculation gives that

I
k=0
p (Bk) дс* =i=Пl 0 - p>+ Pi *);
n
by substituting X = 1 we obtain £ P{Bk) = 1.
k=0
4. The following problem was discussed in the preceding Chapter. An
urn contains M red and N — M white balls (M < N). Draw n times one
ball from the urn without replacement (n < N). What is the probability
that there are к red balls among the n balls drawn ? Denote this event by
Ck. Then the events Ck [max (0, n — (N — M ) ) < к < min (и, M)] form
a complete system of events. The corresponding probabilities are, as we have
already shown:

M IN-M

P(Ck) = — (* = 0 , 1 , . . . , n). (4)

П
This distribution is called the hypergeometric distribution.
Ill, § 3 ] C L A S S IC A L P R O B A B IL IT Y D I S T R IB U T IO N S 89

5. This distribution can be generalized in the following manner: Suppose


that the urn contains balls of r different colours, namely exactly N, balls of
r
the i-th colour (i = 1, 2, . .r). Let N = £ N t be the total number of balls
i=i
and let Ckx, кг, . . kr denote the event that among n balls drawn without
replacement the first colour occurs k 1 times, the second k 2 times, . . the
r-th colour kr times (кх + k 2 + . . . + k r = ri). By a simple combinatorial
consideration we obtain that
* i) [AVI N,
k, \k 2 k.
n c kx, .,-0 = ----- . (5)

I«,
Distribution (5) is called the polyhypergeometric distribution. It is, for ex­
ample, applied in statistical quality control, when the commodities are
classified into several categories. (Such categories are for instance: a) fault­
less; b) faulty but still serviceable; c) completely faulty.)
The events
C k u к ..... (0 ^
kr k j < min (и, N i ) ; X k , = n )
1=1
form a complete system of events, thus

*,) = i.
This can be seen directly, if we compare the coefficient of x" on both sides
of the identity
П (1 + x )Ni = (1 + * )* .
i=i
6 . Let an urn contain M red and N — M white balls. Let Ak (к = 0, 1,. . .,
N — M ) denote the event that at consecutive drawings without replacement
we obtained the first red ball at the (к + l)-st drawing. As was proved in
§ 5 of the preceding Chapter, we have
M
РИ о )= дГ>

M k~l M I
P^ )= N ^ k П ! - j 1 ,2 ,...,N -M ). (6)

Since the events Ak (к = 0, 1, . . ., N — M) form a complete system of


events, we have the relation
90 D IS C R E T E R A N D O M V A R IA B L E S [III, § 3

M N- M M к~ Ч . M
N + *?i N - к У0 I N ~j

This identity also has a direct proof, but it is not quite simple. It happens
often that certain identities for which a mathematical proof may be rather
elaborate, are readily obtained by means of probability calculus.
7. Let the preceding exercise be modified in the following manner: Let
an urn again contain M red and N — M white balls, but the drawn balls
should now always be replaced. Let Ak denote the event that we obtain the
first red ball at the (к + l)-st drawing. The most marked difference between
this problem and that of the drawing without replacement dealt with above
is that the number к was there bounded (k < N — M). Now, however, к
can be arbitrarily large; in principle, it is even possible that we draw always
a white ball. Hence it can be questioned whether the events Ak (к = 0, 1,. . .)
do form a complete system of events. Clearly, the events Ak mutually ex­
clude each other, the only thing we have to examine is whether it is sure that
one of the events does occur. By introducing the notation

ß' = X X /1 = 0

we have Q' Ф Q.
We shall prove, however, that the possibility to draw always a white ball
in an infinite repetition of drawings has probability 0 , thus in practice it
does not count at all, i.e., the system of events {Ak} is in a wider sense
complete.
M
First of all we compute the probabilities P{Ak). Put — = p, 1 —p = q.
The probability that we obtain at the first к drawings white balls and at the
(k + l)-st drawing a red one is

P(Ak) = p q k (k = 0 ,1 ,...) . (7)


From this
OO OO
X Р ( А к ) = р £ 1-

Hence the probability of Q' is 1 and thus P(Q — Q') = 0. Though it is, in
principle, possible that Q — Q' occurs; this possibility can be neglected in
practice. Hence the system {Ak} of events is, in a wider sense of the word,
complete.
The distribution pqk (к = 0, 1,. . .) is often called the geometric distri­
bution, since the sum of the members pqk is a geometric series. We shall see
ш , § 31 C L A S S IC A L P R O B A B IL IT Y D IS T R I B U T I O N S 91

later on that this distribution belongs to a larger class of distributions,


namely to the class of negative binomial distributions.
Examples 1-6 show finite probability algebras; the construction of these
probability algebras causes scarcely any difficulty at all. As Example 7
deals with an infinite probability algebra, this deserves to be examined some­
what more thoroughly. This example deals with the infinite repetition of a
simple alternative, the elementary events are thus infinite sequences consist­
ing of the events A and Ä. It is readily seen that the set of the sequences of
this type has cardinal number of the continuum. Indeed, we associate to
every sequence
Е ъ Е2,

where the meaning of E„ can be only A or Ä, the number x having the binary
expansion x = 0 , . . . g„ . . ., where

П if En = A,
E" ( o if En = Ä.

Thus the set of the elementary events of the sequence of experiments is


mapped onto the interval [0, 1]. This mapping is one-to-one, with exception
к
of the binary rational numbers x = (Jc and l are nonnegative integers).
Now we construct the probability space corresponding to this system. First
of all let denote the set of events obtained by prescribing the outcome
of a finite number of experiments assuming nothing about the remaining
experiments. Let ix < i2 < . . . < ik denote the indices of the experiments
where the occurrence of A is prescribed and j \ < j 2 < . . . < j, similarly for
the occurrence of Ä. Let C denote the event so defined, then we have

P(C) = pkql.

From this we can compute P(C) for every C Clearly [ ^ й,Р] is a proba­
bility algebra, but -.j£0 is not a n-algebra. But if we consider the least e-al­
gebra containing and extend the set function P(C) defined over
(readily seen to be a measure on 0) to ^ , then we obtain the Kolmogorov
probability space sought for (cf. Ch. II, § 6 ). In order to prove that P{C)
is a measure on 0, let us consider the above mapping of the sample space
onto the interval [0, 1]; let the interval [0, 1] be denoted by Í2*. There cor­
responds to the algebra of sets o ( 0 the class u€* of the subsets of Q* consist­
ing of a finite number of pairwise disjoint intervals with binary rational end­
points. Just like in Chapter II, § 7 there can be given a function E(x) so that
the probability belonging to the interval [a, b) = 1 be equal to F{b) — F(a).
92 D IS C R E T E R A N D O M V A R IA B L E S [III, § 3

indeed, if the interval [a, i ) i s o f the form j («i being odd) and

m 1 1 1 .
2n ' 2'Г + y f + • • ■+ Y* (*i < *2 < - • - < * * —”)»

then we put
F(ft) - F(a) = / <7"-*.
From this F(x) can be determined at every binary rational point x. Thus
for instance

F(0) = 0, F(l)=l, f Ü - U 9, f (~ = q\

f [ \ ) = 4 + P9, * '(-{-)-* » , F ^ = q* + q*P ,

(5 „ (7 )
F T \= q+ TP’ F r 8 = 4 + pq + p q' etc-

In general, if

*= £ i 0 ’i < * * < • • • < *r)


A=1 Z
then
F (x )= X q lk.
k=1 4

It is easy to see that F(x) is an increasing continuous function and F(0) =


= 0, F (l) = 1. Hence the result of Chapter II, § 7 can be applied here. The
extension ts€* of >.j ?* is in this case the collection of all Borel-measurable
subsets of Q*. Especially, if p .= q = — , then F(x) = x and the measure
P* is the Lebesgue-measure. The fact that in an infinite sequence of experi­
ments the probability of “obtaining except for a finite number of experi­
ments always the same event” is zero, corresponds to the fact that in a
binary expansion of almost every number both digits occur infinitely often.
This is a special case of the well-known theorem of Borel, which will be dis­
cussed later on. The above construction of the probability space [ß, F]
is a special case of a general theorem of Kolmogorov (the so-called funda­
mental theorem o f Kolmogorov) which will be proved later.
8. The negative binomial distribution can be obtained as a generalization
of the preceding problem. Consider an experiment having two possible out­
comes A and Ä and let the probability of the event A be p , that of Ä be
I l l , § 3] C L A S S IC A L P R O B A B IL IT Y D IS T R IB U T IO N S 93

q = 1 —p. Let A^f denote the event that during independent repetitions
of the experiment the event A occurred for the r-th time (r > 1) at the
(r + k )-th experiment. We obtain by a simple combinatorial consideration
that

p(4'>) = +
/ _ ~ 1 / qk (k = 0 ,1 ,...) . (8)

The events Aty (k = 0, 1,. . .) form a complete system of events in a wider


sense. Since the events A(kr) (k = 0, 1,. ..) are pairwise disjoint, it is enough
CO

to show that Y Д 4 ?) = 1- This follows from (8 ) in the following manner:


k=0

“ [k + r - \ \ k ” f-r| . p У
k=o r - 1 I k=o \ к ) i - q )

The distribution (8 ) will be called the negative binomial distribution o f r-th


order, since the probabilities in question can be obtained as terms of the
binomial series (for a negative exponent) of the expression p r{1 — q)~r.
Since the events A^f (Л: = 0 , 1 , . . . ) form a complete system of events, the
probability, that the number of occurrences of an event in infinite repeti­
tions of an alternative remains bounded, has the value zero.
Indeed, if C„ denotes the event that in the infinite sequence of experi­
ments A occurred exactly n times; then, as proved above, P(Ct) = 0. If
therefore C denotes the event that A occurs in the infinite sequence of events
00

only a finite number of times, then C = £ Cn, and


n= о

T(C) = Y
/1 = 0
P{C„) = 0 .

Thus the event A occurs infinitely many times with the probability l . 1
9. Consider the following problem: let an urn contain M red and N — M
white balls. Draw a ball at random, replace the drawn ball and at the same
time place intő the urn R extra balls with the same colour as the one drawn . 2
Then we draw again a ball, and so on. What is the probability of the event
that in n drawings we obtain exactly к times a red ball? Let this event be
denoted by A k. Of course we assume that at every drawing each ball of the
1 Later on we shall prove more: let k„ denote the number of occurrences of A in
the first n experiments, then not only lim £ „ = + <*= with probability 1, but more
П-*-со
k„
precisely lim — = p with probability 1.
n—►CO ft
2 R can be negative as well. In case of negative R we remove from the urn R balls
of the same colour as the one drawn.
94 D IS C R E T E R A N D O M V A R IA B L E S [III, § 4

urn is selected with the same probability. We compute first the probability
that we obtain at each of the first к drawings a red ball, and white balls at
the remaining n — к drawings. Clearly this probability is

П (M + j R) П ' {N —M + hR)
J
—------------------------------------------------------- • (9 )
П (N +l R )
/-0

From this it follows easily that

( " I n (M +JR)" П 1 ( N - M + hR)


P(Ak) = b b z ? --------------------------------------- . (1 0 )
П (tf + tt)
1=0

Distribution (10) is called the Pólya distribution.


M
If R = 0 and — = p, then we obtain from (10) the binomial distribution
N
as a particular case. If R = - 1 , we get as a particular case from (10) the
hypergeometric distribution.
We can also compute the probability that we obtain at the (к + l)-th
drawing the first red ball. Obviously, this probability is
M t ^ l N - M + j R 1
N + kR Д N +jR J' 1 ’

In the cases of R = 0 and R = —1 we have the particular cases already


dealt with.

§ 4. The concept of a random variable

So far we have only considered whether a random event does or does not
occur. Qualitative statements like this are often insufficient and quantitative
investigations are necessary. In other words, for the description of random
mass phenomena one needs numerical data. These numerical data are not
constant, they show random fluctuations. Thus for instance the result of
a throw in dicing is such a random number. Another example is the number
of calls arriving at a telephone exchange during a given time-interval, or
the number of disintegrating atoms of a radioactive substance during a
given time-interval.
In order to characterize a random quantity we have to know its possible
values and the probabilities of these values. Such random quantities are
I ll, § 4] T H E C O N C E P T O F A R A N D O M V A R IA B L E 95

called random variables. In the present Chapter we shall discuss only random
variables having a countable set of values; these are called discrete random
variables. The random variables figuring in the above examples were all
of the discrete type. The life time of a radioactive atom is for instance also
a random variable but it is not a discrete one. General (not discrete) random
variables will be dealt with in the following Chapters. In what follows ran­
dom variables will be denoted by the letters of the Greek alphabet.
Let A be an arbitrary event. Let the random variable £A be defined in
the following way:
Í 1 if A occurs,
A I 0 otherwise (i.e. if A occurs).

Obviously, the value of S,A depends on chance, further we have


P(ZA = \)= P {A ),
and similarly
= 0) = P(Ä) = 1 —P(A).

A random variable £A associated in this way to the event A is called the


indicator of A. Conversely, we can assign to every random variable £ assum­
ing only two values, a and b, the event A that £ = a (where the event Ä
means that £ = b).
Starting from this trivial remark, we can make a further step forward.
If х ъ x 2, . . . are the different possible values of the random variable £
(i.e. the set of the possible values is finite or denumerable), a complete sys­
tem of events {A n} can be associated to it. Indeed, let A„ denote the event
OC 00

that £ = x n, then clearly A nA m = O, if n ф m and ]TA„ = Q, hence ^Р (А„) =


П= 1 «=1
= 1. Conversely, there can be assigned (in several ways) to every complete
system of events {AK} a random variable £ such that in case of the occurrence
of A„ the value of £ should depend on the index n only. £ can for instance
be defined in the following manner:

£= n if An occurs (n = 1, 2, . . . ) .
The value n may be replaced by /(и), where f(ri) is any function defined for
the positive integers, for which f(ri) Ф f(m ) if n ф m. Thus we can see that
a complete system of events can be assigned to every discrete random va­
riable in a unique manner, while there can be assigned infinitely many dif­
ferent random variables to a complete system of events.
We shall deal in this Chapter with random variables assuming only real
values. It must be said that probability theory deals also with random vari­
ables whose range does not consist of real numbers but for instance of
96 D IS C R E T E R A N D O M V A R IA B L E S [III, § 4

«-dimensional vectors. There are also random variables whose values are
not vectors of a finite dimension but infinite sequences of numbers or func­
tions, etc. Later on, we shall also examine such cases.
Now let us see, how the notion of a random variable is dealt with in the
general theory of probability.
In Chapter II we were made familiar with Kolmogorov’s foundation of
probability theory. We started from a set Q, the set of elementary events,
and a а-algebra t s f consisting of subsets of Q. Here consists of all events
coming into our considerations. Further there was given a nonnegative
и-additive set function P defined on tjfi such that P(Q) = 1. The value
P(A) of this function for the set A defines the probability of the event A. N at­
urally, we understand by a random variable a quantity depending on which
one of the elementary events in question occurs. A random variable is there­
fore a function £ = £(co) assigning to every element со of the set Q (i.e., to
every elementary event) a numerical value.
What kind of restrictions are to be prescribed for such a function ? If we
have a probability field where every subset of Q corresponds to an event,
no restriction is necessary at all. But if this is not the case, then the definition
of a random variable calls for certain restrictions.
Since we consider in this Chapter discrete random variables only, we
confine ourselves (for the present) to the following definition:
Let [fí, 'sA, P] be a Kolmogorov probability space. A function £ = £(a>)
defined on Q with a countable set o f values is said to be a discrete random
variable, i f the set, for which £,(oj) takes on a fixed value x belongs to for
every choice o f this fixed value x.
Let x b x 2, . . ■denote the different possible values of the random variable
£, = £(cü) and A n the set of the elementary events со £ Q for which £(co) = x,„
then A n must belong to the algebra of sets for every «. Only in this case
the probability
Р{Ц = х п) = Р(Ап)
is defined.
A complete system of events associated with a discrete random variable
thus consists of those subsets of the space of events for which the random
variable takes on the same value. Especially, if bA = £A(co) is the indicator
of the event A, then ( A(co) is a random variable having the value 1 or 0
according as со does or does not belong to the set A.
The sequence of probabilities of a complete system of events is said to be
a probability distribution. Now that we have introduced the concept of
random variable this probability distribution can be considered as the set
of all probabilities corresponding to the different values taken on by a ran­
dom variable. If for instance an experiment having the possible outcomes
A and Ä is independently repeated n times, then the number 1; of the experi-
I ll , § 4] T H E C O N C E P T O F A R A N D O M V A R IA B L E 97

ments showing the occurrence of the event A is a random variable with the
binomial distribution, i.e.

P(Z = k )= I ” pk qn~k (& = 0, 1, . . . , n)

where p = P(A) and q = 1 —p.


Let £ be a random variable and g(x) an arbitrary real valued function of
the real variable x. Then g = #(£) is also a random variable. Let further
£1, C2, .. ., £r be random variables and let д(хъ x 2, . . ., x r) be an arbitrary
function of the r real variables xx, x 2, . ■.,x., then g = g(£b £2, ■- •, £r) is a
random variable as well.
The distribution is the most important concept for the characterization
of a random variable, it does not, however, characterize a random variable
completely. If for instance we know the distributions of the random vari­
ables £ and g, it is not, generally, possible to determine from this alone
the distribution of the random variable £ = #(£, g). In order to do this we
have to know the “joint distribution of the random variables £, g, that is
the probabilities P(£ = x„, g — y m). But if £ and g are given as functions
of the elementary event со f Й, the joint distribution of £ and g is herewith
also given and the distribution of any function £ = #(£, g) as well.
Let the possible values of the random variable £ be х ъ x 2, ■. ■and let A
be an arbitrary event of positive probability. We define the conditional
distribution of the random variable £ with respect to the condition A by
the sequence of numbers
P(£ = x j r i ) (« = 1, 2 ,...).
We introduce further the notion of the distribution function of a random
variable. If £ is a random variable, then the function F(x) defined by
P(x) = P(£ < x)
for every real x is said to be the distribution function 1 of £. Here P(£ < x)
stands for the probability of the event that the value of £ is less than x; this
event can be represented as the set A x of the elements со d Q for which
£(w) < x. If the discrete random variable £ takes on the values x„ with the
probabilities p„ = P(£ = x j (n = 1 , 2 , . . . ) then we have, clearly,
F{x) = X Pk,
Xk < X

where the sum is to be extended over all values of к such that x k < x.
1 Called also cumulative distribution function.
The definition F(x) = P((, < x) is also customary. This induces only minor
modifications in its properties, e.g. this function is continuous from the right, while
P(£ < xi is continuous from the left.
98 D IS C R E T E R A N D O M V A R IA B L E S [III, § 4

If, for instance, the distribution of £ is a binomial distribution of order n


and parameter p, then

F(x) = P t f < x ) = Y ("W -* . (1)


k<x К

Sometimes, an integral form of this distribution function is used. For this


purpose we define the incomplete (and the complete) beta integral of Euler.//
a> 0, ß >0, 0 < л' < 1,

put
B ( « ,ß ,x ) = \ t* - 1 ( \ - t y - 1 dt. (2)
• b

B(a, ß, X) is called Euler's incomplete beta integral o f order (a, ß). It is well
known that
D, оч D, p 1Ч Д а) Щ )
Щ . , Й - U f r , ft В - Ц а + Ю , (3)
where
00
Г (а )= \ x * - 1 e~x dx (а > 0 )
ö
is the so-called gamma function. B(a, ß) is called Euler's complete beta inte­
gral o f order (a, ß). It is readily verified through integration by parts that

i t pkqn~k = ( n - r ) ") jV (i - ty - '- 'd t =


k = o \K I p

_ . _ B{r+ 1 , n - r,p)
B (r+ l,n -r) ’ W
hence

5(r + l,n - r)
if
r < X < r + 1 (r = 0 , 1 , . . . , n — 1). (5 )

The distribution function F(x) of an arbitrary (not necessarily discrete)


random variable £(ю) exists, iff the set A x defined above belongs for every
real X to i / . This will be always assumed in the following Chapters
during the study of general random variables. In case of discrete random
variables, however, this follows from the assumption that for every possible
value n the set A„ of elements a> £ Q such that £(cu) = x n, belongs to
The distribution function F(x) is always nondecreasing; further lim F(x) =
X-*- — CO

= 0 and lim F(x) = 1.


JC-^ + OO
H I, § 5] T H E IN D E P E N C E O F R A N D O M V A R IA B L E S 99

§ 5, The independence of random variables

It is obvious to call two random variables independent if the complete


systems of events belonging to them are independent. This definition corre­
sponds to the natural requirement that two random variables should be
considered as independent, if the fact that one of them takes on a definite
value has no influence on the random fluctuations of the other. Let со denote
any element of the space of events Q ; let £ = £(co) be the first, ц = r\((o)
the other random variable, let further A„ be the set of all со-s for which
^(co) = x„(n = 1 ,2 ,...) and Bn of those for which 17(01) = ym (m = 1,2,...).
£ and r] are said to be independent, if
P(AnBm)= P (A „ )P (B m) (1)
for every n and for every m; that is, if the complete systems of events {A„}
and {B,„} are independent. Or in a different notation: £ and r\ are called
independent, if
Щ = x n,r\ = ym) = P(£ = x „) P{r] = ym) (1')

for every n and m. Hence in case of two independent random variables the
joint distribution of £ and t] is, according to ( 1 '), determined by the distri­
butions of £ and г].
This definition can be generalized to the case of several random variables.
The discrete random variables £2, . . . , are said to be (completely)
independent, if for every system of values x ki, x ki, . . ., xkr the relation

Щ х = x kl, Ъ x kr) = П P(Zj = xk) (2 )


j =1
holds. It is easy to see that any s < r out of r independent random variables
are also independent. The independence of the random variables £2, • • •,
(v < r) can be verified by summing Formula (2) over all possible values
x ks+1, . . ., x k/.
The converse of the statement is not true; from the pairwise independence
of <J1; £2 , £ 3 their complete independence does not follow. Let the random
variables i;2, %3 be the indicators of the events A u A.,, A 3, then relation
(2) expresses just the complete independence of the events А ъ A,,, A 3. Since
— as we know already — the complete independence of three events does
not follow from their pairwise independence, the same holds for random
variables, too.
A constant is, clearly, independent of any random variable. Indeed, if
1/ = c (c = constant) and £ is an arbitrary random variable, then

у = c) = P(Z = xk) = P(£ = x k) p (t7 = c),


since the set defined by 17 = c is the entire space Q.
100 D IS C R E T E R A N D O M V A R IA B L E S till, § 6

Next we prove the following theorem:

T h e o r e m 1. Let 1Ъ . . £r be independent discrete random variables


and gfx), g 2( x ) ,. .., gr(x) arbitrary functions o f a real variable x. Then the
random variables rji — gfAi), = 9 AAA are independent as well.

P r o o f . The proof will be given in detail only for r = 2, for r > 2 the
procedure is essentially the same.
Let {xjk} be the sequence of the possible values of the random variable
(J = 1, 2) and {A k) the complete system of events belonging to the ran­
dom variable £ ; Ajk is thus the set of those elementary events со £ Q for
which f(o>) = xjk.
If yjt is one of the possible values of the random variable rjj = g fA if
then the set Bjt defined by g/Aj) = Уд can obviously be obtained as the
union of finitely or denumerably many sets Ajk; Bn is equal to the union of
the sets Ajk whose indices satisfy the equation g f x jk) = yjt.
Since the complete systems of events {Ay,} and {A2k} are independent,
the sum of an arbitrary subsequence of the sets A lk is independent of the
sum of an arbitrary subsequence of the sets A 2k. From this our assertion
follows.
We give a reformulation of the above theorem which we shall need later
on. Let £(a>) be a discrete random variable with possible values х ъ x 2, ■. .,
xn, . . . and let A„ denote the set of those elementary events со for which
<foj) = x„. Let further be the least а -algebra containing the sets A„.
t is called the a-algebra generated by Clearly, consists of the sets
obtained as the union of finitely or denumberably many of the sets A n.
Obviously ^c i f £1; f , . . . Qr are independent random variables,
'у А . . ., >
у Аir are the ст-algebras generated by £2, ■■■, <tr and Bj
is an arbitrary element o f у А^ (j = 1 , 2 , . . . , r), then the events Въ В ,,. . -,Br
are independent.

§ 6. Convolutions of discrete random variables

Let £ and g be two random variables with possible values x n and y,„
( n , m = 1 , 2 , . . .), respectively. Let the distributions of £ and >/ be

Р(£ = Хп)=Рп and P{t] = ym) = 4 m (n,m = 1 , 2 , . . . ) .

If g{x, y) is any real valued function of two real variables, then — as men­
tioned above — £ = g(f, rj) is a random variable.
I ll, § 6] C O N V O L U T IO N S O F D IS C R E T E R A N D O M V A R IA B L E S 101

Let us determine the distribution of the random variable £. For every


real number z
F«=z)= X Щ = Хп>П = Пп)- (1)
g (x n, y m) = z

The sum is here extended over those pairs (и, ni) for which g{xn, y m) = z.
If such pairs do not exist, the sum on the right hand side of (1) is zero.
In order to compute P(£ = z) we have to know therefore, in general,
the joint distribution of £ and r/. If £ and )] are independent, then
Щ = x n,rj = y m) = P(£ = x„) P(q = y j and thus

P{i = z ) = £ pnqm. (Г)


9(.Xn,ym)=z
Let us consider now the important special case when £ and rj are independent
and g{x, y) = x + y, hence ( = £ + 17. Then

P<A = z) = X Pnqm. ( 1 ")


•Х/1+ .У/И—z
If £ and 17 assume only integer values and p n = P(£, = rí), qm — P(q — ni)
(и, m = 0 , + 1 , + 2 , . . .), then

P{H = k )= £ pj qk_j (k = 0 , + 1 , ± 2 ,...). (2 )


y = —со

If £ and 77 have only nonnegative integer values, then

P{Z = k ) = £ Pj Як-j- (3)


L'= 0
The distribution of f = £ + q is called the convolution of the distributions
£ and 17. In what follows we shall compute the convolution of some discrete )
distributions.
Let £ and »7 be independent random variables having binomial distribu­
tions of order nx and n2, respectively, with the same parameter p:

P{Z = k ) = pkqrh~k (k = 0 , 1 ,...,« ! ) ,

P{q = i) = Í" 2 1 p V 1“ ' (/ = 0 , 1 , . . . , n2),


1 >
where 9 = 1 - />. If £ = <!; + /1, then

ДС = к) = [ X Í"1) Í ” 2 .11 (4)


.,íó U l * - /
102 DISCRETE RANDOM VARIABLES [III, 5 7

By the well-known identity

у [« ij Я2 j _ К + Яо

j=o Ы \k-.i) к

it follows from (4) that

Р^ = к ) = \ П
к Pk q "~k (A: = 0, 1 , . . . , ri) (5)

where я = nt + л2.
Hence the random variable £ has a binomial distribution too. This result
can also be obtained without any computation as follows: Consider an ex­
periment with the possible outcomes A and Ä ; let P{A) = p. In the above
example resp. rj is equal to the number of occurrences of A in the course
of fii resp. я 2 independent repetitions of the experiment. The assertion that
£ and ц are independent means that we have two independent sequences of
events. Perform a total number of я = n1 + n2 independent experiments,
then £ = £ + t] means the number of the occurrences of A in this sequence
of experiments; hence £ is a random variable having a binomial distribution
of order n and parameter p; that is, Formula (5) is valid.
We encounter a practical application of this result when estimating the
percentage of defective items. Consider a sampling with replacement from
the population investigated. According to the above this can be done also
by subdividing the whole population into two parts having the same per­
centage of defective items and selecting from one part a sample of nx ele­
ments and from the other one a sample of n2 elements. This estimating pro­
cedure is equivalent to that which consists of the choice of a sample of
n — пг + я 2 elements from the whole population.
It is to be noted here that the distribution of the sum of two independent
random variables with hypergeometric distributions does not have a hyper-
geometric distribution. Hence the former assertion is not valid if the sam­
pling is done without replacement. The difference is, however, negligible in
’ practice, if the number of elements of the population is large with respect
to that of the sample.

§ 7. Expectation of a discrete random variable

The random fluctuations of a random variable are described by its distri­


bution function. In the practice, however, it is often necessary to characterize
a distribution by a small number of data. The most important and simplest
Ш. § 7] E X P E C T A T IO N O F A D IS C R E T E R A N D O M V A R IA B L E 103

one of such data is the expectation defined below (first for discrete distribu­
tions only).
Let the possible values of the random variable £ be х ъ x 2, ■. . with corre­
sponding probabilities p„ = P(£ = x„) (n = 1 , 2 , . . .). Perform N independent
observations of £; if N is a large number, then, according to the meaning of
probability, at approximately Npx occasions we shall have £ = x x, at approx­
imately Np 2 occasions £ = x 2, and so on. Taking the arithmetic mean of the
^-values obtained at the N observations, we obtain approximately the value

Np1 -x 1 + Np 2 -x 2 + . . . ^
-------------- n ----------- -- = L P k X k;
к
this is the value about which the arithmetic mean of the observed values of
£ fluctuates. Hence we define the expectation E(f) of the discrete random
variable £ by the formula

Е (0 = ^ Р кхк. ( 1)
к
Obviously, E(£) is the weighted arithmetic mean of the values x k with weights
pk.1 In order that the definition should be meaningful we have to assume
the absolute convergence of the series figuring on the right side of (1). Other­
wise, namely, a rearrangement of the x k values would give different values
for the expectation.
If £ can take on infinitely many values, then E(f) does not always exist.
E.g. if

P(Z = 2*) = ^ T ( * = 1, 2 ,...),

then the series £ pkxk is divergent. Clearly, the expectation of discrete and
к
bounded random variables always exists.
Sometimes, instead of “expectation” , the expressions “mean value” or
“ average” are used. But they may lead to confusion with the average of the
observed values. In order to discriminate the observed mean from the
number about which the observed mean fluctuates we always call the latter
“expectation” .
Obviously, the expectation E(t;) depends only on the distribution of
hence if £, and £ 2 are two discrete random variables having the same dis­
tribution, then £(£]) = E(£2). Therefore E(£) can also be called the expecta­
tion o f the distribution of f The fluctuation about E(f) of the averages1

1 Hence E(c) lies always between the lower and the upper limit of the possible
values of i.
104 D IS C R E T E R A N D O M V A R IA B L E S [III, § 7

formed from the observed values of £ is described more precisely by the laws
of large numbers, which we shall discuss later on. Here we mention only
that the average of the observed values of £ and the expectation E{f) are
essentially in the same relationship as the relative frequency and the proba­
bility of an event. This will be readily seen if we consider the indicator A
of an event A having the probability p ; indeed, E(ff) = p ■l + (1 — p)- 0 =
= p and the average of the observed values of £A is equal to the relative
frequency of the event A.
Next we compute the expectations of some important distributions.
1. The expectation o f the binomial distribution. The random variable c;
has a binomial distribution if it assumes the values к = 0 , 1 , . . ., n with
probabilities

P(Z=k)= j ” A " -* ,

where q = 1 —p and 0 < p < 1. Hence according to (1)

Щ) = t k " Pkqn~k = np " t ( П “ 1 Pr‘f n~1)- r = nP-


k= о kj r=o r
Example. The number of atoms disintegrating during a time interval t out
of N atoms of a radioactive substance has a binomial distribution; indeed
the probability that in the given time interval exactly к atoms will disinte-
N)
grate is equal to p kqN~k, where N means the number of atoms present
к
at the beginning of the time interval and p = 1 — e~Xt (A is the disintegra­
tion constant). Hence the expected value of the atoms disintegrating during
the time interval t is given by N{ 1 — e~Ár); thus the expected number of
nondisintegrated atoms is Ne~Xt. This exponential law of radioactivity does
not state — as it is sometimes erroneously suggested — that the number of
the nondisintegrated atoms is an exponentially decreasing function of the
time; on the contrary, it only states that the number of nondisintegrated
atoms has an expectation which is an exponentially decreasing function of
the time.
2. The expectation o f the negative binomial distribution. The random
variable £ has a negative binomial distribution if its possible values are
r + к {к = 0 , 1 , . . .) and if it takes on these values with probabilities

P (f = r + к) - У _ ~ 'J prqk (k = 0, 1,...),


1ц , § 8] S O M E T H E O R E M S O N E X P E C T A T IO N S 105

where 0 < p < l , q = l — p. Because of (1) we have

Щ ) = i (r + k) [k + r - 1 P'у = - f (‘k + r I / +i Ф = — .
k =0 r~ 1 P k =0 l r I P
Example. In shooting at a target, suppose that every shot hits the target
with the probability p and the outcomes of the shots are independent of
each other. How many shots are necessary to hit the target r times ?
The mathematical wording of the problem is as follows: Let the experi­
ments of a sequence be independent of each other. Let the experiment have
only two outcomes: A (the shot hits the target) and Ä (it does not). Let £
denote the serial number of the experiment at which A occurred for the г-th
time. As noted in § 3 of this Chapter, the probability that the event A occurs
in the (к + r)-th experiment for the r-th time is
[k + r - 11 k
, p q ,
r- 1
hence t has a negative binomial distribution of order r. Thus in the average
r
we need to fire — shots in order to get r hits.
Р

3. The expectation o f the hypergeometric distribution. The random vari­


able i has a hypergeometric distribution if it takes on the values к — 0 , 1 ,
. .., n with probabilities
M \ IN - M
h: M«-Ус
P « - * ) - - ^ ----- Ц

We obtain from ( 1 ) by a simple calculation that

Example. The hypergeometric distribution occurs for instance in sampling


M
without replacement. Let p = — denote the fraction of defective items in
the lot examined. We want to estimate p from a sample of size n. The number
of defective items has the same expectation np as in sampling with
replacement.

§ 8 . Some theorems on expectations

We shall now prove some basic theorems about expectations.


106 D IS C R E T E R A N D O M V A R IA B L E S [III, § 8

T heorem 1. IfE (£) and E(rj) exist, then E{f + rj) exists too and

E(Ji + rf) = E (i) + E(rj).


The statement of this theorem is plausible because of the intuitive meaning
of expectation. Indeed, if the observed values of i are i 2, . . . , £ „ and
1 "
those of t] are rjb t]2, . . . , rjn, then — у ^fluctuates about the number £(£)
n k = i
1 " 1 n
and — У ijk about the number E(n), hence — Y (c_k + tjk) fluctuates about
n k=i n
the number E(E) + E(tj) ; in consequence E(£ + r f ) — E{f) + E(rf). Let us now
give the proof of the theorem. Let the possible values of i be x, (j = 1 , 2 , . . . )
and those of rj y k (k = 1 , 2 , . . .), let further A,k denote the event that i = Xj
and rj = y k. Clearly, the AJk (j, к = 1 , 2 , . . .) form a complete system of
events. Further
У Р ( А /к) = Р(г, = у к)
j
and
yP(A jk)= P( ^ = Xj).
к

On the other hand, the possible values of £ + t] are the numbers z repre­
sentable as Xj + y k. It may happen that a number z can be represented
in more than one way in the form z = x} + y k ; in this case

zP (i + rj = z) = z У P(Ajk) = У (xj + y k)P(Aik).


xi+ yic= z X!+yk = z

Since the sum of two absolutely convergent series is itself absolutely conver­
gent, we obtain that

Щ + П) = У У (Xj + Ук) P(Ajk) = E(0 + E{rf)


j к

and this is what we wished to prove.


The next theorem follows by mathematical induction from Theorem 1.

T heorem 2. I f E(ff) ( 7 = 1 , 2 , . . . , ri) exist, then E{£x + . . . + £„) exists,


too, and

£(£ i + í 2 + . . . + z„)= m ) + ш + ■.. + а д -

It is easy to prove the following theorem:

T heorem 3. Let съ c2, ■ . ., c„ be constants and 1Л , g 2, ■ ■ ■> ín random vari-


I I I. § 8 ] S O M E T H E O R E M S O N E X P E C T A T IO N S 107

ables the expectation o f which exists, then

E ( t c k Q = Z c kE ( Q .
k= 1 *=1

In other words, E is a linear operator.


It is further easy to show the following properties of the expectation:
If ^ > 0, then E(£) > 0. If | { | < | tj [ and E(rj) exists, then E(g) exists as
well.
Consider now some examples. We have proved already that a random
variable with a binomial distribution of order n has the expectation E(£) =
= np. This can be deduced immediately from Theorem 2; indeed £ can be
n
written in the form í = X £/> where is the indicator of the event A at the
7= 1
y'-th experiment. Since £(£f) = p, it follows from Theorem 2 that E(f) = np.
Similarly a random variable having a negative binomial distribution of
order r can be considered as the sum of r independent random variables
each having a negative binomial distribution of the first order, with the
same parameter p. Thus it follows from Theorem 2 that the negative bi-
r
nomial distribution of order r has the expectation — , as proved already.
P
Similarly, a random variable with a hypergeometric distribution can be
represented as the sum of n indicator variables whose expectation is p (cf.
the example after the theorem of complete probability). These indicator
variables are not independent, but this does not affect the validity of
Theorem 2.
T heorem 4. I f rj = £ - E(if), then E(rj) - 0.

P roof . According to the additivity

E ( r ,)= E ( 0 - E (E ( 0 ) -
Since the expectation of a constant is obviously the constant itself, we have
E(E{f)) = £(£), and our statement follows.

T heorem 5. I f i, and t] are discrete random variables such that the expecta­
tions £(£2) and E(jf) exist, then E(£ri) exists as well and
\E(Zn) t < j E { C ) E ( f ) . ( 1)

Note. Essentially, the inequality (1) is Schwarz’s inequality known from


analysis.
P roof . Consider the random variable

Ía= « - 2 ^)2,
108 D IS C R E T E R A N D O M V A R IA B L E S [H I, § S

where X is a real parameter. Since 0 < < 2£? + 2A2ij2, Е(Сд exists.
Because of Theorem 3 we have

а д = Щ 2) - 2ХЩг,) + X2 E(tf). (2)

Since £A> 0 we have АГ(£Л) > 0 for every real X, therefore the polynomial
(2) in X of degree 2 is nonnegative. But as it is well known this is only pos­
sible if ( 1 ) holds, which is what we wished to prove.
Let £ be a discrete random variable and A an event having positive proba­
bility. The conditional expectation of £ with respect to the condition A is
defined by the formula

E{Z\A) = YP ( $ = X k \A )x k, (3)
к

provided that the series on the right side is absolutely convergent (which is
always fulfilled if E(£) exists), where x„ (n = 1 , 2 , . . . ) denote the possible
values of £. E(£ | A) is therefore the expectation of the conditional distri­
bution of £ with respect to the condition A. If the events A„(n = 1 , 2 , . . . )
form a complete system of events, then in view of the theorem of total proba­
bility

E(£) = T P ( i = = I I P « = ** I A n) P ( A n) x k = £ P ( A „ ) E ( £ I A n).
к к n n

Thus we proved the following theorem:

T h e o r e m 6 . I f A „ (n = 1 , 2 , . . .) is a complete system o f events and l is


a discrete random variable, then

Е (0 = ^ Р Ш Е ^ \ А п), (4)
n

provided that Elf) exists.


Particularly, if i^B is the indicator of the event B, then £(£/>) = P{B),
E{tB I A) = P(B I A) and we obtain the theorem of total probability as a
special case of Theorem 6 . Hence Theorem 6 is used to be called the theorem
o f total expectation.
Theorem 6 may also be interpreted in the following m anner: The condi­
tional expectation E(£ | A„) can be considered as a random variable which
takes on the value E(£ \ A„), if the event A„ (n = 1 , 2 , . . .) occurs. Accord­
ing to this interpretation the right side of (4) is the expectation of the discrete
random variable E(£ | A J . Let i; be a random variable whose value depends
on the event which actually occurs of the events A n: e.g. put rj = n if A„
occurs (n = 1 , 2 , . . .). Since now E{f \ rj) can be written instead of £(£ | A„).
I l l , § 8] S O M E T H E O R E M S O N E X P E C T A T IO N S 109

we have, according to the statement of Theorem 6 ,

£(£(£!»,)) = £ (£ ). (5)
This relation will be used later on.
Example. Formula (5) can also be used to compute the expectation of the
sum of a random number of random variables. Let <J2, . . . be indepen­
dent random variables and let v be a random variable independent of
(n = 1 , 2 , . . . ) and taking on the values 1 , 2 , . . . with probabilities
qx, q2, ■■. . Consider the random variable

£ = Ci + £2 + • ■• + £1

which is the sum of a random number of random variables. It follows from


(5) that
£(Q = £(£(£ I v)).

If En is the expectation of then, in view of Theorem 2 and of the inde­


pendence of the random variables and v, we obtain that
Щ I v = rí) = E 1 + E 2 + . . . + En,
hence, according to (5),

E (0 = t q n(E1 + E2 +
/Z—1

or, after rearrangement of the terms (which is admissible if the series


00
X<7„ ( | £ i | + |£'2| + . . . + |£ 'J )
n=1
converges), we have
00 00
£ (0 = Z £ „ ( E qk).
n =1 k=n

In the special case where the expectations of the random variables £,k
are equal, i.e. En = E, then

£ (0 = £ I nqn =E-E(v). (6 )
n =l

T heorem 7 . I f £ and rj are independent discrete random variables and if


E(c) and E(tj) exist, then E(qq) exists as well and
E(to)=E(Z)E(t,). (7 )
ПО DISCRETE R A N D O M VARIABLES [III, § 9

P roof . Let Ajk denote the event £ = xJ} rj = y k (j, к = 1 , 2 , . . .). Clearly,
the possible values of lr\ are the numbers which can be represented in the
form z = Xjyk. Further zP(^rj = z) = z £ P(Ajk) = £ XjykP{Ajk), hence
xjyk=z x/y/c=z
E ( ^ ) = E l x j y k P(Ajk). (8 )
j к

Because of the independence of c, and t] we have P(Ajk) — Р(£ = Xj) x


X P(t] = y k). Thus we obtain from (8 ) that

ад ) = (I Xj P(£ = Xjj) ( £ yk Р(Л = y k)) = E tt) E(n).


j к

Since a series obtained as a sum of term-by-term products of two absolute


convergent series is itself absolute convergent, Theorem 7 is herewith proved.

§ 9. The variance

The expectation of a random variable is the value about which the random
variable fluctuates; but it does not give any information about the magni­
tude of this fluctuation. If we compute the expectation of the difference
between a random variable and its expectation we obtain, as we have already
seen, always zero. This is so because the positive and negative deviations
from the expectation cancel each other. Thus it seems natural to consider
the quantity
d (0 = E ( \ ^ - E ( 0 \ ) (1)
as a measure of the fluctuations. Since, however, this expression is difficult
to handle, it is the positive square root of the expectation of the random
variable (/; — E (0 )2 which is most frequently used as a measure of the
magnitude of fluctuation. This quantity, called the standard deviation of £,
is thus defined by the expression

D(£) = + V £ ( ( £ - £ ( 0 ) 2) (2)
(provided that this value is finite) and D2(£) is called the variance of
The choice of D(£) for measuring the fluctuations is advantageous from a
mathematical point of view, as it makes computations easier. The real im­
portance of the concept of variance is shown, however, by some basic theo­
rems of probability theory discussed in the following Chapters, e.g. the
central limit theorem.

1 The letter D hints at the Latin word dispersio.


Ill, § 9] T H E V A R IA N C E 111

From the fact that £ is a linear operator follows immediately

T heorem 1. I f D(f) exists, then

D \Z) = E(£2) - [£ (0 ]2.


This is the formula by which the standard deviation is most readily com­
puted. If the discrete random variable £ assumes the values x„ (n = 1 , 2 , . . . )
with probabilities p„ = = x n), then

DXO = Z P n(*n-E (O Y
П
(3)
and, according to Theorem 1,

D \E) = Y,Pnxl - (Z Pn x nf . (4)


n n

We obtain by a similar simple argument the somewhat more general

T heorem 2. For any real number A one has

D \f)= E (^-A Y )-[E {0 -A f.


From this we obtain immediately the following theorem:

T heorem 3. For any real number A

E (ff - Ä f ) > D2 (£).

The equality holds i f and only i f A = E(f).

Theorems 2 and 3 are similar (from a formal point of view even equal)
to the well-known Steiner theorem in mechanics which states that the mo­
ment of inertia of a linear mass-distribution about an axis perpendicular to
this line is equal to the sum of the moment of inertia about the axis through
the center of gravity and the square of the distance of the axis from the
center of gravity, provided that the total mass is unity; consequently, the
moment of inertia has its minimal value if the axis passes through the center
of gravity.
Theorem 3 exhibits an important relation between the expectation and
the variance.
Theorem 2 is mostly used if the values of £ lie near to a simple number A
but the expectation has not exactly this value. For computational reasons
it is then more convenient to calculate the value of E (f — A)2.
112 D IS C R E T E R A N D O M V A R IA B L E S [III, § 9

Obviously, the standard deviation D(c) is always nonnegative. If D(f) = 0,


then £ is equal to a constant with the probability 1. Indeed, because of
(£ — E{f ))2 > 0 the equality D(f) = 0 can only hold if Р(£ = E(£)) = 1,
hence £ is a constant with probability 1 .

T heorem 4. For any random variable <f

d (0 < D(0-
P roof . According to Theorem 5 of § 8

d 2( 0 = E 2 ( \ ^ - E ( 0 \ - \ ) < D 2 (0 .
Equality can occur in other cases besides the trivial case when f is with
probability 1 a constant, thus e.g. if f takes on the values + 1 and —1 with
the same probability y .

T heorem 5. I f tj = at; + b (a and b are constant), then

D(n) = \a \ -D(0-
P roof. Since E(n) = aE(f) + b, we obtain that
D\rj) = E (a\£ - E ( f) f ) = a'D 2 (f).

Especially, we obtain that the standard deviation does not change if we add
a constant to the random variable £ or multiply it by —1 .
It is seen from (3) that the variance of a random variable depends on its
distribution only. Hence we can speak about the variance of a distribution.
We shall now compute the variances of certain discrete distributions and
for sake of comparison we determine the values of r/(f) as well.
1. The variance o f the binomial distribution. Let the distribution of the
random variable <f be a binomial distribution of order n:

P(Z = k) = " pkqn- k (к = 0 , 1 , . . 0 < p < 1 ; q = 1 - p).


к
In § 7 we have seen that the expectation of is E(f) = np; similarly, we
obtain here that

D2 (£) = t ( k ~ n p f I pk qn~k = npq. (5)

The value of for sake of simplicity, will only be determined for a bi-
I l l , § 9] T H E V A R IA N C E 113

nomial distribution of an even order and parameter — . If n = 2N and

p = , then E(£) = N and thus

лЛ2ЛП

By using Stirling’s formula we obtain that for N -» + oo

АГ i2Af)
U) _ ГЕ_
22лГ ~ V я ‘

Here « is the sign of asymptotic equality. If aN and bN (N = 1, 2 , . . .) are


two sequences of numbers (bN ф 0 ), we say that the two sequences are
asymptotically equal (aN к bN) if

Um ^ - = 1.
TV—oo bN

Since in the case of n = 2N, p = according to (5) we have

D«>- J t■
it follows that

d (0 « / — D(£).
V я

Thus the quotient —jr r tends for N -+ oo to the limit !— • We shall see

later on that this holds for a whole class of distributions.


V n
2. The variance o f a negative binomial distribution o f the first order. Let
the distribution of the random variable £ be a negative binomial distribu­
tion of the first order, i.e.

P(£ = k + T)=pqk (k = 0 , 1 , . . . ) ,
114 D IS C R E T E R A N D O M V A R IA B L E S ПН, § 9

where 0 < p < \ , q = \ — p. We have seen in § 7 that E{f) = — . Thus

D2 (Z) = p E (k + О 2 як - A >
* =0 P

and therefore

b !«) = 7 -

I f p is sm all, then D(£) is approxim ately equal to E(£) = —Í- .


P
3. The variance o f the hypergeometric distribution. Let £ have a hyper-
geometric distribution, i.e.

M\ I N - M
к [n-k
P(S = k ) = 1 ' ^ ----- — (к = 0, 1 , . . n).

1".
As in the two preceding examples we obtain

M l M I n — 11
1 > (0 ' , т г т | | ' - » г т ] -

M
Let us introduce the notations---- = p, q = 1 —p, then
N

D (Í)= ^ » o t(I

hence the standard deviation of the hypergeometric distribution is some­


what less than that of the corresponding binomial distribution; the differ­
ence is, however, small if n is small with respect to N (cf. Example 18, § 12,
Ch. II). Random fluctuations are for drawing from an urn without replace­
ment less than for drawing with replacement. The quotient of the two stan-
M
dard deviations tends to 1 for N -> + go, if the value of — - = p remains
N
fixed and n increases more slowly than N.
In, § 10] SO M E T H E O R E M S C O N C E R N IN G T H E V A R IA N C E 115

§ 10. Some theorems concerning the variance

In the present paragraph we shall prove several theorems concerning the


variance, which will be used often later on.

T heorem 1. I f £lt £2, are pairwise independent, then

D 4 i z k) = t » 4 S k)-
k=1 k=l

P roof . Let E(£k) = Ek, then

D2( t f*) = t D 1 ({*) + 2£ E(fJZ, - Ej) (& - E k)).


k~l k=l j <k

From the pairwise independence of £fc-s and from Theorem 7 of § 8 follows


that
E ( í £ j - E j H t k - E k) ) = 0 if j Ф k .
Thus we have proved our theorem.
It follows immediately, as a generalization of Theorem 1,

T heorem 2. I f £ъ tj2, . .., £„ are pairwise independent and cu c2, . . ., c„


are real constants, then

D \ Í c k Zk) = Í c l D \ Z k).
k=l k=1

Because of later applications, the following particular form of Theorem 1


deserves to be mentioned: If £2, . . . , £ „ are pairwise independent random
variables having the same distribution and standard deviation D, their sum
ín = <Üi + £ 2 + • • • + £„ clearly has

Let E denote the expectation of the distribution of £k, then

Щ п) = пЕ.
Hence the ratio
Щ .) _ D_
E(fn) E jn
116 D IS C R E T E R A N D O M V A R IA B L E S [III, § 11

tends to zero for n —> oo, provided that E is distinct from zero. Consequences
of this are dealt with in Chapter VII. If £ is a positive random variable, the
quotient is called the coe fficient o f variation o f f
Щ)
As an interesting consequence of Theorem 2 we mention that if £ and t]
are independent, then

Dn~ tf + tl) = D2t t - 4 ) -


Theorems 1 and 2 can be used to compute the variance of distributions.
1. The variance o f the binomial distribution. If f , £2, are indepen­
dent random variables assuming the value 1 with probability p and the
value 0 with probability q = 1 p , then their sum

Ся = £i + £2 + • • • + Ín
is a random variable having a binomial distribution of order n. Since
D2(^k) = pq, it follows from Theorem 1 that

D~ ((„) = npq.
Thus by applying Theorem 1 we can avoid the calculation used in § 9 for
the determination of the variance of the binomial distribution.
2. The variance o f the negative binomial distribution. In the former para­
graph the variance of the negative binomial distribution of the first order
was determined. If the independent random variables £ls £2, . . - Л г have
a negative binomial distribution of the first order, i.e. if

Pttj = k + \ ) = p q k (k = 0 , 1 , . . . ; j — 1 , 2 , . . . r)

then, as we know already, D'Hf) = . Applying Theorem 1 it follows for


P~
the negative binomially distributed random variable + . . . + £г of
order r that

§ 11. The correlation coefficient

The correlation coefficient gives some information about the dependence


of two random variables. If <f and 4 are any two nonconstant discrete ran-
M l, § 1 1 ] T H E C O R R E L A T IO N C O E F F IC IE N T 117

dom variables, the value R(E, rj) defined by the formula

m - ”) - ----------- m m ------------ (1)


is said to be the correlation coefficient of E and r\. (If E or t] is constant, we
put R{E, i?) = 0.)
From this definition follows immediately that R(t], c) = R(E, *1). If the
possible values of E and i/ are x m (m — 1 , 2 ,. . .) and y„ (n = 1 , 2 , . . .),
and rmn = P{E = x m, tj = y„), then

’ >- im5? h - - £« » h . - m y

If E is any nonconstant random variable, the random variable

= (2 )
4 0(0 ’
satisfies
E(E') = 0 and D(E') = 1.

The operation (2 ) which applied to E gives E' is called the standardization


of the random variable E- It follows immediately from the definition of the
correlation coefficient that
R(E,r,)=E(E'r,'). (3)
Now we shall prove some theoremsabout the correlation coefficient.

T heorem 1. We have


m
P roof . It follows from the linearity of the operator E that

E([Z - Щ ) ] [Ц - m i ) = E(Eh) - E(E)E(n).


T heorem 2. The value o f R(E, n) Hes always between — 1 and + 1.

P roof. According to Theorem 5 of § 8

\E([E-E(E)][r1- E ( t 1)])\<D(E)D(rj).
Theorem 2 cannot be further sharpened, since
R(E, Í) = + 1
118 D IS C R E T E R A N D O M V A R IA B L E S [III, § 11

and
= - 1.
T heorem 3 . I f S, and ц are independent, then
Щ ,Ц) = 0 .

Proof. If £ and i/ are independent, then, according to Theorem 7 of § 8


E(Sr,)=E(i)E(r,).
Hence Theorem 3 follows from Theorem 1.
Remark. The converse of Theorem 3 does not hold. The independence
of £ and t] does not follow, in general, from R(£, r\) = 0. If rj) = 0,
then the random variables £ and tj are called uncorrelated. While uncorre­
lated random variables are not necessarily independent, nevertheless this is
true for certain special cases (cf. e.g. Theorem 4).

T heorem 4. If £ A and are indicators o f the events A and В with positive


probabilities, the condition R(fA, £B) = 0 is equivalent to the independence
o f l A and

Proof. Since E(£A) = P(A), E(£B) = P(B) and E(ZA$B) = P(AB), it


follows from the condition R(£A, £B) = 0 that
P(AB)=P(A)P(B),

which is equivalent to the independence of A and B.


The following is an example of two uncorrelated but not independent
random variables. Let

J>(í=l,4=l)=P(É = - l , i f = l ) = P ( É = l , 4 = - l ) -

= P(£ = - M = - =

P ( f - 0 , 4 = l ) = P tf = 0,i, = - 1 ) = P ( £ = 1, 4 = 0) =

= р« = - 1,ч = 0) = - Ц £ -

where 0 < p < 1. Then £(£) = E(r\) = Elfq) = 0, hence £ and q are uncor-
(1 ~ p f
related. Since, however, P(f = 0, t] = 0) = 0 # P(£ = 0) P(q = 0) = ---- ------
they are not independent.
I l l , § 11] T H E C O R R E L A T IO N C O E F F I C I E N T 119

In what follows we shall study, what kind of consequences can be deduced


from the knowledge of the value of the correlation coefficient. First we
prove the following simple theorem:

T heorem 5. | R(fq) | = 1 holds, i f and only if

ti = a £ + b (5)

with probability 1, where a and b are real constants and а ф 0 ; in this case
R( f r\) = +1 or —1 according as a > 0 or a < 0.

P roof. Let E(f) = m. If the relation (5) holds between and r\, we have

Suppose, for instance, R ( f rj) - +1. (The case R(f, ц) = - 1 can be dealt
with in the same manner.) Put

£- m . ч —E(tj)
RKO ’ V DM ’
then by (3)
Щ ' l )= 1,
hence
- ri'f) = 2 - 2 = 0 .
From this it follows that

П ? = *l') = 1 ,
that is
i? = E M + DM

with the probability 1 .


Thus, unless a linear relation of the form (5) holds between £ and t/, the
absolute value of their correlation coefficient is less than 1 .

1 sgn X means the sign (signum) of x; it is defined by

1 if .г > 0
sgn x — 0 if x = 0
- 1 if x < 0.
120 D IS C R E T E R A N D O M V A R IA B L E S [H I, § 11

In the following we shall say that there is a positive correlation between


t and t\, if R(f, r/) > 0 and a negative correlation, if R{£, rj) < 0.
The following most instructive theorem is due to L. V. Kantorovich:

T heorem 6. L e t £_ a n d t] b e d is c r e te r a n d o m v a r ia b le s a s s u m in g o n ly a
f i n i t e n u m b e r o f v a lu e s . L e t th e p o s s ib le d if f e r e n t v a lu e s o f £ b e Xj ( i = 1, 2,
. . m) and those o f t] yj (j = 1 , 2 , . . . , n). If£,h and r\k are for h = 1, 2 ,...,
m — 1 and к = 1 , 2 , . . . , n - 1 uncorrelated, i.e.

£ (£ V ) = £ (£ * № * ) (A = 1, . . .,/и - 1; (6 )

th en £ a n d rj a r e in d e p e n d e n t.

P roof. Let

Щ = xi) = Po р (Ч = У]) = 4n Щ = x„ Ц= yj) = rtJ,


then Equation (6 ) goes over into the equivalent form
m n m n

ЁЁ
/=1 j = l
rn xj yki = (i=1
Y i) (
Pi xh z qjtfy,
i= 1

the latter clearly holds also if h — 0 , к < n — 1 and if к = 0 , h < m — 1 .


By introducing the notation öi: = rtj — pßj we obtain for these new
unknowns the following system of equations:
m n

E YbjXiy'j = 0 (A = 0, 1,.. .,m — 1; k = 0, 1,.. . , n - 1). (7)


1=1 7=1

Introducing the notation

dik = Ё t ó (8 )
i=i
we have for the unknowns dik {i — 1 , 2 , . . . , m) the system of linear equa­
tions
m
Z 4 ^ = 0 (A = 0 1, — ,m — 1). (9)
/= 1

The determinant of this system is the so-called Vandermonde determinant.


It is well known that its value is different from 0. Thus the system (9) of
equations has no solution distinct from 0 , i.e.:

d ik = 0 (г = 1,2,.. . , m ).
H I, § I I ] T H E C O R R E L A T IO N C O E F F I C I E N T 121

Since the above consideration holds for every к = 0, 1 , . . n — 1, we obtain

t ^ y kj= 0 (fe = 0 , 1 ). (10)


j =i
The determinant of Equations (10) is again a Vandermonde determinant,
thus
<5,v = 0 ( j = 1 , 2 , . . . , n).

The same can be shown for every i = 1, 2 ,. .., m. From this follows

rtj = Pi Яг
thus £ and rj are independent.
Remark. The random variables £ and t] must fulfil (m — 1) (и — 1) con­
ditions in this theorem; as was seen in Chapter II, § 9, the same number of
conditions is necessary to ensure the independence of two complete systems
of events consisting of m and n events.
Finally, we give an example in which the correlation coefficients are effec­
tively computed.
Let the r-dimensional distribution of the random variables

íi,
be a polynomial distribution

P(Z:i = къ £2 = k 2, . . ., £r = kr) = - k ”' k ; p\lp \ ' ... pk/,

where 0 < < n (i = 1, 2 , . . . , r) and k x + k2 + . . . + k r = n; further-


r

more 0 < pi < 1 and Y. P i = *• We compute the correlation coefficient


/=1
R(£j, £,j). It follows from a simple calculation that

E(£iZj) = n ( n - 1 ) Pi Pj.

It is easy to see that every component £k of the polynomial distribution has


a binomial distribution and thus

Щ к) = nPk and D($k) = .Jnpk (1 - pk),


i.e.

R(€„ Zj) = - J 77—Pi){


4Ä ----7
V - Pj)
(1 - 1
0'
thus Cj and £j are always negatively correlated.
122 D IS C R E T E R A N D O M V A R IA B L E S [III, § 12

§ 12. The Poisson distribution

Under certain conditions, the binomial distribution can be approximated


by the so-called Poisson distribution. The Poisson distribution, dealt with
in this and in the following paragraph, is one of the most important distri­
butions in probability theory. Let us first consider a practical example.
The following problem occurs in the production of glass-bottles. In the
melted glass, used for the production of the bottles, there remain little solid
bodies briefly called “ stones” . If a stone gets into the mass of a bottle, the
latter becomes defective. The stones are situated at random in the melted
glass. But under constant circumstances of production, a given mass of
glass contains in the average the same amount of stones. Suppose for in­
stance that 1 0 0 kg of fluid glass contains an average number x of stones,
let further the weight of a bottle be 1 kg. What per cent of the produced
bottles will be defective, because of containing stones? At the first glance
we could think that as the mass of 1 0 0 bottles contains in the average stones,
approximately x per cent of the bottles will be defective. This consideration
is, however, wrong, since it does not take into account that more than one
of the stones can get into the mass of one bottle and thus the number of
the defective items will usually be less.
The problem in question can be solved by means of probability theory.
Let us first reduce the problem to a simplified model, nevertheless fulfilling
the practical requirements. In practical applications of mathematics we
generally work with such models. Whether such a model gives a true picture
of the real situation depends on the adequate choice of the model.
We construct the following model for our problem. Suppose that every
stone gets with the same probability into the mass of any of the bottles inde­
pendently of what happens to the other stones. Thus the problem is reduced
to an urn-problem: n balls are dropped at random into N urns, what is the
probability that a randomly chosen urn contains exactly к balls? Since
there are N equally probable possibilities for every one of the balls, the proba­
bility that an urn should contain just к balls is, according to the formula
of the binomial distribution,

ín ) 1 1 )"-k

We ask for the percentage of defective items, if the production of N bottles


requires M tons of liquid glass. In this case N = 100 M and n = xM. Since
we are interested in the percentage of defective items in a long period of
x
production, we may assume that M is very large. Let = Я, then a
1

III, § 12] T H E P O IS S O N D I S T R IB U T IO N 123

simple calculation gives that


jk о \n-k k - 1 / _• \
-n(i-i)- <2>
It is known that
X i"
lim 1 —■
— = е 2 (31
n ~ ос n ,1

hence from (2 )
Xk
1 im Wk - — e (к= 0, 1 ,.. .)• (4)
n -*-oo К .

Let

= (* = 0 ,1 ,- ..) . (5)

From the power series of ex we have


oo oo 1к

!• ( 6)
fc=0 A=0

Thus the probabilities defined by (5) are the terms of a probability distri­
bution, called the Poisson distribution with parameter X: the meaning of X
in the above example is the average number of balls in one urn. It can be
shown by direct calculation that X is the expectation of the Poisson distri­
bution (5). Namely from the relation

P(A = k) = - ^ r e~X (* = 0 , 1 , . . . )

we have

m -Ík~e-x=x[Í
k =0 K '-
- i ^ - l e -* =
4f c = l l A —I T
Me-' X.
= (7 )

Thus the expectation of the Poisson distribution (5) is X; hence the distri­
bution (5) can be called the Poisson distribution with expectation X. The vari­
ance of the Poisson distribution can easily be calculated;

Щ 2) = 1 к * - ^ е - х = 1 к ( к - 1 ) 4 г е - * + Х = Х* + Х,
k =0 K ■ k=2 K -

hence
D2tf) = X2 + X - X 2 = X;
124 D IS C R E T E R A N D O M V A R IA B L E S [III, § 12

that is, the standard deviation of the Poisson distribution (5) is D(if) = ,yÄ.
Thus the variance o f a Poisson distribution is equal to the expectation.
In the passage to the limit in (4) no use was made of the property that the
probability for a ball to enter in a certain urn is — with a natural number N.
Therefore our result can also be stated in the following form : The k-th term

Щ = ( ” Pkqn~k (8)

o f the binomial distribution tends to the k-th term o f the Poisson distribution,
i.e. to the limit

^ = 4 r ^ (9 )

i f n -* oo and p —►0 in such a way that np = where Я > 0 is a constant


number. (Clearly, the condition np = Я can be substituted by the condition
np - а Я . )
The distribution function of the Poisson distribution can be expressed
in integral form by means of Euler's gamma incomplete function. Let

Г (г,х )= \t* - 1e - , dt (10)


Ö

for X > 0, z > 0, denote the incomplete gamma function of Euler and

T{z) = T(zi + oo) = I tz~l e~‘ dt (11)


ó

the complete gamma function of Euler. Partial integration yields the formula

( . 2)
it0 fc! r\ Jо T ( r + l)

Let us now return to our practical problem. Because of the relation be­
tween relative frequency and probability, the ratio of defective bottles and
produced bottles is approximately equal to the probability of a bottle being
defective, provided the number of manufactured bottles is sufficiently large.
This probability, however, is 1 — W0 hence approximately 1 — e~x. Since
X f X
Я = ----- , the percentage of defective items is 100 1 - exp — —— . If x
100 L l 100 j
is very small, this is in fact nearly equal to x; in the case of large x, however,
it is not. In the extreme case, when x = 100, the fraction of defective bottles
is not 1 0 0 per cent as it would follow from the consideration mentioned at
I ll, § 13] A P P L IC A T IO N S O F T H E P O IS S O N D IS T R IB U T IO N 125

the beginning of this paragraph, but only 100(1 — e _1) = 63.21 %. Of course
such a large fraction of defective items will not occur. If for instance л; = 30,
the fraction of defective items is 100(1 - e~0-3) « 25.92% instead of 30%.
Clearly, if the number of stones is large, it is more economical to produce
small bottles, provided of course that there is no way for clearing the liquid
glass. Using 0.25 kg glass per bottle instead of 1 kg, the fraction of defective
items decreases for x = 30 from 25.92% to 7.22%. As is seen from this
example, probability theory can give useful hints for practical problems
of production.

§ 13. Some applications of the Poisson distribution

In the previous paragraph the Poisson distribution was introduced as an


approximation to the binomial distribution. Now we shall show that the
Poisson distribution represents the exact solution of a problem of probabil­
ity theory. This problem is of fundamental importance in physics, chemistry,
biology, astronomy, and other fields.
First let us deal with the example of radioactive decay. The atoms of a
radioactive element are randomly disintegrating. As experience shows, the
probability for an atom (non-disintegrated until a certain moment) to dis-
i ntegrate during the next time interval of length t depends only on the length
t of this time interval. Let this probability be F(t) and put G(t) = 1 — F(t).
As to the function G(t) we know only that it is monotone decreasing and
G(0) = 1. Let A s denote the event that a certain atom does not disintegrate
during the time interval (0, j), then clearly P(As+t \ A,) = G(t). It follows
from the definition of conditional probability that
P(As+l) = P ( A s+t\As)P(As), (1)
hence
G(i + t) = G(s) G(t). (2)

Thus we obtained a functional equation for G(i). If we assume further that


G(t) is differentiable at the point t = 0, G(f) may be obtained in the follow­
ing simple manner: Substitute in (2) At for s, then from (2) it follows

G (M -3 /)-G (< ) G (A f)- Í


----------- At----------- --- G(,) - ~At ' <3)

Let At tend to 0. Because of G(0) = 1, we get

G '(0 = G'(0) G(t). (4)


126 D IS C R E T E R A N D O M V A R IA B L E S [III, § 13

In the deduction of this equation the existence of the derivative of G(t)


was supposed only at the point t = 0 ; namely if the limit on the right side
of (3) exists for A t -* 0, then the same holds for the left side as well. G'(0) is
necessarily negative. It follows namely from the monotone decreasing prop­
erty of G{t) that G'(0) < 0. If we had G'(0) = 0, it would follow from (4)
that G(t) = 1, which means that no radioactive disintegration can occur.
Thus putting G'(0) = —A we have A > 0. The solution of (4) and G(0) = 1
is
G(t) = е~л'. (5)
The same result can be obtained without the assumption of the existence
of G'(0); the assumption of G(t) being monotone decreasing suffices. In
fact, we have from (2 )
G(2t) = G2(t), G(3t) = G3(t),
or, generally, for every positive integer
G(nt) = G \t). (6 )
Let nt = s, then

№ ]" = g (~ ). (T,
From (6 ) and (7) we obtain
Im ) —
G — t = [G(f ] "
l n /
hence for every positive rational number r

G( 0 = [ < W - (8 )
Since G(l) < 1, G(l) can be written in the form G(l) = e~x. Thus we obtain
from (8 ) that for every rational t
G{t) = e~x‘. (9)
However, because of the monotonicity of G(t), (9) holds for every t. There­
fore
F(t) = 1 - G(t) = 1 - e~xt. (10)
Let us now examine the physical meaning of the constant A. By expand­
ing the function F(At) = 1— e~XAl in powers of At we obtain the equality
F(At) = AAt + 0 {{A tf). (11)
The left side of (11) is the probability that an atom, which did not disinte­
grate until the moment t, will disintegrate before the moment t + At. A has
I l l , § 13] A P P L IC A T IO N S O F T H E P O IS S O N D IS T R IB U T I O N 127

thus the following physical meaning: the probability that an atom disinte­
grates during the time interval between t and t + At is (up to higher powers
of At) equal to к At. The constant A is called constant o f disintegration; it
characterizes the radioactive element in question and may serve for its
identification. It is attractive to give another interpretation of the number
A, which enables us to measure it. The time during which approximately
half of the mass of the radioactive substance disintegrates, is said to be the
half-life period. More exactly, this is the time interval such that during it
each of the atoms of the substance has probability ~ of disintegrating. Con­
sider a given mass of a radioactive element of disintegration constant A.
Since every atom disintegrates during the half-life period T with the proba­
bility we have F(T) = However, G(T) = 1 — Fit) = e~XT, thus

e~XT = — and
2
, In 2
* = ( 12)

The disintegration constant is therefore inversely proportional to the half-


life period. The obtained result may be expressed as follows: the life time of
any atom of a radioactive element is a random variable £ such that its dis­
tribution function F(t) = P(£ < t) has the form
F(f) = 1 - e~xt (t > 0),
where A is a positive constant, the disintegration constant of the element in
question. (For t < 0 clearly F(t) = 0 since the life time cannot be negative.)
More concisely: the life time of a radioactive atom is an exponentially distrib­
uted random variable. Hence the custom to speak about the exponential
law o f radioactive disintegration.
Suppose that at time t = 0 there are N atoms. How many non-disinte-
grated atoms shall there be at time t > 0? The probability of disintegration
during this time is for every atom 1 — e~k'. So in view of the relation be­
tween relative frequency and probability, the number of disintegrations
will be approximately N(l — e~Xt). Hence approximately Ne~xt atoms
remain non-disintegrated.
Let Pk(t) be the probability that during the time interval (0, /) exactly к
atoms disintegrate. Suppose that the disintegration of each atom is an event
independent of the disintegration of the others, then we have

Pk(t) = | ^ j (1 - e~x,)k <Г(ЛГ-*>А'. (13)


128 D IS C R E T E R A N D O M V A R IA B L E S [III, § 13

The number of disintegrations thus obeys the binomial law. If Xt is small


and к not too large, Pk{t) may be approximated by a Poisson distribution;
the probability Pk(t) is approximately

Pk(t) = [ATI - g~*0]* exp [ - ATI - e~x,)\ (14)

As a further step we can replace for small Xt values (1 - e~u) simply by


A t . Thus Pk(t) is near to

(N X tf e~Nxt
P U 0 = - ---- ----------- • (15)

The half-life period of radium is 1580 years. Taking a year for unit we
obtain X = 0.000439. If t is less than a minute, Xt is of the order 10-9. For
1 g uranium mineral, containing approximately 1 0 15 radium atoms, the re­
lative errors committed in replacing Pk(t) by P*(t) are of the order 10-3.
If we restrict ourselves to the case where t is small with respect to the half-
life period, we can choose the model so that the Poisson distribution re­
presents the exact distribution of the number of radioactive disintegrations.
Consider a certain mass of radioactive substance and assume >
1. If tk < t2 < t3 and Ak(tb t2) denote the event that “during the time
interval (tb t2) к disintegrations occur”, then the events A k(tlt t2) and
A,(t2, t3) are independent for all nonnegative integer values of к and l.
2. The events Ak(tb t2), к — 0, 1,. . . form a complete system. If к is
given, P[Ak(ti, t2)] depends only on the difference t2 — tv In other words,
the process of radioactive disintegration is homogeneous with respect to
time. Let Wk(t) denote the probability of к disintegrations during a time
interval of length t (t2 — tk = t).
3. If t is small enough, the probability that during a time interval t there
occurs more than one disintegration is negligibly small compared to the
probability that there occurs exactly one. That is

a . 1 - ^ =o, ( .6 ,
/-*o Wi (t)
or equivalently

h i l M . I . (17)
Ir,(<)
In words: the probability that there occurs at least one disintegration is,
in the limit, equal to the probability that there occurs exactly one.
I l l , § 13] A P P L IC A T IO N S O F T H E P O IS S O N D I S T R IB U T I O N 129

Clearly, W0(0) = 1 and Wk (0) = 0 for к > 1. Further W0(t) is a monotone


decreasing function of t. From this and from conditions 1 and 2 it follows
that
W0(t + s ) = W 0(t)W 0(s);
hence we have
W0(t) = e~1“ where /г > 0. (18)

In order to determine the functions Wk (t) we show first that

lim Wk^ =o if k = 2, 3 , . . . . (19)


0
A t-* At
Obviously, this is a consequence of (16) and of the relation

L Wk (dr) = 1 - W0 (At) - W\ (At). (20)


k= 2

Since for к > 1 Wk (Q) = 0, (19) can be written in the form

Wk (0) = 0 ’ (k — 2 , 3 , . . ,). (21)


It is to be noted here that the existence of W'k (0) was not assumed, but
proved.
The event that к disintegrations occur during the time interval (0, t +
+ At), can happen in three ways:
a) к — 1 disintegrations occur between 0 and t and one between t and
t + At;
b) к disintegrations occur between 0 and t andO between t and t + At;
c) at most к — 2 disintegrations occur between 0 and t and at least 2
between t and t + At.
Thus, because of conditions 1 and 2, we get

Wk (t + At) = Wk (t) W0(At) + Wk_ , (?) Wx (At) + R, (22)

where R = o(At), according to condition 3 and relation (19). In view of


(17) and (18) we obtain from (22) for At -> 0

Wk (t) = ,i(W k_1( t ) - W k (t)) ( k = 1,2,. .. ). (23)

Thus we obtained for the Wk(t) a readily solvable system of differential


equations. Put
Vk ( t) = W k (t)e ^ , (24)
130 D IS C R E T E R A N D O M V A R IA B L E S [III, § 13

then, from (23)


r i ( t ) = n V k- 1(t) ( k = 1, 2,...). (25)

From W0(t) = e - '" follows V0(t) = 1 and we obtain

*T( 0 = FL
u2t2
K2(0 = V ’
and, in general,

Hence
( ut)ke~ß‘
W k (t ) = ----- (A: = 0 , 1 , . . . ) .

Thus we have proved that the number of disintegrations during a time


interval t, given conditions 1-3, has a Poisson distribution with expecta­
tion proportional to t.
The Poisson distribution can also be used in studying the number of tele­
phone calls during a given time interval. Let A k{tx, i2) be the event: “between
the moments and i 2 a telephone exchange receives exactly к calls” ; the
assumptions introduced for the radioactive disintegration are here approxi­
mately valid (at least during the “rush hours”). The number of the calls has
thus a Poisson distribution. The situation is analogous for the number of
electrons emitted by the glowing cathode of an electron tube during a time
interval t ; also for the number of shooting stars observed during a time inter­
val t as well as for other phenomena exhibiting random fluctuations.
As an application of the Poisson distribution in astronomy let us consider
now the mean density X of the stars in some region of the Milky Way. This
density can be considered to be constant. We understand by this that in a
volume V there are in the average VX stars.
In the same manner as in the case of radioactive disintegration (reformulat­
ing of course conditions 1-3 adequately), it can be shown that the probabil­
ity, that a region of volume V of the space contains exactly к stars, is equal
to
(XV)k e~xy
^гт------- ( * - 0 , 1,...). (26)
k\

The distribution of the stars thus follows the same law as the radioactive
disintegration; the only difference is that here the volume plays the role
I l l , § 14] T H E A L G E B R A O F P R O B A B IL IT Y D IS T R IB U T IO N S 131

of time. The same reasoning holds for particular kinds of stars as well, e.g.
for double stars. In the same manner the distribution of red and white cells
in the blood can be determined. Let A k denote the event that there are
exactly к cells to be seen in the visual field of the microscope, then we have
(XT)k e~XT
P{Ak) = 1 ------- (к = 0 , 1 , .. . ) , (27)

where T is the area of the visual field and X is the average number of cells
per unit area.

§ 14. The algebra of probability distributions

In the present paragraph we shall summarize systematically the relations


between probability distributions, which we encountered in the previous
paragraphs. In particular, we shall deal with relations which permit to con­
struct other distributions from a given one. We shall consider probability
distributions belonging to discrete random variables £ taking on positive
values only; such distributions will be denoted by {р0,р ъ . . . , р к, . . . }
where p k = P(f = к) (к = 0, 1,. . .). For the sake of brevity the notation
Sfi = {Po,Pi, • • -,Pk, • • •} will be used as well.
A fundamental operation is the mixing of probability distributions. Let
{a„} (n — 0 , 1 , . ..) be nonnegative numbers with sum equal to 1 and let
■VJn = {pnk} be for each value of n (n = 0 , 1 ,. ..) a probability distribution.
Let us form the expression
00

= Е XnPnk- ( 1)
n= 0

Obviously, the numbers itk (к = 0, 1, . . .) form again a probability distri­


bution; indeed nk > 0 and
00 00 00 00

£ я * = £ a„ X P n k = £ <*„= L (2)
k=0 n=0 k= 0 /i = 0

Let the probability distribution П — {nk} be defined by

n = f a„^„;
n =0

П will be called the mixture o f the probability distributions taken with the
weights a„.
132 D IS C R E T E R A N D O M V A R IA B L E S [III, § 14

For instance, the mixture of the binomial distributions

•Я,(/>) = | ”. pV ~ * J

А"е~л
taken with the weight ccn = ------— is a Poisson distribution. In fact
n\

Another example is the mixture of hypergeometric distributions

(M l ( A - M l

I
i
^ , ”r k ’

u)
/дм
with weights a„ = pnqN ". This leads to the binomial distribution $ M{p),
[n
as is seen from the relation
Ml ( N - M

n
Geometrically, mixtures of distributions can be represented in the follow­
ing way: Two d istrib u tio n s ^ = {plk} and SP2 = {p2k} can be considered
as two points in an infinite dimensional space having the coordinates p lk and
p2k respectively. The mixture
+ ß<&>2 = {aplk + ßp2k} (0 < a < 1, ß = 1 - a)

subdivides the “segment” ^ к^ 2 in proportion a : ß. All distributions of


p ro b ab ility ^ = {p,,} are on the “hyperplane” of this space with equation
oo
Yj Pn — 1> namely in that part of this hyperplane for which p„ > 0. These
n= 0
points constitute thus a “simplex” S. Since
+ ß^P2 (0 < a < 1 , ß = l — a)
is a probability distribution as well, it follows that S contains with two points
the segment joining them. S is thus convex.
II, § 14] T H E A L G E B R A O F P R O B A B IL IT Y D IS T R IB U T I O N S 133

Another often-used operation is the convolution of probability distribu­


tions. The convolution of the distributions Sfi = {pk} and Q = {qk} is the
distribution — {rk}, where
к
Гк = Т Pi Як-:- (5)
7= 0

As it was seen in § 6 ^ is the distribution of the sum £ + q of two independent


random variables £, and rj having the distributions -Sf' and Ц respectively.
Even without the knowledge of this result, it is readily shown that ^ is a
probability distribution. In fact rk > 0 and

00 00 00
Ё = Y Pj Y 4h = L (6)
k =0 7=0 h =0

The convolution of *9 and Q is denoted by Since


к к
У P j Як- j = У Q j P k - j J
=0 7=0
we have
(7 )

The convolution is thus a commutative operation. It is associative as well:

(& 2 & z) = i ' ^ ’l Я 'г) ^ 3 = ^1 ^2 <^з- (8)


In fact, if = {pjk} ( /'= 1, 2, 3), the k-th term of the distribution
■9:,1(.JAJA,) as well as of the distribution is equal to

Y PliPijlhh-
i+j + h= k

In this manner multiple convolutions and convolution-powers of a distri­


bution may be defined. By the и-th convolution-power of a distiibution SP
we understand the и-fold convolution of the distribution d? 3 with itself, in
symbols SP". Thus for instance if Po = q = 1 —p and p x — p, we obtain
as convolution of power n of the binomial distribution £8^(p) = {q ,p } of
order 1 the binomial distribution

^ „ 0 ) = ( ^ i (*>))"• (9)
In fact

£ [т \ р ](Г Ч Í " pk- Jpn- k+j = i'" + pk qm+n~k,


7=o j ) {k - j ) { к
134 D IS C R E T E R A N D O M V A R IA B L E S [III, § 14

hence
(
■%m P ) ■%n(P) = '%m + n( P) - (10)
Relation (9) can be obtained from (10) by mathematical induction.
Similarly, it can be shown that for the negative binomial distribution
fJr(p) = \f~i(p)Y, where (?r(p) = {рьг>} with p ^ r) — 0 for k < r and p(^ =
= 11p rqk~r for к > r.
r - 1 /1
It can be shown finally that the convolution of two Poisson distributions
l k e~k )
is again a Poisson distribution. If C^fA) = — —— }, then
k\ I
^(A ) • = dS*(A + p) (11)
since
k kj e~x pk-j e"" (A + p)k e~(k+'°
j h ~ T ~ & —j)\ ~ k\
i.e. the distribution obtained as the convolution “product” of two Poisson
distributions has for its parameter the sum of the parameters of the two
“factors” .
Let us now introduce the degenerate distribution H?0. It is defined by
^ 0 = {1,0, 0 , . . . , 0 ,...} .

Obviously, for any distribution Sfi one has

= Ä (12)

Thus the distribution $ 0 plays the role of the unit element with respect to
the convolution operation . 1 The distributions n, defined by p„ = \ , p m = 0
for m ф n, are also degenerate distributions. It is easy to show that

%’r%’s = %’r+s> &r = &[. (13)


It is readily seen that the operations mixture and convolution commute:

(/7Z
=0
«n^n)0. = Z an{&„ Q)-
/7=0
(14>
By means of the operations mixture and convolution functions of probabil-

1 The probability distributions form a commutative semi-group with respect to


convolution with unit element tf0.
I l l , § 15] G E N E R A T I N G F U N C T IO N S 135

00

ity distributions can be defined in the following manner: Let g(z) = £ Wnz n
/7 =0
00

be a power series with nonnegative coefficients suchthat g{ 1) = £ W„ = 1.


/1 = 0
If is an arbitrary probability distribution, let g(& ) be defined by

g ( ^ ) = f J w n^ sn ( ^ ° = r 0). (15)
n= 0
If for instance ^ is the degenerate distribution defined above and if g(z) —
= (pz + q)n (0 < p < 1), then, because of (13), we have

•% n ( P ) = (Р & i + Я )"- (16)

Similarly, if g(z) = (Я > 0), then


^ (Я ) = ехр[Я(^1 - 1)], (17)

where •¥{)-) is a Poisson distribution of parameter Я. In fact


00 Т ,к Я ? к cc yk ( 1к - - Л

exP № , = j r ■
k =0 A- fc=0 A . ( A i

§ 15. Generating functions

In the present paragraph, we shall again deal with random variables


taking on nonnegative integer values only. Let £ be such a random variable
and put P(£ = к) = pk (к = 0, 1,. . .). The generating function 1 G f z) of
the random variable £ is defined by the power series

G i 0 0 = f pk zk, ( 1)
k=0
where z is a complex variable. The power series (1) is certainly convergent
for I z I < 1 , since

t Pk = 1
k=0
(2)
and represents an analytic function which is regular in the open unit disk.
The introduction of the generating function makes it possible to treat some
problems of probability theory by the methods of the theory of functions
of a complex variable.

1 Called sometimes probability generating function.


136 D IS C R E T E R A N D O M V A R IA B L E S [III, § 15

Since the generating function is uniquely determined by the distribution


of a random variable, we may speak about the generating function o f a
probability distribution on the set of the nonnegative integers.
It follows immediately from the definition of the generating function
that the distribution of a random variable is uniquely determined by its
generating function; in fact
Gf> (0)
Po = Gf (0), pk = у ' (к = 1 , 2 , . . . ) (3)

where Gh(fc)(z) is the k-th derivative of G^(z). The series (l)may converge in
a circle larger than | z | < 1 , or even in the entire plane.
Examples
1. Generating function o f the binomial distribution. Let ^ be a random
variable having a binomial distribution of order n, then
Gi (z) = (1 + p(z - 1))” = (pz + q)n.
2. Generating function o f the Poisson distribution. Let £ be a random vari­
able having a Poisson distribution with expectation X, then
Gi (z) = eÄ(z- 1).
(Compare these with the corresponding Formulas (16) and (17) of the pre­
ceding paragraph.)
3. Generating function o f the negative binomial distribution. Let ^ be a
r
random variable of negative binomial distribution with expectation —, then
P
we have / vz у

From the generating function of a distribution one can obviously get all
characteristics (expectation, variance, etc.) of the distribution. We shall
now show that these quantities can all be expressed indeed by the deriva­
tives of the generating function at the point z = 1. Since the generating
function is, in general, defined only for | z | < 1 , we understand by the
“derivative at the point z = 1 ” always the left side derivative (provided it
exists).
If the derivatives G(f (z ) of G( (z) exist at z = 1, we have the following
relations:
00

G H !) = Z to *
k =1
oo
G '^ l) = Y K k - l ) p k
k=2
I l l , § 15] G E N E R A T I N G F U N C T IO N S 137

and, in general

G « ( l ) = f; k ( k - l ) . . . ( k - r + \ ) Pk (r = 1,2, .. .) , (4)
k=r
where the series on the right is convergent. Conversely, it is easy to show that
if the series in (4) converges, the derivative G*jp(l) exists and Formula (4) is
valid. The number
co
M s = E{ff) = Y k sp k (J= l,2,...) (5)
k=1
is called the moment o f order s o f £ (hence М г is the expectation). Thus we
have
G ,( l ) = M„ = 1,

G*(l) = М ъ

G ' ' ( 1 ) = M 2 - M 1;
and, in general,

G « ( l ) = ± S<fMj (r = 1 , 2 ,. .. ) (6 )
7=1
where the S f are Stirling numbers of the first kind defined by the relation

x(x - 1 ) . . . (x - r + 1) = £ S f x J.
7=1
Equations (6 ), if solved with respect to Mj, give
M i = G' (1),
M 2 = G '(1 ) + G ''(1)
and, in general,
M, = i ^G ^(l), (7)
7=1
where o f are Stirling numbers of the second kind, defined by

xs=t 7=1
afx(x - (x-j
1)... + 1).

Equations (7) allow the calculation of the central moments of f i.e. the
moments of £ — E (f) :

ms = E([Z-E {£,)Y) (5 = 2 , 3 , .. .) . (8 )
138 D IS C R E T E R A N D O M V A R IA B L E S [III, § 15

In fact
m, = t M ( - \)rM ^ rM[. (9 )
r= 0 ГI
For s = 2 we obtain the often used formula

D \Z ) = w 2 = M 2 — M \ = G" (1) + G': (1) - [G^(l)]2. (10)


A convenient procedure to calculate moments (central or not) of higher
orders by means of the generating function is the following: Substitute
z = ew into Gf z) and expand the function G f e w) in powers of w:

00 00 (kw\s 00 ws 00
(e») = E Л E ^ r - = I ^ k E pjt k*
£ = 0 5 = 0 ** • 5 = 0 ^ • = 0
(11)
or
00 M
H ( (w) = Gs(e") (12)
j =0
The function Hf w ) is called the moment generating function of the random
variable f. In order to calculate central moments we put
(13)
A simple computation furnishes
_ “ m, ws
/« (» •0 = 1 + Z - J j - . (14)
s= 2 J!

Jj(w) is called the central moment generating function of £. Hf w) and I f w)


exist only if Gf w) is regular at z = 1. The necessary and sufficient condi­
tion for this is the existence of all moments of £ and the finiteness of the ex­
pression
ш Ш .
s - + oo V *5-

This condition is always fulfilled for bounded random variables and also
in case of certain unbounded distributions, e.g. the Poisson distribution and
the negative binomial distribution.
If Hf w) exists, then I f 0) = 1, since G f \ ) = 1. But then there can be
found a circle | w \ < r in which I f w) Ф 0, hence In I f w) is regular. Put
Kf w) = In Ifw). Since KfO) = 0 and KfO) = /*(0) = 0, we have for | w \ < r

w - i
/-=2 % - о «
H I , § 15 ] G E N E R A T IN G F U N C T IO N S 139

The coefficients kt = k t{f) (/ = 2, 3 ,. . .) are called cumulants or semi-


invariants of the random variable £. If q = £ + C (C being a fixed positive
integer), then k,(rj) and k fg ) are identical (since Gf z) = zcGfz), thus I f w) =
= / 4 (w)), hence the name semi-invariant. The meaning of the name
“cumulants” will be explained later.
Between the first cumulants and the first central moments we have the
following simple relations

k 2 = m 2 = D2 (£),
k 3 = m3, (16)
кл = mi — 3ml-

These can be established by differentiating the equality Á^(w) = In f(w).


The function Kf w) is called the cumulant generating function of £.
Example. The cumulants of the Poisson distribution. Let ^ be a random
variable having a Poisson distribution with expectation л. We have

G( (z) = еЯ(г-1),
Я . (w) = eHeW- 1\
f (w) = ex<-e*~1~w\
hence

К 'М = Це"~ 1 - w) = a £ 4 -- (17)


1=2

In consequence, all cumulants k i f ) are equal to X. In particular, not only


the variance of but also its third central moment are equal to X. This can
also be seen by direct calculation.
In what follows we shall prove some properties of generating functions,
properties which make the application of these functions a very fruitful
device in probability theory.

T heorem 1 . I f £ and ц are two independent random variables, we have

Gi+„(z) = G4 (z)-G 4 (z) (18)

and, consequently,
K i + n ( w) = K , ( w ) + K n (w) . (19)

Relation (19) states that in adding independent random variables, their


cumulant generating function as well as their cumulants themselves are
140 D IS C R E T E R A N D O M V A R IA B L E S [III. § 15

added (or “cumulated”); since (19) implies

к, (£ + »/) = к, (О + к, (/?) (1 = 2 ,3 , . . . ) . (2 0 )
Remark. For 1=2 relation (20) is already well known to us: the variance
of the sum of independent random variables is equal to the sum of the
variances. For / == 3, relation (20) shows that this holds for the third cen­
tral moments, too.

P roof. Equality (18) is proved by direct calculation; (19) follows imme­


diately from (18) and from E(^ + rj) = E(f) + E(q).

T heorem 2. I f the distribution o f the random variable rj is the mixture with


CO
weights ctn (a„ > 0 , £ x„ = 1), o f the random variables (n = 0 , 1, . . .),
/2=0
then

C„(z) = X a„Gin(z). (21)


71= 0

Proof. The probability that the quantity t] is equal to the random variable
is, by assumption, equal to a„. Thus, if qk = P(rj = k) and pnk = P(f n =
= k), we have
OO
Як = X < *nPnk- (22)
Consequently
oo oo oo
(z) = É Як zk = X an Z Pnk zk, (23)
k= 0 n= 0 k= 0

where the order of the summations may be interchanged because of the ab­
solute convergence of the double series. Relation (21) is herewith proved.

T heorem 3. Assume that the random variables £ъ have the


same distribution; let G(z) be their common generating function. Let further
be V a random variable taking on positive integer values only, which is inde­
pendent from the £n-s. The generating function o f the sum

ц=^ +Ь + . (24)
o f a random number o f random variables is equal to G,,[G(z)].

P roof. Theorem 3 is a consequence of Theorems 1 and 2. In fact, the distri­


bution of q is the mixture of the distributions (n = 1 , 2 , . . . ) with weights
a„ = P(v = n), where SP stands for the distribution of According to
I ll, § 15] G E N E R A T I N G F U N C T IO N S 141

Theorem 1 the generating function of SAn is [G(z)]n, hence by Theorem 2:

C„(z) = £ a„[G(z)]".
/1= 1

00
But, by definition, G„(z) = £ a„z". Hence
/1 = 1

G„(z) = G,(G(z)), (25)

which finishes the proof of our theorem.


The generating function of the joint distribution of several nonnegative
integer valued random variables can be defined analogously. Let for in­
stance £ and q be two random variables assuming nonnegative integer val­
ues only (independence is here not supposed); the joint distribution of the
random variables <* and q is defined by the probabilities

rhk = P(£ = h,t1 = k) (h ,k = 0 , 1 , . . .) . (26)

The generating function of the joint distribution of the random variables


£ and q is defined by the series
oo oo
G(x, y) = £ £ rhk x hy k, (27)
A=0 k=0
where a and у are complex variables satisfying the conditions | x | < 1 ,
I у I < 1. Obviously, G(x, 1) and G(l, y) are the respective generating func­
tions of i and q. The probabilities rhk are uniquely determined by G(x, y),
namely
1 dh+k G(x,y)
Гкк hl kl dxhdyk *=°0' { '
If £ and q are independent, rhk = phqk with ph = Р(£ = И) and qk =
= P(q = k). From this it follows that G(x, y) = G4(x) Gn(y). Conversely,
from the latter relation follows the independence of £ and q.
Example. The binomial distribution. Let the possible outcomes А, В, C
of an experiment mutually exclude each other and have the respective prob­
abilities p, q, r (p + q + r — 1). Let us perform n independent trials.
Let £ denote the number of trials leading to the outcome A, q the number of
those leading to the outcome B. The random variables £ and ц are not inde­
pendent. The joint distribution of the random variables q and q can be given
by the probabilities

=*■ ’ - *) - Т ш о Л - Щ ph■ (29)


142 D IS C R E T E R A N D O M V A R IA B L E S t i l l , § 15

The generating function Gn(x, у) is given by

G„(x, у ) = (px + qy + r)". (30)


If the number n of the trials is a random variable having a Poisson distri­
bution with expectation N, £ and q become independent from each other.
In fact, according to Theorem 2 (which can immediately be generalized to
the case of two dimensions) we obtain that the generating function of the
N"e~N
mixture of the trinomial distributions (30) with w eights--------- •, и = 0 ,1 ,...,
n\
is
G(x, y) = f — — Gn(x, y) = (3 1 )
n-0 П\
and since p + q + r = 1, we have

G(x, y) = eN4^ - v>; (32)


therefore <*; and ц are independent random variables with Poisson distribu­
tion and with expectations Np and Nq, respectively.
Conversely, £ and ц are only independent if the number of trials has a
Poisson distribution. In fact, if an = P(v = и), further if £ and rj are inde­
pendent, then
G(x,jO = (7(*,l)G (l,jO = f a„Gn{x,y). (33)
n-0
Let A(z) be the generating function of v, then according to (30) and (33) we
have
G{x, У) = A(p(x - 1) + q(y - 1) + 1).
Hence from (33)

A(p(x - 1) + q{y - 1) + 1) = A(p{x - 1) + 1) A(q(y - 1) + 1). (34)

If we put g(z) — A(z -f 1), g(z) satisfies the functional equation

g(a + b) = g(a)g(b). (35)

But from this follows because of the regularity of g(z) that g(z) = eNz.
Hence A(z) = eN(z~J); that is v has a Poisson distribution.
Now we shall prove the following theorem:

T heorem 4. I f the sequence o f distributions o f random variables ...


(assuming nonnegative integer values only) converges to a probability distri-
1П , § 15] G E N E R A T IN G F U N C T IO N S 143

bution, i.e. if
lim pnk= p k (k = 0 , 1 , . . . ) (36)
«-*■ co
and

t Pk= 1 (37)
are valid for
Pnk = P{Zn = k) (k = 0 , 1 , . . . ) , (38)

then the generating functions o f the converge, in the closed unit circle, to
the generating function o f the distribution {pk }. Hence we have
ve
lim G„(z) = G (z) fo r \ z \ < 1 (39)
n-*■ 00

where
= £ PnkZk (40)
k= 0
and
<?(*) = £ Pkzk. (41)
о
Conversely, if the sequence Gn(z) tends to a limit G(z) for every z with | z | <
< 1, then (36) and (37) are valid, i.e. G(z) is the generating function o f a
distribution {Pk} and the distributions {pnk} converge to this distribution
{.Pk}•

Remark. If (36) does hold while (37) does not, then (39) is valid only in
the interior of the unit circle. This can be seen from the following example.
Let £„ = n, hence
j 1 for к — n,
Pnk j 0 otherwise;
consequently,
lim pnk = 0 (fc = 0 , 1 , . . . ) ,
я-*-оо
but
>. _ . . .. „ Í 0 for I z j < 1,
lim Gn(z) = lim zn = \
n -a c oo I 1 fo r Z = 1,

while for z — eli with 0 < ft < 2n there exists no limit.


It can be seen from the same example that if we assume (39) to hold for
I z I < 1 only, G{z) will not necessarily be a generating function.

P roof of T heorem 4. First we show that (39) follows from (36) and (37).
144 D IS C R E T E R A N D O M V A R IA B L E S [III, § 15

Let e > 0 be an arbitrary number; choose N such that

E
fc=V
Pk< -4-.
4
(42)
where pk has the sense given in (37); this will be always possible because of
(37). Choose next a number n so large that

\Pnk- P k \ < ~ {k = Q , \ , . . . , N - \ ) (43)


00

holds, which is possible because of (36). Since £ Pnk — L i t follows from


*=о
(42) and (43) that for n large enough

E Pnk<~- (44)
k=N z
In fact
» N- 1 JV-1 p 00 P p
E Pnk = i - k=0
fc=lV
E a * —i ~k=0
E Pk + 4 —*=лг
E А + т4 < уZ '
It follows from relations (42), (43) and (44) that for \ z \ <, l and for suffi­
ciently large n
N —1 oo oo

lG(z) - G«(z) I^ E
=0fc
\Pk- р Пк I+ E Pk + E
* =
Pnk < £,

which was to be proved.


Now we shall prove that (39) implies (36) and (37). From the assumption
lim G„ (z) = G(z) for Iz I < 1
«-► 00
and from
|G„(z)|<G„(l) = 1 for |z| < 1, n = 1, 2 ,...

it follows according to the known theorem of Yitali that G(z) is regular for
I z I < 1 and that G„(z) converges uniformly to G(z) in the entire circle
I Z I < r < 1. Putting
00
G(z) = £ pk zk
k=0
and denoting by Cr the circle | z | = r < 1, we obtain that

Cr Cr
Ш , § 15] G E N E R A T I N G F U N C T IO N S 145

From this (36) follows. Since G(l) = lim G„(l) = 1, we get (37).
«-*-00
Example. By means of Theorem 4 another proof can be given of the fact
that the binomial distribution converges to the Poisson distribution. Let
G„(z) be the generating function of the binomial distribution j , then

6. w _ ( i + * z « ) \
Clearly
lim Gn (z) =
«-*-00

and since еД(г_1) is the generating function of the Poisson distribution.^(A),


our statement follows from the second part of Theorem 4.
It can be proved in the same manner that the negative binomial dis­
tribution @r(p) converges to the Poisson distribution ^(A ) for r — oo, if
(1 —p)r = A is constant. In other words, if

P(Zr = k) = + k ~ 1 f t (к = 0,1,...),
A A
where p — 1 -------and q = 1 —p = — , then the distribution of £r con­
i'-------------------------- r
verges to the Poisson distribution J^(A). Since the generating function G„(z)
of the distribution &r |l — —j is given by

/ 1 _ у
G„(z)= -------~
V1 r
and

( 1 - — V
lim --------- L = e ^ -b ,
« —co 1 ^
V1 " 7 /
our statement follows from Theorem 4.
The reader may have noticed that the present and the preceding para­
graph deal substantially with the same problems. The only difference is that
instead of the algebraic point of view the analytical viewpoint is favored
here. Obviously, it means the same to say that the distribution SP can be
146 D IS C R E T E R A N D O M V A R IA B L E S [III, § 15

exhibited in the form & = G{Wj), where G(z) is a power series of nonnega­
tive coefficients such that G(l) = 1 and $ ’1 denotes the distribution
{0 , 1 , 0 , . . 0 , . . . } , or to say that the distribution SP has the generating
function G(z). In dealing with algebraic relations between distributions, the
first point of view is entirely sufficient and the analytic point of view is
superfluous. If, however, theorems of convergence are considered, the anal­
ytic point of view is preferable.
As an example of the application of generating functions, let us consider
now a problem taken from the theory of chain reactions. Consider the chain
reaction occurring in an electron multiplier. This instrument consists of
so-called “screens” . If an electron hits a screen, secondary electrons are
generated, whose number is a random variable. These electrons hit a second
screen, making free new electrons from it, whose number is again a random
variable, etc. Suppose that the distribution of the secondary electrons pro­
duced by one primary electron is the same for each screen. Calculate the
probability that exactly к electrons are produced from the и-th screen. Let
£nr (r = 1 , 2 , . . . ) be the number of secondary electrons produced from the
и-th screen by the r-th electron; assume that £л1, . . . are independent
random variables with the same distribution which take on nonnegative
integer values only. Let pk denote the probability pk = P (£„r = k) (k =
= 0, 1 ,.. .) . Let further rj, denote the number of electrons issued from the
и-th screen. We have then
— £nl + £n2 + • • ■+ (45)
in fact, the number of electrons emerging from the и-th screen is the sum of
the electrons liberated by those emerging from the (и — l)-th screen. Thus
the random variable t]„ is exhibited as the sum of independent random vari­
ables, the number of terms of the sum being equal to the random variable
4n-i- Put
G(z) = f p fcz* (46)
k=Q
and let G„(z) be the generating function of r\„. We have Gx(z) = G(z) and
it follows from Theorem 3 that

Gn (z) = G„_x (G(z)) (и = 2 , 3,... ), (47)


hence
G2 ( z ) = G(G(z)), G3 (z) = G(G(G(z))) etc.

The generating function G„(z) is thus the и-th iterate of G(z). Sometimes it
is convenient to employ the recursive formula
G„(z) = G(G„_x(z)) (и = 2, 3 ,. . .). (48)
i n , § 15] G E N E R A T I N G F U N C T IO N S 147

In general, we have
G„+m(z) = G„(Gm(z)). (49)

Let us compute from the generating function G„(z) the expectation Mn of


Пп• Put
CO

M = YJ kpk = G’ ( 1). (50)


k =1
It is here to be mentioned that an electron multiplication, in the true sense
of the word, takes place only if M > 1; in fact, only then can it be expected
to observe an increase of the number of electrons (cf. the calculations be­
low). In order to calculate M n differentiate (47), put z — l, then we have
M n = с ;< 1 ) = G U (1 ) G '(1 ) = M n_1M. (51)
Consequently
Mn= M " (/1 = 1,2,...). (52)

The expectation of the number of electrons emitted from the и-th screen
is thus the и-th power of the expectation of the number of electrons emitted
from the first screen. For M > 1 this expectation increases beyond every
bound for и -* oo; for M < 1 it tends to 0. In the latter case the process
stops sooner or later. Let us see now, what is the probability of this. Let
Pnk be the probability that к electrons are emitted from the и-th screen;
particularly, we have
Pn,o = G„(0). (53)
It can be supposed that G(0) = p0 is positive, since if G(0) = 0, obviously
P„,o = 0 for и = 1 , 2 , . . . .
The sequence P„fi (и = 1, 2 , . . . ) is monotone increasing. This can be
seen immediately: in fact if no electron is emitted from the и-th screen,
the same will hold for the (и + l)-st screen to o ; the converse, however, is
not true. According to (53) we have

Л + 1.0 = Gn (G(0)) > G„ (0) = P„to, (54)


the sequence P„ 0 is thus monotone increasing. Since for every и P„ 0 < 1,
the limit
lim P„fi = P (55)
Л-*-00
exists. It follows from (53) that

P„,o = G(P„-i,o); (56)


148 D IS C R E T E R A N D O M V A R IA B L E S [ I I I / § 15

P is therefore a root of the equation

P = G(P). (57)

Since G(l) = 1, 1 is also a root of this equation. We shall show that for
M < 1 there exist no other real roots. In this case therefore the probability

that no electrons are emitted from the и-th screen, tends to 1 if n -> oo. To
prove this draw the curve у = G(x). Since G(x) is a power series with non­
negative coefficients, the same holds for all its derivatives, G(x) is therefore
monotone increasing in the interval 0 < x < 1 and is also convex. The
equation P = G{P) means that P is the abscissa of the intersection of the
curve у = G(x) and the line у = x. Since G'(0) > 0, G(x) - x is positive
for л: = 0. Now if G'(l) = M > 1, G(x) — x is, because of G(l) = 1, nega­
tive in an appropriate left hand side neighbourhood of the point x — 1
(see Fig. 14). As G(x) is continuous, there exists a value P(0 < P < 1) satis­
fying (57). Because of the convexity of G(x) there can exist no further points
of intersection.
It can be proved in the same manner that for M < 1 Equation (57) has
no real roots other than P = 1. (There can of course exist complex roots
of (57).)
It is yet to be shown that for M > 1 the sequence Pn0 (и = 1 , 2 , , . . )
converges to the smaller of the two roots of Equation (57). This can be seen
immediately from Fig. 15 by relation (47) which gives in case of M > 1 for
I I I . § 16] A P P R O X IM A T IO N O F T H E B I N O M IA L D I S T R IB U T I O N 149

every z (0 < z< 1) the relation

lim Gn(z) = P , (58)


п-+ со
hence
l i m P n>fc = 0 (k = 1 ,2 ,. ..) . (59)
n-+co

Fig. 15

Thus the probability that from the и-th screen there are exactly к > 1 elec­
trons issued tends to 0 for n -* oo for each fixed value of k. From
CO
lim Pnfi = P < 1 and £ Рщк = 1
n-*co k =0

it follows that for large enough n the number of the emitted electrons (pro­
vided that the process did not stop) will be arbitrarily large with a (condi­
tional) probability near to 1. This is in accordance with experience.

§ 16. Approximation of the binomial distribution by the normal distribution

In probability theory Stirling's formula is often employed in the following


form:

« != -j-j exp f - ~ - (0 < 9n < 1). (1)

This can be proved by means of Euler’s summation formula and Wallis’


formula. We employ Euler’s summation formula in the following form:
150 D IS C R E T E R A N D O M V A R IA B L E S [III, § 16

Let f(x) be a continuously differentiable function in the closed interval


[a, b]; let further be

e W = X - [x] - — , (2 )

where [x] denotes as usually the integral part of x ; i.e. [x] = к for к <
< X < к + 1 (к = 0, 1,. . .). Then we have

X /(*) = j /(*) dx -
a < k < ,b a
[e(b)f(b) - Q(a)f(a)\ + J Q{x]f'(x) dx.
h
(3)

Remark. \ i a = A ---- b = В + - у - , A and В integers, we have


g(a) = g(b) = 0 and instead of (3) we may simply write1
В П+ 2
X /(* ) = Í я x ) d x + J Q{x)f(x) dx. (4)
k=A Aл - 12 ' A - 12

In the present paragraph an approximation will be given for the terms

Wk = j ” pk q"~k (к = 0, 1,.. .,w; 0 < p < l ; q = l - p) (5)

of the binomial distribution.


Put z = к — np, hence

к = tip + z and n — к = nq - z. (6 )
in 1
Evaluating asymptotically the binomial coefficient figuring in (5) by
/С I
Stirling’s formula, a simple calculation gives

Wk = / ------------ ------------- ( l ------ X— Г * [i + — ‘ (7)


\ 2 tz (np + z) (nq - z) { np + z j nq - z )
with
e __ 0Л АС 0 п„—К
L / n \

“ 72л “ Ж _ 1 2 (и - к) ’ 1 ;
where 0„ is defined by (1). We assume that the quantity

* = -£ = - (9)
yjnpq
1T h e p r o o f o f th is f o r m u la c a n b e f o u n d e .g . in K . K n o p p [1 ].
I l l , § 16J A P P R O X IM A T IO N O F T H E B IN O M IA L D IS T R I B U T I O N 151

rem ains bounded:


IX I < A (A = constant). (10)
For the different factors on the right hand side of (7) we obtain

/ - ...... x(gZ f > + o ( - L ] 1 (,,)


у 2n(np + z) (nq - z) *j2nnpq L 2 -f npq l n }.
and
_ \n p + z _ n q —z

1 ------- — 1 +——
np + z nq — z

( . 2)
bsjnpq l и 1_

According to assumption (10) we have1

5 = 0 (-Í-), (13)

and the constant figuring in the residual term О Í— in Equations (11), (12)
In ,
and (13) depends on A only; thus we obtain from the relations (7), (11),
(12), and (13) the following theorem:

T heorem 1. I f 0 < p < 1, q = 1 - p, and

0* = ( ” Pk <Tk (k = 0 , 1 , . . . , « ) , (14)

further if

1*1 = <ZA, (15)


yjnpq
then

■ r . - - j l ~ r i + (* - ^ - r t -+ o f-l|l. (.6)
s/2nnpq Gf npq \ n ).
1 Here, as well as in what follows, the notation aN = О (bN) is employed. If aN and
bN(N = 1 ,2 ,...) are sequences of numbers such that bN ф 0 and there exists a constant
C > 0 for which [a v I <,C I bN\, this fact will be denoted by aN = O(bf). (Read:“ av is of
a N
order not exceeding that of bN”.) If, however, lim ——= 0, this will be denoted by
v .<z> b\
aN = о (bN). (Read: “aN is of smaller order than bN”.)
152 D IS C R E T E R A N D O M V A R IA B L E S [III, § 16

where the constant intervening in О | — j depends on A only.

In practice, usually the following weaker form


e (k - n p f '

[l+of-U (17)
sjln n p q L V « /J
of Theorem 1 suffices.
Thus the probabilities | ” | pkq"~k are approximated by values of the
function
f{x) = — exp [ - (x - m f / 2o2] (18)
у/ 2 тто

at the point x = k, where the constants m and a have the values m = np


and о = -Jnpq. This function is represented graphically by a bell-shaped
curve (Fig. 16) called Gauss’ curve (or Laplace curve, or “normal” curve).
Function (18) plays a central role in probability theory.

Fig. 16

Theorem 1 can be “verified” experimentally for p = — by means of


Galton's desk.
Gabon’s desk (Fig. 17) is a triangular inclined plane provided with nails
arranged regularly in n horizontal lines, the /с-th line containing к nails.
A ball, launched from the vertex of the triangle, will be diverted at every
m, § 16] A P P R O X IM A T IO N O F T H E B IN O M IA L D IS T R I B U T I O N 153

line either to the left or to the right, with the same probability — . Under
the last line of nails there follows a line of n + 1 boxes in which the balls are
accumulated. In order to fall into the k-th box (numbered from the left,
к = 0 , 1 , . .., ri) a ball has to be diverted к times to the right and n — к

Fig. 17

times to the left. If the directions taken at each of the lines are independent,
n
the probability of this event will be 2~". By letting a large enough num-
К
ber of balls roll down Gabon’s desk, their distribution in the boxes exhibits
quite neatly a curve similar to the Laplace Gauss curve. Theorem 1 states
that the limit relation

( " W - ‘
lim ------ ■ , ...------- vft = 1 (19)
„-a, 1 (k - n p f
__ exp — ------------ -
sjln n p q I 2npq Jj
holds, if with n also к tends to infinity so that
\k -n p I
•Jnpq
remains bounded; with these conditions the convergence is even uniform.
Formula (19) is the so-called de Moivre-Laplace theorem.
154 D IS C R E T E R A N D O M V A R IA B L E S [Ш , § 16

This result can be expressed in a more concise and practical, though


weaker form. Let Wk be the probability that during n repetitions of an alter­
native the event A occurs exactly к times. If n is very large, it is more rea­
sonable to ask for the probability that the number к of occurrences of A
lies between two given limits, than for the probability that it assumes a
fixed value.
Our problem may conveniently be phrased as follows: what is the proba­
bility that the inequality

np + a j npq < к < np + by/npq (2 0 )


should hold, where a and b (a < b) are two given real numbers. It follows
from Formula (16) that this probability is

W * \a ,b ) = £ Wk =

Г npq

= — /—■ X e 2 (l + — = = - ( * £ —3xfc) + О — , (2 1 )
y/2nnpq a<,xk<b \ b fn p q L n J/

where x t = was substituted. It will be seen that there exists a limit

lim W(n) (a, b) = W(a, b) (22)


Л-*-СО

which can be calculated and the residual term estimated.


Choose the numbers a and b such that

A = -i- + np + a f npq and B = — ~ + np + by/ npq

are integers; this can always be done without changing the value of
W(n)(a, b). It follows then from (4) that

b ) ------ ' | V * (i + ( g - r t l g _ - 3 * ) I Jx + 0 i l l (23)


J2n J b J npq ) \n )
a
_—
Since Je 2 (xs — 3 x)dx can be given explicitly, we have

T heorem 2. I f

np + a yjnpq + ~ ~ = A and np + b у /npq — — = В


I ll, § 16] APPROXIM ATION OF T H E BINOM IAL D ISTR IB U TIO N 155

are integers {A < B), then


ь

^ J l ) pkg" k ’ ^ r ^ ^ ‘lx + S v a
<24a>
where

R= q Z L - ((1 - b 2)e~ 2 - (1 — a2) e~ 2) + 0 (— ] . (24b)


6 Jnpq n)

From (24a) and (24b) follows the limit relation


ß
lim I [n
A p k < f-k = - ^ L - [ e - * * dx (25)
n-cc ^ k - n g ^ ^ \k ) J 2 l l .1
Í npq *

for each given pair (a, ß) of real numbers (a < ß); it suffices in fact to replace
a by an, ß by bn, where a„ is the least number such that

A = n p + a „ yj npq + ~ > np + a J n p q

(A integer) and bn is the largest number such that (B integer)

В = np + bnJ npq - ~ < n p + ß J n p q .

Obviously,

a" d
hence
bn _ ** ß _ X*
lim f e 2 dx = f e 2 dx.
«-<*>& i
Thus the right hand side of (25) gives an approximate value for the proba­
bility that the number of occurrences of an event A (having the probability
P(A) = p) in an experiment consisting of n independent trials, lies between
the limits np + a J n p q and np + ß J n pq. To use this result we must have
the values of the integral
ß
I C
— 7= e 2 dx
Jlu J
a

_ **
for every pair (я, ß). The integral Je 2 dx cannot be expressed by elemen-
156 D IS C R E T E R A N D O M V A R IA B L E S [III. § 16

tary functions; however, the function


у xt
Ф(у) = —= ( e~*dx (26)
y/2 n —
Jco

is tabulated with a great precision and a table of its values is given at the
end of this volume (cf. Table 6 ). The curve у = Ф(х) is shown in Fig. 18.

It is easy to see that


+ 00

Ф(+ oo) = I_I e i d x = 1. (27)


V 2 tzj
— co

In fact, when introducing polar coordinates we find

+00 +00 00
1 Г Г Х*+Уг Г _ г*
Ф2(+ оо) = —— е 2 d xd y = \ re 2 dr = I.
—о о —оо О

From Theorem 2 follows immediately

T heorem 3. For every real у

Hm £ " I pk qn~k = Ф (у). (28)


n— oo k —np _ К J
tn p q У
m , § 17] B E R N O U L L I ’S L A W O F L A R G E N U M B E R S 157

In fact, it follows from (25) that for every sufficiently large N

lim £ Г pk qn- k > Ф(у) - Ф ( - N) (29)


«+<ю k -n p \K
Inpq
T = -_ < .У
and
Tim E ? Pk qn~* £ 1 - Ф(Ю + Ф(у). (30)
n - —oo k-n p
/«P? У
Because of (27), (28) follows.
The function Ф(х) is one of the most important distribution functions.
A random variable £ having for its distribution function Ф(х) or more
—YYi
generally Ф --------- (with a > 0 and m an arbitrary real number) is said
to be normally distributed or a Gaussian random variable. Ф(х) itself is called
the standard normal or Gaussian distribution function, or distribution func­
tion of Laplace-Gauss.
Theorem 2 may be expressed also by saying that for large n the binomial
distribution is approximated by the normal distribution.

§ 17. Bernoulli’s law of large numbers


The results obtained in the preceding paragraph allow us to prove a very
important theorem, Bernoulli’s “law of large numbers” .

T heorem 1. Let the event A be one o f the possible outcomes o f an experi­


ment, with probability

P(A) = p {0 < p < 1)


and let f A(n) be the relative frequency o f the event A in a sequence o f n inde­
pendent repetitions o f the experiment. Given two arbitrarily small positive
numbers e and 5, there exists a number N depending on e and S only such that
fo r n > N
P (\U n )-P (A )\< e )> \-ő . (1)

Proof. We have

P ( \ f A( n ) - P ( A ) \ < e ) = E \ nk \ p k 4n~k-
[ k —np\< № \ K J
Choose a number Y such that

Ф (Т)-Ф (-Т)> 1 - A . (2 )
158 D IS C R E T E R A N D O M V A R IA B L E S [III, § 17

For n > f - = Nx we have Y^Jnpq < ns, and it follows

Р (\Ш -Р (А )\< а )> I ( n\ p kqn- k. (3)


\k—np\ <Y ^npq '
According to Formula (25) of § 16 N2 can be chosen such that for n > N2

I _ ( nj\ Pkqn~k ^ Ф(Л - Ф (-У ) - J ; (4)


\k-np\<,Y/npq ' '
from (2), (3) and (4) it follows that (1) is verified for n > N = max (Nb N 2).
Bernoulli’s law of large numbers can also be proved directly, without the
use of the de Moivre-Laplace theorem.
Formula (1) is equivalent to

X [n \ p kqn~k > 1 - Ö (5)


\k-np\<en ^ ,
for every n > N.
The identity
£ ( k - n Py - i n \ kq"-k = npq (6 )
*=0 \ K!
(given as relation (5) in § 9) states that the variance of the binomial distri­
bution is equal to npq. Thus we have

npq > Y, ? pkq"~k (k - n p f > e V £ I ” pkqn~k


\k—np\^m к \k-np\>en I ^ /
and, consequently,

V ^ \ к n—k i V [ft к n—k ^ i


Ik—
i
np\ <en k** p ч = i - \k-np\^en
i к p q - 1 - ЬП ■

Thus for n > ■ „ = N relation (1) is verified. Since pq = p{\ — p) <


E-Ö
< ——, it suffices to take for N the value N = , „ - . We shall see in Chap-
4 4s1ö
ter VI that one can take for N a much smaller value as well.
The method of proof employed above is often used in probability theory.
Later on (in Ch. VII) it will be formulated in a more general form as the
inequality of Tchebychev.
Finally, some remarks should be added here concerning Bernoulli’s law
of large numbers.
I l l , § 18] E X E R C IS E S 159

In introducing the concept of probability, a number called probability


was assigned to events whose relative frequency possessed a certain stability
in the course of a long sequence of experiments. This stability of relative
frequency is now proved mathematically. It is quite remarkable that the
theory leads to a precise formulation of this stability; it is undoubtedly a
proof of its power.
At the same time it can be understood, why the stability of relative fre­
quency could not be defined precisely at the introduction of the concept of
probability. Indeed, in its formulation occurs the concept of probability:
the law of large numbers states just that after a long sequence of experiments
a large deviation between relative frequency and probability becomes very
improbable.
It may seem that there lurks some vicious circle here: probability was in­
deed defined by means of the stability of relative frequency, and yet in the
definition of the stability of relative frequency the concept of probability
is hidden. In reality there is no logical fault. The “definition” of the proba­
bility stating that the probability is the numerical value around which the
relative frequency is fluctuating at random is not a mathematical definition:
it is an intuitive description of the realistic background of the concept of
probability. Bernoulli’s law of large numbers, on the other hand, is a theo­
rem deduced from the mathematical concept of probability; there is thus
no vicious circle.
The theorem dealt with above is a particular case of more general theo­
rems which will be discussed in Chapter VI. Similarly, the approximation
of the binomial distribution by the normal distribution is a particular case of
the general limit theorems, to be dealt with in Chapter VII of the present book.

§ 18. Exercises

1. Suppose a calculator is so good that he does not make more than three errors
in the average in doing 1000 additions. Suppose he checks his additions by testing
the addition modulo 9 and corrects the errors thus discovered. There can, however,
still remain undetected errors: in fact, it may occur that the erroneous result differs
from the exact sum by a multiple of 9. How many errors remain in the average among
his additions?
H in t. It can be assumed that, if the sum is erroneous, the error lies with an equal
probability — in any of the residue classes 0, 1,2, 3, 4, 5, 6, 7, 8 mod 9. Let A denote
the event “ the sum is erroneous” , В the event “ the error could be detected by testing
the sum modulo 9” . The probability sought is the conditional probability P(.A \ B ) ;
according to Bayes’ rule it has the value — - .
160 D IS C R E T E R A N D O M V A R IA B L E S [III, § 18

2. A missing letter is to be found with the probability P in one of the eight drawers
of a secretary. Suppose that seven drawers were already tried in vain. What is the
probability to find the letter in the last drawer ?

3. Determine the maximal term of the binomial, multinomial, hypergeometric


and polyhypergeometric distributions.
4. Consider the terms

, В Д -1 Э
UJ
of the polyhypergeometric distribution, where
к i + k2 + . . . + kr = n у Ni + N 2 + . . . + N r = N.
P ro v e t h a t if t h e n u m b e r s N , ( j = 1, 2 t e n d t o in f in ity s o t h a t

lim = p, > 0 (J = 1, 2, . . . , r ) ,

we have for fixed values of къ k 2, . . . ,k r

ы р ь*... - Щ О T
Thus under the above conditions the terms of the polyhypergeometric distribution
converge to the corresponding terms of the multinomial distribution.
5. If
lim n p t = A, > 0 (J = 1, 2 , . . . , r — 1),
П—
*■+00
the multinomial distribution

j ______ a. _____ . A
i f t . кг\ . . . kr\ yr i
tends to an (r — l)-dimensional Poisson distribution. That is, for fixed k u k2, . . . , kr~j
(Ат, = n — кг —k2 — . . . — k r- i ) we have for n —* + 00
„I 3*! 3*/*—I
lim ------- ---------- f t f t . . . f t = 1' ■e-«rb~+A ,. .
f t f t . . . kr\ ^ Pr f t. . . . k rf t.
6. Deduce Formula (4) of §4 from Formula (12) of § 12, using the convergence of
the binomial distribution to the Poisson distribution.
kk e~x
7. Determine the maximal term of the Poisson distribution-------- (к = 0 , 1 , . . . ;
k\
A >0).
8. If 2 is constant and N = n ln n + X n , there exists a limit of the probabilities
Pk (n, N ) (cf. Ch. II, § 12, Exercise 42.b) for n -> ° ° : we have for any fixed real value
of X and any fixed nonnegative integer к

lim P k (и, л In и + Ая) = ■<« _ > * (* = 0 , 1 , . . . ) .


n-~oo k!
I l l , § 18] E X E R C IS E S 161

Thus, if we distribute J V = n I n n + l n balls into n urns, the number of the urns


which remain empty will be, for large n, approximately distributed according to a
Poisson distribution with expectation e~l .

, M R
9. If —— = p, — = r, and n-> oo so that
N N
lim np — Л > 0 and lim nr = p > 0
Л—*- + со tl —►oo
then
k —1 n - k —l

, П<м+''л) П Q f - M + j K)
h m ---------------- *3]---------------------------- =
— ; Пw
t=0
+ J v

Thus under the above conditions, the Pólya distribution tends to a negative binomial
distribution. If /i = 0, the above limit becomes -------- ; the limit distribution is then a
k\
Poisson distribution.
10. A roll of a certain fabric contains in the average five faults per 100 yards .
The cloth will be cut into pieces of 3 yards. How many faultless pieces does one expect
to find?
Hint. It can be supposed that the number of faults has a Poisson distribution.
The probability of finding к faults in an x yards long piece is therefore equal to

U -f,-*
20 -------- №- 0 ...........
k\
11. In a forest there are on the average ten trees per 100 m2. For sake of simplicity
suppose that all trees have a circular section with a diameter of 20 cm. A gun is fired
in a direction in which the edge of the forest is 100 m away. What is the probability
that the shot will hit a trunk?
Hint. It can be assumed that the trees have a Poisson distribution; the probability
that on a surface area T m2 there are к trees is equal to
( T \k --L

Each tree can be considered and represented by its centre.


12. In a summer evening there can be observed on the average one shooting star
in every ten minutes. What is the probability to observe two during a quarter of an
hour?
162 D IS C R E T E R A N D O M V A R IA B L E S [III, § 18

13. At a certain post office 1017 letters without address were posted during one
year. Estimate the number of days on which more than two letters without address
were posted.
14. Let tSj (j>) = {p , 1 - p } be a binomial distribution of order 1; let g(z) =
2 _(X
= -----------. Determine the distribution g [•%(?)]■
1 —az

15. L e t (p) be the same as in the preceding exercise. Show that

exp [A(AO) — 1)] = Ф(хр)


is the Poisson distribution with expectation Ap.
16. Let p be the probability of an event A. Perform n independent trials and denote
by / the relative frequency of A in this sequence of trials. With the aid of the approx­
imation of the binomial distribution by the normal distribution answer the following
questions:
a ) I f p — 0 .4 a n d n — 1500, w h a t is th e p r o b a b ility f o r / to lie b e tw e e n 0 .4 0
a n d 0 .4 4 ?
b) If p = 0.375, how many independent trials have to be performed in order
that the probability of \ f — p 1 < 0.01 is at least 0.995?
2
c) Let p = — , n = 1200. How should e be chosen in order that the probability
of I / —P I < e be at least 0.985?
d) Suppose и = 14 400. For which values of p will the probability of | / — p \ <
< 0.01 be at least 0.99?
17. Put
.V

i f -4 -
Ф(х) = ---- = e 2 du.
yj 2л '

Prove that the expansion


_ ч 1 1 - 4 -(X X3 X5 3
' ( - + T TT + - г т т + " ' )
holds.
18. Prove that for x > 1,
V
e" 2
Ф(х) = 1 — ----------- -------- ----- where 0 < 6 < 1.

Hint. Use the identityI

I (, + ^ K e_ 2 •
Ш , § 18] E X E R C IS E S 163

19. Suppose P(A) = • Perform N independent experiments; let f(n) be the


number of the occurrences of A among the first n experiments (n = 1, 2, . . .,7V);
put /(0 ) = 0. Verify the formula

for M = 1, 2..........
20. Applying the result of Exercise 19 prove that

lim PN (x J N ) = 2Ф(х) - 1 (x > 0).


N— f +CO

21. The function of two variables *Р(х, у) = ф {—— ) fulfils the partial differential
{yjy >
equation of heat-conduction
= 1
дУ 2 dx2
the function

U(x'" ) = k
K< 2
fulfils the difference equation

AnU = i - AlU
where
A„U = U(x, n + 1) - U(x, n),
XU = U(x + 1, n) - 2U(x, n) + U(x - 1, л).
A2
22. Prove the asymptotic relation

(n\ к nк 1 (к — "РУ
к ) 2n npq L 2npq

under the following conditions: p is a constant, n and к tend to infinity so that


(k -n p y
«-*.+<» ft
23. Prove the asymptotic relation
IM i i N — M i
\ k ) \ n —к ) 1 (k — np)2 '
-------- —---------- ä — --------- e x p ------- ---------
IN j 2л npq 2npq

where
• --
TV—> oo, M — pN, 0 < p < 1, q — \ — p, n —> n= o( J n )
164 D IS C R E T E R A N D O M V A R IA B L E S [III, § 18

and
Iк — np\ =
Thus also the hypergeometric distribution can be approximated by the normal
distribution.
24. Establish the following power series expansion:

Ф{х) =

1 1 X* X5 X7 ( - l)*x2*+1
~ T + Г " 1 ! 2 . 3 + 214-5 3 1 8 - 7 + ' ' ' + k \ 2 k{2k + 1) + " 'J ’

How many terms are to be taken to calculate Ф(2) with an accuracy of four decimal
digits?
25. If X is positive, the difference
1 -
1 — Ф (х )------ ----- e 2 X
V 271
1 1 1-3 1 • 3 ■5 ( - 1)* 1 ■3 . . .{2k - 1)
+ ^ X7 +■■•+ x 2*+ 1

is positive or negative according as к is odd or even. The value of the function


i — Ф(х) is thus always contained between the k-th and {k + l)-st partial sums of the
divergent series
1 - f Г1 1, 1- 3 1- 3- 5 ,
yj 2л Lx X3 X5 X7 J

How many terms are there to be taken to calculate Ф(4) with an accuracy of 10“"?
26. Show that

1 + (' - • '* ) ' , +


-------- ---------------— < Ф(х) < ---------- ----------- for X > 0.

27. (Method of Laplace.) Let /(x ) be a complex valued function continuous in a


neighbourhood of x = a such that
/(a ) = 1, l/(x)| < 1 for x # a, f"{d) = - b < 0.
Let further {^„(0} be a sequence of complex-valued functions such that

lim g„ \a + ---- I = A
П-*- oo \ nb) )
be uniformly fulfilled for t in every finite interval. Show that for every x > 0, у > 0
we have

__ °+ ^ A x
J gÁ t)fV ) dt = 5 e~ 'r du-
У ^ —У
inb
I l l , § 18] E X E R C IS E S 165

28. Show that for every real value of x

e~^
lim £ _ _ _ = ф(х).
A -v o o k < A + x i h

Hint. Use the result of the preceding Exercise.


29. Show with the aid of the result of Exercise 27 and the relation
i
t f " ) P1 = (n - k) ( " ) J> (1 - I)"-*-1dr
P

(0 < p < l; q = l — p) directly the validity of the limit relation

lim
n —►со
Z
к —пр
( \ ) / <?”“* = Ф(х).
\ К )
^npq~ <X
30. Prove the following strong form of Stirling’s formula

”! = (t ) y /Ynn exp! - I ^ r - w ) ’0 < y "< L


31. In a factory at any instant t each of n machines is at work with probability p
and is under repair with probability 1 — p. The machines are operated independently
of each other. What is the probability that at a given instant at least m machines are
working? Calculate an approximate value of this probability for n = 200; p = 0.95;
m = 180.
32. Prove the well-known approximation-theorem of Weierstrass in the following
manner: Show with the aid of the inequality

\ к —пр\>ПЕ V K ) пь

deduced in course of the proof of Bernoulli’s law of large numbers, that for any
function f(x), continuous in the closed interval [0, 1], the so-called Bernstein poly­
nomials of f{x)

Bn(x)= i U ) f [~v )*‘(i - x)n~k


converge uniformly to f(x ) in the interval [0, 1] if n —> oc..
33. Find the limit

= lim Z [ l\ for “ > °-


34. The following problem occurs in statistical mechanics: Let Ех,Е г, be
the possible values of the energy of a particle belonging to a system of N particles.
If a particle has the energy Ek, it is said to be in the k-th state. The state of the whole
system can be characterized by giving the “occupation numbers” Nu Nz, . . . ,N n;
166 D IS C R E T E R A N D O M V A R IA B L E S [III, § 18

N k being the number of particles having the energy Ek (k = 1, 2, . . . , « ) . Let Wk be


the probability of a particle being in the state k, W(NX, N2, . . . , N„) the probability
that the system is in the state characterized by the occupation numbers Nx, N,,,..., N„.
By assuming that the states of the particles are independent from each other, we have,
obviously

W( NX, N , .......A„) = N , N T' N , - W "' w "'- ■ W» n- (0


with
N = N x + N2 + ... + N„. (2)

The probabilities of the possible states have therefore a multinomial distribution.


If, however, the total energy E of the system is given, not all these states can actually
occur: besides condition (2) the following one must be fulfilled, too:

f j Nk Ek = E . (3)
ft=l
According to the definition of the conditional probability, probabilities (1) fulfilling (3)
are simply multiplied by a constant factor. Find the values of N x, N2, . . . , N„ fulfilling
(2) and (3) for which the expression (1) takes on its maximal value.

Hint. Consider the numbers Nk (k = 1, 2, . . . , n) as continuous variables and replace


in (1) the factorials Nk\ by Г(Нк + 1), then apply the well-known identity

In Г (N + 1) = ilV + 4 In N - N + In v '2 я - j -
0

where g(x) = x — [x] — — . By differentiation1 we obtain

r ’{ N + \ ) 1 Г o(x)dx
Г (A + 1) - П + IN + J (x + N )- '
о
By Lagrange’s method of multipliers, the conditional extremum of (1) under the
conditions (2) and (3) can be found. Thus it follows that
Wk e ~PE*
N k x N —— Í---------- ,
£ W ,e-to
/=i
where the constant ß must be chosen so that (3) is satisfied. This is Boltzmann's energy
distribution.
35. Let 0 < p < — , g = l - p, и an integer such that np — m is also an integer.
Let A be an event with probability p. Show that during the course of n independent

1 Often this formula is deduced by differentiating Stirling’s formula; this of course


is inadmissible. Our procedure is correct, since we differentiate an identity and not
an asymptotic relation.
I l l , § 18] E X E R C IS E S 167

repetitions of an experiment, having possible outcomes A and A, the probability that


A occurs less than m times is greater than the probability that A occurs more than
m times (Simmons’ Theorem).
Hint. The inequality to be proven is

E i к
п \ р кя"~к > E [ I U v -* .
A' = 0 K ) k=m+ 1 \ K )
By putting (1)
Br = I n j pm- ' qn- m+r ( r = 0 , 1 , . . . , m),
\ m —r )

C, = I " Ip m+ r q n - m- r (I- = 0, 1, . .. . n — m).


[m + r )
(1) may be written in the form
m n—rn
E Br < E c ' • (П
0
Br
P u t ---- = D„ then we have
Cr
D,+i j _ ____ (p - q) (r- + r - npq)
Dr (n — m - r) (n — m + r + 1)p~
Dr+.
thus ------- — 1 is positive for small values of r. It decreases as r increases and is
Dr
negative for r > s, where s is the least integer for which s(s + 1) > npq. As D0 = 1,
D1 — — = npCt g > 1 there exists an integer к > 1 such that E E > \ for
C, npq + p C,
Br
r — 0, 1, . . . , к — 1, — < 1 for к < r < n — m. For this value of к we get
Q
*-i it-i
У (k r - 1) Br > £ (А - r - 1) Cr (2)
r= 0 r = 0
and
m n—m
£ ( r - A + l)Sr > £ (r — A + 1) C. . (3)
r= k r= k

From the identity

E 0 (* -* " ) ( J ' J / ' V * = °


it follows
m n —m
E rB,= у rCr . (4)
r = 0 r= 0

From (2), (3) and (4) it follows


m n —m
(к - 1) У Br > (А - 1) £ С ,,
t = 0 r —0

which was to be proved. This proof is due to E. Feldheim.


168 D IS C R E T E R A N D O M V A R IA B L E S [III, § 18

36. Prove the following asymptotic relation for the terms of the multinomial
distribution. For
Г
П -> + Y j k l =* n > I k l - nP i I = ° ( J n)
1=1
we have
-------
kx\ k2\.—--------
. . k r\ PÍ1Pz‘ ■■■Pr’
Vr ~

« ------ — --------1 — e x pГ---- -—


1 V> ----------------
(ki - nr')2 1
( ^ 2 л п ) г~1 J p i P i - ■-pr 2n 1 = 1 Pi

Hint. Use Stirling’s formula.


37. An urn contains N cards labelled from 1 to N. Perform n drawings without
replacement. Let f denote the least number drawn. Find the distribution of the random
variable (.
38. Let { and ?; be nonnegative integer valued random variables such that if the
value of j? is fixed £ has a Poisson distribution and conversely. Show that

A' uk vlk
Rlk = P(Z=j,ri = l c ) = C - ^ — ( / , * = 0, 1, . . . ) ,

where А, ц, v are positive constants, and


1 “ » A’ n k v ’k

C ~ /to *=o /! A! ‘
For the independence of { and ?? it is necessary and sufficient that v = 1 should hold.
The distribution 7?ilt is therefore a generalization of the Poisson distribution for two
dimensions. (Distribution of N. G. Obreskov.)

39. Let £ and r\ be two independent random variables both having a Poisson
distribution with expectation A. Determine the distribution of f — rj.
40. Each of two urns contains N — 1 white balls and one red ball. Draw from
both urns n balls (n < N) without replacement. Put now all 2 N balls into one and
the same urn and draw 2n balls without replacement. In which one of the two cases
is it more probable to obtain at least one red ball?
41. Let A be the disintegration constant of a radioactive material. Let the proba­
bility of observing the disintegration of any one of the atoms be denoted by c (c is
proportional to the solide angle under which the counter is seen from the point from
where the radiation starts). Let N denote the number of the atoms at the time t = 0,
the number of disintegrations observed in the time interval (0, f). Prove by applying
the theorem of total probability that has a binomial distribution.
Hint. The probability that exactly n atoms disintegrate during the interval (0, t) is

(^ 1 (1 - e- x')nе -ш - п);
\ П)
the probability that among them к disintegrations are observed, is | ^ jc^fl —c)"~k.
U I, § 18] E X E R C IS E S 169

The theorem of total probability gives

P(í, = * ) = £ ( " ) ( ! - e-*')" (" Jck (1 - c)"~k =

= ( c ( l - e-*'))k (1 - с (1 - e - x'))N~k .

Note that because of с (1 — e~Xl) < 1 — e~cXl somewhat fewer disintegrations are
observed than when the value of the disintegration constant would be Xc and all
disintegrations would be visible. But this difference is only important for large values
of t.
42. Let £„ £2, .. ., c:r be independent random variables with the same negative
binomial distribution of order 1:
K i k = «) = (1 - P ) P " - ' ( « = 1 , 2 , . . . ; к = 1, 2 , . . . , r; 0 < p < 1).
Let V be a randcm variable, independent from , with a Poisson distribution
an d expectation X. Determine the distribution of

f = + íj + ... + í,+l-
Hint. By using the notations of § 14, the distribution of £i + £ * + . . . + i*+i
can be written as
Sk+l — - - - - ,
that of C is given therefore by

£P= y f Г* =
k=o k\ \ \ — pgy) 1 — pSi

where x — — —---- i l A . _ jt ;s known that


P

7 —— e l-t = £ z*>
1 z k=0
where the

£*w = ; £ ■ £ - {x* e~X)


are the Laguerre-polynomials.1 It follows

P(f = «) = (1 - p)e- i Ln l (1 ~ p) p -' ( « = 1 , 2 , . . .).

43. Calculate the expectation of the number of marked fishes at the second capture
(cf. Ch. II, § 12, Exercise 21), if there are 10 000 fishes in the lake and if at the first
capture 100 fishes are marked.

44. Calculate the expectation of the number of matches in one of the boxes in
Banach’s pockets at the moment when he found the other box empty for the first
time (cf. Ch. II, § 12, Exercise 14).
lCf. e.g. G. Pólya and G. Szegő [1].
170 D IS C R E T E R A N D O M V A R IA B L E S [III, § 18

45. Calculate the expectation of the sum defined in Chapter II, § 12, Exercise 46:
X — kx + k2 + . . . + kM.
Hint. Let be the distribution of a random variable which assumes the values
0, 1, 2, . . к — 1 with the probabilities -j- :
к

М т 4 .....T - * 0........
Show that the distribution of X — M( M + l)/2 can be written in the form

■■■ÍFv-M+i

M ( N + 1)
From this it follows that E(X) = ------ —-------- .

46. Suppose that a player gambles according to the following strategy at a play
of coin tossing: he bets always on “tail” ; if “head” occurs, he doubles his stake in
the next tossing. He plays until tail occurs for the first time. What is the expectation
of his gain ?

Hint. If the tail occurs at the я-th toss for the first time (the probability of this
event is ■— ), the gain of the player, if his bet at the first toss was 1 shilling, will be
1 shilling, since
П—1
2" - I 2* = 1.
k =0

The expectation of the gain is thus

Z yz - b
/ 1=1

It seems that with this strategy the player could ensure for himself a gain. This,
however, would be true only if he would dispose over an infinite sum of money. His
fortune being limited, it is easy to show by a simple calculation that the expectation
of his gain is 0 even if he doubles the stake always when a head appears.
47. Calculate the expectation of the Pólya distribution.
48. The chevalier de Méré asked Pascal the following. Two gamblers play a game
where the chances are equal. They deposit at the beginning of the game the same
amount of money. They agree that he who is the first to have won N games gets the
whole deposit. They are, however, obliged to interrupt the game at a moment when
the one player gained N — n times and the other N — m times (1 < n < N; l < m <
< N). How is the deposited money to be distributed ? Calculate this proportion for
я = 2 and m = 3.
Hint. The distribution of the deposited money is said to be “fair” if the money
is distributed in the proportion p n : p m, p„ denoting the probability that the first
gambler would win and p m the probability that the second. Thus each gambler receives
I l l , I 18] E X E R C IS E S 171

a sum equal to this expectation. The problem is thus to calculate the probability that
the first (or the second) wins, under the condition that he already won N — n
(i. e. N — m) games.
49. In playing bridge, 52 cards are distributed among four players. The values
of the cards distributed are measured by the number of “tricks” in the following
manner: If a player has the ace and the king of the same suit, this amounts to 2
tricks; ace and queen of the same suit without the king t o l — ; king and queen without

ace to 1; ace alone to 1; king alone to ~ trick. What .is the expectation of the total
number of tricks in the hand of a player?
Hint. Obviously, the expectation of the number of tricks is the same for all players
and in each of the suits. Hence the expectation of the total number of tricks for a
player in all four suits is equal to the expectation of sum of tricks for the four players
in one suit. Thus it suffices to consider one suit only, e.g. spades. The expectation
of the tricks in the hand of one player is equal to the sum of the expectations of all
tricks present in the spades. However, this sum is equal to 2, except in the case when
the ace, the king, and the queen of spades are in the hands of different players; in
3
this case the sum of tricks is — . Hence the expectation looked for is 1.801.

M
50. a) There are M red and N — M white balls in an urn. We put --- = p. Draw
. N
n balls without replacement from the urn and let the random variables (k =
= 1, 2, ..., n) be defined as follows:
f _ [1 if at the k-th drawing a red ball is drawn,
’k } 0 otherwise.
Calculate R(í, , Sk) (1 < j < к < rí).
b) If
= i l + £2 + • • ■ + in
prove that

° 2(Q = np (1 - p) (\ - -Ü .
C H A P T E R IV

GENERAL THEORY OF RANDOM VARIABLES

§ 1. The general concept of a random variable

We have already introduced in Chapter III the general concept of a random


variable. Let [£2, <
j £ , P] be a Kolmogorov probability space. We understand
by a random variable a function of a real variable £ = £(m), defined for
each со £Q , such that every level-set of £ belongs to -xf. The level-sets of
£ = £(w) are the sets Ax defined by £(co) < x, where x is an arbitrary real
number. The function F(x) = P(AX) = P (f < x) is called (see Ch. Ill) the
distribution function of the random variable

T heorem 1. I f £ is a random variable and g{x) a Borel-measurable1function


o f the real variable x, then r\ = g(f) is also a random variable.

P roof . Let £ ~ \A ) be the set of those elementary events со £ Q for which


f(a>) £ A. We have clearly

(la)
n n
and
е л л - В) = Ц -\А ) - Z -\B ). (lb)
Let Ix denote the interval ( —0 0 , лг) and Ia b the half-open interval [a, b).
By assumption, £ - \ I x) = Ax £ *x£. Hence, according to (lb), £_1(/я>л) é
for every pair of real numbeis (a, b), a < b. Since is a cr-algebra, it follows
from (la) and (lb) that £- 1(A) £ for every Borel-set A of the real line.
Theorem 1 follows immediately.

T heorem 2. I f the distribution function F(x) o f a random variable £ is


given for every real x, then P[£,~\A)\ is uniquely determined for every Borel
subset A o f the set o f real numbers.

1 A function g(x) is said to be Borel-measurable if the level-set defined by g(x) < c


is a Borel-set for every real c. In particular, every continuous function is Borel-
measurable.
IV , § 2 ] D I S T R IB U T IO N - A N D D E N S IT Y F U N C T IO N S 173

P roof . This theorem follows immediately from Theorem 2 of Chapter


II, § 7.

§ 2. Distribution functions and density functions


Let F(x) = P (£ < x) be the distribution function of the random variable
If the random variables f and r\ are almost surely equal (i.e. if P (f # i/) =
= 0), then their distribution functions are obviously identical. In what
f ollows we shall establish some properties of distribution functions.
1. A distribution function F(x) is a nondecreasing function.
According to the definition of level-sets, we have Ax c Ay for x < y;
Viрпгр
F(x) = P(Ax)< P (A y) = F(y).

A distribution function F(x) is not necessarily continuous. It follows how­


ever from the monotonicity of F(x) that at any discontinuity point F(x) pos­
sesses both a left-hand side and a right hand side limit.
2. For any distribution function F(x) we have
lim F(x - h ) = F(x).
A->+0
Hence a distribution function is continuous from the left at every discon­
tinuity point. In fact, F(x) — F(x - h) is the probability that x — h < £ < x,
i.e. F(x) - F{x - h) = P(Bh), where Bh = AxAx_h. Obviously, the sample
space does not contain any element which belongs to Bh fo r every h > 0.
If an element со of Q belongs to Bh, we have always £(ю) < x. Choose now
a small enough h' > 0 such that h' < x - £(a>), then со will not belong
any more to Bh. Let {/;„}(«= 1 , 2 , . . . ) be an arbitrary monotonic sequence
of positive numbers tending to zero. To prove the left-continuity of P(x)
it is sufficient to prove that
lim P ( B J = 0 •
n-+ + CO

But this is a particular case of Theorem 3, Chapter II, § 7.


3. For every distribution function
lim F(x) = 0 and lim F(x) = 1
x — oo X-+ + CO

and thus we may write


F(— oo) = 0 and F (+ со) = 1.
174 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 2

In fact, if {x„} (n = 1, 2 ,. . .) is any sequence of real numbers such that


x n < x „ + 1 and lim xn = + oo, then AXn is a subset of AXn+i hence the sets
Л—00
Ax — AXn (n = 1 , 2 , . . . ) and AXi are disjoint and

F(xn) = P(Ax) + Z P ( A Xk+i- A Xk).


k =l
Since
00 00
P(AX) + Y P(AXk - Ax A = P( £ = P(0) = 1.
k=1 k=l

it follows that
lim F(xn) = 1.
n — + CO

Similarly it can be shown that


lim F(x) = 0.
—oo

All these results may be combined in the following theorem:

T heorem 1. The distribution function o f an arbitrary random variable is a


nondecreasing left-continuous function, which has fo r — oo and fo r + oo the
limits 0 and 1 , respectively.
кь.
The converse of this theorem is also true: Every function F(x) having
these properties can be considered as a distribution function. The proof
runs as follows: Let x = G(y) be the inverse function of у = F(x). (The de­
finition of G(y) is unique, if the following conventions are adopted: if F(x)
has a jump at x0, i.e. if F(x0) = a and F(x0 + 0) = b > a, we put G(y) = x0
for a < у < b ; if F{x) is constant and equal to y0 in the interval c < x < d
but F(x) < y0 for x < c, we put G(y„) — c.) If Q is the interval (0, 1), the
system of all Borel-measurable subsets of Q and P(A) is for A £ the
Lebesgue measure of A, then the function t](y) = G(y) defined for all у £ Q
is a random variable on the probability space [ß, P ] and the distribu­
tion function of rj(y) is

P(/l < x) = P(G(y) < x) = P(y < F (x)) = F(x).

Hence t] is a random variable with distribution function F(x).


If £ is a bounded random variable, i.e. if there exist two constants c and
C such that for every element со of the sample space Ü the inequality c <
^ £(co) < C holds, then clearly F(x) = 0 for x < c and F(x) = 1 for x > C.
If the random variable is “almost surely” constant, i.e. if there exists a
set A £ such that P(Ä) — 0 and fco) = c for every со £ A, then we obtain
IV , § 2 ] D I S T R IB U T IO N - A N D D E N S IT Y F U N C T IO N S 175

for the distribution function of £_

f 0 for X < c
F(x) = |
( 1 otherwise.

The distribution function of a constant is said to be a degenerate distri­


bution function. Evei у nondegenerate distribution function has at least
two points of increase, i.e. at least two points where F(x + h) — F(x) > 0
for every h > 0. If F{x) is the distribution function of a random variable £
which assumes only a finite number of values, л: is a point of increase of
F(x) if and only if <* takes on the value x with positive probability. The set
of jumps of a monotonic function, and thus in particular of a distribution
function, is necessarily finite or denumerably infinite. In fact, if the jumps
of the distribution function are projected upon the у-axis, a system of dis­
joint intervals is obtained because of the monotonicity of the function.
Our statemert can be deduced from the fact that every interval contains a
rational number and the set of all rational numbers is denumerable.
Distribution functions which are not only continuous but absolutely con­
tinuous, deserve particular attention. A distribution function is said to be
absolutely continuous if for any given positive number s there exists a 6 > 0
such that for every system of disjoint intervals

(ak, bk) (k = 1 , 2 , . . ., n; ak < bk)

the inequality

£
k= 1
(Pk ~ a k) < 0

implies

£ IF(bk) —F(ak) I < £.


k =1

Every absolutely continuous function, as is known, is almost everywhere


differentiable and is equal to the indefinite integral of its derivative. This is a
necessary and sufficient condition for the absolute continuity of a function.
If the distribution function F(x) is absolutely continuous we put f(x ) —
= F \x). If F(x) is at a point non-differentiable, f(x ) is not defined there, but
it can be defined arbitrarily. But such points are known to form a set of
measure zero. The function f(x ) is called the density function of the probabi­
lity distribution given by F(x). If F(x) is the distribution function of the
random variable £,, J\x) is called also the density function of
176 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 2

Example. We have already seen (Ch. Ill, § 13) that the function de­
fined by
Í 1 — e~Xx for > 0
Fix) = \
[0 otherwise,

with X > 0, is the distribution function of the life-time of a radioactive


atom. This function is absolutely continuous and we have

/„W, - FТ7Ц
W\ - [
i 0
fOT X> 0
f o r i < 0

(/(0) is not defined, as F'(0) does not exist.)


Thus the density function of the life-time of radioactive atoms exists and
is equal to Xe~Xx for x > 0.
Let c, be any random variable; let Ay be the event Q < у and Aa bthe event
a < £ < b; then we have Aa t = A bÄa. Now, if F(x) is absolutely continuous
and F \x ) = f{x), we have

P(Aa,b) = F(b) - F{a) = f f{x ) dx. (1)


a

From this follows immediately


+00
J fix ) dx = Fi+ oo) - F I - oo) = 1. (2)
—00
Conversely, every nonnegative measurable function fix ) fulfilling (2) can be
considered as a probability density. Indeed, the function

F(x) = j fit) d t (3)


—00
obviously has every property of a distribution function and F \x ) = fix )
holds almost everywhere.
Consider now the case when b — a = A a is small and F'(a) exists. Since,
by definition,
„ш
Да

it follows from this that


Pia < 11 < a + Aa) = fia)Aa + oiAa), (4)
where, as usually, o{Aa) represents a quantity which, divided by Aa, tends
to zero for Aa -* 0.
IV , § 3] M U L T IN O M IA L D IS T R IB U T IO N S 177

Example. In Chapter III, § 16 we encountered the standard normal distri­


bution having the distribution function
X
1 r --
F(x) = — e 2 dt.
■Jin J
— 00

The density function of the standard normal distribution is therefore

1 --
f(x ) = — = e \
J in

We need now the following well-known theorem: Every nondecreasing


monotonic function may be represented as a sum of three nondecreasing
monotonic functions, the first of which is a step function, the second is
absolutely continuous and the third is a continuous “ singular” function
(i.e. a continuous nondecreasing function whose derivative is almost every­
where equal to zero). From this it follows readily that every distribution
function can be written in the form

F(x) = PlFx (x) + p2F2 (x) + p3F3 (x),

where px, p2, p3 are nonnegative numbers having sum 1 and jF)(x) (i — 1 ,
2, 3) are the three distribution functions such that Fi(x) is the distribution
function of a discrete random variable, F 2(x) is an absolutely continuous
distribution function and F3(x) is a singular distribution function. This
decomposition is evidently unique.

§ 3. Probability distributions in several dimensions

By a random vector of n dimensions we understand an и-dimensional


vector £ = (<£], ... ,£ „ ) whose components £,■ are random variables on the
same probability space. The distribution function of the random vector £
is defined as the function of n variables

F{xx, x 2, . . x „) = P(£X < х ъ £2 < x2, . .., £„ < x„). (1)

The probability figuring on the right side of (1) is always defined; in fact,
let denote the level set of all со such that £Ä(<u) < x (k = 1 , 2 , . . ri),
then A ® and

П1
k=
178 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV, § 3

The right hand side of Equation (1) is exactly

Ч П 4*)-
k=l

If a problem of probability theory involves и random variables, these can


always be considered as components of an и-dimensional random vector.
In general, the function defined by (1) will be called the joint distribution
function of the random variables £ъ i 2, .. £„.
For example the value F(xb y\) = P{£ < x b rj < y }) of the distribution
function of a 2 -dimensional random vector represents the probability that
the endpoint of a random vector £ = (£, t]) beginning in (0 , 0 ) lies in the
quadrant of the (x , y) plane defined by x < х ъ у < у г.
Let us consider now some general properties of multidimensional distri­
bution functions.
1. F{xb . . x n) is a nondecreasing function o f every one o f its variables.
2. F(xb . . x„) is left-continuous in every variable.
3. F(x±, .. ., x„) = 0, i f at least one o f its variables is equal to — со.
4. F (x1, . . ., x n) = 1 i f all variables are equal to + oo.

Besides these trivial properties, every и-dimensional distribution has


another characteristic property which for n = 1 follows from the above
property 1. For n > 1, however, it does not follow from it. The probability
P(ak < £k < bk; к = 1, 2 , . . .) may be written in the form
n
P{ak < <!;* < bk, к = 1 ,2 ,. . ., ri) = £ ( - F(ca, c2, . . . , c„) (2)

where ck = ekak + (1 — ek)bk. The numbers ek assume independently of


each other the values 0 and 1. The sum on the right hand side of (2) has thus
2" terms. Thus, for instance, for n = 2 we obtain
P{ax < £ i< Ьъ a2 < £2 < b2) = b2) - F(gu b2) - F(bx, a.,) +
+ P(ab a2).

Formula (2) is a direct consequence of Theorem 9, Chapter II, § 3. In


fact, let Ak be the event £k < ak. Bk the event £k < bk (k = 1,2, . . . , и).
If we put

Л = £ А к, В = f f Bk,
k=l k =1
IV , § 3 ] M U L T IN O M IA L D IS T R IB U T IO N S 179

we find that
P(ak <Цк < Ь к ; к = 1 , 2 , . . n) = P{ÄB) = В Д - P ( £ AkB )‘.
k =1

If we now put Ck — A kB (k = 1, 2,. . и) we obtain (2) by applying the


above-mentioned theorem to the events C*.
As the left hand side of (2) is a probability, it is certainly nonnegative.
Thus we have established the sought property:
5. We have
n

£ ( - i y = ‘k F(e1üi + (1 - e j Ъъ . .. , s„an + (1 - e„) bn) > 0,


where eb e2, assume the values 0 and 1 independently o f each other
and ak < b k (k = 1 , 2 , n) are arbitrary real numbers.
Property 5 does not follow from properties 1-4. If for instance n = 2
and
су % f 1 if x x + x 2 > 0 ,
= jo otherwise,
properties 1-4 are fulfilled but property 5 is not, since for instance

F{2,2) - F ( - 1,2) - F(2, - 1 ) + F ( - l , - 1 ) = - 1 < 0.


By introducing the notation

A V F(x 1; . . x„) = F(xl f . . х к_ъ x k + h, х к+ъ ■■•>x n) - F ( x b . . . , x n)

we may write condition 5 in the following form:


5'. We have
A ^ A ^ . . . A ^ F { x b x 2, . . . , x , ) > 0

for hk > 0 andfor any real numbers x k (k — 1 , 2 , . . . , n). Here the “product”
of the (commutative) operations A$ means that they are to be performed
one after the other. It is easy to prove that if condition 5' holds for hx = h2 —
= . . . = hn = h > 0 it is valid in general.
Conversely, it can be shown that every function F{xu x2, . . . , x n) fulfill­
ing conditions 1-5 may be considered as a distribution function. This
follows from § 7 of Chapter II.
If the distribution function of the random vector £ = (£x, £2>• • •> £n) is
F(xx, x 2, . . x n) and В is a Borel-set of the и-dimensional space, then

P ( U B ) = $ . . . $ d F(xb . . . , x n),
J в J
180 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 3

where on the right-hand side figures a Stieltjes-integral; in other words


P(C 6 B) is equal to the value on the set В of the measure defined by the
function F in the и-dimensional space.
If the distribution function of an и-dimensional random vector is abso­
lutely continuous, then the density function

s, N 8nF(x1,x z, . . . , x n)
....... * - ) ° а , , г , , . . . э * , (3)

exists almost everywhere, and we have always

/ 0 * 1 , * 2 , . . . , x„) > о

because o f p ro p e rty 5. T h is follow s fro m

/( V ! ; . . . . *„) = urn .
л-o "

Further we have
Xi Хц
F(Xi, ...,*„) = J —co
... J
—oo
f{ t и • • tn) dtl . . . dtn ; (4)

hence in particular
+00 +00
J . . . J /( * ! ,. . . , x„) dx1. . . d x n = 1 .
—00 — 'oo
(5)
Further
ьi
< bk; к = 1 , 2 , . . . , и) = j' . . . J /(* 1 , ... , x„) dx1. . . dxn, (6 )
űi an

or, more generally, if В is a Borel subset of the «-dimensional space, then

Щ € B) = I •• • J /(* i, • • •>x n) dxi . . . dxn. (7)

In other words: the probability that the endpoint of the random vector £
lies in a Borel-set В of the и-dimensional space is equal to the integral on В
of /(* b . . ., x n).

T heorem 1. I f <р{хъ . . ., x n) is a Borel-measurable function o f n variables


and if £j, are random variables, then ц = . . ., £„) й айо a
random variable.
IV, § 4 ] C O N D IT IO N A L D IS T R IB U T IO N S 181

P r o o f . Let (~ \B ) be the set of those points w £ ß for which ((со) 6 В,


where ( is an и-dimensional random vector and В is a Borel-set of the «-di­
mensional space; clearly ( _1(B) From this Theorem 1 follows in the
same manner as Theorem 1 of § 1 was proved.
Let us remark that to every 3-dimensiona! probability distribution there
can be assigned a 3-dimensional distribution of the unit mass such that any
domain D contains the mass P(D). If f ( x ,y ,z ) is the density function of
the probability distribution in question, this same function will represent
the density of the corresponding mass distribution.

§ 4. Conditional distributions and conditional density functions

Let ( be any random variable, В an event of positive probability. Of course


В is assumed to belong to the probability algebra on which ( is defined.
The conditional distribution function o f ( with respect to the condition В is
defined as the function

F(x \B ) = P(( < x \ B ) = P(AX \ B),

where A x has the same meaning as in § 1. If the conditional distribution


function thus defined is absolutely continuous, its derivative f ( x | В) =
= F'(x I В) will be called the conditional density function o f ( with respect to
B. Evidently, if P(B) — 1, the ordinary distribution function and densitv
function are obtained.
Take for instance the conditional distribution function of the life-time
of a radioactive atom with respect to the condition that it did not disinte­
grate until the moment t0. As is already known, this is equal to the ordinary
distribution function if t is replaced by t — t0. Let B0 be the event: the atom
did not disintegrate until the moment t0. Then we have

F(t\ / n = f1 for '> 'o ,


0 1 0 otherwise
and
f(t i ß ч _ f for t > t0,
for t < r0.

If {#„}(« = 1, 2 , . . . ) is a complete system of events with P(B„) > 0,


we have
F(x) = £P(B„)F(x\B„) (1)
n
and
f(x) = l P ( B n) f ( x \ B n). (2)
n
182 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 5

For the generalization of the concept of the conditional distribution func­


tion and conditional density function see Chapter V, § 2.

§ 5. Independent random variables

Two random variables E, and ц are said to be (stochastically) independent,


if for every real x and у

Щ < X, ij < y) = P(£ < x) P(t] < y), (1)

i.e., if the tw o-dim ensional distribution function o f (£, rj) is equal to the prod­
uct o f the distribution functions o f £ and rj. From (1) is readily deduced
that
P (a <£ < b, c < rj < d) = P (a < <
I < b) P (c <ц < d) (2)

and, more generally, for any two Borel-sets A and В (cf. Theorem 2 below):

P(Z £ A, r i ZB) = P(£ £A )P (n £B) . (2')

For discrete random variables, this definition of independence coincides


with that given in Chapter III, § 5.
The independence of several random variables may be defined in a simi­
lar manner. The random variables £1; £2, are said to be independent,
if for every system of real numbers х ъ jc2, •••,*„ the relation

p ( ti = n m < x k) (3)
k=l

holds. If the random variables <J2, are independent, any к {к < rí)
chosen arbitrarily from them are independent as well. To see this, it suffices
to substitute Xj = + 0 0 , where the./-s are the indices of the random variables
which do not figure among the chosen k.
The converse of this relation does not hold. For example the fact that £, rj,
C are pairwise independent does not imply their independence. We have
already seen this in the preceding Chapter.
If (*2 , . . ., £„ are discrete random variables, the above definition of
independence is equivalent to the definition given in the preceding Chapter.
We shall prove now some simple theorems about independent random
variables.

T heorem 1. + c o n s ta n t is in d e p e n d e n t o f e v e r y r a n d o m v a r ia b le .
IV , § 5] IN D E P E N D E N T R A N D O M V A R IA B L E S 183

Proof. If ц = с (с = constant), we have

P(A< r\<y)=
p ,. .
X,
^ s f Щ < x) for c < y,
[ 0 otherwise,
hence ( 1 ) is valid.

T heorem 2. Let £t, £2, . . . , £ „ be independent random variables and let


gk(x) (k = 1 , 2 , . . . , n) be Borel-measurable functions; then the random
variables rjk = gk( tk) are independent.

P roof . If Въ . . . , Bn are Borel subsets of the real axis, it follows from (3)
that
р<Л1 6 в ъ . . . , e я„) = п € **)• (4 )
Л= 1

In fact, if Вх, . . ., Bn are unions of finitely many intervals, (4) follows from
(3). Let now B2, B3, . . . , B„ be fixed and let Bx alone be considered as vari­
able: thus both sides of (4) represent a measure. The theorem about the
unique extension of a measure (Ch. II, § 7, Theorem 2) can be applied here
and it follows that (4) is true for any Borel-set BY. Let now be Вг an arbi­
trary, fixed Borel-set and let B3, Вn be fixed sets, each of them being the
union of finitely many intervals. By repeating the preceding reasoning it
can be seen that (4) remains valid, if B2 too is an arbitrary Borel-set. By
progressing in this manner (4) can be proved. Theorem 2 follows immediately
from (4).
In particular it follows from Theorem 2 that the random variables
Vk = ak€k + bk (k=l,2,...,n)
where ak and bk are arbitrary constants, are independent if § l 5 . . .., t n are
independent.
Furthermore, it follows from Theorem 2 that for independent random
variables Formula (3) remains valid if for one or several values
of к on both sides one of the expressions < x k, > x k, or i> x k will be written
instead of < x k.

T heorem 3. I f are independent random variables with den­


sity functions f { x ) ,f A x ) ,. . f n(x), respectively, then the joint distribution
o f the random variables is absolutely continuous with density func­
tion
Д х ь ..., x„) = f l fk{xk). (5)
k=1
Conversely, (5) implies the independence o f the random variables £l9. .
184 GENERAL THEORY OF RA ND O M V A R IA B L E S [IV , § 6

P roof . (5) follows from (3) because of Formula (3) of § 3. Conversely, (3)
is obtained by integrating (5).

T heorem 4. L e t £2, b e in d e p e n d e n t r a n d o m v a r ia b le s a n d le t
A(xj, . . x k) b e a B o r e l-m e a s u r a b le f u n c tio n o f к v a r ia b le s (Jc < ri). T h en
th e r a n d o m v a r ia b le s

h(f 1 , • • t o , & +i ,
are independent.
The proof is similar to that of Theorem 2.
The independence of two random vectors, £ = (£ъ . . £„) and n =
= (nu . . Пт) can be defined as follows: £ and rj are said to be indepen­
dent, if the equality
P(Zi < хъ . . £„ < x n\ r\i < Ух,. . ., i/m < y m) =
= Pß I < *1, • • In<Xn ) f (»?X < Уь fim<Ут
•• ) ( 6)
is identically fulfilled in the variables Xj and y k.

§ 6 . The uniform distribution

The random variable £ is said to be uniformly distributed on the interval


(a, b) (a < b) if its density function is

0 for X < a and for b < x,

/(* ) = 1 ( 1)
—------ for a < x <b.
b —a

At the points x = a and x — b /(%) can be defined arbitrarily1. The corre­


sponding distribution function is
0 for t < a,

F (x ) = ~ — — for a < x < b, (2 )


b —a

1 for x> b.

The uniform distribution of a random vector can be defined in a similar


manner. An n-dimensional random vector £ = (£ь . . . , £„) is said to be
uniformly distributed on a nonempty open set G of the и-dimensional space
1 Sometimes f(x) is also called “lectangular” density function.
IV , § 6 J T H E U N I F O R M D I S T R IB U T IO N 185

with finite «-dimensional Lebesgue-measure, if the density function of the


random vector is given by

— —— for (xl t . . . , x„) £ G


f ( x i, . . . , x n) = Vn(G) (3 )
0 otherwise,

where n„(G) is the “volume” (the «-dimensional Lebesgue-measure) of G.


We already encountered uniformly distributed random variables in Chapter
II, § 10 in connection with the geometric probabilities. The geometrical de­
termination of probabilities is nothing else than the reduction of the prob­
lem to certain uniformly distributed random variables. In fact when one
deals with geometric probabilities, it is always assumed that the probabi­
lity of a point to lie in an interval of the real axis (in a domain of
the plane, space, or more generally, of the «-dimensional space) is propor­
tional to the length of the interval (to the area or volume of the domain
in question). But this means that the random variable considered (or the
random vector) is uniformly distributed. If £ is uniformly distributed on the
interval (a, ft), then, according to (1) and Formula (1) of § 2, the probability
that £ lies in a subinterval (c, d ) ( a < c < d < b) of {a, b) is given by

f f(x ) d x = d — - , (4)
c b -a

thus it is indeed proportional to the length of the interval (c, d).


A similar statement holds also in the multidimensional case. If £ is a
random vector uniformly distributed on an «-dimensional domain G, the
probability that the endpoint of £ lies in a domain Gx which is a subset of G
is equal to

[ . . . ]■/(*!, . .., x„) d x j . . . dxn = . (5)


G, Hn(G)
The case when G is a parallelepiped with its edges parallel to the axes
deserves particular consideration: we have

—— for ak < xk < bk ( k = 1, 2 ,...,« ) ,


/ ( xu . . . , x n) = Vn{G)
0 otherwise,
where

dn(G) = П (ft* - ak).


k=1
186 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 7

Hence

f ( x 1, . . . , x n) = П /*(**), (6 )
k =l
where f k(xk) (/c = 1 , . . и) is the density function of a random variable uni­
formly distributed on the interval (ak, bk); consequently are inde­
pendent. Conversely: if is uniformly distributed on (ak, bk) and if the
i;k are independent, the vector £ = (<JX, . . £„) is uniformly distributed in
the parallelepiped ak < x k < bk (k = 1 , 2 , . . . , и).
For an infinite interval (or for a domain of infinite volume) the uniform
distribution can be defined by means of the theory of conditional probability
spaces. We shall return to this in Chapter V.

§ 7. The normal distribution


We already encountered the normal distribution. It was introduced as
the limit-distribution of the binomial distribution. It has a paramount role
in probability theory. Many random variables dealt with in practice have
a nearly normal distribution. Often, the normal distribution is called the
“law of errors” , since random errors in the result of measurements are often
normally distributed. In Chapter VIII it will be proved that the distribution
of the sum of a large number of independent random variables has approxi­
mately a normal distribution under quite general conditions.
First of all let two general notions be defined: that of the similarity of
distributions and that of a family of distributions. Two distribution functions
Fx{x) and F2(x ) are said to be similar, if there exist two numbers а ф 0 and
m such that if Fx(x) is the distribution function of a random variable
then F2(x) is the distribution function of t] = at; + m. As the inequality
X — YYl
at, + m < X is for a > 0 equivalent to £ < --------- and for a < 0 to
X —m °
£, > --------- , we have either
<7

Fi(x) = Fi ~ ~ ~ ) (la)
(for a > 0 ), or
F 2 (x) = 1 - F, + oj (lb)
(for a < 0 ).
If F ^x) is absolutely continuous, F>(x) is absolutely continuous as well.
In this case, we obtain for the density functions f ( x ) = F'(x) (/ = 1, 2)
, , , \ [x-m
f ‘l (.X) — r / l • (2 )
M о
IV , § 7 ] T H E N O R M A L D IS T R IB U T IO N 187

Clearly the relation of similarity is symmetric, reflexive, and transitive.


Thus it permits the classification of the distributions into types called fam ­
ilies. Every family of distributions is a set depending on two parameters
(m and a). All uniform distributions (on the line) are thus similar to the dis­
tribution uniform on the interval (0, 1). In fact, the uniform distribution on
(a, b) has the density function
1 lx - a
b —a b —a

where f(x ) is the density function of the uniform distribution on (0 , 1 ),


that is
1 for 0 c X < 1
0 for a < 0 and 1 < X .

One can also define families of multidimensional distributions. The dis­


tribution functions F1(x1, . . ., x„) and F2(xb . . ., x n) are said to be similar
if there exists a linear transformation

&= + £ aki £t (k = 1» 2 , . . .,n) (3)


;=l
with a non-zero determinant D = \ aik | such that the random vector £ =
= (£ъ . . . , £„) with the distribution function F f x i , . . . , x„) is transformed
by (3) into the vector <*' = , . . ., £') with distribution function
F2( x u . . ., x„).
If the functions F f x ^ ,.. ., x„) and F2(xl t . . . , x n) are absolutely contin­
uous and have the density functions fi( x b . . ., x n) and / 2(хъ . . ., x n), then
by a well-known property of linear transformations it follows that

ÍÁ Á ....... x'n) = — - f (хъ . . . , x n), (4)

where
П

x ’k = a k о + £ aki xt (k = 1 , . . . , n).
i=i
For n = 1, (4) reduces to (2).
Let us now return to the normal distribution. We shall call every distri­
bution normal which is similar to that obtained as the limit of the binomial
Хг

. . . e ~2 ~

distribution, i.e. to the distribution with density function _. Thus the


J in
188 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 7

density function of a normal distribution has the form

1 lx-m \ 1 (x-m f
f( x ) = - < p -------- = - — - e x p -------r-j— , (5a)
в a ! Jlna a
where

<p{x) = —;L=- 2 . (5b)


Jin

(We have taken a > 0; this restriction to positive values of a is permissible


since cp(x) is an even function.) In other words: a normal distribution func­
tion has the form
' x —m
F(x) = Ф --------, (6 a)
a
where

1 f --
Ф(х) = —= г e 2dt. (6 b)
Jin J
— CO

If the distribution function of a random variable ^ is given by (6 a), we shall


call £ for the sake of brevity N(m, a) distributed. Let us now consider the
multidimensional normal distributions. For the sake of simplicity, let us
first restrict ourselves to the case of two dimensions.
If £ and rj are independent normally distributed random variables with
density functions
1 [x — m1 \ 1 ( x - m2 )
---- q> ----------- and -— 9 ? ---------- .
Ol I °2 \ I

then thedensity function of the random vector £ = (£, tj) isequal to the
product of the density functions of £ and ц; i.e. to

,. . 1 f 1 Г (x —mfi2 (у —m.,)2 11
^ = 'Ъгг/T ‘fr~exp ~ T ------ ~2----- + --------72-- • (7a)

A random vector having a density function of the form (7a) or one similar
to it is said to be normally distributed (or Gaussian). Since all distributions
having density functions of type (7a) are similar to each other, the two-
dimensional normal distributions form a family. The density function (7a)
(with mx — m2 — 0) is represented on Fig. 19.
IV , § 7 ] T H E N O R M A L D I S T R IB U T IO N 189

A simple calculation shows that the most general form of the two-dimen­
sional normal density function is given by

^ A<^ K B exP - 2 (A ~ m^ + 2B(X~ O' - m-i) + С(У - m 2)2) »(7b)

where A and C are positive, В is a real number such that B~ < AC, mx and
m2 are arbitrary real numbers. If В Ф 0, £ and r\ are not independent. In
fact, in this case the density function cannot be decomposed into two factors,
one depending only on x and the other only on y.
We introduce now the concept of the projection of a probability distribu­
tion. Let C — (£1, • • D be an «-dimensional random vector. The projec­
tion of the distribution of £ upon the line g having the cosines of direction

gk (k = 1,2,. ..,« ; j ] g l = 1),


k = 1

is defined as the distribution of the real random variable

Cg = Z SkZk-
к= 1

If the distribution of £ is known, all its projections are known as well.


In particular, the distribution of (k = 1 , . . . , « ) is thus the projection of
the distribution of £ upon the xA-axis. Let F(xx, . . ., x „) be the distribution
function and f( x l t . . ., x n) the density function of £, Fk{x) and f k(x) those of
t k. We have

Fk(*k) = Д + oo,. . . , + со, xk, + со,. . . , + oo) (8 )


190 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S (IV , § 7

and, similarly (for almost every x k)


+ 00 +00

f k ( x k) = J •■. J /(*!, -•
—oo —oo
x„) d x x... d x k _ x d x k+1. .. dx „. (9)

To understand the notion of a “projection” it is useful to consider the


analogous notion for a mass-distribution. For instance let a distribution
of the unit mass over the plane be determined by the density function
h{x, y) and let us “ project” it upon the x-axis, in the sense that we assign
to the interval (a, b) the total mass contained in the strip a < x < b, — со <
< у < + oo. This mass is equal to
Ь +00
a
J J
—oo
h ( x , y ) dydx.

Consider now the projection of an arbitrary two-dimensional normal


distribution (7b) upon the x-axis (and upon the у-axis). For the density
function /(x) (and g ( y ) ) of these projections, a simple calculation gives,
as we have
1 Г (x —m f
T s J “ ■ T - s * - . * ' - 1’
v -00
the results

/(x ) = — <p[—— — and g(y) = — (p — — I , (10)


а1 \ &1 ! a2 )
where

"*= <n)

Thus the projections upon the axes of a two-dimensional normal distri­


bution with density function (7b) are one-dimensional normal distributions . 1
The projection on an arbitrary line may be calculated in the same manner
and the result is always a one-dimensional normal distribution. Suppose
that the components <Ц2, . . . , £ „ of an и-dimensional random vector £
are independent and cjk is normally distributed with the density function
l x ]
— V — (& = 1 , . . n). The density function of the random vector £ is
ck \<*к)
1 The projections of a distribution in я-space on the coordinate axes are also called
its marginal distributions.

\
IV, § 7] T H E N O R M A L D IS T R IB U T IO N 191

f i x ........ x„)= n_ l n exp - — Z ~£ ■ ( 12*)


( 2 tt) 5 П <?k
k=1
If the density function of a random vector has the form (12), it is said to
be normally distributed or Gaussian. Every distribution similar to this is
said to be an n-dimensional normal {or Gaussian) distribution. In order to
obtain the general form of the density of an n-dimensional normal distri­
bution put

= I cik £-k + mp (13)


k =l

where (cjk) is an orthogonal matrix, i.e.

i c , t cl t - S „ - I ' f” 'Г '- (14)


и [ 0 otherwise,

and where the mf (j = 1 , . . ., n) are real numbers. 1


Consider now the random variables as coordinates of a vector
C■Determine the density function g{x\,. . ., x'n) of By (13) and (14) we
have
Zk-icjkiZ j-m j) (15)
7=1

and in consequence of (4) we obtain (as in the two-dimensional case),


1 Г i « i f " l2'
g{x'1,...,x'„) = „ „ exp - — £ — Z cjk (x'j - mj)\ ,
( 2 ТГ)* П <rk 2 ** I
k =1
or, by putting
, V
bn = Z - Cik°k-2C—
k=l
jk
.
9{x'i, ■• x'„) = — „ X
(2 л) 2 П
k=l
I n n

X exp
A
X
/= 1
Z
j = 1
bu ix 'i - mi) {x'j - mj) , (17)

1 We can restrict ourselves here to orthogonal transformations, since the most


general nondegenerate linear transformation may be decomposed into an orthogonal
transformation and a transformation of the form x'k — Xkx k ( к = 1,
192 GENERAL THEORY OF RANDO M V A R IA B L E S [IV , § 7

where (by) is a symmetrical matrix such that the quadratic form

i=l y=l
is positive definite. It is known that a positive definite quadratic form can
be transformed into a sum of squares. Thus if the density function of £'
has the form (17), there exists an orthogonal transformation with matrix
C = (Cy) such that the и-dimensional density function of the random vari­
ables

Zk = £ cjk ( £ y - mi)
i= 1
n

has the form (12). Note that the factor 1/ Д <7fc is equal to the positive square
k = 1
root of the determinant | by |. The matrix В = (by) can be written as CSC*,
where C* is the transpose of C and S is the diagonal matrix

\ 0 ... 0
<4

о 0 \ ... 0
S= a~2

1
0 0 ...

For the determinants, since | C | = | C* | = +1, we have

i ^ i = i5 i = n - 4 .
k=1 ak
Consequently, the density function (17) can be written as

I iв\ Г 1 " " 1


д(хъ .. . , x n) = ^ —— exp j^- у I I b,j (x, - m ) (Xj - m,)J , (18)

where the quadratic form SbyZ/Zj is positive definite, m h . . m„ are arbi­


trary real numbers, and |2?| is the determinant of the matrix В = (by). Every
density function o f the form (18) is the density function o f an n-dimensional
normal distribution; a suitable orthogonal transformation leads from (18)
to a density function of the form (12). Since evidently all distributions with
a density function of the form ( 1 2 ) are similar to each other, the «-dimen-
IV , § 8] F U N C T I O N O F A R A N D O M V A R IA B L E 193

sional normal distributions form a family. It has some interest to study the
case of an m-dimensional vector £ = (£b . . £„, 0 , . . 0 ) where m > n,
and where the /i-dimensional vector (£1;. . £„) has a density function of
the form (12). By applying the orthogonal transformation

= t cJk 4 + ntj ( y = l , 2 , . . . , m), (19)


k = 1
we obtain an m-dimensional vector which, however, is not really m-dimen­
sional; indeed (19) implies
m
<£* = Ё cJk 0Vj -rrij) for к = 1 ,2 , ..., n (20)
;=i
and
m
0 = Z cJk(£j - mj) for k = n+ l,...,m . (21)
i= 1

Formula (21) expresses that the point (£i,. . . , £') lies in an n-dimensional
subspace of the m-dimensional space. A distribution of this kind is said to
be a degenerate m-dimensional normal distribution.

§ 8. Distribution of a function of a random variable

Let £ be a random variable with known distribution and let у = ф(х)


be a Borel-measurable function. It is then easy to determine the distribution
of the random variable rj = <K£). Let denote the set of real numbers
X for which ф(х) belongs to the Borel-set E; let further Iy denote the inter­
val ( - oo, y) and let F(x) be the distribution function of £ and G(y) thrt of rj.
It follows that

С( у)=Р (г,<у) = Р (1 & ф -% ))= f dF(x).


*~Kh)
Let os first consider some particular cases. If £ is a discrete random vari­
able, the calculation of the distribution of r\ is almost trivial. In fact, let x k
(k — 1 , 2 , . . . ) denote the possible values of £, then

P(n = y )= Z Р(Л = x k),


<1>(.Хк)=У
where the summation extends over those values of к for which ф(хк) = у.
Let us now consider the case of an absolutely continuous distribution
function. Let f{x) be the density function of £. Assume ф(х) to be monotonic
and differentiable and suppose ф'(х) ф 0 for every x. If g(y) is the density
194 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 9

function of r\ = ф(£), one easily finds that

i i for inf < У < SUP <W*),


g(y) = I ‘/ 'О O')) I (1)
0 otherwise,

where x = is the inverse function of у = i

If for instance £ is a normally distributed random variable and g = e(,


we have by (1) putting M = em,

L j M2

exp “ if2 - for >;>0,


g(y) = (2 )

0 for у < 0.
A random variable having the density function (2) is said to be lognormal.
The lognormal distribution is of great importance in the theory of crushing
of materials. The distribution of the grains of a granular material (stone,
metal or crystal powder, etc.), in particular of a product produced by a
breaking-process, is lognormal under rather general conditions. This den­
sity function is represented by the curve seen on Fig. 20.
Take now another example. Let the random vector £ be uniformly dis­
tributed on the circumference of the unit circle; what is the density of the
distribution of the projection £ of £ on the x-axis? We obtain from (1)

----- -------- for - 1 < у < + 1 ,


g(y) = Hy/l-y* (3)
0 otherwise.
IV , § 9 ] T H E C O N V O L U T IO N O F D IS T R IB U T IO N S 195

§ 9. The convolution of distributions

Let two independent random variables £ and rj be given having the distri­
bution functions F(x) and G(y) respectively. Consider the sum £ = £ + 17;
let H(z) be its distribution function. We have clearly

tf(z) = J J dF(x) dG(y) = ( 1)


x+.y<z

= У F(z - y) dG(y) = J° G(z - X) dF(x).


—00 —00

The distribution function H(z) is called the convolution o f the distribution


functions F(x) and G(y). The convolution operation is denoted by H = F * G
Clearly it is commutative and associative; in fact, if £2, £3 are independent
random variables, we have

ír + & = £2 + Ét and + Q + £3 = it + «2 +
From this follows for the distribution functions that
Fx * F., = F2 * F! and (Ft * F2) * F 3 = Ft * (F2 * F3).

Suppose that £ and у are independent random variables having absolutely


continuous distribution functions; let J{x) and g{y) be their density functions.
It will be shown that Fl — F *G is also absolutely continuous and that
the density function of £ = £ + 17 is

4 Z) =
—00
J /(* ) # 0 - X) rfx = j' /(z - >’) .90) dy,
—00
(2 )

(the equality of the integrals in (2 ) can be shown e.g. by a transformation


of the variable).
Formula (2) can be proved as follows: (1) is equivalent to

H(z) = j j f ( x - y ) d x d G { y ) =
—00 —00
* (3)
= J
—00
J f(x -
—00
y) dG(y) dx.

By differentiating (3) we obtain

h{z) = J f( z - y) dG(y). (4)


—00
196 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 9

From (4) follows immediately (2). Further it can be seen that the distribu­
tion of £ = £ + i/ is absolutely continuous, provided that one of £ and rj
has such a distribution, regardless of the other distribution.
The function h(x) defined by (2) is called the convolution of the density
functions f( x ) and g(x) and is denoted by h = f * g. It is easy to show that
h{x) is a density function; as a matter of fact (2 ) implies h(x) > 0 and
+ 00 + 00 + 00 + 00 + 00

J h(x) dx = f J f(x -y )g (y )d y d x = J f{x) dx f g(y)dy = 1 .


— CO — CO — 00 — 00 — 00

In what follows, we shall give some examples of the convolution of ab­


solutely continuous distributions (the convolution of discrete distributions
was already dealt with in Chapter III, § 6 ).
1. Convolution o f uniform distributions.
Suppose that

- ------ for a < x < b,


f(x) = j b ~a (5)
0 otherwise
and

—j-—— for c < X < d,


g (x)= \d - c (6 )
0 otherwise.
Assume d — c > b — a. The convolution of the density functions of two
independent random variables £ and ц with the respective density functions
(5) and (6 ) is equal to
0 f°r X < a + c or b + d <. X ,

X — (a + c)
— --— ---- - for a + c < x < b + c,
(b - a)(d - c)
h(x) = l (7)
— ---- for b + c < x < a + d,
d —c
(b + d) — x „ , ,
TV-------т-i------ for a + d < x < b + d.
(b — a){d — c)

The graph of the function f = h(x) is an isosceles trapezoid with its base
on the x-axis (Fig. 21 represents the case a — —1, = 0, c — - 1 , d = +1).
Note that h(x) is everywhere continuous, though f(x) and g(x) have jumps.
(The convolution in general smoothes out discontinuities.)
I V , § 9] T H E C O N V O L U T IO N O F D IS T R IB U T IO N S 197

In particular, if £ and rj have the same uniform distribution, the graph


h(x) is an isosceles triangle; this is the so-called Simpson distribution.
By repeated application of (2) one can determine the density function of
the sum of several independent random variables with absolutely continuous

Fig. 21

distribution. Thus for instance the density function of the sum of three
independent random variables uniformly distributed on ( - 1 , + 1 ) is given by
0 for Ix I > 3,

-(-1 - l * l £ for 1 < | x | < 3,


h{x) = 16 1 1 (8 )
3 —x 2
— -— for 0 <|x| < 1.
8

The function h(x) (cf. Fig. 22) is not only continuous but also everywhere
differentiable. The curve has already a bell-shaped form as the Gaussian
curve; by adding more and more independent random variables with uni­
form distribution on ( —1 , + 1 ), this similarity becomes still closer: we have
here a particular case of the central limit theorem to be dealt with later. The
density function of the sum of n mutually independent random variables
with uniform distribution on ( - 1 , + 1 ) is

[~ 2~]
/„ (* > - 2 > Ь )! i„ + ft* \ * \ < " . (9)

0 otherwise
198 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 9

as it is readily proved by mathematical induction. The graph of the function


f n(x) consists of arcs of functions of degree n - 1 ; it is (n - 2 )-times
differentiable, i.e. the first n - 2 derivatives of these functions are equal
at the endpoints of these arcs.
This distribution was first studied by N. I. Lobatchewski. He wanted to
use it to evaluate the error of astronomical measurements, in order to decide
whether the Euclidean or the non-Euclidean geometry is valid in the Uni­
verse.
2. The convolution o f normal distributions.
Let £, and n be two independent random variables with density functions
4 1 [ x - тЛ 1 (x-m 2
'(•*) = -----<p---------- and g(x) = — cp --------- ;
O'I O'I I O'2 O'2

it follows from (2 ) by an easy calculation thai putting h = f * g one has

а д . - ■ J (,o)

The sum of £ and g is thus also a normally distributed random variable;


i
the parameters of the distribution are m = mx + m., and u = (erf + <t|)2.
It follows that the sum of any number of independent and normally distrib­
uted random variables is again a normally distributed random variable.
3. Pearson's yf-distribution}
The distribution of the sum of the squares of n independent random vari­
ables £ь with the same normal distribution, plays an important role
in mathematical statistics. We shall determine the density function of this
sum for any n. Let <j{x) be the density function of the random variables
£k (k = 1 , 2 , . . . , n). Let the sum of the squares of the be denoted by

(io
k =1

Let h„(x) be the density function of %% The statement

h--- 7T,T for ^ > 0,


К ( х ) = ) 2 ‘> Г - ( 12)
(2

0 for x < 0

1 This distribution was already used by Helmert, before Pearson.


IV , § 91 T H E C O N V O L U T IO N O F D IS T R IB U T IO N S 199

can be proven by mathematical induction. For we have by Formula (1)


of § 8
i -4
, / . —7=— e 2 for X > 0 .
/г,(х) — y/2 n x (13)
0 for r < 0 ,
which shows that (12) is valid for n = 1. Suppose that (12) is valid for a
certain value of n. Given (2) and the induction assumption we have
i
-4 " t1 -i ’ 4-1
hn+i(x ) — r . .......^у. ....

J2ilrí J -/' - y ' ’


v 2 n

As for Euler’s beta function B(a, b) the formula

B (a,b)= f f - ^ l - t f - ' d t ( a > 0 , b > 0) (15)


Ö
is valid, we have1
m b ) , r n m . <i 6 ,
Г (a + b)

Since Г — = y j n , from (14) follows


И+1 , _X
X 2 e 2
hn+i(* ) = - ^ i -------- ГТТ for * >0-
2 2 r n+
I 2
Thus (12) holds with n + 1 instead of n\ thus it holds for every n.
From (12) we obtain that the density function gn{x) of /„ =
-+V Ü + - + «2 is
x " “ 1 e~^
#„(*) = 2 —-— -— — for x > 0. (17)
IW
22 Г —
(2 j

The distribution with density function (12) is called Pearson’s yf-distribution


with n degrees o f freedom. The distribution with density function (17) is
called the x-distribution with n degrees o f freedom.
1 For the proof of this formula cf. e.g. F. Lösch and F. Schoblik [1] or V. I. Smir­
nov [1].
200 G ENERAL THEORY OF RA ND O M V A R IA B L E S [IV , § 9

For n = 3 Equation (17) gives the density function of Maxwell's velocity


distribution, which is of great importance in the kinetic theory of gases.
Consider a gas contained in a vessel. The velocity of a molecule has for
its components in the directions x, y, and z the random variables f , r\, and ir­

respectively. It is shown in the kinetic theory of gases that these three ran­
dom variables are independent, normally distributed, and have the same
density function:
1 (x
— <p — .
a a)
The physical meaning of £, ц and ( having identical distributions is that
the pressure of the gas has the same value in every direction; m = 0 means
that the gas does not move as a whole, only its molecules move at random.
We wish to determine the density function of the absolute value of the
velocity

v=y/Y + 7 + F - (i s)

Clearly, —, —, — have the density function <p(x); hence, by (17), the density
' <t a a
function of — is g3(x). According to Formula (1) of § 8 the density v(x) of v
a
■ 1 „ P l 1 ^ / 1 ] 1 r
is — 9s — . Since Г — = — Г — = — s/ и , we have
о \o ) [2 ! 2 [21 2

V(x) = Л J - x2 e (x > 0 ). (19)


(7 \ J n
IV , § 9 ] T H E C O N V O L U T IO N O F D IS T R IB U T IO N S 201

(The curve representing у — v(x) is drawn on Fig. 23 for a = 1.) Note that
a has the physical meaning
fkf
* = 7 м > (20)
where T is the absolute temperature, M the mass of the molecules, and к is
Boltzmann’s constant.
1 -*
Let further be noted that h2( x ) - ^ e 2 : the ^-distribution with 2
degrees of freedom is an exponential distribution.
4. Convolution o f exponential distributions.
The exponential distribution was introduced in the previous section in
connection with radioactive disintegration; but it occurs also in many
other problems of physics and technology. In what follows, we give an
example from the textile industry; namely the problem of the tearing of the
yarn on the loom. At a given moment, the yarn is or is not torn, according
as the section of the yarn, submitted at this moment to a certain stress, does
or does not yield to the latter. Evidently this does not depend on the time
duiing which the loom worked uninterruptedly. Let £ be the random vari­
able representing this time-interval, i.e. the time between the start of the
work and the first rupture of the yarn; let F(x) denote the distribution func­
tion and f(x ) the density function of £; for F(x) one obtains, as in the case
of the radioactive disintegration, the functional equation

^rom which it follows that


F(t) = 1 - e~xt for /> 0 (2 2 )
and
f(t) = Xe~xt (t > 0) (23)
(where Я is a positive constant). Hence the random variable £ has an
exponential distribution.
Consider now the functioning of the loom during a sufficiently long time
interval. Let £„ denote the time interval until the и-th rupture of the yarn.
For the sake of simplicity assume the time wasted between the rupture and
the tieing of the yarn to be so small that it can be neglected. Then we have
C„ = £i + {* + . . . + «„,

where £ъ are independent and every one of them has distribution (2 2 ).


Let F„(t) be the distribution function and /„(?) the density function of C„.
202 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 9

It can be shown by induction that


^nt n - ie-xt
f n(t)= ( и _ 1}, for t > 0; и = 1, 2, . . . . (24)

By (23), Formula (24) holds for n = 1. Assume its validity for a certain value
of n. Since £ „ + 1 = Cn + £ „ + 1 and further £„ is independent of £„+1, For­
mula (2) can be applied here. Thus we obtain
t

fn + i (0 = J
0
Г
fn (m) / i 0 — u )d u =
Xn+1t ne~x'
— — ------

and (24) is hereby proved. It follows


= ( t> 0) (25)
where
л:

T„(x) = — J un~1e~“du ( x 2i 0 ) (26)


0
is the incomplete Г-function.
The distribution with the distribution function (25) is called the Г-distri­
bution o f order n and parameter X. For X — - i - , f n(i) is equal to the function
h.,„(t) defined by ( 1 2 ); thus the ^-distribution with In degrees of freedom is
the same as the Г-distribution of order n.
This result permits us to calculate the probability that the yarn is torn
exactly n times during a time interval (0, T). Let vT be the number of break­
ings of the yarn in the time interval (0, Г ); clearly vr can only assume non­
negative integer values. The event vT — n means that £„ < T, but £ „ + 1 > T.
Let A„ denote the event £„ < T; then, because of A„+1 a A„, we have
P{yT = n) — Fn{T) — Fn+1 (T). (27)
Substituting here for Fn(T) and Fn+1(T) and integrating by parts, we find
(XT)ne~XT
P(vT =n) = ± - ! ~ ------ . (28)
nl
Thus the random variable vT has a Poisson distribution of parameter XT.
Here we encountered an important further property of the Poisson distri­
bution: i f a sequence o f events has the property that time intervals between
consecutive events do not depend on each other and have distribution function
1 — e~u (t > 0 ), then the number o f events occurring in a fixed interval
(0, T) has a Poisson distribution with parameter XT.
IV , § 10] D IS T R IB U T IO N O F A F U N C T IO N O F R A N D O M V A R IA B L E S 203

The above reasoning can be applied to a large number of technical pro­


blems (e.g. the breaking of machine parts).
One can also determine the distribution of a sum of independent random
variables having exponential distributions with different parameters. Let
£ь . . ., £я be independent random variables with exponential distributions,
let Xk e~Xk' be the density function of £,k for t > 0 where the numbers Хъ . . .,
X„ are all different. It can be shown by induction that the density function
ffn( 0 of ri„ = + ... + is given by
n
& ,(') = (— A, £ n n n for i > 0 - (29>
k= 1 11K 4 ~ F )

Formula (29) has the following physical application: Let ^ be a radioactive


substance with the disintegration constant Xv The disintegration of an A l
atom means its transformation into some other kind of atom A 2; suppose
that the A2 atoms are radioactive as well and have the disintegration con­
stant X2. Similarly, let Ak (k = 3, 4 ,. . ., rí) be the result of the disintegration
of an Ak_1 atom, the disintegration constant of A k being X„ for к < n.
Assume that the substance A n+1 is not radioactive. Denoting by rj the
time necessary for the transformation of an atom A x into an atom A n+1,
ri„ clearly has the density function (29). For instance if A x is uranium, An+1
is lead and t]„ is the time necessary for an uranium atom to change into lead.

§ 10. Distribution of a function of several random variables

Let £]; be arbitrary random variables with the joint (n-dimensional)


distribution function F(xx, . . x „) and let д(хъ . . ., x n) be a Borel-measur-
able function. Evidently, the distribution function of g = g(£u . . ., £„) is
Р(Ч <У )= J • • •J dF(xu . . . , x n).
0 (* 1 .........X„ ) < y

Let us consider some important particular cases. Let £ and g be inde­


pendent random variables with absolutely continuous distribution functions;
£
let us consider the random variables £i = and £2 = — . Let the density
П
functions of £ and r\ be f(x) and g{y) ; we have

P(& < z) = J f f(x) g(y) dxdy (1 a)


xy< z
and

p —< 2 = j J f ( x ) g(y) dx d y . (lb)

7 <r
204 GENERAL TH EORY OF R A N D O M VARIABLES [IV, § 10

By differentiating we obtain the corresponding density functions p(z) and


q(z) of Ci and C2 :

-0 0

Ф ) = J \y\9 { y )f (zy) dy. (3)


—OO

Let us give some examples.


1. Student's distribution.
We shall determine the distribution of the random variable

c = V c r+ ^ . + c i ’ (4)
where Co» £i> are independent random variables having the same
normal distribution with density function
1
<p(x) = - = e 2.
J in
Let q„(z) be the density function of C- We know already the density func­
tion of the denominator of (4) (cf. Formula (17) of § 9), hence we obtain
from (3)
r f c ± i]
9 a(z)= 7 = h T --------^
f y ( 1 + z 2) 2
(5)

The distribution with density function (5) is called Student's distribution


with n degrees o f freedom. It plays an important role in mathematical sta­
tistics, since “ Student’s Mest” is based on it. The particular case n = 1
gives the Cauchy distribution, with the density function

?1<2)- 1 ч г Ь т (6)
2. Distribution o f the ratio o f two independent random variables having
X2-distributions.
In mathematical statistics one is often interested in the density function
h{z) of the ratio of two independent random variables C and q having y2-
distributions with n and m degrees of freedom, respectively.
It follows easily from Formula (12) of § 9 and from (3) of the present
IV , § 10] D I S T R IB U T IO N O F A F U N C T I O N O F R A N D O M V A R IA B L E S 205

section that

^r i n +
2
m \ » 1,
2

f o r z > 0 ' < 7 )

Г( 2 / Ы <1+Z)
3. The beta distribution.
If f is the ratio considered in the previous example, let т denote the ran-
. . £
dom variable т = -— - and k(x) the density function of r. By (1) of § 8
we obtain
in + m \
1 2 --1 m- i
k(x) = ----- Ц ---- — ■X 2 (1 - x) 2 for 0 < X < 1 . (8 )
„ In „ lm
Г — Г —
Uj I2j
X
The distribution function K(x) = j к (/) *// is thus
о
X
(n + m \ Г
1 2 --i m—i
= — A t— t 2 ( 1 - 0 2 <*=■»„ m (x) for 0 < x < l , (9)

'• tH t J 0
where
X

« ■ - Ш <l0)
о
is, up to a numerical factor, Euler’s incomplete beta integral. The distribu­
tion B(a, b) (a > 0, b > 0) having (10) for its distribution function is called
the beta-distribution o f order (a, b).
4. Order statistics.
In nonparametric statistics the following problem is of importance: Let
£i, fi, • be independent random variables with the same continuous
distribution: let F(x) be the distribution function of £k. Arrange the values
of £b . .., £„ in increasing order , 1 and denote by £* the k -th of these ordered
values; hence, in particular
<n = min £*, £* = max £k . (11)
t<,k<.n l<.k<.n
Í* is called the &-th order statistics of the sample (£,,. . ., £„).
1 The probability that equal values occur is 0.
206 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 10

Determine now the distribution function Fk(x) of £* (k = 1, 2, . . . , « ) ;


clearly £* < x means that among the values taken on by £1;. . there
are at least к which are less than x. The probability that r given variables
among the 4 are less than x and the other n — r greater than or equal to x,

is given by [F(x)]r [1 - F(x)]" r; since the first r can be chosen in dif­
ferent ways, we have

В Д )= i ( " I № ) l r [l - F ( x ) ] - '. (12)


r=k r

This expression can be simplified by taking into account the identity


p

1 ( " k “ - ,r " - (*-iÜ -*)' I - xrtdx■ <i3)


о
which gives
Fk(x) = BKn+1_k (F(x)), (14)

where Bk n + (x) is the incomplete beta function of order (к , n + 1 — k)


(cf. (10)). In the case when F(x) = x for 0 < x < 1, i.e. if the are uni­
formly distributed on the interval (0 , 1 ), * has a beta distribution of order
(к, n + 1 — k), and, in particular, for 0 < x < 1,

Fx(x) = P (min ^ < x) = 1 - (1 - x)" (15)


1 <.k<,n
and
F„(x) = P (max £k < x) = x" . (16)
1 <Ji<,n

If £1( are independent and have the same continuous, monotone,


and strictly increasing distribution function F(x), then the random variables
t]k = Р(сл ) (k = 1 , . .., n) are independent and uniformly distributed on
the interval (0, 1). In fact, if x = F~'(y) is the inverse function of у = F(x),
we have
Р ( Пк< х ) = Р ( 4 < F - 1(x)) = F ( F - X(x)) = x (17)
for 0 < x < 1 .
If now t]* is the k -th among the random variables r\b . . r\n ranked ac­
cording to increasing order, it is clear that r\* = F(£*) and we have

P (l* < x) = Bk n+1_k (x ). (18)


5. Mixtures.
Let Fk{x) (k = 1 , 2 , . . . ) be arbitrary distribution functions and {pk}
*V, § 10] D I S T R IB U T IO N O F A F U N C T IO N O F R A N D O M V A R IA B L E S 207

a discrete probability distribution. Then

m = f PkFk {x) (19)


*=1

is also a distribution function. It is called the mixture o f the distribution


functions Fk(x) (k = 1 , 2 , . . . ) taken with the weights pk. This concept was
already defined in the foregoing Chapter for the particular case where the
functions Fk(x) are discrete distribution functions.
Consider the following example: a physical quantity is measured by two
different procedures,the errors of the measurements being in both cases nor-
1 IX 1 x
mally distributed with density functions — cp — and — у — . N x mea-
0 "l 1 ^1 , &2 ^"2
surements were performed by the first, and N 2 measurements by the second
method without registering, which of the results was furnished by the first
and which by the second of the methods (the measurements were mixed).
What will be the distribution function of the error of a measurement chosen
at random from these N = N x + N 2 measurements? If
л:
1 г - ,а
Ф(х) = — — i e 2 dt,
sjl n J
— 00

it follows from the theorem of total probability that this distribution func­
tion F(x) is given by
N, x 1 N, (x
v1 N ox ) N \o2j
i.e. F(x) is the mixture of the distribution functions of the errors of the two
methods, taken with the weights and ——.
B N N
It is easy to extend the notion of the mixture to a nondenumerable set
of distribution functions. If F{t, x ) for each value of the parameter Г is a
distribution function and for each fixed value of x F(t, x) is a measurable
function of t and if G(t) is an arbitrary distribution function, the Stieltjes
integral
+ co
H(x) = J F(t, x) dG(t) (2 0 )
■H-OO

defines a distribution function called the mixture of the distribution func­


tions F(t, x) mixed with the distribution function G{t). If G(t) is a discrete
distribution function, (20) reduces to (19). It is easy to see that the function
H(x) defined by (20) is in fact a distribution function.
208 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § I I

Let us consider an important application. One has often to determine


the distribution function of the sum
П — £i + £2 + • • • + £.■ (2 1 )
such that the number v of the terms is a random variable. Assume that the
£k are mutually independent and v is independent of the Let Fk(x) de­
note the distribution function of £k, G,(x) the distribution function of („ =
= £! + . . . + £„ and H(x) the distribution function of the random variable
r] defined by (21); let further be P(v = ri) = p„ (n = 1, 2,. . .). Then, by
the theorem of total probability,

Щх) = £ pfiJLx), (2 2 )
n= l

i.e. H(x) is a mixture of the distribution functions Gn(x).

Example. If p„ = j j pkq"~k (n = к, к + 1, . . .) and Fk(x) = 1 —

— e~Xx for X > 0, further if the random variables v, £ l f £ 2, • • • are inde­


pendent, then

C- W - J („ _ i) i « ■
and, by (2 2 )
f (pXf tk_1 e~pkl
H^ r \ k - 1)1 d t’ (23)
0

hence r\ has а Г-distribution of order к and parameter pi.

§ 11. The general notion of expectation


We shall now extend the notion of expectation to an arbitrary random
variable £. In order to do this assume that a great number of independent
observations were made on the value of £. Arrange the observed values
into classes such that the k-ih class should contain the values between kh
(included) and (к + \)h (excluded) (A > 0 ; к = 0, + 1, + 2 ,. . .). According
to the law of large numbers the arithmetic mean of the observed values will
be near to
£ k h P ( k h < ^ < ( k + 1)A), (1)
к = —oo
provided of course that the series is convergent; the approximation will be
the closer the smaller is the value of A. Hence it is natural to define the expec-
IV , § 11] T H E G E N E R A L N O T IO N O F E X P E C T A T IO N 209

tation o f £ by

£ ( 0 = lim | f khP(kh < ^ < { k + \ ) h ) , (2)


Л —*■0 k = — oo

if this limit exists. If ^ is a discrete random variable, this definition coincides


with that given in the preceding Chapter.
Obviously, if the limit (2) exists, it represents the Lebesgue integral of the
function £ = £(cu) with respect to the probability measure P, i.e.

E(£) = f £(®) dP . (3)


n
1 П
(2) can be interpreted in a different manner too. Let = h — , where
h
[x] denotes the entire part of the real number x; is a discrete random
variable and

+f khp(kh < a < ( k + \ ) h ) = +£ khP{U = *A) = £(£/,)


k — — со k=- — со

is the expectation of (2 ) can be written in the form

Щ ) = lim £(£„); ( 2 ')


A -0

is the greatest multiple of h not exceeding £. For h = 10~r, £Ais nothing


else than the value of £ rounded off to r decimal places.
In what follows, the knowledge of the Lebesgue integral will be taken for
granted; we shall give without proof the properties of E(£) which follow di­
rectly from the properties of the Lebesgue integral. Theorems, which in the
general case can be proved in the same manner as in the case of discrete
distributions and which were proved for the latter in § 7 of Chapter III,
will be formulated here without proof. But the reader may profit from
carrying through these proofs for the general case too.
Evidently, the expectation E(£) depends only on the distribution function
of £; hence one may call £(£) the expectation o f the distribution o f f
If £ is a random variable with distribution function F(x), then
+ 00
Щ ) = J xdF(x). (4)
—00

If £ is bounded with probability 1, then E{f) exists. If P(A < ^ < B) = 1,


then A < E(£,) < B; in particular, if P{£, > 0) = 1, we have E(£) > 0,
the equality being valid if and only if P(g = 0) = 1. If the distribution func-
210 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [1У, § 1 1

tion of £ is absolutely continuous and if f(x) is the density function of £,


then
£ ( 0 = +f x f(x )d x . (5)
— 00

E.g. for the Cauchy distribution with the density function

Л д г)= * Т 7 1 ? )

the expectation does not exist, since in this case the integral (5) does not
converge.
Let us now consider some examples.
1. Expectation o f the uniform distribution.
If £ is a random variable uniformly distributed on the interval (a, b), it
follows from (5) that

L'

which is also evident because of the symmetry of the uniform distribution.


2. Expectation o f the normal distribution
If £ is a normally distributed random variable, its density function has
the form
t _ fyif
f(x) = - = - e x p ---------5— (m real, a > 0 ).
Jin a 2(7

By applying (5) we obtain easily

Щ) = m.

Thus we have found the probabilistic meaning of one of the parameters


of the normal distribution . 1
3. Expectation o f the gamma distribution.
If the random variable £ has а Г distribution of order k, its density func­
tion is of the form
tA: yk —1 p—Xx
JW — JE T ljr (* > 0 ) -

1 Later on we shall see that a is the standard deviation of (.


IV , § 11] T H E G E N E R A L N O T IO N O F E X P E C T A T IO N 211

where Я is a positive constant; from this it follows by (5) that

* W “ T -

In particular, the ordinary exponential distribution with the distribution


function 1 — e~kx for X > 0 has expectation Thus we found another
probabilistic meaning of the parameter A of the exponential distribution.
The disintegration constant of a radioactive substance is thus the inverse
of the mean life-time of a radioactive atom. As we have seen, the relation
1 h
— = - —— holds between the constant and the half-period h\ from this it
A 111 Z
follows that the mean life is equal to the product of the half-period and
- —— (i.e. it is 1.34 times the half-period).
In 2
4. Expectation o f the y2- and '/-distributions.
According to Formula (12) of § 9 the density function of / I is

П_i _ X
hA x)= ~— - y .

^(т)
A simple calculation gives

E(xl) = n .
Similarly, for the expectation of /„

£ ( а ) = ч/ 2

By applying Stirling’s formula, we find for n -» oo


'H ; .

E (/„ )

hence
E(xl) ~ \E(Xn) f if и-* oo.
5. Expectation o f the beta distribution.
Let £ be a random variable with a beta distribution; its density function is
212 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § l l

m-«- ‘ c - t f - (o<*<i>

From this, by (5),

a+b

6. Order statistics.
Let (i, . . be independent random variables each uniformly distrib­
uted on the interval (0,1). Let t;* be the random variable which assumes the
A>th of the values £ъ ranked according to increasing magnitude;
by Formula (14) of § 10
Fk(x) = Bk'„+i_ * (x ).
к
Hence E(£*) = — — j-; expectations of the £*-s subdivide the interval
(0 , 1 ) into n + 1 equal intervals, as could also be guessed by a symmetry
argument.
We hinted already at the analogy between probability distributions and
distributions of masses. Consider now the distribution of the unit mass on
a line, such that between the abscissas a and b > a there should lie a mass
F(b) — F(a), where F(x) is a given distribution function. If x 0 is the center
of gravity of this distribution, we know that
+ 00

x () = j xdF{x) ,
— CO

hence x0 is equal to the expectation of the probability distribution which has


for its distribution function F(x).
Let £ be an arbitrary random variable and A an event of positive proba­
bility. We define the conditional expectation £({ | A) o f c with respect to
the condition A as a limit

£(£ I A) = lim +f khP(/ch< £ < ( fc + l ) h \ A ) .


h—0 k=—со

Since P(B I A) < the existence of E{£) implies the existence of the
conditional expectation E(£ | A) for any event A such that P(A) > 0.
If F(x I A) is the conditional distribution function of £ with respect to
the condition A, then
E(£ I A) = J° xd F (x! A ) . (6 )
— CO
IV , § II] T H E G E N E R A L N O T IO N O F E X P E C T A T IO N 213

Clearly, since
f m d p
Щ \ A ) = A ‘ -{ A) - = f mdQ = UM dQ
A Q

where Q(B) = P(B \ A), and Q(B) is a probability measure, all results valid
for ordinary expectations are also valid for conditional expectations.
We shall now give some often used theorems.
T heorem 1. The relation

E ( t ckZk) = t ckE ^ k)
*=i k=i
holds fo r any random variables £k with finite expectation and for any con­
stants ck. Thus the functional E is linear.
This theorem is a direct consequence of (3) and of the corresponding
properties of the integral.
Let i; and t] be two normally distributed independent random variables
with density functions
1 ( X —mj , 1 ( X — nin)
— <p --------- and — <p ---------- .
°i l °1 °2 I °2 j
The density function of the random variable £, + r\ is, as we have seen al­
ready, — (p i------— , where m — mx + m2 and er = J o \ + of. It was proved
a \ a j
above that the parameter m figuring in the density function is the expecta­
tion of the distribution. Hence the relation m = + m2 is a consequence
of Theorem 1.
Similarly, because of Theorem 1, the expectation of the gamma distribu­
tion of order n is —, since the gamma distribution is the distribution of the
A,

sum of n independent random variables with the same exponential distri­


bution of parameter X (i.e. having the same expectation — ). The sum figur-
A
ing in this example was one of independent random variables; one should,
however, realize that Theorem 1 holds for any random variables, without
any assumption about their independence.
T heorem 2. Let A„ (n = 1 , 2 , . . . ) (P(A„) > O') be a complete system o f
events and t; a random variable such that its expectation E{f) exists, then
00

E ( 0 = Y . E ^ \ A n)P{An). (7)
Л- 1
214 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 11

This follows immediately from the theorem of total probability.


The statement of Theorem 2 may be expressed in the following m anner:
Consider rj = Е(£ | Ar) as a random variable with its values depending on
which one of the events A k (k = 1, 2 ,. ..) occurred; i.e. rj = E(£ | A r),
if event A v occurred. Then the right hand side of (7) is just the expectation
of this discrete random variable ц, hence
Щ ) = £ (£ (£ !+ ,))■ (8 )

T heorem 3. I f q and ц are independent random variables such that E{f)


and E{rf) exist, then the expectation o f cr\ exists as well and

E(£n)=E(£)E{t,). (9)
P roof . Assume first > 0. Let A k be the event kh < n < (k + 1)h;
evidently, the events A k (k = 0, +1, ±2, . . .) form a complete system of
events. Hence, by Theorem 2,

В Д ) = E P(Ak)E(Zr,\Ak). (10)
h----со
The conditional expectations E(£ri \ A k) exist, since ц is bounded, under
condition Ak.
Since, however, and ц are independent, we have

E tf) kh < E(fn I Ak) < E ( 0 ( k + l ) h . (11)

If we put this into ( 1 0 ), the series on the right side can be seen to converge,
thus E(£rj) exists; further (9) holds since the sums
+ 00 +00

X khP{Ak) and X ( k + \ ) h P ( A k)
k=—со k=—со
tend to E(t]), if h -* 0. Thus (9) is proved for £ > 0. The restriction £ > 0
can be eliminated as follows: Put

( . 2,

then > 0, £2 > 0 and £ = «Jj — £,. Since r\ is independent of fi and £2,
we have

£(tn)=Щх n) - щ* n) =[ а д - вд] m = Щ) m ( i з)
and herewith Theorem 3 is proved.
ív, § 11] T H E G E N E R A L N O T IO N O F E X P E C T A T IO N 215

T heorem 4. I f F(x) is the distribution function o f £ and if E(f) exists, the


following limit relations are valid'.

lim x (l - F(x)) = 0, (14)


X-*-+00
lim xF(x) = 0 . (15)
X -*- — CO

P r o o f . Since E(f) exists, the integral

Г \x\dF(x)
- oo
exists. Hence

0 < lim x (l - F(x)) < lim J° ydF(y) = 0.


+ сю X-*- Ч-oo V

The proof of (15) is similar.

T heorem 5. I f E(f) exists, it can be expressed by ordinary integrals:

Щ ) = J (1 - F(y)) dy - f F(y) dy. (16)


6 “ 00

Conversely, the existence o f the integrals on the right-hand side o f (16)


implies the existence o f the expectation E(f).

P roof. An integration by parts gives

f ydF(y) = - x ( l - F(x)) + f ( l - F(y)) dy (17)


о о
and
j ydF(y) = x F ( - x) - J F(y) dy. (18)
-X -X

If we add term by term Equations (17) and (18) and let x tend to infinity we
obtain, by (14) and (15), Formula (16).
Conversely, the existence of the integrals on the right-hand side of (16)
implies the existence of the expectation E{f). In fact, the convergence of the
integrals implies for x > 0
X ~ 2

x(l - F(x)) < 2 ( (1 —F(y)) dy and xF(—x) < 2 ( F(y)dy,


fC —X
2
216 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S (IV , § 11

hence (14) and (15) are valid. Because of (17) and (18), the second part of
Theorem 5 follows.
Theorem 5 has the following graphical interpretation: Draw the curve
representing F(x) and the line у = 1. The expectation is equal to the diffcr-

Fig. 24

ence of the areas of the domains marked by + and — on Fig. 24. The
(evident) fact follows that a distribution symmetric with respect to x = a
has expectation a if this expectation exists. A distribution is said to be sym­
metric with respect to a if
F(a — x) = 1 —F(a + x + 0).
T heorem 6. lfH(x) is a continuous function, which is on every finite inter­
val o f bounded variation1 and £ is a random variable with the distribution func­
tion F(x), then
E ( H ( 0 ) = J H(x)dF(x) (19)
— 00

whenever E[H(x)] exists.

P roof . Since every function of bounded variation is the difference of two


monotone functions, it suffices to prove the theorem for monotone H(x).
Let x = Н ~г{у) be the inverse function of у = H{x). If H(x) is monotone
increasing, P[H(£,) < x] = P[{ < # _1(-*)]> hence

E(H(£))= J y d F ( H - f i y ) ) . (20)
—00
Relation (19) results from (20) by a transformation x = H~1(y) of the
variable of integration.
Examples. 1. The expectations E{if), if they exist, are expressed by

£(£ ")= +f x ndF(x) (21)


— 00

1 Relation (19) holds for every Borel function H(x) provided that its expectation
Я[Я(х)] exists; cf. § 17, Exercise 47.
IV , § 13] T H E M E D IA N A N D T H E Q U A N T IL E S 217

and are called moments o f order n (и = 1, 2 , . . .) of the random variable

2. <Pt ( t ) = E(eUi) = E{cos tf) + /£(sin tf) = +j ei,x dF(x)


-0 0

is the characteristic function of the random variable £.


Characteristic functions play an important role in the study of distribu­
tion functions; Chapter VI will deal with them.
Theorems 4 and 5 of Chapter III, § 8 are also valid in the general case.
Their proof is almost the same as for discrete random variables.

§ 12. Expectation vectors of higher dimensional probability distributions

If the distribution of an и-dimensional random vector

с - « ! ,...,«
is known, then so are the components Е(£к) {к = 1 , . . n) of its expectation.
They can be considered as the components of an и-dimensional vector
ДО = № ) ,... ,£(£„)),
called the expectation vector of the random vector £. In the three-dimen­
sional case, the expectation vector specifies the center of gravity of the cor­
responding mass-distribution.
Let us calculate for example the expectation vector of a normally distrib­
uted и-dimensional random vector £ = (t]u . . r/n), where the density
function of £ is given by Formula (18) of § 7. By the definition of the и-di­
mensional normal distribution, the components rjk can be exhibited in the
form
П
4 k = w* + X c k J Z j,

where the £,■ are normally distributed independent random variables with
1 I X
density function -— cp — and thus expectation£(c) = 0 ; henceE(rjk) =m k.
aj l
Thus we have found the probabilistic meaning of the parameters mk figuring
in Formula (18) of §7.

§ 13. The median and the quantiles

The notion of the median is related to that of the expectation. Let £ be a


random variable with continuous distribution function F(x), strictly increas­
ing for every x such that 0 < F(x) < 1. The median of i is the (unique)
218 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 13

number x for which F(x) — ~ . Tf the distribution is symmetric with respect


to a certain point, then the median always coincides with the expectation if
the latter exists. There are certain distributions for which the expectation
does not exist, but the median always exists. Consider for instance the
Cauchy distribution, with density function f( x ) = — (1 + x2) -1. Here the
n
expectation does not exist, but the median does and is evidently equal to zero.
We introduce the somewhat more general notion of a quantile. The
q-quantile (0 < q < 1) denoted by Q(q), of a random variable £ (or, more
precisely, of the corresponding distribution function F(x), continuous and
strictly increasing for 0 < F{x) < 1, by assumption) is defined as that value
1
of X for which F(x) = q. In this notation the median is equal to Q — .
1 (3 1
In particular, Q is called the lower quartile, Q -— the upper quartile.
X _ YYl
For the normal distribution with distribution function Ф --------- , where
a
X

l г
Ф(х) = ■ — ■ e 2 dt,
J2n J
V —00

the lower and upper quartiles are


( 1 1 (3 1
Q —— —m — 0.6745 a and Q —— = m + 0.6745<r,
4 4
as follows from the tables of the normal distribution function. Since
F[Q(q)] — <7> the function x = Q(q) is the inverse of the distribution func­
tion q = F(x).
Now we shall prove a simple but important inequality.

T heorem 1. ( Markov-inequality). Let £ be a positive random variable


with finite expectation F(£). Then for every X > 1 we have

P{£ ^ Щ О ) < ~ • (1)

(The inequality also holds for 0 < X < 1, but in this case it is trivial,
since every probability is at most equal to 1 .)
P roof . From
m = E {£) = f xdF(x)
0
I V , § 14 ] S T A N D A R D D E V IA T IO N A N D V A R IA N C E 219

follows
00 00
m > I xdF(x) > Ám j dF(x) = Ám( 1 —F(Ám)),
Xm Xm
which proves ( 1 ).
If F(x) is continuous and strictly increasing and if x = Q(y) is the y-
quantile, i.e. the inverse function of у = F(x), then (1) can be written in the
form
Q 1 -----— < Ám.
I . .
In particular (for £ > 0), the upper quartile can never exceed the fourfold
of the expectation.

§ 14. The general notions of standard deviation and variance

As in the discrete case, the quantity

а д = + > /£ ([« - д о ] 2) (i)


is used as a measure of the magnitude of fluctuations of the random variable
£ around its expectation. D \£) is called the variance, D(f) the standard
deviation of f D{£) is a nonnegative number which is zero if and only if
P(£ = с) = 1 for some constant c. According to Theorem 6 of § 11, (1)
can be written in the form

ОНО = Г О - Д О ) 2 dF(x) = X x -d F (x ) - ( J°xd F {x)f (2)


— 00 — 00 — 00

where F(x) is the distribution function of the random variable <J. If this
distribution function is absolutely continuous and if we put F \x ) = f(x),
then we have
+ 00 + 00 + 00

D‘\ q) = j (x - E (c,)ff(x) dx = j' x -f(x ) dx - ( j xf(x) d x f. (3)


—QO —00 —00
Theorems 1-5 of § 9 and 1-2 of § 10 of Chapter III about the variance
are also valid in the general case. The proofs are essentially the same, since
they rest upon the corresponding theorems concerning the expectation.
We shall now calculate the variances of some particular distributions.
1. Uniform distribution.
If £ is a random variable uniformly distributed on (a, b), then by (3)

а д - Ь -t.
2^/3
220 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 14

2. Normal distribution.
Let % be a random variable with density function

1 X —m l ( ( x — m )2

° V ° J 2 n a eXP ( 2 tr2
X —YYi
We know that E(£) = m. By a transformation of the variable---------= и
о
we obtain
+ 00

D2(£) = —
y j2 я a

J
f (*- tnf exp Í -
{
(X dx
2cr
^ ]
)
=
—oo
+ oo
-2 /'• «2
= ---- — \ u2 e 2 J m.
v / 2 tt J
—cc

From here follows by a simple calculation

Z>2 (É) = ff*


Thus we have found the probabilistic meaning of the parameter ex of the
normal distribution.
3. Exponential distribution.
If the density function of the random variable £ is given by ).e~ix for
x > 0, then we have seen that E(£) — — . Hence

oo
f i 1 l 2 1
Z) 2 ( 0 = A ) * - T e~^dx = l r
0
and

0(0 = 4 '
The standard deviation of the exponential distribution is numerically equal
to its expectation.
4. Student's distribution.
Let £ be a random variable having Student’s distribution with n degrees
of freedom; its density function is given by Formula (5) of § 10. Since f(x )
is an even function, E(£) — 0 for n > 2. [For n = 1 (i.e. in the case of the
I V , § 14] STAND AR D D E V IA T IO N A N D V A R IA N C E 221

Cauchy distribution) the expectation does not exist.] By applying (3) we


obtain
+00

1
W—
2
I г X2
=
^ гЫ J 'In \

—00
----------- Л * -
( i + *2)2

X2
Take for new variable of integration у = —------; then
1 + X

^ 2( o = —
n —2 for n

for n = 2 the variance is infinite.


5. Beta distribution.
If £ has a beta distribution B(a, b), then E(c) — —— —as we have seen.
a+b
From this, by (3)

о
ab
(ű + b y (a + b + 1 )
6. Convolution o f normal distributions.
Let E, and rj be independent normally distributed random variables with
densities
1 l x — m ,\ , 1 (X — m2)
---- <p ----------- and -— cp —------- I.
{ ai J °2 1 °2 j

The density function of £ + i; is — cp (—— with m = m± + m2 and


x a l a
a = (uf + erf)2. The relation m = mx + m2 is valid since the expectation
of the sum of two random variables is equal to the sum of the expectations
i
of the terms. The relation a = (erf + u! ) 2 follows from Theorem 1 of
§ 10 in Chapter III, since we have seen that the parameter a represents the
standard deviation of the normal distribution.
7. Variance o f the gamma distribution o f order n.
Let be independent random variables with the same density
222 GENERAL THEORY OF R A N D O M VARIABLES [IV, § 15

function Xe~Kx for X > 0; their sum £„ = + ... + has a gamma dis"

tribution of order n. Since (cf. Mo. 3) D \ ? ) = -4 -, Theorem 1 of § 10,


A
fl
Chapter III implies Z)2(£„) = —2- . Of course a direct proof is also possible.
A
8 . Variance o f Pearson’s yf-distribution.
The ^-distribution with n degrees of freedom was defined as the sum of
the squares of n independent random variables having the same normal
distribution. The variance of the square of a normally distributed random
variable with density function

г(х)-ф
is according to Theorem 6 of § 11 and Formula (3), equal to
+ 00

D2( ? ) = — L I x4 e" dx - [ E ( ? )f = 2 .
J2n J
— 00

Consequently, according to Chapter III, § 10, Theorem 1, the standard


deviation of the ^-distribution with n degrees of freedom is ?2 n .

§ 15, On some other measures of fluctuation

The difference of the quartiles characterizes to some extend the fluctua-


1 Г (3 1 1 П
tions of a random variable. The quantity — Q — —Q — is called the
quartile deviation and is denoted by q(?. If ^ is uniformly distributed on

(0 , 1 ), then q(f) = ~ ; if £ is normally distributed, the tables


for the normal distribution give q(f) к 0.6745 a. It is to be noted that in
some (chiefly older) books the density function of the normal distribution
is not given in the form
X
1 Г - —
Ф(х) = —- = e 2 dt,
■Jin J
v —oo
but by

4*(x) = Ф(р ? 2 x) - e~eV dt,


V 71 j
I V , § 15] SO M E O T H E R M E A S U R E S O F F L U C T U A T IO N 223

0.6745
where qx 0.477 x — ---- yj2. Anyone of these two forms can be taken as

the “ standard form” ; it is a question of convention which is chosen. Ф(х)


( X — t7 l\

--------------- 1
the expectation m and standard deviation a can be obtained immediate­
ly, without calculation; if a normal distribution is brought to the form
( X —m
--------- expectation m and quartile deviation q can be read off without
4
any further computation.
If the distribution function F(x) of the random variable £ is continuous
and strictly increasing for 0 < F(x) < 1, then the value of £ lies with proba­
bility ^ in the interval Q Q Clearly, every interval
1 U 1 )
Q(S), Q ö + — 0 < <5 < — I possesses the same property. If the distri­
bution is symmetric with respect to the origin and if its density function is
’ M l 3 \)
monotone decreasing for jc > 0, then Q — , Q — is the smallest
interval possessing this property. In this case

Q(t )=~Q(t )’ hence q(0=Q(t )•


T heorem 1. For every random variable £ symmetrically distributed about
the origin with a continuous distribution function F(x) that is strictly increas­
ing for 0 < F(x) < 1 , the inequality
q { £ ) < J lD ( £ ) ( 1)
is valid.
P roof . Let F(x) be the distribution function of £. As £ is symmetric with

respect to the origin, D \£) = E(£2). Put Я = ar|d apply the Markov
inequality (§ 13, Theorem 1) to the random variable £2. Then we obtain

p [\ z \ > q [4 ))= * -yfjjj- • (2)


On the other hand, because of the symmetry of the distribution,

' ( |{|* в ( т ) )- Т - <3)


224 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 15

1 D2 (£)
From (2) and (3) it follows that — < — whi ch proves П).
The inequality (1) is sharp. This is shown by the following example: Let
the distribution of the random variable £ be the mixture, with weights
— , — , — of three normal distributions with the same standard deviation
4 2 4
e(> 0) and expectations —1,0, +1. Since e can be chosen arbitrarily small,
it follows from the example that y /2 figuring in ( 1 ) cannot be replaced by a
smaller number.
The quartile deviation q(£) is mostly used when the standard deviation
of £ is infinite, e.g. in the case of the Cauchy distribution.
The standard deviation of a random variable that is uniformly distributed
on the interval (m — a, m + a) is given by —= . If ^ is an arbitrary random
V3
variable with E(£) — m and D(f) = a, the interval
( m - f f y / з , m + <7^3) (4 )

may be characterized by the fact that a new variable uniformly distributed


on this interval has the same expectation and the same standard deviation
as <Ц. The interval (4) is called the interval o f concentration of £, the inverse
of its length is called concentration of £ and is denoted by k(f).
Sometimes the absolute mean deviation

d($) = E ( \ S - E ( Q \ )
is also used as a measure of fluctuations. By Theorem 6 of § 11

m = 7 \x -E (0 \d F (x ).
— 00

For the normal distribution

for the uniform distribution on an interval

J J
d c o -^ -а д ,

and for the exponential distribution

d{f) = 2 D (0 .
e
Of course Theorem 4 of § 9 is also valid in the general case.
I V , § 16] V A R IA N C E IN H IG H E R D IM E N S IO N A L C A SE 225

§ 16. Variance in the higher dimensional case

Let £ = be an «-dimensional random vector with distribution


function F(xb . . x„). Clearly the fluctuation of £ cannot be characterized
by a single number. The standard deviations Z)(£*) furnish certain infor­
mation. But this is insufficient, since these standard deviations are only
concerned with the fluctuations of the projections on the coordinate axes
and the choice of the axes is arbitrary. More information about the fluctua­
tions of £ is obtained by considering its projections on all possible straight
lines. Put mk = £■(£&) and let P0 be the point (тъ . . m„). Let g be a
line passing through P0 with direction cosines oq,. . ., a„ (oq, . .. ., a„ are
П
real numbers for which £ af = !)• Put
k =1

t g = Y ak ^ k - m k) ( 1)
*=1

and calculate D~(ig). (1) implies E(Cg) = 0, hence


Z)2( Q = £ ( Q .
If we put
D y = £ ( ( £ , • - IW ;) ( £ / - ntj)), (2 )
th e n

D \ i g) = t t (3)
i= l 7= 1

Let D denote the matrix of coefficients Di; :

r Dn . . . D u '
D= ................. . (4)
4 T)„i.. .Dnn

Because of (3), the determinant | D \ is always nonnegative. If | D | = 0,


we have a so-called degenerate distribution. In what follows, it will be always
assumed that | D \ > 0. From the coefficients Du the standard deviation
of the projection of £ on an arbitrary line can be calculated. Thus the fluc­
tuation of £ can be characterized by the matrix (4), called the dispersion
matrix 1 of £. The «-dimensional ellipsoid with equation

ÉI Dijxi Xj = c2
/-i j =i
1 Since D u is called the covariance of £, and £ t he dispersion matrix is also
called the covariance matrix.
226 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 16

is called the dispersion ellipsoid of the distribution. It is easy to see that the
dispersion matrix is invariant under a shift of the coordinate system. Under
the rotation of the coordinate system D is transformed as a matrix of a
tensor. Let in fact C = (c,7) be an orthogonal matrix and

& = Z ckj (£; - "»/),


j =i
then E(£'k) = 0 and

Ои = Е {?,® = £ с 1к Í c jhDkh.
k=1 h=l
If the matrix (/)',) is denoted by D’, we have
D ’ = C D C * ,

where C* is the transpose of the orthogonal matrix C. Hence we may speak


about the dispersion tensor, which does not depend on the choice of the
coordinate system. Again, the similarity to the moments of inertia should
be noticed: in case of several dimensions the moment of inertia is also
characterized by a tensor and by ellipsoids of inertia.
We have now to deal with the notion of ellipsoid o f concentration. For
the sake of simplicity let us restrict ourselves to the case of two dimensions.
Consider the ellipse

E(x, y) = Ax2 + 2Bxy + Cy2 = 4 (AC - B2 > 0) (5)

and suppose that the random vector ft = (t]b rj2) is uniformly distributed
inside this ellipse. The elements of the dispersion matrix of ft, i.e. the
numbers di} = Е(гцц}) (i,j = 1 , 2 ) are defined by

dn = j x2dxdy, d12 = dn = j j xydxdy,


E* E
, 1 ff 2 , , (6 )
«22 = -pr у dxdy,
F JJ
E
where E denotes the interior of the ellipse with Equation (5) and F its area.
Calculation of the integrals in (6 ) gives

С В A
d u = AC — B2 d'* = d* = - A C - B 2 ’ dv2 = ~AC - B2 ■ (7)
IV , § 16] V A R IA N C E I N H I G H E R D IM E N S IO N A L C A S E 227

Let £ = (£,, £2) be any random vector. Choose the numbers A, B, C such
that the dispersion matrix of a random vector uniformly distributed in
the ellipse (5) coincides with that of £. We put, therefore

AЛ -. —
°2 2
, Вp --------A Cü - —^ _
,A l ,o \
(8 )

where A = Dn D22 — D\2. Hence the ellipse

D22x 2 - 2 D 12x y + D n y 2 = 4A (9)

has the property that a random vector uniformly distributed in it possesses


the same dispersion matrix as the given random vector £. The ellipse (9) is
called the ellipse o f concentration of the random vector £ = (£x, £2) and the
number

k (0 = > (10)
An у / A

i.e. the reciprocal of the area of the ellipse (9), is called the concentration
of £.
(A B\
If А, В, C are chosen according to (8 ), the matrix is the inverse of
[В C

A i A 2 |

. A l T>22 J

The case of higher dimensions turns out to be quite similar. The equation
of the ellipsoid of concentration is here

z z ^ = « + 2- (11)

where d is the value of the determinant | Dy \ and Atj the value of the co­
factor of the element in the г-th row and y'-th column. The concentration,
that is to say the reciprocal of the volume of the ellipsoid ( 1 1 ), is equal to

*( 0 = ------ ----------------• (12)


(n + 2 ) 2 л 2 J a

Of course, this holds only for A > 0. If A = 0, the point (£b . .., £„)
lies, with probability 1 , on a hyperplane of at most n — 1 dimensions;
228 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 16

hence the distribution effectively is not и-dimensional. Indeed, | D | = 0


implies the existence of numbers х ъ . . . ,xn which do not all vanish and
satisfy
f JDijXj = 0 0 = 1 ,...,« ).
j =i
But then
E ( [ t x J ^ j - m J) f ) = 0 ;
j=i
consequently the random vector (£ь . . ., £„) lies with probability 1 on the
hyperplane

£ xj (Zj - rtij) = 0.
1

Consider now in some detail the two-dimensional normal distribution.


Let

/(* , У) = B exp — ( Ax2 + 2Bxy + C /)l (13)


2л 2

be the density function of the two-dimensional random vector £ = (£, rj).


The expectations of £ and r\ are equal to zero and the elements of the dis­
persion matrix are

C B A
Dn = а с - в 2’ Dl2 = ~ H c ^ ¥ ’ ° 22~ А С - В * '

It follows that
._ D 22 n _ _ -^12 p _ -fin
“ TŐT’ \D \ ’ ~~ ID I ’
where | D | = Dn D22 — fi?2. If we put

Du /—— — ,,
8 = r n — y.— , °i = V D u, o2 = \J D 22. 5 (14)
V ^11 ^ 2 2
we find
1
/( * ,/ = 2 п< /Т ^ Т
т1(т2%/ 1 - Q
*

1 A2 / 1 1
X exp ----- 5Г —;----------------h —и ■ (15)
2 (1 - / ) / сггег2 а \)
IV , § 16] V A R IA N C E I N H I G H E R D IM E N S IO N A L C A S E 229

The number q is the correlation coefficient I?(£, rj) of the random variables
£ and )]. We have already introduced this quantity for discrete distributions.
It is similarly defined in the general case and its properties are the same.
Thus
. í(K -£ « )i[> i-£ (i)i) в д й - вдж ч )
--------- Щ ) Ш —
= * (16)
Theorems 1, 2, 3 and 5 of Chapter III, § 11 are valid and can be proved in
nearly the same way.

T heorem 1. I f the random vector £ = (£, rj) is normally distributed in the


plane and i f i?(£, rj) = 0 , then £ and ц are independent.

Proof. If q = 0, we see from (15) that the density function is given by

,, . 1 X 1 1 у )
f ( x , y ) = ---- <p ----- ------- <p ----- ; (17)
°1 °"l / °2 <f2 J
hence £ and ц are independent. This theorem is easily generalized to any
number of dimensions.

T heorem 2. I f the random vector £ = (>js, . . rjn) is normally distributed


in the n-dimensional space and R(t]h t]j) = 0 i f i ф j, then the random vari­
ables 17j, . . .,ц п are independent.

P roof. By § 7 we can write


П
7/ = I cjkZk + mj (7 = 1 , 2 , . . . , и). (18)
k =1
Now the random variables £k (k = 1, 2 ,. . . , ri) are pairwise independent,
each of them is normally distributed, the expectation of is zero and
its standard deviation ak; hence C = (cjk) is an orthogonal matrix. Since
by assumption R{rjb rjf = 0 for i Ф j, we have
П

E Cik Cjk = 0 for j ф i. (19)


k =1
The meaning of this is, however, that the vector

(ca a \ , . . . , cin a n
2) (20)

is orthogonal to every one of the vectors (cn , . . . , cjn), j ф i. This means


that thevector(20) must be parallel to the vector (cn , . . . , cin). Hence
230 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV, § 16

there exists a constant /, ф 0 such that

cik a\ = l, cik (k = 1, 2 , . . . , и). (2 1 )


On the other hand, as the inverse of C is its transpose, we can write,
according to (18),
П
Zk = Z cJk (ij - mj l
j =i
Thus we obtain for the density function of the random vector (r]1 , . . . , rj„)
1 Г 1 " 1 I " 2 "
в(Уъ • • •>Уn) = ------- ---- ----- exp - — £ , X сд О / - mj) ■ (22)
(2 л ) 2 Г К 1 * = 1 ** U J
k = l

But it follows from (21), that


w 1 i n \ 2 n ( у __ yyj \2

I I C jk iy j- mj ) 2 ; (23)
fc=l °k u =l ) 1=1 A;

and consequently
, 4 i 1 £ (л - w , ) 2 '
9 (У ъ • • ■> J' «) = ------------- — ---------- e x P - t L ----------j ---------- >
(2 ТГ) 2 Y \ a k 1 ,=1 J
k=l
which proves the independence of the random variables rjy ,. . . , rj„.
Remark. If, instead of assuming that the random vector (rj1 ; . . . , tj„)
has an «-dimensional normal distribution, the weaker condition is assumed
that the components rjy, . . . ,r]n are each normally distributed, then the
assertion of Theorem 2 is false. This can be seen from the following example:
Let the density function of the random vector (£, 17) be

h(x, y) = —— J l e 2 - e~x' е~уг + J l e 2 — e~yt e~x‘ .


2n Ц (v
(It is readily verified that h(x, y) is in fact a density function.) The density
functions f(x ) and g(x) of £ and t] are

f{ x ) = g{x) = —- L = e 2 ,
J2n
i.e. ij and r\ are normally distributed with expectation 0 and standard
deviation 1. Since h (x,y) is an even function of both x and y, it follows
IV , § 16] V A R IA N C E I N H I G H E R D I M E N S I O N A L C A S E 231

that R(£, q) = 0. The random variables £ and q, however, are not indepen­
dent, since evidently h(x, у ) Ф f(x) g{y); thus £ and q are each normally
distributed and are uncorrelated, but they still are dependent.
From Theorem 2 follows

T heorem 3. Let ,. . . , be mutually independent random variables


with the same normal distribution with E (tk) = 0, D(£k) = 1, (k = 1,. .., и).
I f ax , , an and bl t . . . ,b„ are real numbers, not all equal to zero, then
the random variables
n n

П1 = Z ak Zk and r]2 = Z bk U
k = 1 k = 1

n
are independent if and only if akbk = 0 .
k=1

P roof. Since
П
Z ak bk
ъ) = ,T - ,
J Z a llb Z
V k= 1 k =1

the necessary condition that rjx and rj2 should be uncorrelated is that

t ak bk = 0 .
k =1

We shall now show that the random vector (q, , q2) is normally distributed.
There can be found an orthogonal matrix (ckl) such that

clk = Áak, c2k = gbk (k = 1, . . n).

The и-dimensional distribution of the random variables


П
= Z c)ktk U = Ь 2 ,...,и)
k =1
is thus a normal distribution and the same holds for the two-dimensional
distribution of и, = and q2 = . Since
A p

ak
—j = and, bk
-y——
J i°l J ib l
232 GENERAL THEORY OF RA ND O M V A R IA B L E S (IV , § 17

are the direction cosines of two directions,

£ akbk = 0
k =1
means that these two directions are orthogonal and rji, r\ 2 are (up to a
numerical factor) the projections of the random vector , . . . , £„) on
these directions. Our result may thus be formulated as follows: If , . . . ,
are mutually independent random variables with the same normal distri­
bution, then the projections of the random vector £ = ......... £„) on
two lines dt , d., are independent iff dt and d2 are orthogonal.

§ 17. Exercises

1. Let the distribution function F(x) of the random variable Z be continuous and
strictly increasing for — c o < X <C H - o o . Determine the distribution function of the
following random variables:

a) Vi = F(i), b) ry2 = In , c) t]3 = W i f i ® )

where x = 4*(y) is the inverse function of the normal distribution function


X
1 r _ JL
у = Ф(х) — — = - e ~ d t.
%/2я А
2. Draw the curve of the density function
1 ( (x — mУ i
У = / ( a) = .— - e x p ------- — —
/ 2я a 1 2 ct2 I
and determine its points of inflexion. Let A and В be their abscissas. Calculate the
probability that the value of a random variable with density function fix) lies between
A and B.
3. Draw the curve of the density function
1 Г (In x — rrif I
— — exp — -------——------ for x > 0,
y = f (x )= In ax
0 for a < 0

of the lognormal distribution and calculate its extrema and points of inflexion. Calcu­
late the expectation and standard deviation of the lognormal distribution.4*
4. a) Show that if the random variable £ has a lognormal distribution, the same
holds for rj — ci* (c > 0 ; а ф 0 ).
I V , § 17] E X E R C IS E S 233

b) Suppose that the diameters of the particles of a certain kind of sand possess
a lognormal distribution; let f(x) be the density function of this distribution (cf. Exer­
cise 3), with m = — 0.5, a = 0.3; x is measured in millimeters. The sand particles
are supposed to have spherical form. Find the total weight of the sand particles which
have diameters less than 0.5 mm, if the total weight of a certain amount of sand is
given.
5. Let the random variable rj have a lognormal distribution with density function

„ 1 (In x — m)2 1 „
f ( x ) -- —— — e x p --------- — ------- for x > 0 .
^Jbiox 2a
If the curve of у = fix) is drawn on a paper where the horizontal axis has a logarithmic
subdivision, then (apart from a numerical factor) one obtains a normal curve. It does
not coincide with the density function of In rj, but is shifted to the left over a dis­
tance a2.
6 . Let the random point (£, rj) have a normal distribution on the plane, with density
function
,, V 1 x2 + y 2
КХ’ У ) = 2 ^ - eXP ['-------2a2

Find the density function of f = max ( |£ |, |?j|).


7. a) Let the random point ({, rj) have the same distribution as in Exercise 6 . Show
that the angle 0 between the vector f = ({, rj) and the x-axis is uniformly distributed
on the interval (0 , 2 я).
b) Determine the density function of 0, if the point ({, rj) has density

1 1 ( X? Xj \
----------- exp — r -t----- r .
2 лах <r2 2 \ a\ a\ ]98
8 . Let the density function of the probability distribution of the life-time of the
tubes of a radio receiver with 6 tubes be X2t e~l‘ for t > 0, where Я = 0.25 if the
unit of time is a year. Find the probability that during 6 years no one of the tubes
has to be replaced. (The life-times of the individual tubes are supposed to be indepen­
dent of each other.)

9. A distribution with density function у = f(x) satisfying the differential equation


У' x + a
— = - 3—— —;—5—— («, p, у, о are constants)
у ß + у х + 0 x2
is called a Pearson distribution. Show that the following are Pearson distributions:
a) the normal distribution
b) the “exponential” distribution
c) the gamma-distribution
d) the beta-distribution
e) Student’s distribution
f) the x2 distribution
g) Cauchy’s distribution
234 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV, § 17

Cm~ 1 _ c
h) the distribution with density function f ( x ) = —---- —— x~m e x for x > 0 ;
(m — 2 )!
(c > 0 ; m = 2,3,...).
10. a) Let the point (c, ту) be uniformly distributed in the interior of the unit circle.
We put
---------- 77
Q = у / Í2 + V , <p = arc tan — .

Show that e and <p are independent.


b) Let the point (£, ту, Í) be uniformly distributed on the surface of the unit sphere.
Introduce spherical coordinates 0 and q> (geographical longitude and latitude) and
show that 0 and <p are independent.
c) Let the point ({, ту, £) be uniformly distributed in the cylinder { 2 + i f < 1.
0 < C < 1. Show that <p = arc tan ~ , q= yj í 2 + if> and f are independent.
d) Find the general theorem of which a), b) and c) are particular cases.
Hint. The independence of the new coordinates results, in the three cases, from
the fact that the functional determinant of the transformation can be decomposed
into factors each containing only one of the new variables.
11. Let £ and ry be independent random variables with the same density function

/ М = у е" И ( — CJO < X < + < X > ).

Find the distribution of C = £ + ту.


12. Show that a two-dimensional normal distribution is uniquely determined by
its projection on three non-parallel lines.
13. Let { and »y be independent random variables with the same density function

/(*) = —
n Ve + e
Find the distribution of f = { + 7y.
14. Let i be a random variable with density function

/( * ) = exp у (* - m)2j + exP ( - y ( * + "^j •

Find the values of m for which f(x) has two maxima.


15. Let £t, (г be independent random variables having a Cauchy distri­
bution with density function

f(x) - —77—7— 7,- .


77(1 + x2)
Find the density function of

n /c=1
í v , § 17] E X E R C IS E S 235

16. Let the random variables 4> 4 ........ Z„ be independent and uniformly distrib-
П
uted on the interval (0,1). Determine the density function of C = Z Zl-
1=1

17. L e t th e r a n d o m v a r ia b le s 4> 4 .......... 4 b e in d e p e n d e n t a n d u n if o r m ly d i s t r i b ­


u te d o n t h e in te r v a l (0 , 1). L e t 4* = R k( S „ 4 , . . . , { „ ) = 1, 2 ..........и b e th e A>th
among the values arranged in increasing order. ({J is called the /с-th orrfer
statistic of the sample £,„).)
a) Find the distribution function of the random variable £*_* — Z* (1 5 = к < к +
+ h < n) and show that it is independent of k.
t*
b) Find the distribution function of the ratio — — (1 < к < к + h < n).
4+ A
Z* Z* Z*
c) Show that -~r , - ~ r , ■■■, , {* are independent and that their «-dimensional
«* e* «?
density function is

f ( x 1, X2, . . x„) = «1 x2x\ . . . xT~' (0 < xk < 1; к = 1 , 2 , . . . , « ) .

I I* l *
d) Show that the random variables — —• are uniformly distributed in the
l 4+1J
interval (0 , 1 ).

18. The random variables Zi, 4 ........ 4 are called exchangeable, if their и-dimen'
sional distribution function F(jrb ....... x„) is a symmetric function of its variables-
(Exchangeable random variables have thus the same distribution and consequently
the same expectation.)
a) Choose at random and independently, with a constant probability density, n
points in the interval (0, 1). Let their abscissas be {,» 4 >- - - . 4- The interval (0, 1)
is subdivided by these points into и + 1 subintervals of the respective lengths
Vu %>•••> Vn+i- Show that

Щк) = —X T •

Hint. The rju t)2 , . . . , rjn+t are exchangeable random variables and we have
Л+1
Z=1
A = L

b) Calculate the standard deviation of the random variables rjk .


c) Let Z* be the A>th order statistic of the sample ( 4 » 4 . • • •> 4 ) (see Exercise 17).
Show that

Hint. Z* = Vi + % + • • • + Va­
di Which is larger:

D \zt) = o 2( Z%) or Z ° 2(^ ?


/ =1 -=1
236 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 17

e) Calculate the correlation coefficient


R((Л £,*) 1 < i< j< n .
19. The mixture with equal weights of the distributions with distribution function
Bk'„+ , - k (a) (k — 1 , 2 , . . . , « ) is the uniform distribution in the interval (0, 1). How
could this be shown without calculation ?
20. If the probability that a car runs at least x miles without a puncture is e~Xx
with X = 0.0001, is it worth while to carry three spare tires on a trip of 1 2 000 miles?
21. Lei the random variables { and у be independent, let { have an exponential
distribution with density function Xe~*x (x > 0), and let r) be uniformly distributed
on (0 , 2 я). Put С, = y j £ ■cos rj, C2 = yj í ■sin rj. Show that C, and C2 are indepen­
dent and have the same density function

[L e -* .

22. Let {2, . . . , Sn be independent random variables, let the density function
of i k (k = 1, 2 be
X{k + h — 1) for a > 0 ,

where X > 0 and ft is a real number. Find the distribution function of the sum
П
rj = £* and show that C = exp (— X rj) has a beta distribution.
k=I
23. Let h„(x) be the density function of Student’s distribution with n degrees of
freedom. Show that

lim -A = h„ í - í = ] = —} = e~ 2 .
\ /n > V 2jl
24. The substances A„ A2, A „ +1 form a radioactive chain, i.e. if an At atom
disintegrates it is transformed into an A2 atom, similarly the A2 atoms into A.6 atoms,
and so on. The An+l atoms are not radioactive. Suppose that at the instant t — 0
the number of Аг atoms is N„ the number of A., atoms is N2, ..., while there are N„ atoms
of A„. Find the density function of the time interval needed for an atom chosen at
random to change into an A„+i atom.
25. Let X be the disintegration constant of a radioactive atom. Let there be N atoms
present at the time 0 .
a) Calculate the standard deviation of the number of atoms disintegrated up to
the time t.
b) Calculate the expectation and the standard deviation of the half-period (i.e. of
N
the random time interval till the — -th disintegration, if N is even).26

26. a) Let r/k (k = 1, 2, . . . ) be the time required for the transformation of a radio­
active atom A, into an Ak + 1 atom, through the intermediary states A , , . . ., Ak, i.e.
the duration of the process
A i~ * A 2 —> ■ . • —► A k + 1 .
Let further Xk be the disintegration constant of the Ak atoms, gk(t) the density function
of r]k and i k(t) the number of A, atoms which are present at the time t. It is assumed
IV, § 17] EXERCISES 237

that at the moment 0 there are only Al atoms present and their number is equal to N.
Find the distribution function of r\k and of i k(t) (k = 1, 2, . . . ) .

Hint. Let Pk(t) be the probability that at the time t an atom is in the state Ak.
These probabilities can be calculated in the following way: The probability that an
atom Ak changes into an atom Ak+ l during a time interval it, t + At) is, by definition
of gk(t), equal to gk{t) At + o(/lt). On the other hand, the probability of this event
is as well expressed by Pk{t)XkAt + o(/l/); the possibility that during the time interval
(/, t + At) an atom passes through several successive disintegrations can be neglected.
Hence we have

Pk(t) = ffk(‘) ■ 0)
Ák
Since the disintegrations of the individual atoms are independent, we obtain

F(f*(0= r) = (1 - 9jj ; ) N ' (r = 0>1.......(2)


The expectation and the standard deviation of the number of Ak atoms at the moment
t can now be calculated, since we know that

Ш = ( - I)'1“ 1 2 A • • • К X (Я( _ Я/) (f > °) (3)


t* j
(cf. Ch. IV, § 9, (29)).
b) Put M k(t) = £({*(/)). Show that the functions M k(t) satisfy Bateman’s system
of differential equations
MUt) = A*_, Af*_,(/) - h M k(t) (M 0(i) = 0; к = 1 , 2 , . . .). (4)
Hint. By (2) we have
M ki t ) = ^ - . (5)
If we differentiate the identity

в Á‘) = J >.k gk_ t («) du, (6 )


0
we obtain
a ’k it) = К ig k - ! it) - gk it)). ( 7)

Because of (5), (4) follows from (7).


Remark. If the number M kit) is very large, the fluctuation of the number £k(t)
of the atoms Ak about M k(t) is small with respect to M k(t), since by (2)

D ih it) ) = y j M kit) | l - — < y j M kit).

Hence as a fiirst approach M k{t) may be considered as the námber of Ak atoms


existing at the time t. However, one should not forget that this number is in reality
a random variable with expectation Mkit).
c) Show shat the graph of the function у — Mk(t) has for any A,, A2, . . . , A* only
one maximum. Show further that 0 = mx < m2 < . . . < m„, where mk denotes the
abscissa of the maximum of the function Mkit).
238 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 17

Remark. The atoms A„+i not being radioactive, M„+ 1(l) is evidently an increasing
function of time, hence mn+x = + a> .
d) Show that t = 0 is a zero of order к — 1 of the function M k(t).
27. Let г/, C be the components of the velocity of a molecule of a gas in a
container. Let the random variables £, r\, £ be independent and uniformly distributed
on the interval ( —A, + A). Calculate the density function f A(x) of the energy of this
molecule. Determine further the limit
lim A3f A(x) = w(x).
A-*- -f- со

Hint. Let the mass of the molecule be denoted by m and its energy by E, then

£ = - ^ ( í 2 + ’?2 + C2) .
hence

P(/:<'> = - i* J I f dxdydz = S o [2L ] 2 for V i < A’


[2t
since the integral is equal to the volume of a sphere with radius , / — Thus
V m

^ (i)= ( i ) 2-4i v /7 for J ~ <A’


hence
h^I) = c yj t (c = constant).
28. In Exercise 34 of Chapter III, § 18 we studied the most probable energy
distribution of a gas consisting of N particles, when the total energy E of the gas was
given. The probability p k of the energy Ek was found to be given by
wk e~PEh
Pk = 5 •
£ w ,e - ßE>
i=i
This result was obtained under the assumption that E can only take on the discrete
values Ek. Let now the energy be considered as a continuous random variable. For
the density function of the energy we obtain in a similar way the expression

_ m e - ß>
PiO = -s-------------------- .
J w(u) du
0

where ß can be determined from


CO CO

N J tw(,t) e~P' dt = E J w{t) e - P dt.


о о

Let w(t) be chosen such that

w(t) = c J t for 0 < t < c' ,


IV, § 1 7 ] EXERCISES 239

3
where c' is- a positive constant and c = -----g- . Calculate under these conditions, for
2 c' 2
the limiting case c -> + °o, the value of ß, the function />(?), and the distribution
of the velocity of the molecule.
Hint. With the above notations we have for c' —* + oo
E _ 3
~N ~~ ~2ß '
E 3kT
It is known from statistical mechanics that ---- = ------ , where к is Boltzmann’s
N 2
constant and T the absolute temperature. So ß = ^ and

2 s /T exp - T r
p(0 = — —— —— •
V n (k T ) 2
Let the velocity of a molecule be denoted by v and its kinetic energy by Ekin, then the
density function of v will be given by

t, s s dEm ( m \-\ 12 V2 ( m \
f W = * E ^ — = [ — J °2 exp - •
This derivation of the Maxwell distribution coincides essentially with one usually
given in textbooks of statistical mechanics. (We return to this question in Chapter Y,
§ 3.)
29. a) Calculate from the Maxwell distribution the mean velocity of the molecules
of a gas having the absolute temperature T and consisting of molecules of mass m.
b) Show that the average kinetic energy at the absolute temperature T of the
3
molecules of a gas is equal to — kT (k is Boltzmann’s constant).
c) Compare the mean kinetic energy of a molecule with the kinetic energy of a
molecule moving with mean velocity. Which of the two is larger?
30. a) Consider a gas containing in 1 cm3 N molecules and calculate the mean
free path of a molecule.
Hint. The molecules are considered as spheres of radius r and are supposed to be
distributed in the space according to a Poisson distribution, i.e. the probability that
a volume V contains no molecules is expressed by e~NV. The probability that the
volume A V contains just one molecule is given by NA V + о (A V). The meaning
of the statement that “a molecule covers a distance s without collision and then collides
on a segment of length As with another molecule” is just the following: a cylinder
of radius 2 r and height s does not contain the center of any of the molecules and
another cylinder of radius 2r and height As contains the center of at least one of the
molecules. Thus the probability in question is
Ar27iNe~inNr'’ As + о (As),
240 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 17

i. e. the distribution of the free path is an exponential distribution with density function
4nNr2 e-*nNr‘s .
1
Hence the length of the mean free path is ■ w . .
AnNr1
b) Calculate the mean time interval between two consecutive collisions of a molecule.
Hint. Let the length of the free path be denoted by s, the velocity of the molecule
by V, then г = — , where т denotes the time interval studied, s and v can be assumed
V
to be independent, thus E(x) = E{s)E j ; the first of these two factors is known
from Exercise 30 a), the second can be computed from the Maxwell distribution.
31. Calculate the standard deviation of the velocity and kinetic energy of a gas
molecule, if the absolute temperature of the gas is T and the mass of its molecules m.
32. Let the endpoint of a three-dimensional random vector Í possess a uniform
distribution on the surface of the unit sphere. Let в be the angle between the vector
sin t
£ and the positive x-axis. Show that the density function of 0 is given by — —

(0 < t < л).


33. Choose at random a chord in the unit circle and determine the expectation
of its length under the three suppositions considered in the discussion of Bertrand’s
paradox (Ch. IX, § 10).
34. Let £t, be independent normally distributed random variables with
1 _**
density function — = e 2. Calculate the expectation and the standard deviation of
Jbi
+... + Й
(n+l + • • • + f„2+m '
35. Let mk be the median of the gamma distribution of pafameter X and order k.
mk 1
Show that hm —— = — .
*-»+os к 4
36. Let the distribution of the random variable £ be the mixture of the distribution
of the random variables £i , . . . , £„ with weights p k (k = 1 , 2 , . . . , «). Show that
Щ ) = Yj PkE (it).
*=1
37. Under the same assumptions as in Exercise 36 put
M k = E ( i k) (At = 1 , 2 , . . . , я).
Show that

D \ Z ) = X P k D \ H k) + D \p ),
kmz1
where ц is a random variable assuming the values M u M2, ...» Af„with probabilities
IV, § 17] EXERCISES 241

Pi,Pt, . . . , p n. From this follows

o 2(i) > I f t £*(!*);


*=i
equality holds iff Mt = M 2 = ... = M„.
38. a) Let i be a normally distributed random variable with density function

,, 4 1 (* —"Ú2"
/(* ) = .— e x p ------- — — .
y j 2ла L 2a

Deduce £(£) = m from the fact that the function у = /(x ) satisfies the differential
equation a2/ = — (x — m) у .
b) Let the density function of the random variable £ be given by
A™” ! - A
f(x) - 2 , x-me * (x> 0 ),

where m > 3 is a positive integer and A > 0. Calculate E(Q from the fact that the
function у = f ( x ) satisfies the differential equation

/ = *
c) Apply the same method in general to Pearson’s distributions (cf. Exercise 9).
39. Suppose that there are 9 barbers working in a hairdressing-saloon. One shaving
takes 10 minutes. Someone coming in sees that all barbers are working and 3 customers
are waiting for service. What waiting time can he expect till he is served?
Hint. Assume that the moments of the finishing of the individual shavings are
independent and are uniformly distributed on the time interval (0 , 1 0 ').
40. Let £lt be independent random variables having the same distribution.
Prove that

E { ) - т

4L Prove that if the standard deviation of the random variable f with the distri­
bution function F(x) exists, then

lim x 2 (1 - F(x) + F ( - x)) = 0


X-*~ +00

and

Щ 2) = 2 fx (1 - F(x) + E(—x)) dx .
о
42. Calculate the dispersion matrix of a nondegenerate «-dimensional normal
distribution.

Hint. Let the «-dimensional density function of the random variables % ,... Fin be

■■••л)= ( w ) 1 exp\ - U i i b“yiy) } ’ (1)


242 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 17

where \B\ is the determinant of the matrix В = (Ъ„). There can be given independent
normally distributed random variables i k such that £(£*) = 0 and

V i = Y ct k h O'= 1, 2 , (2 )
k=1
where C = (c№) is an orthogonal matrix. Let ok = D(Ak) and let 5 be the diagonal
matrix having for its elements the numbers —- . Then B, S, and C are connected by
°k
the relation В = CSC*, where C* is the transpose of the matrix C. If we put Dn =
= E(r]/ rjt), then we have by (2)
П
А/ = Y c‘k C/k (2 )
k=1
Hence the matrix D = (Du) can be written in the form D = CS~l C*, where S _l
denotes the inverse of the matrix S and thus BD = CSC*CS~lC* = E, where E is
the unit matrix of order и. Thus the dispersion matrix D of the normal distribution
is the inverse of the matrix В of the quadratic form figuring in the exponent of (1).
43. a) Using the result of the preceding exercise, find anew proof for Theorem 2
of § 16.
b) Let ....... be independent normally distributed random variables with
E(£k) = 0, D(í/i) = o \show that if the matrix C = (c„) is orthogonal, then the random
variables
П
Vi = Y c'*
A=1
are independent.
c) Determine the ellipsoid of concentration of the и-dimensional normal distri­
bution and prove Formula (12) of § 16.
d) What is the geometric meaning of Exercise b)?
Hint. The components c2, . . . , ?„ of an и-dimensional normally distributed
random vector are independent, iff the axes of the ellipsoid of concentration are
parallel to the coordinate axes. If the random variables ........ Z„ have the same
normal distribution, then the ellipsoid of concentration is an и-dimensional sphere;
thus the condition required is fulfilled for every choice of the coordinate system.
44. a) When considering errors of measurements the following rule is often used:
If the random variables are independent, further if the first partial deriv­
atives of the function g(xl t ..., x„) are continuous and if r] = g{xl .., x„), then

where the partial derivatives are to be taken at the points xk = £({*) (к = 1


Discuss the validity of this rule.
b) Let £ and rj be independent random variables. Prove that
O'-’(^ ) = D - ^ ) D \ V) + ЕЧА) D\rj) + E-(vW-tt) ■
45. Counters used in the study of cosmic rays, radioactivity and other physical
phenomena do not register all particles hitting the apparatus; in fact, the latter remains
in a passive state for some time interval h > 0 after a hit by a particle, and does not
IV , § 17] E X E R C IS E S 243

register any particle arriving before the end of this time interval. The number of
particles counted is thus smaller than the number of the particles actually coming in.
The average number of particles registered during unit time is said to be the “virtual
density of events” and is denoted by P; the average number of the particles actually
arriving during the unit time is said to be the “actual density of events” and is denoted
by p. (Every arriving particle renders the apparatus insensitive for a time interval h,
regardless whether the particle was registered by the apparatus.) As to the arrival
of the particles, the usual assumption made in the study of radioactive radiations
is introduced, namely that the probability of the arrival of n particles during a time
(pt)" e~p'
interval t is given by --------- — (n = 0 , 1 , . . . ) .
n\
a) Determine the virtual density of events P.
b) Determine that value of the actual density of events which makes the virtual
density maximal.
Hint. The probability that a particle arrives during a time interval At and is registered
is equal to the probability that the particle arrived during the time interval considered
and no other particle did arrive during the preceding time interval of length h. This
probability is approximately pe~phAt, hence P — pe~ph. If the passive period h is
known and P was experimentally determined, the above transcendental equation is
obtained for p . By differentiating we find that P has its maximal value, if p = ~ ;
h
then P = ---- .
eh
c) Calculate the distribution, expectation and standard deviation of two consecutive
registered particle-arrivals.
Hint. Suppose that an arrival was registered at the time t = 0 and let fV(t) be the
probability that at least a time interval t will pass till the following registered arrival.
It is easy to see that IV(t) satisfies the following (retarded) difference-differential
equation
W’(t) = T (l - W(t - hj) (t > h) (1)
and fulfils the initial condition W(t) = 0 for 0 < t < h . The solution of (1) is given by
” ( _ D *- 1 Pk (t — кКУ
W(t) = У -------------------------— for nh < t < (n + 1 )A ( л = 1 , 2 , ...). (2 )
*=i k\
Integrate (1) from h to + ° ° , then
CO

1 = P I' / IV'(t) dt.


о
Hence the expectation M of the time spent between two consecutive registered arrivals
is given by - i - . The standard deviation D of this time interval can be determined in
a similar manner and one obtains
„ J I - 2hP
D = --------- . (3)

Observe that for It = 0 we have D = — = M . For h > 0, we have < 1. Hence


P M
244 G E N E R A L T H E O R Y O F R A N D O M V A R IA B L E S [IV , § 17

the fact that the apparatus has a passive period diminishes the relative standard
deviation of the distribution.
d) If the radiation has a too high intensity, a “scaler” is commonly used in order
to make the observations more easy. This apparatus registers only every k-ih particle.
(In practice к is a power of 2.) Calculate the virtual density of events for this case too.
Hint. First calculate the probability that during the interval (t, t + At) there arrives
a “k-th particle”, i.e. a particle having in the list of arriving particles a serial number
which is divisible by k. Clearly, the probability of this event is
t oo nk .n k I -p i ,

( . ? , т а г ) л + “<л >-
As the factor of At depends al»d on t, the process is not stationary. But this dependence
on t is very weak when t is large; in fact, it can be shown that
“ p " k _ p ...

i-iT o o (nk - 1)! к


Relation (1) follows from
co -n k f nk—l -p t n к—1
У -----— = — У core ^ r -A (2 )
Ät (nk - 1)! к kx

where co, = exp (r = О, 1, . . к — 1), since the real parts of со, — 1 (r =


= 1, 2, . . . , к — 1) are negative and a>0 — 1 = 0 . Hence the probability that a particle
arriving between t and t + At is a “/c-th particle”, is given, for a sufficiently large t,
P P
approximately by — A t . Thus the registered density of events is — . Here we
к к
neglected the passivity of the apparatus following the arriving of a particle.
46. Suppose that the expectation of the random variable { exists and let a be a
real number. Prove that E(|{ — a\) takes on its minimum if a is the median of £.
47. Let ( be a random variable with distribution function F(x). Show that

E(H(0) = H(x)dF(x)
— CO

holds without restriction for every Borel-measurable function H(x), such that the
expectation E(H(£)) exists.
Hint. The value of E(H(£)) only depends on the distribution of Я(£), hence
on the distribution of c, since for every Borel set В P(H(i.) 6 В) = P(c 6 Я - 1 (В) ) ,
where Н~'(В) denotes the set of the real numbers x for which FHx) 6 В . Hence
E(H(£)) does not depend on the fact on what probability space [Q, cA, P] the random
variable £ is defined; thus let Ü be the real axis, cA. the set of all Borel subsets of Ü
and P the Lebesgue-Stieltjes measure defined on Q by P(Iab) = F(b) — F(a), where
/ oi, is an arbitrary interval a < x < b . Under these conditions £(x) = x ( — <x> < x <
< + oo) has distribution function F(x), hence

£(Я (|)) = f H(£)dP = J H(x) dF(x).


а —о
CHAPTER V

MORE ABOUT RANDOM VARIABLES

§ 1. Random variables on conditional probability spaces

Let & = [Í2, «55, P ] be a conditional probability space (cf. Ch. II, § 11).

A real valued function £ = £(<») defined for со d Q is said to be a random


variable on ^ if the level sets A x of £ (Ax is the set of all со d Q such that
£(со) < jf) belong to for every real x. A vector valued function £ =
= (£x, . . . , £r) on Q is said to be a random vector on ^ if all of its com­
ponents ^ , . . . , £r are random variables on Since, by assumption,
is a u-algebra of subsets of Í2 it follows that for every Borel set В of
the /--dimensional Euclidean space the set £ - 1 (В) of all со d Q for which
£(co) d В belongs to the u-algebra
If C is any fixed element of «55, = [Í2, *j €,P{A | C)] is a Kolmogorov
probability space (cf. Theorem 6 , § 11, Ch. II). Since every random
variable £ on & is an ordinary random variable on .5^., the usual notions
can be applied to the random variables on dFc. Thus there can be defined
for every random variables £ on & its conditional distribution function,
its conditional expectation, etc. with respect to the condition C All
theorems proved for ordinary random variables are valid for the random
quantities defined on dF with respect to a conditional probability space
■Tc for every given C. New problems, however, arise if we let C vary.
Let £ be a (real) random variable on У , Iab the interval a < x < b and
Í - 1 dab) the set of all со d Q with i(co) d l ab. Clearly, for every Iab the set
í _ 1 (4fc) belongs to but i t does not necessarily belong to «$. Let
be the set of all intervals Iab c. I0 with Í _3(4 *) € where /„ is a given,
(possibly infinite), interval. The following two conditions are assumed to
hold for L /#:
Condition Nr. The set is not empty; for Ir d and I2 d there
exists an I3 d such that Ix + I2 C| I3 .
Condition N 2- For /, d / 2 d <-Ж, and / х с / 2 we have

P ( r 1( h ) \ r \ h ) ) > 0 .

Conditions N x and No are evidently fulfilled if consists of a single


element only. Let J be the union of all intervals I d <-Ж. The set J СПI0 is
246 M ORE ABOUT RANDO M V A R IA B L E S [V, § X

an open or half open interval with endpoints a and ß (ос may be equal
to - oo and ß to + oo). Let cn be any point in the interior of J, i.e. a < c0 < ß.
Take a sequence of intervals Ianbn (n = 1 , 2 , . . .) with
a„+l < a n < c0 < bn <bn+1, lim an = a, lim bn = ß.
n-*- + 00 n-*- + CO

There can always be found such a sequence when the condition N x is


fulfilled. Put for X £ I a„b„

f(I. j P C M Q U » for „ <х<ь


Á) /■ ( { - ■ ( /„ .,) I{ - ‘ ( / . . О ) " ■

and
F ,.o - 't i - V J H - I U for „ < * < .

From Axioms Л, Ä, and C (Ch. II, § 11) of conditional probability spaces


follows then that for a„ < c < a < b ^ d < b n and Icd g

f n u i i - M - g : ”

(c < d and Fn(d) — F„(c) > 0 follow from our assumptions). Furthermore,
for an <, X < bn and for N > n

F n( x ) = f N( x ) .

Therefore the value of Fn(x) does not depend on n and we can omit the
index n by wiiting simply

F (x) = F n( x ) for a n < X < b„ (n = 1 , 2 ,. .. ) . (1)

The function F(x) is defined everywhere on the interval (a, ß), it is non­
decreasing and leftcontinuous; for Icd £ we have F(d) — F(c) > 0 and
for c < a < b < d the relation

шО
)

is valid. Thus the following theorem can be stated:

T heorem 1. Let ^ be a random variable on a conditional probability space


■T = [Q, 'jF, - ff P] and let be the set o f the half open intervals Iab
contained in an interval /„ such that £ ~ \ I ab) £ &■ Let <-Жfulfil the conditions
N x and N2. Let further J be the union o f all I £ <-Ж; J is an interval contained
V, § 1] C O N D IT IO N A L P R O B A B IL IT Y S P A C E S 247

in 70; let a and ß denote the endpoints o f J. Then there exists a nondecreasing,
left continuous function Fix) defined on (a, ß), such that for Icd £ we have
F{d) — Fie) > 0 and for Iab CCIcd the relation

р ({ -ч с > ж -ч ц ) - P)
holds.
A function Fix) having the above properties will be called a distribution
function o f £ on (a, ß). Under the assumptions of Theorem 1, the random
variable £, thus possesses a distribution function on (a, ß). The distribution
function Fix) of £, is evidently not uniquely determined since for A > 0
and for arbitrary g, the function G(x) = /.Fix) + g is also a distribution
function of £, on (a, ß). Conversely, if F(x) and G)x) are distribution functions
of ^ on (a, ß) and if the conditions of Theorem 1 are fulfilled, then for any
two subintervals I cd and I.,ö of (a, ß) with l cd £ and I./6 £ there
can be found an interval Ief £ such that Icd cc fif and Iy6 £ l ej. Thus
we have
Fid) - Fie) F if) - Fie) F fi) - F(y)
G id )-G ic ) G if)-G ie ) G(ő) - G(y)
Hence

for (4)

is a constant. And since for every Iab c; Iej


Fjb) - Fja) = Gjb) - Gja)
F if)-F ie ) G if)-G ie ) ’
it follows that
Fib) - Щ Ь ) = Fia) - Щ а ) = g (6 )

is also a constant; thus


G(X) = Щ х ) + g, (7)
where X and g are the constants defined by (4) and (6 ).
Thus the distribution function of c on (a, ß) is uniquely defined up to
a linear transformation.
When the distribution function Fix) of c on (a, ß) is absolutely continuous
on every closed subinterval of (a, ß), then

f i x ) = F \x ) (8 )
248 M O R E A B O U T R A N D O M V A R IA B L E S [V, § 1

is called a density function o f f According to what was said above, the


density function of £ is uniquely determined up to a positive constant;
/(x) is nonnegative, measurable and integrable on every closed subinterval
[a, b] of the interval (a, ß).
Example 1. Let (2 be the set of all real numbers and Ж the set of all
Borel subsets of (2, let further be g(x) a function which is nonnegative,
measurable and integrable on every finite interval of the real number axis.
Let 3d be the set of all intervals Iab such that
b
0 < \ g(x) dx < + oo;
a

assume that 39 is not empty. Define conditional probabilities by


J g{x) dx
P(A IB) = A±-- ч , for A £Л and В ^ 39. (9)
\g {x)d x
A
Put £(cu) = a>(—oo < at < + oo). Then £ is a random variable on the
conditional probability space У = [Q, 3d, Р]. If 70 = ( —oo, + oo),
is identical to 3d and all conditions of Theorem 1 are fulfilled; hence £
has a distribution function F(x) and indeed

X
A \ g{t) dt + g for x > 0,

FW = 'о (10)
—A f g(t) dt + g for x < 0

is a distribution function of ^ for any choice of the constants A > 0 and


/i. Furthermore ).g(x) is a density function of £ for any A > 0. In particular,
for g(x) = 1
F(x) = Ax + p (—o o < x < + c o )

is a distribution function of ^ and the density function of £ is an (arbitrary)


positive constant A; in this case we say that £ is uniformly distributed on
the whole real axis. It should be remarked that the distribution function
of a random variable on a conditional probability space may assume
negative values and is not necessarily bounded. It is easy to see that the
following theorem holds:

T heorem 2. Let F(x) be a distribution function and f(x ) a density function


o f a random variable £ defined on a conditional probability space У. Let
V, § 1) C O N D IT IO N A L P R O B A B IL IT Y S P A C E S 249

у = h(x) be a monotone function and x = h~l(y) its inverse. Then

F ( h - \y ) ) = G(y)

is a distribution function and, i f у = h(x) is absolutely continuous,

/(* -■ O ))

is a density function o f r\ = /;(£).


Example 2. Suppose that £ is uniformly distributed on the whole real
axis, then the same holds for g = a£ + b (for any constants а ф 0 and b).
Example 3. Let £ be uniformly distributed on the whole real axis. Then
tl — eHhas on (0 , + oo) the density function

/(* )= * (x > 0>

The distribution of rj is said to be logarithmically uniform on the half line


(0 , + со ).

Example 4. Let £ possess a logarithmically uniform distribution on the


half line x > 0. Then rj = a if has the same distribution as £ for any a > 0
and b Ф 0 .
Example 5. If l; has on the interval (0, + oo) the density function

/(* ) =
then r] = cf (c > 0 ) has the same density function.

T heorem 3. Let I be a random variable defined on a conditional probability


space -T and let F(x) be a distribution function and fix ) a density function
o f £ on the interval (a, ß). I f Iat and £ - \ I ab) 6 -59, then the conditional
expectation o f £ with respect to the condition a < £ < b is given by
ь ь
i xdF(x) f x fix ) dx

— <П)
a
{Clearly, the value o f E(£ \ a < £ < b) does not depend on the choice o f
Fix) or fix).)

«
250 M O R E A B O U T R A N D O M V A R IA B L E S [V, § 1

Example 6. If £ is uniformly distributed on the whole real axis, then

E ( Z \a < Z < b ) = ^ -
for all a < b.
Example 7. If £ is logarithmically uniformly distributed on the positive
semi-axis, then
E(£ \ a < £ < b) = —— for 0 < a < b.
In —
a
Example 8. If £ is uniformly distributed on the whole real axis, |£| is
uniformly distributed on the positive semi-axis.The distribution function of
<J2 is th u s y /x a n d its d en sity fu n c tio n is - __: f o r x > 0 . H en ce fo r 0 < a < b
V *

6* _

j у/ X dx о , ,9

E ( e \ a < ^ < b ) = E ( e \ a 2 < e < b 2) = ^ -------- = a~+ ab + b~


r dx
J Vх
and consequently

in accordance with the fact that under the condition a < £ < b, £ is
uniformly distributed on the interval (a, 6 ) and the standard deviation of
such a distribution is —— = . (Cf. Ch. IV, § 14.)
2 ^ /3
Distribution functions and density functions of an /--dimensional random
vector on a conditional probability space can be defined in a similar way.
Let I be an “interval” of the /--dimensional space, i.e. the set of the points
X = (xx , . . . , x r) whose coordinates satisfy ak < x k < bk (k = 1 ,2 , , r)
and let F{x1, . . . , x r) be a function of r variables. Like in Chapter IV,
§ 3, we introduce the notation
A,F = < > A f>. .. A%E(al t . . . , a r),
where hk = bk — ak (k = 1, 2, ,r). We have the following theorem:

T heorem 4. Let £ be an r-dimensional (r = 2, 3 , . . . ) random vector on


a conditional probability space .7' = [fi, -$9, P] and let £ ~ 1(/) denote the
V, § 1] C O N D ITIO N A L PROBABILITY SPACES 251

set o f those со f Q for which [(со) £ /, where I is an interval o f the r-dimen-


sional space E r. Let I0 denote a fixed interval o f E r and e.Ж the set o f those
intervals I C I0 for which £~](l) £ Assume that the conditions and N 2
given above are fulfilled. Then i f J is the union o f all intervals I £ ЖТ(J is
also an interval o f E r), there exists a function F on J such that A ,F > 0 for
every I CA J and for / 2 £ and f С / 2 the relation

p (C ~ \h) I C 'V i) ) = ~ j (12)


is valid.
P roof . If x J f consists of just one interval, the statement of the theorem
is trivial. Otherwise let f £ ЖА, / 2 £ with Ix C and let
(x[fi x<f, . . . , xf>) £ A .
For X = (a ,, x 2, . . . , x r) £ /._> put
fu r т - 'о ж - х щ
.............' ( ’ П С Ч /,) I { -'(/,)) ■ (13)

where Ix is the interval a,- < tf < bt (i = 1 , . . . , r) with

a = min (x<°>,Xj), hi = max ( x f \ a,)


and к is the number of the values of i for which x; < xf'l
Like in the proof of Theorem 1, we see that F(xl s . . . , x,) does not
depend on the choice of / 2 . Clearly, F is nondecreasing with respect to
each of its variables, A ,F > 0 and (12) is true. Theorem 4 is thus proved.
Every function F fulfilling (12) is said to be a distribution function o f [
on J. The distribution function is not uniquely determined; if F is a distri­
bution function of C and p is any nondecreasing function of r — 1 of the
variables x b . . . , x, then for every Я > 0

G(xl t .. . , x r) = Щ х ь . . . , x r) + p (14)

is also a distribution function of [.


If F is absolutely continuous on every I £ <Ж we call
8r F
f ( x 1, • • X r) = - -------- — (15)
oxx . . . dxr
the density function o f £ on J. It is determined up to a positive constant
factor.
Let be random variables- on the conditional probability
space Ж = [Q, P(A | В)] and put [ = (£b . . . , [,)■ We shall say
252 MORE ABOUT RANDOM V A R IA B L E S (V , §1

that the random variables £l t . . . , are mutually independent if

ArF = l \ ( F i ( b ,) - F l (ai)), (16)


i =1
where F is a distribution function of £, / is any interval I = {ak < x k < bk;
к — 1, , r) with / С / and the Ft are nondecreasing functions. If F
is absolutely continuous and the random variables . . . , £r are indepen­
dent, the density function / of £ is

/ = П M xi) . (17)
i=l
where the nonnegative function f ( x ) is equal to F'(x). Conversely, from
(17) follows (16) and thus the independence of the random variables

Example 9. Let Í2 = E r be the r-dimensional Euclidean space; let g(x),


where jc = (xb x 2, . . . , x r), be a function which is nonnegative, measurable
and integrable pn every finite interval / of Er\ let be the set of the Borel
subsets of Er, let £8 be the set of all nonempty В £ for which 0 <
< f g(x) dx < 4- oo and put, for A £ В £
в

J 9 (x) dx
P(A IB) = Aw .
1 g(x) dx
В
Put CW — X. Then ^ = [Q, £8, P] is a conditional probability space
and £ is a random vector on JE If Ix denotes the interval
min(0 , x t) < ti < max(0 , x t) (i = 1, 2 , ,r),
then the distribution function of C is given by
Е(хъ . . x r) = ( - l)fc f g (x) dx,
h
where к is the number of the values of i for which x t < 0 and g{x) is the
density function of f.
In the case g(x) = 1 , ( is uniformly distributed on the whole space Er.
In this particular case we can put

F(x t , . . xr) = x rx 2 . . . x r.

Let L, be an r-dimensional random vector, I an interval and B Q I a.


Borel subset of Er, furthermore let C-1(/) € -75. Let F be a distribution
V, § 1] C O N D IT IO N A L P R O B A B IL IT Y S P A C E S 253

function of £. Then we have

р ( с - 1( в ) \ с - 1(Г)) = — -A -F— •

If ^~ЧВ) belongs also to -JÖand if C c: 5 is another Borel subset, it follows


that

rrr-vnir-vrni ^-ЧСЖ -ЧО) S - c $ dF


(С ( Ж ( )) p (^ -i(ß )!(-!(/)) j \..J < /F '

Thus we have proved the following theorem:

T heorem 5. I f F is a distribution function o f the r-dimensional random


vector 'Qon a conditional probability space -T = [Í2 , , SB, P] and if В and
С С В are Borel subsets and I is an interval o f Er, further i f В Cl I,
t ~ \B ) €«S0 , then

n C - 1 ( C ) K - 1 (5)) = y - £ - p F .
' в '
From Theorem 5 we can easily deduce

T heorem 6. Let £ and ц be independent nonnegative random variables on


the conditional probability space ■Lr. Let (£, rj) have distribution function
F(x) G{y) (0 < X < + oo; 0 < < + oc) and let lim F(x) = F(0) be finite.
X-*- + Q
Then the sum { = £ + t] has distribution function

Щх) = f (F(x - >') —F(0)) dG(y).


0

Remark. If we put F(y) = F(0) for у < 0, we can also write

H (x)=J(F(x-y)-F(0))dG(y).
0

If we assume further that F(0) = 0 (which does not restrict generality),


then we can simply write
H(x) = \ F(x - y) dG(y).
о
A similar theorem holds for more than two nonnegative independent
random variables.
254 M O R E A B O U T R A N D O M V A R IA B L E S [V , §1

P roof of T heorem 6 . By Theorem 5, if ^ ~ \ l cd) £-58 and Iab C l cd,

J.f dF(x) dG{y)


P ( a < l ; + r i < b \ c < Z + r i <c r ) = - a<xf f <b - ■ =
jj dF(x) dG(y)
c< ,x+ y< d

H(b) - H(a)
H(d) - H(c) ’

hence Theorem 6 follows.


If F is absolutely continuous.

f
h{x) = f ( x - y ) d G ( y )
Ö
is a density function of £ = £ + . Finally, if G(y) is absolutely continuous
and #Cf) = G'(y),
h(x)=\f(x~y)g(y)dy.
о

Example 10. Let the random vector (£ls . . . , £ „ ) be uniformly distributed


on the и-dimensional space, and put

ű = й + ••• + £

The random vector (£ f, . . . , has density function

/ ( x b .. ,,x„) = — = L = for xf c > 0 (£ = 1, 2 ,


V *i • • •
It follows by Theorem 6 that the density function of y'n is given by
*
h„(x) = ( A„_i(х - у ) - Ц = .
oJ

We obtain by induction
" -i
A„(x) = x 2 for X> 0 .

In particular, + <!;§ is thus uniformly distributed on the positive semi­


axis.
V. § 2) NOTION OF THE CONDITIONAL PROBABILITY 255

§ 2. Generalization of the notion of conditional probability on Kolmogorov


probability spaces

Let t be a discrete random variable defined on a probability space


[fi, P ], let x k (k — 1 , 2 ,. . .) be the values taken on by £ with positive
probability. Let Ak denote the event t, = x k and В an arbitrary event.
Let the random variable r\ be defined by

t]= P (B \A k) for t = xk (1)

(ij = P(B I Ak) for every со £ Q such that £(co) = x k). Instead of (1), the
notation Y] = P^B) will be also used.
Let U denote any Borel set of real numbers and <l~ \U ) the set of all
со £ Q such that £(co) £ U. Let further be the family of the sets
We have thus The family is a tr-algebra since

^ - \ l U k) = U ~ \ U k) and =

it is the minimal «т-algebra with respect to which £ is measurable.


It is easy to see that
P(AB) = \P fB )d P ( А £ Л ; ,В е Л ) . (2)
A
In fact, since A £ for every к such that А к А ф О , we have Ak C A,
hence
00 00

(В) dp = X P(B I Ak)P (A A k) = X P(ABAk) = P(AB).


A k =1 k = l

Obviously, P^(B) can be interpreted as the conditional probability o f the event


В for a given value oft, . Of course, the question arises whether this definition
may be extended to any random variable so that formula (2 ) should remain
valid. We shall show that this extension is possible. The difficulty of the
problem is seen from the fact that for instance a random variable with
absolutely continuous distribution function assumes each of its values
with probability zero and up to now we defined conditional probabilities
only for conditions having a positive probability.
Let t be an arbitrary random variable. Let us fix a В £ with P(B) > 0
and consider the measures P(A) and P{AB) on the а-algebra Clearly,
0 < P (A B )< P (A ). Hence P(A) = 0 implies P(AB) = 0: the measure
P(AB) is thus absolutely continuous with respect to P(A).
In what follows, we shall need the Radon-Nikodym theorem:
Let be a a-algebra o f the subsets o f a set Q, let у (A) be a measure
and v(A) a a-additire real set function on . The measure p is assumed
256 M ORE ABOUT RANDOM V A R IA B L E S [V , § 2

to be o-finite, i.e. Q is assumed to be decomposable into denumerably many


subsets Qk with Qk p{Qk) < + co. Let further v(A) be absolutely
continuous with respect to p(A), i.e. let p(A) — 0 imply v(B) = 0 for every
В £ В a A. Under these conditions there exists a function f(oo), measurable
with respect to the a-algebra such that for every A £ -jd the relation
v(d) = [f(a>)dp (3)
Ä
holds. I f V is nonnegative {i.e. i f it is a measure), then f(a>) > 0. The function
f(co) is determined in an essentially unique manner in the sense that whenever
g{a>) is another function fulfilling the conditions o f the theorem, then f(a>) -
= g{w) holds almost everywhere (with respect to p). That is to say, if D
denotes the set o f the points a> at which f{co) Ф g(co), then p(D) = 0.
dv
The function f(co) figuring in (3) is denoted by —— and is called the
dp
(Radon-Nikodym) derivative o f the set function v with respect to p.
Consider now on the er-algcbra <^d the measures p(A) = P(A) and
v(A) = P(AB) {A ^ isdf, В £ is fixed) and apply the Radon-Nikodym
theorem. Since P(i3) = 1, P(A) is not only er-finite but even finite. Further
we have seen that P(AB) is absolutely continuous with respect to P(A).
Hence there exists a function f(co) which is measurable with respect to
such that
P {A B )= \f{(o)dP (А в ^). (4)
Ä
According to the Radon-Nikodym theorem, /(to) is determined up to a
set of measure zero; /(со) is a nonnegative measurable random variable
with respect to Obviously, we have almost everywhere f(co) < 1. Indeed,
(4) implies
P { A B ) = \( l- f{ c o ) ) d P . (5)
A
Thus if we would have the inequality 1 —f{co) < 0 on a set C with positive
measure, this would imply P(CB) < 0, which is impossible.
We shall call the random variable /(со) the conditional probability o f the
event В fo r a given value o f <; and shall denote it by P fB ). Thus we can
write
P(AB) P f В) dP (A £ ^ f ) , (6 )

which is a generalization of (2). If we want to emphasize that P fB ) depends


on со, we shall write P fB ; со) instead of P fB ).
In particular, if we put A = Q in (6 ) we find
P ( B ) = \P fB ) d P = E (P fB )), (7)
ß
V, § 2] N O T IO N O F T H E C O N D IT IO N A L P R O B A B IL IT Y 257

i.e. the expectation o f the random variable P fB ) is equal to P(B). If we


put now В = Q in (6 ), we have
P(A) = \P;{Q)dP for every А
A

On the other hand, however,


P{A )= \ \ - d P ,
A
hence, with probability 1 we have

Pt (S2) = i - (8)
One can prove in a similar manner that with probability 1
P fB ,) < P fB 2) for B, a B2.

In particular, when £ is a discrete random variable, P,(B) coincides


almost everywhere with the random variable defined in ( 1 ) at the beginning
of this section, since (2) determines the value of P fB ) for almost every со.
It is seen from (1) that the value of P fB ), for a discrete random variable
f only depends on the value of f But this holds for the general case as
well and is expressed by the fact that P fB ) is measurable with respect
to which may be rephrased by saying that P fB ) = h{f), where у = h(x)
is a Borel-measurable function. This can be seen from the following,
somewhat modified, definition of P fB ).
Apply the Radon-Nikodym theorem to the measure defined on the
(j-algebra of the Borel subsets U of the real numbers by P (f~ '(U )) = p(U)
and P (B t;-\U )) = v(U). The Radon-Nikodym theorem states the existence
of a Borel-measurable function g(x), defined for the real numbers x, such
that
P ( B Z - \U )) = \g{x)dF{x) (9)
i/
holds for every Borel set of U of the real axis. Here F(x) denotes the
distribution function of £ .
Obviously, the relation ^(^(co)) = /(cu) holds for almost every со £ Q .
If the random variable P(B \ £ = x) is defined by the function g(x) of
formula (9), then by definition it only depends on x; further for almost
every со £ Q P(B \ £ = x) = P fB \ со) where x — <^(cu).
If A is a fixed set, A £ P(A) > 0, then P(B \ A), considered as a
function of the set В £ is a probability measure. We shall now discuss,
how far this remains valid for P fB ). Suppose Bk d (k = 1 ,2 ,. . .),
oo
BjBk = 0 for j Ф к , and £ Bk = B. Consider an arbitrary random variable
k =1
258 M ORE ABOUT RANDOM V A R IA B L E S [V, § 2

£ and define the random variables Р ^Вк) (к = 1 ,2 ,...) and P${B) as above.
Then, for A £
P{ABk) = \ P i(Bk)dP (10)
A

and
P(AB) = \P 4{B)dP.
A
But from (10) and from

I P(ABk) = P(AB)
k =1
it follows that
00

P(AB) = $ ( Y P t(Bk))dP,

oo
hence ]T PÁBf) fulfils relation (6 ) which defines P((B). Thus with proba­
ri
bility 1
00

W = k= 1 (11)

The elements to for which the relation (11) does not hold form thus a
set C of measure zero, i.e. P(C) = 0. Since Pt(B) is determined only almost
everywhere, one cannot expect to prove more than this. The exceptional
set C depends on the sets Bk and the union of the exceptional sets corre­
sponding to the individual sequences {Bk} is not necessarily a set of measure
zero since the set of all sequences {Bk } is nondenumerable if has infinitely
many elements. Thus we cannot state that for a fixed £, Pc(B) as a function
of В is a measure; in general this is not true.
In practice, however, this fact causes scarcely any difficulty at all. In most
cases, the conditional probability P$(B) = P(B | £ = x) is studied simul­
taneously for nondenumerably infinitely many В only when the con­
ditional distribution of a random variable tj is to be determined with respect
to the condition £ = x; i.e. if the probabilities

Р(П < У I É = x)
are to be considered for every real value of у . If these conditional proba­
bilities can be defined in such a manner that Р(ц < у \ ^ = x) is a distri­
bution function with probability 1 , then this function is said to be the
conditional distribution function o f rj with respect to the condition ^ = x
and is denoted by F(y | x):

F{y I x) = P{r\ < у I £ = x).


V , § 2] N O T IO N O F T H E C O N D I T I O N A L P R O B A B IL IT Y 259

If F(y I x) is an absolutely continuous function of у and if

F(y I x) = f f( t I x) dt
— CO

is valid, then f ly \ x ) is said to be the conditional density function o f r] with


respect to the condition ij; = x .
Since conditional probabilities are determined only with probability 1,
it can always be achieved that for almost every x the random variable
P(p < у I £ = x) as a function of у should be a distribution function.
The proof of this statement will just be sketched.
The conditional probabilities P{t] < у | ^ = x) are first defined, by means
of the Radon -Nikodym theorem, for rational numbers у only. Then
there exists a set V with Р (£ - 1(К)) = 0 such that for x $ V the function
P (tj < у I £ = x) as a function of у (у rational) is
nondecreasing, left-
continuous, and fulfils the conditions

lim P(t] < у 1£ = x) = 0 and lim P{r\ < у | { = x) = 1.


Y—- со y-*-+ oo

Extend now the definition of P(rj < у | £ = x) to irrational values of у


in the following manner:
Pf] < у I f = x) = sup P (p < y ' I { = x).

Then P(rj < у 1£ = x) a sa function of у is a distribution function and we


have
P(r] < y, £ £ U) = f P f] < у 1£ = x) dF(x).
V

In fact, this relation is valid for every rational у and hence for every
real у as well. Herewith our statement is proved.
Thus we have defined the conditional probabilities P(B \ A) even for
P(A) = 0; but let it be emphasized that in the latter case the conditional
probability P(B I A) is only defined, if A can be considered as a level set
of a random variable £, i.e., if there exists an x such that A is the set of
the elements <n of Q for which flat) = x. Then P(B \ A) is defined by
P(B I A) = P(B \£ = x). However, a set of probability zero can be obtained
as a level set of different random variables, thus e.g. A may be defined
by any one of the conditions — x x and £ 2 = x 2 . Thus it is possible that

P(B I f l = Xl) Ф P(B I & = x,),


though the conditions f l = x x and f l = x 2 define the same set A. A condi­
tional probability with respect to a condition of probability zero is therefore
260 M ORE ABOUT RANDO M V A R IA B L E S [V , § 2

defined only if this condition is an element of the decomposition of the


sample space into pairwise disjoint subsets and is considered as an element
of this decomposition. The corresponding conditional probability P(B \ A)
depends thus on the decomposition in which A was imbedded.
With the Radon-Nikodym theorem, we proved so far the existence of
the conditional probability P^{B) only. Let us now see, how P^(B) =
= P(B | <J = x) can be effectively determined. In order to do this let us
remark that relation (9), in the case of P(B) > 0, may be brought to the
form

(
P(B) FB(b) - FB(a)) = J P{B I £, = x) dF(x)
a
( 12)
where FB(x) denotes the distribution function of £ with respect to the
condition В and where we have chosen for U the interval [a, b]. It follows
by a well-known theorem of Lebesgue that (if F(x) is the distribution
function of <^)

P(B\Z = x) _ FB(x + h ) - F B(x)


P(B) F(x + h) - F(x) ( }

for almost every x (i.e. for every x $ C, with P (£ _1(C)) = 0).


In particular, if F{x) and FB{x) are absolutely continuous and if
F \x ) = /(x), F'B(x) = fu(x), then for almost every x

P(Z?K = x) = P ( S )A4 x^) - 04)


whenever /(x) > 0 .
Examples.
1. Let (<!;, tf) be a random vector with absolutely continuous distribution
and with density function h(x, y). Let

/(*)=j—OO h( x , y ) d y
be the density function of £ . Let £-1(£/) and £_1(F) denote the events
^ £ U and (z V respectively, where U and V are Borel sets on the reai
axis. Assume that the function /(x) is positive for x £ U. Then

Pin i УЛ € £ / ) = [ ] h{x,y)dxdy = j i f '^ - d y \ f { x ) d x ,


x f U x £ U у £ V
V, § 2 ] N O T IO N O F T H E C O N D IT IO N A L P R O B A B IL IT Y 261

hence

У£ V
thus the conditional density function g(y | x) of g with respect to the con­
dition £ = X is given for the x values which fulfill f(x ) > 0 by

g(y I x) is not defined for those x values for which /(x) = 0 .


Similarly, if g(y) is the density function of g and /(x | y) is the conditional
density function of ^ with respect to a given value у of g (i.e. with respect
to the condition g = y), we find for g(y) > 0 that

2. Let t, and g be independent random variables and F(x) the distribution


function of £ . Then
p{g = V) p({ e u ) = f p (n e v ) « в д ,
x£U
where U and V are arbitrary Borel sets. Hence

P ( tl iV \Z = x )= P (tiZ V ). (17)
Consequently, if the random variables £ and g are independent, then
the conditional distribution function of g with respect to the condition
£ = x is identical with the ordinary (unconditional) distribution function
of g . Conversely, if (17) is valid for every Borel set V and for every x £ U
with P(£ £ U) = 1, then ^ and g are independent.
3. Let (c, g) be a normally distributed random vector with the density
function

1 exp
Г 1 ,(*■2 + >■)
2x1> .

Let Q and ft (0 < ft < 2n) be the polar coordinates of the point (£, g).
Find the conditional distribution of ft with respect to the condition
о = r > 0. We have

P(0 < ft < cp, £ 2 + g2 < R 2) = JJ e x p ----(x 2 + y 2) dx dy,


о<,ъ<ч>
x'+y^R*
262 M ORE ABOUT RANDO M V A R IA B L E S [V , § 2

or, by introducing polar coordinates,


R
P(0 < d < <p, £ 2 + r]1 < R 2) = [re 2 dr.
2n j
0
_
Since the density function of q is given by re 2 , we obtain

P( 0 < ß < c p \ o = r ) = -- (0 < (p < In).


2n
H ence 9 is u n ifo rm ly d istrib u te d in th e in te rv al [0, 2 n ) u n d e r th e co n d itio n
q = r fo r every r > 0 a n d th u s d a n d q a re in d e p en d e n t.

4. Let £ and rj be independent random variables. We shall determine


the conditional distribution of Я(£, rf) with respect to the condition £ = x ,
here H{x, y) is assumed to be a Borel measurable function.
If U is an arbitrary Borel-measurable set and if F(x) and G(y) are the
distribution functions of £ and q, then
W , ri) < z , S € U) = J' dF(x) dG(y) - J ( i' dG{y)) dF(x) ;
m .x .y ) < Z X t u Щ х . 'у ) < z
X t u

the conditional distribution function in question is thus the distribution


function of H(x, ц).
5. Let U be a Borel set and В = ^ ~ \U ). Then, with probability 1
1 for со £ B,
p j B) =
0 otherwise.
In fact, if A =
P ( A B ) = \ d P = \ XB(co)dP,
AB A
where

ы » )= 1 1 for a i B -
\ 0 otherwise.
Since Xu is measurable with respect to we have, with probability 1,
= XÁ<»)-

6 . (Particular case of 5.) Let Q be the interval [0, 1], let be the set
of Borel subsets of Q and P the Lebesgue measure. Put
c(co) — ш (0 < ш < 1 ).
V, § 3] G E N E R A L IZ A T IO N O F C O N D IT IO N A L P R O B A B IL IT Y 263

Then, for В £ Л (with probability 1),

f 1 for со £ B,
PAB) = { л ,
[ 0 otherwise.
7. Let Q be the unit square of the plane (x, y), ^ the class of the Borel
subsets of fi and P the two-dimensional Lebesgue measure. Put £(x, y) = x.
Since, for every В £ <Jé and for any Borel set U of the real axis (according
to the theorem of Fubini),

P (B Z ~ \U ) ) = \{ S' dy)dx,
U О ,Ж »
we find
P(B |£ = x0) = J dy = n(BXo),

where BXo represents the intersection of В by the line x = x0 and /< the
one-dimensional Lebesgue measure. In this case P(B | £ = x0) is thus, as
a function of B, a measure on the u-algebra -J i, for every x0 .

§ 3. Generalization of the notion of conditional probability on conditional


probability spaces

Let & = [ß, <^f, 3d, P(A j Л)] be a conditional probability space and b,
a random variable on г?'. Let B ( t / and C £ 3d be given sets, with
P(B I C) > 0 and let -.sA£ be the least tr-algebra with respect to which £
is measurable. Consider the measures jic(,4) = P(A | C ) and vc(A) =
= P(AB I C ) on <^6 vc(A) is absolutely continuous with respect to gc(3) ;
there exists thus, by the Radon-Nikodym theorem, a function measurable
with respect to ^ f ( c o ) = P fB \ C) such that

P (A B \C )= \ P fB \C ) d g c for (1)
'A

The random variable P fB | C) will be called the conditional probability of


the event В with respect to the condition C and for a given value o f i ; this
of course depends on but also on C ; but the dependence is quite obvious
in the most important particular cases.
If C is fixed, P fB I C ) can be considered as the conditional probability
of the event В on the ordinary Kolmogorov probability space J*c =
= [Q, P(A I C)] with respect to the condition that £, assumes a given
value. The random variable P fB ! C ) has thus, for fixed C , all the properties
proved in § 2 for P fB ).
264 M ORE ABOUT RANDOM V A R IA B L E S [V , § 3

Let us point out the following circumstance. If A(x) is the set of all
со £ Q for which £(co) = x, it may happen that the sets CA(x) belong
to the family AS for some values of x or even for every one of its values
and thus P(B \ CA(xj) is defined. But a priori it is not at all certain that
P(B I СА(хУ) coincides with P ^B \ C), i.e. that
P(AB 1C) = f P(B I C A (ty dpc for A £ Л {.
A

This regularity property does not follow from the axioms and if necessary
it must be postulated as an additional axiom.
C o n sid e r n o w th e fo llo w in g im p o rta n t p a rtic u la r case:
L e t Q be a n a rb itra ry set an d a a -a lg e b ra o f su b se ts o f Q . L e t fu rth e r
p be a «т-finite m e asu re o n ^ 6 a n d le t A S b e th e fam ily o f sets

В £ with 0 < p(B) < + oo .


We define
ц(АВ)

when A £ and В £ &).


Let f be a random variable on the conditional probability algebra

F = [ Q ,u € , AS,P (A \B )]

and let be the least n-algebra with respect to which £ is measurable.


Let B (B £ be fixed; since the measure v(A) = p(AB) is absolutely
continuous on with respect to p(A), there exists a function /(со, В) which
is measurable with respect to and has the property
p(AB) = \f(co,B )dp for every A 6 (2)
A
u(AC)
If C 6 AS and B Q C , it follows from (2) by putting pc (A) = —— —
MC)
that
P(AB IC) = = f/(со, B) dpc . (3)

The function /(со, В) obviously does not depend on C. Hence introducing


the notation P^B) = /(со, В) we have

P { A B \C )= \P ^ B )d p c (4)
A
for A e 5 f t / and В CI C £38.
V, § 3 ] G E N E R A L IZ A T IO N O F C O N D IT IO N A L P R O B A B IL IT Y 265

Clearly P fB ) is with respect to д almost everywhere uniquely defined,


further almost everywhere with respect to /t holds

0 < Pt(B) < 1. (5)

With exception of a set of д-measure zero we also have

p t(B) = E Pi(Bk) if E Bk = B and Bj Bk = 0 for j # k.


k= 1 k= 1

If ri is another random variable on it can be shown as in § 2 that


the values of P 4(^_ 1(F)) can be chosen such that for every a> § D, with
ц ф ) = 0, P^(>1~1(Pj) is a measure on the Borel subsets of the real number
axis. This measure will be called the conditional distribution o f r\ for given
Í- If
Pi ( rl~1(v )) = \d { y \x )d y for £(w) = x, (6 )
V
then g(y\x) will be called the conditional density function o f t] with respect
to the condition £ = x . '
Let £ and r\ be random variables defined on <5r with the two-dimensional
density function h(x,y)\ assume that the integral

/(* ) = J h{x,y)dy (7)


— 00

exists for every x. Then f(x ) is a density function of f In fact, if V and V


are two intervals, U <Z.V, we have
\f{ x )d x

р^ и^ - у щ г ’ <8)
V
if Z - \V )
In this case the conditional density function of r\ with respect to the
condition £ = x is equal, for f(x ) > 0 , to

In fact, for U с; V and £- 1(K) £ we have

[J h (x ,y )d x d y \ ( \ g ( y \ X)d y )f(x )d x
n „ Í W .U V \U V )= ^ T m d x m d x ------------ (10)
V V
266 M ORE ABOUT RANDOM V A R IA B L E S [V, § 3

Finally, let the relation


-F oo

^ g { y \ x ) dy =j ^ = \ for f ( x ) > 0 (11)


-00
be mentioned, expressing the faci that the conditional distribution of 17
for given £ is an ordinary distribution.
Let us consider now some examples.
1. Let the point (<!;, tj) be uniformly distributed in the domain of the
plane defined by | x 2 - y 2 1 < 1. The density function h ( x , y ) of (£. ry) is

I 1 for | V - - / | < 1,
- I 0 otherwise.
The density function / ( x) of ^ is

/(*) = -00
J h{x,y)dy,
hence
f ( x ) = j 2( \ Л 2 + 1 - J * ? - О for 1Je 1> 1,
( 2sjx~ + 1 otherwise.

Similarly, if g{y) is the density function of iy, we have

( 2 (7 / + 1 -s fr ~ - 1) for | t I > 1,
g(y) = --------
( 2 y jy + 1 otherwise.
It follows that

а Т Т р г ^ т т г , f ° r i I-1 > 1 . < | *! < v/ / Л ,

f(x\y)= 1
-----for I у I < 1,0 < |x | < J y + 1,
/Т Г Т
2 J r + 1
0 otherwise.

Hence £ is. for rj = у ( \y J < 1), uniformly distributed on the interval


(- V /T h + J f+ l).
2. Le< i and »/ be two independent random variables with absolutely
continuous distributions. The density function of their joint distribution
V, § 3] G E N E R A L IZ A T IO N O F C O N D IT IO N A L P R O B A B IL IT Y 267

is thus
h(x,y) =f(x)g(y),
where g(y) is an ordinary density function. Hence the conditional density
function g(y I x) of t] with respect to the condition £ = x is
ff(y I x) = g(y).
The conditional density function g(y | x) does not depend on the value
of X.
3. Let (£1; £2, . . . , <*„) be a random vector uniformly distributed in the
whole и-dimensional space and let »/„ = + £! + ••• + £2. Determine
the conditional density function of £k with respect to the condition r\„ =
"-i
= y (y > 0). We know already that the density function of t]n is у 2
for у > 0 (§ 1, Example 10). It follows that the two-dimensional density
function of £k and tjk is

. (у — X 2) for у > X-,

0 otherwise.
For the conditional density function of £k with respect to the condition
— У we find thus

fn (x \y) = - ~ r (l 2 for Ix \ < J y , (12)


s/ у УI
where the constant C„ will be determined by

fj f „ ( x \ y ) d x = \ . (13)
-НУ
From (12) and (13) it follows

c. g 4 = , 2 (14)
V , r | - ± |

and finally, we obtain thus

Г — 2 я-З
f n i x \ y ) = - K = ---- . f i 1 - — ! 2for - J y < x < + J y . (15)
s f^ y г n ~ 1 у I
268 M O R E A B O U T R A N D O M V A R IA B L E S [V, § 3

From (15) follows


1 -J E
lim / „ (xIист2) = __- e 2<r’ (-oo<x<+oo), (16)
n-*- + ao y j 2 7Г О

hence every £,k (к fixed) has in the limit a normal (conditional) distribution,
if the condition imposed is t]n — no1 and n tends to infinity.
4. We deduce now the Maxwell distribution from the preceding example. 1
Let c,k, r\b t,k (к - 1 , 2 , ,n) be the components of the velocities of n
atoms of a certain amount of gas. We assume that the (a priori) distribution
of the point (<*!, tji, £], • • •, in, r\,., Cn) is uniform on the whole Зл-dimen-
sional phase space. Consider the conditional distribution of the velocity
components with respect to the condition that the total kinetic energy
of the gas be constant. This kinetic energy is given by

E = ^ -i(ü + n t+ a i
1 k=i
where m represents the mass of a particle of the gas. The conditional
density function of the distribution studied is, by the above example,

____r (Jü_ I
, L 2E I m 2 ' Í,
hn Г m V 2nE 3n - 1 2E I
2 )
f 2E
for lx I < -----. (17)
V m
3
By taking into account that E = — кТп (к Boltzmann's constant, T the
absolute temperature of the gas) we find for the conditional density function
hn{x 17) of the velocity components t]k, £k at constant temperature T

rí— I 2 ,
, , ,_ 1 2 (t x 2m
h „ (x \T )= ^ — — ----_ 1 _ ___
/ ЪпкТп г I Зи - 1 \ ЪкТп)

rfor , X, < l 3kTn


/ -------
V m1
1 Cf. A. Rényi [19].
V, § 3] G E N E R A L IZ A T IO N O F C O N D IT IO N A L P R O B A B IL IT Y 269

hence
__ x2
m
lim hn ( x \T ) = —===== e~ * * . (19)
+ со / 2 ,7 1 k ]

V m

Thus the distribution of each component of the velocity tends to a normal


l ie f
distribution with the expectation 0 and standard deviation / ---- .
V m
Since for large n under the condition E — constant, £k, r\k, and Ck tend
to be independent, it follows already that the distiibution of the random
variables
vk: = y/Zk + *lk + £k >
i.e. of the velocities of the particles, tends to the Maxwell distribution.
But it is profitable to perform exactly the above calculations, i.e. to calculate
the conditional distribution of vk for every finite n. It is quite natural to
call this distribution the Maxwell distribution o f order n ; for this distribution
tends to the ordinary Maxwell distribution if n -» + oo .
The calculations are entirely similar to those in the preceding example.
Put
vk = \/£ f + *1к + íf >
if we put

*1in —Yj Zk + fk + Ck
k= I

and if hn(v, у) is the two-dimensional density function of vk and t/3„, we


have
Эл- 5
h„ (v, у ) (у - V2) 2 for 0< V <Jy . (20)

Thus if VJv I y) denotes the conditional density function of vk with respect


to thecondition rj3n = у we obtain
v 2 I 2 1 ^ л -5 _
V»(p\y) = Dn— \ \ ------- 2 for 0 < F < yjу , (21)
у2 1 У

theconstant Dn being determined by1

1 V „ ( v \y ) d v = \. (22)
0
270 M O R E A B O U T R A N D O M V A R IA B L E S [V, § 4

Hence

r i£
D n = — F = 2 • (23)
чЛ r 3w — 3
2

If Ж„(п I 71) denotes the conditional density function of the velocity of the
particles at a given absolute temperature Г, we have

rí— I
, ч Ъ п кТ ] 4v2 m I* 2
W „ (v IT ) = V „ V ------------ = — — — — — ----------- - X
m J n ЪпкТ Iin — 3
l 2
Í mv2 -"Г— / ЪпкТ
x ( ' (24)
The distribution with density function (24) is called the Maxwell distribution
o f order n. As we have already seen, it tends for n - 4 0 0 to the ordinary
Maxwell distribution, i.e.
Г2~~ Í m 11 _ g*”
lim Ж „ ( п | Г ) = — —— 2 1 ,2 e 2kT (0 < V < + 00). (2 5 )
Я-»-00 V II \ ^ ,

§ 4. Generalization of the notion of conditional mathematical expectation in


Kolmogorov probability spaces

In § 2 we have defined the conditional probability of an event with


respect to the condition that a random variable assumes a given value.
Similarly, we can define the conditional expectation of a random variable
tj with respect to the condition that the random variable £ assumes a given
value.
Let ( be a random variable and %the least er-algebra with respect to
which £ is measurable; let t] be any other random variable with finite
expectation. If £ is a discrete random variable assuming the values
x k {k = 1 , 2 , . . . ) with positive probabilities and if A k is the event £ = x k,
then let E(r] | £) denote the random variable such that E(i] \ £) = E{r\ \ A k)
for £ = x k (i.e. for every ш £ (2 with £(co) = x k) ; we have thus, for A £ iJE^
+ 00

\E (t1\£)dP = J j E{rl \A k)P (A A k),


A k = 1
V, § 4 ] generalization of conditional expectation 271

hence
\E (ti\Z )d P = \r,dP, (1)
A A

provided that A £ (this means in case of a discrete random variable


Í that A = Y j Ajr Ui < h < ■■•) is valid).
Г
In the general case, we want to define the random variable E(r\ \ c,) so
that it is measurable with respect to and the relation (1) is valid. Put

v(A )= \rjd P (А£Л^);


к

because of the known properties of the integral, v(A) is er-additive on


and absolutely continuous with respect to P(A). Hence, by the Radon-
dv
Nikodym theorem, there exists a function /(w) = which is measurable
with respect to and fulfils the relation
v (A )= \f(a i)d P
A
whenever A £ Therefore if E(rj \ £) is defined by | £,) = f(a>), (1) is
satisfied. It follows from the definition that for A = Q

E(E{n \ Q ) = E { n ) . (2)
In particular, \i r\ = t\B , where >]h is the indicator of the set B, i.e.

f 1 for (o £ B,
W [ 0 otherwise,
then
V(A) — f gBdP = P(AB),
A
and
Е(Пв\ 0 = Р<(в )-
The conditional probability P((B) of В for a given value of £, may thus
also be considered as a conditional expectation.
Of course one may ask whether E(r] \ £,) is with probability 1 equal to
the expectation of the conditional distribution of r\ for a given value of ^
(i.e. to the expectation of the distribution Р ?(»7 _1(К)). The response is
affirmative, provided that Р£>{>]~'{У)) is with probability 1 a probability
distribution. This can always be achieved, as we have already seen. In this
case
E(4 10 = f r,dP(> (3)
h
272 MORE ABOUT RANDOM VARIABLES [V, § 4

with probability 1. In order to prove this it suffices to show that for every
A £ the relation
\ ( \ g d P .) d P = \ t 1dP (4)
A Si A
holds. Obviously, this relation is fulfilled for ц = rjn, where rjB is the
indicator of the set B; indeed in this case
4 r]B dP ( = P ( ( Д), S' пв d P = P (A B),
h A
and (4) will be reduced to the relation

j Pi (B) dP = P(AB)
A
defining Р^(В). H ence (4) holds when t] takes on a denum erable set o f
values. From this, because o f the know n properties o f the Lebesgue integral,
it can be show n that (4) is generally valid.
If t; and ц are independent, it follows from (3) that we have with proba­
bility 1
Е Ш ) = Е(п). (5)
Furtherm ore the follow in g theorem can be stated for arbitrary random
variables £ and >]: If f ( x ) is a Borel-m easurable function su ch th a t E (f(£)q)
exists, then we have, with probability 1 ,

E { f t t ) 4 \ V ) = № E { n \Z)- (6)
To prove this it suffices to show that

f /Ю E{r\ \0 d P = j / ( 0 r\dP for A t^ f. (7)


A A
It follows from (3) that

S M ) E(t, I O d P = \ f ( 0 ( S'r,dP() dP = f E ( f ^ ) n I О dP = j ' m r,dP.


A 'A П A A
Relation (6) furnishes a new proof of the fact that, for independent ^ and
rj, E(^t]) = E(£)-E(ri) (cf. Ch. IV). In fact, it follows from (2) and (6) that

е д = £ № 1 0 )= £ (№ Ю )- (8)
Thus if £ and t] are independent, E(t] | £) = E(ti) with probability 1 and
from this follows the desired result.
Consider now another important property of the conditional expectation.
Let £ and rj be two random variables and g(x) a Borel-measurable function.
V , § 5] G E N E R A L IZ A T IO N O F BA YES* T H E O R E M 273

We have then with probability 1

E (E (r ,\0 \g (0 ) = E (r,\g (0 ). (9)

In order to see this it suffices to prove the relation

^ E (E (r ,\0 \ff(0 )d P = \r]dP (10)


A A

for every A £ ^ € g^y By applying twice the definition of conditional proba­


bilities and by taking into account that Cl we obtain

f E{E{r] 10 1g(0)dP = \E (n\ 0 dP = f ndP,


A A A

which proves (10) and hence (9) too.


The ordinary expectation is known to be a linear functional. How far
does this hold for the conditional expectation ? If ct and c2 are two con­
stants, we have with probability 1,

E(c, + c2r,2\ 0 = Ci E(t]t I 0 + c2E(tj2 10 - (11)

In d ee d w e h av e b y (1) fo r every A £

f (Ci E(jh \ 0 + c2E(n2 I 0 ) d P - с Л E(r,i \Z)dP + c2 f E(r,2 | Í ) dP =


A A A

= I (<i П1 + c2 42)dP.
A

Nevertheless we cannot state that E(rj j <f) is a linear functional with proba­
bility 1, since (11) holds with probability 1 only and the exceptional sets
corresponding to every pair (t]u t]2) may together even cover the whole
space Q.

§ 5. Generalization of Bayes’ theorem

Let £ and У] be two random variables with absolutely continuous distri­


butions and a two-dimensional density function h(x, y). Put further
+00
f( x ) = j h(x, y) dy, (1)
—00

g(y) = J h(x, y) dx, (2)


—00
274 M ORE ABOUT RANDOM V A R IA B L E S [V , § 5

f ( x Ij ) = —-y’-^ - for g(y) > 0, otherwise arbitrary, (3)


У\У)

g(y I x) = ^ ■ for f(x ) > 0, otherwise arbitrary. (4)


J \x )
Clearly we have
f( x ) = j' f(x \y )g (y )d y (5)
-0 0

and
g(y) = J g(y I x)f(x)dx. (6 )
— CO

It follows from (3) and (4) that

Ф 1 х ), т > т for / w >o, (7)

hence by (5)

*)-.
M (*)
—00
Formula (8) may be considered as a generalization o f Bayes' theorem for
the case of absolutely continuous distributions. With this formula one can
express the conditional density function of g for a given value of £ by means
of the conditional density function of £ for a given value of g and the
unconditional density function of g. It follows from (8) that

P(a < .g < b \ £ ,= x ) = <


\g (y \x )d y =
a

J f(x \y )g (y )d y
= — h ------------------ (9)
I f(.x \t)g (t)d t
—oo
or

J
f ( x Iy)dG(y)
P ( a < g < b \t; = x ) = - ^ -------------------- , (10)
J f{ x \l)d G (t)
-00
where G(y) is the ordinary distribution function of g.

I
V , § 6] T H E C O R R E L A T IO N R A T IO 275

Formula (10) is therefore valid even if rj does not have an absolutely


continuous distribution.
Relation (8) is in certain cases also valid for £ and tj defined on a con­
ditional probability space. This holds if the two-dimensional density function
(in the sense explained in § 1) exists. For the functions f(x ) and g(y) defined
in (1) and in (2) — provided that they exist, i.e. the integrals (1) and (2)
are finite — we have, in general,
+^00 +00
J f(x)d x = I g{y) d y = + c o .
—oo —oo

Let it be mentioned that h(x, y),f(x), g(y) are only defined up to a constant
factor. If f ( x I y) and g(y \ x) are computed by (3) and (4) or (8), this factor
disappears. The obtained density functions f ( x \ y) and g(y \ x) are already
so normed that their integral from —oo to +oo has the value 1.

§ 6. The correlation ratio

Let £, and ?; be two random variables on a Kolmogorov probability


space; suppose that E{rf) and D2(g) exist, let further E(g | £) denote the
conditional probability of ц for a given value of £. We know that
Е (Е Ш )) = Е (ц ) . (1)
For the variance of E(t\ | £) we have

D\ (r,) = D2 (E(n 10 ) = E (E 2 (Ч10 ) - E2(ц). (2)


T heorem 1. I f E{r\) and D\rf) exist, we have

D2 (r,) = D 2(r,) + E ( [ E ( r , \ 0 - r i Y ) . (3)


P roof . We have

П - E(g) = [ n - Е(ПI 0 ] + [E(r, | 0 - E(r,)],


therefore
D 2 (r,) = E([E(r, I 0 - n]2) + D\ (tf) + 2E([r, - E(r, | О] [Е(Л \ 0 - E(n)}). (4)
By (2) and (6) of § 4

E([r, - E(r, I О] [Е{П I 0 - E(rj)]) = E([r, - E{n | 0 ] E(t, | 0 ) =


= E{E{[r] - E(ri I О] Е{т] I О I 0 ) = E(E(rj \ 0 E(tj - E(rj \ 0 | ^)) = 0.
Thus (4) implies (3).
276 MORE ABOUT RA ND O M V A R IA B L E S [V , § 6

Remarks.
1. It was implicitly shown in proving (3) that the random variables
tj — E(rj 10 and E(rj | 0 are uncorrelated.
2. The assertion of Theorem 1 may be written in the form

D2(r,) = D2(E(r1\ 0 ) + E ( D 2(r ,\0 ) (5)


where D2(t] \ E) is the conditional variance of ц for a given value of E defined by

D2 (r, \ 0 = E([r,-E{ri \ 0 f 10-


According to Formula (2) of § 4 we have thus
E(D 0)=E(E([r,-E(nOf
2(i/1 I I £ )) = £([»! - E{r,\Of).
Assuming D(r]) > 0, put

Then by Theorem 1
0< /ЗД < 1. (7)

Ks(ri) will be called correlation ratio o f t] with respect to it is defined


only if D{r\) > 0. This notion was introduced by K. Pearson in a somewhat
less general form, and in full generality by A. N. Kolmogorov. It gives
a certain information about the mutual dependence of £ and rj. This is
shown by the following two theorems.

T heorem 2. I f f and t] are independent, Kfrj) = 0. The converse, however,


does not hold: Kfr\) = 0 does not imply the independence o f £ and rj, though
it implies the vanishing o f the correlation coefficient R(E f])-

T heorem 3. The relation Kfrj) = 1 is valid i f f rj = g{£), where g (0 is


a Borel-measurable function.

P r o o f o f T h e o r e m 2. If £ and r\ are independent, Dfr\) = 0, hence


Kfrf) = 0. If Kfrf) = 0, E(ji I О is equal to E(t7) with probability 1, therefore
by relation (8 ) of § 4

E ^ r , ) = E ^ E ( t 1 \ 0 ) = E ( 0 E ( r 1) ,

thus R{0 rj) = 0. The following example shows that Kfrj) = 0 does not
imply the independence of £ and rj: Let the point (E rf) be uniformly
V, § 6] T H E C O R R E L A T IO N R A T IO 277

distributed in the circle x 2 + y 2 < 1; let g(y \ x) be the conditional density


function of i/ with respect to the condition £ = x. We have

ff(y\x) = -----for \ y \ < J l - x 2 ; - l < x < + l,


2^/l-x2
hence E(g f ) s O and, consequently, Kf g) = 0 though £ and g are
evidently not independent.

Proof of T heorem 3. If Kfifi) = 1, (3) shows that

E ( [ g - E ( g I O]2) = 0,
hence, with the probability 1,

>; = £07 10; (8)


g is thus measurable with respect to and therefore it can be written in
the form g = gif). Conversely, if g = g(f), then g is measurable with
respect to thus (8) is valid with probability 1, therefore it follows
that Kffi) = 1.
Unlike the correlation coefficient, the correlation ratio is not symmet­
rical. To characterize the dependence between £ and g both quantities
Kf g) and Kf f ) can be used, provided that the variances of both £ and g
exist and are positive. The conditional expectation E(g | 0 can be charac­
terized by the following property:
T heorem 4. I f £ and g are any two random variables and D 2{g) is fin ite
and i f g (x ) is a Borel-m easurable fun ction, then the expression

E([g-g(0f)

takes on its minimum for gif) = E(g \ £).

Proof. By Formula (2) of § 4

E([g - 9{Ü)?) = E(E([g - g ( f ) f \ £)). (9)


It follows by a basic property of the expectation (see Theorem 2 of § 9,
Ch. Ill) that

E([g - g (0 f 1 0 = 1 0 1 - ffit))2 dP í > JO? - E(g I O f dPt. (10)


It follows from (9) and (10) that

Е([д-д^)]2)>Е([д-Е{д\0]2), (H)
278 M ORE ABOUT RANDOM V A R IA B L E S tV , §6

q.e.d. Equality in (11) can occur, if and only if the relation

g (0 = E (r ,\0
is valid with probability 1.

Remark. The curve у = E{rj | £ = x) is called the regression curve o f r\


with respect to 0

In particular, it follows from Theorem 4 that for any two real numbers
a and b
E([r, - (a{ + b) f ) > E(U - E(r, I £)]2> (12)

The left-hand side is minimal for

a= W R M ’ h = E M - aE® - ( 13)
The line

У —E(g) = R{0 ri) [X - E (0] (14)

is called the regression curve o f tj with respect to f If a and b are given by


(13), we have

E([rl - ( a l i + b ) ? ) = D \ r i ) l \ - R \ t i , r i ) ) . (15)

On the other hand, because of (3) and (6),

E([g - E(t, I O f ) = D2{fi)[\ -*?(»!)]. (16)

From (12), (15) and (16) it follows that

Ä*(i,»/) £ * ! ( , ) . (17)

This permits to restate the proposition of Theorem 2: If Kffi) = 0 then

R (0 rj) — 0. Inequality (17) may be sharpened as follows:

T heorem 5. I f c; is an arbitrary random variable and r\ a random variable


with finite expectation and variance, then

Kf (n) = sup R 2(g (0 , h), (18)


9
V, § 7] D E PE N D E N C E OF TW O R A N D O M V A R IA B L E S 279

where у = g{x) runs through the set o f all Borel-measurable functions for
which the expectation and variance o f g(f) exist. The relation

Kl(rj) = R 2(g(0,ri) (19)


holds, i ff
g (0 = aEto I f) + b, (2 0 )
where а Ф 0 and b are constants.

P roof . One can assume without restriction of generality that E(tj) =


= E{g{f)) = 0 and D(r\) = D(g(f)) = 1. By (2) and (6) of § 4 and by the
Schwarz inequality,
R 2 Ш , n) = E2(r,g(0) = E2 (E(ng{f )1 f)) = E 2 (g(f) E(rj \ 0 ) <
<E{ E2(r] \f)) = K2{r1).
The condition for equality is here easily verified.
Theorem 5 permits to give a new definition of the correlation ratio
Kt(r\) and even a new definition of the conditional expectation E(g \ if).
Certainly, Formula (19) defines E(ri \ f) = g( f) up to a linear transformation
only. But it is easy to obtain a unique definition. In effect, E(t] \ if) may be
characterized as the function g0(f) fulfilling the relation
R 2 (h ,9 o (0 )= sup R 2(r],g{f)) = K\{ri) (21)
Q
and the relations
E(g0{fj)=E{ri),
D2(g0(O) = D2(r,)K2(r,)A (22)
£(^o(0)>0.

§ 7. On some other measures of the dependence of two random variables

Another measure of the dependence of two random variables is given


by the contingency. This notion was introduced for discrete distributions
by K. Pearson (mean square contingency).1
Let I and rj be discrete random variables assuming the values x k (к =
= 1 , 2 ,. . .) and yj (j = 1 , 2 ,. ..), and only these withpositive proba­
bilities. LetAk and Bj denote the events £ = xk and rj — yjrespectively.
The contingency <p{f ц) is defined by

r(^)-fcy w ^ -рщрт2и
ш ю Vf к P ( A k) P ( B j ) ’
1 For the general case see A. Rényi [28].
280 M ORE ABOUT RANDOM V A R IA B L E S [V , § 7

or with an obvious transformation,

(2)
It is clear that 9?(£, rj) is zero iff £ and rj are independent. If the number
of the values x k is n and that of the у,-s is m with m > n, then

<p2( £ , r \ ) < n - \ , (3)


as because of P(AkB)) < P(Bj) it follows from (2) that

<«>
It can be seen from (4) that in (3) the sign of equality holds iff for every
к and for every j either P(Ak Bj) = P(BJ) or Р(АкВу) = 0. Since, however

Y P(Ak Bj) = P(Bj) {P(Bj) > 0) ,


k =1
this cannot occur unless for one kj the relation P{Ak.Bj) = Р{В,) and for
the other к Ф kj the relation Р{АкВ}) = 0 is valid. But then £ = x k/ for
tj = and consequently £ = /(>?)•
Conversely, if ^ = f(tj), <p2( r \ ) = n — 1. If both £ and tj assume
infinitely many values, the series on the right of ( 1 ) may be divergent;
in this case <p(£, tj) = + 0 0 .
Before defining the contingency for arbitrary random variables, the
notion of regular dependence will be introduced. Let £ and t] be any two
random variables. If C is an arbitrary two-dimensional Borel set, we put

P ( C )= P ((£ ,,,K C ). (5)

Let A and В be Borel sets on the x-axis and the у-axis respectively; put

P1(A) = P( Z- 1(A)) ( 6 a)
and
P2( B ) = P ( r 4 B ) ) . ( 6 b)

Let A X В denote the set of the points of the (x, y)-plane for which x £ A
and у £ В. Define the measure Q(C) for the two-dimensional Borel sets
of the form С = A x В by

Q(A X B ) = P 1(A)Pi (B). (7)


V , § 7) D EPEN D E N C E O F TW O RA N D O M V A R IA B L E S 281

This measure can be extended to all two-dimensional Borel sets of the


plane in a unique manner, since the values of its extension are uniquely
determined by the values on the “parallelograms” A x B.
If P is absolutely continuous with respect to Q, the dependence between
£ and g is said to be regular. This is evidently the case if £ and g are
independent, since then P = Q. It is easy to see that the dependence between
two discrete random variables is always regular.
If the dependence between £ and g is regular, there exists according
to the Radon-Nikodym theorem, a Borel-measurable function k(x, y) =
dP
= ——- such that for every two-dimensional Borel set C the relation
dQ

P(C) = \ k(x,y)dQ (8 )
c
holds. If F(x) and G{y) are the distribution functions of c, and g, respectively,
and if A and В are any two Borel subsets of the real axis, then the function
k(x, y) satisfies the relation
P(Z £ A, g £ B) = (' f k(x, y) d f(x ) dG(y). (9)
х£Ау£В
In particular, if £ and rj are discrete random variables,

P(AkBj)
----------- -— tor x — x k, у = у.;
k(x, у) = P (A )P (B j) ^ ^ (10)
0 otherwise.
If the joint distribution of £ and \ is absolutely continuous with the density
function h(x, y) and if / ( x) and g{y) are the density functions of i and g
respectively, then we evidently have

^ . ( 11)
/W s W
We can now define the contingency for arbitrary regularly dependent
random variables £ and g by

<p(L g) = ( 7 T W
— 00 — 00
* . У) - 1 12 dFW dGW ( 12 )

or equivalently by
+со +oo
<T g) = j J k 2 (x, y) dF(x) dG(y) - 1. (13a)
—00 —oc
282 M O R E A B O U T R A N D O M V A R IA B L E S [V, § 7

In particular, if £ and rj are discrete random variables, relation (1) is


obtained from (12), because of (10). If the joint distribution of and íj
is absolutely continuous, we obtain, because of ( 1 1 ),

(l3b)
—0 0 —00

Obviously 9?(£, r\) — 0 holds iff P = Q, i.e. if £ and t] are independent.


N o w w e p ro v e a th e o re m estab lish in g a re la tio n b etw een th e co rre la tio n
coefficient a n d th e contingency.

T heorem 1. Let £ and t] be regularly dependent random variables, u(x)


and v(y) Borel-measurable functions such that the variances D \u{f)) and
D2(v(t])) exist and are positive. Then we have

R 2 (u(f), v(r\)) < ( f (f, n). (14)

P ro o f . We may assume without restricting the generality that


E(u(£)) = E(v(r\)) = 0 and D(u{f)) = D(v(rj)) = 1. Then by the definition
of the correlation coefficient,
+ oo +00
R(u(f), v(r\)) = J j u(x) v(y) k{x, у) dF(x) dG(y). (15)
—CO —CO

But by assumption

T Г u(x) v(y)dF(x) dG(y) =


—oo —oo
0 . (16)

From (15) and (16) follows

R (u (£ ), v(rj )) = J J
—00 —00
u (x) v(y ) [k (x , у ) - 1] d F (x ) d G {y). (1 7 )
By applying the Schwarz inequality and (12), we obtain
R \u ( 0 , v(ri)) < rp2 ({, n),
which proves the theorem.
The quantity
•Ж, ri) = sup I R (u(0, U ( 18)
u ,c

where u(x) and v(y) assume all Borel-measurable functions for which
expectations and variances of u{Q and v(rj) exist, can also be considered
V, § 7] D E P E N D E N C E O F T W O R A N D O M V A R IA B L E S 283

as a measure of the dependence between £ and rj. This quantity is called


maximal correlation of <*and tj and was introduced first, for discrete random
variables, by Hirschfeld, for absolutely continuous distributions by Gebelein.
Its most simple properties are contained in the following theorem:

T heorem 2. I f ifif, tj) is the maximal correlation o f £ and rj, defined


by (18), we have always

a) <KÉ,»/) = lK»bO;
b) 0 < ^ , » ? ) < 1 ;
c) i f У — a(x ) and У = b(x) are strictly monotonic functions, then
Ф(a(0> *0 0 ) = rj)-,
d) ij/(f tj) = 0 iff q and tj are independent ;
e) if there exists between £, and ij a relation o f the form U(f) — V(tj),
where U(x) and V(y) are Borel-measurable functions with DjUifij) > 0,
then ф(£, tj) = 1 ;
f)
we have
I h) I ^ min (Kt (rj), ^ ( 0 ) ^ max (K t (tj), K n (О) ^ h) < <p(£, rj).
P r o o f . Properties a), b), and c) are direct consequences of the definition.
If £ and tj are independent, clearly rj) = 0. Conversely, if

IЖ , n) = 0, then R(u(f), v(tj)) = 0


for every и and v. If we choose
, . í 1 for X < a, í 1 for X < b ,
“‘ « - j o f o r ^ « , ,iW " I 0 formas,
it follows from R (u jf), vb(rj)) = 0 that
P(Z < a ,r j< b ) = P(£< a) P(r\ < b).
As a and b are arbitrary this means that £ and rj are independent, hence
d) is proved. If Uff) = V(tj) with D(U(fj) > 0 we know that
Я Ш ) , V(n)) = l,
hence rj) = 1. Property f) can be deduced by comparing the defini­
tion of maximal correlation to Theorem 1 of this § and to Theorem 5
of § 6 .
A further notion which we want to study here is that of the modulus
o f (pairwise) dependence of a sequence of random variables. Let
£1 , £2> . be a (finite or infinite) sequence of arbitrary random
variables. We define the modulus o f dependence of the sequence {£„} as
284 MORE ABOUT RANDO M V A R IA B L E S [V , § 7

the smallest positive real number A satisfying for all sequences {x„} with
Y jxn < + 0 0 the inequality

IZ Z Ш п , in,) x n x m) I < A Z x%
n m n
(1 9 )
i.e. the least upper bound of the quadratic form

Z Z Ш п>im )xnx m
n m

under condition Z x \ = 1. (If it is unbounded, the modulus of dependence


is infinite.)
In particular, if the sequence {£„} contains only two elements, its modulus
of dependence will be 1 4- <K£i> £2)- If the sequence {£„} is finite, A is
finite as well; if the sequence contains infinitely many elements, A is not
necessarily finite. If the elements of {{,,} are pairwise independent, A = 1,
otherwise A > 1.
The following theorem furnishes an inequality between the correlation
ratio and the modulus of dependence of a sequence of random variables.

T heorem 3. Let {£„} be a sequence o f random variables with a finite


modulus o f dependence A. I f r\ is a random variable with a finite variance,
we have

n
K t(n )£ A .Z (20)
For the proof we need a lemma which is a generalization of the Bessel
inequality, well known from the theory of orthogonal series, to the case
of quasiorthogonal functions.

L emma. Let {£„} be a finite or infinite sequence o f random variables such


that exists (n = 1, 2, . . . ) . Suppose that the quadratic form

Z Z E ( U m) x nx m
n m

is bounded and that we have

\ Ц Е ( Ш х пх т\< B ^ x l ; (2 1 )
n m n

then for every random variable rj for which Elfi1) exists we have

Z E 2(>/(„) ^
n
BE(rf). (22)
V , § 7] D E PE N D E N C E OF TW O R A N D O M V A R IA B L E S 285

Note. If £■(£„ £ш) = 0 for n # m and if £ ( 0 = 1, i.e. if the sequence


{£„} is orthonormal, (21) is valid with В = 1 and (22) reduces to Bessel’s
inequality
1 Е 2Ш < Е ( п % (23)
n

P roof of the Lemma. Put

а„ = Е Ш - (24)
Obviously,

^ ^ --- ^ X «« 2 ) > 0. (25)

By carrying out the calculations, we find

E (r ) - -J I at + ' £(£„ О > 0 . (26)


В n В n m

Because of (21) we have

4 ri s «»£ (c„ w *4 - 1 4 (27)


-L* n m & n

hence it follows from (26) that

-i-I^ < £ (^ ), (28)


в „
which is, by (24), equivalent to (22).

P roof of T heorem 3. Let f n(x) be a Borel-measurable function such


that E(f„(U) = 0 and D(f„(U) = 1. Put

in = fn ( U - (29)
Then by definition of the maximal correlation

Iщ п U I = I R (fn (£„). fm ( U ) I Ш п, U - (30)


Hence according to (19)

I Y Y E ( L U x nx m I < X I Щ п, U \ x n \- \ x m\ < A £ x l . (31)


n m n m n

Thus the lemma can be applied to the sequence {£„} with В = d, provided
286 M ORE ABOUT RA ND O M V A R IA B L E S [V, § 8

that E(r{z2) exists. Then

ZE X C A r, - E(r,)]) < A D \r,y (32)


But
Щп in - т д =m к ш , n\ (зз>
Let f n(x) be chosen such that

(34)
Dd n )

where Din(t]) = D(E{r\ \ £„)) is the standard deviation of E(r\ | £„). Then
according to Theorem 5 of § 6
* a( / .( « , 4 ) - * f . 0 l ) . (35)
Hence by (32) and (33)

D \n )lK l(r,)< A D \r,). (36)


П
After division by D2(rj) we obtain (20) and thus Theorem 3 is proved.
Theorem 3 is a probabilistic generalization of the large sieve of Yu. V.
Linnik, which has important applications in number theory . 1

§ 8. The fundamental theorem of Kolmogorov

In what follows, we shall often prove theorems concerning an infinite


sequence of random variables. The conditions of these theorems involve
the simultaneous distributions of a finite number of the random variables
considered. We shall not prove for each particular theorem the existence
on some probability space of a sequence of random variables fulfilling the
assumptions of the theorem; the solution of this existence problem is
furnished by a general theorem due to Kolmogorov. Kolmogorov proved
this fundamental theorem for an arbitrary (not necessarily denumerable)
set of random variables; we restrict ourselves to the case of denumerable sets.

T heorem 1. (Kolmogorov’s fundamental theorem). For any integer n let


Fn (xb x2, , x„) be an n-dimensional distribution function, fulfilling the

1 With the help of this generalization Rényi succeeded to prove that every positive
integer n can be written in the form я = p + P, where p is a prime and P is the product
of at most К prime factors; К denotes here a universal constant. Cf. A. Rényi [2].
V , § 8] FU N D A M EN TA L THEOREM OF KOLM OGOROV 287

following conditions o f compatibility:

F n + m ( x 1 , X 2, . . . , X „ , + CO CO) = f „ ( X j , X 2, . . X„)

(n,m — 1 , 2 , . . . ). ( 1)

Then there exists a Kolmogorov probability space on which the random


variables £„ (n = 1 , 2 , . . .) can be so defined that for every n the n-dimensional
distribution function o f the random variables <J2, .. . , is equal to
F„(xi, x 2, . . . , x n).

P r o o f . L et Q be th e set o f all infinite sequences

со = (új,, a>2, . . . , con, . . . )

o f real n um b ers. /7Я th e fu n ctio n , defined o n Q, p ro je c tin g Q u p o n th e


su b sp ace ß „ o n th e first n c o o rd in a te s o f со; i.e., fo r со = (соъ со2, . . . ,co„,...)
we p ut
Я „ю = ( ш 1, с и 2,. (2)

For A c Q, let //,, A denote the set of all elements of Qn which can be
brought to th e fo rm у = П„со with со £ A.
Let now A cz Qn be any subset of Qn. We shall call the set of elements
со = co2, such that IJnco = (соь . . . , co„) £ A an n-dimen­
sional cylinder with base A; we shall denote this set by IJfi(A).
If A is Borel-measurable, the corresponding cylinder set is said to be
a Borel cylinder set. Let be the set of all Borel cylinder sets; is an
algebra of sets. To see this let us remark that an и-dimensional cylinder
set is at the same time an (n + m)-dimensional cylinder set as well. In fact

П ; \А ) = П -\т (.Пп+т П - \А ) ) . (3)

Hence if A is an «-dimensional Borel set, IJn+m (IJ ffA )) is an (« + m)-


dimensional Borel set. Thus for a finite number of cylinders it can always
be assumed that their bases have the same number of dimensions, e.g.
N. If A = IJ^ fA '), В = n ff\B '), where A' and B ’ are Borel sets of the
Л'-dimensional space, then

A + В = П ^ \ А ’ + В ’),
A — В = n~jf{Ä - В ');

А + В and А — В are thus Borel cylinders again. Finally, since Q =


= n f { Q N), the set Q itself is a Borel cylinder as well.
288 M ORE ABOUT RANDO M V A R IA B L E S [V , § 8

Let be the least a-algebra of subsets of Q which contains all Borel


cylinders of Q. The probability measure P on is defined in the following
manner: First we define P on the Boolean algebra and then we extend
the definition as was done in Chapter II.
Let A be a Borel cylinder of Q and N an integer with A = n ^ l(AN),
A n being a Borel set of QN. FN(xb . . . , x N) generates on QN a probability
measure which we denote by PN. We put P(A) = PN(AN).
The definition is unique, since from

^ n \ ^ n) = n N\ M(AN+M)
follows, because of ( 1 ),

Pn +m (An+m) — Pn (An).
Consequently, the definition of P(A) does not depend on the base figuring
in the construction of A.
Clearly, the set function P(A) is nonnegative; it is easy to show that it
is (finitely) additive. If A £ , В £ AB = 0, then, because of

A = n~N\ A N), В = n-N\ B N),


we have ANBN = 0. Hence
P(A + B) —Pn (A n + BN) = Pn {An) + PN(BN) = P{A) + P{B).

(We made use of the fact that the value of P(A) does not depend on
the dimension of the chosen base of A.) It is further clear that P(Q) =
= Pn (Qtv) = 1. It remains to prove that P(A) is not only additive but
also e-additive on By Theorem 3, § 7 of Chapter II it suffices to show
that P has the following property:
CO

Property К . I f A„ £ A„ +1 c: A„ (n = 1 , 2 , . . . ) and A„ = 0, then


n= l
lim P(An) = 0.
П-++СО
We shall show by an indirect proof that property К is fulfilled. The
inequality P(A,) > P(A„+]), n = 1 , 2 , . . . is obviously true. Hence
00

lim P(AJ = p exists. Assume p > 0. We show that then D = A„


n-++GO n=1
cannot be empty.
It can be assumed without restriction of generality that A n is an и-dimen­
sional cylinder; in fact if d„ denotes the exact (minimal) number of dimen­
sions of A„, we have dn < dn+1. Further lim dn = + oo can be assumed,
n->+00
since in the case of dn < d our assertion would follow from Pa(A) being
V, § 8] FU N D A M EN TA L THEOREM OF KOLM OGOROV 289

a e-additive measure. If dn —> +oo, we can replace the sequence {A,,}


by another sequence {A'nj, where A'n is an л-dimensional cylinder and
{A'n} contains the sequence {A,,}.
Put An = П ~](В„), where B„ is an «-dimensional Borel set. Since
P(An) — Pn(B„) > p > 0, we can find in Qn a compact set Z n with Z n C Bn
such that
P„(Zr) >Pn(Bn) - 2f - L- ( «=1,2,...).

Put C„ = n~\Z„). C,t is also a Borel cylinder, and

P(Cn) = Pn(Z n) > P(An) - - f - .

Let now Dn = СгС2 . . . Cn. We have

P{A„ - Dn) = P(A„ (C1 + . . . + C J) < £ P(Ak - Ck) <


k=1
< у .^A:P+l_ < A
o ’
fc = l Z ^
hence
P( 0 „) = Р(Л„) - Р(Л„ - Z)„) > - y > o.
Thus the set Dn cannot be empty for any value of «. Choose now in Dn
a point ог(и) = (eo^, арф , , со(ф). Then a sequence {«,} can be given
with
lim coj^ = cok (k = 1 , 2 , . . . )
У—+00
(G. Cantor’s “diagonal method”). Since all Z„ are closed, for every и
( « ! ,.. .,<u„) 6 Z„;
OO
hence со = (a>1; co2, • • • > • • •) belongs to Dn and со £ П Я n; therefore
/7= 1
oo oo
Dn is not empty. Similarly, A n cannot be empty either, and thus our
/7=1 /7=1
assumption leads to a contradiction. Hence we must have p = lim P(An) = 0
n->+00
and P is cr-additive on ; it follows that the extension of P is u-additive
on the cr-algebra *. We have proved that [Q, < j €*, P] is a Kolmogorov
probability space Put therefore

€k ( m) = a>k (k. — 1 , 2 , . . . ) f or со = (соъ . . . , c o k , . . .). (4)


290 MORE ABOUT R A N D O M VARIABLES [V, § 9

Then £,k = ^(w ) is a random variable on P], since if U is a Borel


set on the real axis, £,k l(U) is a /с-dimensional Borel cylinder which belongs
thus to and, consequently, to On the other hand, obviously
P (íi < * 1 , £ 2 < x 2, . . i „ < x n) = Р(х1; x2, . . . ,x„); (5)
the л-dimensional distribution function of the random variables £2,
is thus identical with the function F„ (xb x 2, . . , x n). Herewith Theorem 1
is proved.
Example. Let (G„(x)}, n = 1 . 2 , . . . be any sequence of distribution
functions. There can be constructed a probability space and on it a sequence
of random variables (n — 1 , 2 , . . . ) in such a manner that the i n are
mutually independent and the distribution function of is G„(x). To see
this it suffices to note that the functions

Fnixj, .. . , x„) = П Gk(*k)


k=1
fulfil all conditions of Theorem 1.

§ 9. Exercises
1. Let there be given in the plane a circle C, of radius R with its center in the
origin, and a circle C2 concentrical with C\ having a radius r < R . Let us draw a
line d at random which intersects C,, so that if the equation of d is written in the form
X cos <p + у sin <f = Q,
q> and q are independent random variables, <p being uniformly distributed in (0 , л)
and q in ( - R, + R ) . Let { denote the length of the chord of d inside C2. Determine
the distribution function, expectation, and standard deviation of { .
Hint. Let first f be fixed. Then
0 for X < 0,

/>(£ < X I <p) = 1 ----- w V ''2 — “ T fo r 0 < X < 2r,


a V 4
1 for 2r < X.

I At the point X = 0 the distribution function has thus a jump of the value 1 — — . j
This expression being independent of <p, the conditional density function of c
under the condition { > 0 is
fO fer X < 0 and X > 2r,

Ax) = --------- X for 0 < X< 2 r.

4* V - - 7
V , § 9] E X E R C IS E S 291

This leads to
. г2я r2 / 32Ä
£(f) = ------ and O(f) = ----- J ---------- л2.
w 2R v’ 2R V 3r
2. Let d be a line chosen at random as in Exercise 1. Let В be a convex domain
in the circle Cv Let £ denote the length of the chord of d inside B. Calculate the
expectation of £.
Hint. We have E(Z) = EÍE(£ \ y)), where <p has the same meaning as in Exercise 1.
E(£ I <p) is equal to the integral along the chords of the domain В lying in a given
direction, divided by 2 R; for fixation of <p means restriction to the chords which
n IВ \
form an angle <p + — with the x-axis. Hence E(( | <p) = —— , \B\ being the area
2 2R
IВ I
of B. We see that E( q) = — . It is not necessary to require the convexity of В neither
that it be simply connected.
3. Let there be given in the plane a curve L consisting of a finite number of convex
arcs and contained in a circle C of radius R. Choose at random (in the sense explained
in Exercise 1) a line d intersecting C. What is the expectation of the number of the
points of intersection of this line with L I
Hint. Consider first the particular case when L is a segment of length l of a straight
line. In this case the number v of points of intersection is 0 or 1. If (p is the angle
between the normal to d and the segment L, the expectation under the condition of
/ cos (p
fixed <p is E(v I 9?) = —— — . This leads to
2R
n
2

E{v^ ^ r \ cos 'pd,f = J r


0

From this it follows for polygons, and by a limit procedure for all piecewise convex
IL I
(or concave) curves L, that E(v) = ------, where | L 1 is the length of the curve L.
nR
4. Calculate E(i") for n = 2, 3 , . . . the conditions of Exercise 1.
Hint. We have
те
2
2n rn+l (
Ed") = — - — I sin"+I i> d&.
0

Note. Exercises 1 to 4 present well-known results of integral geometry1 from a


probabilistic point of view.
5. Establish the law P V = RT for ideal gases on the basis of the kinetic theory
of gases. V denotes here the molar volume, P the pressure, T the absolute temperature
of the gas, further R = Nk where A1 is Avogadro’s number and к Boltzmann’s constant.

1 Cf. W. Blaschke [1].


292 MORE A BOUT R A N D O M V A R IA B L ES [V, § 9

Hint. The pressure of the gas is equal to the expectation of the quantity of motion
imparted by the molecules of the gas during unit time to a unit surface of the vessel
wall. We assume that the shocks are perfectly elastic. If a molecule of mass m and
velocity V strikes the wall in a direction which forms an angle & with the normal
vector of the wall, then the quantity of motion imparted by the molecule will be
2 mv cos &. In order to strike a unit surface К of the wall during a time interval
(t, t + 1 ), the molecule of velocity v moving in a direction which makes an angle
with the normal vector to the wall has to be included at the time t in an oblique
cylinder of (unit) base К and height v cos &. Under the assumption that the molecules
are uniformly distributed in the recipient, the probability of the shock in question
v cos &
i s ---------- , where W denotes the volume of the vessel. Hence the expectation of
W
the quantity of motion imparted to the wall by the considered molecule will be
2 V m Ci°S — = cos ^ where e is the kinetic energy of the molecule. The
W W
4e cos2 ft
quantity------------ is a random variable. Hence we have to calculate its expectation.
W
(Here the relation E(E(Z \ r\)) = E(Z) is to be applied.) If the velocity components
are supposed to be independent and to have normal distributions with the density
1 ( X2 \ IkT
function------ = exp — — - where a = , / — , then ft and e are independent and
a y In \ 2o4 Mm
the distribution of the direction of the velocity vector is uniform. Hence

E 1 ^ cos2 0j = ~ E(e) E(cos2 V).

3 i
We know already (Ch. IV, § 17, Exercise 29b) that E(e) = — kT. Since E(cos2 #) = - - ,
2 о
we find
_ ( 4e ) kT
[ W j W
for the expectation of the “pressure” exerted upon the wall by one molecule. Since
there are N molecules in a gram molecule of gas, we find for n gram molecules, because
of the additivity of the expectation, the value
nNkT _ NkT _ RT
p ~ - - y ~ - —y ’
w
where V — ---- is the molar volume and R = Nk the ideal gas constant.
n
6. Let C2 , . . ., i n be independent random variables uniformly distributed in the
interval (0, 1). Let them be arranged into an increasing sequence and let the k-th
element of this sequence be denoted by Z*.
a) Show that the conditional density function of £,*, £*, . . ., s* with respect to
the condition = c is given by
k\
—y for 0 < Xi < x2 < ■• • < xk < c,
f/c(xs, хг, . . . , xk I ££+1 = c) = c
0 otherwise.
V, § 9] E X E R C IS E S 293

b) Show that under the condition i £ +1 = c the random vectors ( £ * , . . £ * ) and


(£*+ 2,. .., {J) are independent.

7. Let {lt £2, . . f„, . . . be independent random variables. Consider the sums
= £i + £* + . . . + €„ • Show that under the condition = x the random variables
t k and f, are independent for к < n < l .
8. Let the random vector ({, rj) have the normal density function

J A C - В2 Г 1
h(x, у) = - У .— -------- exp — —- (Ax2 + 2Bxy + Cy-) .
2л 2
Prove the following relations:
В
W , rj) = ------ -- - ,
у / AC

E (V I I) = - 4 Í,

I 4) = - A »i,

*„(í) = ('/) = !*(f, ч) I= ~ ~ = ■


y j AC
9. If the random vector (£, rj) has a nondegenerate normal distribution, show that

?>(£. V) — —p = = = and Ш , » 0 = 1 r I.
V 1 - r*
where r = R(c, rj) is the correlation coefficient, <p({, rj) the contingency, and i/<({, rj)
the maximal correlation of the random variables £ and q.
10. If the functions a(x) and b(x) are strictly monotone, then
<p(a(í), b(qj) = <p(í, rj) .
11. If ({, rj) is uniformly distributed in a circle, фЦ, rj) = —.

12. If { and q are the indicators of the events A and B, i.e. if

£(ty) = j 1 for m 6 d,
[ 0 otherwise,
Г 1 for со € Ő,
tl(co) — Í
I 0 otherwise,
then
Ш rj) = r ( $ , rj) = K\(rj) = K j ß ) = R\!t, n) =
[P(AB) - P(A) P(B)f
~ P(A) [1 - P(A)i P(B) [1 - P(B)\ ’
provided that 0 < P(A) < 1 and 0 < P(B) < 1.
294 M O R E A B O U T R A N D O M V A R IA B L E S [V, § 9

13. Prove the following variant of Bayes’ theorem: Let i be a random variable
with an absolutely continuous distribution with the density function / ( a) and let rj
be a discrete random variable. Let y k (k = 1 , 2 , . . . ) denote the possible values of
rj and p k(x) the conditional probability Pip = yk | ( = x). Let f k(x) be the conditional
density function of £ given ij = yk. We have

fk(x) = -+3----------------- •
J /»*(0/(0*
Hint. By definition
J Pk(x)f(x) dx = P ( H A, = yk),
A

hence

p it < x n - V ) dl
P(S<x \T,= yk) = Х’п Ук) = ^ ----- -------------- .
p(v = Ук) +c . . „ . .
J PÁOfOdt

14. Suppose that the probability of an event Л is a random variable c with density
function p(x) 0 (a) = 0 for a < 0 and 1 < a ) . Perform n independent experiments
for which the value P(A) = £ is constant and denote by r]„ the number of the experi­
ments in which A occurred. Let p„k(x) be the conditional (a posteriori) density function
of { with respect to the condition т]„ = к (к = 0 , 1 , 2 , . . . и); according to the preceding
exercise
, . A *(l - X f - kp(x)
P„k(x) = -J------------------------- .
Jr*(l - t) " -hp(t)d t
о
a) Show that if { has a beta distribution of order (r, s), then c has under condition
rj„= к a beta distribution of order (к + r, n — к + s).
b) If p(x) is continuous and positive on (0, 1) and if / is a constant (0 < / < 1),
then

hm J --------------- p„JM / + . V J -------------- = — - = e


« - +- V n \ \ n ) 2л

15. Let Í be a random variable and let «i........ in- be random variables which
are for every fixed value of C independent and have a normal distribution with
expectation f and standard deviation a (a > 0 is a constant). Let p(x) be the density
function of £ . Study the conditional density function p„(x |y) of Í under the con­
dition
fi + f* + • ■• +£■ _
n
and show that if p(x) is positive and continuous, we have, for fixed x and y,

Pn (У + y\
I Jn J 1
l i m ------------- 4 _--------- = g 2o! .
»-+» 7 ” J 2710
V , § 9) E X E R C IS E S 295

16. Let f be a random variable with an exponential distribution. For every given
value of C , let f t) { 2 , . . . , be independent normally distributed random variables
with expectation f and standard deviation a > 0 . Determine the conditional distri­
bution of f with respect to the condition
£t + ^2 + • • ■+ _
n
17. Let fi be a random variable having the density function pit). Let for every given
value of p the random variables i u . . . , be independent, normally distributed,
with expectation p and standard deviation a > 0. Show that . ■. ,c„ are exchangeable
(cf. Ch. IV, § 17, Exercise 18).
18. Let {дг be independent random variables having the same distribution
and finite variance. Put t)„ = f t + { , + . . . + i„ (n = 1, 2, , N) . Calculate the
correlation ratio Knn (rjN) (n < TV).
19. Let £„ . . ., i N be independent random variables with the same distribution.
Put i;,, = + . . . + {„ (л = Í, 2 , . . . , TV). Calculate the contingency
Ч>(Л«, Vm) (n < m < TV).
20. Let the random variables , £2, . . {„ be independent and uniformly distributed
in the interval (0, 1). Let i* denote the k -th order statistic of the sample . ..,
(See Exercise 17 of Ch. IV.) Compute Kf* (£*) and <p(Z*, if ) for k < l < n.
21. Suppose that the probability p of an event A is a random variable on a con­
ditional probability space. Let g(t) — ^-----— (0 < t < 1) be its density function.
Let p be constant during the course of n independent experiments and let r/„ denote
the number of those experiments in which the event A occurred. Calculate the
a posteriori density function and the conditional expectation of the random variable
p with respect to the condition = k ( 0 < k < ri).
"■a
Hint. According to Bayes’ theorem the a posteriori density function of p with
respect to the condition r/„ = k is
, , P *-'(l - p y - k~l
ffk(p) = i --------------------- ;
j - t)n- k-'d t
0
the “a posteriori distribution” of p is thus a beta distribution of order (k, n — k)
and the conditional expectation is — .
n
22. Let i be a random variable with Poisson distribution and expectation A, where
A is a random variable with a logarithmically uniform distribution on (0 , + oo).
Calculate the a posteriori density function and the conditional expectation of A with
respect to the condition c = n > 1 .
Hint. Bayes’ theorem gives for the conditional density function of A with respect
to the condition i = n:
„ A"-l e~x
9,,(/) ~ ö T T i j r ;
the a posteriori distribution of A is thus a gamma distribution of order n.
296 M ORE ABOUT RANDOM V A R IA B L E S [V , § 9

23. Let the random variable £ have a normal distribution N(p, a), where p is a
random variable uniformly distributed on the whole real axis. Determine the a posteriori
density function and the expectation of p with respect to the condition Í — a.
Hint. According to Bayes’ theorem
„ 1 (.ft - a)2
9(P I Í = a) = — — e x p ------ — — .
у Тло L 2a

The a posteriori distribution of p is thus a normal distribution with expectation a


and standard deviation a.
24. Let be independent random variables having the same normal
distribution N{m, a). Put = . I° 1 . Determine the conditional
n
distribution of ({„ . . ., {„) for a given value of £„ .
Hint. By assumption, the л-dimensional density function of the £k(k = 1, 2, . . . . и) is

1 1 «
/C*i, • • •, x„) = -— exp - — V (xk - m f .
( а ,/2 л )" 2a-
Put
1 П
x = n- kt= 1 x*•
We have
n n
у (*k - m f = У (xk — x f + n(x - m f,
k= 1 k= 1
hence
1 In n(x — to)2] 1
/ ( a„ . . *„) = J — e x p -------- — ---- • - — -= r~—--- = X
a V 2я 2a- J ( в ^ 2 я ) " - ‘ /л
1 П _
X exp - —- У (a* - a) 2 .
*=1
The density function of C„ is
_ 1 j n nix — to) 2
S(5) = 7 V 'УГ exp [ ------- 2 a2 ;
hence the conditional density function of the random vector (£t, . . . , c,) for £„ = x is

This function does not depend on m; a property which is expressed by saying that
C„ is a sufficient statistic for the parameter to.
25. a) Let there be given n independent random variables . ■., {„ with the same
normal distribution N(0, a). Put

C = - £ f* and T = У (f* - £f.


n k=l k= 1
Show that C and r are independent.
V. § 91 E X E R C IS E S 297

Hint. Let (cik) O', к = 1, 2......... и) be an orthogonal matrix with c lk = -— —

(k = 1, 2, rí). Put

П
Vi = X С» ffc C /= 1, 2, . . л).
k = \

Then = /я C and

É
1=1
= *=1
É Й = 4? + £=
Z 1 (f* “ o 2.
hence

т = 1=2
É »?*•
We know (cf. Ch. IV, § 17, Exercise 43) that r)x, . . . , r ] „ are independent normally
distributed random variables with expectation 0 and standard deviation cr; hence

f = ~ = and г = У jif
V" ,=2
are independent; т has a ^-distribution with (я — 1 ) degrees of freedom.
b) Let • • •, Í« be independent random variables with the same normal distribution.
Let the expectation /< and the standard deviation a of the be independent random
variables on a conditional probability space, p being uniformly distributed on the
whole real axis and a logarithmically uniformly distributed in the interval (0 , + oo).
Put
f. + $2 + ■ ■ ■ + , V It
Í = -------------------------- and r = > (I* - 0 ‘-
n <t=i
Determine the a posteriori distribution of /г and cr2 under condition f = x and r = z.
Show that given these conditions cr and —---- — are independent.
(7

Hint. The density function of the vector (p, a2) with respect to the condition
f = x, r = z is, according to Bayes’ theorem and the result of Exercise 25 a)
•2 ^ 1 z n(x - fl)2'
i---- z - e x p ------- - e x p ------------- —
n 2 <r2 _ 2a2
V 2^ ’ . --L t i i ’
2 2 cri+,r | — - —
л; — (j,
thus a and --------- are independent,
(7 26

26. Let there be given a sequence of pairwise independent events A„(n = 1, 2,. .. )
with P(A„) > a > 0 {n — 1 , 2 , . . . ) and an arbitrary random variable r) of finite
variance. Show that
lim E( 4 I An) = О Д . (1)
n —► + CO

If 2? is the indicator of the event В (O < P(B) < l) , it follows from (1) that
lim P(B I A„) = P(B). (2)
П —*• -j- со
298 MORE ABOUT RANDO M V A R IA B L E S (V , § 9

Hint. Let f„ be the indicator of the event A. We have

P(A„) [E(r, I A„) - E(r/)J-

h e n c e , by T h e o re m 3 o f § 7,

00 P(A ^
Z i Р"Л л № I 4 ) - £ ( # < ö ! (>;).
/1=1 A ^\Лп)
Thus

lim
»-»+» ,—
1
P(íДíЛ7T
) =
w h ic h p r o v e s (1).
Remark. C f. C h . VII, § 10, T h e o re m 1.
27. Let a sequence of pairwise independent events A„{n = 1, 2, . . . ) be given and
assume

Z Л Л ) - +°°-
/1=1

Let

В = lim s u p
n —>- -J- OO
A„ = П(
CO

/1 = 1
zCO

& = /l
A *)

denote the event that infinitely many of the events A„ occur simultaneously. Show
that P(Ii) = 1 .
Hint. Let C be any event with 0 < P(C) < 1. Like in Exercise 26, it follows from
Theorem 3, § 7 that
<*> p(A ^
Z , J J s lp(c I A„) - P(QV < P ( Q [1 - P(C)1.
/1=1 1 *\л п)
CO

Since ^ P(An) diverges, clearly


/»=1
lim inf [P(C I A„) - P(C)f = 0. (1)
П-+- +00
CO

Apply (1) to C = Ck = ^ A„. Obviously, P(Ck) > 0. It follows from (1), in view of
n —k
CO

P(Ck I An) = 1 for n > k, that P(Ck) = 1; hence P(Ck) = 0. Since В = | ”[ Ck> we
k=1
— °o
have B = Y , C k a n d h e n c e P(B) = 0 , w h ic h is e q u iv a le n t t o P(B) = 1 .
Remark. The assertion of Exercise 27 is a sharper form of the Borel-Cantelli
lemma (cf. Ch. VII, § 5).

28. Let c and ?; be arbitrary random variables, f(x) and g(x) Borel-measurable
functions such that
E ( f ( 0 ) = E(g(r1)) = 0, D ( m ) = D(g(rj)) = 1
V, s 9] EXERCISES 299

and
* (/« ). g (v )) = E {№ g ( r j) ) = ш , V) .

or to put it otherwise, suppose that Jl(u(Q, v(r/)) assumes its maximal value for и = f
and V = g. Then the following equations hold with probability 1, where X = ф (£, rj):
E ( m ) 1v ) = Xg(v) (1)
and
E(g{v) I 0 = Ш ) , (2)
hence also
E(E(M) \ v ) \ t ) = Ш ) (3)
and
E(E(g(rj) \ i ) \ v ) = WgRi) ■ (4)
Hint. We have
H í , rj) = E (№ )g(.v)) = E(E(M)g(?j) I 0 ) = E (№ E (g (ri)\0 ) ,
hence according to Schwarz’ inequality

H(i, v) < E(EHg(r,)\0) = D-.

On the other hand, if /*(£) = fulfils £ (/* ({)) = 0 and ö (/* (£ )) = 1, then

*(/*«). *fo)) = E ( f * ( i ) g ( r j ) ) < H i , V) ■


But as

E ( n c m ) = E(g^ E< *w o) = £(p w i a ) = D j

we conclude that Ű- < ф'ЧЛ, rj). Hence Ö 2 = ф'ЧЛ, rj). Since in Schwarz’ inequality
equality holds only in the case of proportionality, we must have E^gjrj) | | ) = A/({)
which proves (2). But
E ( f ( Q E ( g ( r j) \ij) = H i,V ) -
On the other hand, by (2)

E (m E (g (v ) ! £)) = XE (/*({)) = X ,
hence X = H i , V)- Equation (1) is proved in a similar way.
29. With the notations of Exercise 28 we have
E ( № I g(ri)) = Xg(rj)
and

E ( g ( r j) |/ ( 0 ) = X f( i) .

Hence the regression curves of i* = /{(,) with respect to rj* = g(rj) as well as that
of rj* with respect to l* are straight lines (or, as it is expressed, the regression of {*
and rj* is linear).
300 M ORE ABOUT RANDO M V A R IA B L E S IV, § 9

Hint. The proof is similar to that of Exercise 28.


30. Let L\ be the set of all random variables/(£ ) such that f(x) is a Borel-measurable
function with £ ( / ( i ) ) = 0 and E ( / 4 0 ) is finite. If we put

(№,MO) E(A(0M0),=
Ll is a Hilbert space. Further we define АЛО, for /( { ) € Ll, by
А Л О = е ( е (ЛО\чЖ ) .
Show that АЛО = A ( 0 belongs also to L\ and the linear transformation АЛО of
the space L'l is positive and symmetric, i.e. it fulfils the relations

{ A f { 0 J ( 0 ) > o and ( A f ( 0 , 8 ( 0 ) = ( Л 0 , Ag(0).


C H A P T E R VI

CHARACTERISTIC FUNCTIONS

§ 1. Random variables with complex values

Characteristic functions are useful analytic tools of probability theory,


especially for proving limit theorems. This Chapter presents the definition
and the properties of characteristic functions; the following two chapters
will deal with the limit theorems themselves.
The characteristic function of a random variable £ is defined as the
expectation of the complex valued random variable é il. Thus we have
to study complex landom variables first; we shall see how theorems on
real random variables can be extended to complex random variables.
If £ and r\ are real random variables, we say that the quantity £ = £ + irj
is a complex random variable. The distribution of £ can be characterized
by the joint distribution of £ and rj.
We define the expectation of £ = £ + Irj by

£(£) = ßJ £dP, (1)


which implies
£(£) = £(£)-M £(i,). (2 )

The random variables £x = £x + it], and £ 2 = £ 2 + Щг are said to be


independent if the two-dimensional random vectors (£b í/j) and (£2, >/2)
are independent. The independence of several complex random variables
is defined in a similar way.
If Ci, £2 »• • •, Cn are independent complex random variables and if
the expectations £(£*) (к = 1 , 2 , . . . , « ) exist, one can see at once that

Я(ПС*)
k=l
=k П
=1 £(£*). (3)
If A (x) = a(x) + ib(x) is a complex valued Borel function of the real
variable x and £ is a real random variable, further if the expectation of
£ = T(£) exists, then the latter can be calculated by

£(£) = T a (x ) dF(x), (4)


—OO
302 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 2

where F(x) is the distribution function of S,. In fact, according to Exercise


47, § 17 of Chapter IV,
+ 00 +00

£ (0 = —JCO a(x) dF(x) + i J —00


b(x) dF{x).

It is easy to prove that for every random variable with complex values

|£(0 I<AKI). (5 )
§ 2. Characteristic functions and their basic properties

W e define the characteristic function o f a random variable £ as the


expectation o f e'ir; thus it is a function o f the real variable t. It is denoted
by <pä(i). Thus by definition

9>{(0 = £(*'<'). (l)

According to Formula (4) of § 1

9>i(0 = J
— 00
e‘xl dF{x) (2 )

where F(x) is the distribution function of hence cpft) is the Fourier-


Stieltjes transform o f F(x). If the distribution of f is discrete and £ assumes
the values x k (k = 1 , 2 , . . . ) with probabilities pk (k = 1 , 2 , . . .), then (p^t)
can be written in the form

^ ) = 1 A «"". (3)
/c=1

If the distribution function of £ is absolutely continuous with the density


function f(x ) = F'(x), we have
+ 00

<Pi(0 = j eUxf( x ) dx. (4)


—00
Hence is the Fourier transform o f f(x).
Thus we see that the characteristic function of an arbitrary random
variable depends only on its distribution; characteristic functions of random
variables with the same distribution are identical. The function defined
by (2) can thus be called the characteristic function of F(x) (as well as the
characteristic function of a random variable with distribution function
V I, § 2] B A S IC P R O P E R T IE S 303

First of all, let it be noted that every distribution function has a charac­
teristic function since the Stieltjes integral (2) exists always, in view of
I eixl I = 1 .
If £ assumes positive integer values only, with
Р(Л — к) —Pk (k = 0 , 1 , . . . ) ,
we see that
<Pt(0 = E Pk e m =
k=0

where

<^0 ) = E Pkzk (И < 1)


k = 0

is the generating function of discussed already in Chapter III, § 15.


In this case the characteristic function is therefore equal to the generating
function on the boundary of the unit circle. In the general case when £
may take on other than positive integral values, the generating function
is not defined; the characteristic function, however, exists for every random
variable.
We shall now prove some elementary theorems concerning the charac­
teristic functions of probability distributions.

T heorem 1. We have always \ q>t ( t ) | < 1; equality holds for t — 0.

P roof . Since | ei£,t | = 1 and because of Formula (5) of § 1 we have

1 ^ (0 1 <£2 ? ( к '< '|) = 1 .


Further 9?,(0 ) = E(e°) = 1.

T heorem 2. The function <pft) is uniformly continuous on the whole real


axis —oo < t < + oo .

P roof . Let e > 0 be given. Choose a X > 0 such that

A I <ÜI > Л) < - j - .

If we denote by A x the event | £ | > A, we evidently have


= E (e* i Ax) P(AJ + E ( é « 1Ä x) P(ÄX). (5)
Since IE(eiil \ A x) | < 1, we conclude that

IM O - £ ( ^ ‘ IЛ ) Р Ш I ^ A M < ~ • (6 )
304 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 2

Consequently,

I <Pf?2) - <pftf I < E( I««'■ - I I Äx) + -23£ . (7)


From
ь
I e ib — e'a \ = \ i \ e,z dz \ < b — a for о< 6 (8 )
a

follows

I - e'« '1 1 < - y for I £ 1 < A and | *2 - *i I < = <5,


hence ,■ i
£ ( [ei<,! - e'í(l I Л;) < - £- for 112 - tx \ < Ö. (9)

From (7) and (9) we conclude that


I <Pfh) - f f h ) I < e for I t.z - íj I < <5:
where <5 > 0 depends only on e. This proves Theorem 2.

T heorem 3. I f a and b are constants and if r] = a£ + b, then


(p ft) = eibt <pt(at).
P roof.

(/ f t ) = E(ei(a-+b>l) = eibt E{eiial).

T h e o r e m 4 . I f f , t2, .. . , t nare arbitrary real numbers and ,z„


arbitrary complex numbers, further if у f t ) is the characteristic function o f
a random variable £ and if z = x — iy is the conjugate o f the complex
number z = x + iy, t/ien we /гаге

Y .Y . v f d h - t k) z hzk > 0 . (1 0 )
A = lfc = l

Remark. Functions satisfying (10) aie said to be positive definite. A remark­


able theorem of Bochner says that every positive definite function cp{t)
for which 0 ) = 1 is the characteristic function of a probability distri­
bution. We shall not give the proof of this theorem.

Proof of T heorem 4. We have

Z Z vf fh - tk) Zh Zk = E( I Z eUk« zk I2 ).
A = 1 fc=l k=1
V I, § 21 B A S IC P R O P E R T IE S 305

T heorem 5. For every real t, < p f - t ) - <pt ( t ) . In particular, if the


distribution function o f £ is symmetric with respect to the origin, у f t ) is
a real even function o f t.
P r o o f . Let £ be a random variable with complex values; then E(() = £ (£ ).
This leads to
( p f - t ) = E (e~ ^) = E (S< ).

If the distribution function of £ is symmetric, i.e. if £ and —£ have the


same distribution, their characteristic functions are identical; we have thus

< p ft) = < p _ f t ) = cpi (-г) = 9^ (?).


Consequently, <p ft) is real and, since p f t ) = p f —t), p f t ) is an even
function.

T h e o r e m 6 . I f £b £ , , . . . , £„ are mutually independent random variables,

the characteristic function o f their sum is equal to the product o f the charac­
teristic functions o f the individual terms:
П
Pc.+<. *...+«» ( 0 = n<Pi*( 0 -
k=l

P ro o f . This follows from Formula (3) of § 1.


Remarks:
1. Theorem 6 expresses a property of the characteristic functions which
exhibits their successful applicability to probability theory. Indeed, the
distribution of a sum of independent random variables is the convolution
of the distribution functions of the individual terms; the calculation of this
convolution is in most cases rather complicated. On the contrary, Theorem
6 allows a very simple calculation of the characteristic function of a sum
of independent random variables from the characteristic functions of its
terms, as it is just their product. Further, as we shall see in § 4, from the
properties of the characteristic function the properties of the corresponding
distribution function can be deduced.
2. The converse of Theorem 6 does not hold. From

Vst+dO = V sX O vdO
the independence of and £ 2 does not follow. Let for instance be
ív = £2 = £> where £ has a Cauchy distribution: 95f t ) = e~w
306 C H A R A C T E R IS T IC F U N C T IO N S [V I, § 2

(cf. Example 4 of § 3). According to Theorem 3 we have thus

? « .+ « ,( 0 = Ы 0 = e ” 21' 1 = <ptl(t) <Pd0 .


though if is obviously not independent of itself.
T heorem 7. I f the first n moments E(ff) = M k (к — 1 , 2 , , n) exist
/o r the random variable <f, t/геи the characteristic function <pft) is n times
differentiable and
cpf{Q) = ik M k {k = \,2 ,...,n ). (11)
P roof. Let F(x) be the distribution function of if. If

T
—oo
exists, the integral
j xeixt dF(x)
—00
converges uniformly in t. Hence

<p'ft) = j ixe,xt dF(x);


—oo
in particular
<P'i (0 ) = iM v (12)
By iterating the operation we obtain
+ oo
<pf(t) = ik j x k eix‘ dF(x) (7c = 1 ,2 ,...,« ) ; (13)
—00

from here ( 1 1 ) follows by putting t — 0 .

T heorem 8 . Let the distribution function o f the random variable if be


absolutely continuous. I f the density function f(x) o f if is к times differentiable
(к — 0 , 1 , . . .) and if
+ 00
C j = $ l / 0) WI dx
—oo

exists fo r j — 1 , 2 , , k, we have1

lim 11 1* I <pft) I = 0. (14)


If I—+ 00

1 It suffices to assume the finiteness of Ck; this implies thefmiteness of C, , . . . , C*_t.


Cf. S. Bochner and K. Chandrasekharan [1], p. 29.
V I, § 2 ] B A S IC P R O P E R T IE S 307

Proof. If we perform к times an integration by parts on

<?>«(0 = A x ) eix,dx (15)


—oo

and consider that by our assumption lim f i ’\ x ) = 0 for j = 1, 2 , .. . , к — 1,


|л:|-*-оо
we obtain
к + °°

v «(0 = ( y J f (k) ( X ) e>xt dx. (16)


—00
From (16) it follows that

l* « (° l* 1 7 p (17)
Since by assumption | / (<r)(x) | is integrable on ( —oo, + oo), (14) follows
from (16) by Riemann’s lemma concerning the Fourier integral. 1
Inequality (17) is obviously of interest for the study of the behaviour
of cpft) for large values of 1 1 1 .
Remark. According to Theorem 7, the “smoothness” (differentiability)
of cp^t) is determined by the behaviour of/(x)for |x| -*■ + oo ; by Theorem 8
the “ smoothness” of f i x ) determines the behaviour of for |i| -> oo .
The two theorems are therefore in a certain sense dual.

T heorem 9. I f the first n moments o f £, M k = E (tf) (k = 1, 2, . . . , « )


exist, we have (with M 0 = 1), fo r t -* 0 that
" Mu (it)k
4>t(0 = I , ’ - + о(П . (18)
*=o K-
Proof. This follows immediately from Theorem 7.

T heorem 10. I f all the moments M k = E(^k) (к = 1 , 2 , . . . , « ) o f the


random variable £ exist and if

К™™р \/17 р - Т (,9)


is finite, then the domain o f definition o f <fft) can be extended to complex
t-values. We have, for 11 \ < R,
05 M (it)"
№ = E - ^ r - ; (2 0 )
я=0 П■
1 Cf. G. H. Hardy and W. W. Rogosinski [1], p. 23.
308 C H A R A C T E R IS T IC F U N C T IO N S [V I, § 2

<pft) is even a holomorphic function in the whole band \ v\ < R o f the complex
plane t - и + iv.

P roof. If the assumptions of the theorem are fulfilled, cp^t) is, because
of ( 1 1 ), arbitrarily often differentiable at the point t = 0 and we have
9 )(^( 0 ) = i"M„. From this (20) follows immediately.
Because of (13) for every real t0 and every n

\cpfn\ t 0) \ < M 2n. (2 1 )

Hence, according to Schwarz’s inequality,


+ со

I q f " +1> (t0) I < J I X I2" * 1 dF(x) < j M 2n M.ln +2 < + +* . (2 2 )


— CO

We obtain from (19), (21) and (22) for every real t0

"I 1<Р(л)('о)1 „ 1
,lmSUP V ------ n ' ------ - R - (23)

Hence q f t ) is regular in every circle \t — /„ | < R which leads already


to our assertion.

Remark. It follows from Theorem 10 that the function y ^ t) is uniquely


determined by the sequence M„ (n = 1, 2, . . .), whenever (19) holds.
In fact, by (20) <^(0 is determined by the sequence {M„} in the circle
I/ 1 < R, hence the value of <pAt) for every real t can be determined by
analytic continuation. We shall see in § 4 that a distribution function is
uniquely determined by its characteristic function. Hence if (19) is fulfilled,
the distribution of ^ is uniquely determined by the sequence of moments

M n = Elf") (и = 1 , 2 , . . . ) .

The question, whether the moments M„ — E (fn) do or do not determine


uniquely the distribution function F(x) of £, is called the Stieltjes moment
problem. In general, F(x) is not uniquely determined by the sequence of
the moments.

D efinition . The random variable ^ has a lattice distribution with span


d, if it takes on only values of the form dk + г {к = 0 , ± 1 , + 2 , . . . ) ,
where d > 0 and r are real constants.
VI, § 2] B A S IC P R O P E R T IE S 309

T heorem 11. I f £ has a lattice distribution with span d, then

l “ I
I
cpt —- ^ ! = 1 for n = 0 , ± 1, ± 2 ,...;

if C does not have a lattice distribution, we have j^ tO | < 1 for every t Ф 0 .

P ro o f . If all values of £ , are of the form dk + r and if P (f = dk + r) =pk


if = 0 , ± 1 , + 2 , . . . ) we have, for any integer n,

2 nn \ i?
Vi —7 - = I Pk= 1-
4 ^ / к — — oo

Conversely, if for a t0 # 0 we have cpffo) = e'a with real a, we conclude


+00 +00
j' e ^ - ^ d F i x ) = 1 = ( i/F(x),
—00 —00
hence

I [ 1 —cos it0x — a)] dF(x) = 0.


—oc

. . . 2 kn a
Since 1 — cos (fox — a) is positive except for x = -------- 1-----(k = 0,
to t0
+ 1,. . .) (for which values it is equal to 0), all jumps of F(x) must therefore
2л a
belong to the arithmetic progression dk + r with d = ---- and r = —■.
tо t0

T h e o r e m 12. I f the distribution o f £ is the mixture o f the distributions of


the random variables with weights pk (k = 1 , 2 , . . .), then

vdO = Z Pk VtfO-
к

P r o o f . Let F(x) be the distribution function of £, Fk(x) that of £,k. We

know that
Fix ) = I Pk Fk (x)-
к

From this Theorem 12 follows immediately.


Remark. The characteristic function may be considered as an operator
which assigns to the distribution function F(x) the function cp(t). Then
Theorem 12 expresses the fact that this operator is linear.
310 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 3

§ 3. Characteristic functions of some important distributions

We determine now explicitly the characteristic functions of some distii-


butions.
Example 1. The characteristic function o f the normal distribution.
Let { be a normally distributed random variable with E(f) — 0,
D (0 = 1. Then
+ 00
1 Г ’ t—*■- 1 _ t2 f _ z*
¥>*(*) = — = \ e ,x * dx = - j = e 2 \ e 2 dz9
y jln J у / 2n J
—oo L

where L is the horizontal line z = x — it ( — 0 0 < x < + 0 0 ) of the


_
complex plane; e 2 is an entire function, its integral is thus zero along
any closed curve and in particular along the quadrangle Rx with the vertices
—x — it, x — it, x, —x. The relation
X __ z 1 _ X2 |/ | Ы2

j f e 2 dz j < e 2 j e 2 du (1 )
x—it 0
implies
lim I j e 2 dz | = 0 . (2 )
( —►GO X—it
Hence
+ 00
1 Г -- 1 г - -
=
sJ 2 k j
) e 2 dz =
y /2 n J e 2 dx
5=
L —00

and consequently

9 >«(0 = e 2. (4)
Q— m
If the random variable ^ is N(m, a) the random variable <f = -------- -
a
_ i•
is N (0, 1) and ^ = crii;' + m. From cp^,(t) = e 2 and from Theorem 3
in § 2 follows
imt----- 5 - ,
(t) = e . (5)

Example 2. The characteristic function o f the exponential distribution.


Let ^ be a random variable with an exponential distribution and of
V I, § 3] S O M E IM P O R T A N T C H A R A C T E R IS T IC F U N C T IO N S 311

expectation — . The density function is thus ke~Xx for x > 0 and


A

00

cpt(t) = X J dx = - - - - - — — . (6 )

0 1 ~ T
From this it follows immediately by Theorem of § 2 that a random 6
£
variable £k having a Г-distribution of order к and expectation — has
A
characteristic function

n ,( 0 ~ F - W •

Example 3. The characteristic function o f the uniform distribution.


Let £ be a random variable uniformly distributed on the interval
( —A, +A); then
+A
1 C ixt , Sin At
^ = J2Ä
-A

Example 4. The characteristic function o f the Cauchy distribution.


Let ^ be a random variable with a Cauchy distribution. Then
+°°
m =~\ -r^—TdxЛ J l + X
= e-M (7)
—oo

(The integral can be evaluated by the method of residues.)


It should be noted that cp^t) is not differentiable at the point t = 0.
According to Theorem 7 of § 2 this is linked with the fact that £ does not
have an expectation.

Example 5. The characteristic function o f Pearson’s y2-distribution.


If Xn is a random variable having a ^-distribution with n degrees of
freedom, we can write

j £ - i &
k =1

where %l9 f 2, • • • , are normally distributed independent random


312 C H A R A C T E R IS T IC F U N C T I O N S (V I, § 4

variables for which E(tk) = 0 and D(£k) = 1. It is easy to see that


_ 00
/л / 2 Г - x' (1-2/0 , 1
= J — e d x = —r — .
V n J _ 2 it
о
(One has to take that branch of the square root which is equal to 1 for
t = 0.) Hence Theorem 6 of § 2 leads to

9 ^ (0 = (8 )
(1 —l i t ) 2
Example 6. The characteristic function o f the binomial distribution.
Let (J be a random variable having a binom ial distribution o f order n
with param eter p \ according to § 15 o f Chapter III

n (t) = GJe“) = [1 + p(e“ - 1 )]".

§ 4. Some fundamental theorems on characteristic functions

In this paragraph properties of characteristic functions will be discussed


which are essential for the proof of limit distribution theorems of proba­
bility theory.

T heorem la . I f (pit) is the characteristic function o f the distribution


function F(x) and i f a and b are continuity points o f F(x) (a < b), then
+ 00
1 Cl e~ita — e~itb eita — eith \
F W - П а ) - — J ( t W - - - - - - - - ---------- - - - - - - - - - - I A. (1 )
—00

T heorem lb . Every distribution function is uniquely determined by its


characteristic function.
Theorem lb follows immediately from Theorem la; in fact if <p(t) is
known, (1) gives the increment of F(x) on every interval the endpoints
of which are points of continuity of F(x). The set of discontinuity points
of F(x) being denumerable, a may tend to —oo through a sequence con­
sisting only of continuity points of F(x); hence (1) gives the value of F(b)
at every point of continuity b. As F(x) is by definition leftcontinuous, the
values of F(x) at a point of discontinuity can be obtained by letting b tend
from the left to such a point.
V I, § 4 ] SO M E F U N D A M E N T A L TH EO R EM S 313

Since the unicity Theorem lb follows from the inversion Formula (1),
it suffices to prove the latter. Before beginning the proof we have to make
first some remarks. It was pointed out in § 2 that q>(-t) = q>(t). Thus if
Re {2 } denotes the real part of the complex number z, ( 1 ) can be rewritten
in the form
1
F{b) - F(a) = — Jf Re I9 > (0 e- “a - e-i'b
-------- ------ dt. (2 )
—00

e~Ua- e - i,b
The real parts of w(t) and o f -------------- are even functions, while their
it
imaginary parts are odd functions. Therefore the same holds for
e ~i,a _ e~i,b
m = ? (o — г— (3)
и

as well. Consequently, 'P (-t) = 'F(i)- If Im {z} denotes the imaginary


part of z, we have
+r
—- j Im {^(O} dt = 0 for every T > 0. (4)
2л J
-T
hence by (2 )
+T
1 r e- ‘<“ _ е- “ь
F(b) - F(a) = lim — <f(t)------- -------dt. (5)
Г -0 0 2л J it
-T

In many textbooks the inversion formula is given in the form (5).


Nevertheless, while the improper integral (1) always exists, the same cannot
be stated regarding the integral

1 /• e -i“>_ e~“b
2Í J ------ J,------ *■ <6)
—00

But if this integral exists, its value is by Formula (5) equal to F(b) — F(a).
For the proof of Formula (2), we need two simple lemmas.

L emma 1. Put

s c .o - 2 - [ (7)
n J t
314 C H A R A C T E R IS T IC F U N C T I O N S [V I, §4

For every real a and for every positive T we have


IS(cc, T ) \< 2 . (8 )
Furthermore
oo + 1 fo r a > 0 ,
lim S(a, T ) = —
Г - + 00 n
Г —int at dt=*
.1
0 for a = 0 , (9)
о —1 fo r a < 0 ;

and the convergence is uniform for | a | > <5 > 0, where 6 is an arbitrarily
small positive number.
P r o o f . If we put
■X

__ 2 f sin и
S(x) = — ------- du,
n J и
0
we have
S(a, T) = S(aT). (10)
Put
(W+l)7r
(‘ sin и
2
cn — — ------ du,
n J и
пк
then we have
It

cn = ( - i y — [ - ^ - d u (и = 0 , 1 ,2 , .. .) ; (11)
л J nn + и
0

the numbers c„ have alternating signs, their absolute value decreases,


hence the series £ c„ is convergent. From

sin и
Six) — У ck H----- ----------- du for пк < X < (n + l ) 7t (12)
к=о 71 J u
ПК

it follows that for even values of n

Y; ck < S(x) <. Y ck for И7Г < A < (и + 1) я, (13)


k =0 k =0

and for odd values of n

Y ck < S(x) < Y ck for nn < X < (n + 1) л. (14)


k =0 k=0
V I, § 4 ] SOM E FU N D A M EN TA L TH EO REM S 315

Hence in every case

0 < 5(л) < c0 < 2 for x > 0 . (15)

Since S ( —x) = —S(x), we have for every real x

№ ) |< 2 . (16)
Thus (8 ) is proved. (9) follows from the well-known formula
00

. 2 f sinn ,
S(co) = -—- -------du= 1. (17)
n J и
о
The uniform convergence follows from (10).

Lemma 2. Put
+T
1 f sin t(z - a ) - sin 1(z - b)
D (T ,z,a,b) = — J ------------- — ------ ------- —dt (18)
-T
and
+00
.. I f sin t(z - a ) - sin t(z - b)
D (z,a,b) = D(+ oo, z, a, b) = —— ------------------------------- — dt. (19)
2л J t
—00
For every real z, a, b and for every positive T

I D(T, z, a ,b )\< 2; (20)*


further i f a < b, then
1 fo r a < z < b,

lim D (T ,z,a,b) = D (z,a ,b )= — f or z = a or z = b, (21)

0 fo r z < a or b < z.

The convergence is uniform for \ z — a \ > d , \ z — b \ > d ( d > 0 arbitrary).

P roof . Since

D{T z, a, b) = ~ [S(z — a ,T ) — S(z - b, T)],

Lemma 2 is an immediate consequence of Lemma 1.


316 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 4

Now we turn to the proof of Theorem la. We have


4-00
1 f ( e~Ua - e~ub )
aT J ------ 7,— r =
— GO

= L f’ l f »n > (- - -» ) - s i - > ( - - * ) (22)


2 тг J t ) t
—00 —00

On the other hand, since a and b are points of continuity of F(x), we have
by Lemma 2
F(b) - F(a) = f D(z, a, b) dF(z).
— CO
(23)

In order to prove (2) it suffices thus to prove that the order of integration
may be reversed in the right hand side of Formula (22). The difficulty is
that the integral (19) representing D(z, a, b) is not absolutely convergent.
But by Lemma 2 we know that D(T, z, a, b) — D(z, a, b) tends unifoimly
to zero on the whole real axis, except for the intervals a — d < z < a + 5
and b - ö < z < b + ö, where Ö is a small positive number. Furthermore
on these intervals \D(T, z, a, b) | < 2. Since a and b are continuity points
of F(x), we have

lim f D(T, z, a, b) dF(z) = J D{z, a, b) dF(z) = F(b) - F(a). (24)


T-+ 00 —00 —*00

On the other hand


j D(T, z, a, b) dF(z) =
— 00

- -L if f s ln ,|2 - ° > - sin,(Z - i’) 4FW ] Л . (25)


2n J J t
-T -00

Flere the order of the integrations can evidently be interchanged, because


of the absolute integrability of the integrand in the domain —со < z < + oo,
I t I < T. If we let T tend to + a , then (25) leads to Theorem la because of
(22), (23) and (24). If a and b are points of discontinuity of F(x) we find,
by a slight modification of the proof,
F(b + 0) + F(b) F(a + 0) + F(a)
2 2
+oc
1 i' f e~i,a - e~i,b)
- 2 i J кт <0— — г <26)
— oo
V I, § 4 ] SO M E F U N D A M E N T A L TH EO REM S 317

Of course the density function fix ) = F'(x) of an absolutely continuous


F(x) may also be expressed in terms of <p(t). We restrict ourselves to the
case where the integral

T I t (0 1 dt (27)
—oo
exists. Then

т - ш n * * - * - * -
a- о

+00
= lim
л-o J
I iíHthi* [T(/) e-»* + ^ ( - í ) ei,Jt] dt. (28)
—00
Since (27) exists and because of

M O e~Ux + ?>(-0 «"*] j ^ 2 I 9 5 (0 I,

the limit and the integration can be interchanged according to the theorem
of Lebesgue, hence
+ CO

f(x ) = ~ J fit) e~i,x dt. (29)


—00

It is easy to show that the integral figuring on the right hand side of (29)
is a uniformly continuous and bounded function of x. This leads to

T h e o r e m 2. I f f i t ) is the characteristic function o f the random variable


f and if the integral (27) exists, then ^ has a uniformly continuous and bounded
density function given by
+00
/0 ) = j fiO e~ilx dt. (30)
—CO
We shall prove now

T heorem 3. The distribution functions F„(x) (/1 = 1 , 2 , . . . ) tend to a


distribution function F(x) at every point o f continuity o f F(x), iff the charac­
teristic functions <p„(t) o f F„(x) tend for n 0 0 to a function (p(t) continuous
fo r 7 = 0. In this case f i t ) is the characteristic function o f F(x) and the
functions <pn(t) converge uniformly to q(t) on every finite interval.
318 C H A R A C T E R IS T IC F U N C T I O N S [VI, § 4

P roof. We show first that the condition is necessary, i.e. we have to


show that if
lim Fn(x) = F(x) (31)
П-+- CO

at every point of continuity of the distribution function F(x), then


lim 99,,(/) = 9 9 (0 (32)
«-*-00
holds, where
(fit) = +f e“x dF(x), (33)
-00
and the convergence in (32) is uniform in every finite t-interval. Let
e > 0 be given. Choose a number Я > 0 such that + Я and —Я are conti­
nuity points of F(x) and

in this case
I I e'xt dF{x) < — . (34)
|x |> A

For n > nx, where nx depends only on e, the inequalities

Fn( + V > i ~ y
hold, hence

j e,xt dFn(x) Í < “г - for n > «i- (35)


я
Id >a
Consequently, for n > n x ,
+A

I <Pnif) - (fit) ^ j elxt d[Fn( x ) - F ( x ) \ + — . (36)


-A

Integrating by parts we obtain for | t | < T

\ +\ eix‘ d[Fn( x ) - F \x ) ] \<


-A

< [ Fn (Я) - F(X) I + IFn ( - A) - F ( - X) I + T f !Fn(x) - F(x) \ dx. (37)


-A
VI, § 4] SO M E F U N D A M E N T A L TH EO R EM S 319

Now |.F„(x) - F(x) I < 2 and according to the theorem of Lebesgue limit
and integration can be interchanged, hence the right hand side of (37),
and by (36) <pn(t) — 9o(t) too, tend for n -> oo uniformly to zero if jij < T.
Thus we proved that the condition of Theorem 3 is necessary.
We show now that it is sufficient as well, i.e. that from (32), with (p(t)
continuous for t = 0, follows (31). According to a well-known theorem
of Helly every sequence {F„(x) } possesses a subsequence {F„k(x) } that
converges to a monotone nondecreasing function F(x) at all continuity
points of the latter.
We show first that this function F(x) is necessarily a distribution function.
It suffices to show that F(+ oo) = 1, F(— oo) = 0, and that F(x) is left-
continuous. This latter condition can always be realized by a suitable
modification of F(x) at its points of discontinuity. Since F(x) is a limit
of distribution functions, we have always 0 < F(x) < 1. Hence it suffices
to prove that F(+ oo) —F( — oo) = 1. First we prove the following formula:
X + oo

I l K ( y )- F „ ( ~ y ) ] d y = — I - — 7 2- * ' - <Pn(t)dt i f x > 0 . (3 8 )


.) n J t
0 —oo

In fact
+00
r, . 1 f 1 - cos xt ,. ,
Fix) = — J ------ y,----- <pn{t) dt =
-00
+00 +00
f f 1 —cos xt
= —
1
JI — 7 2 ----- cos yt dt dFn(y) (39)
—OO —OO

(the order of integrations can be interchanged because of the integrability


1 —cos xt
of ----- y2------and because of \<pn(t) | ^ 1). It is known that
+ 00


1
Jf 1 —cos xt .
------ ^2----- dt = \x \.
,
(40)
—00
From (40) it follows that for x > 0
0 for у < - x,
+ 00

1 f 1 - cos xt X + у for —X < у < 0,


— -------5------cos yt dt =
71 J t2 ' x — у for 0 < у < X, (41)
—00
0 for x < y.
320 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 4

Hence by (39)

/„ (* )= f
-л :
(x-\y\)dFn(y). (4 2 )

An integration by parts in (42) leads to (38).


Since — /'„(—y) is a nondecreasing function of y, we obtain from
(38)
+00
1 С 1 —cos xt
Fn(x) - Fn(~ X) ^ —n J
----xt:2---- VÁ0 di> (43)
— C0

or
+ oo

ад - >— J — ^2— у» lI—и I d,u-


1 P I —COS U
(44>
—00
Suppose that x and —x are both continuity points of +(x) and that n runs
through the sequence {nk}. Then from the theorem of Lebesgue concerning
the interchangeability of the limit and the integral it follows that
+ oo
1 Г 1 —cos и Iи
F(x) - FC-x) > — -------s----- <P — du. (45)
7Z U I X ^
— oo

(p(t) is continuous for t = 0 and because of 93,/0) = 1 we have 93(0 ) = 1


as well; hence we obtain, by applying Lebesgue’s theorem again and by
taking (40) into account
F(+oo) —F (—co) > 1. (46)
Consequently, F (+ со) = 1 and F (— 00) = 0; F(x) is therefore a distri­
bution function.
It remains still to prove that 1) rp(t) is the characteristic function of
F{x) and 2) that the whole sequence {+„(x)} converges to F{x)\ for the
latter, according to the theorem of Helly, it suffices to show that the
sequence {F„{x)} possesses no subsequence converging to a function other
than F(x). Both of these statements follow immediately from Theorem lb
and from the already proved first part of the present theorem.
Hence from the sequence of distribution functions (F„(x)) (и = 1 , 2 , . . . )
there cannot be selected any subsequence which does not converge to
F(x); this means that lim F,(x) = F(x). The uniformity of the convergence
и-^оо
in the relation
lim 9>„(0 = 9 3 (0 for \t I < T
n-+00
V I, § 4 ] SO M E F U N D A M E N T A L T H EO R E M S 321

(T > 0 fixed arbitrarily) follows from the already proved necessity of the
condition of Theorem 3. Herewith our theorem is completely proved.
Let us add some remarks.

1. We have seen that if the sequence of distribution functions {F„(x)}


(n — 1 , 2 , . . . ) converges to a distribution function F(x), then the sequence
cp„(t) of characteristic functions of the distribution functions F„(x) converges
for every t, when n -* oo, to the characteristic function 95(f) of F(x). If the
condition that F(x) be a distribution function is omitted, the sequence of
the characteristic functions <pn(t) does not necessarily converge; let e.g. be

f 1 for X > n,
F n (x ) = [ 0
n for
- x < n.

For every finite x, lim Fn(x) = 0, nevertheless cpn(t) = elnl does not tend
w—OO
to a limit (except for f = 2 kn (к = 0, ± 1 , + 2 , . . .)).
2. We have proved that if the characteristic functions of the functions
F„(x) converge to a function 95(f) continuous at f = 0, then the functions
F„(x) converge to a distribution function F(x) with characteristic function
<p(t). If we omit the condition that <p(t) is continuous at the origin, our
proposition is no longer valid. Thus for instance let F„(x) be the
distribution function of the uniform distribution on the interval ( —л, +ri),
that is

Fn O') = ~y + for Ix I < и;

, ., sin ni , , , .
then wJt) = --------, and thus the limit
nt
lim <pn(0 = <r(t)
n-~OO
exists for every real t and is given by

f 1 for t = 0,
^ I 0 otherwise,
thus 95(f) is not continuous for / = 0. The sequence F„(x) converges when
n -* 00 for every x to . F(x) is therefore identically equal to the

constant — , thus it is not a distribution function.


322 C H A R A C T E R IS T IC F U N C T I O N S IV I, § 4

We show finally that the characteristic functions of two different distri­


butions may coincide on a finite interval.
Consider the random variable £ which assumes the values + (2k + 1)
(k = 0, 1,. ..) with probabilities

P(i = 2k + 1 ) - P({ = - (21- + 1» =


We know that
” 1 7Г2

„ ? „!ъ г п г - — - <47)
hence
+f P{!; = In + 1) = 1,
n = — co
and we find
8 “ cos(2«+l)l 21 / 1
<Рц(0 = n-2 „ti.
Z —Ö (2n" +TTw—
1)- = 1 ----------
n for 111 - n> (48)

further cp^t) is periodic with period 2n.


Let now }] be a random variable assuming the values 0, ± (4к + 2)
(к = 0, 1, . . . ) with the probabilities

P(*l = 0 ) = 4 “ and Р(ч = ± (4 k + 2)) = (^ = °. L • • •)•

Clearly the condition

P(r, = 0) + +f P(rj = An + 2) = 1
n = — 00

is fulfilled because of (47) and we obtain

<«>

According to (48) and (49) we have thus

9»«(0 = Т’ч(0 for UI — ' (50)

The function $>,,(/) is periodic with period n. Let the real axis be partitioned
into subintervals
2k — 1 2k + 1 „ л „
---------- - n <, t < ------— n (к = 0 ,± 1 ,± 2 , . ..),
V I, § 5 ] O N T H E N O R M A L D IS T R IB U T IO N 323

then we see that the functions (p$(t) and cpn(t) are identical on intervals
with an even index к and are of the opposite sign on intervals with an
odd index k.

§ 5. Characteristic properties of the normal distribution

Let £ and rj be independent random variables with the same normal


distribution; it is easy to see that £ + rj and ^ - 17 are independent. It is
quite remarkable that this property is characteristic for the normal distri­
bution. In fact, Bernstein has proved the following

T heorem \. I f c, and rj are independent random variables with the same


distribution and finite variance, further if £ + r\ and { — tj are independent,
then с and rj are normally distributed.

P roof . We may assume without restricting generality that E(f) = E(p) — 0


and D(f) — D{r\) = 1. If cp(t) is the characteristic function of the common
distribution of £ and ц, the characteristic function of £ + t] is <p\t) and
that of £ — r) is 9f t ) • 9?( —?)• Since £ + r] and — t] are independent,
the characteristic function of their sum is equal to the product of their
characteristic functions. The characteristic function of (£ + »?) + (£ — »/) =
= 2£ is, by Theorem 3 of § 2, equal to <p(2t). Hence

V(2‘) = <p3(t) <p{- t). (1)

Now (p{t) can never be zero. In fact, if for a value t0 we would have
t f i
<p(t0) = 0 , then by ( 1 ) we would have 99 = 0 and thus also 99 I= 0
(n — 1, 2, . . . ) . As cp(t) is continuous, we would have 9 9 (0 ) = 0 ; this is
impossible as 9 9 ( 0 ) = 1 (cf. Theorem 1 of § 2). Put

Щ = In cp(t) (2 )
then, by ( 1 )
*(20 = W ) + * ( - ') • (3)
Put
ад = «КО- «К-0. (4)
If in (3) t is replaced by —t and the equality so obtained is subtracted
from (3) we find
0(21) = 20(0- (5)
324 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 5

By assumption <p(t) is twice differentiable and <p'(0) = 0, <p"(0) = —1


(ct. § 2, Theorem 7). Since (fit) Ф 0, i/<(0 and b(t) are twice differentiable
as well and we have <5(0) = 0, < 5 '(0 ) = 0.
We obtain from (5)

sl-L
= ■ (»-1,2,...). (6)
~T

The right side of (6) tends for n -* oo to <5'(0), i.e. to zero. Hence

<5(0 = 0. (7)
It follows that i\/(t) = <//( —/) and by (3) that

ф(20 = 4 m (8)
This leads to

(9)
NO
The right hand side of (9) tends to — — since i/<(0) = »A'(0) = 0, t/<"(0) =

= —1. Hence
t2 • - '*
•K0 = - - j - and 7(0 = e 2 ;

c; and rj are thus normally distributed random variables.


Similarly, one can prove

T heorem 2. Let c, and tj be independent random variables having the same


£ + r\
distribution with zero expectation and finite variance. I f •— has the
V7
same distribution as £ and >], then £, and rj are normally distributed.

P roof . Assume D(f) = D(g) = 1. If (p{t) denotes the characteristic


function of £, and r/, we have

7(0) = h <P'(0) = 0, 9>"(0) = — 1.


V I, § 5 ] O N T H E N O R M A L D IS T R IB U T IO N 325

£ + t]
By assumption, the characteristic function of — = — is also equal to
\/2
cp(t). By Theorems 3 and 6 of § 2, however, the characteristic function of
^ “4 is cp2 — 7 - I , hence
V2 I v/ 2 )

W —7t = ^ ?)- 0°)


lvz
From this follows, as in the proof of Theorem 1, that y(t) ф 0 for every t.
If we put again In <p{t) = ^(t), then ip(t) is twice diiferentiable,

\p(0) = ф'(0) = 0 and iA"(0) = - 1.


From (10)

>A(0 = 2> A | - 7 f ) , (11)


hence for every positive n

* (-t )

[ 2 -0
t2 t2 )
hence 1l/(t) = — — and y(t) = exp — — which proves our theorem.
Theorem 2 can be rephrased, by using the notion of families of distri­
butions, as follows:

T heorem 2'. Let F(x) be a distribution function such that

+ 00 + 00
I xdF{x) = 0, \ X2 dF{x) = 1.
—00 —00

I Í X — m I
I f the family o f distributions I F -------- L <7 > 0, is closed with respect
\ \ a )\
to the operation o f convolution, i.e. if for any real numbers m b m2 and for
any positive numbers а ъ tr2 there can be found constants m and a (m real,
<7 positive) such that

F И , (13)
Ci c2 j a
326 C H A R A C T E R IS T IC F U N C T IO N S [V I, § 5

then
X
u2
F(x) = — = e 2 du; (14)
■^J2 71
— CO

{ X — TYl I
F --------- 1 is thus the family o f the normal distributions.
a Л

Proof. Obviously, m = >щ + m2, a = -Ja\ + a \. If we put


+ CO

<f(t ) = ( elxt dF(x),


— CO

then
9 К 0 <F(°2 0 = 4 Í - J + ai 0- ( l5)

For <7x — ff2 = —j= (15) reduces to (10); hence Theorem 2' follows from
\/2
Theorem 2.
Theorem 2' explains to some extent the fact that errors of measurements
are usually normally distributed. In effect, the condition, that the sum of
two independent errors of measurement belongs to the same family of
distributions as the two errors themselves, cannot be fulfilled, in the case
of a finite variance, by other than normal distributions. The condition that
F(x) should have finite variance is necessary for the validity of Theorem 2'.
Thus, for instance, for the distribution function

F(x) = — + — arc tan x


2 7Г

of the Cauchy distribution we have the relation

г 1 х - тА , р 1 х г ..щ \ , Р {х - ( щ + т ‘>\ . об)


öl j 02 ( ° i + Ö2 )
(16) follows easily by taking into account that the characteristic function
of F(x) is equal to e-1' 1.
If the family of distrubtions IF —— — is closed under the operation
! 0 )
of convolution, F(x) is said to be stable. According to Theorem 2', the
normal distribution is the only stable distribution having finite variance,
V I, § 5 ] O N T H E N O R M A L D IS T R IB U T IO N 327

as pointed out above. There exist, however, other stable distributions, e.g.
the Cauchy distribution. Stable distributions will be dealt with in § 8.
We deal now with some further remarkable properties of normal distri­
butions. If £ and q are independent normally distributed random variables,
their sum £ + q is, as we know already, normally distributed too. We shall
now prove that the converse of this statement is also true: this result is
due to H. Cramér.

T heorem 3. I f £ and q are independent random variables and i f £ + q


is normally distributed, then £ and q are normally distributed themselves.

P ro o f . W e m ay su p p o se E(£ + q) = 0, a n d D(c + q) = 1; th e ch arac-


t2
teristic function of £ + q is then exp — — . Let cp^(t) and f n(t) be the
characteristic functions of £ and q, respectively. We have thus
/*
4 > i( 0 % (0 = e 2. (17)

If F(x) and G(x) denote the distribution functions of £ and q, respectively,


we have

<Ft (0 = —
TOOe‘xl dF(x )> Vn (0 =—
Г00 e“ ' dG(x y ( 18)

We show now that the definition of <pft) and cpft) can be extended to
all complex values of t, so that rp^t) and y f t ) are entire functions of the
complex variable t. Let us first suppose t — iv (v real) and let A and В
be any two positive numbers. We have

+
( e~vxdF(x)- +
f e~vy dG(y) < +f j * e~v(-x+y) dF(x)dG(y) =
—A —B —oo —oo

v*
= 95c + , ( * » ) = e 2 . (19)

Since e~vx > 0, the following integrals exist:

<Pi (iv) =—Too e~vx dF(x) (20a)


and
+ 00

<Pn (iv) = j e~vy dG(y). (2 0 b )


— oo
If now t — и + iv, we have
+00 +00
I M O I = i J eUx dF (x)\< J e - m dF(x) = V{ (iv). (21)
—00 —oo
328 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 5

The definition of cp^t) and q>n(t) can thus be extended to every complex t.
It is easy to see that <pjj) and <p,,(0 are holomorphic on the whole complex
plane, hence they are entire functions of t.
Because of (17), (p^t) # 0, cpn(t) Ф 0 for every t. Hence l n y ^ i ) and
ln y 4(/) are entire functions too, where that branch of the logarithmic
function is to be taken, for which ln 1 = 0. If a > 0 and b > 0 are such
that F(a) — F( —a) > and G(b) — G( —b) > — , then

+ o° ^^

<Pi 0' ») = e~xv dF(x) > — (22)


— 00

and
+ °°

<P„ (/ v) = J e - " dG(x) > . (23)


—oo

Hence, for t = и + i v,
v‘_
e2 ~+6M Ш-а+*|<|
I <Ps (0 I ^ <Pt 0' v) = у < 2e < 2e ~ . (24)

Similarly, we obtain
Д11+.1.1
\ < P ,0 ) \ ^ 2 e 2 . (25)
If the real part of z is denoted by Re(z), we have by (24)

I Re(ln <p4(0) I = In -— - < 1121+ max (a, b) \ t \ + In 2. (26)


IM O I

We have r/^(0) = <p,(0) = 1; furthermore we may suppose without


restricting generality, cf^(0) = <p'n(0) = 0. Indeed if we would have c/.i(0) = a,
and consequently, <p,j(0) = —a, we could always consider instead of t;
and r] the random variables £ — a and rj + a whose characteristic functions
satisfy the above conditions. From this we conclude that the functions
In cpt (t) In tp (t)
---- — and ------ — are everywhere holomorphic, furthermore

1Re(ln q>t (Q) 1 and 1Re(ln (?)) ]


m2 m2
V I, § 5 ] O N T H E N O R M A L D IS T R IB U T IO N 329

are, because of (25) and (26), bounded on the whole /-plane. According
to a well-known theorem of H. A. Schwarz the relation
2n
1 Г 4- z
/(z) = Im (/(0 )) + — J R e ie
R e ( f( R e ie)) Reie_ z dd (27)
0
holds for Iz I < R for every function /(z) holomorphic on Iz I < R. It follows
from (27) that — — ^ —- and ----- \ —1 are bounded on the whole plane;
|/|2 |/|2
they are thus, according to Liouville’s theorem, constant. Hence <рД/) =
= exp (c/2) and <p,t(t) = exp (dt2). Because of

there follows
Г /2 I
<Pí(0= exp (28)
and
Г* 0) = exp - ^ | -
Г t2 1 , (29)

and herewith Theorem 3 is proved.


If / , ( / ) , . . . ,/ДО are characteristic functions and ab . . ar are positive
rational numbers, further if

П (Л(О)“* = e“ 2 - (30)
*=i
then it follows from Cramér’s theorem that the functions f k(t) (k = 1 , 2 r)
are characteristic functions of normal distributions. In fact, if N denotes
the common denominator of the numbers al f . . . , ar, we have
r N l‘
П ( Л ( 0 Г * к = е~ 2, (31)
k =1
where Nock (k = 1, 29. . . 9r) are integers. Hence

fk (t)g k (t) = e 2, (A: = 1, — , r), (32)


where we have put

л С О -С Л Ю Г ^ П С М ) * 4-
1 Фк

gk(t) is also a characteristic function, hence by Cramér’s theorem f k{t)


330 C H A R A C T E R IS T IC F U N C T IO N S [V I, § 5

is the characteristic function of a normal distribution. If not all ak are


rational, Craméi’s theorem does not guarantee the validity of the propo­
sition; however, Yu. V. Linnik and A. A. Singer have proved that it holds
in this more general case too; i.e. they proved the following
T htorem 4. I f the functions f k(t ) ( k = 1 ,. . . , r) are characteristic functions
and i f we have in some interval \ t \ < ő (ö > 0) identically

П {fk (t)Yk =
k=1
e"1' ~ °2 , (33)

where m is a real number and а, alt a2, . . . , a, are positive numbers, then
the functions f k(t ) (/c = 1,2.........r) are characteristic functions o f normal
distributions.
P roof . The following proof is due to Yu. V. Linnik and A. A. Singer [1].
It consists of five steps.
Step 1. Put

9 k (t) = fk [ — t- = f k \ - —^ i = , ( k = \,2 ,...,r ) .


' < V 21 a j2
Clearly gk(t) is a characteristic function to o ; in fact if £ and r\ are independent
random variables possessing the same distribution with characteristic
£ —ti
function f k (f), then gk (t) is the characteristic function o f ---- — .Thusthe
identity a\ 2

П ( л ( 0 ) м = е “ ?2 (34)
k=1
holds if 111< Ő. Furthermore gk (t) is a real and even function. If we prove
from (34) that gk (t ) is the characteristic function of a normal distribution,
then the theorem of Cramér implies the same conclusion for f k(t) . It
follows from (34) that gk(t) Ф 0 for 11 \ < Ő, hence we may take the log­
arithm of the two sides of Equation (34):

r 1 t2
У ock In -------= — . (35)
к- l 9k (0 2 V ;
Let Gk (.y) be the distribution function corresponding to the characteristic
function gk(t). It follows from the assumptions concerning gk(t) that Gk(x)
is symmetric with respect to the origin; hence we have for any a > 0
+ o° +a
9k (0 = I cos tx ■dGk ( y ) < 1— I (1 —cos tx) dGk ( y ) .
—oo —a
V I, § 5 ] O N T H E N O R M A L D IS T R IB U T IO N 331

К
Since for 111 < — - the relation
2a

I (1 - cos tx)dGk(x) < 1
—a

holds and since for 0 < x < 1 we havex < In — -Í— ], it follows from
1 —x j
(35) for a > —— that
2<5
+ a

Г C t2 71
Z ak ( 1 - COStx)dG k (x )< — for |/|< — . (36)
k=i J 2 2a
—a

If we divide both sides of (36) by t 2 and let t tend to zero, we obtain

t Ч Г X2dGk (x) < 1. (37)


k = l - a

Since (37) holds for every a > - —, the integrals


J x 2dGk (x) ( k = 1....... r)


—00
exist; by Theorem 7 of § 2 gk(t) is thus twice differentiable and, gk(t) being
an even function,
9k (0) = 0 (k = 1 , . . r). (38)
From (38) and (35) we conclude that
Г r +00
- I «кд"к (0) = I «* J *2dGk (X) = 1 . (39)
к - l k =1 —oo

2. We show now that gk(t) possesses derivatives of every order.


For this we need the classical formula of Faä di Bruno concerning the
successive derivatives of composite functions.1 This formula states: if
z = H(y), у = h(t)
and if we put

h0(t) = h(t), h,(t) = — — ± L (v = 1, 2, . . . ) ,


v! at
then we have for every integer p

(40)
1 Cf. e.g. E. Lukács [2].
332 C H A R A C T E R IS T IC F U N C T IO N S [VI, § 5

where the summation is extended over all nonnegative integers ib . . . , i,


and 4, . . . , 4 satisfying the conditions
s s

Z 4 = 7 and =Z h 4 = P’
]= i j i

where l assumes the values l — 1,2 , , p.


In particular, if H(y) = In у and p = 2q, we obtain

d*q\nh(t) ^ (—i y - 1 (2<7) ! ( / - l ) ! ]4 ( # ) ( 0 | ,v n


dt 29 ^ 4! 4! •• •4! Ai lj\ h(t )
where the summation is to be extended over / = 1,2.........2q and over
i, , lj such that

A ij = 4 Z 4 4 = 2^
;= i y=i
Now we show by induction that the integrals

j x*qdGk (x) (q — 1 , 2 , . . & = 1, 2, . . . , r)


—00
exist. Suppose that this holds for a given integer q; from this it follows
that gk(t) (k = 1 , . . . , r) is exactly 2q times diiferentiable and that
<Ü2'- ])(0) = 0 ( / = ! , . . .,q; k = l , . . . , r ) .
According to (35) we have

d„ fili i «г»(о i"

J ' = (2 ?)'Í« « I ( - l y y - i r s n '7 ^ < «)


ai fc=i ;= i 7= i */'•

where the summation is as in (41).


Put t = 0 in (42) and subtract the relation thus obtained from (42).
Separate the terms with the indices / = 1, 4 = 1, 4 = 2q and consider
that the left side of (42) is either 1 or 0 for q = 1 or q > 2, respectively.
Thus we obtain

k=i f ~9k7( (r
0
- ^ )(0)]) =

= (2 q ) \ Í «k 2 ( - 1 / ’ 1 (/ - 1)! [Skl(0 - Skl (0)], (43)


k=1 /=2
with
( á 4 t) “

5 н (о = е п - - у ; /у!-
j =1 lj'
V I, § 5 ] O N T H E N O R M A L D IS T R IB U T IO N 333

the summation being extended over the ij, lj such that

Í h = U t i j l j = 2q.
1=1

We show now that the right hand side of (43) has the order of magnitude
0 ( t2) when ? —►0. In fact if v < 2q is an odd number, then by the induction
hypothesis g ^ \t) = О { \t\). Hence it suffices to consider terms for which
all the lj are even. If v is even, v < 2q — 2, then we have

d k O )

from which our statement follows. But

J _ - 1 = 0 (i2)
9 k (0

is also valid; hence it follows from (43) that

t
k =l
У Р (0 - a f q) (0)] = 0 {t2). (44)

In consequence, the expression


+00
' Г 1 —cos tx ^
Z ------ ~~2------ x - qdGk (x)
k =1 J 1
* —00
is bounded. If we let t tend to zero, we see that the integrals

j' х2я+2 dGk (x) (к = 1 , . . . , r)


—00
exist and this means that gk(t) (k = 1,. . . , r) is at least (2q + 2) times
differentiable. As we know already that the integrals

j x 2dGk (x) ( k = l,...,r )


—oo

exist, the proof is finished; gk(t) (к = 1, , r) is thus infinitely often


differentiable.
Step 3. We shall show now that the gk(t) are holomorphic in a circle
11 \ < R (R > 0). In order to show this we have to evaluate the order of
334 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 5

magnitude of the derivatives gvk 4) (0). We can restrict ourselves to the


case when ak > 1 (k = l, r); otherwise there exists an integer N0 such
that N0<xk > 1 (k = l , . . . ,r). Since (34) leads to the equality
г Г f 'Плг0®* _ ,г
П 9k
J
=
n0 I
= e 2 ,

(34) is satisfied by the functions

ff* (0 = 9k *
у / N0
for a* = N00Lk. Without restriction of generality we may thus assume
afc > 1 (к = 1, . . . ,r). Now raise the two sides of (34) to the power 2q,
differentiate 2q times and put t = 0. By introducing the notation yk(t) =
= [gk(t)]2ai‘‘l we obtain thus

I . (45)
/, +... +/,= 2 9 ll - • • • ‘г- Ш 1=0
The quantities y(k\ t ) can be evaluated by means of the formula of Faä
pi Bruno:
/
fi°(0 = Z 2q<xk (2qcck - 1) . . . (2qock - v + 1) [gk (г)]2««*-’ x
V=1

х Г т г ^ т .П ^ т т • <46>

where in the inner sum the summation is to be taken over the /}-s and
/,—
s such that /) = v, £ i f = /. Because of
<?f-1)(0) = 0, sg n # > (0 ) = ( - l ) '
and
2 ^ 0/д. (2//aÄ— 1) . . . (2^гал —v + 1) > 0 for v < 2q,

it follows that all nonzero terms on the left hand side of (45) have the
sign of ( —l)9. The right hand side is
d^ e~tt(1
t o= (2# (2?)! H 24 (0), (47)

where H2q(x) is the Hermite polynomial o f order 2q:


i ** И'гч _ x‘
я- « - р 5 Г * т - а г * и*)
V I, § 5 ] O N THE N O R M A L D IS T R IB U T IO N 335

Thus
(-1)«
H 2a (0) = —— — .
2eW q\ 2«

Since on the left hand side of Equation (45) there occur the terms
^ ( 0 ) too, the relation
^ ( O ) . i 2«
* w ?! 2q
must hold, wherefrom
i' tot 1
lim su p rV V ' (49)

thus gk(t) is holomorphic in the circle 11 \ <- A (k = 1,. . . , r).



Step 4. We show that the functions gk(t) are also entire functions. Put
hk{t) = gk i . Since gk(t) is an even function, hk(t) is holomorphic in

the circle 111< — . Suppose that not all gk(t) are entire functions; then
e
the same holds for the functions hk(t). Let hko(t) be the function hk(t) which
has the smallest radius of convergence, which radius we denote by R.
Take 0 < r < R; put k(t) = hk(r + t). Then

П (50)
k=l
and
ГТ ( - ^ ) . | gfc = e 2 (51)

Since 3>ífk(t) (k — 1, . . . , / • ) too can be represented by a power series with


positive coefficients, we obtain, by raising (51) to the и-th power and
differentiating n-times,

ш ке J r ^ ( 0 ) < (-£-)" J¥%„(0),


hence
hm sup Y I " < — . (52)
00 n\ 2
2
,ko{t) is thus holomorphic in the circle 11 1 < — , i.e. hko(t) is holomorphic
336 C H A R A C T E R IS T IC F U N C T I O N S [V I, 55

2
in the circle 1t — r \ < — . From this it follows for an r sufficiently near
e
to R that hkB(t) is regular at the point t = R. This, however, contradicts
the known theorem according to which the sum of a power series with
positive coefficients having a radius of convergence equal to R, is singular
at the point + R.1
Step 5. The proof of Theorem 4 can now be finished like that of Cramér’s
theorem. If we choose the numbers ak > 0 (k = 1, 2 , . . . , r) such that

ak
J
-Як
dGk{x)> -i- ( k = r - l,...,r ) ,

then

In 1^ (/) I < 4 ,-1— + с 1í 1,


2a*

where C is a positive constant. Because of (34) and step 4, the function


In gk(t) is an entire function, hence by Liouville’s theorem ln^f*(t) is a
polynomial of at most the second degree, which leads to the statement
of Theorem 4.
Theorem 4 enables us to generalize Theorem 1. In fact, as Darmois
and Skitovitch2 have shown:

T h e o r e m 5 . I f £ l5 £ 2 > • • • > are independent random variables and


аъ . . . , an bu . .. , br are real numbers different from 0, further if the
random variables
r r

hi = X ак£к and Y]2 = £ bk£k


k=1 *=1

are independent, then the random variables £2, . . . , £r are normally


distributed.

P r o o f . Linnik has shown that the proof of this theorem can be reduced
to that of Theorem 4 as follows: Take ax = a2 = . . . = a, = 1, which
does not restrict the generality. t]x and t]2 are by assumption independent,
hence we have
E ( e i(Uql+ m,,)) = E ( e iu,u) E ( e iD"’). (53)

1 Cf. e.g. E. C. Titchmarsh [1], p. 214.


2 G. Darmois [1], V. R. Skitovitch [1].
VI, § 5 O N T H E N O R M A L D IS T R IB U T IO N 337

If is the characteristic function of we have by (53) because of the


independence of the
Г Г
П
k =1
4>k(u + M =
k
П =1
<Pk(«) <Pk(bkv). (54)

In a neighbourhood of the origin the 9ok{t) are not all zero. Put therefore
Ф/А0 = in <pk(0> then

E Фк(и + bkv) = E Фк(и) = E ФкФкР)- (55a)


fc=l *=1 k= 1
Replace in this equality и by —u, v by —v and add the equality so obtained
to the former one; it follows, by putting i//*(/) = t/^(t) + il/k( —t), that

E WC« + bkP) = É •/'*(“) + E Ф*(Ьк»У (55b)


k =1 /к=1 k =1
We prove now that ф*(0 = —ckt 2. This means that <pk(t)tpk( —0 *s the
characteristic function of a normal distribution and the proof of Theorem
5 is finished by Cramér’s theorem.
Multiply the two sides of (55b) by x — и and integrate from 0 to x, thus

Y j ( x - и) ф*к(и + bkv) du = “ IÉ * (V ) I +
0
X

+ f (x - u)
■' k= 1
É
Фк(м)du-
о

Then, by variable-transformation and Integration by parts, we get


x b^v x+bf(V t
J (x - и) Ф1(и + bk v) du = - x J ф*к(х) dx + j’ ( J ф* (x)dx) dt.
о b bkv о

If we put
B(v) = Y4>* ( V ) , (56)
*=i
we obtain
r x + bkv t b/cV

E j (J •KCO dt —x j ф*к {x)dx =


A:=l bkv 0 0 (5 7 )
2 X
= -7Г- ^(f) + I (x - u) Y 'l>k(u)du.
Z J fc=l
0
338 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 5

The left hand side of (57) is obviously a differentiable function of v. Hence


x +b/cV
£ bk J = ~ B ' (f) + A(v)x, (58)
bkv
where
A(v) = X bk\p*(bkv).
A=1

Replacing in (58) x by —x and adding the equation so obtained to (58),


we get
r x+bkv -x +bkv
Z hk ( j ffffid x + J i/i*(t) dx) = x 2B'(v). (59)
k =l b^v b/cV

Clearly, this equation can be differentiated with respect to v; doing this


and putting v = 0 we find, because of t//j* ( —x) = tl>*k{x) and ф*(0) = 0

£ Ь Ж ( х ) = 4 ~ В " ( 0). (60)


k=l A
From this follows
VT Í N\6•
П (9>*W) *•=es"(°rt
*=i
where
<P*(x) = eww = <р*(х) <pk{ -x ).

Since [<p*(x) I ^ 1, the relation B"(0) < 0 must hold. Equality cannot hold
here, since then the l k would be constants with probability 1. Hence we
can put B"(0) = —a 2 < 0 and we have thus

Í I (<P* (x ))bi = e”'7^ 2 (a2 > 0) (61)


fc=i

in a neighbourhood of the origin. By Theorem 4 the functions rp*{x), and


consequently the functions y k{x), are characteristic functions of normal
distributions. Theorem 5 is herewith proved.
Finally, we shall prove one further characteristic property of the normal
distribution: the following theorem of E. Lukács:

T heorem 6. If £ 2, are independent random variables having


the same distribution o f finite expectation and variance, then this distribution
V I, § 5 ] O N T H E N O R M A L D I S T R IB U T IO N 339

is a normal one iff


n n z 2
£ = E Zk and t]= Y j Zk------
k= I fc = l n.
are independent.
Remark. R. A. Fisher proved the independence of £ and rj for normally
distributed £k. The converse theorem was proved by E. Lukács in the
case where the have a finite variance. It was already proved before by
R. C. Geary under the stronger condition that all moments of the £k exist.
Later on it was proved by J. Kawata and H. Sakamoto1 as well as by
A. A. Singer2 that even the existence of the variance is unnecessary; however,
we shall not deal with this more general case.
P roof of T heorem 6 . The condition is necessary. The t k are normally
distributed; we may assume E(£k) = 0, D(gk) = 1. If (сц) is an orthogonal
matrix of n rows and n columns with Cy = —— (j = 1, 2 ,. . . , ri) we know
V я
(cf. Ch. IV, § 17, Exercise 43.b) that the random variables

i Cjktk
= U= 1.2,...,«)
k=l
are mutually independent, too.
We have thus S, = J n and

i =
k =Él S - n - j==1 Í r « r - J=Í2 ^
which shows the independence of £, and rj.
2. The condition is sujficient. We may assume E(i;k) = 0. By the assump­
tion of the theorem
E(ei(ui+vri)) = E(eiui) E(eiv"). (62)
If we differentiate on both sides of (62) with respect to v (which is allowed
because of Theorem 7 of § 2), and substitute v = 0 afterwards, we obtain
Eine»*) = E (e ^) E(n) = Е{ц), (63)
where
(p(u) = E(e‘uik)
is the characteristic function of the random variables £fc. From

ч-fi—
{
^-1±«—nf-x:1
n k=
s№
j= j<k 1 1
(«j
1 J. Kawata and H. Sakamoto [1].
2 A. A. Singer [1].
340 C H A R A C T E R IS T IC F U N C T IO N S [V I. § 6

follows E(t]) = (n - 1) <72, by putting a1 = Е (ф . Since

E ^ k elu^ )= -i< p '(u ) (65)


and
E(Ü = - <p"(u), ( 66)

(63) can be written in the form

- (n - 1) cp"(u) (>(«))" “] + — ” 0 ' ( M))2 (?(w))"~2 =

= (и - 1)а2(<}5(ы))\ (67)
If we divide by (« — 1) (<р(м))'', we find

L M . (68)
<p(u) I (f(u)

The left hand side of (68) is the second derivative of In cp(u). If we integrate
twice and consider that cp'(0) — iE(^k) = 0, we find

a2u2
\rup(u) = ------— (69)

which proves the theorem of Lukács.

§ 6. Characteristic functions of multidimensional distributions

Let 'i = (<*!, £2, • • •, C„) be an «-dimensional random vector with


distribution function F(xh x 2, ■■■, x„).
For any two vectors t = (tx, t2, ■. . , tn) and x = (xl5 x2, . . . , x„) we
put
П
СМ ) = Е Xk (k- 0)
к —1
For sake of brevity we write F(x) = F(xb x 2, . . . , x„). We define the
characteristic function of £ by

9 ^ (0 = Е(еКМ) = +J . . .
—00 —00
J
ei(*-r) dF(x). (2)

The characteristic function of an «-dimensional distribution function is


thus a function of « real variables. As is readily seen, it has the following
properties:
V I, § 6] M U L T ID IM E N S IO N A L C H A R A C T E R IS T IC F U N C T I O N S 341

1. For 0 = (0, 0, . . . , 0) we have

9><(0)=1.

2. For every t
1 ^ (0 1 < 1 .

3. is a uniformly continuous function of t.


4. If
n
n = ('ll, • • •>h n f 4j = Z cJkZk + bj (J = 1 , . . n)
k= 1
and
n

и= (M i ....... un), uk = Y cjkb ( k = \,,...r i) ,


j =i
then
9>,(0 = <^(и).
dwAt)
5. — ^ — = /£ (4 ) when £ ( 4 ) exists.
Cifc (=0
Ő2 ®.(f)
6. *— = - £ (£ Д ) when E(5fik) exists.
OljOtk t—
0
7. If the density function f(x ) of £ exists, then
+00 +00
= j • • • J ei(,'x)f(x ) dx.
— 00 — 00

where dx is an abbreviation for dxydx2 . ■■dxn.


Example 1. The characteristic function o f the multinomial distribution.
If
P j> 0 (J = 1 , 2 , . . rí), X Pi = 1
i=i
and
AM
^(£i = = £„) = FÍ1. • ■
П
where Z kj = AT (A) ^ 0 being an integer), we obtain
/=i
9^(0 = ( Z Pke“k)N ■
k= 1
Example 2. The characteristic function o f the n-dimensional normal
distribution.
342 C H A R A C T E R IS T IC F U N C T IO N S [V I, §6

If the vector tj = (rib . . . ,rin) is by definition normally distributed,


there exist normally distributed independent random variables ...,
with E(ik) = 0, D2(£k) = a\ and an orthogonal matrix (cjk) such that
n

4j = E cikZk + Щ (J = 1) 2, . . . , и). (3)


k =l
Then by property 4 of the characteristic function

9>4(m), (4)
where
n
= u = (u1, . . . , u n), m= (5)
i=i
It suffices therefore to determine the characteristic function of £. Since,
by assumption, the random variables are independent, we have

? 4(«) = = f l £ (eiUkik) = exp Г - 4 - E ■ (6)


k =l L Z A: = l
(4), (5) and (6) lead to
j n n

<P4(t) = exp i(m, t ) ----— E


Z h=lj=l
bh M E . (7)
where
n

bhj = E akChkcjk- (8)


fc=1
It is cast to see that the quadratic form
n n

E E bhhh
A = l/= 1

is positive definite and the matrix В = (bhj) is the inverse of the matrix
A = (ahj), the elements of which are the coefficients in the expression of
the density function o f »/. The matrix В is thus the dispersion matrix of /?.
In fact, a simple calculation shows that the matrix В can be written in the
form В — C S C '1, where S is the diagonal matrix with elements of.
On the other hand, we have proved (cf. Ch. IV, § 17, Exercise 42) that the
density function g(y) of t] with у = {уъ .. . , y„) is given by

g(y) = ^ „ exp - 4 E E ahj O a - mh) (yj - m,) , (9)


(27t)2 L 2
V I, § 6] M U L T ID IM E N S IO N A L C H A R A C T E R IS T IC F U N C T IO N S 343

where the matrix A = (ahj) can be written as A = CS_1C _1. Hence we


have obtained

T heorem 1. I f q is an n-dimensional normally distributed random variable


with density function (9) such that the quadratic form
П П

ZZ
h= 17=1
ahjZhZj

is positive definite and its determinant is denoted by \A\, then the characteristic
function o f q is given by (7) where m = ( m1, , mn) and where (bhj) = В =
= A " 1 is the dispersion matrix o f q.
There exists an inversion formula for «-dimensional characteristic func­
tions too. It is given by

T heorem 2. I f <pft) is the characteristic function o f the n-dimensional


vector we have
+T +T
1 Г f Я itkak _ e ~Ukbk
P {W ) = lim 7^Ä T
Г- 0 0 \zn ) J
•••
J
П -------- uk
k=1
Tt----------- m d t, (10)
-T -T
whenever the distribution functions o f all £k are continuous at the points
ak and bk (k = 1 ; here I is the n-dimensional interval ak < x k < bk
(k = 1
Like in the one-dimensional case it follows from this theorem that an
«-dimensional distribution function is uniquely determined by its charac­
teristic function. The proof is analogous to that of the uniqueness theorem
for n = 1; we leave it to the reader.
By means of the uniqueness theorem we prove

T heorem 3. The distribution o f the vector c = { < h , . . . , £„) is uniquely


determined by the distribution functions o f the projections o f £, upon all lines
passing through the origin.

P roof. Let d a be an arbitrary line passing through the origin having


the direction-cosines cq,. . . , a„; put я = (oq,. . . , a„). The projection of
^ on dx is thus
£a = (* ,0 - (11)
Hence the characteristic function of the random variable is

n S t ) ^ E ( e i^ ‘) = <pt(au), (12)
344 C H A R A C T E R IS T IC F U N C T I O N S [VI, § 6

where at is the vector with components akt (k = 1 , . . . , и). Since every


system tb . . . , t n of real numbers can be written in the form tk = tak
where t is real and

I «1=1,
k=1
the theorem follows from the uniqueness theorem.

TH EOR EM 4 . I f £ = ( £ x, . . . , £ „ ) and ij = (цъ . .. ,rjn) are n-dimensional

independent random vectors and i f £ = £ + »/, we have

<pft ) = <p„(t).

P r o o f . Let dx be any line passing through the origin with direction


cosines ak (k = 1,. . . , ri), and put a = (a1( . . . , a„). If £„ p«, are the
projections of i/, C upon i/a, we have C« = ?* + px- Hence because of
the independence of £ and ij :

(P d O = (P d t ) (P n S t )- O 3)
It follows from (12) that
<Pftd) = <pfta) <p„{ta), (14)
and the proof can be finished as that of Theorem 3.

T heorem 5. I f ..., are independent, £ = (fx, . . . , t =


= ( ij ,. . . , i„), (where tx, . . . , t n are real numbers), then

<Ps со = r í o*)- ( l5 >


*=i
Conversely, (15) implies the independence o f £l f . . . , c,n.

P roof . It follows from the assumption that the random variables e“kik
are also independent, hence

cpft) =E{f\eil
*=i
*<*) = П
*=i
= П
*=i
( 16)

If on the other hand x = (xl5. . . , x„) and if F(x) is the distribution


function of £, Fk(x) the distribution function of <ik (k = 1,. . . , n), further if

k= 1
V í, § 6] M U L T ID IM E N S IO N A L C H A R A C T E R IS T IC F U N C T IO N S 345

then by (15) it follows that the characteristic functions of F(x) and G(x)
are identical. Hence, because of the uniqueness, F(x) = G(x) and the
random variables £k are independent.
As an application of the preceding theorem we prove now the following
theorem due to M. Kac (cf. Ch. Ill, § 11, Theorem 6).

T heorem 6 . I f £ and g are two bounded random variables fulfilling for


any integers к and l the relation

Е(£кд')= Е (£к)Е(д'), (17)


then £ and g are independent.

P roof . Our assumption implies the absolute convergence of the series

£ Щ кЖ ) к л “ E tfX ftJ
Víih) = I -------------- and <pft2) = X ------- - ------
fc=o i=o t-
for every complex value of t1 and t.,. If we put £ = (£, g) and t = (zx, t2),
it follows from (17) that

<P{(0=*=0/=0
I I ------nTi-------=<Pt(h) <pj!*
)■
Hence the theorem is proved.
Theorem 3 of § 4 may also be extended to the case of higher dimensions:

T heorem 7. Let £N (N = 1 , 2 , . . . ) be a sequence o f n-dimensional random


vectors; let FN(x) (x = (x1;. . . , x„)) denote the distribution function of
£n and (pN(t) (t = f u . . . , tfi) its characteristic function. I f at every point
o f continuity x o f F(x) the relation

lim Fn (x ) = F(x) (18)


/V — + CO

is valid and i f F(x) is a distribution function as well, then

lim <pN(t) = <p(t), (19)


N-+00
where <p(t) is the characteristic function o f F(x). The convergence is uniform
on every n-dimensional interval.
Conversely, i f (19) holds for every system o f values (?1; . . . , / „ ) = ? and
If <p(t) is continuous for t = 0, then (18) holds as well, the function F(x)
figuring in it is a distribution function and <p is its characteristic function;
346 C H A R A C T E R IS T IC F U N C T IO N S [V I, § 6

furthermore, for any A > 0 the convergence in (19) is uniform for \ l \ < A,
where | / 1 = J t \ + . . . + 1*.
The proof of this theorem is omitted here, as it is essentially the same
as that of Theorem 3 of § 4.
As an application of Theorem 7 we show that the multinomial distri­
bution tends to the normal distribution. Suppose that the random vector
Zn = (£л-1>• • •, £n„) (N = 1 , 2 , . . . ) has a multinomial distribution
fj\
P(ZN1 = U = *«) = T J ~ ~ / r r ■■■pkn’ (20)

where къ . . . , k n are integers and

X kj = N, p; > 0, X Pj = 1
j =1 7=1
Let t]N denote the random vector (i/iVi, • • ■, i?;v„) with

—-^VÍPy /->i ,
"-( ’
We obtain for the characteristic function of gN
_ n _ Ги I it- 11^
<PnN (0 = exp [ - i j N ( X tjjp j)} ■ X Pj exP \— 7== >
7=1 -7 = 1 -I
hence

In <?w(0 = 7=1
hy/Pj ) +
+ A in 1 + X Pj exP /— - 1 • (22)
y=i V Wft L

By substituting the expansions of ex and ln(l + x) we have

Hm In <pnN(t) = -
N-+со
\ [ X Ű - (y =I l
y= l
h JPi )2]- (23>
The limit distribution has the characteristic function

9V) = e * p [ - _ L | £ < l - ( i ' , V Ä S ]. (24)


z l;=i 7=i I.
This is the characteristic function of a degenerate normal distribution;
VJ, § 7) INFIN ITELY DIVISIBLE DISTR IBU TION S 347

in effect the random variables r\Nk are connected by the linear relation
П _
I fiN k s /P k ^ Q ;
k = 1

the limit distribution too is thus concentrated on the hyperplane defined


by the equation
" , —

Ü *k yJPk = 0.
k = l

§ 7. Infinitely divisible distributions

In this section we shall deal with certain types of distributions which


can be described most conveniently by means of their characteristic
functions.
The probability distribution of a random variable £, is said to be infinitely
divisible if for every n = 2, 3 ,. . . there exists a probability distribution
the и-fold convolution of which is equal to the distribution of %. Or to
put it otherwise: the distribution of £ is infinitely divisible if, for every и,
£, can be represented in the form £ = & + & + ••• + where
are mutually independent random variables with the same distribution.1
Let f ( t ) be the characteristic function of £. Obviously, to say that the
distribution of £ is infinitely divisible means the same as to say that for
i
every integer n the function [<p{t)]n is again a characteristic function.
The infinitely divisible distributions can be characterized by the following
property:
T heorem 1. The function cp(t) i s the characteristic function o f an infinitely
divisible distribution iff lnqp(t) is o f the form

In 93(f) = iyt - + f elu‘ - 1 - 1 + g“~ dG(u), (1)


2 J \ +и ] и
—со

where у and о > 0 are real constants and G{u) is a nondecreasing bounded
function. {Formula o f Lévy and Khinchin.)
I f the distribution has a finite variance, (1) may be written in the form
+ 00

In (p{t) = imt + a1J 1


{e,ut — —iui) — i - , (2)
—00
1 This second definition is not completely exact, since it is not certain that the
Zk-s can in fact be realized on the probability space in question. However, if the words:
“for a suitable choice of the probability space” are added, then the second formulation
becomes correct and equivalent to the first one.
348 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 7

where m is a real number, a > 0, and K(u) is a distribution function. (Formula


o f Kolmogorov.)
In particular, if
[0 for и < 0,
for„>0.
then
o~ t 2
In <p(t) = im t------;

hence in this case the distribution in question is a normal one. It can be


seen from (1) as well as from (2) that if tf{t) is the characteristic function
i
o f an infinitely divisible distribution, n ot only [qc(i)] ” but also [99(f ) ] 1
is a characteristic function for every or > 0 ; furtherm ore, ( 1 ) show s that
9 9 ( f ) differs from zero for real f .

If we put in (2) m = kh, a2 = /.hr with Ä > 0, and

10 for и < h.
K(u) = , , ,
(1 for и > h,
then
99(f) = exp{A(euh - 1 )}.

The distribution is thus in this case a (generalized) Poisson distribution


in the following sense: A random variable so distributed takes on the
values kh (к = 0, 1, 2,. . .) with probabilities:
)k —X
P(t; = kh) = - L - (A: = 0 , 1 , . . . ) .

It follows immediately from the definition that the convolution of


infinitely divisible distributions is itself an infinitely divisible distribution.
Thus a distribution which is the convolution of a normal distribution and
of a finite number of generalized Poisson distributions (of the above type)
is infinitely divisible.
It follows from (1) that every infinitely divisible distribution can be
obtained as the limit of convolutions of a normal distribution and of a
finite number of generalized Poisson distributions. It can be shown that the
limit of a convergent sequence of infinitely divisible distributions is itself an
infinitely divisible distribution.
The p ro o f o f Theorem 1, how ever, w ill n o t be dealt with here . 1

1 Cf. e.g. P. Lévy [4] or В. V. Gnedenko and A. N. Kolmogorov [1].


V I, § 8] S T A B L E D IS T R IB U T IO N S 349

§ 8. Stable distributions

A distribution function F(x) is said to be stable if for any two given


real numbers тъ m2 and for any two positive numbers оу, o2, there exist
a real number m and a positive number a such that
„ I x — mx \ „ I x — m2 \ I X —m )
' Ь г И Ь г Н Ы - <■>
Here the sign * denotes the operation of convolution. As we have seen
the normal distribution is a stable distribution, in fact the function
л:
1 Г - -
F(x) = —— e 2 du,
J 2 ti j
v — CO

satisfies (1) with


m = m1 + m2 and a = o\ + of . (2)

Theorem 2' of § 5 implies that the only stable distribution of finite


variance is the normal distribution. There exist, however, stable distributions
of infinite variance; thus for instance the Cauchy distribution with the
distribution function
F(x) = — + — arc tan x
2 71

fulfils (1) with m = ml + m2 and a = oy + o2.


We state now without proof the following theorem:
T heorem 1. A distribution with the characteristic function <({t) is stable
iff <p[t) can be written in the form

In <p(t) = i y t - c \ t Г 1 + iß ~ co(í, a) , (3)

where the constants a, ß, c fulfil the inequalities


-l< ß< l, 0 < a < 2, c > 0, (4)
у is a real number and
not
tan ——- fo r a # 1,
co(t, a ) = (5)
— In ! Г1 fo r a = 1.
C
7
350 C H A R A C T E R IS T IC F U N C T IO N S [V I, § 8

The number a is called the characteristic exponent of the stable distribu­


tion defined by (3). For the normal distribution a = 2, c > 0; in the case
of a = 1, ß = 0 we obtain the Cauchy distribution.
It can be proved without any difficulty that for a stable distribution
with characteristic exponent a < 2, the moments of order <5 < a exist but
those of order <5 > a do not exist.
It follows from Theorem 1 that every stable distribution is infinitely
i
divisible. In fact, if y(t) fulfils (3), the same holds for [<({t)]n with the
у c
same a and ß and with — or — instead of у or c, respectively. This can
n n
be seen directly as well, since if the distribution with the characteristic
function ij9(i) is stable, we have

У f ( i ) = <Р(Я„0 е‘ы
1
with qn > 0; hence [?>(/)]" (« = 1, 2, . . .) is again a characteristic
function.
For a detailed study of stable distributions we refer to the books cited
in the footnote of the preceding paragraph. Lévy calls only those
nondegenerate distribution functions F(x) stable, for which to any two
positive numbers c, and c2 there exists a positive number c such that
F(cxx) * F(c2x) = F(cx). Distributions, which we called stable above,
are called quasi-stable by Lévy. It can be shown that a distribution with
characteristic function <js(t) is stable in the sense of P. Lévy, iff In <p(t)
may be written in the form (3) with у = 0 for а Ф 1 and ß -- 0 for a = 1.
Thus the following result is valid:
T heorem 2. I f a distribution with the characteristic function cp(t) is stable
in the sense o f P. Lévy, In <p(t) can be written in the form

In <p(t) = — |c0 + ic±Y~ UT 0 < a < 2, (6a)


\ I^I
where c0 is positive.
It can be shown further that the inequality

-------< tan (6b)


Co z
holds; however, this will not be proved here.

P roof of T heorem 2. If <p(t) is the characteristic function of a distri­


bution which is stable in the sense of Lévy, there exists for every pair
V I, § 8] ST A B L E D IS T R IB U T IO N S 351

Ci > 0, c2 > 0 a number c > 0 with

<P(ci 0 <P(C2 0 = <P(ct). (7)


In particular, if cq = c2 = 1, we have
9>2(i) = 9?(i0- (8)
It may be supposed q # 1, since = 1 would imply, because of <p(0) = 1
and the continuity of the relation <p(t) = 1 and <p{t) would then be
the characteristic function of the constant zero. Furthermore (8) implies
<p(t) Ф 0 for every real t: since if cp(t0) = 0 could hold, (8) would lead to

? R - = ° (и = 1 , 2 , . . . ) (9a)
q ,
and
<p(qnto) = 0 (и = 1 , 2 , . . . ) (9b)
which is impossible both for q > 1 (9a) and for q < 1 (9b) in view of
(p{0) = 1 and the fact that <p{t) is continuous at t = 0.
Thus i//(t) = In cp(t) is continuous too, and i/^(0) = 0. Let clt c2, . . . , c„
be any n positive numbers; according to (7) there exists a c > 0 with

Z <Kc/c 0 =
k =1
Hence, for cx = c2 = . . . = cn = 1 there exists a c(n) with
пф(0 = ф(с(п) t). (10)
Thus
c(n)
# ( 0 = Ф(с(п) t) = тф t .

( n c(ri)
If we put c ----- = -------, we obtain
\ m c(m)

— iKO = Ф c Í— I t . (li)
m \ \ m)

Consequently, there corresponds to every rational r a number c(r) such that


п К 0 = Ф(с(г)1). (12)
We show now that c(r) is uniquely determined and the relation
c(rs) = c(r) c(s) (13)
holds for any two positive rational numbers r and s.
352 C H A R A C T E R IS T IC F U N C T IO N S [V I, § 8

(I d \ П
From il/(at) = il/(bt) and a < b follows = i// — t , and because
b I
of the continuity of ip(t), we have = i^(0) = 0, which is impossible
since the distribution was supposed to be nondegenerate. Hence necessarily
a = b and c(r) is thus unique. Since further

rsijj(t) = \p(c(rs) t) = ф(с(г) c(s) t)

for every t, (13) holds for any two positive rational numbers r, s. Let now
q be a rational number q > 1 and t a real number such that \p(t) Ф 0.
Then
Ф([сШ п 0 = Ч" И 0 - (14)
We have necessarily c{q) > 1, since c{q) < 1 would imply

max 11Км) I^ qn I' K 0


l"lál<l
1

for every n, which contradicts the continuity of on every finite interval.


c(/*) r \
Since —— = c — , r > s implies c(r) > c(s). The function c(r) is thus
c(5) s )
increasing. For every irrational A > 0 we define c(A) by
c(A) = lim c(r),
г-*-Я

where r tends to A through rational values. Because of the continuity of


it follows from (12) that the equalities

т ) = Ф(с(Х)1) (is)
and
c(A/f) = c(A) с(ц) (16)
are valid for every positive value of A and ц and that c(A), as a function
of the real variable A > 0 is increasing. Put

g(x) = ln c{ex) ( - 0 0 < x < +oo), (17)


then it follows that g{x) is increasing and

g(x + = +
y) g(x) g(y) ( )
18
is valid. Hence (cf. Ch. Ill, § 13)

g{x) = — for a>0 (19)


a
V I, § 9 ] C O N D IT IO N A L P R O B A B IL IT Y D IS T R IB U T IO N S 353

and
1
c(x) = x a , for X > 0. (20)
This leads to
*‘ WO = M O (21)
for every real t and positive X. Thus for t > 0 we have
хКХ1) = Х°ф(0 = Гф(Х), (22)

and therefore for X = 1:

iK0=l*l“iKi). for *>o. (23)


Put ф(\) = —(c0 + icj). Since (f( —t) = cp(t), and thus also ip( —t) = ip(t),
we obtain for every real t

И 0 = “ (c° + 1 J7 [ Cl ^ I“ (24)

Because of \q>(/) | < 1, c0 > 0 holds. It remains to show that 0 < a < 2.
But a > 2 would imply 9?"(0) = 0 and hence D2(^) = 0, and thus f would
be a constant. Herewith (6a) is proved.

§ 9. Characteristic functions of conditional probability distributions

In the present paragraph we need a generalization of the concept of


"function” due to L. Schwartz. In order to construct the theory of genera­
lized functions (called distributions by L. Schwartz) we follow the way
suggested by J. Mikusinski and worked out by G. Temple and M. J. Light-
hill.1
Let C denote the set of infinitely often differentiable complex-valued
functions, for which

f (k\ x ) = 0 -гДдГ for И - М -00 (1)


, IX I
for any nonnegative integers к and N. If f(x ) £ C and if a, b are real, and
X is a complex number, then Xf(x) £ C, f(a x + b) £ C, and / №>(х) £ C
(k = 1 , 2 , . . . ) ; furthermore f(x ) £ C, g(x) £ C imply f(x ) + g(x) £ C. Let
К denote the set of all infinitely often differentiable functions g{x) such
that
g{k\x ) = 0(| X 1^*) for IX \ -> oo and к = 1 , 2 , . . . ,
1 Cf. M. J. Lighthill [1].
354 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 9

where the numbers N k are integers. It is easy to see that /(x ) d C and
g(x) d К imply f(x ) ■g(x) d C.
A sequence of functions { f j x ) } ( f n(x) d С; n — 1, 2, . . .) is said to be
regular, if for every h(x) £ C the limit

lim J f n(x)h(x)dx (2)


П-+ + 00 — CO

exists. Two regular sequences {/„(x)} and {gn(x)} are said to be equivalent
if for every h(x) £ C
+ 00 +00

lim I f„(x) h(x) dx= lim ( gn(x) h(x)dx. (3)


n-*+со —oo /i-*- + со —oo

This equivalence relation defines a partition of the set of regular sequences


of functions into classes. A generalized function is an equivalence class of
regular sequences of functions.
In the present paragraph generalized functions will be denoted by capitals
(F(x), G(x),. . . ) and the ordinary functions by lower case letters (/(x),
g{x),. . .). If the regular sequence of functions {/,(*)} defines a generalized
function F(x) we express this fact by a sign F(x) ~ {/„(x)}. If F(x) ~
~ {/„(x)} and if A(x) d C, we define the “integral”
+ co
j F(x) h(x) dx
— CO

by
+ 00 + 00

f F(x)h(x)dx = lim [ f n(x)h(x)dx, (4)


— 00 П -+ + OO — 00

where on the right hand side the limit exists by assumption and remains
the same when {/„(x)} is replaced by another sequence equivalent to it.
If F(x) ~ {/„(x)} is regular, the sequence {x ffa x + b)} is evidently
regular for any two real numbers a, b and for any complex number L
We put therefore XF{ax + b) ~ {)f„{ax + b)}. If F(x) ~ {f n(x)} and
G(x) ~ {^„(x)}, the sequence {f„(x) + g„(x)} is again regular; we put
F(x) + G(x) ~ {/„(x) + g fx ) |. Finally if {/„(x)} is regular and if g(x) £ K,
the sequence {/„(x) g(x) } is regular. For F(x) ~ {/„(x)} we put
F(x)g(x) ~ {/„(x) • g(x)}.
If {/„(x)} is regular, {/„'(x)} is also regular because if h(x) d C we have
+ 00 +00

J fn (x) K x) dx = — J f n(x) h \x ) dx,


— 00 — 00
(5)
V I, § 9] C O N D IT IO N A L P R O B A B IL IT Y D IS T R IB U T IO N S 355

from which the existence of the limit

+ 00 +00
lim J /„' (x) h(x) dx —— j F(x) ti (x) dx
n-*-00 —00 —00

follows. The derivative of a generalized function F(x) ~ {f„(x) } is defined by


F '{ x )~ { a x )} .
Thus we have

j ° F' (x) h(x) dx = - j f F(x) h'(x) dx. (6)


—00 —00
A generalized function is thus infinitely often differentiable. It is easy
to prove the following rules of calculation:
A(T(x) + G(xj) = Щ х ) + ).G(x),
( cF(x))' = cF\x),

(F(x) + G (x ))'= F '(x ) + G'(x),


(F{axj)' = aF'{ax).
Example. Let us put
/ n -I*
2 -

If h{x) ^ C, it is clear that

lim ( f n (x) h(x) dx = h{0).


П-+00 —00
The generalized function {/„(x)}, which we denote by <5(x), is called
Dirac's delta function. For every h(x) g C we have thus

J° <5(x) h(x) dx = h{0).


—00

We prove now some theorems.

T heorem 1. I f f(x) d C and if

<p(t) = j f ( x ) e “x dx ( —o o < / < + o o ) , (7)


—00
<l(t) belongs then to C as well.
356 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 9

Proof. f(x) = 0(\x\~k) for every integer k, hence the integral (7) exists
for every real number t ; further

+oo
J
—oo
\f(x ) \ - \ x \ k dx

exists too. The function q ( t ) is thus infinitely often differentiable and


+ OO

<pM( 0 = J ( i x f f i x ) eitx dx (k = 1, 2, . . . ) . (8)


—00

If we integrate (7) N times by parts we obtain

99(0 = | y j f m (x) eUx dx ( N = 1 , 2 , . . .). (9)


—OO

The integral on the right hand side exists, since f{x ) 6 C, hence cp(t) =
= 0(\t\~N) for |f| -» + 0 0 and for every integer Ж By (8) the same holds for
<-p(k\ t ) since (ix)kf(x ) £ C, hence Theorem 1 is proved.
The function <p(t) is the Fourier transform of f(x).

Theorem 2. I f f(x) £ C and i f

(pit) = J f{x) e,lx dx,


—00
we have the fo llo w i n g inversion f o r m u l a :

+00
Ax) = -2fn J( < p f ) e - u x dt. (10)
— OO

+ 00
Proof. The preceding theorem guarantees the existence of j | (f (t)\dt ard
—00
(10) follows from Theorem 2 of § 4.

Theorem 3 (Parseval’s theorem). I f f(x) С С, g(x) $ C, further i f

A t) = —Jсо A x ) e ilx dx ,
7(0= —JoO 9ix) ei,x dx,
V I, § 9 ] C O N D IT IO N A L P R O B A B IL IT Y D IS T R IB U T IO N S 357

then we have
+00 +0Г
J f(x)g(x)dx = j ^ I <p(-t)y(t)dt. (ill
—00 —00

P roof . By Theorem 2
+ 00

g{x) = j y(i) e~Ux dt.


—oo
Hence
+00 +00 +00
J /(* ) 9{x) dx = ~ ~ j J f(x ) y(x) e~ilx dt dx =
— 00 — 00 — GO

+ 00

= 2^- J у(О ф( - 0 *
—oo

since the order of integration can be interchanged.

T heorem 4. I f { f n(x) } ~ F(x) is a regular sequence o f functions and if


cpn(t) is the Fourier transform o f f„{x), then {g„(t) } ~ Ф(?) is again a regular
sequence o f functions; the generalized function (l>{t) remains invariant when
{fn(x) } F' replaced by an equivalent sequence. For this relation between the
two generalized functions F(x) and 4>(t) we can write

<P(t) = +
f F(x)ei,xdx. (12)
—00
I f h(x) is a function o f the class C and %(t) is its Fourier transform, we
have
+00 +00
j d>(t)x(t)dt = 2n F (x)h(—x )d x . (1 3 )
J J
—CO —00
We say that the generalized function Ф(г) is the Fourier transform of
F(x).
P roof . Let y(t) £ C and
+ oo

h(x) = J x(0 e~Ux dt.


—00
358 C H A R A C T E R IS T IC F U N C T IO N S [VI, § 9

Then, according to Theorem 3


+ 00 +G0

I <Pn( 0 x(0 dt = 2 tt /„ (x) h { -x ) dx.


J J
— 00 — 00

Thus if F(x) ~ {/„(x)} we have

lim ( <fn(t)y ft)d t = 27Г j F{x)h{—x)dx, (14)


n-*-oo —со —со

hence {r/.„(i)} is indeed regular.


It can be seen from (14) that if { f f i x ) } is replaced by an equivalent
sequence, will be replaced by an equivalent sequence as well.
(13) follows immediately from (14).
Let D denote the class of the ordinary measurable functions /(x ) such
that \f(x)\ < A( 1 + \x\N), —oo < X < Too for any integer N and A > 0.
From f{x) £ D, h{x) £ C follows immediately the existence of the integral
+ 00

[' f(x ) h(x) dx.


—00
T heorem 5. I f f(x ) d D, there exists a generalized function F(x) such
that for every h(x) d C
+ 00 +00
j f(x ) h(x)dx = j F{x) h(x) dx. (15)
— oo — oo

P roof . Put
_ + 00

In' Г -y\ -n{x-yy


f„(x) = yf{y)en e 2 dy.
—00

A simple calculation shows that f n(x) d C ; furthermore from the fact


that a Lebesgue integrable function is almost everywhere equal to the
derivative of its indefinite integral we find that lim f,(x ) = fix ) for almost
n-*- oo
every X. According to the convergence theorem of Lebesgue,

lim j f n(x)h(x)dx= j f(x ) h(x) dx,


Я-*-00 - 00 —00
q.e.d.
V I, § 9 ] C O N D IT IO N A L P R O B A B IL IT Y D IS T R IB U T IO N S 359

Let f(x ) £ D and let F(x) ~ { fn(x)} be the generalized function corre­
sponding to it (in a unique manner) according to Theorem 5. Write
fix) ~ Fix).
Let now £ be a random variable on a conditional probability space with
density function f(x ) and assume that f(x ) £ D. The characteristic function
Ф$и) of £ is defined by

Ф( (t) = F(x) e"v dx (16)


— 00

with F(x) ~ f(x) in the sense of (12); Ф4(?) is thus a generalized function. If

j f ix ) dx = 1
— oo

i.e. if f(x ) is an ordinary density function, and if we put as usually

(f i f t ) = J
— 00
f(x ) eixt dx,

then <p((t) ~ the definition of Ф%(t) is thus consistent with the


definition of ordinary characteristic functions ). In fact, in this case
сp^t) is continuous and j <p^{t) | < 1, hence f ^ t ) £2). It suffices thus to
prove that for every y(t) £ C the relation
+00 +00
J <Pt(t)x(t)dt = j <I>i {t)x(t)dt (17)
— 00 —00

holds, where the integral on the left is an ordinary integral. If


+ GO

h(x) = J - [ X (t)e-“x dt,


— 00

we obtain, by proceeding as in the proof of Theorem 3,


+>co +00
j (0 l f ) d t = 2n j h ( - x ) f( x ) dx. ( 18)
— 00 — 00

Furthermore, by Theorem 1,
+ 00 +00

j Ф? (0 x(t) dt = 2n
— 00
J
— on
F(x) h ( -x ) dx. (19)

By the definition of F(x), the right hand sides of (18) and (19) coincide;
we get thus (17).
360 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 9

If fix ) is the density function of a conditional distribution, f{x ) and


consequently (P ft) are only determined up to a constant factor.
Example 1. If £ is uniformly distributed on ( —0 0 . + 00) then f(x ) = 1 .
If h(x) £ C, we can write
+ 00 +00 +00 _ X*_

J / (x) h(x) dx = j h{x) dx = lim \ h(x) e 2n dx; (20)


— 00 — 00 n-*-00 —00
hence for the generalized function F(x) corresponding to f(x), the relation
Í )
F(x) ~ I exp — — г iS valid. Since

+ 00 _ X2 _____ _ nt*_
I e in e“x dx = s jln n e 2, (21)
— 00
we obtain
+ 00

Фе, (0 — I F(x ) e‘x‘ dx ~ 2n I e 2 J ~ 2nS(t). (22)


— 00

The characteristic function of £ is thus the Dirac delta function.


Example 2. Suppose f k(x) = eikx. Find the Fourier transform of /*(*)•
If h(x) Í C we have
+ 00 +00 i k x - x%-
\ f k (a) h(x) dx = lim I h(x) e 2,1 dx,
— 00 «-*-oo — 00

f X
2\1
hence f k(x) ~ jexp ikx - — J . And since

+ 00 jkx _ xl_ ^ _____ _ n(k+ty


[ e 2« £ ix t ^ x _ 2 -л п e 2 ,

—00

we find for the Fouiier transform (Pkit) of f k(t)

Ф *(0~2лМ — e 2 \ = 2nd(k + t).

We introduce now the concept of convergence o f generalized functions.


Let Fk{x) {k = 1 , 2 , . . . ) be a sequence of generalized functions. We say
that Fk(x) tends to the generalized function F{x) (in signs Fk(x) -*■ F(x)),
if for every h(x) £ C
+ 00 + 00
lim j Fk (x) h(x)dx = \ F(x) h(x) dx. (23)
k-*-00 — 00 — 00
V I, § 9 ] C O N D IT IO N A L P R O B A B IL IT Y D IS T R IB U T IO N S 361

T heorem 6. Let {Fk(x) } (k = 1 , 2 , . . . ) be a sequence o f generalized


functions and put

Фк ( 0 = T FK(*) e“x dx (k = 1 ,2 , . . . ) . (24)


—oo

Then Fk(x) - F(x) i f f <Pk{t) - Ф(/). Thus

Ф ( 0 = j° Д х ) е''*Лс.
—00

The proof follows immediately from Theorem 3.


Theorem 6 permits the use of characteristic functions for the establish­
ment of the convergence of distribution functions also in case of conditional
distributions. In this case the characteristic functions are usually not
ordinary, but generalized functions. As an example we prove a limit distri­
bution theorem.

T heorem 7. Let £2 , . . . , , . . . be independent random variables with


the same distribution. A ssumefurther that their distributionfunction is absolutely
continuous with finite variance and bounded density function f(x). Suppose
В Д = 0. Put
£„ = £:1 + Z2 + . .■ + £„. (25)

Then the distribution o f £„ tends to the conditional distribution uniform on


the whole real axis, that is fo r any four real numbers c < a < b < d we have

lim P (a < t;n < b \ c < t n < d) = ^ - . (26)


л-« d -c

P roof . Let f„(x) denote the density function of £„ and <r the standard
deviation of the f(x ) < M implies f„(x) £ D. We show that for every
h(x) £ C
_ + 00 +00

J ^ ( x ) K x ) dx = - f f f j K x)d x. (27)
—00 —00

Relation (27) proves the theorem. Indeed if it holds for every h(x) £ C,
let then be h(a, b, г, x) a function of the class C such that

0 < h(a, b, e, x) < 1 (28)


Я 4
362 CHARACTERISTIC FUNCTIONS [VI, § 9

and
0 for X < a — e,
h(a, b ,e ,x ) = 1 for a < x < b, (29)
0 for b + £ < X.1
We have then
+ 00 b

J fn(x ) h(a + e,b —e, e, x ) d x J f n (x ) d x


_____________________
+ 00
< _i_________
d
<
J fn (*) Kc, d, e, x) dx f /„ (x) dx
— 00 c

+ 00

f fn (x ) b, e, x )d x
<, ----- ^ --------------------------------- . (30)
J f n (x ) h(c + e, d - e, e, x ) d x
— 00

Since

fn (x) dx J
P ( a < t „ < b \ c < t ; n < d ) = d L ------------- , (31)
j fn (x ) d x
C

(27) and (30) enable us to write


+ 00 +00

I h(a + E,b — £,£, x) dx J h{a, b, e, x) dx


^ ------------------------------- < 1 < L < — — -------------------------- , (32)
[ h{c, d, E, x ) d x J h(c + E , d - E ,E ,x ) d x
— 00 — 00

1 These conditions are e.g. fulfilled by the function

h{a, b, f, x) = к ( X - ° + e- \ - к (i L Z * ) ,
where we put
0 for x < 0 ,

Í exP _ w! _ГД dt
k(x) = — —j.--------- ---------- í — for 0< x < 1,
f exp ----------------- dt
J0 - /(1-0
1 for X > 1.
V I , § 9] C O N D IT IO N A L P R O B A B IL IT Y D IS T R IB U T IO N S 363

where
/ = lim inf P(a < < b \c < £„ < d)
«-►CO

and
L = lim sup P{a < £„ < b \ c < £„ < d).
«-►00

When e -* 0, the first and the last member of the threefold inequality (32)
b —a
a—
c
tend to —------, hence (27) implies (26).
Let now be F„(x ) ~ f , ( x ) and let 99,/i) be the Fourier transform of
a yJlTt n f n(x). By Theorem 6 it suffices to prove that Ф„(г) — ő(t), where
<Pn(t) is the generalized function corresponding to <p„(t) (n = 1 , 2 , . . . )
and <5(f) is Dirac’s delta.
Put
<p(t) = J f i x ) e Ux d x .
— 00

We see that cpn(t) = a J i n n <p\t). We have to show that for every /(f) £ C
one has
__ +00

lim<T J l í n 1 cPn^ ^ dt==^ - (33)


—oo

The proof can be carried out by means of the method of Laplace (cf. Ch.
Ill, § 18, Exercise 27).
By Theorem 11 of § 2 we have | 99(f) | < 1 for t Ф 0; furthermore by
Theorem 8 of § 2
lim 99(f) = 0 .

Hence there can be assigned to every e > 0 a q = q(s) with 0 < q(s) < 1
such that I 99(f) I < q(e) for | f | > e. But then we have
_____ +00

a jj ф "(0 x ( 0 * < a j n Ы Е)]п ) \x(t) \dt. (34)


jf | > £ -0 0

On the other hand, for every | f | < e

(7^ t 2
\ncp (f) = ----- — (1 + 17(f)) with lim 17(f) = 0 .
2* / —00
364 C H A R A C T E R IS T IC F U N C T I O N S [V I, § 9

By introducing a new variable и = to ^Jn, we obtain


_ +8

—s

+eajn

= 7 5 ] ’ 'exp[ - i (1 + ’,( ^ s) ) 1 4 ^ ) ‘/“-


— £<7 / n

Since y(i) is continuous for t = 0 and is bounded, it follows from Lebesgue’s


theorem that
_ +£
[ <pn(t)x(t)dt = x(0). (35)
—£
(35) and (34) lead to (33).
Let us remark that the assumptions of Theorem 7 can be considerably
weakened.
The product of two generalized functions is generally not defined. The
way which would seem quite natural to follow leads astray: the regularity
of {/„(*)} and {#„(*)} does not, in general, imply the regularity of
{ f h{x)gn(x)}. Just take as an example

Here for every h(x) £ C with h(0) ф 0 we have


+ CO
и C
lim —— h(x) е~"*гdx \ = + oo.
и-«, 2тг J I
— CO

Consequently, S2(x) has no sense at all.


So far we did not define the characteristic function of a random variable
defined on a conditional probability space unless the random variable
had a density function. Now we have to deal with the general case. Let
■T(x) be the distribution function of £. Suppose that there exists a genera­
lized function such that

j F(x) h(x) dx = [ h(x) d T (x ) (36)


—00 —00
V I, § 9 ] C O N D IT IO N A L P R O B A B IL IT Y D IS T R IB U T IO N S 365

for every h{x) £ C, where on the right hand side figures an ordinary Stieltjes
integral. Then Ф(0, the Fourier transform of F(x), will be considered as
the characteristic function of the random variable £.
-ТТУШ
Example. Suppose that £ is uniformly distributed on the set of the integers,
i.e. the distribution function of £ is given by [x] ([x] represents the integer
part of x, i.e. the largest integer smaller than or equal to x). In this case

j h(x) d.Tix) = Z h(k); (37)


—oo к =—oo
there exists a generalized function fulfilling (36), namely

F(x) = +
f <5(x - k). (38)
k = —co

it is easy to show that


+ oo
Ф(0 = 2тг Z ö ( t- 2 k n ) . (39)
k =—oo

If we apply (13) to any function h(x) £ C and to an F(x) defined by


(38), we find
+f h(k) = +£ X(2kn), (40)
k = —oo k ——со

where yff) is the Fourier transform of h{x). (40) is Poisson's well-known


summation formula. In particular, if h{x) = exp ( —х2Я2), then

*((> = A eXP [ 4P- J

and it follows from (40) that


+ 00 /_ +00 k ! K8

Z e- ^ = v i L z e“ r‘~ . (41)
k =—oo ^ к——со
This is a formula known from the theory of 0-functions. We shall need
it later on.
Now follows a theorem similar to Theorem 7.1

T heorem 8. Let £1; £2, . . be independent integer valued random vari­


ables having the same distribution. Suppose that their expectation is zero

1 It was proved by K. L. Chung and P. Erdős [1 ] under weaker conditions.


366 C H A R A C T E R IS T IC F U N C T IO N S [V I, §9

and their variance finite. Suppose further that the greatest common divisor
o f the values assumed by £,L—c2 with positive probabilities is equal to 1. Put
C„ = £ 1 + £ 2 + • • • + ín (n = 1 , 2 , . . . ) , then for any two integers к and l

Hence when n -> 0 0 , the distribution o f tends to the uniform distribution


on the set o f integers.

Proof. Let D(ck) = a. If we show that

lim P(f„ = k) = 1 (k = 1 ,2 ,. .. ), (43)


Л -* - OO

the theorem is proved. Let


+ 00
9>(0= Z
k=—00
P ( ^ = k )e ikt (44)

be the characteristic function of the random variables We have


+ 7Г

P (L = k ) = - ^ ( <Pn( t) e - ik‘ dt. (45)


— It

Since by assumption <p(0) = 1, <p'(0) = 0, y "(0 ) = —a1 and | 99 (?) | < 1


for 0 < \ t \ < n , the method of Laplace (cf. Ch. Ill, § 18, Exercise 27)
leads immediately to the result.
This result can be rewritten in the following manner:
_____+00 +00
lim a J i n n j <p" (t) h(t) dt = 2 k h{2kn) (46)
«-*-00 —00 k ——со

for every h(x) 6 C; hence a J i n n tends for n —►со to the generalized


function (39). Thus if Fn{x) is a generalized function such that
+ 00 ________ + CO

j F n (*) K x ) dx = aJ i n n X P (L = k) h(k),
—со к= —oo

F f x) tends for n —> 0 0 to the generalized function (38).


VI, § 10] EXERCISES 367

§ 10. Exercises

1. Prove the following


T h e o r e m . If £ is a discrete-valued random variable taking on the values xk (x, ф xk
for j ф k) wi'h probabilities pk (k = 1 , 2 , . . . ) and if <p4 (t) is the characteristic function
o f {, we have
T
i’n = hm ~ j <P( ( 0 e~,Xn' dt (n = 1, 2 , . . .).
-T
Hint. The series

<ft (0 = 1 A e iXn'
n= 1
is absolutely and uniformly convergent, hence it can be integrated term by term.
Furthermore, since for every nonzero real number x
T
lim i e,x' dt = 0,
T—►oo J
-T
the theorem follows immediately.
2. Let { be an integer-valued random variable and let <pt (t) be its characteristic
function. Prove that
71

A f = k) = j <ps ( t ) e - ,k' dt (k = 0, + 1, + 2 , . . .).


— 71

3. Prove the theorem of Moivre and Laplace by means of the result of the preceding
exercise.
Hint. By Exercise 2
Я

( I ) Pk d " - k = (P e " + Я)" e ~ M d t -


—71
2
For Iк — np\ = 0 \ j i 3 ) the method of Laplace leads after some calculations to

3 - f (p,» + » ) ■ ,- « ä, = exp — + О 'Ц .


2nJ K ^ 2nnpq V lnP4 J 1" J
4. Prove the following characteristic property of the normal distribution: Let
F(x) be an absolutely continuous distribution function, let F'(x) = f(x) and
+03
J x2f ( x ) d x = 1.
— 00
If we put
H (f(x)) = - J f { x ) In f ( x ) dx,
— CO
368 C H A R A C T E R IS T IC F U N C T I O N S [V I, S IP

we have
t f (/O )) < l n ^ 2 ne ,
_ 1 .
where equality holds only for f(x) = (2л) 2 exp j ---- — . Hence # ( /( x ) ) assumes
its largest value in the case of the normal distribution. (In information theory the
number H(f(x)) is called the entropy of the distribution with the density function fix) ;
cf. Appendix.)
5. If E(£) exists, then we know that <pt(t) is differentiable at / = 0 and <p's(0) = iE(^).
Show that the differentiability of <p?(i) does not necessarily imply the existence of £■(£).

Hint. Put

n s = '0 = P(S = - n) = -nljY\2nn


— (n > 2),
with

We find
Д cos nt , +, sin nt
* (,) - 2c £ 7 Í 7 ’ * (,) - - 2c, 5 T to T •
The trigonometric series qi\{t) is uniformly convergent1 and <p^(0) = 0. Nevertheless.
E(Z) does not exist.
6 . Let £ be a random variable and = £(|{|"), a > 0. Suppose that Af„ is finite.
Show that if 0 < ß < a ,
i i
(Mp)ß < ( M J “.
p ap bq
Hint. For positive a and b p > 1, q = -------— we have2 a b < — + — .
P —1 P Q
Apply this inequality with

\s \ß . . <
*■
a = ------- T . 6=1, P= j .
(A Q a
7. Study the limit distribution of the multinomial distribution

= p" ' ■• ■^ h - o > k>= *)


г
when N —> 0 0 , with = 1 and lim NpNj — A; (j — 1, 2 , . . . , r — 1).
/= 1 JV— со

Hint. The (/• — l)-dimensional characteristic function of the multinomial distri­


bution is
(1 + 2 PNy > - \ > ) \
/= 1

1 Cf. e.g. A. Zygmund [1], p. 108.


2 Cf. e.g. G. H. Hardy, J. E . Littlewood and G. Pólya [1], p. 111.
V I, § 10] E X E R C IS E S 369

hence from lim NpN] = A; follows


N - + CO

lim ( 1 + X PNy > - 1) ) N = f [ exp [Я, (e"í - 1)].


N-*-00 /=1 /=1
8. a) Let i be a random variable of zero expectation and put

m = 7 I* Idf(x).
— CO

If rp(t) is the characteristic function of cj, we have


CO

1 Г 1 - Re(V(0)
m = — -------- dt ( 1)
71 I t*
— CO

(Here as usual Re(z) denotes the real part of z.)


Hint. From

1 - Re(<p(0) ■= J — CO
(1 - cos xt)dF(x)

follows

— CO — CO — 00

Hence the relation (1) follows according to Formula (40) of § 4.


Remark. It can be shown that the necessary and sufficient condition for the existence
of d(£) is the existence of the integral

j~ 1 —Re (y(0) df

b) If we add to the assumption a) the other one that the variance of { exists, we
have further

« . - i f ü M *
71 J t
---CO

Hint. This can be obtained from a) using integration by parts.

9. Let i t, £2, . . be independent random variables having the same distribution


which is symmetric with respect to the origin and has variance 1. Consider the sums
Í* = fi + { * + . . . + {«■ Show that

„tn-^Q = /T
*-co D(C„) V 71
370 C H A R A C T E R IS T IC F U N C T I O N S [VI, § 10

Hint. If rp{t) is the characteristic function of the random variables £„, we have

9>CWi(0 = 9>M=)1 •
L
Since (pit) is real for every real t, we obtain, by taking into account Exercise 8,

_ CO

</(?*) Jn r tp'(u)
~n7F\
0(C „) = - 31 .) V (") -----
и du-
—со

From this we obtain the required result by the method of Laplace.

10. If Ф(t) is the Fourier transform of the generalized function F(x), the Fourier
transform of F(ax + b) is
1 - "b ( t\
-r— e « Ф — .
Ia I 1a )
11. With the same notations, the Fourier transform of F'(x) is — it Ф(').
12. a) With the preceding notations, the Fourier transform of x"F(x) is ( —
b) If the conditional density function of £ is x2" (n = 1, 2, . . . ), the (generalized)
characteristic function of £ is 2тг(—1)" <У'П) (0, where <5(I) denotes Dirac’s delta
(cf. § 9, p. 355).
13. a) Let £„ •> £« be independent random variables having the same normal
distribution, £(£*) = m, £ = — (£t + £2 + . . . + £„), and
n

I (£i - Ö2 + (£, - £)* + . . . + (£, - £)2


^ n —1
Show that
Jn
г - -V-----------------
(£ - m)
j
has Student’s distribution with n — 1 degrees of freedom.
b) Let £t, ..., £„+m be independent random variables having the same normái
distribution. Put

£m = —({, + ... + £„), £<2>= — (£л+1 + . .. + £„+m),


n m
Iя 3 /Г+m _
/ Ё «* - £(1‘)2 + £ (£* —£<2))2
t _л / ini_______ <:="+1 .
V л + /и — 2
Show that
_ £л > - (f<2> / ллг
i \ n + m
has Student’s distribution with n + w — 2 degrees of freedom.
V I, § 10] E X E R C IS E S 371

14. Prove that the following property characterizes the normal distribution: If
f(x) ( —M < X < + ° ° ) is a continuously differentiable positive density function
such that for any three real numbers x, y, z the function
f ix - t ) f { y - t ) f ( z - t)
, . . X + у + z
has its maximum at t = --------------- , then

, 1 Г *2 '
Ax) = — - exp - — .
^ jin a L 2a
f\x ) x -f- у -f- z
Hint. By assumption,/(x) is positive. If we put g(x) = ------ and s = ----------------
A x) 3
we have
g{x — s ) + g(y — s) + g(z — s) = 0 . (2 )

For x = у — z, we obtain g(0) = 0. Take now any two x and у and put z = — ^ =s\
the relation (2 ) and #( 0 ) = 0 lead to

Hence g(x) is an odd function. If we put и = x — s, v = у — s and thus z — s =


= —(и + v), we can write Formula (2) in the form
9{u) + g(v) = g(n + v). (3)
Since by assumption, g(u) is continuous, we know that (3) implies
/ (A“)
g(x) = - = Cx (C = constant).
Ax)
Hence by integrating,
Г Cx 2 ‘
Ax) = A exp j —- — .

As f{x) is a density function, we have C = — ~~ and A = — i = — , with a > 0.


2л a

Remark. The result is valid under weaker conditions too . 1


15. If and { 2 are independent random variables having the same nondegenerate
distribution with finite variance and if the random variable a i, + (0 < a < b < 1 ;
a2 + b'z — 1 ) has again the same distribution, then this distribution is normal with
expectation zero.
16. The distribution having the density function
1 _• - 3
A x) = — - - e ** x 2 (x > 0)
V 271
1 Cf. Á. Császár [2].

I
372 C H A R A C T E R IS T IC F U N C T IO N S [V I, § 10

is stable and corresponds to the parameters a = у , /S = — 1 . у = 0, c = l .1

17. If <p(t) is a characteristic function, then


t

Ф0) = — i 9>(«) </"


0
is a characteristic function as well.
18. Show that the gamma distributions are infinitely divisible.
19. Let £(.s) denote Riemann’s zeta-function
® i
£(s) = —- with s = a + it, a > 1.
n=l n
Show that the function

<">■>
is the characteristic function of an infinitely divisible distribution.
20. We know (Ch. IV, § 10) that the quotient of two independent N (0, 1) random
variables has a Cauchy distribution. Show that this property is not characteristic for
the normal distribution. If £ and t] are independent, have the same distribution of
zero expectation and if — has a Cauchy distribution, then it does not follow that
V
£ and t] are normally distributed.
Hint. Take for density function of c and r\

/2 1
Ax) = ^ — — —r ( - °°< X < +°°).
n 1 + x'

We obtain for the density function of —


V
n _ 2_ f \y\dy _ 1
gX л1 J (1 + y4) (1 + x* у4) я (1 + x2)
— CO

Remark. This example is due to Laha. 2

1 f(x) is the density function of £_z; where £ is N (0, 1); this distribution is some­
times called the “inverse normal distribution”.
2 Cf. R. G. Laha [1].

i
C H A P T E R V II

LAWS OF LARGE NUMBERS


§ 1. Chebyshev’s and related inequalities

In the present paragraph we shall deal with an inequality due to Chebyshev


and with some similar inequalities, all needed in the proofs of the laws of
large numbers. First we prove the famous inequality of Chebyshev.1

T heorem 1. Let £ be any random variable with expectation M = E{f) and


with a positive finite standard deviation D = D(f). I f X is a real number,
X > 1, we have

P ^-M \> X D )< ~ . (1)

Remark. If 0 < Я < 1, (1) remains valid, but becomes trivial.

P roof . If we apply Markov’s inequality (cf. Ch. IV, § 13, Theorem 1) to


the random variable t] = (£ — M f with l 2 instead of X, we obtain imme­
diately (1).
If we apply Markov’s inequality to other positive functions of £, we obtain
other inequalities, related to Chebyshev’s inequality. Thus for instance if
we put rj = I I - M |я (a > 0) and

M a = E( I £ - M П (2)
(thus M x is the a-th absolute central moment of £), then we get

P(U -M \>XD )<-^. (3 )

Of course for a = 2 (3) reduces to (1).


In order to get an inequality as sharp as possible, we have to choose a
M
in such a manner that - y- should be as small as possible.

1 Called also the Bienaymé—Chebyshev inequality.


374 LAW S O F L A R G E N U M B E R S [V II, § 2

We can also apply Markov’s inequality to the random variable rj = e£^ ~ M\


If we put
£ (ee(i-AO) = ^ ( e); (4)
we obtain
P(e*t-u> > ^ f ( £) e') < (5)

where i is a positive number; the exponential function being monotone,


it follows for s > 0 that

„ I, ,, t + In tv#(e) _
P l; > M + -------------— < e '. (6)
e

In order to get the sharpest possible bound we have to choose £ such that
t + In M ( e )
the expression--------------- is minimal or at least nearly minimal.
e
In § 4 an improvement o f Chebyshev’s inequality, which is due to Bern­
stein, will be deduced from inequality (6).

§ 2. Stochastic convergence

We have mentioned already in Chapter III, § 17 the most elementary


case of the laws of large numbers, discovered already by Jacob Bernoulli.
In order to prove a more general theorem, we introduce first the concept
of stochastic convergence, due to Slutsky.
If £2, is a sequence of random variables for which the re­
lation
lim P (|U > £ ) = 0 (1)
n-*- CO

holds for every positive s however small, then the sequence (n = 1 , 2 , . . . )


is said to converge stochastically (or in probability) to zero. If the random
variables t n (n = 1 . 2 , . . . ) fulfil the relation

lim P( |C„ - a I > 6) = 0 (2)

for any fixed £ > 0 we shall say that the sequence £„ (n = 1 , 2 , . . . ) con­
verges in probability (or stochastically) to the constant a and indicate this by
lim st £„ = a (3)
n-*-co
or by
Z„^a. ( 4)
V II, § 2 ) S T O C H A S T IC C O N V E R G E N C E 375

With this definition Bernoulli’s theorem may be formulated as follows:


In a series o f independent experiments the relative frequency o f the event A
tends stochastically to the probability P{A) o f A when the number o f experi­
ments increases infinitely.
Bernoulli’s theorem is an immediate consequence of Chebyshev’s in­
equality, established in the preceding paragraph.
к
In fact let Cn = — be the ielative frequency of the event A in a series of
n
experiments. The random variable n'C„ has a binomial distribution with
expectation np and standard deviation yf.npq {q = 1 —p). Chebyshev’s
inequality leads to

/.(|C, - , | > I ^ ) < J r . (5)

ПГ
In particular, if we put Я = e / -----, (5) becomes
V pq

P( \ ( n - p \ > e ) < - P^ ; (6)


ne

if now n tends to infinity, the expression on the right of (6) tends to 0, which
proves the theorem.
The definition of stochastic convergence can also be given in the follow­
ing form : the sequence f „ (n = 1 , 2 , . . . ) converges stochastically to the
number p when to every pair s, <5 of positive numbers (however small) there
can be chosen a number N = N(e, d) so that for every n > N

P( \ t n - P \ > s ) < 5 . (7)


This condition can also be expressed in terms of the distribution function
Fn(x) of Cn- In fact (7) is equivalent to

. . f 0 for X < p,
hrn Fn (x) = (8)
( 1 for X > p.
If Dp{x) denotes the (degenerate) distribution function of the constant p,
(8) is equivalent to

lim Fn (x) = Dp (x) for х ф р . (9)


n -+ oo

Conversely, it is easy to see that the stochastic convergence of to the


constant p follows from (8) or (9).
376 LAW S O F LA R G E N U M B ER S [V II, § 2

The concept of stochastic convergence can be generalized still further.


We say that a sequence of random variables £„ (n = 1 , 2 , . . .) tends in
probability (or stochastically) to the random variable £, if for every positive
e one has
l i mP( | C„ —i | > £ ) = 0; (10)
n-*-00
in this case we write
lim st Си = С (И )
n-*-00
or
( 12)

It is easy to prove the following

T heorem 1. I f C„ —, C, th e d is tr ib u tio n fu n c tio n F n(x ) o f te n d s to th e


d is tr ib u tio n fu n c tio n F (x ) o f £ a t e v e r y p o i n t o f c o n tin u ity o f th e la tte r .

Proof. If An is the event | £„ - Í ] < e, we have

P(C, < X) = P(Cn < x \ A n)P(A„) + P(C„ < X I An)P(A„), (13a)

hence
P(Cn < x ) < P(C„ <X\A„) + P(An). (13b)
But
P(C < X + s)
P(C„ < X 1A„) < P(C < X + г 1A„) < * -- ^ . (14)

(13b) and (14) imply

lim P(£„ < x) < P(£ < X + e) for every e > 0. (15)
n->-00

On the other hand, by (13a)

P(C„ < A') > P(An) P(Cn < x I A J > P(C < x - £) - P(T„). (13c)

From this follows

lim P(£„ < x) > P(£ < x —e) for every e > 0. (16)
n-►00
Since e can be chosen arbitrarily small, (15) and (16) imply the statement
of Theorem 1.
V II, § 3 ] G E N E R A L IZ A T IO N O F B E R N O U L L I’S L A W 377

§ 3. Generalization of Bernoulli’s law of large numbers

In the preceding paragraph Chebyshev’s inequality was applied to the


random variable n(„, where denotes the relative frequency of the event A
(with P(A) = p) in a sequence of n independent experiments. Obviously

r _ £1 + £2 +
bn »
n

where £k (k — 1, 2,. . .) is the indicator of the event A in the k -th experi­


ment, that is
I if the event A occurs at the k -th experiment,

I
0 otherwise.

The £k are by assumption identically distributed, independent random


variables, assuming the values 0 and 1 only. Their expectation is E(£k) = p.
Bernoulli’s theorem states that

lim st — + + ''' + ■ = E (tk); (1)


n-*-00 "
i.e. that the empirical mean tends in probability to the common expectation
of the £k. It is easy to show that this property remains valid for arbitrary
independent identically distributed random variables with finite variance.

T heorem 1. Let (n = 1 , 2 be pairwise independent and identically


distributed random variables with finite expectation E(£,n) — M and variance
D\q„) = D-. Then
lim st — £ = M.
w—oo n k=l
P roof . Put
1 "
:- = — ! { * •
n k =1
We have
E(C„) = M and D (Q = - * L .

By applying Chebyshev’s inequality we obtain

Ш п - М |>8)< ,
ne,
which proves the statement of Theorem 1.
378 LAW S O F L A R G E N U M B E R S [V II, § 3

The suppositions of Theorem 1 can be weakened. It is not necessary to


assume that £k are identically distributed, it suffices to assume the existence
of the limit

lim — £ а д =M
«-►oo ft k= 1

and the validity of

lim - — = 0,
«—со ^2
where

s„= . / £ ß 8 (&).
V k=1

Thus we obtain a form of the law of large numbers which is due to A. A.


M arkov:

T heorem 2. Let ck (/c = 1 , 2 , . . . ) be pairwise independent random vari­


ables such that M k = E{£k) and Dk = l) ( fk) {k = 1, 2 , . . .) are finite. Sup­
pose further the validity o f the following two conditions:
a) the limit
1 "
lim -— Y M k — M (2)
«-►oo ft k =1
exists and i$ finite;
b) for

S n = J lD l
V k =1

we have1

lim — = 0. (3)
«-oo ft
Then for the random variable
1 "
f„ =■ — Z Í*
n k=i
relation
t n JLM
is valid.

1 This condition is certainly fulfilled e.g. if the random variables (or at least
the numbers Dk) are uniformly bounded.
V II, § 3] G E N E R A L IZ A T IO N O F B E R N O U L L I ’S L A W 379

P roof. Similarly as in proving Theorem 1, we apply Chebyshev’s inequal­


ity to

C* = - Í (€* - M k).
n k=i
g
Taking into account that £(£*) = 0 and = ——, we obtain the re-
n
lation
lim st С* =0.
п-*-со
Now

С = С„ - -1- i М к
П k l
and by assumption
I 1 и l e
— X Мк - М' < —
I п к= 1 I
£
if п is large enough. As | £„ — Л/ | > e can hold only if | £* | > — , it
follows
lim st (£„ —M ) — 0. (4)
П-+-oo

The assumptions of the above theorem can still be weakened. Instead of the
pairwise independence of £,k it suffices to assume that there does not exist
a strong positive correlation between most pairs. More precisely, the follow­
ing theorem holds, due essentially to S. N. Bernstein:

Theorem 3. Let £k (k = 1 , 2 , . . . ) be random variables with finite


expectations M k — E(£k) and variances D\ = D \£ k) (k = 1, 2, . . .).
Let Ru denote the correlation coeJficient o f f and q . Assume further the va­
lidity o f the following three conditions:
a) There exists the limit
1 "
lim •— £ M k = M;
л -—со п k =\
n

b) For Sf, = Y £>k we have < Kn, where К is a constant independent


k=1
from n ;
c) Rjj < R(\ i — j I), where R(k) is a nonnegative function o f к such that
Л'(0) = 1 and
lim - X R(k) = 0-
n~—o o 11 k = 1
380 LAW S O F LA R G E N U M B E R S [V II, § 3

1 "
Then £„ = — У converges in probability to M:
n k =l

b« “ M. (5)

P r o o f . If we consider the proof of Theorem 2 we see that it suffices to


prove the relation
lim £>(£„) = 0; (6)
«->-00

if this is done, the remaining part of the proof can be repeated word by word.
We prove therefore (6). We have

D \Q n ) - Л- i í D, Dj R'J < ^ i t A DJ R ( \ t - J \ )>


i=lj=l i=i;=i
hence
« —1 n —k
1
D \ U < — (S i + 2 X m X D ,D l+k). '7 4
V' /
n к-t i= 1

By Cauchy’s inequality we have

X Dt Di+k < S i ;

if we put this into (7) we obtain


c2 c2 / n
n
+ ~
9

n In
— X *(*)|-
1

1
( 8)

Hence by condition b)

z)2(C„) <; — + 2K (— X а д ) .

Relation (6) follows and because of condition c) Theorem 3 is herewith


proved.
We return now to the case of pairwise independent random variables
with the same distribution. Let M denote the (finite) expectation of the
i k. It was shown by Khintchine that in this case the mean value of

c„ = — X &
n
1
к
"
=1

tends in probability to M when n increases, even if D(^k) does not exist.


V II, § 3] G E N E R A L IZ A T IO N O F B E R N O U L L I’S L A W 381

Thus we have
Theorem 4. Let be pairwise independent and identically distributed
random variables and suppose that the expectation

В Д = M (9)
exists. Then for
1 "
Cn = — kZi «

one has
( 10)

P roof. Without restricting generality we may assume M = 0. Put

«• _ U * for \Zk \ < k ,


j 0 for \Zk \ > k . ( }

If F(x) is the distribution function of the random variables £k, we have


+k

— i m t) = — i I xdF(x). (12)
П и * = 1 1
-k

Since by assumption

lim j xdF(x) = J xdF(x) = 0,


k-*~ co —k — oo

we can write
lim — Z £(£*) = 0. (13)
n-*-ao M k=l

On the other hand

D\Ck) < £ ( £ f ) = f x2r f f ( 4


-k
hence
+« + /л
~ f . D \ e k) < — Г x 2dF(x)< - 1 = I IX jdF{x) -4- f \x\dF(x) (14)
n k =I и J J ./
~n -In \x\>Jn

and consequently

l i m ~ Z ^ ) = °- <15)
n~ со И fc-1
382 LAW S O F L A R G E N U M B E R S [VII, § 3

If we put

C —k £ {.+ n k = 1
i
fc= .-+ 1
Theorem 2 implies
lim st C*r = 0 .
n-*-oo

On the other hand we have

P(C,r Ф Ü < t P(£*k Ф Zk) = t I < J IXI df(x). (16)


£ = r + l &= r + 1 |x |> & |* |> r

If r is sufficiently large, we have for any S > 0 the inequality

P(C*r # и < s.

Hence for any e > 0

P( I сл I > S) = P( K„ I > г, Ct Ф C „) + P( I C„ I > e, C ,r = C„) (17)


and thus
lim sup P( I £„ I > e) < Ö. (18)
П-+00

But since d > 0 is arbitrary, it follows

lim P ( \ U > £ ) = 0 ,
CO

which was to be proved.


R e m a r k . When the random variables are not only pairwise but com­
pletely independent, Theorem 4 can be proved by the method of character­
istic functions. We have seen in § 2 that the stochastic convergence of Cn to 0
is equivalent to the convergence of the distribution function F„(x) of £„ to
the degenerate distribution function D0(x) of the constant 0 (i.e. to 1 for
X > 0 and to 0 for x < 0). Because of Theorem 3 (Ch. VI), § 4, it suffices
thus to show that the characteristic function <p,,(t) of tends, for every t,
to the characteristic function of the constant 0, i.e. to 1. If <p(t) is the charac­
teristic function of the random variables £k, then by assumption 9o'(0) = 0 .
Put
<p(t) - 1
-----, (19)
then
lim e(r) = 0 . (2 0 )
Í - 0
V II, § 3 ] G E N E R A L I Z A T I O N O F B E R N O U L L I ’S L A W 383

Since for the characteristic function q>h(t) of £„ we have rpn(t) =


Г99 Í—
t )T ,

(20) implies, for every real value of t,

t (ill"
lim <pn(t) - lim 1 H----- £ — = 1, (21)
Л -* 00 П -* 00 ^ ^ , /

which was to be proved.


It was shown by Kolmogorov that in Theorem 4 the assumption of the
existence of the expectation of can be replaced by the weaker postulate of
the existence of the limit

lim I xdF(x)
и -*-00 —n

and of the relation

lim * [F (- x) + (1 - Д х))] = 0
X-*- CO

and that these conditions are not only sufficient but necessary as well for

1 "
c. = — i tk
n k =1
to converge in probability to a constant as n -* 0 0 . As regards the proof
of this theorem of Kolmogorov, cf. § 14, Exercise 24.
We now give an example to which the law of large numbers does not
apply. If the random variables are completely independent and if all

have the same Cauchy distribution (with the density function j + ' )>
then
1 "
Í- = — I
n k=1

also has the same Cauchy distribution.


As a matter of fact the characteristic function of the variables £,k is equal
to e-1' 1, thus that of £„ is equal to (e n )n = e-1' 1. Evidently, lim st = p
does not hold in this case.
The fact that for independent random variables Д with a common Cauchy
distribution the random variable

1 "
Zn = — l Z k
n k=1
384 LAW S O F L A R G E N U M B E R S [V II, § 4

has the same Cauchy distribution as £,k, can be interpreted as follows: When
we take a sample from a population with a Cauchy distribution with den­
sity function — 7 - — -------------- т я г - , we do not obtain more information concern-
я(1 +(x — m ) )
ing the number m from the mean of a sample, however large, than from a
single observation.

§ 4. Bernstein’s improvement of Chebyshev’s inequality

In the preceding paragraph we applied Chebyshev’s inequality to the sum


of a large number of independent random variables in order to prove the
law of large numbers. It was shown by Bernstein that in this case Cheby­
shev’s inequality can be considerably improved. Put

^ f(s) =

where e > 0, and M is the expectation of £. We proved in § 1 the inequality

P g> M + - + - 1 < e-< (/>0). (1)

t + In <_x#(e)
It was already observed that the choice of в minimizing -------------------
£
makes the inequality (1) as sharp as possible.
We prove now the following

Lemma. Let £1; £2, be completely independent bounded random


variables with E(£k) = 0, £ > (£ * ) =Dk and suppose \ \ < K(k = 1 , 2 , . . . , и).
Put further

£= + £2 + Ш ) = D, and ^ f ( e ) = E(e£i),

where e is an arbitrary positive number. We have then

£2 D2 l eK 1
In lx# (б) < —-— 1 + - у - eeKj .

P ro o f . Since (2)

k —1
VII, § 4 ] BERNSTEIN'S IMPROVEMENT OF CHEBVSHEV’S INEQUALITY 385

we evaluate first E(es*k). Since


CO cn zn

tfb = У i A
n=0 n \

and the are bounded, the series is uniformly convergent and the expecta­
tion of ecik can be calculated term by term from the power series. Thus we
obtain
Ei^ l+ i Ä + z £ m . (3)
2 я=з n\

As I £k I < К implies the inequality

Е(ЛТ) — D \K n~2,
we obtain
V e" E < 2 D 2 у W ~ 2
h «! £ kh n\
As
1 1 1
-----< — ------------
л! 6 (n — 3)!

for n > 3, substitution into (3) gives

.W -L + f 2 Л_з
1s i + iz s f ■ + " 1 1 ,
j L2 6

which leads to

Because of 1 + x < ex, we obtain

p2 П 2 I \
^ ( e ) = E(e^) < exp Ц - (l + £* 3e- j ,

which is equivalent to (2).


Since by assumption M = 0, it follows from (1) that

/ e2 D2 EKecK
i + _ 2 ~ 1+ “ T "
p U > ------------------- ± - J \ < e ‘. (4)
V e J
386 LAW S O F L A R G E N U M B E R S (V II, § 4

Put
J it
- V (5)
Then (4) leads to

P ^^D jlt 1+ ex? (6)

Substitution of Я = J i t gives

IK — 1 Ü
P £ > ID 1 + — e D < e 2‘ . (7)
L 6Л J
IK
Thus if Я is large and small, we obtain a much sharper inequality than
that of Chebyshev’s.
If we apply the obtained result to — we find that

Г ЯЛ M ii
P 1í i > ЯЛ 1 + —— e D < 2e 2■ (8)
6D j

In order to transform (8) into a more convenient form, we restrict our-


XK
selves to the case — < 1. We have then

eD<eл к

< 3.

XK
From this, putting ц = Я 1-1------ , we obtain from (8) because of Я <

P ( \ J > H D ) < 2 e x p ------- r ** 2- . (9)

l M1+k l J
We may free ourselves from the condition £(£*) = 0 by applying (9) to
the random variable

e-m=i [$k-E(Sk)\ к =1
VII, § 4] BER N STEIN ’S IM PROVEM ENT OF CHEBYSHEV’S INEQ U A LITY 387

Thus we have proved the following theorem:

T h e o r e m 1. I f £2, are completely independent random variables


such that L( fk) = M k, D(£k) = Dk exist and \ £k — M k \ < К (k = 1,2,
. . n), then for £ = ^ + £2 ■ + . . . + £„ we have

Г /i2
- M \ > p D ) < 2 e x p ------- --------ГГГ2- . (10)
21 + ^
L 1 2Dj
In this formula

M = i Mk and D = j t Dl ,
*=1 V
D
while p is a positive number such that p < — .
К
Let us apply now this result to the case where the £k have a common
distribution. Let M x be the expectation of the £k and D\ their variance. Then
П
the expectation of the sum £ = £ !;k is equal to nMx and its variance to D\n.
k= 1
It follows from (10) for p < v 'u that
К
г 2
P( I { - nM11> pD! V « ) < 2 exp - —г----- • (11)

L 2Dly/ n ) J
If the £k (k = 1 , 2 , . . ., ri) are the indicators of an event A in a sequence of n
experiments, we get from (11)

T h e o r e m 2. Let A be one o f the possible results o f an experiment, suppose


p = P(A) > 0 and put q = 1 —p. Let the random variable denote the
relative frequency o f A in an experiment consisting o f n independent trials.
Then for 0 < e < pq we have
P(\tin - p \ > s ) < 2 exp ------- j------- i p p • (12)
2pq 1 + -X—
L l 2pq, J
P r o o f . (12) follows from (11) when we put p = e / n . Chebyshev’s
V РЧ
inequality [cf. § 2, Formula (6)] leads only to

P(\i;n- p \ > e ) < ™ L .



388 LAW S OF LA R G E N U M B E R S [V II, § 4

Thus e.g. for p = q = - j - , e = , Chebyshev’s inequality guarantees


the validity of

2 ^ 1 5 0 ( ,3 >

only for n > 10 000, while by using (12) we find that (13) holds already for
n > 1283.
If we take e = , we find, applying Chebyshev’s inequality, that
(I 1 1 1 ) 1
\fn 2 I 50j 100

is valid for n > 62 500; while applying (12) we see that it is valid already
for n > 7164.
In these examples s > 0 and Ö > 0 were given and we wanted to estimate
the least number n0 = и0(е, <5) such that for n > nn
P(\C „-p\> e)< 0

holds. This question is answered in general by the following theorem which


follows from Theorem 2:
T heorem 3. We perform n independent experiments with possible outcomes
A and Ä. Suppose p — P{A) > 0. Let £„ be the relative frequency o f the event
A in the sequence o f experiments, and suppose 5 > 0, 0 < s < p{\ —p) and
2
9 In —
П~ 8e2 = и °(в’ ^ -
Then
P(\Zn - p \ > E ) < 0 . (14)

Proof. It follows from Theorem 2 that (14) is fulfilled for


e )2 2
2 Pq 1 + In - ~
n > ____ ____ 2f - i -------
£
1 £ ^ 9
Since 2p q < — and lH---------< — for 0 < e < pq, (14) is always
2 2pq 4
9 2
fulfilled for n > - —j- In —— .
8e’ Ő
V II, § 5 ] T H E B O R E L -C A N T E L L I L E M M A 389

Theorem 2 ;an be rephrased in the following manner: If we put x =


= e /-Z L , we have for 0 < x < Jnpq
V pq
p[\Zn- p \ > x ÍEjL < 2 exp Г— —
r -2-1; (15)
1 V n
L 2 1+ - ^ L =
2J n p q
the sharpness of the inequality can be appreciated by comparing it to the
Moivre-Laplace theorem (Ch. Ill, § 16) concerning the convergence of
the binomial distribution to the normal distribution.
Finally, let us make some general remarks about the laws of large num­
bers. In connection with many random mass phenomena, there are quanti­
ties which, though depending on chance, are practically constant. As an
example, consider the pressure of a gas enclosed into a vessel. This pressure
is the result of the impacts of molecules on the wall of the vessel. The number
of the impacts as well as the velocity of the molecules depends on chance;
nevertheless the resulting pressure is practically constant, provided that the
number of the molecules is very large. Such phenomena can be explained
as instances of the law of large numbers.
According to the law of large numbers, by calculating the probability of
an event or the expectation of a random variable, one can obtain informa­
tion about the relative frequency, or of the arithmetic mean, of the results
if a large number of independent experiments (observations) were per­
formed. This is the reason why the law of large numbers is basic for so
many practical applications of probability theory.

§ 5. The Borel—Cantelli lemma

We now prove the following useful and simple lemma:

Lemma A. I f {An} (n = 1 , 2 , . . . ) is a n y infinite sequ en ce o f e v en ts such


th a t

Е а д < + » , (i)
n= 1

then w ith p r o b a b ility 1 a t m o s t f in ite ly m a n y e v en ts A n occu r sim u lta n eo u sly.


O r to p u t it oth erw ise, i f w e p u t

00 00
Лоо = П1 I
n= k = n
(2)
390 LAW S O F L A R G E N U M B ER S [V II, § 5

then
P(AX) = 0. (3)
Remark. The right side of (2) is denoted in set theory by

lim sup A„.


n—+oo
P roof. We have
oo oo

i>(4)< f(E 4)< I РШ (4)


k= n k= n

for every n. Because of (1), the right hand side of (4) tends to zero as n —>• + oo,
hence we have (3).
If the A n are completely independent, (1) is not only sufficient, but also
necessary in order that with probability 1 at most finitely many of the A„
OO
should occur. If Y j P(A„) = + oo, thenP(A„0 )1 5 not only positive but equal
n=1

to 1. Thus we have

Lemma B. I f А ь А ъ . . An, . . . are completely independent events with


00
£ P(An) = + 00, then
P{AX) = \ . (5)
P roof . Evidently
00 OO
Л'оо =nЕ= kП=nЛ ,
1
(6)
it suffices thus to show that
00
Д П 4 )= 0 for n = 1, 2 ,...; (7)
k—tl
(7) and (6 ) imply P(Äf) = 0, hence (5). But for N > n

P( П Ak) < P( П Äk) = П (1 - P(Ak)) < exp [ - X Р Ш \ (8 )


k= n k= n k =n k— n
00
The series £ P(Ak) being divergent, the right hand side of (8 ) tends to zero
k=1
as N -> 0 0 . Lemma В is thus proved. Lemmas A and В together are called
the Borel-Cantelli lemma.
The hypothesis of Lemma В concerning the complete independence of
the j4„-s can be replaced by the weaker condition that Аъ A 2, .. ., A n, . . .
be pairwise independent. Even somewhat more is true, namely
VII, § 5] T H E B O R E L —C A N T E IA I L E M M A 391

L em m a C. I f А ъ A.,,. . A„, . . . are arbitrary events, fulfilling the condi­


tions

f \Р{АЯ) = + с о (9)
/1= 1
and

t t p (Ak A,)
lim inf ----------- = 1, (10)
( t pW ) 2
k= 1

then (5) holds; thus there occur with probability 1 infinitely many o f the
events A„.

P roof . The proof is based on Chebyshev’s inequality. Let а„(и = 1,


2, . . .) denote the indicator of the event A n, i.e. put

Í 1 if A„ occurs,
oc = i
( 0 otherwise.

Then E(an) = P(A„) and by Chebyshev’s inequality we have

n n n Dl ( £ a*.)
P ( \ t * k ~ t p ( A k) \ > e Y j P{Akj ) < ------^ ------ . (11)
ft=i *=1 fc=i b2 q :^ ) ) 2
k =l
Now E{oik<Xi) = P(ÁkAl), hence

D2( £ « * ) - £ £ P(AkA,) - ( £ P(Ak) f . (12)


k=l k = l 1=1 k =l

Thus it follows from (10) that

lim inf P j Í * k - Í P(Ak) > 4 r t Р Ш I = 0 . (13)


/7-►co \k = l к -l ^ k =1 /

If we put

dn = p [ Í 4 < \ Í P(Ak)\, (14)


.* = 1 A k = 1 /

we have
lim inf d„ = 0. (15)
П —►CO
392 LAW S O F L A R G E N U M B E R S [V II, § 6

It follows from this that one can choose an infinite subsequence of posi­
tive integers nx < n2 < . . . < rij < . . . , such that

I 4*< + o o ; (16)
j=i
hence by Lemma A we have with probability 1
Щ 1 л/
P(Ak),
k=l Z k= 1
00
except for a finite number of values of j. Thus by (9) the series Y ak is di-
k =l
vergent with probability 1, which proves our statement.
The just proved lemmas will serve us well in proofs dealing with improve­
ments of the law of large numbers.

§ 6. Kolmogorov’s inequality

In order to establish another group of laws of large numbers we need an


improvement of Chebyshev’s inequality due to A. N. Kolmogorov.

T h e o r e m 1. Let the random variables th, r\2, . . r\n be completely inde­


pendent; put further E(t]k) = Mk, D{t]k) = Dk. I f e is an arbitrary positive
number, we have

к Y Dl
P ( max I £ {t]j- Mj) \ > e ) <, *=12 . (1)
1<.k<,n j = 1 8

P ro o f . Put
= цк — M k, £k = Y 1? (к = 1 , 2 , . . . , ri). Let further Ak
]=i
denote the event that is the first among the random variables £ъ £2, . . .,
C„ which is not less than e, i.e.

1 Cl I I Cfc-il < S and Kfc| >e.


The events Ak (k — 1 , 2 , . . . , « ) exclude each other but obviously do not
form a complete system of events. Let A0 denote the event [ £k \ < e for
к = 1 , 2 , . . . , « , then the events A0, Ab . .., An form already a complete
system of events. The probability figuring in (1) can be written in the form

P(l<,k<,nj
max IY=i i ^ 8) = Y *: = ]
)•
p (Ak (2)
V II, § 6Л K O L M O G O R O V ’S I N E Q U A L I T Y 393

On the other hand we have

£ D\ = D \ Q = t P(Ak) E ( t i \ A k). (3)


k= 1 k= 0

The right hand side of (3) becomes smaller if the term к = 0 is omitted from
the summation. Hence

£ D l> t P(Ak)E(t?n\A k). (4)


k= 1 k=l
Let us now consider the conditional expectation E(Cl | Ak). From C« = (k +
П

+ Z *1* it follows that


J=k+ 1

Cn = + Z rl*~ + 2 Z £k 4* + 2 Z 4? 1*• (5)


j=k+ 1 y' = A:+ l k<i<j<n
In the definition of the event A Konly occur the random variables rj*,.. r\k-
I f ak denotes the indicator of A k, then it follows that ak does not depend
on 1* (J > &)• Hence we have for к < j < n

r ,r *, • , % ЩкЧ*<*к) Щ * ) Щ к «к) л
Е (и " ' |Л ) " а д ° 0' (6)
Similarly, we obtain that
Е(п*П *Ш = 0 (7)
because of the independence of rjf, t]*, ak for j > i > k.
(5), (6) and (7) lead to

Е ( £ \ А к) = Е ( £ \ А к) + t E(r,**\Ak) > E ( a \ A k). (8)


j=k + 1
Now, by the definition of the events A k, we have | | > e whenever Ak
occurs; hence E{(,k \ Ak) > e2 and thus by (8)

E a l IAk) > e2.


Because of (4) it follows that

Y D t > e 2 £ P(Ak),
k= 1 k= 1
which proves (1).
394 LAW S O F L A R G E N U M B E R S [V II, § 7

§ 7. The strong law of large numbers

A sequence r\n of random variables is said to converge almost surely


(with probability 1) to 0 if

P(lim цп = 0) = 1.
«-►GO

Almost sure convergence is a stronger condition than convergence in proba­


bility. Indeed we have the following

Lemma 1. The condition

P( lim r\„ = 0) = 1 (1)


«-►oo

is equivalent to the condition


lim st ( sup I r\m I ) = 0. (2)
«-►oo «*;>«

Consequently, (1) implies

lim st n„ = 0. (3)
«-►00

Proof. We show first that (2) follows from (1). Let e > 0; let A„(e) denote
the event sup | r\m | > e and C the event lim t]n = 0; put further Bn{e) =
m^n «-►со
CO

= CA„(e). Then Bn+1(s) c B„(e) and the set f J Bn(e) is obviously empty.
«=1
It follows from this (cf. Ch. II, § 7, Theorem 3) that

lim P(Bn (г)) = 0.


«-►oo

Now because of P(C) = 1 we have

P(Bn(ej) = P(An&).

Thus (2) holds and, because of (3),

I ??„ 1 < sup I Г]т I


m~>n
holds as well. Conversely, assume that (2) holds. Let D(e) denote the event
lim sup I цп I > г (e > 0, arbitrary).
«-►GO
V II, § 7 ] THE STRO NG LAW OF LARG E NU M BER S 395

Since obviously D(c) c A„(e) for n = 1 , 2 , . . . , (2) implies P(Z)(e)) = 0.


00 1i
As C c Yj D .1 , we get P(C) — 0, hence (1). Herewith the lemma is
k=1 к I
proved.
Remark. Statement (1) concerning the almost sure convergence of t]k
to 0 can be rephrased: the probability of the set of the elementary events со
for which lim >/„(ю) = 0 does not hold is equal to zero. In measure theory,
П-+СО
this is expressed by saying that the variables rjn(o>) converge almost every­
where to 0. Convergence in probability is called in measure theory conver­
gence in measure.
In order to emphasize the particular character of the strong law of large
numbers we restrict ourselves first to a simple case.

T h e o r e m 1. Let £,, £2, ■•., . . . be (completely) independent and iden­


tically distributed random variables with finite expectation Ь'(с„) = M and
variance D2(£„) = D'1. Put
1 "
in = — £ k
n UÍ1
Then
P(limC„ = A O = 1. (4)
П-+-00

Remark. According to Lemma 1, (4) is a stronger statement than the law


of large numbers: lim st £„ = M. Therefore Theorem 1 is called the strong
n-*-00
law of large numbers.
In order to see better the meaning of (4), consider the case where the
are indicators of an event A in a sequence of experiments. £„ is then the
relative frequency of A. Put p = P(A). In this case the only thing the law
of large numbers says is lim st Cn = p, while the strong law of large numbers
72-►CO

tells us that lim £„ = p with probability 1. That is, according to Lemma 1,


72-*- CO

all relations
\ i n + k ~ P \ < z (A: = 1 , 2 , . . . )

are simultaneously fulfilled with a probability >1 — Ő for s > 0 and ő > 0
however small, if the index n is larger than a number n0 depending on s
and 5.
P roof of T heorem 1. We consider

An = sup I —M I and Aa b = max | £„ —M |.


n>N a<_n<b
396 LAW S O F L A R G E N U M B E R S [V II, § 7

If the inequality AN > e is fulfilled for an At such that T < N < 2J+1, then
2/+i > e is fulfilled for at least one / > s. Hence

P(An > e) < f P(A2,'2l+i > e) for 2s < N < 2s+\ (5)
l =s

On the other hand we have


P(A2i2t+i > e) < P( шах k\ £k - M \ > e ■ 2’). (6)
l^fc<2' +I
If we apply to the random variables i k (k = 1, 2 , . . 2,+ 1 — 1) Kolmogo­
rov’s inequality (proved in the preceding section), we obtain
2,+1 Z) 2 2 Z>2
P( max к 1Cfc — M \ > e2') < - 2 g,-- = - 3 - ^ 7 . (7)
l ^ / c < 2' + ‘ E -A E -A
If we substitute this into (6) and consider (5), we have
4 2D2 " 1 4D2
(8)
N
If At -*■ 0 0 , it follows from (6) and from 2* > — that

lim P(An > e) = 0.


iV-*-00

Hence by Lemma 1 our theorem is proved.


The following two generalizations of the strong law of large numbers are
due to A. N. Kolmogorov.

T heorem 2. Let £b £2, be a sequence o f (completely) iWe-


pendent random variables for which E(ik) = M k and D(^k) — Dk exist.
CO

Assume further that the series £ converges. I f we put


k=i

n k=1
then
Z(lim C„ = 0 ) = 1. (9)
n-*- 00

T heorem 3. Let £1; be ( completely) independent identically


distributed random variables. The random variable
1 "
in ~ -- Z
V II, § 7 ] TH E STR O N G LAW O F LA R G E N U M BERS 397

converges with probability 1 to a constant C iff the expectation M = E(fk)


exists; in this case C = M and, consequently,

P( lim Cn = M )= 1.
n-*-00
The hypothesis of the existence of the variance in Theorem 1 is therefore
superfluous.

Proof of T heorem 2. Put, as in the proof of Theorem 1,

An = sup IC/ti and Aa<b = max | C/c I-


k ^ N a<k<b

It follows from AN > e, 2s < N < 2s+1 that A2^,2^+1 ^ £ for at least one
l > s; hence
00

P(An > e) < Y j P{A21^ 1> e). (10)


l= s

By application of Kolmogorov’s inequality we obtain


i 2'+i-i
P(A2,j2I+, > e) < P( max k\Ck\ > e ■ 2')< 2 £ D\.
I^fc<2'+1 e k =1
Hence by (10),
1 00 i 2'+1 —1

P(^>e)<? E w I Я
£ l= s z k =1

By interchanging the order of summation we find


1 2s + ' - l t/r oo Г)2

Now it can be shown that the right hand side of inequality (11) tends to
oo £)2
zero as n increases (hence as N increases too) provided that the series £ —j-
k=i Iе
is convergent. To show this we need the following lemma due to L. Kro-
necker.
00

L emma 2. I f the series £ ak is convergent and if qnis an increasing sequence


k = 1
o f positive numbers, tending to + oo for n -> со, then

1 n
lim ------- Y , ak 4 k = 0. (12)
n-+oo Яn k =1
398 LAW S O F L A R G E N U M B E R S [V II, § 7

Proof. Put rn = E
CO
ak and choose a number n0 = n0 (s) (s is an arbitrary
k=n
small positive number) large enough in order that n > n0 should imply j | <
g
rn
— . It is easy to see that (with q0 = 0)

I я I я
---- = — L rk(<lk - Чк-l) - r„+i. (13)
4 n k= 1 4 n k=l

If we put max \ rk \ = A, we have for n > n0

I
——
я
4 n k =l
E j
ak 4 k —
j
Aq„
4n
2s
+ ~ r— •
3

^1^1 g

Choose now n±> щ such th a t---- — < — . Then for n > пг

1
------
4n
E"
k =1
ak 4 k < s ,

which proves Lemma 2.


oo p 2

This lemma and the convergence of the series E ~Л~ imply immediately
k =1 к

lim
E я* 2 =0;
«^00 «
hence the right hand side of (11) tends to 0 as N -* oo. This and Lemma 1
lead to Theorem 2.

P roof of T heorem 3. We show first that the existence of M = E(c,k)


suffices to imply P( lim £„ = M ) = 1. For the sake of simplicity assume
П—00
M = 0. Let the random variables £* (k — 1 , 2 , . . . ) be defined by

í ík for \%k \ < k,


k [ 0 otherwise

and the random variables £** by


e** _ £ г*
^k *
V II, § 7 ] TH E STRO NG LAW OF LA RG E N UM BERS 399

Put
i n i n i n
^ __ V v** __ V"* K **
ъ« /Li Ъп 2-1 ^ к 5 ^>« 2—t ^ k '
n k =l n k =1 П k =l

Let F(v) denote the distribution function of the £/c. Since we have assumed
M = 0, we have

lim — £ £(£*) = Hm — £ Щ 1*) = 0.


П k =l
«->oo n-+oo n k=1
The expectation of the random variables £/, exists by assumption, hence
+ 00 I

the integral \ ' x \ dF(x) is convergent. Since £ „ = { * + C**, it suffices


— 00

evidently to prove that

P( lim CÍ = 0) = 1, (14a)
«-> 00

F (lim C f = 0) = 1. (14b)
«-►co

Theorem 2 applies to the random variables since it is easy to show that


00 D2(£*)
Z>(£*) exists and that У ------1, — < + * oo. Indeed
*=1 *

0 \ ф < Е ( 0 = ] к x2dF(x).

k
Hence, because of
00 1 “ 1 1
У — < V ------------------- = — ,
kéj+i к 2 *=% а к (к - 1) j
we have
j —
7+1
соП2 c° Г Г Г 1 00 1
Е ^ н ^ Е л а д + Е-тг*
*=1 * 7= 1 L J j J k= j K
7 -1 -7

<2 j IX I dF(x).
— 00

Hence (14a) must hold. Now consider the random variables £**. We have
Р{И1*Ф 0 ) = f dF(x% (15)
\x\>k
hence
E P ( t i * * 0 ) < +f \x\dF(x). (16)
k=1 —00
400 LAWS OF LARGE NUMBERS [VII, § 8

Lemma A of § 5 permits to state that with probability 1 ££* = 0 holds,


for all but a finite number of values of k. This implies (14b) which proves
the first half of Theorem 3.
Now we show that the condition is necessary. Assume the validity of
P(lim C„ = C) = 1, where C is a constant. Then with probability 1
n-*-cc

lim — - = lim Cn - ——- 1 = C - C = 0.


w—oo П n-*- oo Y\
11

IC I
Thus — I > 1 holds with probability 1 only for a finite number of values
I n !
of n. Since the are independent, it follows from Lemma В of § 5 that the
°° i £ I I£ I f
series У P ——| > 1 is convergent. Now since P — — > 1 = dF(x)
„ti 1 « ! n J
\x \ > n

the series
W+ l -Л
СО Г* 00 If* Г
I i/T(A-) = £ И í/f(x) + ÚÍF(X)
W=l ^ Л=1 »/ ;
|д:|>л n —(л+1)
is convergent. But we have
+ 00 n+1 —n

( I x\dF(x) < 1 + f n ГdF(x) + Г í/F(jc)


J n =1 Ы J
—00 IÍ -(/7+1)
+ 00

hence j | x | dF(x) exists and M = F(<;J exists, too. Hence by the first part
— 00

of the theorem C — M and thus our theorem is completely proved.

§ 8. The fundamental theorem of mathematical statistics

In the present paragraph we prove a theorem due to Glivenko, which is


of fundamental importance in mathematical statistics.

T heorem 1. Let the random variables c2, . . ,,cv be the elements o f a


sample drawn from a population, i.e. let <),, c2, . . ., <f,v be identically distrib­
uted independent random variables with common distribution function F(x).
Let F v(x) denote the empirical distribution function o f the sample, i.e. let
NFn (x ) be the number o f the indices к for which < x. Put further

An = sup I Fn {x ) - F(x) \.
- CO <X< +CO
V II, § 8] G L I V E H K O ’S T H E O R E M 401

Then
P ( lim Лд- = 0) = 1. (1)
TV -*- go

Remark. Glivenko’s theorem states that the empirical distribution func­


tion F n { x ) of a sample of N elements converges, with probability 1, as N -* oo
uniformly in x ( — 0 0 < x < + oo) to the distribution function F(x) of the
population from which the sample was drawn.

P roof. If x M k ( M a positive integer; к = 1 , 2 , . . . , M ) is the least number


к
X fulfilling Fix) < — < F(x + 0), we have
M

An < max (d$>, A ff) + ~ , (2)


M
where
= max IFn ( х м л ) - F(xM<k) | ,
1< Jk< ,M
A%>= max [Fn (хм л + 0) - F(xM_k + 0) | .

By the strong law of large numbers we have for every fixed x

P ( lim F n (x ) = F(x)) = 1
N-*■00
and
P ( lim F n (x + 0) = F(x + 0)) = 1.
N-*oo
Thus it follows from (2) that

1
P lim sup An > ---- = 0 (3)
. ЛГ-оо M

for any natural number M, which proves the theorem.


This theorem of prime importance states that a large enough sample gives
an almost exact information concerning the distribution of the population.
Glivenko’s theorem can also be rephrased in the following manner: If e
and ő are two given positive numbers however small, then there exists an
N0 such that
P ( sup A„ < e) > 1 —<5. (4)

This particular form shows clearly that the strong law of large numbers
and Glivenko’s theorem have a definite meaning even for the practical
402 LAW S OF LA R G E N U M B E R S [VII, § 9

case when only finitely many observations are made. In fact, always when
a large sample is studied, this theorem is implicitly used; hence it has the
right to be called the fundamental theorem of mathematical statistics.
On the other hand it must be noticed that Glivenko’s theorem does not
give any information how N0 figuring in (4) depends on a and ő. This ques­
tion will be answered by a theorem of Kolmogorov dealt with later on
(cf. Ch. VII, § 10).

§ 9. The law of the iterated logarithm


The strong law of large numbers can be still further improved. To get
acquainted with the methods needed for this, we give here first a new proof
of Theorem 1 of § 7 concerning bounded random variables. The proof rests
upon the Borel-Cantelli Lemma A.
Let £2, . . ., . . . be independent and identically distributed bounded
random variables, and suppose | £„ | < K. Suppose further £(£„) = 0,
1 ”
and put D = £>(£„) and £„ = — £k. Then, according to Theorem 1 of
n k= 1
§ 4, we have the inequality
P( IC„ I > e) < 2qn (1)
D2 a2
for 0 < e < — where q = exp — —, - >2 • From (1) follows
/V о £A

2D I1 + 2Z)2
the convergence of the series
CO
lF(|C J>e) (2)
П=1
and because of Lemma A of § 5, we find that the inequality | £„ | < a is
fulfilled with probability 1 for every sufficiently large n. This implies the
strong law of large numbers. Thus we obtained another proof of this law.
Notice, however, that a supplementary hypothesis was needed for this proof:
the random variables were supposed to be bounded. This hypothesis
allows to prove a far more precise theorem, called the law o f the iterated
logarithm:

T heorem 1. Let £ь <f2, be uniformly bounded independent


random variables with common expectation M = £(£„) and standard devia­
tion D — D(£n). Put
£i + £ 2 + ■• • + ín ~
V II, § 9 ] T H E L A W O F T H E IT E R A T E D L O G A R I T H M 403

Then

P lim sup - <1 =1. (4)


„-.00 y/2n\n\nn

In order to prove this we first have to prove a lemma similar to Kolmogo­


rov’s inequality of § 6.

L emma 1. Let £ ъ £2, he independent random variables with


common expectation L(ck) = M and variance D~(i;k) = D2. Put

Ck = C l + Cl + • ■• + Ck ~ k M .
Then

P{ max Ck > x ) < ~ P((n > X - 2D J n ) . (5)


1 <.k<,n *

Proof. Let A k (k = 1 , 2 , . . . , n) denote the event that the inequalities

Cl < X, C 2 < X ............ C k-l < X, Ck — X

are fulfilled. Let Bk denote the event '(n — Ck > — 2 x/riD and A the event
Cn ^ X — 2 ,,Jn D . If both A k and Bk occur, A occurs as well. The events
Ak (k = 1, 2 , . . ., ri) mutually exclude each other, thus the same holds for
the events A kBk ; the events A k and Bk are evidently independent since Bk
depends only on the random variables Ck+i, ■• Cn and Ak depends only
on Ci, ■■•! Ck- Since A 1B1 + . . . + A„Bn £ A, the independence of Ak
and Bk implies

t PW P { B k) = t P ( A B k) = P ( Í A k Bk) < P(A). (6)


k=1 k= 1 k=1

On the other hand:

l - P ( B k) < P ( \ C n - C k \ > 2 D J~n), (7)

hence, by Chebyshev’s inequality,

' - P V ^ ^ i r * I - W
Thus

P(Bk) > ~ . (9)


404 LAWS OF LARGE N U M B E R S [V II, § 9

(6) and (9) lead to

1 Í ПАк) = ~ P( max U > x)<P{A) = P(Cn > x - W j n ) ,

which was to be proved.


Now we proceed to the proof of the law of the iterated logarithm. We
show first that with probability 1 only finitely many of the events >
> (1 + e) J i n In Inn occur if s > 0 is aibitrarily small. Let у be y j \ + s
and let N k be the least positive integer larger than yk. Let A k(e) denote the
event
max r\n > (1 + Ё) 2Nk \n\n Nk. (10)
Nk<.n<Nk+i
In order to show that with probability 1 at most finitely many of the events
Ak(e) occur, it suffices, in view of the Borel-Cantelli lemma, to show that
00
the series £ P(Ak (г)) is convergent for every г > 0. Now
k=1

P(Ak(8))<P( max ,h:> ( \ + E) j 2 N k lnln Nk), (11)


n<Nk+t
1 <,

hence, because of Lemma 1,

P(Ak (г)) < у P{r]Nk+i > ( 1 + г) j 2 N k lnln Nk - 2 j W k ~). (12)


If we put
, 1 l2Nk ln ln Nk
P=Pk = V + e f / *
V ^*+1
we conclude from (12) by applying Theorem 1 of §4 that the following
relation holds if к is large enough:

P(Ak (e)) < exp Г— Ик . (13)


3 2 1+ И к -

T + v ^ +J J
2_
Since lim - — — = - ' ^ ■— = ( l + e ) 8 and lim ltk ^ =0
k~oc V 2111111^ Jy ■ k~ oo 2j N k+1
there exists a number k0depending only on e such that for к > k0 we have

fil
— -----------——- > (1 + e) In ln IV*. > (1 + a) In (к In y).
Ä „ Uif К
2 1+
2 y / N kJ
V II, § 9 ] TH E LAW O F T H E IT E R A T E D L O G A R IT H M 405

Hence for к > k0

P(Ak (e)) < -A- e-d+.)in(*in,) = ---------8 _ -------. (14)


3 — ln(l+e) k 1+e
00
The series £ P(Ak(e)) is thus convergent for every positive e and according
k-1
to the Borel-Cantelli lemma,

P lim sup _1n___< 1 = 1 . (15a)


00 yj2n In In n

It can be shown in a similar manner that

P lim inf ~ n----- > - 1 = 1 . (15b)


n-*co y/2n In In n

This proves Theorem 1.


It should be noticed that (4) cannot be improved; in fact, it can be shown
that

P lim s u p ———" = 1 =1 (16a)


, n- ж v' 2 n l n l n n
and

pi li m inf t]n = - l| = 1. (16b)


« - » - CO yj'2n In In и !

In particular, if the <;k are indicators of an event A with probability P(A) =


= p in a sequence of independent experiments, the conditions of Theorem 1
are fulfilled. Thus we have

T heorem 2. I f £„ represents the relative frequency o f an event A in a se­


quence o f independent experiments and if P{A) = p (0 < p < 1, q = 1 — p),
then

P lim sup V е" " P -— < l \ — I. (17)


„-.00 / 2p#ln In n
V V n у
406 LAW S OF LA R G E N U M B E R S [V II, § 10

As we have seen, there even holds the more precise relation

P lim sup ^ -----■= + 1 =


„^oo / 2pq In In n
' V n I

= P lim inf - P- - - = — Л = 1. (18)


n^oo / 2pq In In и

For the proof of (18) cf. § 16, Exercise 23.

§ 10. Sequences of mixing sets

Let [ß, a t , P] be a Kolmogorov probability space and A„ ^ a t (n =


= 1 , 2 , . . . ) a sequence of sets such that for every В £ a t the relation
lim P(An B) = dP{B) (1)
n-*-00
holds, where i/ is a number, not depending on B, such that 0 < d < 1. Then
the sequence {A,,} is said to be mixing', d is called the density of the sequence
Ш .

Theorem 1. I f A0 = Q, A„ (n = 1, 2 ,. . .), 0 < P(An) < 1 and


lim P(An \Ak) = d Qc = 0, 1, . . . ) , (2)
n-*-00
then {A„} is mixing.
Remark. Evidently condition (2) is also necessary, as it is a particular
case of (1) for В = Ak (Jc = 0, 1, . . . ) .

Proof. We use elements of the theory of Hilbert spaces. Let a t be the set
of all random variables £ for which £(c2) exists. Put (£, tj) = Е(^ц) and 11£ 11 =
= (^, ^)2. a t is then a Hilbert space. Let ocndenote the indicator of the event
A„:
j 1 for co£An
j o for O K A .
If ß is the indicator of В and if a„ — d = yn, we can write (1) in the form

lim (ß, y„) = 0, (3)


« - * - CO
V II, § 10] S E Q U E N C E S O F M IX IN G SETS 407

while (2) is equivalent to

lim (Уь Уп) = 0 (/с = 0, 1 , . . . ) . (4)


Л-*-оО

We show that (4) implies (3) for every ß £ Ж (hence not merely for the ß
which are indicators of sets).
Let Ж ! denote the set of those elements of Ж which are linear combinations
of the y„ or limits of such elements, in the sense of strong convergence,
that is <5„ -» ő means that lim || <5„ — ő || = 0. In other words, Ж х is the least
«-*-00
subspace of Ж containing the elements yn (n = 0,1 . . .). Obviously, (4)
implies (3) when ß is a finite linear combination of the y„, and also when
«
ß £ Ж x. In fact, in the latter case there exists for every e > 0 а у = £ скУк
k= 1
with II ß — у И < e. Because of Schwarz’ inequality and of

IIУп II = E (К - d f ) = P(A„)( 1 - d f + (1 - P{An)) d2 < 1

we have
I(ß , Уп) - (У, Уп)\ = \ ( Р ~ У Уп) > I^ IIß - У II•II УлII ^ £•
By (4) lim (у, у„) — 0, thus lim sup | (ß, y„) | < s, since for e > 0 there
«-»-oo «-»-oo
can be chosen any positive number however small. (3) is therefore proved
for every ß £ Ж х.
Let now Ж ^ be the set of elements <5 of Ж such that (<5, yn) = 0 for
n — 0,1, . . . . Ж 2 is then the subspace of Ж orthogonal to Ж x. For
ß £ Ж г (3) is trivial. Now according to a well-known theorem of the theory
of Hilbert spaces1 ß 6 Ж can be written in the form ß = ßx + ß2, where
ßx 6 x and ß2 £ Ж 2. Furthermore,

(ß, Уп) = (ßl, Уп) + (ßi, Уп) = (ßi, Уп)


hence (3) holds for every ß £ Ж . Theorem 1 is thus proved. As an applica­
tion we prove now a theorem which shows new aspects of the laws of large
numbers.

T heorem 2. Let Ei, . . ., i n, . . . be independent random variables whose


arithmetic mean L = • • • + £n ten(ys jn probability to a random
n
variable C Then '( is equal with probability 1 to a constant.

1 Cf. e.g. B. Sz.-Nagy [1] or F. Riesz and B. Sz.-Nagy [1].


408 LAW S O F LA R G E N U M B E R S [V II, § 10

Proof. Choose two numbers a and b such that P(a < ( < b) > 0 and
a, b are two points of continuity of the distribution function of £. Then
P(a < £ „ < / > ) > 0 for n > n0. Let A0 = ß and let A k denote the event
a< +k +x < b (k = 1 , 2 , . . . ) . For к > 1 we have
P (A I Ak) =
_ p I < («0 + к + l ) C „ o+ /c + 1 + ^Л о+* + 2 + • • • + £«„ + « + 1 < £ j ^ <

~ «0 + n+ 1
< p | a _ (»0 + к + 1) b < £n„-|-fc+2 + • • • + £„„+,, +1 < b _ (И0 + + 1) fl j
— ) w0 + n +1 — и0 + n + 1 n0 + n + 1 I

< P ffl - £ < ^ . +fe+2+' " + ^ . +n +1 < + e)


«о + n + 1 ;

for any e > 0 whenever n is large enough. Similarly, for sufficiently large n,

P(A„ \ Ak) > P [a + e < ^n-+k+~ + •' • + Zna+n+ 1 < b _ e|


I n0 + n + 1 J

Now by assumption £„ Л £, hence also "Jr''' + ^ . Now for


n
e > 0 there can be chosen any positive number however small, thus

Theorem 1 of § 2 leads to
lim P ( A J A k) = P ( a < £ < b ) ,
/»-►oo

since a and b are points of continuity of the distribution function of f. Thus


by Theorem 1 the sequence of events {Ari} is a mixing sequence with the
density d = P(a < £ < b).
Now let В be the event a < f < b. We have
lim P(A„ IB ) = d = P(B).
n-*- oo

The random variables Cn tend in probability to £, also when taken on the


probability space [ß, J é , P{A \ S)]. Thus lim P(A,. j B) = P(B \ B) — 1
/»-►co
by Theorem 1 of § 2. Consequently, P(B) = 1 since P(a < t, < b) > 0,
P(a < C < b) = 1. But this means that f is a constant with probability 1;
in fact if the distribution function of £ would increase at more than one point,
there could be found a pair a < b with 0 < P(a < £ < b) < 1 such that
a and b are points of continuity of the distribution function of £. Theorem 2
is herewith pioved. The arithmetic means of independent random variables
either tend in probability to a constant or do not converge at all.
V II, § 1 1 ] STABLE SE Q U E N C E S OF EVENTS 409

' § 11. Stable sequences of events

In the present section we deal with a generalization of the notion of mixing


sequences of events, introduced in the preceding section. Let [ß, P]
be a Kolmogorov probability space; a sequence {A„} of events (A„ £ ;
n = 1 , 2 , . . . ) such that for any event В £ there exists the limit
lim P(An B) = Q(B) (1)
n-*-CO
will be called a stable sequence o f events} We shall prove first that the set
function Q(B) on the right hand side of (1) is always a measure, i.e. we prove

T h e o r e m 1. I f {A,,} is a stable sequence o f events, the set function Q{B)


defined by ( 1) is a measure which is moreover absolutely continuous with
respect to the probability measure P.

P r o o f . Obviously, Q(B) is a nonnegative and additive set function and

Q(B) < P(B), hence if P{B) = 0, then Q(B) = 0. From this the assertion
of our theorem follows directly, by Theorem 3 of Chapter II, § 7.
According to the Radon-Nikodym theorem the derivative

= “(" ) (2)

exists, furthermore, for every event В £ one has

Q{B) = I &dP.. (3)


в
It follows directly from the inequality Q(B) <, P{B) that
0 < а (со) < 1 (4)
with probability 1. The random variable a = a(co) is called the (local) den­
sity of the stable sequence of events {A„}.
If a is constant almost everywhere, a = d(0 < d < 1), then clearly the
stable sequence of events {A„} is mixing and has density d. On the other
hand, if a is not constant, the stable sequence of events {A„} cannot be a
mixing sequence. Hence the notion of the stable sequence of events is a
generalization of that of a mixing sequence of events.
Let us now consider an example of a stable but not mixing sequence of
events. Let [Q, P] be a Kolmogorov probability space, let
0 < P(Q\) < 1 and Q-> = Qv Consider further the probability spaces
[Q, and [Í2, ^ , P2], where for every A £ P fA ) = P{A | ßj)
and P 2(A) = P{A I ß 2).
1 Cf. A. R ényi [36].
410 LAW S OF LA R G E N U M B E R S [VII, § 11

Let A'n be a mixing sequence of events in the probability space [Í2, ts f, Р2]
with density d 1 and A"n a mixing sequence of events in the probability space
[Q, P2\ with density d2(0 < dx < d2 < 1); put A„ = A'nQ1 + A!’n Q2.
Then clearly we have for every event В £
P(An В ) = P(QA P, (A'a В ) + P( ü 2)P 2(A: B),
hence
lim P(AnB) = Q(B),
«-»-co
where
Q(B) = d1 P(BQ1) + di P(BQ2).
Let the random variable a = a (со) be defined in the following manner:

d 1 if со £ Qu
a(® )= ,
«2 tf ß 2>
then
Q(B) = f «t/P.
в
Thus the sequence of events {A,,} is stable but not mixing, since its density
is not constant but assumes two distinct values with positive probabilities.
Clearly, there can be constructed in a similar manner stable sequences of
events with densities having an arbitrary prescribed discrete distribution.
Now we shall prove the generalization of Theorem 1 of § 10 concerning
stable sequences of events.

T heorem 2. I f And (n = 1, 2 , . ..), A v = Q and if the limits

lim P(A„ Ak) - Qk (A: = 1 , 2 , . . . )


«-►00

exist, then the sequence o f the events {A„} is a stable sequence o f events.

Proof. The proof of this theorem corresponds nearly step-by-step to


that of Theorem 1 in § 10, hence it will be only sketched. Let Aftf denote the
Hilbert space of all random variables with finite standard deviation on the
probability space [Í2, P] \ scalar product and norm are (as usual) defined
i
by (f, q) = E(tr\) and 11 £ 11 = (£, f ) 2, respectively. Let a„ be the indicator
of the event A„. Let be the subspace of the Hilbert space «^spanned by the
elements oq, a2, .. ak, . . thus consists of the finite linear combinations
(with real coefficients) of the elements of the sequence {ak } and of the
(strong) limits of these elements. It is easy to see that if £ f t x, then the limit

lim (f, «„) = Щ ) (5)


«-►00
V II, § 1 1 ] ST A BL E SE Q U E N C E S O F EV E N T S 411

exists; in fact if £ = сга: + . . . + ckcck, then

lim (£ , <x„) = c1 Q1 + . . . + ck Qk,


«-►00
while if i is the limit of linear combinations of the ak, the limit (5) exists
again, since

i ( £ > « » ) - ( £ ', « „ ) I = I (<Ü - <Ü\ a „ ) I iS II É - f ' II-


Now in the same way as was done in Section 10 we decompose £,
as £ = & + £2, where ^ 6 3 ? \ and (£2, ak) = 0 (k = 1 , 2 , . . . ) ; hence the
limit (5) exists for any £ £ and our theorem is proved.
Clearly the functional L(f) has the following properties: If £ £ q £ сЗГ,

furthermore a and b are real constants, then


L(a£ + Ьц) = aL(£) + bL(tj)
and
I Д О I £ I K II,
To put it otherwise, L(if) is a bounded linear functional and thus, according
to the well-known theorem of F. Riesz (cf. F. Riesz, and B. Sz.-Nagy, [1 ]),
there exists an a £ such that

L (0 = (£,«),
i.e. the sequence a„ converges to a, in the sense of weak convergence in the
Hilbert space. (A sequence of elements an of a Hilbert space is said to con­
verge weakly to a (a £ 3F), if for any element £, £ 3t?
lim (£, a„) = (£, a).
n—00
This fact is denoted by a„ -*• a.)
The preceding discussion contains the proof of the following

T heorem 3. Let a„ denote the indicator o f the event An and 3 ? the Hilbert
space formed by the random variables with finite second moments defined on
the probability space [Í2, Р]. The sequence o f events {A„} belonging to
the probability space [Q, P] is stable, if f a„ converges weakly in to an
elementa. £ ЗА. I f the sequence o f events {A„} is stable and if an a, then a
is the density o f the sequence o f events {An}.
A stable sequence of events {An} is mixing, iff there exists a number
d(0 < d < 1) such that for every event A

Q(A) = dP(A). (6)


412 LAW S OF LA R G E N U M B E R S [V II, § 12

It is readily seen from Theorem 1 of § 10 that for A = Í2 and A — Ak


(k= 1 , 2 , . . . ) it suffices to assume the validity of (6); from this it follows that
{A„} is a mixing sequence of events.
Finally, we prove another theorem showing the great generality of the
notion of stable sequences of events.

T heorem 4. From any sequence o f events one can select a stable subse­
quence.

P roof. Theorem 4 is a direct consequence of a well-known theorem of


the Hilbert space theory (cf. F. Riesz, and B. Sz.-Nagy, [1]) stating that
from any sequence of elements with bounded norm of the Hilbert space
a weakly convergent subsequence can be selected.
As an application of these results we shall discuss in the following section
sequences of exchangeable events.

§ 12. Sequences of exchangeable events

The notion of a sequence of exchangeable events was already encountered


(cf. Ch. II, § 12, Exercises 38-43). Here we shall deal only with infinite se­
quences o f exchangeable events. First, let us repeat the definition.
A sequence {A„} (n = 1, 2 ,. . .) of events is said to be exchangeable if
the probability of the joint occurrence of к distinct events chosen arbi­
trarily from this sequence depends only on к for every positive value of k,
but does not depend on which к events were chosen. Thus there can be given
a sequence of numbers pk such that

P(AntA nt. . . A J = Pk ( k = 1, 2, . . . ) , (1)


whenever nx < пг < . . . < nk.
First we prove the following theorem:

T heorem 1. A sequence o f exchangeable events is always stable and is


mixing, i f f the events are independent and have the same probability.

P roof . Let {A„} be a sequence of exchangeable events. Then according


to our assumption
Р{Ля) = р i,
hence
lim P(An) = a ,
W—00
V II, § 12] SEQ UENCES OF EX C H A N G EA BLE EVENTS 413

further by assumption

P{An A ) = P2 if « > k,
hence
lim P{A„ Ak) = p 2 ( k = 1 ,2 ,...) .
/I-*-oo

Therefore by Theorem 2 of § 11 {A„} is a stable sequence of events. Now


according to § 11 a stable sequence of events {A„} is a mixing sequence,
iff a is a constant, a = d. In this case p 1 = d and p 2 = d2, furthermore

Рз = P(A„ A 2 Ai), if n > 3,


hence
p 3 = lim P(An A 2 Afi = dP(A 2 Afi = d3,
n-+-oo
and, similarly, it can be seen that for every к > 1 we have

Pk = dk,
i.e.
P{Ani A„t . . . A J = P ( A J P ( A J . . . P ( A J ,

whenever 1 < nx < n2 < . . . < nk. But this means that the events A„ are
independent.
Now we prove a theorem due to B. de Finetti.

T heorem 2. I f {An} is a sequence o f exchangeable events and the numbers


pk are defined by (1), there can be given on the interval [0, 1] a distribution
function F(x) {fulfilling F{0) = 0 and F(1 + 0) = 1) such that

P k— \ x k dF(x) ( k = 1, 2, . . . ) . (2)
0
P roof . By Theorem 1 the sequence {An} is stable. Let a denote the den­
sity of this sequence. Then

Pi = p (Ak) = Hm P(An) = J ccdP,


n-*-CO fi
furthermore p 2 = P(A„Ak) if n > к and thus

Pi = lim P{An Ak) = j ccdP =$txcck dP {k = 1, 2, . . . ) ,


n—00 Ak ß
hence by Theorem 3 of § 11
p 2 — lim f ш кdP = f a 2dP.
k-*oo h Q

t
414 LAW S OF LA R G E N U M B E R S [V II, § 12

Similarly,
p 3 = P(An Ak A,) if n > к > l,
thus
p 3 = lim P(An Ak A,) = f ш к a, dP,
n-+-co D

hence —by taking the limit first for к —> oo, then for / —> oo, — we obtain
that
Рз= f <*3dP
Q

and, in general, for every positive integral value of к we get

Pk= U k dP. (3)


ß
Let now F(x) denote the distribution function of the random variable a.
Thus (cf. Theorem 6 of Ch. IV, § 11)

pk = \ x kdF(x) (k= 1,2,...),


0
which was to be proved. The proof gives, however, somewhat more than
stated by Theorem 2; in fact we have proved the more general

T h e o r e m 3 . Let {An} be an arbitrary exchangeable sequence o f events',


let the numbers pk be defined by (1). Let an denote the indicator o f the event
A„ and a = a(a>) the density o f the sequence {An}. Then, for any choice o f
the positive integers \ < к л < k 2 < . . . < k r and for any integer s > 0 we
have
S' «к, Ч г ■■• 4 r a5dP = p r+s. (4)
ii
Remark. Theorem 2 is contained as a particular case in Theorem 3.
In fact, if r = 0, relation (4) reduces to (3). On the other hand, if s = 0,
then (4) reduces to (1); i.e. to the definition of the sequence of num bers/v
Let {An} be an exchangeable sequence of events. Let us compute the
probability of the joint occurrence of к distinct events selected from this
sequence and of the simultaneous nonoccurrence of / events distinct from
the former and from each other; i.e. let us compute the probability
P{Ani A„t . . . A„k A mi А тг. . . Ä mi), where щ < n2 < . . . < nk, mk < m2 <
< . . . < » ! / and Uj Ф m, (i = 1, 2,. . ., k \ j = 1, 2 , . . ., /). It can be seen
by an easy calculation that

P(Ani Ant. . . Ank Äm J m, ... Ä J = É


j= 0
Pk+J( - 1У [ )•
J J
(5)
V I I , § 12 ] SEQ UENCES OF EX C H A N G EA BLE EVENTS 415

Equation (5) is valid for the case к — 0 too, if we understand by p 0


the value 1.
The expression on the right hand side of (5) can be written in the form

i , P k +j ( - i y [ ! ) = ( - i y ^ Ä + z , (6)

where A denotes the dilference operator, defined by Axk = x k — x k^ x.


Since the probability on the left hand side of (5) is nonnegative, we have
obtained that for the sequence of numbers pk the inequalities

( - l ) 'A 'p k+l> 0 (7)

hold. Sequences of numbers having property (7) are called absolutely mono­
tonic sequences. Hence an absolutely monotonic sequence is nonincreasing,
its first differences form a nondecreasing sequence (i.e. the sequence is
convex), its second differences form a nonincreasing sequence, etc. Note
that inequality (7) can be obtained from the representation of the sequence
of numbers pk in (2) or (3), since

i
( - 1 )l A 'pk+l= [ **(1 - x ) 'd F ( x ) = f a*(l - a)'dP. (8)
b is

It was shown by F. Hausdorff that every absolutely monotone sequence


of numbers p k (к = 0, 1,. ..) for which p 0 = 1 can be represented in the
form (2), where F(x) is a distribution function on the interval (0, 1). This
theorem can be deduced from the above Theorem 2. It suffices to show that
if Pk (к = 0, 1,. . .) is an arbitrary absolutely monotone sequence of num­
bers for which />„=1, there can be constructed a sequence of exchangeable
events {An} on a suitable probability space, fulfilling (1). This construction
is readily performed, e.g. by the fundamental theorem of Kolmogorov
(Ch. IV, § 22). In fact, if a„ denotes the indicator of the event A n, then

P(Ani • • • Ank Ä mt. . . Ämi) = P(ccni = 1, . . . , (xnk = 1, осШ


1 —0 , . . . , ocm/ = 0).

Hence we can see from (5) that given the sequence pk, the joint distribution
function of a finite number of random variables chosen arbitrarily from
oq, <x2>■ is given as well; the conditions of compatibility are,
obviously, fulfilled and thus the existence of th e(exchangeable) sequence of
events with the required properties is ensured by the fundamental theorem
of Kolmogorov.
416 LAW S OF LA R G E N U M B E R S [V II, § 12

Now we shall prove the following


T heorem 4. Let {A„} be a sequence o f exchangeable events with density
a and let an denote the indicator o f the event A n. Then we have with probability 1

«1 + «2 +
lim -------------------------= a. (9)
„-ос n
P roof . Let
n

oq + a2 + . . . + a„ ~ a)
0.= --------------------- a=
и n

We calculate the expectation of fj. According to (4), if к ь k 2, k 3, к4 are


distinct positive integers, we have
£((«*, - «)(<**, - a)(«*. - “)(«*, - a)) = 0. (10)
Similarly, it can be seen that if к ъ /с2, k 3 are distinct, then
Щ ч , - a)2(a*, - a)(a*, - a)) = 0, (11)
furthermore, if k 1 Ф k 2, then

£■((«*, - a)3 («it, - «)) = 0. (12)


On the other hand, if k x Ф k 2, then
£((<**, - a)2( 4 , - a)2) = P i - 2p3 + P i = A (13)
and
£((«*, - a)4) = P i ~ 4p 2 + 6p 3 - 3/), = В, (14)
hence
M = 0 ( _L) . (is )
00

Thus the series £ ( 0 is convergent, hence (by the Beppo Levi theorem)
n=1
00

the series Y C4 is convergent with probability 1, i.e. lim £„ = 0 with prob-


fl — l n-*<o
ability 1, which implies the statement of Theorem 4. Our result may be re­
phrased as follows: if {An} is a sequence of exchangeable events and v„
denotes the number among the events А ъ A .,,. . A„ which occur, then
V
the limit lim ——exists with probability 1 and is equal to the density of the
n-+oo Yl
sequence of events {An}.
Clearly, Theorem 4 is a generalization of the strong law of large numbers
concerning the relative frequency. Notice that the limit of the arithmetic
V II, § 12 ] SEQ UENCES OF EX C H A N G EA BLE EVENTS 417

means of the random variables ak is, in general, not constant — contrary


to the case of independent random variables.
Let us now consider as an example Polya’s urn model (cf. Ch. Ill, § 3,
Point 9). Let there be in an urn M red and N — M white balls. Balls are
drawn at random, the drawn ball is replaced and simultaneously there are
added into the urn R > 1 balls having the same colour as the one drawn.
If An denotes the event that the ball drawn at the и-th drawing is a red one,
according to Formula (10) of § 3 in Chapter III the sequence of events
{A,,} is exchangeable and
M+1R ,
P t” Д ~n + 1 r (Ю
It is readily seen that in this case

Pk = f x k dF(x), (17)
ö
where F(x) is the distribution function of the beta-distribution with párám­
ig N -M
eters a = ---- and b = ----------. That is (cf. Ch. IV, § 10, (10)) we have,
R R
IN)
Г — x
R Г M _i N-M
= (1-'> *■' (18)
rH T - H "
Thus by Theorem 4 in case of Polya’s urn model the relative frequency of
the drawings yielding a red ball among the first n drawings converges with
probability 1 to a random variable having a beta-distribution of order
M N-M )
---- , ----------- . From this it follows that the distribution of this relative fre-
R R
quency converges to the mentioned beta-distribution. In fact, if a sequence
t]n of the random variables converges with probability 1 to a random vari­
able rj, then (cf. § 7) also rjn tends in probability to ц and thus (cf. Theorem 1
of § 2) the distribution of цп tends to that of r\. Hence we have

Theorem 5. Let in Pólya's urn scheme vn denote the number o f red balls
drawn in the course o f the first n drawings, then
IN
Г — x
V R Г N-M -
lim P - ^ < x = — ---- —4 ^ ------ г tR (1-0 * dt,
л-оо n M ) J N - M \J V
W 1 R 1°
418 LAWS OF LARGE NUMBERS [VII, § 13

i.e. the limit distribution of the relative frequency of the red balls drawn is
a ,beta-distribution
■ ofс order
л [----
M , ----------
N -M .
\R R
In particular, if M = R — 1 and N = 2, the relative frequency of red
balls will be in the limit uniformly distributed on the interval (0, 1).
Furthermore, it is easy to see that Formula (10) of Chapter III, § 3 is a
special case of the present Formula (5).
As is seen from this example, the general theory of stable sequences of
events permits a deeper insight into some particular problems already dis­
cussed.

§ 13. The zero-one law


In § 5 we proved the following statement: If the events An (n = 1, 2, . . .)
are independent, the probability that there occur infinitely many of them is
00

either 0 or 1 according to the series £ R(A„) being convergent or diver-


П=1
gent. Thus the probability in question cannot be equal to any other value than
0 or 1.
This phenomenon is explained by the following general theorem, called
the “zero-one law” .
T heorem 1. Let А ъ Л2, . . . , A n, . . . be a sequence o f independent events
and tiS(K) the least a-algebra containing all sets An+1, A n+2, . . . ; i f C is an event
which belongs for every n to the о-algebra n\ then either P(C) = 0 or
P(C) = 1.
P roof . We need a lemma from set theory. For its formulation the follow­
ing definition has to be introduced: If tv# is a system of subsets of Í2 having
two properties:
00

1) B n d and B n+ 1 ^ Bn (n = 1 , 2 , . . .) imply lim B n = Y [ B n


/1-00 /1=1
oo
2) B n 6 and B n £ B n+1 (n = 1 , 2 , . . .) imply lim Вп = ^ В „ 6 ^ # ,
n — ca n=1
then tv# is said to be a monotone class.

L emma . I f a monotone class o f sets contains an algebra o f sets t v # it contains


also the least a-algebra а(^УА) containing tv€\

P roof . It suffices to show that if ^ # (v # ) is the least monotone class con­


taining tvf, it is identical with rr(t#). Let A be any subset of Q and tv# A the
V II, § 13] T H E Z E R O -O N E L A W 419

family of the sets В such that A — В , В — A, and A + В all belong to


o # (Ж). Clearly, кЖA is a monotone class, since for every (increasing or
decreasing) monotone sequence {Bn} with Bn £ Ж A we have
lim Bn — A = lim (Bn — A),
« -► 0 0 «-► 0 0

A — lim Bn = lim (A — Bn),


«-►oo «-►oo

A + lim Bn = lim (A + B„).


« -► 0 0 «-► O O

By assumption, Ж is an algebra of sets, hence A £ implies Ж £ Ж?A.


Since, by definition, у £ (Ж ) is the least monotone class containing УА, we
have y f t (Ж ) £ <ЖA for A £ Ж . Thus if С £ (Ж ) and A £ Ж , then
С £ <ЖA. Hence A £ <Жc and for every С £ ( Ж), J é £ . Ж с ; therefore
^ Ж{Ж) £<^#с . Consequently ^ Ж( Ж) is an algebra of sets, since for any
other D with D £ Ж /(Ж ), we have D £ *Жc, hence C + D £ <Ж(Ж)
and C — D £ <Ж(«^€). But a monotone algebra of sets is obviously a cr-al-
gebra, thus а( Ж) £ o # ( ^ ) . And since а(Ж ) is a monotone class con­
taining Ж and o # (yé) is the least class of this kind, we have necessarily
о(Ж ) = <Ж(Ж), which proves the lemma.
Now we shall prove the theorem. Let AS„ denote the least algebra of sets
containing А ъ A 2, . . ., A n. Letc^#„ be the collection of the sets which are
independent of all elements of SBn. *Ж„ is a monotone class. If is the
least algebra containing A n+1, A „+2, . . ., then Ж (л) £ <Жn. Hence, because
00

of the lemma, ст(^€’(и)) = f f l n) £ Ж „. Thus if C £ á9(n), then C is indepen-


«=1
dent of every A with A 6 (n = 1, 2 ,. . .). If *Ж(С) is the collection of
00

all sets independent of C, we have ífön ^ (C), hence ]Г £&n c (C),too.


«=i
00

By applying once more the lemma we find a ( £ «$§„) = -Í9(1) £ (C).


«=1
It follows that C e o#(C ), hence P(CC) = P(C) P(C) or P (Q = P 2(C). But
this is impossible, unless either P(C) = 0 or P(C) = 1. Thus Theorem 1 is
proved. Finally we mention a generalization of the above theorem.
T heorem 2. Let £lt . . . be an infinite sequence o f independent
random variables and the least о-algebra with respect to which the random
oo
variables £и+1, £H+2, • • • ore measurable. I f С (I [ ] \ then either P{C) = 0
«=1
or P{C) = 1.
The proof will only be sketched, because it is similar to that of Theorem 1.
Let .7'be the collection of sets independent of C. As in the previous proof,
420 LAW S OF LA RG E NU M BER S [V II, § 14

we show for the least a-algebra S :in, relative to which the random variables
£2, . • i\ n are measurable, that £ У (n — 1,2, . . . ) . Hence, accord­
ing to the lemma, á3(1) £ У , therefore С £ У and, consequently, P 2(C) =
= P(C). Notice that Theorem 2 of § 10 can also be deduced from this.

§ 14. Kolmogorov’s three-series theorem

Theorem 1. Let tfo, r\2, . . . , rj„,. . . be independent random variables and


let X be an arbitrary positive number. Let the random variables tj* be defined by
* \t\„ for I vn I < X,
" ( Я otherwise.
00.
The series Y t]„ converges with probability 1 iff the following three series con­
fix 1
verge:
00
E
/1= 1
Pin** i f a (1)
oo
E E(€ ), (2 )
/1 = 1

00
E D\n t). (3 )
n= 1
00
Remark. It is easy to see from the zero-one law that Y tjn converges
/1 = 1

either with probability one or with probability zero.

Proof. We show first that the conditions (1), (2), (3) are sufficient. From
(1) and the Borel-Cantelli lemma it follows with probability 1 that >]„ = tj*
oo oo
for sufficiently large values of n ; hence the series Y rjn and Y g* are, with
/1= 1 «=1
probability 1, simultaneously convergent or divergent. Thus it suffices to
00
show that Yj h* converges with probability 1. Because of (2), it suffices to
/1=1 oo
prove this for the series ő„, where Sn = rj* — We know that the
/1=1
random variables ö„ are completely independent, further that Е(дп) = 0
oo
and Y L)\bn) < + oo. Hence for an e > 0, however small, Kolmogorov’s
n = 1

inequality gives
m i N
P{ max | £ <5* | > e) < — E D \ 6k), (4)
n<fm<.N k —n & k =n
V I I , § 14] K O L M O G O R O V ’S T H R E E - S E R I E S T H E O R E M 421

and, if we let tend N to infinity for every fixed n


m i oo

P( sup \ Y S k \ > e ) < — E D \ 5 k). (5)


n<,m k =n ^ k=n
Choose now from the sequence of all positive integers a subsequence rij
00
(n1 < n2 < . . .), such that the series Z dj converges, where
y=i

d,= k-rij
t
Then the series
oo m

Z p( sup I Z 5k 1 > <0 (6)


j —1 nj<,m k —ni
converges as well; by applying the Borel-Cantelli lemma we obtain that the
relation
m
1Z I —e
k-rij
holds with probability 1 for sufficiently large j and m > n,. If n is an integer
between rij and m, we have with probability 1
m n- 1 m
Iк=п
Z ^ I ^ I к=zщ i* l + lk=Znj «* 1* 2«. (7)
If we replace e successively by —— (M = 1 , 2 , . . . ) , we obtain (the union
2M
of denumerably many sets of zero measure being a set of zero measure too)
with probability 1 the relation
m j
I k~n
z дк \^-тMт >
for m > n, (8)
whenever n is greater than a bound N(M) depending on M. But this means
00
that Z converges with probability 1. Herewith the first part of Theorem 1
k=l
is proved.
We show now that conditions (1), (2), (3) are necessary as well for the
00
almost sure convergence of the series Z Чп- If this series converges with
/2= 1
probability 1, r]n -> 0 as n -> a. with probability 1. Hence we must have
with probability 1 | tjn \ < Я, i.e. = t]*, except for finitely many values
422 LAW S OF LA R G E N U M B E R S [V II, § 14

of n. Thus according to the Borel-Cantelli lemma series (1) converges.


00

Hence it suffices to consider the series £ q*, the sum of which, a random
П=1
variable itself, will be denoted by rj*. In what follows we shall need a lemma
due to J. L. Doob.

Lemma. I f c is a bounded random variable | £ | < M with variance


D~ = Z)2(<;), its characteristic function cpfit) fulfils fo r \ 1 1 < — the in­

equality
Д!,2
I 1< e 3 . (9)

P roof. Put = 4 - E (£ ). We have £(£*) = 0, | £* | < 2 M , and


I f í Á t ) I = I 9>$(0 |. We can write

IE (C n)\ < (2M)n-2£(£*2) = ( 2 M )n~2 D 2


and
, , & t2 - E (e i(itr
- I •

For I 1 1 < —— we obtain


4M

, . D V « (2MI г I)—2 , Ű V
П -10-1+— ------ ä — ,

i.e.
£ 2?2 D2*2 ! Z>2/ 2 - DT
I <Pt*(0 1^ - 1+ — + ! 1 ------ ^ 1 ------------- j — ^ e >

which proves the lemma.


00

Now we return to the series £ q*. By assumption | rj* \ < A, hence if


W=1
is the characteristic function of q* and D 2 its variance, we have,
according to the lemma

f° r |,|s ^ "
N
Since Yj *7* converges to rj* with probability 1 as iV-> oo, the distribu­
it 1
V I I § 14 ] K O L M O G O R O V ’S T H R E E - S E R I E S T H E O R E M 423

N
tion function of £ tj* converges to the distribution function of 4 * at
n= 1

every point of continuity ot the latter. Hence

П Фп(0 = «К0>
Л= 1

where 1p(t) denotes the characteristic function of g* ; furthermore there


exists an e > 0 with \jj{t) # 0 for | t \ < e. Thus if | 1 1 < min (e, —- ) , then
— Y, ln I Фп(О I tends to — ln I \jj{t) |. Because of (10) we have
/1 = 1
00 3
X Z)2<_ in |* ( /) |; (11)
/2=1 I1 I
00

^ D\ is thus convergent and according to the first part of the theorem


/7=1 00
the series £ bn converges with probability 1. Since by hypothesis the same
/7 = 1
00

holds for £ rj*. the series


/2 = 1
00 00
I £07*) = ! 0/*л -<5„)
/2 = 1 /2 = 1

converges with probability 1 too. Theorem 1 is thus proved.


It is easy to see from this proof that the following result is also valid:

T heorem 2. I f tjn (n = 1, 2 , . . .) are independent random variables and if


D 2(rin) exist, further i f the series
00
I E(r,„) (12)
/7 = 1

and
00
£ D4hn) (13)
/2 = 1
00

converge, then the series t]„ converges with probability 1.


и = 1

The assumptions of Theorem 2 are not necessary for the convergence of


00
Y j hn\ even the existence of E(p„) and D2(rjn) is not necessary. Theorem 2
/7 = 1

can be obtained as the limiting case for A -> + oo of Theorem 1.


It should be noticed that the hypotheses of Theorems 1 and 2 do not
00
guarantee the absolute convergence of the series £ rj„. Thus for instance
/2 = 1
424 LAWS OF LARGE NUMBERS [V II, § 15

if P i/7„ = ± — I = — = 0. D 2(rin) = —^2 , then the conditions of


[ n j 2 n
00

Theorem 2 are fulfilled but the series £ | rjn | is divergent. By Theorem 2,


« = 1
(30

however, the series ^ r\n converges with probability 1 for any rearrange-
/7 = 1

ments of its terms; the sum and the set of its points of divergence depend
of course on the rearrangement in question.
00

For the almost certain convergence of £ | | it is sufficient, according


n= l
00
to a well-known theorem o f Beppo Levi, that У E(\ t]n |) < + o o . On the
/7 = 1
OO
other hand, it is sufficient that the series (1) and E(\ q*\) converge, where
/7 = 1

rj* is defined as in Theorem 1. This condition is necessary as well, since if


Theorem 1 is applied to the sequence | rfn | , it can be seen that the con-
oo oo
vergence of £ | rj* | implies that of £ E{\ r\* |).
/7 = 1 /7 = 1

Theorem 1 is stronger than the law of large numbers. Thus for instance
Theorem 2 of § 7 can be deduced from the present Theorem 2 as follows:
Let £2, be independent random variables with •£'(£„) = 0,
00 £,2 £
D(<L) = Dn, and assume У — ~ < + 0 0 . If we put rj. — ——, the hypo-
„=i « ' n
. oo
theses of Theorem 2 are fulfilled; hence the series ]T t]n converges with prob-
/7 = 1
ability 1. According to Kronecker’s lemma (Lemma 2 of § 7) with qn = n it
follows that with probability 1
n

I k 4k , n
lim *-=! = l i m — X í * = 0,
n — co tl «-► O O ^ /c = l

which proves Theorem 2 of § 7.

§ 15. Laws of large numbers on conditional probability spaces

The relation of conditional probability to conditional relative frequency


is the same as the relation of ordinary probability to ordinary relative fre­
quency. This is reflected in the following
V I I , § 15 ] C O N D IT IO N A L P R O B A B IL IT Y S P A C E S 425

T heorem 1. Let i r = [í2 ,o € ,& , P(A ] B)] be a conditional probability


space, C an event, C £ Lffi (« = 1,2, . . . ) a sequence o f random variables
on JF which are independent with respect to C. Let further V be a Borel set
on the real axis and Bn the set o f the elements со £ C such that £„(<u) £ V.
Suppose B„ £ (n = 1 , 2 , . . . ) . Then clearly Bn £ C (« = 1, 2, . . . ) . Let
further the conditional variances

D \ U B n ) = Dl («=1,2,....) (1)

exist and assume that the conditional expectations

Е(£„ IB„) — M («=1,2,...) (2)


do not depend on n. Put

Pn — P(Bn\ C) (3)
and assume that pn > 0 (« = 1 , 2 , . . . ) . Suppose further that

ZPn=+°o (4)
П-1
and
°° n D2
Z - < +00. (5)

l
Define
Sn(Y )= Z ik and N n( V) = Z 1- (6)
1 < .k < .n 1 < .k< _ n
Sktv ikCV
Sn( V) is thus the sum o f the (1 < к < ri) whose values belong to V and
N„( V) is their number. Then

'ЬШ-Н-'-
Proof. Let us put

„ = „ ( « , ) _ j 1 for « “ >«
[ 0 otherwise
and
, ek (Zk - M )
°k — к
Z
Pi
i=1
426 LAW S OF LARGE NUM BERS [V II, § 15

00
Consider the series Y ök. The ök are independent under condition C;
k= 1
further E(ök I C ) = 0 and

D \d k ! C) = — P ^ L - ;
( Е й )2
7=1

oo

thus by hypothesis (5) the series Y E>2(5k I Q is convergent.


k =1
00
Kolmogorov’s three-series theorem shows that the series Y con-
k= 1
verges with probability 1 under condition C. If we apply Kronecker’s
П
lemma with qn = £ pk, we obtain
k =l

Y ^ k - M ) i ^
P lim — — ------------ - = 0 C = 1 . (7)
V
4
^ Pk
k=1 '
Put now
zk - P k
'Ik = — г ----------- •

IPJ
J =1
00
By repeating the preceding reasoning for the series £ 7/< we find that
*=i
Eiyik I C) = 0 and

( Y pjT
j =i
00
The series £ D 2(t]k \ C) converges by the lemma of Abel and Dini;1 it
*=i
follows (as in the proof of (7)) that

( V о \
A £* I
P lim П
----- = 1 IC = 1 . (8)
4
«-►oo ^
V LA '
4 fc = l 7

1 Cf. K. Knopp [1].


V I I , § 15] C O N D IT IO N A L P R O B A B IL IT Y S P A C E S 427

(7) and (8) lead to

Í Z 4 tk
P lim ------_ = Л / c =1. (9)
И -»-CO

V Z е* /
4 k = l '

n n
Since Z £k£k = S„(V) and Z ek — Nn(V) Theorem 1 is herewith proved.
k=1 k=1
The quotient *j" ^ J can be considered as the empirical conditional aver-
П V /

age of the random variables indeed it is the arithmetic mean of the


(k = 1 , 2 , . . . ) whose values belong to V.
The conditional strong law of large numbers can therefore be stated as
follows: If (4) and (5) are valid, the arithmetic mean of those values of
f 2, . . . , iIn which belong to V converges with probability 1 to the common
conditional expectation of the £* under condition £ V.
Remarks.
1. If Dk is bounded, e.g. independent of k, the condition (5) is a conse­
quence of (4) by the Abel-Dini theorem; hence in this case it suffices that
(4) is fulfilled.

2. If instead of (2) we suppose that M n = £(£„ | B„) fulfils

00 Z Pk IMk - M i
1 -i=l------------------ < + ^ (10)
И—1 rn
Z Pk
k = l

and if we replace (5) by condition

f - ( - n +nM" ( 1 ~ Pn^ < + oo, (5')


"=1 ( E Pk) 2
k = 1

Theorem 1 remains valid; the proof is similar.


3. If the whole real axis is taken for V, then clearly B„ = C and p„ = 1;
“ D 2 < + oo.
hence (4) is trivially fulfilled and (5) reduces to the condition Z -—£—
n=i n
428 LAW S OF LA R G E N U M B E R S [V II, § 16

We obtain thus, as a particular case of Theorem 1, Theorem 2 of § 7 for the


ordinary probability space = [Í2, t j€,P(A | C)]. Notice that it is possible
to state and to prove Theorem 1 without reference to conditional probability
spaces; however, in the form given above it shows how the strong law of
large numbers can be extended to conditional probability spaces.
4. Consider now the following special case: Let V be the set which con­
sists only of the two elements 0 and 1, let further be P(i* = 1 | Bn) = p and
P(£ = 0 I Bn) = 1 - p = q. This situation can be described as follows: У
represents an infinite sequence of independent experiments, A and В are the
possible outcomes of the individual experiments. If at the k-th experiment
both A and В occur, we have £k = 1; if A and В occur, we have = 0;
finally if В does not occur at the k-Xh experiment, takes on a value
distinct from 0 or 1. Then

M = E{Zn\Bn) = P ( Z n = \ \ B n) = p .

The number p can be considered as the conditional probability of A with


respect to B; we write therefore p = P(A \ B). Furthermore, Dn = yfpq
and (4) implies (5). The quotient f n(A \ B) = ^ is thus the conditional
relative frequency of A with respect to В in the course of the n first experi­
ments.
Theorem 1 states that f„(A \ В) tends to P(A \ В) with conditional proba­
bility 1 for a given condition C. In this interpretation C is some condition
concerning the infinite sequence of experiments. p„ = P(B„ \ C) is the con­
ditional probability of В at the и-th (и = 1 , 2 , . . . ) experiment.

§ 16. Exercises

1. Prove Theorem 2 of § 3 by the method of characteristic functions in the case


where the are not merely pairwise but completely independent.
2. Prove the generalization of Theorem 2 in § 3: Let the random variables
£i> <Ü2 , . . . satisfy the following conditions:
a) the expectations M„ — £({„) exist and
1 "
lim — ^7 M k = M\
П-*- CD ft k= 1

b) the variances Dl = D \i„ ) exist and

lim —- У Dk = 0.
„-о, Л2 k =I
VII, § 16) EXERCISES 429

c) the correlation coefficients R,: = R(£h {,) fulfil the inequalities

X X R‘i X, xi ^ С X xj
1=1 i=i i=i
for every system of real values x, such that X •*? converges; C is a positive constant.
/=i
Given these conditions,
1 "
lim st — У Zk = M.
n k= 1
3. Prove the following theorem: If f ( x , y ) is a uniformly continuous function of
wo variables, if lim st {„ = £ and lim st ??„ = r), then
П—*■со П—*■со

lim st № n,r)„) = /(i,? ;).


Л —► CO

If £ = a and r/ = b are constants, continuity of /(л:, у) at the point (a, b) is sufficient.

4. Let £„ (n = 1, 2, . . . ) be independent random variables with I°(i = + yj n) = — ,


1 Д
hence E(q„) — 0 and D(q„) — / n. Therefore the condition lim — X 0 2(c*) = 0
П-*- со & =1

of Theorem 2 in § 3 is not fulfilled. Put C„ = ^ **~ ancj show that


n
i„ does not tend in probability to zero.
Hint. Let <p„(t) be the characteristic function of i„, then

" t lie -
< P n( t) = П cos
A= 1 tl
and lim
n —* oo
V n(r) = e 4;

<p„(J) does not tend to 1.

Remark. The distribution of i„ converges to a normal distribution.


5. Let £1; i 2. ■• • > i „, . . . be pairwise independent random variables with

Дг„ = ± /Л = у .
Show that the law of large numbers holds for the sequence £„ if 0 < ő < - i - .

6 . Let the events Au A2, . . ., A„ be the possible results of an experiment. Let there
be performed N such independent experiments. The probability that the event Ak
occurs exactly vk(N) times (k = 1 , 2 , , n), and in a given order, is equal to

•Tv = О Ркк(ю,
k= 1
where p k = P(Ak). Since nN depends on the sequence vk(N) (k — 1 ,2 , . . . , n) and
i'k(N) are random variables, nN is a random variable as well. Obviously
II 1\ "
E lN l 0 82 ---
nN) = - X Pk l0 8 * Pk-
430 LAW S OF LA R G E N U M B E R S [V II, § 16

n
The quantity H ( d ) = — £ pk\og,ph is called the entropy of the complete system of
*=i
events c t = (Аъ A,,. . ., A„) (cf. Appendix). Prove the limit relation

lim st - i - log 2 — = H(cA).


А — со A ~ ЩV

Hint. According to the law of large numbers

lim st ———— = pk for к = 1 , 2 ........n.


N -» CO A

7. Let an urn contain a0 white and h„ red balls. If we draw from the urn a white
ball, we put it back and besides we add to the urn a1 white balls and b, red balls.
If we draw a red ball we put it back and add to the urn a2 white and b2 red balls where
ax bt = ű2 + b.,, a2 > 0. The same procedure is repeated after all subsequent
drawings. Let denote the number of white balls drawn in the first n drawings.
Prove the relation
r st
lim „+ — = —---------
^2 .
n—►со n К + a2

Hint. It is easy to show that lim E Í— ) = ———— ; further lim D (——) = 0; hence
VП I bl + °2 V» )
our statement follows from Chebyshev’s inequality.
8. a) Let (n = 1 , 2 , . . . ) be bounded random variables, \rjn\ < C. The necessary
and sufficient condition that r;„ should converge in probability to zero is the fulfilment
of the relation
lim E( \r)n\ ) = 0. (1)
П—►оэ
Hint. Applying Markov’s inequality to the random variable |t;„|, we obtain

n/ i i ^ ^ \t]n\ )
P( |»?„| > e) < --------- ,
E

hence condition ( 1) is sufficient for r/„-E 0 .


Suppose now that lim st r;„ = 0. Let A„(b) be the event \t]„\ > <5, with an arbitrary
П—►oo
á > 0. We have then

EQVnl) = E(\Vn\ \ A „ m P ( A M ) + E(\V„\ IЩ ) Р { Л М ) < CP(An{&)) + <5 .


By assumption lim P(An(S)) = 0. Hence lim sup E(\rj„\) < 6. Since 6 can be arbitrarily
« —►CO n —► CO
small, the necessity of ( 1 ) is proved.
b) Suppose that lim st f„ = c and that f(x) is a Borel-measurable bounded function
« —►CO

which is continuous at the point c. Then lim E(f(C„)) = f(c).


« —►CO

Hint. Evidently, lim st (/(C„) —/(c)) = 0. Since f{x) is bounded, it follows because
«-► C O

of Exercise 8 .a) that


lim E( |/(C„) - /(с) 1) = 0,
« —►CO
V II, § 16) E X E R C IS E S 431

hence
lim E ( fiQ ) = f(c ).
n-*-CQ

9. Let f(x ) and g(x) be continuous functions on the closed interval [0,1 ] which
fulfil the relation 0 < fix) < Cg(x), where C is a positive constant. Then

Urn fr
»-«J J J
..№ 1
9 (xy) + g(x2) + . . . + g(x„) -
I*0 1“ .
\g ( x )d x
0

Hint. Choose in the unit-cube of the и-dimensional space a point P at random


(with uniform probability distribution); let {2, . . £„ denote its coordinates;
(k = 1 , 2 , . . . , « ) are thus independent and uniformly distributed on [0,1]. Put

/ ( í t) + m + . • • + M n)
vn = -----------------------^ ---------------------------

and
j. _ ft(.Zi) + fffe) + - ■• + g(£n)
" n
We have thus
i i
lim st r/„ = J fix) dx, lim st = J g(x) dx,
П — со О И-*- oo 0

1
and, since J dx > 0, we have by the result of Exercise 3,
о
f fix ) dx
lim st -----------.
\g ( x ) d x
0

71
Since further 0 < — < C, we get from the result of Exercise 8 . b)

tv \ № x'>dx
,im £ f r = 't\ ---------
g(x)dx
>
0

q.e.d.

10. Prove that the limit relation

lim J] f [ x + — ] e~"h = /(* + A)


я — oa k=\ V n! K-
holds for every h > 0 and x > 0 if fix) is a bounded continuous function on (0 , + °°)
(Theorem of E. Hille).

Hint. Let c2, . . . be independent random variables having a common Poisson


distribution with Efik) = h. Put ^ ^ . Then lim st = A. Since
И Л-*. 00
432 LAW S O F LA R G E N U M B E R S [V II, § 16

/ ( a) is by assumption continuous and bounded, it follows according to Exercise 8 .b)


that
lim E(f(x + £„)) = f(x + /;) ,
П—*■oo
which was to be proved.
11. Let g(s) denote the Laplace transform of a function f(x) which is bounded and
continuous in the interval [0 , + ° ° ):

g{s) = J e~a f(x) dx.


о
Prove the Post Widder inversion formula

(— l)“- , nV"~1> (— ]
fix) = lim ---------—----- —— — . for X> 0.
x " ( n — 1 )!

Hint. Let ij, i 2, . . . , £„, . . . be independent random variables having the same
t

exponential distribution with expectation x, i.e. P(ck < t) — 1 — e x for t > 0. If we


1 ”,
put £„ = — V lk, then lim st C„ = л, hence (see Exercise 8.b))
n k=l

lim £■(/(;„)) = f ( x ) .
П-*-00
Now we have (cf. Lormula (24) of Ch. IV, § 9)

Г n"t"~l e x p ---- — n" ( — l)"-1 ^ -1’ j— I


ч я о ) = J до ib - ' “ ' ,- ( « - ■ ) ! -
which leads to our statement.
12. Let t] be uniformly distributed on [0, 1]. Let Q„(r) denote the number of occur­
rences of the digit г (r = 0 , 1 , . . . . 9) among the first n digits of the decimal expansion
of rj; show that

a) P ( lim — — = ~ ) = 1 (r = 0, 1, . . . . 9),
l«-» n 10 J

/ I n
A(r) - -- 10
b) P lim sup 1 ,_____ < —- = 1 .
' ^J2n\n\nn 3 /
Hint. Let the random variable f„0) be equal either to 1 or to 0 according to the
и-th digit in the decimal expansion of rj being equal to r or distinct from it; then
í„(r) (n = 1 , 2 , . . . ) are independent and have the same distribution, P(cjn(r) = 1) = -y^-,
9
P(£„(r) = 0 ) = —- (r = 0, 1, . . . , 9). a) is obtained from the strong law of large numbers,
b) from the law of the iterated logarithm.
V II, § 16 ] E X E R C IS E S 433

13. Let be independent random variables with

P(ik = ± 1) = у (к — \ , 2 , . . . , n ) .

P ut t]k = + i2 + . . . + and C„ = max rjk.


„ l^ k ^ n
a) If q > 0, then

b) Show by means of the result of a) that

£(,'■•) s ( i + L ) ( ' ,+2 t ) ■


c) Using b) show that

P [lim sup ——- -----< l | = 1.

d) Show that

,,( й )
E (f.) = Y. ~ ---- Ъ ------- for " = 2. 3, 4, . ..
k= 1 ^

/2П
and conclude from it that E(£n) & N/ — .
V n
e) Show finally that
---- * X 2
, /— { e 2 du for x > 0,
/— V n I
lim P(t„ < x j n ) = 0
W-*- 00
0 otherwise.
Hint. Let pnJc = P(C„ = k) (k = —I, 0, 1 , . . . , n). We have the following recursive
formulas:
1
P«+i.k = у (P„.k- 1 + Ли+i) for k = 2 , 3 , . . . , n + 1,

1
A l+ 1 ,1 — у (P n A + Рл,2 +

1
Ptt + l. —l = у 0 „ . - l + Лм>)

1
Рй+ЬО 2 РпЛ’
a) follows by induction.
434 LAW S O F LA R G E N U M B E R S [VII, § 16

14. Let {t, be independent random variables with E(Z„) = 0,


D({„) = D„ and put £„ = í i + £2 + • • • + tnl then for any e > 0 and n = 1 , 2 , . . .

This inequality is due to J. Hájek.1

Hin,. Put V = £ 0 - - ^ r ) • Then

= « *=1 I
*=л+1 ,c (3)
Let (fc > n) denote the event that the inequalities

^от^ < в (от= n, n + 1 , . . ., к — 1) and кÍ-L^ £


hold and put A = then /4, /(„, Л„+1 , . . . is a complete system of events. Hence
k= n

_ CO

E(rj) = £(t? I A) P(A) + £ e (v \ Ak) P(Ak).


k= n
Clearly
£ (ij I A k) = У E H 2m I A k ) \ ~ ------------- Í------ - j >
ik n l m 2 (о т + l) 2 j

> ikk
£ I Л ) [m-
( A - v(от +* l)2, ]J •
If m > k, it can be shown as in the proof of Kolmogorov’s inequality, that

В Д ! Л ) > Д Й 1 Л ) > ^ !А
Hence
E(r] I /1J > e2,
and
£(??) > c2 £ P(Ak) = e2 P (sup > ej . (4)
k^n U ^n к J
(3) and (4) imply (2).
15. Deduce Theorem 2 of § 7 from Exercise 14.
16. Prove the following generalization of Exercise 14. If {lt c2, . . . are completely
independent and if E(£k) = 0, D2(£k) = Df , further if 0 < Bt < Вг < . . . is a sequence
" D\
of positive numbers such that > ----- < + oo, we have, for £ > 0,
fc=l P k

p f s u \pJ Pi к^ L >
'
£ )e < \± Pf" i k-= X
1
^+ I
A=n+1 P k )
#)•

1 For this proof see J. Hájek and A. Rényi [1].


V II, § 1 6 ] E X E R C IS E S 435

Hint. The proof of Exercise 14 can be repeated almost word by word.


17. Deduce from the inequality in Exercise 16 the following theorem: Let rju t}2, . . .,
rj„,.. . be completely independent random variables with expectations E(rjk) = M k > 0
and with finite variances D2(r]k) — D\. Suppose that
CO

a)k=E1M * - +°°>
00 £)2
ß) the series E — -k— ---- is convergent. Then with probability 1
‘=1 (E
i=l
M iy

lim -i= l------= 1.


E m*
k= 1
Hint. This is a generalization of Theorem 2 of § 7; in fact, if rjk = £k — M k + 1,
00 £)2
then E(rjk) = 1, D(r]k) — Dk\ thus if E ~~ГГ < + ° ° then with probability 1
*=i k-
1 "
lim — У щ = 1
и -» n
and thus
lim — Í (£* - M k) = 0.
*-»«. n *=i
18. Prove Theorem 1 of § 15 by means of Exercise 17.
Hint. Let rik — £*(f* — M + 1). Then
1C) = P<ßk I Q = A. D(.Vk I C) = D k.
Thus if conditions (4) and (5) of Theorem 1 of § 15 are fulfilled, conditions a) and
ß) of Exercise 17 are fulfilled as well and it follows with conditional probability 1
with respect to the condition C that

i e*({*-M+l)
lim i= !----- —-------------- = 1. (5)
л'*0° Ё Pk
*=i
If we apply the theorem in Exercise 17 to the sequence rjk — ek we have again, with
conditional probability 1 under condition C
П
У £k
lim -ini---- = 1. (6 )
E Pk
k=i
From (5) and ( 6 ) it follows that

Í «А \
P lim ------- = M C = 1.

v~ L * j
436 LAW S OF L A R G E NUM BERS [VII, § 16

19. If the random variables are identically distributed and if


the fourth moment of exists, then for the validity of the strong law of large numbers
it is sufficient that the Sk are four-by-four independent (instead of being completely
independent); thus, if any four of the random variables are independent and if
1 "
E(£k) = 0, Z)2(£„) = D-, E(Zk) = M„, further if we put t„ = — Y Zk, then we have
n *=i
P( lim £„ = 0) = 1.
П-*- CO

Hint. If we apply Markov’s inequality to the random variables С», we obtain

CO
Hence the series Y P(\ c„ j > e) is convergent and we can use Lemma A of § 5. (The
n= 1
id e a o f th is p r o o f is d u e t o F . P . C a n te lli.)

20. If it, i f, . . . , are identically distributed random variables with finite variance,
it suffices for the validity of the strong law of large numbers the still weaker criterion
that the i k are pairwise uncorrelated (instead of completely independent).
Hint. Let E(Zk) = 0, £>(Q = D and
1 "
í. = 7n kE=l í.-
> z>2
According to Chebyshev’s inequality, P(|C„* | > e) < - o ; hence the series
nl e2
Y Р(И.„г I > e) is convergent. By the Borel-Cantelli lemma

I I< t (7)

with probability 1 for a large enough n. On the other hand


( N I \ ( n + lF - l / N \

max Yj p Z - e” 2 '
n '< N < 0 i+ 1)2 k= n* + 1 I ) 1 l *=«* + 1 !

Hence by Chebyshev’s inequality,


( I " „ ^ ) ( r t + l p —1 4D 2
P max \
>
Lu
£k

> en2I < >
L -i
--------—
p2 „ 4
------ < p2 „ 2
■ .
\л'< Л Г < (л+ 1> ! I fc = n a + l I JV = n*+ l b n "

Applying again the Borel-Cantelli lemma, we find that with probability 1,


N
max Y £* — e"2 (8 )
»! + l á J V < 0 l + l ) í А:=л>+1

for a sufficiently large и. (7) and (8 ) lead with probability 1 for п? < N < (n + l ) 2
to the inequality |f„| < 2 e for a large enough n, which proves our statement.
21. If £,, i 2, . . . are pairwise uncorrelated, if E(fk) = 0, D2(fk) = Dk and if the
co £)2
series ^ ^ is convergent, then the strong law of large numbers is valid.
k -\ к 12
V II, § 16] E X E R C IS E S 437

Hint. Use the method of Exercise 20.


Remark. By a different method1 it can be proved that even the convergence of the
• £ In-k .
series 2_, Щ -■ ■■ is sufficient,
i к
22. Let us develop the positive number x lying between 0 and 1 into Cantor’s series
\ ' £n (-* ■ )
x — / --------------
n=i 92 • • ■ 4n

belonging to the sequence q„ (qn > 2 , q„ integer), where the “digits” e„ (x) may take on
the values 0 , 1 , . . . , qn — 1 (n = 1 , 2 , . . . ) . If q is a random variable uniformly
distributed on the interval (0 , 1 ), let f„(к) denote the number of digits efy) equal to к
(j — 1, 2 ,. . . , n). Assume that the sequence q„ fulfils the conditions lim qn = + °o and
У — = + o°. Show that
n = \ 4n

P I lim = 1 = 1 for к = 0, 1 , . . . .
I П-*■со *
^ Ä q, '
Hint. Let
- _ 1 f°r en (q) = k,
C
~"k 0 otherwise.

Hence, for qn > k, E(c„k) = — and D'2(í„k) = — í 1 ---- —) , the convergence of


Яп Í, I Яп)
f D2 (irt)

“ (St )
follows from the Abel-Dini theorem. Thus we can apply the result of Exercise 17-
The statement of the present exercise can also be obtained as a particular case of
Theorem 1 of § 15.
23. Let ri„ be the frequency of the event A in a sequence of n independent experi­
ments, while P(A) = p(0 < p < 1; q = 1 — p). Prove the complete form of the law
of the iterated logarithm, i.e.

pfl i m sup ——- — ПР---- = l] = 1 (9)


l J 2 «p<jrlnln n J
and
P (lim inf 2П" ~ ПР - = — l ) = 1. ( 10 )
l n-»eo J 2npq In Inn J
Hint. The proof of the Moivre-Laplace theorem shows that

p ( V n-np > ) =1 - а д + о Ш
' Jnpq > I N/ n )
1 Cf. J. L. Doob [2], p. 158, Theorem 5.2.

i
438 LAWS OF LARGE NUMBERS [VII, § 16

for X = 0(дУ In In n) ; hence

Since we have (cf. Ch. Ill, § 18, Exercise 18):


1 1 _ *1
1 - Ф(лг) > —— = ------ — e 2 ,
J271X J +
X2

it follows that
p( ^ ~ np > [— for
l / 2npq ln ln n ) lnn

where c > 0 is a constant. Let nk = 2k and let Ak denote the event

. . ^ - n kP_
yj2n k pq In In nk

Thus the series ^ P(.Ak) is divergent. It is easy to show that the sequence Ak fulfils the
*=i
condition of Lemma C of § 5. Hence there occur with probability 1 infinitely many
events Ak. Since e > 0 is arbitrarily small, (9) is obtained. (10) can be proved in a
similar way.
24. Let £„ i 2> be pairwise independent random variables with common
distribution function F(x). Put f„ = ^ ^
show that in order that
n
lim st £„ = 0 should hold, the following two conditions are sufficient:
П —
► CO


a) lim J xdF{x) = 0,
П—p- со — П

b) lim xF(x) = lim x(l — F(x)) = 0,


X->----- со X -+ -+ C O

(Theorem of Kolmogorov).
Hint. Let
= í Í* for I I ^ n,
"k 1 0 otherwise
and

CÍ = — X ink-
n k=l
Then

P(Cn Ф О < , j r P ( \ ü k \ > n ) = n [ F ( - n) + 1 - Д«)]


k=\

and, because of b),


lim P(C„ ф :•) = o.
n -* • CO
V H , § 16] E X E R C IS E S 439

Thus it suffices to show that lim st J* = 0; as because of


Л-*-со
П I C„ I > e) < P( I Ci I > e) + P(t„ Ф tf)
it follows from this that lim st £„ = 0. On the other hand, by a)
*co
П—

lim £■({*) = lim J xdF(x) = 0.


«—
►со И-►со —П
Furthermore
+л * —(A—1)
Я2 (£„*) < — ( л:2 í/F(x) < — У к2 ( rfF(x) + J - У /с2 í </F(x).
nJ n k=I / Л *=1 У
—/i к—1 —
A
Putting a* = 1 — F(k), bk = F (— k) we may write

D"- (?„*) < — £ Л2 [(ut _ l - «*) + (**_! - 6 ,)] <


n *=i

< — f (at + bk) (2k + 1 ).


n *=o
Because of b)
lim (2k + l)(a* + bk) = 0,
к—со
hence
lim 0 2(f„*) = 0 ,
Л—
►00
and thus lim st f* = 0.
П —► 00

Remark. In case of completely independent the proof can be somewhat simplified


by employing characteristic functions; in fact

( V° ( itx л r ( it Í xdF(x) + S„\


F(e"C") ’= (l + J (exp —- ----- 1 ji/F(x)J = | 1 d------- ------ -----------j ,

where
+n
I <5„ I < n[F(-ri) + 1 - F(n)] + j j x-dF(x)
—П
Hence by a) and b) follows
lim ZJ(e"tn) - i
П—
►co
for every real t.
CHAPTER V ili

THE LIMIT THEOREMS OF PROBABILITY THEORY


§ 1. The central limit theorems

On the basis of the theorems on characteristic functions established in


Chapter VI, we may now pass to the proofs of theorems concerning limit
distributions. Most important among these are the so- called “central limit
theorems” , which express the fact that the distribution of the sum of a
large number of independent random variables approaches, under very
general conditions, the normal distribution. These theorems disclose the
reason why in applications distributions close to normal distributions are
so often encountered.
A typical example is the case of errors of measurements; the total error
is usually composed of many small errors. The central limit theorems justify
the assumption that the errors of measurement are normally distributed;
that is, why the normal distribution is sometimes called the law o f errors.
The simplest case of the central limit theorem, namely the Moivre-
Laplace theorem, was already dealt with in Chapter III. First we give here a
somewhat different formulation of this theorem.
Let there be performed n independent experiments having for their pos­
sible outcomes either the occurrence of an event A or its non-occurrence Ä.
Put p = P(A), q — \ ~ p = P(A). Let the value of the random variable f /c
be either 1 or 0, according as the event A occurs or does not occur at the
k -th (k = 1 , 2 , . . . , « ) experiment. Put further

Cn — £l + £2 + • • • + £n-
We know already that E(Cn) = np and £>(£„) = yjnpq. The linear trans­
formation which transforms the random variable £ into the random variable
having expectation zero and standard deviation one, is

called standardization. Let (* = be the standardized variable cor-


V npq
responding to £„. Since £„ is evidently a binomial variable, P((k = k) =
= pk qn~k . Therefore the Moivre-Laplace theorem (Ch. Ill, § 16.
I^
V III, § 1] T H E C E N T R A L L IM IT T H E O R E M S 441

Theorem 3) can be formulated as follows:

lim P(C < X ) = Ф(х), (П


n-> + 00

where
л:
ф(х) = j e 2 du (2)
—00

is the normal distribution function. In other words, the distribution of the


standardized relative frequency of the event A during n independent experi­
ments tends to the normal distribution as n -* +oo.
The random variables £k are of a very restricted nature: they assume only
the values 0 and 1. The central limit theorem in its most simple form, which
is an immediate generalization of the Moivre-Laplace theorem, can be
stated as follows:

T heorem 1. Let £ls £2, . , £„,. . . be independent identically distributed


random variables for which M = E(cn) and D = D(cn) > 0 exist; put

Sn= L Zk and c„ = ---- •


k =1 и Угп)
I f Fn(x) is the distribution function o f £*, we have

lim Fn (x) = Ф(х) uniformly fo r —oo < x < + oo. (3)


1-*- 00

P roof . Let cp(t) be the characteristic function of r/k = £к — M and


•lift) the characteristic function of (*. Since E((„) = nM and Z)(C„) = D j n
it follows from Theorems 3 and 6 of Chapter VI, § 2 that

<4)
From Theorem 9 of Chapter VI, § 2 it follows, because of Е(рк) = 0, that

i t ] t2 /1 1
<P ~FT~7= = 1 - — + о — . (5)
{-Oyj n] 2n n

By applying the well-known formula

( x I”
lim 1 H— — = ex if lim x n — x , (6)
П -+ + 0 0 l ^ Я -»- + <Х>
442 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 1

we conclude that for all values of t


_ t±
lim ij/n(t) = e 2 (—со < i < +oo) (7)
П-*-+ 00

which, because of Theorem 3 of Chapter VI, § 4, proves (1). It is easy to see


(in view of the continuity of Ф(х) for all x) that the convergence is necessa­
rily uniform in x. Evidently Theorem 1 contains the Moivre-Laplace theo­
rem as a particular case.
The statement of Theorem 1 remains valid under much more general
conditions. This was shown first by Chebyshev and Markov by an entirely
different method, namely by the method of moments (see § 13, Exercise 27).
The method of characteristic functions was first employed by Liapunov.
He proved by essentially this method that the central limit theorem can
be applied under much more general conditions than those of Chebyshev
and Markov.1 The result of Liapunov can be stated as follows:

Theorem 2. Let <fx, £2, • • -Лю be independent random variables, the


first three moments

M( f i k) = M , D 2 t f k) = D l > 0 , M { \ t ; k - M k ?) = H l

o f which exist (k = 1 ,2 ,...) . Put

Sn= j D \ + Dl + . . . D l , (8)

Kn = U H \ + H \ + . . . + H i, (9)

£„= Í É* a n d C*= (10)


k =1 ^KSn)
Let Fn(x) denote the distribution function o f (*. I f Liapunov's condition

lim ~ О (И )
«-► + CO ^ n

is fulfilled, then
lim F„(x) = Ф(х) ( — 00 < x < +oo). (12)
n-*-+ 00
Remark. The condition (11) is evidently fulfilled when all ^k have the same
distribution. In effect, in this case Dk = D, Hk = H, Sn = D sfn , K„ =

1 Later on it was proved by Markov that Liapunov’s theorem can also be proved
by the method of moments.
V III, § 1] T H E C E N T R A L L IM IT T H E O R E M S 443

= H l/n , hence

lim i * lim 1 =0.


n—+oo D Л- +00 ^Jn

It is again fulfilled when the random variables c,k —M k are uniformly bound­
ed and lim Sn = + oo, In fact, from | £k — M k | < C follows
n~* CO

Hl < CDl
hence

S. V S.

and since S„ -a + oo, condition (11) is satisfied.


Liapunov proved the central limit theorem starting from still more general
hypotheses. As a matter of fact, it suffices to assume the existence of the
moments of order ß (for some arbitrary ß > 2) instead of the third moment;
in this case instead of (11) one has to suppose

lim =0. (13)


«-*•+ CO ^ П

where

Kn{ß)= £ £ ( K * - A f * i v \ (14)
k —X

Lindeberg proved the central limit theorem under still more general con­
ditions. His condition is, in a certain sense, necessary as well. It is formulated
in the following theorem due to Lindeberg:

T heorem 3. Let £2, be independent random variables for


which the expectations M k = E (fk) and the standard deviations Dk = D(£k)
exist (k = 1 ,2 ,...) . Put

Sn = (15)
k= 1

and let Fk(x) be the distribution function o f £k — Mk. I f for every positive
s the so-called Lindeberg condition

lim 4 a £ I
n-+ + go k = 1 J
x 2dFk (x) = 0 (16)
\x\>eSn
444 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 1

is fufillled, then for

Í (i* - M k)
c = - - „— (i7)
Лn
we have
lim P(Cn < x) = Ф(х). (18)
00

Remark. From Liapunov’s condition (11) one can deduce (16); indeed
we have
+ CO

i t Í t f |д г |М В Д - 4 - Ш * . (19)
k =1 J k= 1 J [ °n /
|;c!>8<Sn - со
Similarly, (16) can be deduced from (13), too. Hence it suffices to prove
Theorem 3 (Lindeberg’s theorem); then Liapunov’s theorem (Theorem 2)
will also be proved.

P roof of T heorem 3. If cph(t) is the characteristic function of t]k = £k —


— M k, then
+00
it
<Pk у I = J
c e bn dFk (x). (20)
—00

We need the following elementary lemma:

L emma. For a real и and к = 1 , 2 , . . . , we have

s l ü t . (21)
h o fl kl

P roof of the L emma. In fact,

I eiu — 1 |2 = 2(1 —cos w) = 2 \ sin vdv < 2 | vdv = u2, (22)


о о

hence (21) holds for к - 1; if (21) holds for any k, it follows from

e1" - Z = I’ i [e‘° - k£ - ^ ] dv (23)


7=0 J' J 7=0 J ‘ ,
0
V III, § 1] T H E C E N T R A L L IM IT T H E O R E M S 445

that (21) holds for к + 1 too; hence by induction (21) holds for every k.
Thus the lemma is proved.
We have therefore
ifx X2 t 2
e Sn = 1 + - 7 Г- + 0i where 10X| < 1 (24)
and
*tx ifx X2 t 2 г3 £3
e s" = l + —------- + o2— 2 ~ where |02| < 1 . (25)

Now let £ > 0 be given. The integral (20) can be separated into two parts:
+sSn
t \ Г *tx г
<Pk у I = J e s"dF k (x)+ J
lfx
e Sn dFk (x). (26)
-eSn
Consider first the first integral on the right side of (26). Because of (25)
we have
sSn sSn c*S/7

J e s- dFk(x) = j dFk(x)+ ~ j xdFk(x) - - y J x 2dFk (x) + Rk \ (27)



sSn —
sSn —
sS'n —
&Sn
with
eSn
j \ x \ 3dFk ( x ) ü ~ ^ - D l (28)

sSn
Apply now Formula (24) to the second integral in (26); we obtain

J e '^ d F k (x)= j dFk {x) + ± - J xdFk {x) + F%\ (29)


|*| >eS^ |*| >tSn |*| >eSn

with

!R ? I^ - y J x 2dFk (x). (30)


1*1 >eSn

If we add (27) and (29), we obtain by (28), (30) and by taking into account
that E{r]k) = 0,

4 i ) = 1 ~ 4 f - + ^ 3)’ (31)
446 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III. § 1

with
+ I ^ JFk(x) ( 32)
|лг| >ESn

We show now that (16) implies

max Dk
lim l ^ L _ = 0. (33)
«-*-00 ^n
In fact
n
D\ = J X2dFk (x) + J X2 dFk (x) < £2 S2 + £ j x2dFk (x),
|дг|>с 5'/1 &=1

where
max D \ i n r
^ e 2+ - i E x2^ ( x ) . (34)
fc=l J
\xl>eSn
It follows, because of (16), that
max Dk
lim sup ---- < g. (35)
«-*-+00 *^«
Since г > 0 may be chosen arbitrarily small, (33) is proved.
Choose now n0(e) such that for n > n0(s)

-4- £ I x 2dFk (x) < e (36)


k= 1 J
|*| >bS„
and
max Dk
< £• (37)

This can be done because of (16) and (33). Let further be £ < jyy ,thus
1 t 2D \
— < 1 --------< 1
2 2S\

for n > и0(е) and 1 < к < n. Because of the identity

П (°k + bk) - f l ak = f^b j( П ak) ( П (ak + bk)) (38)


k=1 k =l y= l к<f j<k
V III, § 1] T H E C E N T R A L L IM IT T H E O R E M S 447

(where an empty product is to be replaced by 1), it follows from (32) and


(36) that
I n It \ n t 2 Г)2 \ n /III3 1
П й { - П 1 - 1 с г < s i^ ) |< e + (39)
\k=1 л п] k =1 n k =1 0 J

The inequality | e ~ x — 1 + л: | <j x2 for | x | < ~ and the identity (38)

imply, by considering (37),


" I t 2 Г)2 I " <*Dt П fi i) 4 .4-2
(40)
| fc=lt I k= 1 k=l <ьэп 4

Hence from (39) and (40) it follows for n > n0(e) that
« it) _ i2 i l t I3 1 t 4e2
Д » * У - ! h ( 1f + '1 + — ■ <4»
Since e > 0 can be chosen arbitrarily small, (41) implies
” ft) . --
lim (pk — = lim E(e“^") = e 2 . (42)
Tt~* + oo k —1 1 * ^ /1 . Л -. + 0О

Because of Theorem 3 of Chapter VI, § 4, our theorem is thus proved. The


convergence in (42) is even uniform on every finite interval \t \ < T .
If the are identically distributed and possess a finite standard deviation
D, we have
+ 00

Fk (x) = F(x) J x 2dF(x) = D2, and S„ = dJ n.


— 00
Thus
lim ~ £ I x2dFk (x) = *-2 lim f x2dF(x) = 0, (43)
^ n k =1
и-*- + оо J n_^_|_ад J
Iлг| >eSn \x\>eDln

Theorem 3 therefore contains Theorem 1 as a particular case. Notice that


+ 00

Theorem 1 does not follow from Theorem 2, since it is possible that j x 2dF(x)
+ 00 + 00 — 00
exists but [ |xj3c/T(x) = + oo and even that j \x\ß dF(x) = + oo for every
— CO — 00
ß > 2 .
Let us add that Lindeberg first proved his theorem by a different method,
viz. by a direct study of the convolution of the distributions (see § 12).
Lindeberg’s condition (16) is, as was shown by W. Feller, necessary as
well, in the following sense: If | 2, are independent random
variables with finite expectation and finite standard deviation, if Fk{x) is the
448 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 1

distribution function of - E(t;k) and if Си = Ci + £2 + • • • + Cn>

An = Е Ю , Sn = D ( 0 and C = , then

lim P(f* < x) = Ф(х) (44)


n->- + 00
and
lim P( max | C* - E(£k) \ > sSn) = 0 (45)
+00 1 < , k < , n
n -*

hold iff (16) is fulfilled for every e > 0.


If (45) is satisfied, the variables — E(£k) are said to be “negligible”
(or “infinitesimal”). Condition (45) follows from (16) by the inequality

P ( max I - Е Ю I > sSn) < £ P( | С* - Е Ю \ > eSn) <


1 < * k < ,n k = \

f x 2dFk (x). t
у £ Л=1
M>eSí!
Lindeberg’s condition implies thus that the variables (£* — E (fk))jSn are,
in a certain sense, “uniformly small” with great probability. We do not
prove here that (16) is a necessary condition. Neither do we deal with
further generalizations of the central limit theorem.1
The results of this section can be generalized in the following way: in­
stead of a sequence C* {k = 1 ,2 ,...) consider a matrix (£nk), {k = 1 , 2 .,kn;
n = 1, 2 ,. . .) of random variables such that the variables Сит, Cn2 >• • -Лпк„
are independent for every n and put
k n

C„= I U -
k = 1

By the same method which served to prove Theorem 3 we can prove the
following, somewhat more general, theorem:

T h e o r e m 4 . Let E
,nk (k = 1 , 2 , . . ., k„) be for every n (n = 1 , 2 , . . . ) inde­
pendent random variables with finite variance. Put M nk = E(<;„k), Dnk =
= D(£nk) and let Fnk(x) denote the distribution function o f ^,k — M„k. We
kn

assume that £ D\k = 1. I f the Lindeberg condition


k =I

lim £ I V2dFnk (x) = 0


w-^ + oo k = 1 |x | > e

1 Cf. the books of В. V. Gnedenko and A. N. Kolmogorov [1] and W. Feller [7],
Л ol. 2 , containing the detailed discussion of many further results in this domain.
v i li , § 2] TH E LOCAL FO RM O F C E N T R A L L IM IT T H E O R E M 449

is satisfied for any e > 0, then

lim P{ t (U - M nk) < x ) = Ф(х).


n -* - + oo k = l

Theorem 4 evidently contains Theorem 3 as a particular case; it suffices to


put i nk = ф- (Л= 1, 2 , . . . , rí).

The proof of Theorem 4 is very similar to that of Theorem 3; hence we


leave it to the reader.
The central limit theorem can be completed by the evaluation of the
remainder, viz. by giving an asymptotic expansion for the distribution func­
tion of £*, the first term of which is given by Ф{х) while further terms pro­
gress according to powers of — .

§ 2. The local form of the central limit theorem

In the preceding section we have seen that the distribution function Fn(x)
of the standardized sum £* of n independent random variables i b i 2, . . . ,
i n, . . . converges, under certain conditions, to the distribution function of
the normal distribution as n -*■ oo. It is therefore natural to ask under which
conditions the density function of £* (if it exists) tends to the density func­
tion of the normal distribution. For this the conditions must certainly be
stronger, since it is known that Fn(x) -*■ Ф(х) does not necessarily imply
F'(x) Ф'(х). We prove first in this respect a theorem due to В. V. Gnedenko:

T heorem 1. Let i b i 2, .. ., <


* „ ,... he independent, identically distributed
random variables which have a bounded density function f{ x ) ; assume further
+00 00
that E(i„) = I' xf(x)dx = 0 and that the integral D1 = j' x 2f(x)dx exists',
—oo —oo
then the density function f„(x) o f
r i. £i + £2 + •• • + i„
in = -----—7=---- - (1)
D jn
tends to the density function o f the normal distribution; hence we have

lim f n{x) = —L =e 2. (2)


П-*-00 2 71
The convergence is uniform in x.1
1 Fig. 25. The figure represents the case when the random variables £k are uniformly
distributed on ( — ^ /з , + f~3).
450 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V I I I, § 2

P r o o f . The supposition D = 1 does not restrict the generality. Let <p(t)


be the characteristic function of and q„(t) that of c*. We know that

4>n (0 = <P [ ~ i= ■ (3)


L \ Jnl i

— - fiW
.. - fa(x)

- 3 - 2 - 1 0 1 2 3
F ig . 2 5

We first prove the following

Lemma. I f the density function g(x) is bounded, g(x) < K, and the charac­
teristic function
<K0= j g{x)ei,xdx (4)
—OO
+ 00

is nonnegative, then the integral j ip(t)dt exists.


—00

Proof. (4) implies for v > 0


+V +00

^ C sin vx
\p(t)dt = 2 I g (x)-------- dx, (5)
J J X
—v —oo

hence, for T > 0


2T +v +00

j j j ф(1) dt dv = 2 J g(x) 1----dx (6)


О —v —oo

Since ij/(t) > 0, we have on the other hand


2 T +v +2T +Г
J ( J ф (0d t) dv = j' il/(t)(2 T -\t\)d t < T J f(t)d t, (7)
О -V - 2T -T
hence

-T
J H 0 d t< Y J #(*)-—C°2
-o o
2TX dx- (8)
V III, § 2 ] TH E LOCAL FO RM O F C E N T R A L L IM IT T H E O R E M 451

Because of g(x) < К and


+00
f 1 -co s2 7 x ,
I --------2------ dx = 2nT
— CO

(cf. Ch. VI, § 4, Formula (40)) we get


+T
j *l/(t)dt^4nK . (9)
-T
Since T in (9) can be chosen arbitrarily large and by assumption i/i(i) is
nonnegative, the lemma is herewith proved.
Now if the density function of one of two independent random variables
is bounded, the density function of their sum is bounded as well (cf. Ch. IV,
§ 9, Formula (4)). Thus, the density function /(x ) being bounded, that of
— c* is bounded if is independent of and has the same distribution,
and the characteristic function of is equal to |<p(i)|2. Thus by our
lemma is integrable. From this it follows by applying Theorem 2 of
Chapter VI, § 4 that for n > 2
+ CO

fÁ x ) = Y n J (p [7 = ) e~ix‘d t■ (10)
— 00

On the other hand, we have


+00
1 --- 1 r
__ e 2 = —— e 2 e 1X1dt. (11)
J in 2я J
— CO

Furthermore, for every T > 0, because of the uniform convergence of


_ t‘
<p„(t) to e 2 on every finite interval, we have
+Г +Г
I f 1 f _ '1
lim —— y n(d)e ,xtdt = ---- e 2 e 1x1dt (12)
n-*-+со 2tr J 2и J
-T -T
uni f or ml y i n X.
We show now that the integral

h {T )= f (cpn( t ) - e ~ * ) e - lxldt (13)


|(|>Г
can be made arbitrarily small, uniformly in x, by choosing T and n suffi­
ciently large. Because of (12), the theorem will then be proved. In order to
show that (13) can be made smaller than any positive number by an appro-
452 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 2

priate choice of n and T, notice first that


+ 00 +00
t n r t2
\ I n( T ) \ < 2 <p — = dt + 2 e ~ 2 dt. (14)
JT
\J n
T

The second integral does not depend on n and becomes arbitrarily small by
choosing T sufficiently large. It suffices thus to study the first integral. In
order to evaluate it, we separate it into two parts. For и -> 0 we have
w2
<f(u) = 1 - — + ф 2).

For an e > 0 sufficiently small and j и \ < s we have thus

!<K“) I < 1
and it follows that
t in 00

г I ( t )n ce
J j cp — dt < 4 du , (15)
T T

which tends to zero as T -* +oo, independently of n. It remains to show


that the integral
+ oo
Г t \
<P —j =

dt (16)
W n I
tin

tends to zero as n -*■ + oo. First we choose q — q(e) with 0 < q < 1, so
that I <p(t) \ < q when j t \ > s > 0.
In fact, according to Theorem 8 of Chapter VI, § 2,
lim I<p(t) I = 0.
/-*- + 00

Since the do not possess a lattice distribution, j <p(t) j Ф 1 for every


t ф 0; therefore if we put
sup 19< 01= q
l'läE
we have 0 < q < 1. Then, however,
+ 00 + 00 + 00

Г 4 '[-j= dt = y jn j \ ( f ( u ) \ nd u < J n qn~~ j \cp(u)\2 du. (17)


J \yjrt I J J
tin * - 00
V III, § 3] D O M A IN O F A T T R A C T IO N O F N O R M A L D IS T R IB U T IO N 453

+ cc ____
Since we have already shown that f | tp(u) |2 du is finite and lim sJnq"~‘i =
-00 «->-+00
= 0. the integral (16) tends also to zero as n -> + oo. All these restric­
tions are valid uniformly inx.(2) holds thus uniformly for - od < x < + oo.
Theorem 1 is herewith proved.
When fix ) is not bounded but for any given к f kix) is, (2) remains still
valid. This can be shown by a slight modification of the above proof. The
condition that f k(x) be bounded for a value of к (and, consequently, also
for every n > k) is evidently necessary for the uniform convergence of
/«(*) to —L= e 2.
V 2*

§ 3. The domain of attraction of the normal distribution

If £k (k, — 1 , 2 , . . . ) are independent, identically distributed variables


and if the standard deviation D = D(^k) exists, then, according to
Theorem 1 of § 1, the random variables

tn = i z k
k = l

satisfy the limit relation

lim P ^n— -A - < X = Ф(х) (—oo < X < + oo), (1)


П-*-+QO ^П j

where An = E((„) and 5„(C i) = D yjn {n = 1 ,2 ,.. . ). Now we have to con­


sider, whether the existence of D (fk) is necessary for the validity of (1) with
suitably chosen sequences {An} and {£„}. In the present section we show
that the existence of the standard deviation D(ck) can be replaced by the
weaker assumption (2).
We define the domain o f attraction of the normal distribution as the set
of distribution functions F(x), possessing the following property:
If £1; £2, • • -Лю • • • are independent random vaiiables with the common
distribution function F(x), then (1) is fulfilled for suitably chosen sequences
of numbers {An} and {S',,}.
In the present section we shall determine the domain ol attraction of
the normal distribution and we shall prove a theorem due to P. Levy,
W. Feller and A. J. Khintchine:

Theorem 1. Let £2, .. ., . . . be independent, identically distributed


random variables with a common distribution function F(x). I f for F(x) the
454 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V I I I, §3

limit relation
^ И -г) + (1-*М )],0 (2)
f x ‘dF(x)
-y

holds, then (1) is valid for every suitably chosen sequence o f numbers {A„}
and {S„}.
Notes
1. Condition (2) is not only sufficient but also necessary for the validity
of (1). But this will not be proved here.
2. If the standard deviation of the random variables exists, i.e. if
+oo
j x 2dF(x) is finite, (2) is evidently true; this follow s im m ediately from the
—00
inequality
/И -Д О + О -^О О )]^ J x 2dF(x).
1*1
Thus we can see that Theorem 1 of the present section comprises Theorem 1
of §1.
3. If the standard deviation D and the expectation M of qk exist and if
M = 0 , then in (1) An = 0 and Sn = D ^/n. Conversely, if (1) holds with
A n = 0 and Sn = D ^/n, the standard deviation of t k exists and is equal to D.
In fact, in this case Theorem 3 of Chapter VI, § 4 permits to state the follow­
ing: If rp(t) is the characteristic function of the random variables Qk, we
have
t П - /2
lim cp ---- — = e 2
-Too [D ^/n.
hence
In <p i — ^ = ] - In <K 0 )
\D jn Dr
lim --------- Г 7 1 * -------- = - - y »
n-*-+00 1 ^
D jn .
+00 +00
and from this follows that D1 = ( x 2dF(x). Thus if j x 2dF(x) does not
—oo —00
exist, the sequence of numbers {Sn} for which (1) holds cannot have the
order of magnitude ^Jn. (Clearly, by the proof of Theorem 1, S„ tends to
infinity faster than J n .)
VIII, § 3] DOMAIN OF ATTRACTION OF NORMAL DISTRIBUTION 455

4. As an example of a distribution for which the standard deviation does


not exist though (2) holds, we mention the distribution with density function

/w = w for 111 > l -


0 otherwise.
Now we give the

P roof of T heorem 1.

L em m a . I f we have

y2[ F ( - y ) + (1 -^00)] ^
— ------------ ---------- — < a. <
r
1 fo r y > y 0 > 0,
^ л (3)
f x 2dF(x)
-y
+ со
then I I X I dF(x) exists.
— 00

If Y > у > y0 > 0, we may write:


Y Y
J xdF(x) = ><1 - TOO) - Y(l - F{Y)) + j (1 - F(x)) dx
У У
and
—у -у
j xdF(x) = Щ - Y) - y F ( - y) - J F(x)dx,
-Y -V
hence
Y
J \x\d F (x ) < y ( l - F(y) + F ( - >0) + ] ' ( < - m + F(— x))dx, (4)

and by (3)
Y +x
Г j t 2dF(t)
I IXI dF{x) < t (1 - F(y) + F ( - y)) + « -=i— 1------dx,
J J X
y<.\x\^Y у
thus

J \ x \ d F ( x ) < y { \ - F ( y ) + F { - y ) ) + oi J \t\d F (t) +


y<.x<,Y

+ — J t-dF(t).

456 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 3

If we subtract now from both sides a f | t j dF(t) and divide by 1 — a,


we get
+ y
r 2a Г
J I X ] dF(x) < t 2dF(t). (5)
y < ,\x \< .Y -y

Since the right hand side of (5) does not depend on Y, we conclude that
+ CO

j I X \dF(x) exists; the lemma is herewith proved.


—00
In what follows, we assume for sake of simplicity that F(x) is continuous
and symmetrical with respect to the origin, so that F ( - x ) = 1 - F(x).
In the general case the proof runs similarly, but the calculations are some­
what more complicated. Furthermore, we may assume F(x) < 1 for every
л: < + oo.
+ cp
The existence of M — j xdF(x) follows from the lemma proved above.
—00
Because of the symmetry of F(x), we have
< -0 0

j' xdF(x) = 0. (6)


—00
(In the general case we consider the random variable £/k = — M.) We
put
(7)
j x 2dF(x)
о
Then by assumption lim <5(y) = 0. Put further
У - * + CO

It follows from
А(У)= (1 -F(y)f ‘ (8)
y2 1
m = ------------ ^ -----------> ---------------------------г г (9)
(1 - F{y)) j X2dF(x) (1 - F(y)) F(y) - —
о L
that
lim A(y) — Too. (10)
У-+ + 00

By assumption, F(x) is continuous, hence A(y) is also continuous for у >


> y0 > 0. Let C„ be for n > n0 the least positive number > y0 such that
A{Cn) = n \ (11)
V III, § 3] D O M A IN O F A T T R A C T IO N O F N O R M A L D IS T R I B U T I O N 457

Evidently, C„ -» + oo and
и(1 = (12)
Put now
S l = n I x 2dF(x), (13)
—Cft
and let <p(t) be the characteristic function of £,k, q„{t) the characteristic func­
tion of £JSn. We have
~ i t ~n
<Pn(t) = E(e s" ) = <P — . (14)
However, we have
+Cn
It r Jlii Г JitL
<P — =1+ ( e s" - l) rf F ( x ) + (e s- - 1 )dF(x) (15)
-Cn |*| >Cn
and

( (e s• - 1)dF(x) < 4(1 - F(C„)) = 4 n/ ^ C") . (16)


J n
>c„
1*1
By the lemma of § 1
+Cn
C — t2
J
( e s- - 1) d F ( x ) = - — + Rn (17)
—Cn
holds with

(18)
J 6S nA S„ 6n 6J i n
-C n

Relations (14) through (18) lead to

In n
with a 9n which remains in absolute value below a bound not depending
on n. Since lim J b(CJ) = 0, we get
/?-*-+00
lim rpn( t ) = e 2,
n-*- + oo
(19)
which implies the statement of Theorem 1.
458 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 4

As regards the question whether other distributions than the normal also
have a domain of attraction, the following example shows that this is pos­
sible. Let be completely independent random variables
possessing a common stable distribution of order a(0 < a < 2) and charac-
1 "
teristic function e-1' 1“, then the characteristic function of - | a £ £,k is
n * k=1
exactly e ~ l' 1“.
Thus any stable distribution has a domain of attraction which contains
at least the distribution itself. The domain of attraction of a stable distribu­
tion with 0 < oc < 2 is very narrow compared with that of the normal dis­
tribution; it contains only distributions very similar to the stable distribution
considered. As regards the determination of the domain of attraction in the
case of a stable distribution with 0 < a < 2, we refer to the book of
Gnedenko and Kolmogorov [1].

§ 4. Convergence to the Poisson distribution

We have already proved in Chapter III, § 12 that the binomial distribution


of order n and parameter p tends, as n -* + oo, to the Poisson distribution
with the expectation A, if p tends to zero in such a way that np -» A. This is
a particular case of the following, more general theorem :

T heorem l. Let £nl, <f„2, . .., £,nkn be independent random variables assuming
nonnegative integral values only. Put

Щпк = r) = p„k(r) (r = 0 , l , . . . ; k= 1 , 2 ............k„ ;n= 1 ,2 ,...) (1)

and

K k = Z P n k (r). (2 )
r=2

I f the following conditions

lim Z P « * (1) = 2, (A)


П-++00 k =1

lim max (1 —pnk (0)) = 0 , (B)


n-*-+ oo l<,k<kn

lim Z R nk = 0, (C)
П-++СО k —1
V I I I , § 4] C O N V E R G E N C E TO T H E P O IS S O N D IS T R IB U T IO N 459

are satisfied, then the d istrib u tio n fu n c tio n o f

In — £nl + £«2 + . . . + (3)


tends, as n —> + oo, to the distribution function o f the Poisson distribution
with expectation X.

P roof . Let gnk{z) denote the generating function of the random variable
ink-
9 n k {z ) = f P „k (r) zr ( I г I < 1). (4)
r=0
Clearly
19 n k ( z ) - P n k { 0) - P n k (\)z \ < R nk fo r I z I < 1. (5)
Since
a * (0 )-(i - P n k ( i) ) = - R n k, (6)
we can write

\9 n k (z ) ~ 1 - P n k W iz - 1) I < (7)

The identity (38) of § 1 implies, since | gnk(f) \ < 1 and | 1 + pnk(Y)(z — 1) | <
^ L

П
k=l
9пк(г) -
k=l
п (1 + P A 1) (z- 1)) — 2 X1 Rnk. k=
(8)
If , ч 1
max />„*(1)< max (1 - />„*(0)) < —- ,
t< , k < ,k n \< Jk< Jk„ 4

which is because of (B) fulfilled for n > n0, then identity (38) of § 1 leads to

I! 0 +Pnk(l)(z - 1)) - 0 exp (/?„fe(l)(^—1)) <


k=1 k -l

< m a x (1 -
1< .k < ,k „
Pnfifií) I г - 1 12 k П= 1 P A X )- (9)
It follows now by our assumptions from (8) and (9) that

lim
«-►+ 00
ft
k=1
9 nk(z) = e A (z~ 1 ). (10)
kn
Since Y\ 9nk(z) is the generating function of t]n and eA(z_1) that of the Poisson
k=l
460 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 5

distribution {Xke~kjk \}, our theorem is proved in view of Theorem 4 of


Chapter III, § 15.
When kn = n and the random variables £„k take on only the values 0
X X
and 1 with the probabilities P(^„k = 0) = 1 ------ and P(£,nk = 1) = — .
n n
П
then conditions (A), (B), (C) are evidently fulfilled. In this case rj„ = £ c„/c
k=1
has a binomial distribution:

ln\IX)4 X ) n~J

Thus we have as a particular case of Theorem 1:


(n\[Xyit Х \ п~] Xk e~k
lim — 1-----= ——— .
«-. +00 \ J j 1 « n k '-
Theorem 1 is therefore a generalization of the convergence of the binomial
X
distribution of order n and parameter p = — to the Poisson distribution
n
of parameter X, already dealt with in Chapter III.

§ 5. The central limit theorem for samples from a finite population

The statement of the central limit theorem is valid for certain sequences
of weakly dependent random variables. In the present and in the following
two sections we prove some results in this direction. These results have
practical importance, too, since in the applications the independence is
often only approximately true. The following theorem1 refers to samples
taken from a finite population, a situation very often encountered in prac­
tice.

T heorem 1. Let aN1, aN 2, ■■■, aNtN be any real numbers (N = 1 , 2 , . . . ) ,


N
M n = Л a N k (1)
k=l
and
N i \,f 2

^Ь-тг)-
1 Cf. P. Erdős and A. Rényi [1]. See also J. Hájek [2], where it is shown that
<2)
Theorem 1 is essentially best possible.
V III, § 5 ] C E N T R A L L IM IT T H E O R E M F O R F IN IT E P O P U L A T IO N 461

Let further n = n(N) < N be a positive integer-valued function o f N and put

In i n
D* * - D4 i r [ l - i f i ' (3)

From the numbers aN1, aN2, ■■ aNN we randomly choose n numbers in


, iN
such a way that all combinations have the same probability. Let the
я.
random variable „ denote the sum o f the so chosen aN k and put

Cn .h = Cn,n — M N. (4)
Put further
1 V, / M k, l 2
4v,»00 = ~ r X « л и -----т Н • (5)
°N |^ - ^ |> .^ / N >
I f the condition
lim dNn(s) = 0 (6)
N - > + CO

is satisfied for any e > 0, then we have for —oo < x < + oo
r*
lim P < x = Ф(х). (7)
lV-*-+oc , ^N.n

Proof. Condition (6) implies that n - > + o o a s i V - » + oo. Indeed it follows


from (6) that there exists for every e > 0 a number N0 such that for N > N0
the inequality dNn(e) < f holds; but then we have

1 M N l 2 ^ x r2 D * n ^
— < - 51 - V
У aN --------- < Ne2 — < m 2. 2

2 Dn I N Dn

Hence, for N > N0 we have n = n(N ) > —-y and since e > 0 can be
chosen arbitrarily small, we get

lim n(N) = + oo. (8)


N-*- + GO

We may assume
M N = 0. (9)
462 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 5

In fact, if (9) is not fulfilled, consider instead of the numbers aNk the numbers

a'N k = aNk - ; for these (9) is clearly fulfilled and if Theorem 1 holds
for a'N k, it remains valid for aNk too.
Furthermore, we can assume
N
1< n< — - (10)

the random variables C\,n and have indeed the same distribution
N
and if n > , we may take instead of n the number N — n.
We compute now the characteristic function 9oNn(t) of £%„:

<Pn M ) = ТТ7Г I exP [ ' ' f e + a N J t + . . . + a N Jn) l (11)


jV l< ,h < h < - < in < ,N

where the summation is to be extended over all combinations of order n of


the numbers 1, 2, . . N. In order to prove Theorem 1, it suffices to establish
the relation

lim <pN,„ = e 2. (12)


N-* + 00 N,n J
We put

(13)
and
BN,nW = Щ - k)N~ \ (14)

By using the fact that

± f «“ ' # = I 1 f k ~°- (15)


2n J v [ 0 for к = ± 1, ± 2 , . . . v ’
—n

we obtain easily the relation


+K
<pN,„(t)= „2JlBNn(A)
J f П [(1 - X ) e - ^ +t° ^
J k=1
+ l e V - W + ‘“^]dcp. (16)
—n
V ili, § 5] C E N T R A L L IM IT T H E O R E M F O R F IN IT E P O P U L A T IO N 463

Indeed if we calculate the value of the expression behind the sign of integra­
tion, by taking in the product (N - m) times the first and m times the sec­
ond term, we obtain a term multiplied by the factor ei(m~"'yp; such a term
vanishes therefore when the integration is carried out provided that m Ф n.
If N -a + oo and N —n —►+ 0 0 , which is certainly fulfilled in our case because
of (8) and (10), it follows from Stirling’s formula1*that

BKn(X) » — = 1 (17)
j2 n N X ( \ - A)

Hence if we introduce a new variable of integration ф = <pJ NX{ 1 - A)


and if we replace t by —— , we obtain
Av,n
+JI/л/Л(1-л)
<pN,nI I « -7 = = Г П г Л О # - (is)
*-* N ,n ) ^ /2 П J k =1
-nÍNÍ( 1—Я)
where we put

& 0 М ) = (1 - Я ) exp - A Í + -^ -1 1 +
1,/АА(1 - A) A v .J ]

+ A exp /(1 —A) - - - ■+ ---- • (19)


L Ц/.ДОА(1 — A) Av,„ J

According to the lemma of § 1 we have

(1 - А) е-''Яс + Ае'(1- Я)" = 1 - V ^ + Rx (20a)


with
A(1 - A)I til3
\R i\i— — ( 20b)
and
(1 - A) e~iXv + Ae'd-O" = 1 + R 2 (21a)
with
|j?2| < A(1~ A)t)2- ■ (21b)

1 If A is fixed, (17) follows directly from the Moivre-Laplace theorem. In our case
A depends on N, hence the latter theorem cannot be applied. But Stirling’s formula
leads easily to (17).
464 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 5

On the other hand, an easy calculation shows that

I(1 - Я) e~Uv + ).ei(~l- X)01= - 2/(1 - Я ) ( Г - “cos7) <

< 1 —Я(1 —Я) (1 —cos v). (22


\

Now let e > 0 be given and suppose \ ф \ < 2e ^JNX( 1 — Я). If к is an index
such that ^ I J gjv’*l < g) then (20) implies
DN,n

(23)
with I 0XI < 1; (21) implies that for every value of к and for \ ф \ <
< 2ev/ /v;.(r- Я),
, о Л , 2(1 - 2) Ф , taN'k)*
\ 6k( i l f , t ) - l \ < - — - - 7 = - .+ - ■ (24)
2 7 )УЯ( 1 — Я) л

From this follows for | ф | < 2s X/7V2( 1 — Я) the relation

Пе*0М)=ехр - - 1"-- ( l + 0 2 e) (1 + 17/vX (25)


fc=i L , z

where | 02 1 ^ Q and

! 4n I —C2 t/,v,n p + £’ 2(1 — Я) /д,| (26)


with
/v = Z 1, (27)
\ O N ,k \> — утр-

C, and C2 being positive constants.


We have already
t2 Is
lN < - ,-------- — t/ д г — ; (28)
* е2 Я ( 1 - Я ) ’ (|t|)
it follows from (26), (27) and (28), because of (16), that
lim r}N = 0. (29)
N -*■ + co

For I ф I > 2s ч/ агЯ( 1 — Я) the following estimates can be used: For


I a N , k I ’ I 11 \DN n > s the trivial inequality | (>fc(t/q t) | < 1; for
V III, § 5 ] C E N T R A L L IM IT T H E O R E M F O R F IN IT E P O P U L A T IO N 465

I aNk II t \/DK„ < e the inequality

I í) I < 1 —Я(1 —Я) (1 —cos e), (30)


which can be derived from (22).
Thus we obtain

j П gfc 0 # ^ 2?t J M 1 - —1- ^ 2COS £) I (31)


‘M< . < n
(NMX-X)

Since

/лп Í. Я( ] - c o s e ) ] " - '" г n(\ — cose)


sJNX 1 ----------------- < V « e x p ----------------- + A/д, ,

the right hand side of (31) tends to zero as N —> + oo because of (28) and
because
/— и(1 - cose)]
lim ^Jn exp —------ --------= 0.
N-*+a> 2 )

From (18), (25), and (31) we obtain, since s can be taken arbitrarily small
+00
t ) -** 1 г - ф2 -—
О т <рК п - = e 2■ e 2 d i// = e 2, (32)
.V~+oo U N ,n ) J
—00
which concludes the proof of our theorem.
A case of particular importance occurs when M of the numbers aN k are
equal to 1 and N — M are equal to zero. Then
Ml I N - M
m n —m
P(HN,n = m )= ..... ........ (33)

n
i.e. (Nn has a hypergeometric distribution. Furthermore, MN = M , DN =
= у / M {N - M)/N. Condition (6) is satisfied for W ft_ ^ jM )n(N - и)

-> + со as N -* + oo; in effect, for every s > 0, dN „(e) = 0 as soon as N


(depending on e) is sufficiently large.
N N
If M < — and n < — , which can be assumed without restriction of gen-
2 2
466 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 6

erality, the above condition is equivalent to

(34)

M
When — is constant or remains above a positive bound, this means that n
N
nM N
must tend to + oo with N. F rom ------- ►+ oo it follows, because of M < —
N 2

N
and n < — , that N -> + oo, л -+ + oo, and M -» + oo. Theorem 1 con­

tains thus as a particular case


N N
T heorem 2. I f N , M andn are positive integers, 1 ^ M < ^- , \ < n < —

further i f we put p = and X = ~ , then

'M \ I N — M
„ к n —к i . ....
hm Y ______ _____ 7N7\----------= ф(х )- (35)
NpX-++oo k<.np+xinp(\—p )(l—A) I
. n)
M
A particular case of this theorem, when p = — = constant, was derived

by S. N. Bernstein.
M
Note further that if p = — is constant and n increases more slowly than
n 2
N (if, for instance — -*■ 0), (35) can be derived from the Moivre-Laplace
theorem by approximating the terms of the hypergeometric distribution by
those of the binomial distribution (see Chapter II, § 12, Exercise 18). How­
ever, the general case cannot be treated in this way.
Theorem 2 can also be proved directly by merely considering the asymp­
totic behaviour of the terms of the hypergeometric distribution, but this
procedure leads to tiresome calculations.

§ 6. Generalization of the central limit theorem through the application of


mixing theorems

A sequence rh, rj2, . . tjn, . . . of random variables possessing a limit


distribution, i.e. such that
VIII, § 6] A P P L IC A T IO N O F M IX IN G T H E O R E M S 467

lim P(ri„ < x) = F(x)


«-*- + 00
holds at every continuity point x of the distribution function F(x), is said
to be mixing, if for any event В with positive probability the relation
lim P(t)n < x I B) = F(x) (1)
n-* + 00

holds at every continuity point x of F(x). We prove now

T heorem 1. Let be independent random variables and


put

C„ = £
k=l

Assume that there exist two sequences {C„} and {5,,} with Sn —>+ co and a
distribution function F(x) such that at every continuity point o f F(x) the dis­
tribution function o f
L - c n
4n =---^---
tends to F{x):
lim P(ri„ < x) = F(x).
n-*-+ 00
Then the sequence o f the random variables t]n is mixing.
P roof . We shall use the following lemma due to H. Cramér1

L emma 1. Let 0„ and £,. (n = 1 , 2 be two sequences o f random va­


riables; assume that the sequence 0n has a limit distribution with distribution
function F{x), that is, at every continuity point o f F(x) we have
lim Р(в„ < x ) = F(x). (2)
/!-*-+00
Assume further that lim st s„ = 0. Then
W— + 00

lim P(Qn + e„ < x) = F(x) (3)


n-*-+oo
at every continuity point x o f F(x).
P roof . We have for an arbitrary <5 > 0

P ( ß n + En < x ) = P( I e„ I > <5) P(Qn + e„ < x 11e„ | > 8) +

+ P(ßn + ei, < x 11e„ I < á). (4)


1 Cf. H. Cramér [3].
468 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 6

By assumption
lim P( Ie„ I > <5) = 0.
«-*- + 00

From 6„ + e„ < x and | s„ \ < ő follows 6n < x + ö ; from 6„ < x — <5


and I s„ I < ö follows в п + s„ < x. Hence we can conclude from (4)

F(x - S ) - P ( \ £ n\ > S ) < P(0n + e„ < x| | e„| < 5) < F(x + 8), (5)

F(x — Ő) < lim Р ( в „ + sn < x) < lim P{9„ + sn < x) < F(x + <5). (6)
n-++00 «-»-+oo
Since x is by assumption a continuity point of F(x) and S > 0 may be
taken arbitrarily small, (3) is proved.
Let now x be a continuity point of F(x) and suppose F(x) > 0. Then by
assumption we can find an n0 such that /'(>/.. < x) > 0 for n > n0. Put
A0 = £2 and denote by Ak the event г]щ+к < x (k = 1, 2, . . . ) . Then P(Ak) >
> 0 and, by assumption,

lim P{An \ Au) = lim P(A„) = F(x) > 0.


«-*+00 «-*-+00
By Theorem 1 of Chapter VII, § 10 it suffices to prove the relations
lim P(A„ I Ak) = F(x) ( k = 1, 2, . . . ). (7)
«-*-+ 00
Г
Apply Lemma 1 with 0„ = r\n and s„ = -----. This can be done, since
*-*«
the hypotheses of the lemma are fulfilled. We find

lim P r]n - < x = F(x). (8)


Since
Спц+к Cn Cn0+k
Пп ~ ~ s~n
does not depend on rj„o+k, we have

lim P t]n - < x Ak = F(x). (9)


«-*+ 00 >

If we apply Lemma 1 again, to the random variables

£n0+k C«0+k
= о
Vn *ln
O« ’ C
V III, § 6] A P P L IC A T IO N O F M IX IN G T H E O R E M S 469

on the probability space [Q, P(A \ Ak)], we get already (7):

lim P(ti„ < x \ A k) = lim P(An I Ak) = F(x).


«-»-+ 00 «-*■+00
The conditions of Theorem 1 of Chapter VII, § 10 are thus satisfied and for
every В with P(B) > 0 the relation
lim P(tj„ < x \ B ) = F(x) (10)
«-»- + 00

holds. The theorem is therefore proved for every x such that F(x) > 0.
If X is a continuity point of F(x) such that F(x) = 0, we have

lim P(t]n < x) = 0


и-*- + со

and, if P(B) > 0,

lim P(q < x I B) < lim < = 0.


И^ +оо n- +0O P(B)
Theorem 1 is herewith completely proved.
Theorem 1 can also be formulated as follows: If the random variables
£k are independent, if S„ -> + oo a s n - » + oo, further if the random vari­
ables

tZk-C„
k =1
ч„= s—

possess a limit distribution, then r\n is in the limit independent of any ran­
dom variable 9 in the following sense: For every у such that P(9 < у) > 0
the relation

lim P(t]n < x, 9 < y) = lim P(rjn < x)P(9 < y) (11)
П-++ OO «-*- + OO

holds at every continuity point of the limit distribution of r]n.


The following is an interesting corollary of Theorem 1:

T heorem 2. Suppose that the random variables £2, are inde-


n

pendent and put = Y, £k- I f there exist two sequences C„ and Sn


k =1
(n = 1 , 2 , . . . ) with lim S„ — + oo fulfilling the relation
«-*•+00
lim P(t]n < x) = F(x),
«-*- + OO
470 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 6

£ _ C
where rjn = — ——and F(x) is a nondegenerate distribution function, then
Sn
rjn cannot converge in probability to a random variable rjx .

P roof. Assume that there exists such a variable with д,г — A 0.


If we apply the lemma with 9n = rjn and e„ = r\x — q„, we find

P(hoo < x ) = F(x). (12)


On the other hand, Theorem 1 allows to state the following: If v: is a conti­
nuity point of F(x) with 0 < F(x) < 1 (F(x) being nondegenerate; such a
point always exists) and if В denotes the event t\x < x, we have

lim P(r\„ < x \ B) = F(x) = P(B). (13)


«-*-+co
If we apply the lemma to the random variables 0„ = rj„ and en = — t]„
on the probability space [Q, P(A | В)], we get
Р(П х < х \ В ) = Р ( В \ В ) = Р(В),
i.e. F(x) = P(B) = 1, which contradicts our assumption that 0 < F{x) < 1 .
Hence Theorem 2 is proved.
Naturally, it follows from Theorem 2 that, under the conditions of the
theorem, the limit of t]„ cannot exist almost everywhere. Still more is true:
the probability of the existence of the limit lim r\„ is equal to zero.
n-*+ co
The set C of the elements co £ Q for which lim rjn(a>) = t]a(oo) exists is
«-►+00
obviously measurable. Suppose we have P(C) > 0, then rj„ would con­
verge on the probability space [Q, P(A \ C)], with probability 1
and therefore also in probability, which contradicts Theorem 2.
We now prove a lemma.

L emma 2. Let [Í2, tx€, P] be a probability space and Q(A) (A £ a second


probability measure on the о-algebra absolutely continuous with respect
to P. I f the sequence o f sets A„ £ is mixing on [ Í 2 , P ] with the density d,
then
lim Q(A„) = d. (14)
«-*-+ 00

P roof. According to the Radon-Nikodym theorem there exists a mea­


surable function x ( c o ) such that for every A £ Г

Q(A) = §x(co)dP.
A
V III, § 7 ] S U M S O F A R A N D O M N U M B E R O F R A N D O M V A R IA B L E S 471

If /(ю) is a step function, (14) is clearly fulfilled. According to the definition


of the Lebesgue integral there can always be found a step function Xi(co)
such that
J I X(®) - Xi («) I dP<E.
n
Hence (14) is always fulfilled.1
Lemma 2 allows to rephrase Theorem 1 in the following stronger form :

Theorem 3. I f £2>. . . , £ „ >• • • are independent random variables on the


probability space [Í2, P] and i f the hypotheses o f Theorem 1 are satisfied,
then for every probability measure Q absolutely continuous with respect to P
the relation
lim Q(rjn < x) = F(x) (15)
П-++СО
holds at every continuity point x o f F(x).
Theorem 3 allows to extend limit theorems to sequences of weakly depen­
dent random variables. Assume indeed that the random variables £k are
not independent with respect to the probability measure Q, but let there
exist a second probability measure P such that Q is absolutely continuous
with respect to P while £k are independent with respect to P. If one of the
theorems about the limit distributions can be applied to [Í2, o £ , P],
Theorem 3 guarantees its applicability to [Q, Q] as well.2

§ 7. The central limit theorem for sums of a random number of random


variables

In the present section f>, denote independent, identically


distributed random variables with zero expectation and unit variance.
Hence by Theorem 1 of § 1 for the random variables

tilt
L =^ (i)

the relation
lim P(C„ < x) = Ф(х) (2)
Л—
*-+ 00
1 The sequence of events {A„} is also mixing with respect to [Q, cA, Q] since
Q*(A) = Q(A I B) is also absolutely continuous with respect to P. Hence by Lemma 2
lim Q(A„ I B) = d.
/!-*■+00
2 Cf. P. Révész [1].
472 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 7

is valid. Let now v„ (« = 1, 2 , . . . ) be a sequence of random variables which


assume only positive integer values and which are supposed to obey the
relation
v„-P.+oc (3)
i.e. to fulfill for any N > 0
lim P(v„ > N ) = 1.
«-►+ co

We want to find conditions under which the random variables are in


the limit normally distributed. It is easy to prove

Theorem 1. I f the random variables v„ (n = 1 , 2 , . . .) are independent o f


the random variables i b c,, and i f (1), (2) and (3) are fulfilled,
then
lim />(£,„ < x) = Ф(х). (4)
«-► + 00

Proof. Put
C„k = P(v„ = k) (n,k = 1, 2, . . . ) . (5)

The matrix (C„fc) possesses the following properties:

Cnk> 0 (n,k = 1, 2, . . . ), (6a)

fc„,= l (И= 1 , 2 , . . . ) , (6b)


k =l
lim Cnk = 0 (*=1,2,...). (6c)
«-►+00
(6c) is a consequence of (3); (6a) and (6b) express the fact that
C„ C„ i, 2,Cnk,
. .is a probability distribution. The three conditions (6)
., . . .

can be expressed bv saying that (Cnk) is a permanent Toeplitz matrix. A


theorem1 known from the theory of series permits to conclude that, if

lim Sn = S
«-►+ 00
then
00
Hm £ C „ fcS* = S.
«-► + 00 k = 1
Now
oo
P(Cv„ < x ) = Y P(Ck < x , v n = k). (7)
k = l

JCf. K. Knopp [1].


VIII, § 7 ] S U M S O F A R A N D O M N U M B E R O F R A N D O M V A R IA B L E S 473

Since v„ does not depend on £k, we get

/>(Cv„ < * ) = !; c nkp(ck < X). (8)


k =1

From (2) and the above-mentioned theorem from the theory of series we
obtain (4) and Theorem 1 is proved.
The situation is somewhat more complicated if we do not suppose that
v„ is independent of the variables £,k. In this case a stronger condition than
(3) must be imposed upon v„. As an example we prove now a theorem
which is a particular case of Anscombe’s theorem.1The reasoning is inspired
by W. Doeblin.

T heorem 2. //( 2 ) is fulfilled and if

-+ c, (9)
n

where c is a positive constant, then (4) is valid.

P ro o f . P u t
n

*ln = z Zk and a„ = [nc]. (10)


k=1
Then, because of (9),

y -^ l. (И )
Furthermore

c- = é И г + ■ (12)
Now we need a simple lemma.

L emma . I f the sequence o f random variables 6„ has a limit distribution


and i f y„ Л 1, then the sequence Qny„ has the same limit distribution as the
sequence 0 n .

P roof. A s 0„ yn = 0 n + 0„(уг. — 1), it follow s th a t fo r every N > 0

^ ( |0 „ ( r „ - l) l> s ) ^ |0 J > A O + pi|y„- 1 | > ~ •

1 Cf. A. Rényi [22].


474 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 7

Let <5 > 0 be arbitrary. Choose N and nx so that for n > n x the inequality
P(I в„ I > N) < S should hold. Choose n2 > щ such that for n > n2 the
g
inequality P \y n — 1 1> — < <5should be valid. So P(| &n (yn — 1) | > e) <
< 28 for n > n2. Consequently, 0„ (yn — 1) Л 0 and the present lemma
follows from Lemma 1 of § 6.
According to these two lemmas it suffices for the proof of Theorem 2 to
show that
*7v„ _p о ( 13)
J in
Let e > 0 and 8 > 0 be arbitrary; choose nx such that
P{ I v„ — I > 8e2 A„) < d for n > щ. (14)
Clearly

p( > e д ^ pi J f t ~ n j > E,v n = к (15)

and because of (14) we obtain the inequality

P > e < 8+P max 1T,k Г З - 1 > e| . (i6)


y/An J A„ I

Now Kolmogorov’s inequality (Chapter VII, § 6, Theorem 1) implies

P max 14k J_ > gj < 28. (17)


\k —Xn\<E2ökn y j )in J

Inequalities (16) and (17) prove (13).


Finally, as an application of Theorem 1 of § 6 we prove a theorem in
which v„ fulfills a condition of other type than (9).

T heorem 3. Let a be a positive discrete random variable; suppose that


(2) holds and assume further that
= [««], (18)
where [,v] denotes the integral part o f x. Under these conditions (4) is valid.

P r o o f . Let ak (k = 1, 2 ,. . .) be the values taken on by a with positive


probability and A k the event a = ak\ we have

P(C„ < *) = t n k n a ü < * I Ak)P{Ak). (19)


k =l
V III, § 8] L IM IT D IS T R IB U T IO N S F O R M A R K O V C H A IN S 475

When P(Ak) is positive, then because of ak > 0, (2) and Theorem 1 of § 6


we get the relation
lim P{t[nak] < X | 4 ) = Ф{х). (20)
«->+ 00

Hence for every fixed m


m m
lim £ p (Clnak] < x \ A k) P(Ak) = Ф(х) Y P(Ak). (21)
я-> + оо k =1 /с= 1

As there can be found for every г > 0 an m such that

£ Р Ш < «, (22)
k = m +1

(4) follows from (21) and (22); thus Theorem 3 is proved.


We can deduce from Theorem 3 the following more general theorem:

T heorem 4. Let a. be a positive discrete random variable ', suppose (2) and

— A a. (23)
n
Then (4) is valid.

The proof1 rests upon Theorem 3 and uses the same method as the proof
of Theorem 2.

§ 8. Limit distributions for Markov chains

In the present section we shall deal with an important class of sequences


of dependent random variables: Markov chains. A sequence of random va­
riables £„ (n — 0, 1 , . . . ) is called a Markov chain if the following conditions
are fulfilled; for every n (n = 0, 1,. ..) and for every system of real numbers
x0, X i,. . . x„+1 one has with probability 1

P(C„ + 1 < x n+11Co = x0, Ci = х ъ . . . , C„ = x n) = P(C„ + 1 < x „+11C„ = x„). (1)


(The conditional probabilities figuring in (1) are defined according to § 2
of Chapter V.) In the present section we shall deal only with Markov chains
{C„} such that the values of Ся belong to a denumerable set кЖ ; without
1 Cf. A. Rényi [31]; later J. Mogyoródi [1], further J. R. Blum, L. Hanson and
J. Rosenblatt [1], have proved that in Theorem 4 the restriction that a should have
a discrete distribution can be omitted.
476 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 8

any essential restriction of the generality it can be assumed that cv# is the
set of the nonnegative integers. In this particular case (1) can be written in
the following form: If n, k ,j a, j \ , . . are any nonnegative integers, then

P(i„+i = к I Co = j0, Ci =.h, =j„) = P(i „ + 1 = к I f„ =j„). (2a)

Markov chains are usually interpreted as follows: Let S be a physical


system which can be in the states A 0, A b . . ., A k , . . .. Let the state of the
system change at random in time; consider the states of the system at the
time instants t = 0, 1,. . . and put Cn = к if at time n the system is in the
state k.
The hypothesis that the random changes of state of a system form a
Markov chain can then be expressed as follows: The past o f the system can
influence its future only through its present state.
If we multiply both sides of (2a) by P(£0 = j a, ■■., C„ = j„) and add the
equations obtained for all values of j 0, j b .. y'r_i (1 < r < ri) and further
divide by P(Cr = j r, . . £« = j n), then we obtain

P(C„+i = * 1C, —
jn■■L=J„)= P(C„ + 1 = кin=Jn
I )• (2 b )

Similarly, we can show that for arbitrary integers 0 < nx < n2 < . . . <
< ns < n,

P(Cn+1 = k I C„, = k , ■• C„s = j s ) = Pttn+1 = k\C„s = js). (2c)

The conditional probabilities P(C„+m = к \ C


„—
j) are called the m-step
transition probabilities, as they give the probability that the system passes from
the state Aj to the state Ak during the time interval (n,n + m) (m = 1, 2, . . . ) ,
i.e. in m steps. These quantities depend in general on the time n. If not, then
the Markov chain is said to be (time-) homogeneous. In this Section we
deal only with homogeneous Markov chains.
Thus by assumption the probability P(Cn+m = к | („ = j) will be inde­
pendent of n and we may put

№ = P(C„+m = к \C„=j) (уД = 0, 1, . . . ) . (3)

It is reasonable to consider the numbers p'-ff (J, к = 0, 1,. . .) as the elements


of a matrix
Я„, = (РГ)- (4)
Instead of pal we write simply p,k and instead of /7, simply П.
V III, § 8] L IM IT D IS T R IB U T IO N S F O R M A R K O V C H A IN S 477

Clearly for every positive integer m and for j > 0 the relation

(5)
k= 0

holds. In fact, the terms of the sum are the probabilities belonging to a com­
plete system of events. Hence the matrix Пт, which has nonnegative terms
only, has the property that the sum of terms in each row is equal to 1. Such
matrices with nonnegative elements are called stochastic matrices. The matrix
Пт can be computed from П as follows. According to the theorem of com­
plete probability (cf. Chapter III, § 2, Formula (2)) we have for 1 < r < m

P(C„,m=k\C„ = j) = I
/ =0
P(Lrm=к I C„+, = /, C„ = ;)P (C „ +, = 1 =Л1C„ (6)

and it follows from (2c) that


00

pT = I p f p t ~ r)- (7)
t=o
Thus we have

= (m = 2 ,3 , . . r = 1, 2, . . ,,m - 1). (8)


Consequently, Um = П Пт_г = П 2П т_2 etc., hence
Пт =Пт (m = 2 , 3 , . . . ) . (9)

The matrix of m-step transition probabilities is thus the m-th power of the
matrix of one-step transition probabilities.
So far we have only considered transition probabilities, i.e. conditional
probabilities. In order to determine from these the probability distribution
of we must know the state of the system at the instant t = 0 or at least
the probabilities of the initial state of the system, i.e. the probability distri­
bution F(Cо = к) (к = 0, 1, . . . ) . With the notation P((n = k) = Pn(k)
{n = 0, 1 , . . . ) one can thus write

Pn(k) = E P0(j)p $ (10a)


or, more generally,
oo
ад =I 7=0
PrU)P%-r) (r = 0,1..... n - 1). (10b)
If Co is constant, e.g. equal to j 0, then P0(j0) = 1 and P0(j ) = 0 for j # j 0.
478 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 8

In this case
Рп{к) = р% . (11)

As an example of a Markov chain consider a machine in a factory, which


is switched on and off as time proceeds. At any instant there are two possi­
bilities: either the machine works (state Aj) or it stands idle (state A0). Let
pjk denote the probability that at the instant n + 1 the machine is in the
state A k provided that at the instant n it was in the state Aj (J, к — 0, 1),
further put p01 = X, p10 — p (0 < X < 1, 0 < p < 1). In this case the matrix
of transition probabilities is

/1 - Я X \
n = p t1 - pi •

A simple calculation gives the и-step transition probabilities. Further we


derive from these

PJM = -XA+ p- + (1 - A- /0" В Д - лTp


(12a)
and

РАО) = — + (1 - Я - pT k ( 0 ) - - r j - l , (12b)
Л. ~r Ц *v A fi

where P0(l) and Po(0) are the probabilities that at time 0 the machine works
and does not work, respectively. Since 0 < Я < 1, 0 < p < 1, we have always
I 1 — Я — p I < 1; hence (12a) and (12b) lead to

lim P„(l) = , lim РАО) = ; (13)


„ ^ +00 x+ p „ ^ +K x+ p

hence the distribution of £„ tends to a limit distribution as n -* + 0 0 .


Notice that the limit values (13) do not depend on the distribution of Co-
Á A
If P0(l) = ------, and hence P0(0) = —^— , we haveP„(l) = -r-------and
X p X+ p X+ p
РАО) = — for every n and not only in the limit.1
If in a Markov chain the distribution of the random variables £„ tends to
a limit distribution (thus if lim P„(/c) = P(k)) which does not depend
n-*~+ 00

1 If X + p = 1, we have P jl) = X and P„(0) = p, without any assumption on the


initial state; in this case the £„-s are independent from each other!
V III, § 8 ] L IM IT D IS T R IB U T IO N S F O R M A R K O V C H A IN S 479

on the initial distribution P0(j), then the Markov chain is called ergodic.
An initial distribution such that £„ has the same distribution for every value
of n, is called a stationary distribution. If the Markov chain is ergodic and
there exists a stationary distribution, the latter is evidently the limit distri­
bution of C„- It is еа5У to show that there exists a stationary distribution,
iff the system of equations

** = f
7=o
Pjkx, {k = 0 , 1, . . •) (14)
00

admits a solution x0, xb . . . with xk > 0 (k = 0, 1,. . .) and £ x k = 1;


k=1
in this case the numbers x k constitute a stationary distribution. For the
example considered above Equations (14) can be written as

*o = (1 - A) x0 + рхъ
Xl= kx0 + (1 - fi) xv (15)

The single solution such that x0 + xx = 1 is

A g
*1 = 1—— > * o = t ~;— • (16)
A ~ fl A fl

In this example there exists a stationary distribution and the Markov chain
is ergodic.
The following theorem, due essentially to A. A. Markov, shows that this
holds under rather general conditions.

Theorem 1. Let a system possess a finite number o f possible states A0,


Аъ , . ., A n. Assume that the changes o f state o f the system form a homo­
geneous Markov chain and denote by pffi the probability that the system passes
from state Aj to state Ak in m steps. Assume further that there exist integers
s > 0 and k0 > 0 such that fo r j — 0, 1, . . ., N,

P& > 0, (17)


i.e. that the matrix IIs has at least one column in which all elements are posi­
tive. In this case the chain is ergodic; the limits

lim p $ = Pk (у,к = 0 , 1 , . . . , A) (18)


И-*-+ 00

exist and do not depend on j. The sequence o f numbers N is the unique


480 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 8

nonnegative solution o f the system o f equations

Pk = l PiPjk = (19)
J=o
which satisfies

1^= 1- (20)
1=0

The limit distribution {Pj} is thus a stationary distribution o f the chain.

Proof. By assumption

d = min p$0 > 0. (21)


Put
= min pV , M V = max (k = 0 , 1 , . . N ) . (22)
0 ^ l< ,N 0<,l<,N

Clearly

P%+1)= I PjipW, (23)


l=o
hence, for 0 < j :< N,

m V = m V t Pn * t PflPW = P%+^ (24)


1=0 1=0

and
m V < m(kn+1\ (25)
Similarly, for 0 < j < N,

m V = m V £ Pn > £ pjipV = p$ +1\ (26)


/=0 1=0
hence
m V > м [п+1\ (27)
Furthermore, by (22),

0 < m V < M V < 1, (28)

which implies the existence of the limits

mk = lim m V and M k = lim M V (29)


П—+00 П--+СО
VIII, § 8] L IM IT D IS T R IB U T IO N S F O R M A R K O V C H A IN S 481

with
Щ < M k. (30)

If we can prove that mk = M k for к — 0, 1 , . . N, then (18) will be proved.


Now for a suitable к the equality

= I PfiPj? (3D
y=o
holds and for a certain /0;

™(r s)= p (i’X s) = Z PM J- (32)


./=0
Hence
M (r s) - m(r s) = I Ш - ri3) p f - (33)
j = n

Let H be the set of all j (0 < j < N) for which pfy —p\s^ > 0 and let H
be the complementary set of H, i.e. the set of those j (0 <i j < N ), for which
P u ~ P u < 0 holds. Put

^ = £ (/> $ - /£ ) ) and В = I ( P ,( :)-№ ;)■ (34)


XU KH
Then A > 0 and

A+B= f
y=o
p f i - £ pffi = 1 - 1 = 0,
j =о
hence В - —A and it follows from (33) that

M in+S) - m[n+s) < (Mi"'» - m ^ ) A. (35)


Two cases are now possible: either k0 £ H or k0 £ H. In the first case we
have
B > - ( \- p ] 'l) > - ( \- d ) ,

and, since A = —B,


A < l — d. (36)
In the second case we have

A < \ - p ^ t<\-d,

hence (36) is valid in both cases. It follows thus from (35) and (36) that
Mi"+P - n4n+l) < (1 - d) (M P - mi">). (37)
482 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 8

Furthermore, because of (21)

M%> - 4 S) < 1 - d . (38)

By induction we conclude from (37) and (38) that


M[nx) - < (1 - d)n (39)
and, because of d > 0, by passing to the limit n -* + oo and taking into
account (29) and (30), we get mk = M k. This proves (18) with Pk = mk =
= Mk.
The passing to the limit in (7) and (5) shows that {Pk} fulfils Equations
(19) and (20). It remains merely to prove that the Pk (k = 0, 1,. . ., N)
are uniquely determined by (19) and (20). This can be shown as follows:
If Q0, . . ., Qn were a distribution distinct from P „,. . PN and satis-
ы
fying (19), thus if Y, Qk = 1 and
k =0

Qk = l Q ,P ,k (k = 0 , l . . , N ) (40)
1=0

held, then after multiplication of (40) by pkt and summation over к we


would obtain
a = £ Q,pf?.
/=o

By repetition of these operations we find for every integer n

Q, = I QlP\”\
1=0
N
Because of lim p ff = P, and Y Qi = 1 there follows Q, — P„ which
n-»+OO / =0
was to be proven.
The numbers Pk fulfill the equations

Pk = t p ,p № (41)
;=o
for every n = 1, 2 ,. . Equation (19) is a particular case of (41).
If the distribution of (0 is known, P(C0 — i) = P0(i), then from (10a) one
can derive the relation

lim Pn{k) = Pk (k = 0 ,l , (42)


n -*- + oo
V III, § 8] L IM IT D IS T R IB U T IO N S F O R M A R K O V C H A IN S 483

Finally, let us mention the following particular case: assume that for the
matrix of the transition probabilities (p]k) the sum of all columns is equal
to 1:

£ Pjk= 1 for к = 0, l , ..
7=0

The matrix П = (pjk) as well as its transpose П* = (pkj) are stochastic


matrices; such a matrix П is called a doubly stochastic matrix. In this case
(40) is fulfilled for Q = ~ ^ (k = 0, 1 , . . N); the solution of (19) being

unique, there follows Pk = ^ ^ . Thus for a doubly stochastic matrix П


fulfilling the conditions of Theorem 1 the relation

lim p $ = ■ 1 - holds for j , k = 0 , 1 , . . N.


„- +00 N+ 1

It follows from (42) that the probabilities of the N + 1 states are in the limit
equal to each other, regardless of the initial distribution.
A particular class of the Markov chains is that of the so-called additive
Markov chains. If £0, . . . are independent random variables and
if we put C„ = £0 + £i + • • • + the random variables C„(n = 0, 1,. . .)
form a Markov chain, since

P (C „ +1 < X ICo = *o, • • • » = *n) = P ( .L + i < x - x „ ) =

= -P(i«+i < * I £„ = *„)• (43)

If <ik are identically distributed, the chain is homogeneous. In this case


the problem of finding the limit distribution of the chain {£„} can be reduced
to the study of sums of independent random variables, already dealt with.
If take on only integer values, if their expectation is zero and if the
greatest Common divisor of the values assumed by — £2 with positive
probabilities is equal to 1, then for every pair к, l of integers the relation
(cf. Chapter VI, § 9, Theorem 8)

- 1 m

holds, hence £„ has in limit a uniform conditional distribution on the set of


all integers. (This may happen also for nonadditive chains.) Further, if the
expectation of t;k is zero and their variance is equal to 1, then from
484 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 8

Theorem 1 of § 1 there follows the relation

lim P ~ < л] = Ф(х).


П—+00 11 )

For homogeneous Markov chains with a finite number of states the suit­
ably standardized sum

*ln — Co + Cl + • • • + in
is under general conditions in the limit normally distributed. This will be
proved here only for the simplest case of a chain with two states.

T heorem 2. Let the random variables {£..} form a homogeneous Markov


chain with two states. Put £„ = 0 or £„ = 1, according to whether the system
is in state A0 or in state A x at the instant n. Let with 0 < Я <
g I - PI
< 1, 0 < g < 1 be the matrix o f transition probabilities. Put

4 n = Í Cl- (45)
k = 0

Then
( я- n
VП- \
lim P “= < X = Ф(х). (46)
п-*-со пЛц( 2 — Я — /i)
W (X + J tf я

Remark. If Я + p = 1, then the variables £„ are independent and gn has


a binomial distribution of order n and parameter Я, further in this case

W x A s f )„
(A + g f K ’
Hence (46) reduces to

lim P — f ~ J ' n= x < X = Ф(л:).


n— +cо ■\ n X { 1— Я)

Theorem 2 is thus a generalization of the Moivre-Laplace theorem.

Proof. If t„ denotes the instant when the system returns for the и-th
time to the state A b we have 0 < tx < t 2 < . . . < t„ < . . . ; £1я = 1;
V III, § 8] L IM IT D IS T R IB U T IO N S F O R M A R K O V C H A I N S 485

C* — 0 for k < ij о г т „< к < rn+1 (n = 1 , 2 , . . . ) . Put <5j = т}, <5„ = тп —


— t n-i. It is easy to see that 5n are independent and (0Хexcepted) have the
same distribution. For by the definition of Markov chains, r„ — r„_1 is
independent of the random variables <51( d2, . .., дп_г which depend only
on the states of the system at instants t < тп_1. The fact that the random
variables <5„ (n = 2, 3 , . . .) are identically distributed follows from the ho­
mogeneity of the Markov chain. Clearly for every n > 2

P(<5„=l)=l-/t
and
P{8„ = к) = цк(1 - Х)к~г for к > 2 .

Hence, for n > 2

E(K)- —
and

*2V '* 0 •

If Co =..l, <5i has the same distribution as the other <5„ (n ^ 2); if Co = 0, we
have Р(<Зг = 1) = X, P(öx = к) = (1 - A)*-1! for /с > 2 hence £(<50 =

By Theorem 1 of § 1 and the lemma of § 6 follows

lim P
{ — - .
( < X = Ф(х). (47)
к~+оо 2 —A —ц)
^ Ä '
Now obviously P(r)„ < к) = P(rk > n); in fact t]„ < к means that up
to the moment n the system was less than к times in the state A u thus its
&-th entrance into the state A x occurs after the moment n, hence x-K > n and
conversely. If we put

k = \ X n i x / ”M 2 - A - l>) -
[ a + |í V (A + i i f J’

a simple calculation gives that

к ( l + v) x \/k u ( 2 —A —u)
n = —- - -V- -- ----- л ---- + 0(1),
486 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 9

hence

Í 'l (ъ - Ц
P —, Л , = < X =F —p = = > - л: + О -7 = .(48)
/иЯ /;(2 - Я - ju) I У М 2 - Я - /i) lN/ А;
W (Я + м)3 ' V Я' '
Thus (47) leads to

/ Я \
Чп - у —“ и
lim Р —, ^ : < X = 1 —Ф(—х) = Ф(х). (49)
/!-*•+со /wA/i(2 —Я —/z)
^V (А + /i)3 ^

Theorem 2 is thus proved.

§ 9. Limit distributions for ‘ order statistics”

The theory of “order statistics” (i.e. the theory of observations arranged


according to their magnitude) is becoming more and more important in
mathematical statistics.
In the present section we shall show how the theory of order statistics
can be reduced to the study of certain Markov chains.1 We start from the
following particular case: Let £1( £2> denote the outcomes of n inde­
pendent observations of a quantity having an exponential distribution;
Ci, i 2>• . Cn are thus independent random variables with the same exponen­
tial distribution:

F(x) = P(Ck < x) = 1 —е-я* for X > О(Я > 0).

We use the following property of the exponential distribution: for any x and
у one has
P { t > x + y \ C > y ) = P{C > x). (1)
Property (1) is characteristic of the exponential distribution. In fact, it is
equivalent to
G(x + >0 = G(x)G(j) (2)
if F{x) is the distribution function of '( and G(x) = 1 — F(x); we know al­
ready (cf. Chapter III, § 13) that the only nonincreasing solutions of (2),
1 For the method see A. Rényi [9], [10].
v i li , § 9] L IM IT D IS T R IB U T IO N S F O R “ O R D E R S T A T IS T IC S ” 487

the trivial solutions G(x) = 0 and G(x) = 1 excepted, are the functions of
the form G(x) = exp (-A x ) with A > 0.
The meaning of (1) becomes particularly clear if we interpret Cas the dura­
tion of an event which takes a certain lapse of time to occur. In this case (1)
expresses the fact that the future duration of an event which is still in course
at a moment у does not depend on the time passed already since the begin­
ning of this event.
Arrange the random variables Ci, Сг>. . ., Cn in increasing order and let
С2 = В Д i,Í2,. ••,£„)
be the k -th of the ranked variables (j. Then1

Ci* < i f < • ■• < c*.


It is easy to determine the distribution of the C*-
If the C, are interpreted as durations of independent events beginning at
the same time, then C* is the duration of that event which is the k -th that
ends. We compute now the distribution of the differences C*+i — £*■ Clearly

p (C*+i - С* > * I Ct = У )= p (C*+i > * + Ж * = jO- (3)


From the n — к events still in course at the moment у none must cease
before the moment x + y; by (1) the probability of this is
[P(C > x)Y~k = e- (n~k»x.
The conditional distribution function of C*+i — Ct with respect to the
condition С* = У is thus

Cjfc < XIC* = JO = 1 - e-("-k»x. (4)


The function thus obtained does not depend on у hence it is equal to the
unconditional distribution function of C*+i — C*- This difference has
thus itself an exponential distribution and its expectation is — — ——
(n — k )l
(k = 1, 2 ,..., n — 1). (* also has an exponential distribution, with expectation
- i - . If we put С* s 0 and

Sk+1 = ( и - * ) ( « +1 - « ) (k = 0 , l , . . . , n ~ 1), (5)

then ök+i (к = 0, 1, . . n — 1) have all the same exponential distribution

1 F(x) being continuous, the probability that two of the are equal is zero; this
possibility can thus be omitted.
488 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 9

with expectation —- . It is easy to see that the SKare independent. In fact


A
the conditional probability

m +i- a <x i c?= Уь.., tt - t u = л) (6)


does not depend on у ъ y 2, . .., yk; since the conditions are equivalent to the
relations £*• = y 1 + . . . + yj (j' — 1, 2 ,. . k), they give thus for j = 1,2,
. . к the moment when the y'-th of the events starting at t — 0 ends. This
means that at the moment t = y x + y 2 + .. . + y k there are exactly n — к
events in course. The probability that between t and t + y at least one event
comes to an end is 1 — exp [-( и - k)X x\ This does not depend on the va­
riables y u . . . , yk; hence the random variables <51; <52, . . <5„ are indepen­
dent. Thus may be written in the form
<5i ő2 Sk
i t = — + ----- y + • • ■+ ----- , (7)

where 5j (j = 1 , 2 , . . . , k) are independent and possess the same distribu­


tion. Equation (7) shows that the variables form an additive Markov
chain. By means of (7) the distribution of can be determined explicitly.
Let the preceding result now be applied to the theory of order statistics.
Let £2, . . ., be independent random variables with the same contin­
uous distribution function F(x). As above, put = Rk{^i, £2, . .
hence < £* < . . . <<! ; * are the variables £y-arranged in increasing order.
The theory of order statistics deals with the study of £*; is called the
A-th order statistic. This study can be reduced to the case when £k are expo­
nentially distributed and by (7) we then have to consider sums of indepen­
dent random variables only. In order to show this, put

^ = l n T77T (*-l>2,...,n) (8)


r\S k)
and
tt = RkUi,...,tn). (9)
Since In ------ is nonincreasing, we have
F{x)

t t = In * . (A: = 1, 2 , . . . , n) (10)
r\Z*+l-k)
and as £k are independent, the same is valid for
Consider now the distribution of C,k. Let у — F_1(x) (0 < x < 1) be the
inverse function of x = F(y) ( —oo < у < + oo). Then the relation

P(Ck < x) = P(Ftfk) > e-*) = P ( tk > F-1(е-П)


V III, § 9 ] L IM IT D IS T R IB U T IO N S F O R “ O R D E R S T A T IS T IC S ” 489

is valid, i.e.

P(C* < X) = 1 - F(F~X(e'*)) = 1 - e - (11)


for 0 < x < + 0 0 . Hence the random variables £* are exponentially distri­
buted with expectation 1. Thus can be written in the form

£* = F _1 (e~ii+1- k) = E_1 [exp f- — ----- \ - ... - , (12)


I й и- 1 k J )

where <5t, S 2, .. .,<5„ are independent random variables with expectation 1.


Our result implies the theorem of van Dantzig and Malmquist stating that
the ratios ■■ (к = 0, 1 , . . n) are independent of each other
F {if)
(cf. Chapter IV, § 17, Exercise 17). Indeed we have according to (12)

F(£*+\) ( d„+i-k I /•; i о л /i


P I к] <l3 >

(We have to put here E (it+ i) = 1.)


The random variables £ f , . . .,£* form a Markov chain, since because of
(12) for x1 < x2 < . . . < xk < X the relation

Щ *+1 < x Ift = *i> • • •>£* = x k) = P lsn. 1-le < k In - d„+l_j =


\ г \хк)
=jXn^ Y ' l - J - k - l ’ Z* = x k (14)

is valid. Because of the independence of <5y- we get

Щ к+i < x \Z* = x1,...,£ * = xk) =■■


F(x)
= P k + 1 -fc < к ln — L = x*j = P(£*+1 < x \C k = Xk). (15)

Let us notice that as

P(F(t;k) < x ) = P(Zk < F - 1 (x)) = F( F - 1(x)) = x (16)

holds for 0 < x < 1, the random variables F(tJk) are uniformly distributed
in the interval (0, 1). The random variables /•"(£*) are thus the ordered ele­
ments of a sample selected from a population uniformly distributed on
(0, 1).
490 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 9

Starting from this point of view, many results on order statistics can be
derived quite easily. As an example, consider the following problem: What
is the limit distribution of the random variables <!;* when both n and к tend
к
to infinity in such a way that — tends to a limit q (0 < q < 1) ? In
n
particular we consider the case к = [nq\ + 1; £*nq]+\ is called the sample
quantile o f order q.
We prove a theorem which implies in particular that the sample quantile
of order q is in the limit normally distributed, provided that the distribution
function F(x) fulfills certain conditions.

Theorem. Let £x, £2, be independent, identically distributed random


variables, with an absolutely continuous distribution function F(x). Suppose
that the density function f(x ) — F'(x) is continuous and positive on the interval
[a, b). I f 0 < F(a) < q < F{b) < 1, and i f k(n) is a sequence o f integers such
that

lim J n —1— — q = 0 , (17)


«-> + oo V Yl
further i f denotes the k-th order statistic o f the sample £x, £2, . . .,
then is in the limit normally distributed, viz.

lim p i ^ m - Q < * I = ф(х), (18)


л-> + оо ^ y j Yl j
where
Q = F -\q ) (19)
and

D = j h ) ^ q { l ~ qy (20)
Proof. We consider first the limit distribution of

C*+i-*(„) = I n ■ (21)
By (12)
rt+l-k(n) S

C n + l-* ( n ) = E , J J ’ (2 2 )
j =1 n + 1~ J

where 5j are independent and exponentially distributed with density func­


tion e~x (jc > 0). Hence E(5j) — D(öj) — 1 and
00

E ( \ S j - 1|3) = J | * - \\*e -xd x < 3. (23)


0
V III, § 9 ] L IM IT D IS T R IB U T IO N S F O R “ O R D E R S T A T IS T IC S ” 491

It follows from (17) and from the known formula

^ 1 I
У — = lnW+C + O — ,
hi к Nj

where C is Euler’s constant,1 that

M n = £(&!_*(„)) = In — + о f - U . (24)
? yjn
Since
_l_ J _____ l_ M_
b V 2 Ni N2 + [TVf,
we get

S„2 = D2(С*+1-ад) = — + 0 (- U i (25)


m In2/
according to (23) and from

y‘ 1 1 1 ( 1 I
Л , * 3 " 2N \ 2N \ + W [iV?J
it follows that

л + 1 -Arpi) x _ I 3\ 1 \

,5 ‘ I VTw |-° Ы - <“ >

11 гГ-Ш-
Thus Liapunov’s form of the central limit theorem (Theorem 4 of § 1)
can be applied to the sums (22). Taking into account the lemmas of
Sections 6 and 7 we get

ír * , 1 \
ÍÍ+ i - а д - I n —
lim P ------- --------- < ~ n = ф(х)- (28)
n^ +co / 1 —<7 ./и
^ V T "

1 Cf. K. Knopp [1 ].
492 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 10

Now because of (21)

í rW
* i -в д - I Ьт
n - L \
P --------------, ----------- < 7= =
/ 1- q Jn
' V q '

=P <1 exp X Ij]• (29)

The mean value theorem allows us to write

f /1-17 ri
= F- 1(a) + — I- - - - - - 1- - - - - - - ” y .- J - - - - - - L (3 0 )
w № ,)
where lim 9„ = 1; further
/2-*- + CO

?H (“xх/-^Г)- ')■--■* +° (t )•
Now (29), the continuity of f(x), and the lemmas of § 6 and § 7 imply (18),
hence the theorem is proved.
The theorem states that the empirical sample quantile of order q of a
sample of n elements is for sufficiently large n nearly normally distributed

with expectation Q = F _1(q) and standard deviation ~ ’


This fact can be used in practical applications (e.g. in quality control). In case
of a symmetric distribution, expectation and median coincide; thus in this
case the sample median can be used as an approximation of the expectation.

§ 10. Limit theorems for empirical distribution functions

In the preceding section we have seen how to determine the quantiles of


a distribution function F(x) by means of an ordered sample of independent
random variables with distribution function F(x). The empirical distribution
function of such a sample may help us to get information also about the
whole course of the distribution function F(x). Glivenko’s fundamental
v i l i , § 10] E M P IR IC A L D IS T R IB U T IO N F U N C T IO N S 493

theorem of mathematical statistics (Chapter VII, § 8, Theorem 1) states that


the difference between the empirical and theoretical distribution functions
tends uniformly to zero with probability 1 as the sample size tends to in­
finity. Glivenko’s theorem, however, says nothing about the “rapidity” of
the convergence. But this information is supplied by the theorems of Smir­
nov and Kolmogorov, which we shall state now without proofs.1
Let £l5 £2, . . ., be independent, identically distributed random variables
with the continuous distribution function F(x). As in the preceding section,
let denote the k -th order statistic. Put

0 for X < £f,

F„{x) = —n for ££ < x <<j£+1 (Ac = 1, 2 , . . . , n - 1), (1)

1 for £* <
F„(x) is the empirical distribution function of the sample . . ., <J„.

T heorem 1 (Smirnov).

SUP ( Fn(*) - F(x)) < > ' ) = { Ó У> °’

Theorem 2 (Kolmogorov).
lim P( J n sup IFn (x) - F(x) | < y ) = | ^ ^ .} > °’
«-+ 00 -ao<*<+oo (0 otherwise,
where
ЧУ) = I (2)
k=—oo
Notice that in these two theorems the limit distributions do not depend
on F(x). It suffices that F(x) is continuous, this guarantees the validity of
these and all further theorems in this section. The values of the function
Ч У ) figuring in Kolmogorov’s theorem are given in Table 8 at the end of
this book.
The theorems of Smirnov and Kolmogorov may serve to test the hypothe­
sis that a sample of size n was drawn from a population with a given con­
tinuous distribution function F(x).
The theorems of Kolmogorov and Smirnov refer to the maximal deviation
between Fn(x) and F(x). Often it is more convenient to consider the maximum
1 For the proof of Theorem 1 cf. § 13, Exercise 23.
494 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § J 0

F (x) — Fix)
for F(x) > a > 0 of the relative deviation — ———----- . The follow-
F(x)
ing theorems are concerned with this relative deviation.1
T heorem 3. We have

J r Fn( x ) - F ( x )
hm P J n sup --------------- < у =
n -+ + со X a < , x < + OO - * \ X )

*1 t =*
_ J e 2 dx fo r у > 0,
о
0 otherw ise,
where xa is defined by F(xa) = a, 0 < a < 1.
T heorem 4. We have

J r F n (x )~ F (x ) j ^ \ь[у J ^ - \ f o r y > 0 ,
lim P L/и sup ------ — ------- < у = 1 \ 1 "/
И — J-0 0 l X a < tX < + o o \ " \ " ) I
0 otherwise,
where
2\
exp - {2k 4- l)2
L(z) = — X ( - 1 ) * -------i - — (3)
n 2k + 1
and x a is defined by F(xa) — a, 0 < a < 1.
The values of the function L(z) defined by (3) are tabulated in Table 9.
We may be interested in the maximum of the relative deviation over an
interval (xa, x b), where xa and x b are defined by F{xa) = a and F(xb) — b
(0 < a < b < 1). This problem is solved by
T heorem 5. I f 0 < a < b < 1, F(xa) = a, F(xb) = b, then the relation

J Fn{ x )-F { x )
\ i mP J n sup ------— ----- < T =
n - > + 00 X a< ,X < X b г \л )

________ у 0 - ' ) ( l- r - ä

- г —со
J “p( - J J ,) J 0
s * •*
is valid.
' Cf. A. Rényi [9].
V III, § 1 0 ) E M P IR IC A L D IS T R IB U T IO N F U N C T IO N S 495
\

First of all, we have to note a surprising corollary of Theorem 5. It


follows from the theorem of Smirnov that
lim P( sup (F„(x) — F(x)) < 0) = 0,
W-*- + 0 0 — CO < X < + CO

i.e. the probability, that the empirical distribution function remains every­
where under the theoretical distribution function, tends to zero. According
to Theorem 3 the same holds if we restrict ourselves to values of x superior
to xa (a > 0). However, if we consider an interval [xa, xfc] with 0 < a <
< b < 1, then by Theorem 5,
lim P( sup (F„(x) - F(x)) < 0) =
И--+СО Xa<,x<,Xb
0 _,|i ?P--

+ 00
J
0
(4,

i.e. the probability of the difference F„(x) — F(x) being in an interval


[xa, x b] (0 < a < b < 1) everywhere negative remains positive even in
the limit. Obviously, this result is of practical importance.
One can simplify the right hand side of (4) by a probabilistic consideration,
without calculations. In fact, the right hand side of (4) is equal to the
probability that a point (x, y) normally distributed on the plane lies in an
angular domain 0 < x < + oo, 0 < < x, where x and у are independent

and have the respective standard deviations / —p— — and 1. Now this
V b —a
probability is equal to

1 Ia(l — b) 1 . [a( 1 - b)
иar c , a n \ h r ^ r - 2 í arc sm V <5)
As a matter of fact for a normal distribution symmetrical in x and у
the probability of the random point lying in an angle у is —— ; an affine

transformation leads from this to the general case. Thus we have

T heorem 6. I f 0 < a < b < 1, F(xa) = a, F(xb) = b, then

lim P ( sup (F„(x) - F(xj) < 0) = — arc sin (6)


л^ +oo xa^x<.xb n \b(l-a)
If we take a = 0 or b = 1, we see that the right hand side of (6) is equal
to zero.
496 T H E T IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 10

Another fundamental problem of mathematical statistics consists in the


decision whether two samples may originate from the same population
or not. This question occurs often in problems of medicine, biology,
economics and many other domains. Essentially, the question is whether
the deviation between the results of two experiments is or is not significant.
The following theorems can serve to decide, provided that the distribution
function of the basic population is continuous; the knowledge of the
population distribution is, however, not necessary.
Let £2, . . . , rib rjb . . . , t \ „ be independent random variables.
Let C/t and t],• have continuous distribution functions F(x) and G(x) respec­
tively, which are not necessarily known.
The problem consists of testing the hypothesis F(x) = G(x) by comparing
the empirical distributions Fn{x) and Gn(x).
The following two theorems were proved by Smirnov:
T heorem 7. I f F(x) = G(x), then

(In . 4 [ 1 —e~iyt fo r у > 0,


lim P — sup ( Ff x ) - Gn(x)) < у = j
„-+« V 2 _00<x<+00' (0 otherwise.
Theorem 8. I f Fix ) = G(x), then

> ш р [ Д sup |F tó -5 .W I< l]-|5 W f ,> 0'


.... (V 2 I [o
where K(y) is defined by (2).
The theorems of Smirnov can be derived by passage to the limit from the
following theorems due to Gnedenko and Koroljuk, which give the exact
distributions of the quantities sup (F„(x) — G„(x)) and sup | F„(x) — G„(x) |
for finite values of n.
T heorem 9. I f [x] denotes the least integer ^ x , i f c = {z ч/ 2 л } , and if
F(x) = G(x), then

P sup (Fn(x )-G n( x ))< z =


, > ^ -0 0 <X< +00

0 for z < 0,

( 2n ) _

n
1 otherwise.
V III, § 10] E M P IR IC A L D I S T R IB U T IO N F U N C T IO N S 497

T heorem 10. Under the same conditions as in Theorem 9, one has

P ( sup I Fn(x) - Gn{x) | < z ] =

0 for z - V7 42n
= ’

= 1 Í n ii 2" I / 1 ^ In
Т ъ ,\
(2
Z
* = -[" ]
(-!)
I n-kc) JYn
. f or —p = < z <
\ 2
,

1 otherwise.
The values of
1 + [«“] ( 2n \
./- » ‘ Ú J
n ) 1'
are tabulated in Table 7, for n < 30; for n > 30 Theorem 8 can already
be applied.
First we prove Theorems 9 and 10; Theorems 7 and 8 can then be derived
by passing to the limit. Collect the random variables £,ъ rjb . . . ,ц п
into one sequence and arrange these 2n numbers in increasing order; let
Ct denote the А-th number in this ordered sequence. One can suppose
that £ * < £ * < . . . < C Put

J 1 if C* is one of the
I —1 otherwise.

Thus in the sequence 0lf 02, ■■■, 92n, n numbers are equal to 1 and n
numbers are equal to —1. Put Sk — 91 + 92 + . . . + 9k. We prove first

L emma . The relations


max Sk
sup (F„(x) - G„(x)) = —
— 00 < X < + 00 ^
and
max \ Sk \
sup IFn{x) - G„(x) ! = — ^ ------

are valid.
498 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 10

The number n [Fn(x) — G„(x)] is the difference between the number of


the £j inferior to x (j = 1 , . . . ,n) and the number of the rj, inferior to
X (/ = 1, . . . , n). If x runs through the real numbers, the quantity
n [F„(x) — G„(x)] changes only if x passes through one of the values
(* (к = 1 , 2 , . . . , 2n) ; in this case it changes by 6k. Hence
sup n(F„(x) - G„(xj) = max n{Fn{Jk + 0) - GJX* + 0)) = max Sk
-oo < * < + 00 l<^fc<^2w l< k < ,2 n

and, similarly,
sup n \ F„(x) - G„(x)I = max n | F„(C* + 0 ) - G„(£* + 0) | = max |S*|.
—co < x < + co l< ,k < ,2 n l < J c < ,2 n

The lemma is herewith proved.

P roof of T heorem 9. Clearly, the number of the possible sequences


01; 02, . . . , 02n is equal to the number of the possible arrangements
of 2n elements among which n are equal to 1 and n to —1; thus this number
[2n\
is . Since f ! , . . . , t h ,. . . , t], are independent and identically
n
distributed, all arrangements are equiprobable and each has probability
, .I n order to determine the probability for max Sk < z J i n we must
2 n l< ,k < ,2 n

П
find the number of the sequences въ . . . , 0 .2 n fulfilling this condition
(2n\
and then divide this number by . We arrived thus at a combinatorial
11
problem. Its solution will be facilitated by the following geometrical
representation : Assign to every sequence 0 X , . . . , 9 n a broken line in the
(x, y) plane starting from the point (0, 0) with the points (Sk, k) (k =
= 1 , 2 , . . . , In) as vertices. (Here (a, b) denotes the point with coordinates
x = a, у = b.) There corresponds thus to every sequence . . . , 02n a
“path” in the plane; all paths start from (0, 0) and end at (0, 2n); all are
composed of segments forming with the x-axis an angle either of +45°
or of —45°. We have to determine the number of those paths which do
not intersect the line x = z J i n . Let this number be denoted by U* (z).
If a path intersects the line x = z J i n , it is clear that it reaches the line
x = {z J i n } = c, too.
Thus we have to count those paths which lie everywhere below the
line x = c. First we count the paths which intersect the line x — c.
If a path intersects the line x = c, we uniquely assign to it a path which
is identical with the original one up to the first intersection with the line
x = c and from this point on is the reflection of the original path with
V III, § 10] E M P IR IC A L D I S T R IB U T IO N F U N C T IO N S 499

respect to the line x = c. The new path ends at the point (2c, 2rí). By this
procedure, we assign to every path going from (0, 0) to (0, 2rí) and inter­
secting the line x = c in a one-to-one manner a path which goes from
(0, 0) to (2c, 2rí) and is composed of segments which again form an angle
of ±45° with the x-axis. The number of paths having one or more points
in common with the line x = c is thus equal to the total number of the
I 2n
paths going from (0, 0) to (2c, 2rí). This number is . Hence
[n — c

. u :(,z) = [2n - ( 2n ) .
n \n —c)
Because of the lemma, Theorem 9 is herewith proved.
Proof of Theorem 10. We use a similar argument. The number of paths
going from (0, 0) to (0, 2rí) and having no point in common either with
x = z J i n or with x = —z J i n is equal to the number of paths going
from (0, 0) to (0, In) and having no point in common with the lines x = +c.
Let this number be denoted by t/„(z).
Let N+ and 1V_ denote the number of paths intersecting x — c and
x = —c, respectively. Let N +_ (and iV_+) denote the number of the paths
which after intersecting x = c (and x = —c) intersect also x = —c (and x = c),
respectively, etc. Let N0 denote the number of the paths which do not in­
tersect either x = c or x = —c. There can be shown as in Chapter II (§ 3,
Theorem 9) that
N0 = N - N + - N_ + N + _ + N _ + - N +_ + - N _ +_ + . . . . (7)

We know that N + = ( 2n I ; by reasons of symmetry we have AL = N +.


[n + c)
We calculate now N+ _ (which is equal to +). Let us take the reflection to
the line x = c of the section of the path which follows the first common
point of the path with the line x = c, then let us take another reflection
to the line x = 3c of the section of the new path which follows the first
common point of the path with the line x = 3c; we obtain thus a path which
goes from (0, 0) to (4c, 2 n ) . Conversely, there corresponds to every such path
exactly one of the original paths intersecting first the line x = c, then the
line x = —c. Hence N +_ (and N_ + too) is equal to the number of
sequences which consist of n + 2c elements equal to 1 and of n — 2c ele-
( 2^
. . Similarly, we obtain
n + 2c
_( 2 n \ I 2 n \

........ . (и + kc) \n-kc' ’


500 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 11

where гь . . . ,£k is a sequence of alternating signs + and — beginning


with + or with —. By (7) and by the lemma, Theorem 10 is thus proved.
In order to derive Theorem 7 from Theorem 9 it suffices to remark that
for c = {z y/2n} Stirling’s formula gives

2n I
n +c) _2zS
l,m ~ m ~ ~ e
л-*- + оо ■
-n )
Theorem 8 can be derived from Theorem 10 in a similar way.

§ 11. Limit distributions concerning random walk problems

In this section we shall study limit theorems of another type than those
encountered so far. As we do not strive at the greatest possible generality
but rather wish to present the different types of limit distributions, we shall
restrict ourselves mainly to the simplest case, i.e. to the case of the one­
dimensional random walk (classical ruin problem). We shall find in the
study of this simple problem a lot of surprising laws which contribute to
a better understanding of the nature of chance. Theorems 1 and 2 are
concerned with the problem of random walk in «-space.
Let the random variables £ъ {2, . be independent and let
each of them assume the values +1 and —1 with probability — .The
random variable
c„ = £ c (i)
k =1

may be considered as a gambler’s gain in a game of coin tossing after n


tosses, provided that the stake is 1 unit of money. The value of which
is always an integer, can also be interpreted as the abscissa at the time
t — n of a point moving in a random manner on the real axis. This point
performs on the real axis a “random walk”, in the sense that it moves
during the time intervals (0, 1), (1, 2),. . . either one unit step to the right
or one unit step to the left, both with probability . We shall deal with
the laws of this random walk.
Consider first a generalization of the problem to several dimensions.
Let G> denote the set of the points of the r-dimensional Euclidean space
which have integer coordinates, i.e. the set of points of the “r-dimensional
V III, § 11] R A N D O M W A LK PROBLEM S 501

lattice” . Imagine a point which moves “at random” over this lattice. We
understand by a “ random walk” the following: If the moving point can
be found at a time t = n at a certain lattice point, then the probability
that at the time t — n + 1 it can be found at one of the adjacent points
of the lattice is equal to for all adjacent points which have r — 1 coor­
dinates equal to those of the preceding point and one coordinate differing
by ± 1. If the position of the point at the time t = n is given by the vector
£(„r), then the random vectors fj* {n — 0, 1,. . .) form a homogeneous
additive Markov chain, namely

c(„r) = c(or )+ i W,
A =1

where the random vector represents the displacement of the point


during the time interval (k — l,k) ; by assumption, the random vectors
c f are independent and identically distributed. For r = 1 we obtain the
one-dimensional random walk problem discussed above; in this case we
write simply £„ and Qk instead of ^jj? and
We prove now first a famous theorem of G. Pólya.1

Theorem 1. The probability that a point performing a random walk over


the lattice Gr returns infinitely often to its initial position is equal to one
for r = 1 and r = 2 and is equal to zero for r > 3.

Proof. Without the restriction of generality we can assume that at the


time t = 0 the moving point is found at the origin of the coordinate system.
Let P(„r) denote the probability that at the time t = n the moving point
is again at the origin. The moving point returns to the origin after performing
in the direction of each of the axes exactly as many steps to the “right”
as to the “left” . Hence P $ +1 = 0 and

(2л)!
PS = !Y2
,r=n (nf.. . . nr\)
(2 r f

n\
(2 )
nx\ . . . n r\
In particular

0 ( 1) _
r 2n —
22n 5
1 Cf. G. Pólya [2].
502 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 11

2л j 2

p(2) _ _ П^__
2« л2п »

Í 2л

р(3) _ i 11 у w!_______
2" 62n * 4 , k\ l \ { n - k - l ) \ •

By applying Stirling’s formula we obtain

(За)
л /л л
and

/» ff« — • (3b)
nn

We give now an estimation of for r > 3. We know that

„ и !
У ------------ = г".
nl+.!Xnr=n «i! • • - и,!

On the other hand, it is easy to see that among the polynomial coefficients
the largest are those in which the numbers пъ n2, . . . ,nr differ at most
by +1 from each other (cf. Chapter III, § 18, Exercise 3). Hence

P(r) = I й ) у (__ Hi__j2 <


" ( 2 г Г „i+^ nr=n W ...л ,!.

(2n\ (3c)
^ \n n\ J 1
< —У- max — --------- = О —— .
(4r)n r
Znj = n
nx\ . . . n T\
n
1
On the other hand, it can be proved that P $ can be represented by the
following integral:
2n
»(<■) ______n у
2n (In)'-1 ( 2 r f n

X \ . . . \ \ r + 2 Y cos (0, —0 ; ) + 2 Y cos 0/]" d d x .. . d 6 r _ x.


— 71 —П 1 ^ / < / < £ » • —1 1=1
V III, § I I ] RANDOM W ALK PROBLEM S 503

Hence we derive the asymptotic relation

<4 >

From (3a) and (3b) it follows that

Y P $ = + oo for /• = 1 and r — 2,
/1 =1

and from (3c) that

Y Pin < + 00 for r > 3.


/1 = 1

In the latter case the Borel-Cantelli lemma permits to state that for
r ^ 3 the moving point returns with probability 1 at most finitely many
times to its initial position.
For r = 1 and r = 2 we shall show that with probability 1 the
moving point will sooner or later (and therefore infinitely often) return
to its initial position. In order to prove this, consider the time interval
which passes until the first return of the moving point. Let Q(p) denote
the probability that the point walking at random on the r-dimensional
lattice reaches its initial position for the tin e after n steps. Obviously,

p$ = + YL P&Q&-2k- (5)
k=1
Put

G,(x) = f P$xk (6)


k=l
and
Hr(x) = £ е й .л (7)
k=1
then from (5)
Gr(x) = Hr(x) + Gr(x) Hr(x), (8)
hence

HAx) = (9a)
1 + Gr(x)
and

а д "7 ~=Ш-m
504 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 11

Clearly,
0 <r) = i Q& = H'{ 1) = lim - , (10)
k =l *-*1-0 1 + W )

where Q(r) denotes the probability that the moving point returns at least
+ 00
once to the origin. For r = 1 and r = 2 the series £ is divergent, hence
k =l
<2(r) = 1, while for r > 3

t r®
Q(r) _ =1______

1+ t P $
k =1

hence 0 < 2 (r) < 1. (E.g., for r = 3, g (3) « 0.35.) Thus we have proved
Theorem 1; at the same time we have obtained

Theorem 2. For r > 3 a point performing random walk over the lattice
Gr has a probability less than 1 to return to its original position.
It can be shown in a similar manner that for r — 1 and r = 2 the moving
point passes infinitely many times through every point of the lattice with
probability 1, while this is not true for r > 3.
In what follows we shall deal with the case r — 1 only. First we give
the explicit form of the probability Q ff The generator function (6) is here

this and (9a) lead to

В Д - yH — = 1- / 1 ^ = I l K - i ) * - 1**.
1 + Gj(x) ^ * ti \ k }
Hence
12k — 2

— + ■ on
2 ^Р кк2
A simple calculation shows that
( 12)
V III, § 1 1 ] R A N D O M W A LK PRO BLEM S 505

Let Vj be the number of the steps in which the moving point first returns
to its initial position; hence vx is a random variable and P( vl — 2k) = Q$.
It follows from the asymptotic behaviour of the sequence Ihat the
expectation of Vj is infinite. Let 9f t ) be the characteristic function of v1:

<p(t) = 1 - У 1 - e2",
hence

lim q k | = e x p (- J - 2 i t ) . (13)
n~*4-CO 1
But we have
+00

’ (. 1
____ ! exp \ixt — —
exp(—s j — 2it) = —f== -------— 3--- — dx, (14)
V :2n 0' л“2

hence exp( —J —2it) is the characteristic function of the distribution with


the density function
1
e 2x
--------- 3- for л; > 0 ,
/(* )= J2nx2
0 otherwise.

Because of the identity

- — 3- í/m = 2 (1 — Ф (“ 7 = |) (15)

0
J J ln u 2

(where Ф(х) is the distribution function of the normal distribution), we


obtain

T h e o r e m 3 . I f v ls v 2, . . . , v „ , . . . denotes the moments when the moving


point performing a random walk on the line returns to its initial position,
i.e. when = 0 , then for x > 0 the relation

<i6)

is valid.
506 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 11

P roof . The random variables vlf v2 — vb . . . , vk — ,... are


evidently independent and identically distributed. Further
vk = vi + (V2 —vi) + • • • + (v* —v*-i)-
Hence (16) follows from (13), (14) and (15).
Remark. The distribution figuring in Theorem 3, with the characteristic
function e x p ( - J l i t ) , is a stable distribution corresponding to the

parameter a = ; hence the distribution {Q $} belongs to the domain


of attraction of this stable distribution.
Theorem 3 can also be formulated in a different way. Let 9n denote
the number of the zeros in the sequence Ci >• >• > CR> then
P(vk < n) = p(en > k). (17)
From this follows
T heorem 4.
limp | Í L < , ) J f W - I / » ^ > 0; (18)
„^+00 \J n ) 10 Otherwise-,
hence has in limit the same distribution as the absolute value o f a random
Jn
variable with the standard normal distribution.
Now we shall investigate the number of positive and negative terms
in the sequence £1;. . . , If Cj = 0 but Cj-i > 0, then £,- is counted as
positive; if (j = 0 but Cj-i < 0, then it is counted as negative. Let n„
denote the number of the positive terms (in the above-mentioned sense)
of the sequence £2, . . . , £„ . We prove
T heorem 5. For every positive integer n the relation1

2k\ l In — 2k

P(n2n = 2k) = k (k = 0 , 1 , . . . , n) (19)


holds.
Remark. Clearly, n.,n cannot be an odd number; in fact, n2 is either 0
or 2, according as £i = 1 or = —1. Similarly, n2n — n2n_2 is either Oor 2
since Сгп- 2 is an even number: if £2 , - 2 ä: 0, we have necessarily £2„- 2 ^ 2,
hence n2„ — n2n_2 = 2; if ( 2/ ; - 2 < we have necessarily (2я_2 ^ —2,
hence n2n - л2я_2 = 0; if C2n_2 = 0, then л2п - л2л_2 is equal to 0 or
to 2.
I2k\
1 We put I I = 1 for к = 0 .
VIII, §11] RANDOM WALK PROBLEMS 507

We need the following

L emma 1. For every integer n > 1 the relation

1 "(2 k\l2 n -2 k\ ,
. - * ) - ■ <20)
holds.
Remark. Relation (20)is a corollary of (19); in fact if we add the proba­
bilities P{n2n = 2k) for к = 0, 1,. .. , n, we obtain 1, remembering that
7t2n is always even. But since we wish to use (20) for the proof of (19), we
have to prove (20) directly.

Proof. A s we have seen, for | x | < 1

2k
1 00 к
( 21)
yj 1 —X k=0 4
Let us take the square of both sides of (21); since on the left side we get
1 00
—------ = Y x k, (20) is obtained by comparing the coefficients of x n on
1 - A k%
both sides.
Now we prove (19) by induction. Clearly, (19) is true for n = 1; in effect

P ( n 2 = 0) = P ( n 2 = 2) = •

Suppose that (19) is valid for n < N and let Vj denote the least index j
for which Cj = 0; vx is necessarily an even number. Furthermore
N
P{n2N = 2k) = Yj p (n2N = 2k, Vi = 20 + P(n2N = 2k, vx > 2N).
/=i
But
P(n2N = 2k, Vi = 21) =

= P(n2N = 2k, V! = 21, Ci = + 1) + P(n2N = 2k, va = 21, C i = - 1),


on the one hand

p (n2N = 2k, = 21, Ci = + 1) = P (n2(N-n = 2(k — /)) P(vi = 21),


508 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 11

and on the other hand

P(n2N = 2k, Vl = 21, C! = - 1) = - 2 P(n2iN_n = 2k) P(Vl = 21).


According to (12)
21 - 2 ill \

/ ’(v i= 2 i)= e s > = J - LU


w hich lead s to th e re c u rsio n fo rm u la

P(n2N = 2/c) = -P(7t2N = 2fc, Vj > 2N) +

((21-2 ( 2/j \

■+— — ~ 221 ) [/’(7r2(Ai-o = 2k) +P(^2(N-i)— 2(k —/))]. (22)

The probability P(n2n = 2k, vL > 2 N) is evidently zero for 0 < к < N.
If к = 0 or к = N,
2N\

P{n2N = 2N, V! > 2N) = P (n2N = 0, Vl > 2N) = . (23)

Now if the relation (19) is true for n = 1 , 2 — 1, then it follows


by some simple calculations from (20) and (22) that it is also true for
n = N. Thus Theorem 5 is proved.
This theorem implies the so-called arc sine law.

T heorem 6.

lim P Ял’- < л: = — arc s in ./* f or 0 < x < 1. (24)


N,+oo [ N ) л

Proof. According to (3a)

Ш _L
2 2k Jn k' { )

For 0 < X < у < 1 we obtain from (19)


, I I Iny] 1 1
P x < ^ ~ <y К — У , ------- - — , (26)
2n % * = [£ ] + i Jc_ / _ Jc \ n
V n n
V III, § 11] R A N D O M WALK PROBLEMS 509

hence
TCon ) I Г dt
hm P X < <у =— - 7= = =
n~+00 2« ) n J ^(1 _ ,)

= — (arc sin у /у —arc sin J x ). (27)


Ti

Now since n2n < 7г2„ +1 ^ 7г2„ + 1, the limit distribution of ——+1- coin-
277 + 1

cides with that of — which proves Theorem 6.


2n
This theorem can be proved in a more elegant way, which, however,
requires more powerful tools. This rests upon the following generalization
of Lemma 1:
L emma 2. We have

1 " (2k\ I 277 - 2к ,

where P„(x) denotes the n-th Legendre polynomial:

а д = 2 ^ Г ^ (" 2 - 1)Л-
Proof. We see that the left side of (28) is the coefficient of x n in the
power series expansion of
1 _ 1
^/(1 - e" x) (1 —e~" x) y j 1 — 2x cos t + :c2
On the other hand we know that1

Л — A ■--------- i = Z P n(cos t ) x n (29)


yj 1 —2x cos t + X n= 0
where P„(z) is the ?7-th Legendre polynomial. By comparing coefficients
we obtain (28).
Theorem 6 can be derived from (28) as follows: We have

1 1
I
E exp it -^2n-| = e 2 P„ cos -j— .
2n ) ] 1 2 t7

1 Cf. G. Pólya and G. Szegő [1], Vol. II, p. 291.


510 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 11

By the Laplace formula1



I r _____
P„(x) = — (x + i cos c p J \ —x 2)ndcp
n
0
we obtain
-g- l ( 1 e2 Г (it cos <p\
lim e P„ c o s ---- = ----- exp — -— dw =
n-*-+oo I 2n ) 7Г J 2 I v
0
+1 1
1 f eiI(1+u)l2 ^ 1 I' eitx dx
n J
-1
J 1 - u2 n J
0
J x (\ - X)

which implies the statement of Theorem 6 in view of Theorem 3 in Chapter


VI, § 4.
Lemma 2 permits an easy calculation of the moments of л 2п . For reasons
of symmetry E(n2„) = n. The standard deviation can be derived from (28):

D \ n 2n) = P'n( 1)= ( " 2 l ),


hence

• (30)
Remark. Theorem 6 expresses an interesting paradoxical fact. The
2 r- .
derivative of the distribution function F(x) = — - arc sin J x, i.e.
n

F '( X ) = ----; - 1
Лу/х(1 - x)

is namely symmetrical with respect to the point x = and has a minimum


at this point. Consequently, this value is the least probable one for the
random variable — : the probability that the value of — is in the neigh-
N N
bourhood of a point x (0 < x < 1) is the greater the farther the number
jc is away from — . One would expect rather the contrary: indeed it would

1 Cf. G. Pólya and G. Szegő [1], Vol. II, p. 291.


VIII, § II] RANDOM WALK PROBLEMS 511

seem quite natural that the moving point would pass approximately half
of its time on the positive and the other half on the negative semiaxis.
However, Theorem 6 shows that this is not the case. Or, to put it in the
terms of coin tossing: One would consider as the most probable that both
players are leading during nearly ~ of the whole time. But this is not so;

on the contrary, — is the least probable value for the fraction of time
during which one of the players is leading. However, a little reflexion
shows this to be quite natural; indeed £n varies quite slowly, because of
C„+i — Си = ± 1; if Си reaches for a certain n a large positive value,
(n+k will remain for a long time positive and a similar reasoning holds
for large negative values too.
Theorem 6 is due to P. Lévy.1
The theorem can be generalized. It was proved by Erdős and Kac2 that
if Ci, C2 , • • • are independent random variables with E(£k) = 0, D(£k) = 1
П

which satisfy the condition of Lindeberg, further if we put C« = X £k


k=1
and if nN denotes the number of positive terms in the sequence Ci, C2 »• • • > Cn
then nN fulfils (24). Sparre-Andersen3 proved that if Ci, C2 , • • • >£n are
independent random variables with the same symmetrical and continuous
distribution, then
'2k\ (2« —2k\
P(7ln = к ) Л к ) \ £ п~ к У (31)

In this case (24) is valid even if the variance does not exist.
We now determine the exact distribution of 9„ (the number of the zeros
in the sequence Ci, C2 , • • • , C«) f°r even values of n. We prove first
T heorem 7. For every positive integer n

2 k 12n — k \
Р(в2п = к ) = -5Г (A: = 0 , 1 , . . . ,2«) (32)
2 n
holds.

P roof . If vx denotes the least positive integer for which = 0, then

P(62n = k) = £ P(92n = k , v x = 2r) + P(92n = k ,v x > 2ri).


r= 1

1 Cf. P. Levy [2].


2 Cf. P. Erdős and M. Kac [2].
3 Cf. E. Sparre-Andersen [1 ].
512 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 11

From this we derive (32) by the method used in the proof of Theorem 5.
We obtain from (32) by Stirling’s formula

E(ßn) * J 2^ - . (33)

Theorem 6 may also be formulated as follows: Put


[1 for £ * > 0 or for £* = 0 and Ck_г = 1 ,
8 ^ - s g n £ f c - [ _ j otherwise

Then for - 1 < X < + 1,

( £ 1 + 8 2 + . . . + £„ 1 2 . I l X
hm P -----------------------< x\ = — arc sin / — -— . (34)
„- +=o I n ) n V 2

Consequently, the ratio - - + -£~ + — — does not tend to zero as


n
n -> + 0 0 , though this would seem to be quite “plausible” . However the
following theorem due to Erdős and Hunt1 is valid:
Theorem 8.
' z - 1
P lim ^ - ^ = 0 = 1 . (35)
N-*-+00 v -' *
kI= 1 tK ;'

Pr o o f . Clearly E(ek) = 0, E(ek) = 1. We determine Е(г„ em) (m > n):

E(e„ em) = 2 1 P(Cn = k) {IP (Cm > 0 1C„ = k) - 1) <


k =1

< 4 ^ ((„ = i) f ( - ( < U < 0 ) .


k =l
The greatest term of the binomial distribution of order n and parameter
1 Г2
— is the central term, asymptotically equal to / ---- ; from this we con-
2 V П71
elude that
Cxk
P ( - A r < C m_ „ ^ 0 ) < — = L = = ,
m —n
1 Cf. P. Erdos and G. A. Hunt [1].
VIII, § 1 1 ] RANDOM W ALK PROBLEM S 513

hence

I£(<=„ O l ^ C2 — - — ; (36a)
V m — n

here and in what follows Съ C2, . . . are positive constants. If m — n < n,


we use instead of (36a) the trivial inequality

| £(£„ e j | < l . (36b)


Thus we obtain
ífíl-y ) < C 3 lnyV. (37)
If we put

E
4v = ^ H n ’ (38)

we find
E{An) = 0 and E(A%) < - % .
ln N
Hence, by applying Chebyshev’s inequality,

П \^ \> г ) < ^ - . (39)

со
From this we obtain that the series E T(|d2/t* | > s) converges for every
k =1
e > 0. Thus by the Borel-Cantelli lemma the inequality | A2k* | < e is
satisfied with probability 1 for a sufficiently large k. But for 2k8 < n <
< 2(,<c+1)! we have

\An\ < A ^ + ^ .

Hence the inequality | A„ | < 2e is fulfilled with probability 1 for a sufficiently


large n. Since e > 0 can be chosen arbitrarily small, Theorem 8 is proved.
In conclusion we mention some theorems concerning the largest fluctu­
ations of the one-dimensional random walk.

T heorem 9.

lim P ( max [k < x J n ) = | 1 ^ x > °’ (40)


„-•юс i < ^ „ |0 otherwise.
514 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § II

T heorem 10.
lim P( max | C* I < x sf n ) =
n-*-+oo l<,k<,n
, (Ik + l)12я 2
4 * ( - I)* e x p - i — Л -
= lr ~ L -----------2 k + 1 ------- f 0 r X > °* (4»
0 otherwise.
Theorem 9 can be derived from the following formula (cf. Chapter III,
§ 18, Exercise 19):1
Г” - " П
1 L 2 j / w + 2/ c к 1
<42)
It was shown by Erdős and Kac2 that Theorems 9 and 10 can be amply
generalized. They can be extended to the sums of independent, identically
distributed random variables.3
It is interesting to compare Theorems 9 and 10 with the results of the
preceding section. Those results can be put in the following form:
T heorem 11.
/— . í 1 —e~2x‘ for X > 0,
lim P( max £* < x j I n \ C2„ = 0) = j . (43)
„--юс 1 < ^ 2„ (0 otherwise.
T heorem 12.
,— У (—1)k е~2к'х' for X > 0,
lim ( max | C/tl < x j l n |(2„ = 0) = *=-oo (44)
„- +0О i^*< 2„ [ 0 otherwise.
Theorems 11 and 12 describe the properties of paths which after In
steps return to the origin. One expects that under this condition the path
does not deviate as far from its origin as in the general case. Indeed the
expectation of the distribution (43) is
00
J*4x2 e~2x‘ dx = = 0.627,
0
while that of the distribution (40) is / — = 0.798.
V 71

For another proof of Theorem 9 see Chapter VII, § 16, Exercise 13. e.
1
Cf. P. Erdős and M. Kac [1].
2
3 For the extension to random variables which are not identically distributed,
see A. Rényi [9].
V it i § 12] P R O O F O F L IM IT T H E O R E M S B Y O P E R A T O R M E T H O D 515

§ 12. Proof of the limit theorems by the operator method

The most important limit theorems of probability calculus can also be


proved without the use of characteristic functions, by a direct method.
This method will be presented here. It is to be noted that this method, if
applicable, is more simple than the method of characteristic functions;
but it does not replace the latter. As a matter of fact, the method of charac­
teristic functions has a far wider range of applications and there are many
limit theorems which can only be proved by means of characteristic
functions, or at least their proof by any other method becomes very
complicated indeed.
The method to be dealt with in the present section can be called the
operator method, since it uses certain functional operators. For the sake
of simplicity we shall introduce the method first by proving Liapunov’s
theorem; then we shall pass to the proof by this method of the more
general Lindeberg theorem. Finally, we shall prove a theorem about the
convergence to the Poisson distributions (§ 4, Theorem 1) and the theorem
of § 3.
We recall some definitions and notations. Let C3 be the set of all uniformly
continuous and bounded real-valued functions defined on the real number
axis which are three times differentiable while their first three derivatives
are also uniformly continuous and bounded on the whole number axis.
If / = /(*) £ C3, put
11/11= sup |/(x )|. (1)
X

The number | | / j[ is called the norm of the function / = fix). Clearly, if


/ £ C:! and g d C3 , then f + g £ C3 , further i f / £ C3 and a is a real number,
then a / £ C3 . It is easy to see that if / £ C3 and g £ C3 , then 11/ + g \\ <
— Il/ll + Ill'll an(i if / í Q and a is a real number, then 11 o f 1j =
= |a| Il/ll . An operator A which assigns to every function / £ C3 an
element g = g(x) = A f of C3 is called a linear operator if it possesses the
following properties:
1) If / 6 C3 and g £ C3 , then A ( f + g) - A f + Ag.
2) If / £ C3 and a is a real number, then A(af) = a • Af.
3) There exists a number К > 0 such that for any function / £ C3 the
inequality || A f || < К • | | / | | holds.

If 3) is fulfilled for К = 1, the operator A is called a contraction operator.


If A and В are two operators, we define the operator A + В by (A + B )f —
= A f + B f The product of two operators is defined by the consecutive
516 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 12

application of the operators, i.e. by (A B )f = A(Bf). The multiplication of


operators is associative, but it is usually not commutative. The addition
of operators is obviously both associative and commutative. For the
multiplication and addition of operators the distributive law is valid:

A(B + C) = AB + AC.

If A is an operator and a a real number, we understand by a A the


operator defined by (aA )f = a ■A f Clearly, if a and ß are real numbers
and A and В operators, then a(A + B) = ocA + aB, <x(ßA) = (ocß)A and
(a + ß)A = a A + ßA, further A + { - A ) = О and 0 • A = 0 , where О
is the zero operator which assigns to every function / £ C3 the function
identically equal to 0. To the reader acquainted with the elements of
functional analysis all this is of course familiar.

L emma 1. Let F(x) be an arbitrary distribution function, then the linear


operator A F defined by
+ 00
AFf = I f i x + y)dF(y) (2)
— GO

is a contraction operator.

P It is easy to see that if / £ C3 then Aff £ C3 . Clearly Af fulfils


ro o f .

conditions (1) and (2) in the definition of linear operators, further

ii^ /ii^ ii/n —jсоW ) =11/11,

hence A F is indeed a contraction operator.


The operator A F is called the operator associated with the distribution
function F.
If A and В are two operators such that for every / £ C3 A B f = BAf,
then the operators A and В are said to be commutative.

L emma 2. Let F(x) and G(x) be any two distribution functions. The operators
A F and A g associated with them are commutative and A FAa = A H, where
H = H(x) is the convolution o f the distribution functions F(x) and G'(.v), i.e.

H(x) = J f (x - y)dG(y).
—oo
V I I I , § 12 ] P R O O F O F L IM IT T H E O R E M S B Y O P E R A T O R M E T H O D 517

P roof. Clearly

AFAGf = J ( J f ( x + V+ z)dG{z))dF(y)= J f ( x + u)dH(u).


—00 —00 —00
Lemma 3. Let A be a contraction operator and В an arbitrary operator,
then
\\A B f\\< \\B f\\.

Proof. The statement follows from the definition of the contraction


operator.

L emma 4. I f Ux, U2 , . . . , Un and Уъ V2, . . . , V„ are operators associated


with probability distributions and i f f £ C3, then

\\UyU2. . . U J - Vi V2. . . VJW < £ II V J - V kf\\. (3)


k= l

P roof. Clearly, for arbitrary linear operators we have the identity

U1 U2. . . U n - V 1 V2. . . V n = £ Ui U2. . . Uk^ ( U k - V k) Vk +i . . . V n. (4)


k = l

By Lemmas 1, 2 and 3 we get immediately (3).


Now we can begin the proof of the central limit theorem under the
Liapunov conditions. Instead of Theorem 2 of § 1 we prove the following
somewhat more general theorem:

T heorem 1. Let i nl, i n2, . . . , be independent random variables with


finite variances Dlnk = D~(qnk) and third absolute central moments H \k —
= Д1 ink - Щ„к)\3)- Put L = £ ink and c* = " nf ~ , further
k = l Щ ^ п )

Sn = D (U = l £ D;lk and Kn = »/ £ H \k.

Let Fk(x) denote the distribution function o f £*. I f Liapunov s condition

lim — =0 (5)
«-*•00 ^ П
is fulfilled, then
1 f -—
lim Fn (x) - Ф(л-) = —== e 2 du. (6)
п~* со 2 jI J
—CO
518 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 12

Proof. Without restricting generality we may assume Е(£„к) =


= О (к = 1,2, . . . , « ) . Let Unk denote the operator associated with the distri­
ct
bution function F„k (x) of the random variable —- . Further let Vnk denote
the associated operator of the normal distribution with expectation 0 and
Dnk
standard deviation —- . Then U„x Un2. . . U„„ is nothing else than the
operator associated with the distribution function F„(x) and Vnl, V„2 ■■. V„n
is the operator Аф associated with the standard normal distribution function
Ф(х) of expectation 0 and standard deviation 1. If we apply Lemma 4, we
obtain that for any / £ C3

li AFnf - A„f\\ < E II Unkf - Vnkf\\. (7)


k =\

Now
+ 00 + 00

Unkf = ( f{ x + y) dFnk (y) and Vnkf = f ( x + y)d<I> } .


J J u nk,
—00 —00

Since / £ C3 , f{ x + у) can be expanded into a finite Taylor series up to


three terms
2 3
f i x + y) = /(x ) + y f (x) + j—/ " (x) + (x + Or), (8)

where 0 < 0 < 1; of course 0 depends on x and у . Thus, taking into


account that
+00 +00 +00
J dFnk O) = 1, J ydFnk (у) = 0, and J / dFnk (y) = ~ ,
—00 —00 —00

we obtain
+ 00

U„kf= Ax) + y3f " ( * + °y)dF„kiy) (9 )


—00
and
+ 00

у n k f =fix) + - f ~ + -j- J f f " (X + Oy)dd> j - ^ - J . (10)


—co
v i l i , § 12] P R O O F O F L IM IT T H E O R E M S B Y O P E R A T O R M E T H O D 519

Hence, if sup f"'(x) \ = M, we get

+ 00 +00

\\Unkf - V nkf \ \ < ^ - [ f \y\*dFnk(y)+ f I у |3 с/Ф í-y p Ú ] . (11)


6 U J \ D nk))
— со — oo

Since

J
+ 00

\ y \ 3dFnk(y)= ”, (12)
-0 0

and

(,з)
—00 —00

there follows

I Unkf - Vnkf И< - ^Mg - ( H l k + 2D3nk). (14)

Because of the Hölder inequality one has for every random variable

E (? f< E ( |i|V , (15)


hence
Dnk< K nk, (16)
and thus

£ D3„k < К 3. (17)


k= 1
(7) and (14) lead to
M к
\\AFJ - A „ f \ \ < — ^E , (18)

and thus by (5)


lim \\ATnf — A„f\\ = 0. (19)
n-*- 00

Thus we proved that if / £ C3 , then for any value of x (and even uniformly
in x)

lim J f(x + y ) dF„ (>’) = j f ( x +y) d<l>(y). (20)


W
—00 —00 —00
520 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 12

From this follows that (6) holds for every x. Indeed if e > 0 is arbitrary,
let / E(x) be a function belonging to C 3 with the following properties:
f j x ) = 1 if x < 0 ,/£(x) = 0 if x > e andf E(x) is decreasing if x lies between
0 and e. Such a function can be given readily, e.g. the following function
has all the required properties:

1 for x < 0,

Á O) = 1- | * for 0 < x < e, (21)

0 for £ < X.
Then
+ со
Ф(х + e) > J f E(x + y)d<P(y) > Ф(х), (22)
—00
and
+
F„ (x + £) > J f e (x + y) dF„ (y) > Fn (x). (23)
-o o
Hence
lim sup Fn(x) < Ф(х + e), (24)
n -+ oo

and
lim inf Fn(x + e) > Ф(х). (25)
n-*- 00

If we apply (25) to x — £ instead of x, we obtain

lim inf Fn (x) > Ф(х —e), (26)


n-*- 00

i.e. we obtain from (24) and (26) that


Ф(х — e)< lim inf F„ (x) < lim sup Fn (x) < Ф(х + e). (27)
П-+О0 n-+ 00

Since (27) is valid for every positive e, it follows that (6) is fulfilled for
every x. Theorem 1 is herewith proved.
Now we pass to the proof of the Lindeberg theorem by the operator
method. We prove the theorem in its most general form, i.e. we present
the proof of Theorem 4 of § 1.
P roof of T heorem 4, § 1 by the operator method. We may assume
without restriction of generality that M nk = E(£nk) = 0 (k = 1 , 2 , . . . , « ) .
n
Put = £ €nk, and let Fn(x) denote the distribution function of ( я.
k= 1
V I I I , § 12 ] P R O O F O F L IM IT T H E O R E M S BY O P E R A T O R M ETHOD 521

We prove that for every / £ C3

lim Af J = Л Ф/ . (28)
n-*- oo

As we have seen in the proof of Theorem 1, it then follows that for every
real X
lim F„ (x) = Ф(х).
n-*- 00

Let U„k denote the operator associated with the distribution function
Fnk(x) of the random variable £„k and Vnk the operator associated to the
normal distribution with expectation 0 and standard deviation Dnk. Then
according to our assumptions
Af „ = U nlUn2. . . U nn, and Аф = VM Vn2 .. . V„„. (29)
Further by Lemma 4 for every / ( C 3

IIAFJ - a j II < £ II Unkf - Vnk)f\\. (30)


k= 1

Now if we expand /(x + y) into a finite Taylor series up to the second


and third term respectively, we get

f ( x + y) —f(x) + y f (x) + - 'y -/" ( x + öiv), (31)


and
2 3

f ( x + }’)= /(X ) + y f (x) + y2 f " (x) + y— г (X + 02у ), (32)

where 0 < вг < 1 and 0 < 02 < 1; of course and 02 depend on x


and y. Let e > 0. Clearly

Unkf = J /(x 4 y) dFnk (x) + J /(x + y) dFnk (x). (33)


-* |Я>*
Use in the first integral on the right hand side of (33) the equality (32)
and in the second integral (31). We obtain

U„kf=f(x) + 2 f (■*) D~„k + ^ J ' }’3f (x + 02y) dFnk (y) +

+ T J j2 (f "{x + 01 y) ~ f "{x)) dFnk {y)■ (34)


Iy\>c
522 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 12

Put
sup If"(x) I = Mx and sup |f"' (x) 1= M2,
then
Unkf - f { x ) - ~ D \k f"(x) I < - i- sM 2 D\k + Mx J j* dF nk (x). (35)

On the other hand


+ 00

Vnkf = f ( x ) + ~ D\kf" (*) + - ! J / / " ' (a + 08j;) </Ф i-^ -j , (36)


— 00

hence
+ 00

Vnkf - f ( x ) - Y D l J " (X) < ~ M 2Dlk J Ij I3сЩу) < M lp L . (37)


—oo
(35) and (37) lead to
" 1 Г Л/о "
E II unkf - Vnkf\\ < — eM2 + M, E (J) + ^ - E Dlk- (38)
k=1 0 k= 1 J 3 k=l
Ul>«
Since by our assumption the Lindeberg condition is fulfilled, we have

l im E j / d F nk O ’) = 0 . (3 9 )
л-*-со A :=l |У| > e
furthermore
D2nk = J x2dF„k(x )+ J X2dFnk (x) < s 2+ J x2dF„k(x),
|jc|^ 8 |лг| > e J*|>£
therefore

É &пк ^ max D„k < J r 2+ E j’ x 2 d F nk( x ) ,


k=l l< ,k < ,n > k = l |л:| > e

and thus for every positive e


n
l im s u p E Dlk ^ e,
n-*- 00 /c = l
that is
l im E = 0, (4 0 )
n-+ CO /c=l

hence (30) and (38) lead to (28). Theorem 4 of § 1 is herewith proved.


As a further illustration of the operator method we prove now a theorem
concerning the convergence to the Poisson distribution.
V III, § 12] P R O O F O F L IM IT T H E O R E M S B Y O P E R A T O R M E T H O D 523

T h e o r e m 2. Let Qn k (k — 1 , 2 , , n) be independent random variables


which assume only the values 0 and 1; put further

P{U=V=Pnk- (41)
Put

K = t Pnk (4 2 )
k 1 =
and suppose that
lim Xn = X (43)
П -+ 00

and
lim max pnk = 0. (44)
tt-*-oo 1 <,k<,n
Then the distribution o f
in = Ínl + Íril + • • • + inn (45)
converges to the Poisson distribution with expectation X.
Remark. Theorem 2 is a particular case of Theorem 1 of § 4; the latter
can be proved in a similar manner. Merely for simplicity’s sake we restrict
ourselves to the proof of Theorem 2.
P roof. Let К denote the set of all real-valued bounded functions f ( x )
(x = 0, 1, 2 , . . . ) defined on the nonnegative integers. Put | | / | | = sup|/(x)| .
Let there be associated with every probability distribution —
— {Po .........A i - ” } an operator defined by
00

A & f = Y . A x + r )P r (4 6 )
r=0

for every / £K . Clearly, Ag- maps the set К into itself, Ag^ is a linear con­
traction operator, further if fP and Ц аге апУ two distributions defined on
the nonnegative integers, then A ^ A q = A ^ where Lft = Sfi • Q, i.e.
is the convolution of the distributions ffi and £); that is, if = {pn} and
Q = {dn}> then = {/•„}, where
n
rn = Yj P kdn-k-
k= 0
Let Unk denote the operator associated with the distribution 6f'nk of the
random variable and V„k the operator associated with the Poisson
distribution with parameter p nk. Then Unl Un2 . . . Unn is nothing else than
the operator A.cßnassociated with the distribution SAn of the random variable
C„, while Vnl Vn2-.. Vm is the operator Qkii associated with the Poisson
distribution with parameter Xn (taking into account that if Qk is the
Poisson distribution with parameter X, then = йх+ц)-
524 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 12

In order to prove Theorem 2 it suffices to show that for every element


/ of К the relation
lim II A g s J - AQXnf\\ = 0 (47)
«-►00

holds. In fact, if (47) holds for every f £ K, choose for / the function for
which /(0) = 1 and / ( x) = 0 for x > 1; then it follows from (47) that
for every r (and even uniformly in r)

lim P(Cn = r ) - ^ - J = 0 , (48)


« — GO
r' ' *

and since by our assumption — Л, it follows from (48) that

lim P(£„ = r) = — (/- = 0, 1, . . . ) . (49)

Now Lemma 4 is valid in this case too, and applying it we obtain

II A & J - AQXnf \ \ < £ ИUnkf - Vnkf\\. (50)


k =1
On the other hand

u nkf - v nkf = f ( x ) ( \ - p nk - e ~ n +

00 vrLe~Pnk
+ f ( x + 1){p„k - P„k e Pnk) - E /(* + r) ~ Zi---- > (51)
r =2 ' •

and thus

II Unkf ~ Vnkf\\ < U/H ( « - « - (1 -Р „к) + Р п Л 1 - e -pnk) +


+ [! —e~Pnk(l + pnkj\)- (52)
Since
1 - x ^ e~x < 1 - x + x 2 for x < l , (53)
there follows
\\Unkf - V nkf \ \ < l \ \ f \ \ p l k. (54)
Thus (50) and (54) lead to

II A & J - AQxJW < 3 H/Ц • £ p2nk. (55)


k =l
Because of
. «
£ Pnk ^ max p„k, (56)
к- 1 l<.k<,n
V III, § 12] P R O O F O F L IM IT T H E O R E M S B Y O P E R A T O R M E T H O D 525

by (44) it follows that


lim IИ л / “ AQx„f\\ = 0. (57)
n-+ oo

As we have seen, the assertion of Theorem 2 follows.


Finally, we give a proof by the operator method of the theorem proved
in § 3. Just like there, we assume that the distribution in question is contin­
uous and symmetric with respect to the point 0, i.e. we prove the following

T heorem 3. Let £b &> •••>&»>•• • be independent identically distributed


random variables; let their common distribution function be denoted by F(x).
Suppose that F(x) is continuous and the distribution is symmetric with respect
to the point 0, i.e. F( —x) = 1 — F(x) (x > 0). Assume further that

lim , q. (58)
\ x 2dF(x)
о
Put Cn = í i + £ 2 + • • • + £n- Then there exists a sequence o f numbers Sn
such that for every x
lim P < xj = Ф(х). (59)
w-*-oo )
P roof . Put

<«,)
j x2dF(x)
о
then by assumption
lim S ( y ) = 0. (61)
y-*-+CO
Put further
./.л %) /
^(у) (Л _ с у ч \ 2 У ’ (6 2 )
( w) (1 - F(y)) § X2dF(x)
0
then, as was shown in § 3,

limd(y)=+oo. (63)
n-+oo
By our assumption A(y) is continuous for у > y0 . Let C„ denote the least
positive number for which
A(C„) = n \ (64)
then C„ -> oo, furthermore
и ( 1 -Т ( С й) ) = 7 а д . (65)
526 Г Н Е L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V I I I, § 12

+Cn
Put f 2C2
S2n = n x2dF(x) = — = ^ = , (66)
J x/<5(C„)
-Cn
then

(«>
V2
Now let {/„* be the operator associated with the distribution of the random
L
variable — and Vnk the operator associated with the normal distribution
Sn 1
with expectation 0 and standard deviation — = (k = 1,2, ,ri). Then

UniU n2 ■. . Unn is the operator associated with the distribution function
F„(x) of the random variable while V„1 Vn2 . . . v nn is the operator
associated with the standard normal distribution function Ф(х) (having
expectation 0 and standard deviation 1). Thus by Lemma 4 for every
/ d C3 we have
II A fJ - A * f II < £ II Unkf - Vnkf\\ = n II Unlf - Vnif\\.
k =1
Hence it suffices to prove that
\ i m n \ \ U nlf - V nlf\\ = 0. (68)
П-+00
Now

Unlf = j f ( x + y)dF(Sny ) + ) f ( x + y)dF(S„y). (69)


-g w > f:
If sup |/(x ) I = A, then

j f{x + y)dF{Sny) < A ( l - F(Cn)) = ± ^ 3 CA . (70)

On the other hand, if in the integral on the right hand side of (69) f ( x + y)
is expanded into a Taylor series up to the third term and if it is taken into
account that by our assumption the distribution with the distribution
function F(y) is symmetric with respect to the point 0, then we have

j f i x + y) dF{Sny) =f(x) (1 - 2(1 - F(C„))) + + R„, (71)


Cn
~Sn
V III, § 12] P R O O F O F L IM IT T H E O R E M S B Y O P E R A T O R M E T H O D 527

where
+c„
R n =
-c„
J y 3f" ' (x + Oy)dF(y), (72)

and 0 < 0 < 1.


If sup If"'(x ) I = B. then
+Cn
В Г „ BC„ В i / Ő(C„)
J рз )
-c„
On the other hand
V„if=f(x) + + О , (74)

thus (69)-(74) lead to

»II U n J - VnJ \ \ < 3A J ö ( C j + y 3 (C j + o |4 = | , (75)


3v/2 V»
hence because of <5(C„) -> 0 the validity of (68) follows. Herewith Theorem 3
is proved.
Finally we make some remarks concerning the relation between operator
and characteristic function methods.
The convergence of a sequence F„ of distribution functions to a distri­
bution function F is proved by the operator method by showing that for
every / £ C3 one has APnf -> AFf This implies that the characteristic
function <pn of the distribution function F„ tends to the characteristic
function <p of the distribution function F; in fact, if f(x) — eixl, then
/£ C 3 and A hnf = ei,x J ei,y dF„(y), hence A tJ = ei,x cpn{t) and,
— 00

similarly, A hf — e,,x(p(t).
Hence, from the fact that for every / £ C3
A fJ - AFf (76)
it follows that for every real t <pn(t) -* <p(t).
Therefore the operator method proves slightly more than the characteristic
function method. In effect, we prove for every / £ C3 the validity of (76)
and even that (76) is fulfilled uniformly in x. This makes the proof of the
relation F„(x) -* F(x) simpler, because while the implication of the relation
Fn(x) -> F(x) by the relation <pn(t) -> <p{t) is a comparatively deep theorem
(the so-called continuity theorem of characteristic functions, cf. Theorem 3
of Chapter VI, § 4) it is quite easy to see that (76) implies Fn(x) -> F{x) (for
528 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 13

every X which is a continuity point of F(x)). On the other hand, the method
by which we proved (76) in each of the above discussed cases, can be applied
for distributions of sums of independent random variables only, while the
method of characteristic functions can be applied in other cases too (cf. e.g.
§ 5 or Exercise 26 of § 13).

§ 13. Exercises
1. Prove Theorem 2' of Chapter VI, § 5 by means of the central limit theorem
(Chapter VIII, § 1, Theorem 1).
Hint. If Fix) is a distribution function with expectation 0 and variance 1 such that

f [— ) * F (— I = F - ,
KJ lff2J yJo\ + ol
then F{x) is equal to the я-fold convolution of F(Xyjn). This converges to the normal
distribution as n —> + сю.
2. Let . be independent random variables and suppose

P (in = a„) = P (i„ = - a„) = (n = 1, 2 , .. .).

Under what conditions on the positive numbers an does Liapunov’s condition of the
central limit theorem hold for the random variables {„?
Hint. Put

s n = J i t a\, K „ = J i t a l and m„ = max ak.


V 1 \j k-^1 I ^k<.n

It follows that
H hL< LL< izM 1
s n - s n - [s„J ’
К ш
and Liapunov’s condition lim — = 0 is fulfilled, iff lim — = 0 .
Л-v + O, V Л-^+ОО C^П
3. a) Let ся be a random variable having a Poisson distribution with expectation A.
Show by the method of characteristic functions that the distribution function of the
— A
random variable----—— tends to the normal distribution function as A—> + °°

(cf. also Ch. Ill, § 18, Exercise 28).


b) Let £„ be a random variable having a gamma distribution of order n with
E (Q = — . Show that the distribution function of yj n _ ]j tends to the normal
distribution function with expectation 0 and standard deviation 1 .
c) Let f„ be a random variable having a beta distribution of order (яр, nq). Show
V III, § 13 ] E X E R C IS E S 529

that the distribution function of the random variable 4- q ) 2 ( i „ ------ ——


ypq I p + q)
tends to the normal distribution function with expectation 0 and standard deviation
1 as n -> °o.

4. Let e„(x) denote the и-th digit in the decimal expansion of jc (0 < x < 1 ); the
П
values of £■„(.*) are thus the numbers 0, 1, . . . , 9. Put S„(x) = ^ еА(дс). If E„(y) is the
1
2 s (X ) 9n
set of the numbers х for w hich---- ------ < >’,and if \E„(y)\denotes the Lebesgue

measure of E„(y), show that


V33«
У

lim I E„(y) I - *— ( e 2 du.

Hint. We choose a point rj at random in (0, 1); i.e. rj is a random variable uniformly
distributed in the interval (0, 1). The random variables = E„(rj) are then independent
and identically distributed; the central limit theorem can be applied. We have:

ЕЮ = Y ’ D(^ = ^ •

5. Let qu q2, . . . , q„,. . . be a sequence of integers > 2. It is easy to show that


every number x (0 < x < 1 ) (a denumerable set of numbers expected) can be repre­
sented in one and only one way in the following form:

x = V «»(*) ,
> -------------
«Й <h4f--4n
where e„(x) may take on the values 0, 1, — 1. As in Exercise 4 put Sn(x) =
= £ ek(x). Now if E„(y) denotes the set of numbers x (0 < x < 1) such that

SJLX) - ' X (<?, - l)


_________ * k = l_________

and \EJiy)\ is the Lebesgue measure of E„(y), then we have


У

lim I E„ (у) I = —~ ( e 2 du,


«—+о» ^/2я J
provided that the condition
max qk
lim ---- = 0
**+" /v 2
is fulfilled.
530 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 13

Hint. Choose at random a point t] € (0, 1) and put c„ = e„(rf). It is easy to see
that the random variables are independent. Furthermore

£(£,) = , Вг ({.) = ~ 2 '-, and Д I — Е((я) I3) < Cj q\,

where Cx is a positive constant. Liapunov’s condition is thus satisfied.

6 . Let ti, (t, t • be independent random variables, with the same normal
distribution. Put

_ , . 1/ г _
Í« = — £ ff« = ' — 4 -----i-----, and r„ = V я•
« /t=i n —1 a„
Show that
л:
i г -
lim Р(т„ < а) = — е 2 du.
» -+ 0 O ,у2я J
— со

ЯшЛ The distribution of r„ is Student’s distribution with n — 1 degrees of freedom

Ь)
(cf. Ch. IV, § 10). Its density function is

5я_,( а) = — (i+^L -)-",


V (п - \ ) л r I " - 1j l n 1 I

and we find that


1 - *■-
lim 5„_t (A) = —7= e 2.
n-»+CO 2л

Another proof can be obtained by noticing that the distribution function of v'n
tends to the normal distribution function as n -* + °° and lim st o„ = 1 ; the result
ft —*■ + no

follows then from the lemma of § 7.


7. Prove that any subsequence of a Markov chain is also a Markov chain.
8 . (Ehrenfest’s model of heat conduction.) Consider two urns and N balls labelled
from 1 to N. The balls are arbitrarily distributed between the two urns; assume that
the first contains M and the second N — M balls. We put into a box N cards labelled
also from 1 to N. We draw a card from the box and put the ball bearing the same
number from the urn in which it is contained into the other urn. After this the card
is replaced into the box and the operation is repeated. Let Cn denote the number
of the balls in the first urn after the и-th step (i.e. after drawing и cards) (и =
= 1 , 2 , . . . ; f 0 = M). The states of the system consisting of the two urns form a
Markov chain. The transition probabilities are

^W i = l~ 4 - (* = 0, 1 , . . . , A — 1),
VIII, § 13] E X E R C IS E S 531

{к = 1 , 2 ........N ) (1)

Pu = 0 for I к - 1 1 ф 1 .
Show that
„ N f N) I 2 Iя
« ■ т + Г - т Ц ' - » ] -
(This example contains the statistical justification of Newton’s law of cooling.)

9. Let Galton’s desk be modified in the following manner (cf. Fig. 26): From the
JV-th row on the number of pegs is alternatingly equal to the number in the (N — l)-th
row and in the 7V-th row. On the whole desk there are N + n rows of pegs. Determine
the distribution of the balls in the containers when the number n of balls is large.
10. The random variables £0, . . . , £„ . . . form a homogeneous Markov chain;
all take on values in (0, 1); let the conditional distribution of C„+i under the
condition Сп — У be absolutely continuous for every value of у (0 < у < 1); let
p(x, y) be the corresponding conditional density function. We assume that for
0 < X < 1 and 0 < у < 1 the function p(x, y) is always positive and that for every
i
X ( Q < x < 1) J p(x, y) dy = 1 holds, further that p(x, y) is continuous. Let p„(x, y )
о
be the conditional density function of C„ under the condition C0 = y. Show that
the relation
lim p„(x, у) - 1
Л -* - + CO

is valid uniformly in x and у .


532 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 13

11. Let a moving point perform a random walk on a plane regular triangular
lattice. If the moving point is at the moment t = n at an arbitrary lattice-point, it
may pass at the moment t = n + 1 with the same probability to any of the 6 neigh­
bouring lattice points. Show that the moving point will return with probability 1 to
its initial position, but that the expectation of the time passing until this return is
infinite.

The following Exercises 12 through 18 all deal with homogeneous Markov chains
with a finite number o f states, fulfilling the conditions o f Theorem 1, § 8 . The notations
are the same. The states are denoted by A0, Al t . . . , AK. The random variable c„ is
equal to к if the system is in the state Ak at the time n (Jc = 0, 1, . . . ,N ). We put
P (íо = J) = Pdf), Pfk = Р(Лп +,„ = к | cm = j), p<f = p,k and PJk) = p(i„ = k). We
assume that min p ik = d > 0. According to Theorem 1 o f § 8 the limits lim pjj? = Pk
I ,к n —*■ со
N
exist and are independent o f j . Furthermore Pk = 1.
k=0
12. Let
tt) _ f 1 if the system is in state Ak at the time t = n,
T,n I 0 otherwise.
П
We put Cnk) = Y j Show that
/=i
Hk)
lim st — = Pk (к = 0, 1 , . . Л0
n—
►+00 Г1
i.e. the system passes approximately a fraction Pk of the whole time in the state Ak.
Hint. We have

F.(rif') = P( ín = k) = P{k),

DinT) = yJP n (k)( 1 - Р„(Щ


and
E{r,T Vnlr) - Д Л Д О = P„{k) (pfL - Pn+f kj).

Furthermore, Formula (39) of § 8 implies \pkk — P, [ < (1 — d)r . Hence

Пп1г) < C(1 - d f,

where R(fi„k\ fifil,) is the correlation coefficient of fi„k) and fifin and C is positive.
Thus the result follows from Theorem 3 of Chapter VII, § 3.

13. Let C„ = (i + i 2 + • . . + Zn ■ Show that


r N
lim st — = ^ kPk.
л-»-+оо n k= 1
N
Hint. c„ can be expressed as a function of the variables namely ^ kfifi.
k=1
The result follows from that of the preceding exercise.
V III, § 13] EXERCISES 533

14. Assume that at t — 0 the system is in the state Ak . It returns to it for the first
time after a certain number of steps. Let this random number be denoted by j**» .
Show that P(vM > n) < (1 — d f .

Hint. We have P(vw > 1) = 1 — pkk; hence the inequality is true for n = 1.
Suppose for a proof by induction that the inequality is true for n . Then

P(v<k) > n + 1) = £ X P(-V'k> > « > ? " = Á C. + 1 = A),


Ифк 1фк

hence
/V * > n + 1) = 5 ] P(v® > и, f , = у) X pm <
i фк hjzk

< (1 - d) P(v,k) > n) < (1 - d f +1.


15. Show that:
a) the expectation and the variance of the random variable v(k> defined in the
previous exercise exist.

b) E( v ^ ) = -I- •
“k
Hint, a) follows from Exercise 14. Let further Vk(z) denote the generating function
o f v'M; we have

V M = 1 - 7Г7Т
Uk (*) >
where

Uk(z) = I + £ p<kÍ z".


/1= 1

The relations

Uk(z) = Z_ + 1 + f (p g - Л ) Л |/> й-Л |< 0 - d)n

lead to

lim t/*(z)(l - z) = lira Щ (z) (1 - z) 2 = Pk,


Z-*-l 2= 1

which implies b).

16. Let the numbers pSjP (r — 0, 1 , . . . ) denote the values of n for which rfk) = 1
(fffi < fAk> < ■■•)> Vn® >s defined here as in Exercise 1 2 . Show that the standardized
distribution of tends to the normal distribution as r —►+ 00 .

Hint. The random variables v® = /г® — , are independent and identically


distributed. According to the preceding exercise, the expectation and the variance
of i>® exist and are equal to those of г>®. Hence the central limit theorem can be
applied.
534 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 13

17. Show that the distribution of the random variables Qk> introduced in Exercise
12 tends, after standardization, to the normal distribution as n -* + 0 0 . (Generaliza­
tion of Theorem 2 in § 8 .)

Hint. It is easy to see that Pff„k) < r) = P(p(,k) > n) ; for if the system passes less
than r times through the state Ak during the first n steps, then it will return to it for
the r-th time after the moment t = n , and conversely. Thus we are back to Exercise
16 and find

lim P Í ^ ~ ПРк_ < x) = Ф(х) ,


*-+ - 1 Dk p'i' V " >
where we have put Dk = D{v{ky).

18. Put
Pfc* = P(Sn = /с 1{*+i = I) (n = 1 , 2 , . . .;J, к = 0 .1 ........N).
Show that the limits lim P)k "' = P*k exist and form a stochastic matrix.
n —► + ОЭ
Hint. We have

p t-r t _ p $ n = k ) p k,
lk Г&+i=J) '

Hence

lim Pfc* = = Pi
П -* -+ o o -V /

and
N I N
X P*k = " 7 Г ^ PkPki-
k= О г / /c=0
But the Pk satisfy the system of equations
N
X PkPk, = Pi ( / ' = 0 , 1 ........ N ) ,
k= 0
N
hence we find that ^ P*k = \ \ the transition probabilities Pfk define thus again a
k=o
Markov chain.

Remark. A Markov chain is said to be reversible, if P*k = p kt for j, к = 0, 1 , . . . , N.


It is necessary and sufficient for this that the matrix of transition probabilities is
doubly stochastic.
In Exercises 19 through 23 , £„ are independent, identically distributed random
variables with the common continuous distribution function F(x). i* k denotes the k-th
order statistic (k = 1 , 2 , . . . , « ) .
19. Suppose F(0) = 0 and F’(0) = Л > 0. Show that
Лх
Iim P(n i*k < x ) = 1 Гtk- 'e ~ ‘ dt (k = 1, 2, . . . ) .
Л-*- -feo \JC t)!J
0
V I I I , § 13 ] E X E R C IS E S 535

Hint. We have

^ « -^ С М т /М т Г
and, consequently,

0
20. Suppose F(x) = x (0 < x < 1). Show that ni*k and и(1 — Сл+i-/) are
independent in the limit as n~* + 0 0 and have gamma distributions of order к and j,
respectively:
lim P(n Z*k < X, n(l - Ím + i- í) < y ) -
tl-* -+ co

Г uk~ xe~“ du Г i/~ l e~"dv


= 0) (к - 1)! Jо O' - 1)! '
Г
X

21. Suppose Fix) = | —— — where Ф (х)= — L=s e 2 du . Show that the


1 о ) J 2л J
» — CO

density function of

f* , /Vi---- In In n + In 4л
ÍZk — m + a ' 2ln n - o --------- —
___________ ^ 2 j 2 In n
a
y j 2 In n

tends to — ^ ■ exp ( —kx — e~*) as л -» + o o .


(k — 1 )!
22. Let f(x) = F'ix) be continuous and positive on the interval [a, 6 ]; suppose
further that a < F~l iq,) = Q, < Q? = F ~ \qf) < b. Show that the two-dimensional
distribution of n « * м „> — ßt) and n (£*,*,<„> — Q>) tends to a two-dimensional
normal distribution as n -> + ■», if |ktin) — a,n | and \k2(n) —qtn\ remain bounded.
23. Let F„(x) be the empirical distribution function of the sample (£x, i 2........ in)-
Show that
P„{e) = / ’(sup (Fix) - F„(x)) < e) =
X

. ‘" V f n ' i f, k \-4 k \ k-'


- 1- .5 ( t ) l 1— t ) l* + 7 ) -
Hint. We may assume that the variables are uniformly distributed in the interval
(0, 1). If m = [nil — £)], the inequality sup (F(x) — F„(x)) < e is equivalent to
X

( tj < —---- — + £ for j = 1, 2 , . . m -f 1.


n
536 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y (V III, § I 3

I t is e a s y t o p r o v e t h a t

Pn (e) = n! J • • • J dxi . . . dx„,


Te

where Te is the domain defined by the inequalities

0 < X, < дг2 . . . < лг„ < 1, X/ < —--- i — I- £ f o r / = 1, 2 . . m + 1.

The final result can be obtained by induction.

Remark. We can derive from this result the theorem of Smirnov (§ 10, Theorem 1),
24. (Wilcoxon’s test for the comparison of two samples.) Let £„ . . . , and
T]j......... rjn be independent, identically distributed random variables with the common
c o n tin u o u s d is tr ib u tio n f u n c tio n F ( x ) . L e t th e n u m b e r s i k a n d rj: b e u n ite d i n to a
sin g le se q u e n c e , le t th e m b e a r r a n g e d in in c r e a s in g o r d e r a n d in v e s tig a te th e “ p la c e s ”
o c c u p ie d b y L e t v t , v 2, . . ., v m d e n o te t h e r a n k s o f th e e le m e n ts
4i, . • • , Sm in th is s e q u e n c e . P u t
m(m + 1)
W= + v„ + . . . + vm-------- ------.

a) Show that W is equal to the number of pairs ({ ,, i],) such that { / > rj,.
b) Show that E(W) = .
c) Let G„m(z) be the generating function of W:
G„m ( z ) = X P( W = k) z \
к

Show that
Г
^nm <7\
\z) — s-< \+/-<
c„ s(z)
M(Z)
C„,f(z)
\ >
where we have put

Q i2) = Ó S L z J l} .
/U Д1 - Z)
d) Show that

DiW ), + .

W _E( \y\
e) Derive from c) that the distribution of tV* = ---------------- tends to the normal
D{W)
distribution as n - > + o o , m —> + o o , if — tends to a constant. (Cf. Ch. II, § 12,
n
Exercise 46 and Ch. Ill, § 18, Exercise 45.)

25. Let {j..........t?t, • • • , Vm be independent random variables, with the same


continuous distribution function F(x) for all t k and G(x) for all %. Let I , denote
the number of triplets (i,j, k) for which { , • > % , ? / > % and i < j , and L„ the
V III, § 13] EXERCISES 537

number of those triplets (i,j, к) for which ij, > Zk, r), > i k and i < j. We put
_ L,

2 2
Show that if F(x) = 6 (x), then E(L) = — and if Fix) Ф G(x), then E(L) > — .

26. Let there be performed N independent experiments. Let the possible outcomes
of every experiment be the events A,, A , . Let p k = P(Ak) (k = 1,2
Г
let v* denote the number of occurrences of the event TA., where ^ vk = N . If
*=i
2 V {V« - КРкУ
Xn = У ----- rt-------- .
ih\ Npk
th e n th e d is tr ib u tio n o f •/% te n d s a s N —> + oo t o th e /- - d i s tr i b u t i o n w ith r — 1 d eg re e s
o f f re e d o m :

lim P( x 2n < x ) = ------ Y---- X


— -----г— I t 2 e~ 2 dt.
2 ■ Г(^1) J
Hint. The /"-dimensional distribution of the random variables vt, . . . , vr is a multi­
nomial distribution; hence the characteristic function of the joint distribution of the
• ,, vk — Npk
variables — — = — tends to
y / NP*

« p Lf - +y (*=í i '* “ ( rÉ= i ‘k\ / p* )2)


(Cf. Ch. VI, § 6 , (23)).
27. If í 1( C2, • are random variables such that the &-th order moment
of tends a sn -> + oo to the A-th order moment of the standard normal distribution,
i.e. if
+<»
.. 1 Г k - 'I Í 1. 3 . . . (А — 1) for к even,
n^ + oo 2 .T J [О for к odd.
—00

then for every real x


X
1 ( - '*
lim P(Cn < x) = — = j e - dt.
n—+00 2л J
— CO

Hint. Apply inequality (21) of § 1 to и = t and к = 21. We find

eti;N _ у ' I< I !~'


Ь J'- I (2 /)!
538 T H E L IM IT T H E O R E M S O F P R O B A B IL IT Y T H E O R Y [V III, § 13

From this we get


2 ;-1 (if \ i ,2 ;
Е(е"СП - £ -Цр £(&) < - яо.
1= 0 J'

If we let N tend to - f 00 , we obtain

/-1 f--i' 2 .2 ;
lim sup Д е"'") - X ■ „ ’ < 2 /ТГ i
N—+« /= 0 7 * z *•

hence, by letting / tend to + ° ° ,


_ ,2
lim E(e“CN) = e 2 ,
N-»■+ 00
which is equivalent to the desired result.

28. Let be random variables which assume only the values 0 and 1
and let rjnk be the sum of all products of к distinct elements of the sequence
£л1» ***>£ЛЛ•
Vnk — ^ ^nli £/ii2• • • £mV
lá í i < ...< / * S n

A* .
Show that if E(r]„k) tends to — - (k = 1 , 2 , . . . ) as n -» + o o , then the distribution
of the sum

X Í»I = *?Л1
/=1

tends to the Poisson distribution with parameter Я .

29. Let £„ {2, . be independent random variables assuming the values


+ 1 and —1 with probability У . Put C„ = {, + {2 + • • • + in and denote by n„
the number of the positive terms in the sequence f ,, C2......... C„ . (If t* = 0, £k is
to be considered as positive if Ck- i = + 1 .) Show that

Е(лгп = £ 2л = 0) = ^
i 2; )
^ 1 )2 2' ^ — 0» 1 , . . я

and
_ 1 „ ( 2я - 2П 1 ( 2/ l
Р ( я 2„ - 2 * , { 2я - - 2 / ) - - ^ г ^ Х _ Д и - / ) (я _ / + 1 у ( / _ 7J

for fc = 0 , 1, . . . , я and j = If 5, . . . , я.
30. By using the results of Exercise 29 show the following: If y„ is any sequence
Уп
of integers such that yn and n are of the same parity and lim —p= = у (у is here
n—
*■+03 -y n
VIII, § 13J EXERCISES 539

any real number), then


■X

lim P \~ V <X
M -+ + C 0 \ n
| =J
J
!?)<*>
0
with
+ 00
У* _ U\
. , 2ei e 2 du
m r)- ^ c т г г з т
J \ u2 I
>
i'T
for 0 < / :S 1 , у > 0 and f ( t \y ) = f ( l — 1 1 — y) for у < 0 .
71-
Remark. For у = 0 the conditional limit distribution of — with respect to the
n
condition CJ yj n~* 0 is thus uniform on (0 , 1 ). If we notice that n is, in the limit,
normally distributed, it follows that

'/ т г 7 *
0

л ^ р ( ^ <х[с" - ° У И
0
\ / ^ A;
from these results, from Р(СЯ> 0) = -i- and from

^ “7 « h r>
the arc sin law can be easily derived.
C H A P T E R IX

APPENDIX
INTRODUCTION TO INFORMATION THEORY

§ 1. Hartley’s formula

Information theory deals with mathematical problems arising in connection


with the storage, transformation, and transmission of information.
In our everyday life we receive continuously various types of information
(e.g. a telephone number); the informations received are stored (e.g. noted
into a note-book), transmitted (told to somebody), etc. In order to use
the informations it is often necessary to transform them in various fashions.
Thus for instance in telegraphy the letters of the text are replaced by special
signs; in television the continuous parts of the image are transformed into
successive signals transmitted by electromagnetic waves. In order to treat
such problems of communication mathematically, we need first of all
a quantitative measure of information.
It is not at all obvious that the amount of information contained in a
message can be defined and even measured. If we wish to introduce such
a measure, we must abstract from form and content of the message. We
have to work like the telegraph office, where only the number of words
is counted in order to calculate the price of the telegram.
It is reasonable to measure the amount of information contained in a
message by the number of signs necessary to express its content in the
most concise possible form. Any system of signs can be used; the infor­
mations to be measured must be transformed into the system chosen.
Thus, for instance, letters can be replaced by digits, the binary number
system can be taken instead of the decimal, and so on. If we add to the
26 letters of the English alphabet the full stop, the comma, the semicolon,
the question-mark, the note of exclamation and the space between the
words, the 32 signs so obtained can be assigned to the numbers expressible
by means of 5 digits in the binary system. (The numbers expressible by
1, 2, 3, or 4 digits are to be completed by zeros to five digits; thus 0 = 00000,
1 = 00001.) In this manner, every telegram can be expressed as a sequence
of zeros and ones; the number of necessary signs is five times the number
of the letters of the text. Every message, every information may thus be
encoded into a sequence of zeros and ones.
It seems reasonable to measure the amount of information of a message
IX, § 1] HARTLEY'S FO RM U LA 541

by the number of signs necessary to express it by zeros and ones. On this


basis, messages of different forms and contents become comparable as to
the amount of information contained in them.
Since a digit can assume one of the values 0 and 1, the information
specifying which of these two possibilities occurred can be taken as the
unit of information. Thus the answer to a question which can only be
answered by “yes” or “no” contains just one unit of information, the
meaning of the particular question being irrelevant. The unit of information
is called “ bit”, which is an abbreviation for “binary digit”.
When one receives some information it happens often that only a part
of it is really new. Thus for instance, if the telephone numbers in a certain
city have all 6 digits, we can be sure in advance that every inhabitant
will have a telephone number which is a number having 6 digits. Every
information may thus be considered as a distinctive sign of an element
of a set. If we know in advance that some object belongs to a certain set
E, to give full information on the thing means to specify which of the
elements of the set E is the one in question. The amount of information
received depends, evidently, on the number of the elements of E. If E
contains exactly N — 2” elements, these can be labelled by binary numbers
having n digits; any element will be uniquely characterized by a sequence
of length n consisting of zeros and ones, hence by n units of information.
From N = 2" follows n = log2 N; this gave Hartley the idea to define
by log2 N the information necessary for the characterization of an element
of a set having N elements, even if N is not a power of 2.
At the first glance it would seem that if 2" < N < 2n+1, then log2 N
units of information do not suffice for the characterization of the elements
of E, as somewhat more is necessary for this purpose, namely n + 1 units
of information. This, however, is not the case. If we consider a sequence
of symbols the terms of which are elements of E and if we replace each
term of this sequence by a sequence of zeros and ones, we need really
n + 1 binary digits. However, if we take from the elements of E a sequence
of к elements (some of which may be equal), there are N k such sequences
and in order to characterize any one of these we need nk zero or one sign,
where

2"*-1 < N k < 2я*.

In order to transcribe a symbol of our “alphabet” (an element of E) we


need therefore on the average ~ binary numbers, where
к
к log2 N < nk < к log2 N + 1.
542 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , § 1

It follows

lim -p- = log2N.


k -* - oo ^

Thus for every e > 0 we can find a number к such that if we take the
elements of E by ordered groups of k, then the identification of one element
requires on the average less than log2 N + e binary digits.
The formula

I(En ) = log2 N, (1)

in which / (En) represents the information necessary to characterize the


elements of a set E n of N elements is called Hartley's formula.
Formula (1) is a mathematical definition of the amount of information
and thus needs no proof at all. Nevertheless, in order to show that this
definition is not arbitrary, we postulate some properties which the function
I(E n) should reasonably possess and show that the postulates in question
are fulfilled only by the function log2 N. These postulates are:

A. I(Enm) = I{En) + 1(EM) for N, M = 1 , 2 , . . . ;

B. I(En) < I(En+1);

c. I(E2) = 1.

Postulate C is the definition of the unit; it is not more and not less
arbitrary than the choice of the unit of some physical quantity. The meaning
of Postulate В is evident: the larger a set, the more information is gained
by the characterization of its elements. Postulate A may be justified as
follows.
A set E nm of N M element may be decomposed into N subsets each of
M elements; let these be denoted by £ $ , . . . , Fffp. In order to characterize
an element of ENM we can proceed in two steps. First we specify that subset
to which the element in question belongs. Let this subset be denoted by
£j$. We need for this specification an information I(EN), since there are
N subsets. Next we identify the element in £j$. The amount of information
needed for this purpose is equal to I(EM) since the subset E $ contains M
elements. Now these two informations completely characterize an element
of ENM; Postulate A expresses thus that the information is an additive
quantity.
IX, § X] HAR TLEY ’S FO RM U LA 543

T heorem1. The Postulates A, B, C are fulfilled only by the function


I(Es) = log2 N.

P roof. Let P be an integer larger than 2. Define for every integer r the
integer s(r) by
2i(r) < P r < 2s<r)+\ (2)

Taking the logarithms of base 2 on both sides, we get

r r

Hence
s(r)
hm ----- = log2 P. (4)
Г -* 00 f

Put f(n) — /(£,,). It follows from В that for n < m

fin ) < fitn). (5)


(2) and (5) lead to
/(2 lW) < f(P r) < /( 2 'w+1). (6)

According to A we can write

f i a k) = k f ( a ) (7)
and, by C ,/( 2) = 1; hence it follows from (6) that
s (r )< r f(P )< s (r )+ 1, (8)
thus

lim— =/(£)• (9)


П-*00 Г
From (4) and (9) we conclude that f(P ) = log.,/’ for P > 2. Since/(2) = 1,
/(1) = 0, the theorem is herewith proved.
Postulate В can be replaced by the following one:

B*. lim (/(£ v+1) - l(EN)) = 0;


N — со

and A can be replaced by a weaker postulate, too:


544 IN T R O D U C T IO N TO IN FO R M A T IO N TH EORY [IX, § 1

A*. I f N and M are relatively prime numbers, then

I(Enm) = I(En) + I(Em).

P. Erdos1 proved the following


T heorem 2 .1(EN) = log2N is the only function which satisfies the postulates
A*, B*, and C.

Proof. Let P > 1 be any power of a prime number and f(n) = I(E„)
a function satisfying A*, B*, C. Put

/ л 4 f( P ) logo n
g(fi) =f{n) - log p (io)
Clearly, g{ri) fulfills A*. Furthermore we have

g(n + 1) - g(n) = f{n + 1) - f ( n ) + log2 .


log2P n+ 1
If we put

en = g(n + 1) - (U )
then B* implies

lim £„ = 0. (12)
п-+ co

Hence g(n) fulfills B*. Now it is easy to see that

<7(P) = 0. (1 3 )

Define for every integer n an integer n' by

Г n ~| „ if n "I
— for — ,P = 1,
P P
n' = l (14)

-y - 1 for - y ’P ] > : >

where (a, b) denotes the greatest common divisor of the integers a and b.

1 Cf. P. Erdős [2] and the article of D. K. Fadeev, The notion of entropy in a
finite probabilistic pattern (Arbeiten zur Informationstheorie, Yol. I). Fadeev found
this theorem independently from Erdős. The proof tfiven here (cf. A. Rényi [29], [30],
[37]) is considerably simpler than that of the above two authors.
IX, § 11 HAR TLEY ’S FO RM ULA 545

Clearly

n '< ~ (15)

and
n = Pn' + /,
where (n',P ) = 1 and 0 < l < I P . According to (13), g(Pn') = g(n),
hence we can write
n—1
g(n) = g(n') + g(n) - g(Pn') = g(ri) + £ sk, (16a)
k= P „'

where ek is defined by (11).


Repeat the decomposition (16a) with n' instead of n, then with n" instead
of rí, etc. If we put

rV+» = (««)' 0 = 0 ,1 ,...) ,


we obtain at the k-l\\ step
к
g(n) = g(n{k)) + X I eh- (16b)
j =1 h=P„<T>
But by (15)

pk ?

hence we obtain rík) = 0 after at most - °^2 П + 1 steps, hence for every n
.log oP

0(«) = X>A,> ( 17)


i=l
where hx < h2 < . . . < hbn and

(log2R
Thus, according to (12),

lim J & L = 0, (18)


oo log2«
and by (10)
,im /n i (1„
n - 0о log 2n log 2p
546 IN T R O D U C T IO N T O IN F O R M A T IO N T H E O R Y [IX , § 2

Let c denote the limit of the left hand side of (19). We conclude that for
every P > 1 which is a power of a prime number

f{P) = c log2 P. (20)


If the integer n > 1 has a decomposition n = P1P2. . . Pr, where Pt
are powers of primes, then we conclude from the additivity of f(ri) that

/(«) = Z /(Л) = c Z 1о§2Pi = c logan. (21)


i=i /=1

Because of Postulate C, the value of c must be equal to 1. Furthermore,


according to A*, /(1) = 0. Theorem 2 is herewith proved.
This theorem will be used in the following section.

§ 2. Shannon’s formula

Let £j, E2, . . . , En be pairwise disjoint finite sets and put


ß —E\ + ß 2 + . .. + En.
П
Let Nk be the number of elements of the set Ek; E has therefore N = £ Nk
k =1
Nk
elements. We put pk = (k = \, 2 , . . . ,n). If we know about an
element of E that it belongs to a set Ek, for the complete determination of
this element we need some further information, the amount of which is
equal to log2N k. Thus in order to characterize an element which is known
to belong to a subset Ek of E we need on the average the amount of infor­
mation

72 = Z loS 2 ^ = Z Pk logz Npk. (1)


к=1 л k =1

The information necessary for the complete characterization of an element


of E can therefore be decomposed into two parts. The first part, ij determines
the set Ek containing the element in question; the second part, / 2, given
by Formula (1), identifies the element in Ek. If the information is additive
also in this sense, then the relation
П
log2N = / i + /» = / ! + Z Pk log2Npk (2)
k=1
ix , § 2] S H A N N O N ’S F O R M U L A 547

must hold. Since £ pk = 1, it follows from (2) that in order to know


/и i
to which one of the subsets Ek an element of E belongs, we need an amount
of information equal to

h = Z Pк log2 — • (3)
fc=1 Pk
Formula (3) was first established by Shannon and in what follows we
shall call it Shannon's formula. Simultaneously with and independently of
Shannon the same formula was also found by N. Wiener.
In particular, if px = p2 = . . . = p„ = — , Shannon's formula reduces
n
to Hartley's formula (cf. Formula (1) of § 1). Analysing the above heuristic
considerations it is clear that we implicitly used three assumptions, namely
1. The selection of the considered element from the set E depends on
chance; actually, we are dealing with the observed value of a random
variable.
2. All elements of E are equiprobable; the probability that an element
of E belongs to Ek is therefore pk — .

3. The amounts of information associated with the different possibilities


must be “weighted” by the corresponding probabilities; essentially, we
consider thus the expectation of the information.
Thus, instead of restricting ourselves to the particular case of the random
selection of an element from a set, we are led to the more general question:
how much information is yielded by the outcome of a random experiment?
We shall see that Shannon’s Formula (3) remains valid in this more general
case too (hence not only in the case of rational values of p k).
The general problem can be put in the following form : Let A b A2, . .., A„
be the possible outcomes of a random experiment ; put pk = P(Ak)
(k = 1 , 2 , . . . , n). We wish to know, how much information is furnished
by a single performance of the experiment It seems reasonable to start
from the following postulates:

I. The information obtained depends only on the probability distribution


= (p±, p2, . . . , pn), consequently, it will be denoted by K-9) or
I(p,, p2, . . p„). We suppose further that 1 (р ъ p2, ■■■,p „) is a symmetric
function o f its variables р ъ p 2, . . . , pn.
II. Kp, 1 — p) is a continuous function o f p (0 < p < 1).
548 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , § 2

h i.
2 2
Furthermore, we require:
IV. The following relation holds:
КРъРъ - • ■,/>„) =

= KPi +Р 21РЗ’ ■• ->Рп) + (Л + p<i)I '—~ — »'— ~ — • (4)


\P l+ P t P1 + P 2 )

Condition (4) can be worded as follows: Suppose that an outcome A


of an experiment with probability a = P(A) can occur in two ways, A'
and A" which mutually exclude each other. Suppose that the probability
of A' is ot' and that of A" is a " ( a ' + a " = a ) . Then if we are told in which
of the two forms A actually occurred, the amount of information thus
obtained is equal to the information associated with the distribution
fa ' a") a' a")
— , — taken with the weight a, i.e. to a / — , — . Postulate IV
a oc a a
requires at the same time the additivity of information as well as the
“ weighting” of the different informations by the corresponding proba­
bilities.
We shall show that Postulates I-IV are fulfilled only by the function
defined by Shannon’s Formula (3). The above set of Postulates I-IV is
due to D. K. Fadeev; it is a simplified form of a system of postulates given
by A. I. Khinchin. In § 6 we shall characterize Shannon’s information
by different postulates which lead also to alternative measures of infor­
mation.
In the present section we prove
T heorem 1. I f to every discrete, finite probability distribution there corre­
sponds a number fi-AA) = 1(ръ p2, , p„) so that the above Postulates I-IV
are satisfied, then1

E f t 10g2— • (5)
*=i Pk

P roof. The proof consists of six steps.

a) We show first that


/(1) = 0, (6)

1 In view of lim x logj - = 0 we put 0 log, — = 0 .


x-0 a 0
I x , § 2] S H A N N O N ’S FO RM U LA 549

i.e. that the occurrence of the sure event does not give us any information.
In fact, if n = 2, p1 = 1, p2 = 0, it follows from IV that

7(1,0) = 7(1)+ 7(1,0);

thus (6) holds. Similarly, we have


7(Pi, Рь . . .,p„,0) = 1( ръ p2, .. ., pn).
b) If we put
m
sm = z Pj,
j =i

we can deduce from (4) by induction on m the somewhat more general


relation
7(Pi> • • •>Pm, Рт +Ъ • ■',Pm+ii) ~

— K sm, Pm+l, • ■-iPm + r) + sm I ~ , • • ■> • (4')


, sm I

According to (4) the formula holds for m = 2. Suppose that it is already


proved up to m — 1. Then

7(Pl + Ръ, Рз, • ■ Pm+n) =

= K s m, Рт + Ъ ■■ ; P m + n) + Sm l ~~ ^ , ••; j, (7a)
Sm Sm )

furthermore, because of (4),

К Р ъ Р г , ■ ■ ; P m + n) = 7 ( P i + P i , P 3 , ■ ■ ■, P m + n ) +

+ (Pi + P2) 7 Í— g - , (7b)


{P 1+P 2 P1+P2)
and
r IP i + P 2 Pm , , , w Pi P2,
sm 7 --------- , • ■; ----- + (Pi + Pi) 7 ---- ;---- , ---- ;---- =
Sm Sm) P 1 + P 2 P1+P2J

r f Pi Pm . /п ч
— I ~ 9 • • •» ~ J (^c)
[ sm Sm ,

(4') follows immediately from (7a), (7b) and (7c).


550 I N T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , §2

c) We prove now a still more general relation:

/(/>ш *‘ •?Plm,5• • ">РпЪ ***5Pnm„) =

<4')
j t i ■*/ *7
where we have put
mj
sJ = l L P i i 0 = 1, 2, . .. , « ) . (8)
1= 1
By assumption,
и я /я /

E J7 = Ё I P jl = 1■
7=l 7=11=1

F orm ula (4") m ay be considered a s a theorem about the inform ation


associated w ith a mixture o f distributions. In effect, if d?3,- d enotes the

distribution ,..., , the left hand side of (4") is the information


SJ si
associated with the mixture of the distributions d^-with weights s}. According
to (4") this information is equal to the sum of the average of the informations
with weights Sj and the information associated with the mixing distri­
bution S = ( s j,. . ., j „):
/( f Sj = X sj I(&j) + I{S). (4"")
7=1 7=1

(4") can be obtained immediately by a repeated application of (4'), taking


into account the assumption that 1 ( р ъ . . ., p„) is a symmetric function
of p b . . .,p n.
d) Let 8?n be the distribution — , -— , . . . , — and put f(n) = /(«?„).
n n n
From (4") we deduce the functional equation
f(n, m) = /(« ) +Дт ). (9)

In fact, if in (4") all m. are equal to m and all pn are equal to ----- , the
mn
left hand side is equal to f(nrri) and the right hand side to /(и) + f(m),
hence we get (9).
e) If we apply (4') to the case when all probabilities are equal and if we
unite them all except the first one, we obtain

An) = I , 1 — 1 ] + Í1 - - 4 A n - I)- ( 10)


{n n I n I
IX, § 2] SH A N N O N 'S FO R M U L A 5 51

Now we show that

lim [/(«) - / ( « - 1 )] = 0. (1 1 )
n-+oo
Put

/0 0 - / ( « - 1) = d„ and / (— , 1 ---- — = <5„ (n = 2, 3---- ).


n n

It follows from our assumptions that

lim <5„ = 0. (12)


n-*- 00

Indeed the assumed continuity of I(p, 1 — p) implies

lim Sn = /(0, 1) = /(1),


n-*- 00

and according to (6) 7(1) = 0. On the other hand

f ( n — l) = d2 + d3 + . . . + dn_x\

(10) is therefore equivalent to


d2 + d 3 + . . . + d„_!
t>„= d„ + -------------------------- . (13)

Multiplying both sides by n and adding the equalities obtained for


n = 2, 3 ,. . .,N , we get
N N
Z + do + . .. + d„_x) = Yj ndn> (14)
n=2 n= 2

and by a simple transformation

л N Z nÖn

(l5)
I n
/2 = 2

Because of (12) the right hand side of (15) tends to zero for N -» oo. Hence
we have

Iim
N-*-+oo “aF' T1Tk =
I 2 dk = °- ( 16)
552 IN T R O D U C T I O N T O I N F O R M A T I O N T H E O R Y [IX , § 2

From (12) and (16) it follows because of (13)


lim dN = 0, (17)
N—+oo
hence we obtain ( ll).1
We have seen that f(n) fulfills conditions A*, B*, and C of the preceding
section; hence by Theorem 2 of § 1
f(n) = I(W„) = log2 n. (18)
f) We can now finish rapidly the proof of our theorem.
Consider the function I(p, 1 - p). Let first p be rational, p = with
b

integers a and b (a < b). If we apply (4") with


n = 2, m1 = a, m2 = b - a, p n = p12 = . .. = p2m%= ,

we find

log2b = I l ± , l “ \ + “ log2a + 1 — J-) log 2(b ~ a ). (19)


b b> b b/
Since by assumption I(p, 1 —p) is continuous, we have for any p between
0 and 1

7(p, 1 - P ) = p log2— + (1 - p ) l o g 2— -— , (20)


P 1-P
hence (5) is proved for n = 2. We show now by induction that (5) holds in
the general case too. Suppose that (5) is valid for a certain integer и; let
& = (ръ . .., p„+i) be any distribution having n + 1 terms. We conclude
from (4) and (20) that
»-i 1
1 ( Р ъ ■ ■ •5P n + i ) = Y , P k log-2 — +
*=1 Pk
If \ , 1 . j Pn Pn +1 I
+ {Рп+Рп+л) l°g2 ---- ;--------- b 7 ---- ;------- j -----, ' .
\Pn + P„ + l ( Pn+ Pn +l Pn+Pn+ll

1 We use here a well-known theorem of the theory of divergent series (Mercer’s


theorem) which says: If s„ is a sequence, fulfilling

(
hm I as„ +
. (1
,, — a)s ---------------------
si + s2 + • • • + — sn )I = i

(0 < a < 1), then we have also lim s„ = i. We need only the particular case
« —►CO

a = i - (cf. G. H. Hardy [1 ], Ch. V).


I X § 21 S H A N N O N ’S FO R M U L A 553

hence because of (20),


n+i 2
1{Ръ • • • , Pn+l) = E Pk log2 --- , (21)
k=\ Pk

and thus the theorem is proved for every integer n.


Remark. It is easy to see that Postulate IV implies the additivity of infor­
mation. Suppose that the experiments A and В are independent of each
other; let Aj ( j = 1 , 2 , . . . , m) be the possible outcomes of A, and Bk (k =
= 1, 2 , . . . , n) those of B. Let pj = P(Aj), qk = P(Bk) denote the corre­
sponding probabilities and put SA = (pb . . ,p m), Q = (qx, . . . , q„). To per­
form simultaneously A and В means the same as to perform an experiment
AB having the events AjBk as possible outcomes with the corresponding
probabilities pjqk. The distribution {pjqk} = S 6 * Q is called the direct
product of the distributions £A and Q. If we apply (4") to Р\ЦЪ . . ., pkqn,
РгЧъ ■■•>РтЧп, we find that
* Q) = I(&) + Щ). (22 )

However, (4) does not follow from (22). This is most easily demonstrated
by the quantity

h (A. • • •. Pn) = - log2(/>! + . . . + p I), (23)

which fulfills Postulates I —III and Formula (22), without fulfilling (4). (If it
n 1
fulfilled (4), it would be equal, by the just-proved theorem, to У pk log2— ,
fc=i Pk
which is not the case.) We shall see in § 6 that the quantity (23) too can be
considered as a measure of the information associated with the distribution
SA = (py,. . pn). In fact, we shall define a class of information measures
depending on a parameter a which contains both Shannon’s information
(for a = 1) and the quantity (23) (for a = 2).
We add some further remarks.1

1. In connection with the notion of information we also have to mention


the concept of uncertainty. If we receive some information, the previously
existing uncertainty will be diminished. The meaning of information is
precisely this diminishing of uncertainty.
The uncertainty with respect to an outcome of an experiment may be
considered as numerically equal to the information furnished by the occur­
rence of this outcome; thus uncertainty can also be measured. We could have
started equally well from the notion of uncertainty; to speak about infor-
554 IN T R O D U C T IO N TO INFO R M A T IO N THEORY [IX, § 3

mation or about uncertainty means essentially the same thing: in the first
case we consider an experiment which has been performed, in the second
case an experiment not yet performed. The two terminologies will be used
alternatively in order to obtain the simplest possible formulation of our re­
sults.
2. The quantity (5) is frequently called the entropy of the distribution
•Sfi — (/>!,. . ., p„). Indeed, there is a strong connection between the notion of
entropy in thermodynamics and the notion of information (or uncertainty).
L. Boltzmann was the first to emphasize the probabilistic meaning of the
thermodynamical entropy and thus he may be considered as a pioneer of
information theory. It would even be proper to call Formula (5) the Boltz­
mann-Shannon formula. Boltzmann proved that the entropy of a physical
system can be considered as a measure of the disorder in the system. In case
of a physical system having many degrees of freedom (e.g. a perfect gas)
the number measuring the disorder of the system measures also the uncer­
tainty concerning the states of the individual particles.
3. In order to avoid possible misunderstandings it should be emphasized
that when we speak about information, what we have in mind is not the
subjective “ information” possessed by a particular observer. The terminol­
ogy is really somewhat misleading as it seems to support that the informa­
tion depends somehow on the observer. In reality the information contained
in an observation is a quantity independent of the fact whether it does or
does not reach the perception of an observer (be it a man or some registering
device or a computer). The notion of uncertainty should also be interpreted
in an objective sense; what we have in mind is not the subjective “uncertain­
ty” existing in the mind of the observer concerning the outcomes of an experi­
ment ; it is an uncertainty due to the fact that really several possibilities are
to be taken into account. The measure of uncertainty does not depend on
anything else than these possible events and in this sense it is entirely objec­
tive. The above mentioned relation between information and thermodynam­
ical entropy is noteworthy in this respect too.

§ 3. Conditional and relative information

We associated with every discrete finite probability distribution —


= (ръ . . .,p„) the information /(Jf). If £ is a random variable assuming
the distinct values х ъ x 2, . . x n with probabilities ръ p2, ■. ., pn, we may
say that I (■(/>) is the information contained in the value o f £ and we may write
/(£) instead of l(ff'). It must, however, be remembered that 1(c) does not
depend on the values x1; x 2, . .., x n of f /(£) remains invariant, when we
replace х ъ x 2, . . x n by any other system of mutually different numbers
IX , § 3 ] C O N D ITIO N A L A N D RELATIVE IN FO R M A T IO N 555

x[, x2, . . x'„. The observation of the random variable assuming the values
x\,x'2, .. x'n with probabilities р ъ р2, • • -,P„ contains the same amount
of information as the observation of Consequently, if h(x) is a function
such that h(x) ф h(x') for x ф x', we have /(/?(£)) = /(£). However, with­
out the condition h(x) Ф h(x') for x Ф x' we can state only that /(/;(£)) <
< /(c). This follows from the evident inequality

CP + ?)Iog2— I— < p \o g 2— + ?log2— (1)


p+q p q

for p > 0, q > 0, p + q < 1.


We shall often need Jensen's inequality: If g(x) is a convex function on
an interval (a, b), if хъ x2, . . ., x n are arbitrary real numbers a < x k < b
n
and if wl9 w2, . . wn are positive numbers with £ wk = 1, then we have
k =l
n n
9 { Y , w k X k) < Y , w k 9 { x k )- ( 2)
k=1 k=l

Inequality (2) can readily be proved by a geometrical reasoning. Consider


in the plane (x, y) the points (xk, g(xk)), к = 1 , 2 , . , . , и. Suppose that
masses wk are situated in these points; the center of gravity of the so formed
system will evidently lie in the smallest convex polygon containing the men­
tioned points. Since all points lie on the convex curve у = g(x), the center
of gravity lies above this curve. Let x and у denote its coordinates, then
n n
g{x) < y. As clearly x = £ wkx k and У = Y. wk9(xk), we get (2). It can
k=1 *=1
be seen immediately that if g(x) is not linear on any subinterval, then
in (2) the equality sign can occur only if xx = x2 = . . . . = x t .
If g(x) is concave, we have instead of (2) the inequality

9{ t wk x k) > Y w k 9(xk), (2')


/£=1 k =1

since now —g(x) is convex.


From Jensen’s inequality (2) we obtain
l ( P i , p 2 , - . ; P n ) ^ ^ o g 2 n. (3)

It suffices for this to apply (2) to the convex function у = xlog2x (x > 0);
with x k = p k, wk — — (k = 1 , 2 , . . ., n) we get (3). The equality sign holds
n
556 IN T R O D U C T I O N T O I N F O R M A T I O N T H E O R Y [IX , § 3

for р г = p 2 = . . . = p n = — only. That is, if there are n possibilities for the

outcome of an experiment, the uncertainty will be maximal when all possi­


bilities are equiprobable.
Formula (3) can be generalized as follows: Let ófi = (ръ p2, . . .,/>„) be a
probability distribution and W = (wjk) a stochastic matrix with n rows and
n columns; the elements of W are thus nonnegative and the sum of the terms
of each row is equal to 1. Put
n

<7* = X PjWjk (k= l,2,...,n).


j=i
Then

X 9k = X Pi X wjk = X p j= i;
k=l j —1 k=1 j =1
hence Q = (qb q2, ■. ., #„) is a probability distribution and we find

im <№ . (4)
In fact, by putting
g(x) = X log2 X, Xj = ph Wj = wjk ( j = 1 , 2 , . . . , ri)

we can derive from (2) the inequality


n

4k log'. Чк ^ X wJkPj log2 Pj• (5)


}=i
If in (5) we sum over k, we obtain (4). Inequality (4) expresses that the un­
certainty for a distribution is larger if the terms of the distribution are closer
to each other.
We introduce now the notion of conditional information. Let £ and rj be
two random variables having finite discrete distributions. Let х ъ x 2, . .., x m
be the distinct values taken on by £ with positive probabilities, у ъ y 2, ■■ yn
those by t/. We write:

P(£ = Xj)=pj (j = 1 , 2 , . . . , m); & = (ръ р2, ■■., pm). (6a)


P(jl = Ук) —4 k ( k = 1 ,2 ........n); Q. = (qi, q2, (6b)
7 = 1 2 YYl
P(£ = Xj, rj = yk) = rjk ^ = (ru , . . rm„). (7)

P(^ = X,-1ri = y k) = # 1* ; ^ = (Piifc,. . . , p m|A) (Ä: = 1 , 2 ,. .. , n). ( 8 a)

P(t? = Ук К = *,) = qk\j ; Q; = (?цу. • • 4 n \i) ( j = 1 , 2 , . . . , mi). (8b)


IX, § 3] COND ITIO N A L A N D RELATIVE IN FO R M A T IO N 557

According to the definition of conditional probability we have


fjk = P,4k\i = q kPj\k ; (9)
further
X rik = P;
k=l
U = 1, , . . rri)
2 ( a)
10

and
m
X
7-1 Г]к = Чк (k = \ , 2 , . . . , n ) . ( b)
10

We define now the co n d itio n a l in fo rm a tio n I ( f | rj) contained in £ with


respect to the condition that t] assumes a given value; this will be the expec­
tation of the information associated with the distribution
n n m n
Iq) = *=1
Щ X Я к К ^ к ) = fc=17=l
X X ri k l° g —
rjk
• 2 (li)
On the other hand, if /((<;, / )) denotes the information associated with the
7

two-dimensional distribution of £ and ty.


m n i
/(«, П)) = /(-») = X *=i
X *loS --
7=1 O'* , 0 2 ( )
12

then we have
/(({,»?)) = Ач) + /«!*). (13)
Formula (13) follows from (9), (10b), (11) and (12):
m n
щ In) = /((?, I/)) +7=1
X AX=l o*’°g ?* = /((Í, »7)) - /(«;)■
2

It follows from the definition that /(£ | rj) = /(£) when £ and rj are inde­
pendent, hence (13) reduces in this case to the relation obtained in the pre­
ceding section:
щ , n)) = m + m - (i4)
We may consider (13) as a generalization of the theorem on the additivity
of the information: the information contained in the pair of values (£, rj)
is the sum of the information contained in the value of rj and of the condi­
tional information contained in the value of £ when we know that rj takes
on a certain value.
Now we show that in general the relation
/(({, ч)) s m + m (is)
558 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , § 3

holds, where the sign of equality occurs only if £ and t] are independent.
According to (13), relation (15) is equivalent to

/ « In) £ / ( 0 ( 16)
which means that the “conditional” uncertainty of £ for a known value of
i/ cannot exceed the “unconditional” uncertainty of £. By taking ( 11) into
account we can write
m n
7( £ b ) = - Z Ё ЗД|* lo§2 P j\k ' (17)
1=1*=1

If we apply Jensen’s inequality to the function x log2 x with xk = pjk, w k =


— 4 k ( k — 1, 2 , . . . , и), we obtain, in view of (9),

П
Pj l°g2 P j ^ Z 4 kP i\k l0g2P,|fc. ( 18)
k=1

From (17) and (18) follows immediately (16), and hence (15) too. The sign
of equality in (18) can only hold if all P j \ k (k = 1 , 2 , . . . , n) are equal, i.e.
when £ and ц are independent. We conclude from (13) that

m - / « I f?) = / ( 0 + /( 7) - Щ , ti )); ( 19)


the right hand side being symmetric in £ and 17, we have

m - n& i n) = m - i(n i o- (20)


The left hand side of ( 19) may be interpreted as the decrease of uncertainty
due to the knowledge of >7, or as the information about £ which can be
gained from the value of 17. We call this the relative information given
by t] about £ and denote it by /(£, 17); we have thus

/« , 7) = m - / « I 7 ). (2 1 a)
(We must not confuse /(£, íj ) with the information /((£, >7)) associated with
the two-dimensional distribution of £ and Í7.) According to (20)

if) = 7(4, 0 ; (21b)


hence the value of >7 gives the same amount o f information about f as the
value of £ gives about 17.
/(£, íj) can also be defined by the symmetric expression
m n „
/ « , 4) = / ( « + /(if) - / ( « , if)) = Z Z 0* log2 • (22)
7=1 к = 1 Pj 4 k
IX, § 3] COND ITIO N A L A N D RELATIVE INFO R M AT IO N 559

According to ( 16) we have

/(& П) 2: 0, (23)
where the equality sign holds only if £ and 17 are independent. Hence if £
and rj are not independent, the value of rj gives always information about
£. On the other hand, from (21a) and (21b) follows

/(£, t]) < min (/(0 , /(>/))• (24)

Here too, it is easy to find the cases in which the equality sign holds. In fact,
if I{£, rj) = I{£), then I{£ I i/) = 0 , which can occur only if the value of £ is
uniquely determined by the value of r\, i.e. if £ = / ( 17). Similarly, /(£, 17) =
= /( 17) can occur only if 17 = #(£). The quantity I(£, rj) can be considered
as a measure of the stochastic dependence between the random variables
£ and 17.
The relation I(£, 17) = /( 17, £), expressing that 17 contains (on the average)
just as much information about £ as £ about 17, seems to be at the first glance
surprising, but a deeper consideration shows it to be quite natural.
The following example is enlightening. Let 17 be a random variable sym­
metrically distributed with respect to the origin, with P(ri = 0) = 0, and
put £ = rj1. There corresponds to every value of r\ one and only one value
of <!;, while conversely £ determines 17 only up to its sign. In spite of this,
£ gives just as much information on 17 as rj gives on £ (viz, /(c)); the differ­
ence is that this information suffices for the complete characterization of
£ but does not determine rj completely (only the value of \ri\). In fact, Ijtj) =
= I(£) + 1 (if we know already the absolute value of 17, rj can still take on
the values ± | rj | with probability — , hence one unit of uncertainty must
be added).
We prove now the inequality

I ( f M ) * /« , 4), (25)
which is equivalent to

4 1/ ( 17)) > I(£ I rj). (26)

If instead of r\ we observe a function f{rj) of 17, then we obtain from the value
oifjrj) at most as much information on £ as from the value of /7; the uncer­
tainty of £ given the value of /( 77) is thus not less than its uncertainty given
the value of /7.
560 IN T R O D U C T IO N TO IN FO R M A T IO N THEORY [IX, § 4

P roof of (26). If f(y k) Фf{yi) for к Ф l, we have equality in (25); if for


instance f( y k) = f(y,) Ф f(y m) for m ф к, m Ф l then to the terms
qkI(ffik) + <?//(^/) (cf. (11)) figuring on the right hand side of (26) there
corresponds a single term on the left hand side, viz. (qk + qt) I(Sfikj),
where f ^ k,t is the conditional distribution of £, under the condition that q
takes on one of the values y k or y,. Clearly
d/ j: i
— y k or q = y,)\ = -----
4 k P j\k T
/>(£ = Xj\rj
qk+qi q ,P j\t
■•

If we apply Jensen’s inequality to the convex function x log2 x, we obtain

(Як + Я,) l( ^ k j) ^ qk K&k) + q,


The case when several values of f( y k) are equal to each other can be dealt
with similarly. Thus we proved (26), hence (25) too.

§ 4. The gain of information

The same example which served to derive Shannon’s formula can be used
to get a heuristic idea of the notion of gain o f information. Let E be a set
containing N elements and let Еъ .. ,,E„ be a partition of this set. If Nk
П

is the number of elements of Ek, we have N = X Nk and we put pk =


k=l

= . Let the elements ofifbe labelled from 1 to N, E = {ex, e.,,. . ., eN}


and let the elements of Ek (k = 1,. . ., n) be labelled from 1 to N k. An ele­
ment of E chosen at random (all elements having the same probability ——
of being chosen) may be characterized in two distinct manners: a) by giving
its serial number in E which we denote by ; b) by giving the set Ek to which
it belongs and its serial number in Ek. The index к of the relevant set Ek
is a random variable which we denote by tj. The index of the element in
question in the set Ел will be denoted by £. Then we have

m = m + ЕС I n) (l)
where, clearly

1(0 = log2W, 1(0 kX=1Pklo Pk


= 82 —
and
l(i\n)=kt=\Pk 2Nk. lo g
IX, § 41 THE G A IN OF IN FO R M A T IO N 561

Now let E ' be a nonempty subset of E and let E'k (k - 1 , 2 , . . . , rí) denote
the intersection of Ek and E'. Let N'k be the number of elements of E'k,
Nk "
N ' the number of elements of E ' and put qk = ---- . Then we have N'k =
A k=i
П
= N ', hence qk = 1. Suppose that we know about an element chosen
k= 1
at random that it belongs to E'; what amount of information will be fur­
nished hereby about q l The original (a priori) distribution of q was
&(Pi,Po, ■■■,p „); after the information telling us that the chosen element
belongs to E', 11 has the (a posteriori) distribution Q = (qu q2, . . ., q„). At the
first sight one could think that the information gained is 7(d?3) — I(Q). This,
however, cannot be true, since 7(d?3) — 7(Q) may be negative, while the gain
of information must always be positive. The quantity I{SP) — 7(Q) is the
decrease of uncertainty of q; we are, however, looking for the gain of infor­
mation with respect to q resulting from the knowledge that ef belongs to E'.
Let the quantity looked for be denoted by 7(Q || d?3) 1; it can be determined
by the following reasoning: The statement eK £ E ' contains the information
N . „
logo ~ v This information consists of two parts; first the information given
by the proposition et £ E' about the value of q, next the information given
by this proposition about the value of £ if q is already known. The second
part is easy to calculate; in fact if q — k, the information obtained is equal
t° logo —7 - and since this information presents itself with probability qk,
™к
the information about the value of £ is
n
4k
k=
E log2 •
tv,
^7
1 к
Hence

logo = KU 11 + £ qk log, ~ . (2)


Since

E = i and
V
k=i qk
i a N N 'k
A Nk =— . ^k
pk
we find that

1(Q.\\&>) = £ ? f c lo g 2 — • (3 )
k=\ Pk
The quantity 7(Q || SE) depends only on the distributions d? 3 and Q; it
1 We use a double b ar||in KQWfifi) in order to avoid confusion with the con­
ditional information 7({ | q).
562 IN T R O D U C T IO N TO IN FO R M A T IO N TH EORY [IX, § 4

follows thus from Jensen’s inequality that we have always

7(Q||^)>0. (4)

The equality sign occurs in (4) only if the distributions ■&>and Q are identical.
I(Q II is defined only if every pk is positive and if there exists a one-to-one
correspondence between the individual terms of the two distributions. The
quantity I{Q || J7J), defined by (3), will be called the gain o f information
resulting from the replacement of the (a priori) distribution & by the (a pos­
teriori) distribution Q.
The gain of information is one of the most important notions in informa­
tion theory; it may even be considered as the fundamental one, from which
all others can be derived. In § 6 we shall build up information theory in this
fashion; the gain of information, as a basic concept, will be defined by pos­
tulates.
The relative information introduced in the preceding section can be ex­
pressed as follows by means of the gain of information. Let £ and rj be ran­
dom variables assuming the distinct values х ъ x2, .. ., x m and y u y 2. . .., y n
with positive probabilities pj = P (f = x,) and qk — P(q — y k) respectively;
put = Ox, . . ., p j , Q = (qu q2, . . ., q„),

Щ = Xj, П=Ук) = rJk, P(f =Xj \ t ] = y k) = Pjlk,

^ к — {Pl\k’P2\k> ■• ■; Pm\k)‘
Then we have

m ,n ) = i < i k i ( & k \ m - (5)


k=1
Indeed by (3)
m n-
K&k II& ) = jZ=1Pj\k lo§2 —
Pj
.
hence, because of qkP j \ k ~ r]k
п m n у

Z ЯкK ^ k \ \ & )
k=\
= Z Z rjk 1o82 Pj~ Як~ •
j=l k=l
(6)
From this (5) can be derived by Formula (22) of § 3. Formula (5) means
that the amount of information on £ which is contained in the value of q
is equal to the expectation of the gain of information obtained by replacing
the distribution S 5 of t; by the conditional distribution
If = (ръ . . ., pn) is any distribution having n terms and if '<£n =
IX, § 4] THE G A IN OF INFO R M A T IO N 563

1 1 1I .
= — , — , . . — , we have
n n n I
1 (9 № = £ > * log2 nPk = log2 n - 1 (9 ) = 1(STJ- 1 (9 ). (7)
*=i
The gain of information obtained by replacing the uniform distribution by
the distribution 9 is thus equal in this case to the decrease of uncertainty.
But in general the quantities I(Q || 9) and 1(6$) — 1(Q) are not equal.
Though in general 1(9k\\9) # 1(9) —1(9k), Formula (5) still expresses
that the averages of these two quantities are equal. For according to the
first definition of relative information,

) -Г /«) - /( { I r,) = £ qk(1(9) - I(9k)),


П
k=1

hence, according to (5)

£ qk(I(9)-I(9k)) = £ qk1(9k\!9). (8)


k=l k =1

But only the sums on the two sides of (8) are equal; the single terms have not
necessarily the same value.
The following symmetric expression is also often considered in informa­
tion theory:
J(9,Q) = I(Q\\9) + I(9\\Q). (9)

This expression was first studied by Jeffreys. A simple calculation shows


that

J ( 9 , Q) = £ (pk - qk) log2 — . (10)


k=l Qk

Let us remark that while certain terms of the sum (3) defining I(Q | | 9 )
may be negative and we know only that the sum itself is nonnegative, on
the contrary, on the right hand side of (10) all terms are nonnegative.
The relative information can be expressed by means of the gain of infor­
mation in still another way. If 92 is the distribution {rjk}, 9 * Qthe dis­
tribution {pjqk}, then it follows from Formula (22) of § 3 that
/(£, г?) = I(92\\ 9 * Q ) . (11)
The information concerning £ contained in the value of q is thus equal to
the gain of information obtained by replacing the direct product of the dis­
tributions of £ and q by their actual joint distribution.
564 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , § 5

§ 5. The statistical meaning of information

Let the possible outcomes of an e x p e rim e n t^ be denoted by A u A 2, ■.


Ar; let their probabilities be P(Ak) = pk (k = 1 ,2 ,.. ., r). Let £A denote
the distribution (pb p2, . . .,p r); consider n independent repetitions of the
experiment The probability of an outcome of this sequence of experiments
(when we take into account the order of the experiments) is given by
7г„ = p \\P 2 , ■■;P \', where vk means the number of experiments leading
to the outcome A k. Since the vk are random variables, nn is a random vari­
able too. The expectation of vk being equal to npk, we have

E I— log, log2 — = I{6P). (1)


n n n) k=l Pk

The information I{SA) may be interpreted as the expectation of — Iog2 — .


П 7ln
According to the law of large numbers
vk
hm st — = pk,
„- =о n
hence
lim P — log, —---- I(AA) < e = 1 (2)
и—
со n

for every e > 0 (see Ch. VII, § 14, Exercise 6).


If instead of the expectation of log2 — we consider the analogous
n 7tn
quantity — log2 ------ then we obtain
n Е(л
1 , 1 , 1
--- log2 -гг—г = log2 г
” £ (“ J (Ей)
*=1
This quantity was already mentioned in § 2; it can also be considered as a
measure of information. We shall return to this question in § 6.
There is still another point of view showing that the definition of informa­
tion is suitable. The unit of information (“bit”) was defined as the amount
of information contained in a symbol which can assume only the two values
0 and 1. Such a symbol will be called a 0 —1-symbol. We shall now consider
whether the outcome of an experiment can actually be characterized on
the average by /(.V') 0 —1-symbols. We show that this is really possible,
if certain highly improbable events are neglected.
IX, § 5] STATISTICAL M E A N IN G OF IN FO R M A T IO N 565

T heorem 1. Let o f be an experiment having possible outcomes А ъ A 2, . . . ,


A r occurring with probabilities pk = P(Ak) > 0 (k = 1,2, . . r). Put =
- {Pi,p2, . . ., pr). Then for any given g > 0 and ö > 0 there exists an n0
depending only on e, and ő such that if there are performed n (n > n0)
independent experiments o f , then the outcome o f this sequence o f experiments
can with a probability greater than 1 — ö be expressed uniquely by n (/(.f ) + g)
0 —1-symbols. I f g > 0 is arbitrarily small, it is impossible to character­
ize the outcome o f the sequence o f experiments by less than пЩ.АТ) — g)
0 —1-symbols with a probability greater than or equal to g whenever n > n'0,
where ri0 depends only on ffi, s, and g.
Remark. This means that, if the experiment o f is sufficiently often repeat­
ed, for the description of an outcome of the experiment one does not need,
on the average, more than I(fA) + g 0 —1-symbols; hence the statement that
the outcome of o f contains the amount of information I (f f ) has a quite
definite meaning.

P roof . Choose щ large enough so that n > nx should imply

P Í— log2 — - 1(«i») < — > 1 - b. (3)


n 7ln 2
This is, in view of (2), always possible. This means that the sequences of
outcomes obtained by repeating n times the experiment can be parti­
tioned into two classes: the first consists of the sequences for which


n
log2 —
7t„
- w < 41 . (4)

the second of the remaining ones. According to (3) the probability that a
sequence belongs to the second class is less than <5. Let Cn denote the number
of sequences of the first class, let qb q2, . . ., qCn be their probabilities. By
(4) we have

2- n{Kffi)+D < q. 0 ‘ = 1 , 2 , . . . , C„) (5)


or, by adding these inequalities,

C ,2 -(W *= h < X »f (6,


j =1
The sum on the right hand side cannot exceed 1, since it represents precisely
the probability that a sequence belongs to the first class. Therefore we have

C„ < 2 (W + ^). (7)


566 IN T R O D U C T I O N T O I N F O R M A T I O N T H E O R Y [IX , § 5

Now let us number the events of the first class from 1 to C„ and write
g
these numbers in the binary system. For this n / ( ^ ) + — + 1 binary digits
are needed. There can be found an n2 such that for n > n2 the inequality

n JI(& ) + ®- + 1< + s) (8)

holds. Put и0 = max (nh n2); it is clear that n0depends only on s, 5, and SP
and satisfies the requirements of the theorem. It is easy to show that with
large probability «(/(d?5) — e )0 —1-symbols are not sufficient to describe
the outcome of the sequence of experiments. To see this, subdivide again
the set of the sequences into two classes: let the first class contain the se­
quences for which

К & ) - — log2 — < (9)


n nn 2

and the second the remaining ones. Choose an n3 such that for n > n3 the
probability of (9) exceeds 1 — <5; this is possible because of (2). Let
Dn denote the number of sequences in the first class and let гъ r2, . . r Dn
be the corresponding probabilities. We have then

2- 4 ^ ) —2) > r. { j = 1, 2, . . . , D n). (10)

Furthermore, by assumption
d„
E ^ i -г. (ii )
i =i
If we select some outcomes and assign to them sequences of zeros and
ones of length not exceeding — в), the number of these sequences
will be less than hence the total probability of the selected out­
comes will be at most

2<'(£P)-z) 2~n(,(S>)~Ti0 + <5 = 2~ 2 + <5.

The total probability of the outcomes not considered is thus at least


_ HE

1 —5—2 2 > 1 — 25, provided that n >


Suppose that n >n ' 0 = max (и3, и4) and that, contrary to the statement
in the second half of the theorem, it is possible to characterize the outcome
of the sequence of experiments by less than — e) 0 —1-symbols with
IX, § 5 ] STATISTICAL M E A N IN G OF INFO R M A T IO N 567

a probability > q > 0. If we choose <5such that 2d < g, then this contradicts
what was just proved.
Theorem 1 is therefore completely proved. It can be sharpened in the
following manner:

T heorem 2. For every 5 > 0 there can be given an щ such that fo r n > n0
the outcome o f n independent experiments ^ can be uniquely expressed, with
probability > 1 — <5, by at most nl(SF) + K f h 0 —1-symbols; К is here a
positive constant which depends only on Ö.
However, there corresponds to every q between 0 and 1 a constant K ' and an
integer n0 such that a unique characterization o f the outcome o f a sequence
o f experiments becomes impossible (with a probability > o, for n > n0) by
less than nl{-¥s) — K 'J n 0 —1-symbols.

P roof . It is easy to show1 that the distribution of the random variable

J n Í— log2 — - I(SP) = Y - k— ,- Pk log2 —


l n n „ * ti J n Pk

tends to the normal distribution as n -* oo. There exists thus a constant К


which depends only on ö such that we have for sufficiently large n

p \ — log2 — - I ( & ) < - ? = >l-$. (12)


n л„ Vй

The continuation of the proof runs exactly as that of Theorem 1.


Theorem 1 can also be considered as a justification of Shannon’s defini­
tion of information. The statement of Theorem 1 can be translated into the
language of communication theory as follows: Let a message source at
the moment t (t = 1 , 2 , . . . ) emit a random signal let х ъ x 2, . . x r
be the possible signals; let p k = P(£t = x k) denote the probabilities of the
individual signals. These probabilities are supposed to be independent of t.
Assume that the signals are independent of each other. Assume further that
for transmission the signals must be transformed (encoded) since the
“channel” (transmission network) can only transmit two signs.2 (This is the
case e.g. if the channel works with electric current and at every instant only
two cases are possible: the current is either on or off.) Let 0 denote one of

1 The proof is the same as that in Exercise 26 of Ch. VIII, § 12.


2 In information theory the word “channel” has a very general sense: it means
any process capable to transmit information.
568 IN T R O D U C T IO N TO INFO R M A T IO N TH EORY [IX, § 5

the signs and 1 the other. The question is then, how many 0 or 1 symbols
are necessary for the transmission of the information contained in n signs
£i, £2 furnished by the source. According to Theorem 1 with proba­
bility arbitrarily near to 1 less than + s) symbols are required, pro­
vided that n is sufficiently large. This shows the importance of the quantity
/(.9 s) for communication engineering.
Let us mention an important particular case. If pr — p2 — ... = p r = — ,
r
then H-9') = log2 r; therefore in order to encode a signal of such a source
into 0 —1-symbols, on the average log2 r symbols are necessary. (Of course
this can be shown directly.) If for instance a number written in the decimal
system is transcribed into a binary system, the number of digits increases on
the average by the factor log2 10 = 3.3219 . . . . This is of importance for
computers, which work in the binary system.
If the source emits signals x k with probabilities p k (k = 1 , 2 , . . ., r) and
nl(,9)
if the channel can transmit s different signs, approximately —---- — signs
log2i
are necessary in order to transmit a message of n signs if the most econom­
ic coding is applied.
It is to be noticed that optimal or nearly optimal codings are very compli­
cated and are feasible only for long sequences of signals. Hence in practice
usually such codes are employed which to some extent take into account
the statistical nature of the source, but are more easy to handle than the
nearly optimal codings. In particular, the signals are coded one by one or
by small groups (as for instance in the encoding of letters into Morse sig­
nals).
The message sources encountered in practice are generally much more
complicated than those described above. The individual signals are, in
general, not independent of each other. E.g. in every natural language the
letters have not only different probabilities, but the probability of a letter
depends also on the letters preceding it in the text. This can also be taken
into account, but we do not deal with these questions here.
The channels actually used in communication theory are also much more
complicated than those discussed above. In practice, it is of the great impor­
tance to know how to transmit the information through a channel which,
with a certain probability, distorts the transmitted signal. Then one cannot
be sure that the received signal is identical to the emitted one. (E.g. in broad­
casting the distortions caused by the transmission through the atmosphere
are perceived as noise.) Such channels are called noisy channels. Information
theory takes this into account, but our brief introduction does not permit
to go into these questions.
I X , § 6} F U R T H E R M E A S U R E S O F IN F O R M A T I O N 569

§ 6. Further measures of information

In the present section we give another characterization of the information;


this approach will show what other quantities can be considered as measures
of information besides that of Shannon’s.
Shannon’s information was defined in § 2 by postulates and by means of
Shannon’s information we introduced the notion of the gain of information.
We shall now follow the inverse procedure: we define first the gain of in­
formation by a set of postulates; from this then we shall derive a measure
of information.
We start from a generalization of the notion of a random variable. Let
[Q, P] be a Kolmogorov probability space. We define an incomplete
random variable as a function t; = c(w) measurable with respect to the meas­
ure on and defined on a subset Q1of Q, where and F(ßi) > 0.
The only difference between an ordinary random variable and an incomplete
random variable is thus that the latter is not necessarily defined for every
со £ Q. In this sense, ordinary random variables can be considered as par­
ticular cases of incomplete random variables.
If £ is an incomplete random variable assuming values x k with probabili-
П
ties pk (pk > 0; к = 1, 2 , . . ., и), we have Y pk < 1 and not necessarily
k=1
í > * = I-
*=i
The discrete incomplete random variables and i] are said to be inde­
pendent, if for any two sets A and В the events £ £ A and r\ £ В are inde­
pendent.
The distribution of an incomplete random variable will be called an in­
complete probability distribution-, in this Sense the ordinary distributions can
be considered as a particular case of the latter. Thus if pk > 0 (к = 1, . . . , « )
П
and Y P k — 1> then {pk} is a finite discrete incomplete distribution.
k= 1
The direct product of two incomplete distributions d?5 = {pj} (j — 1,. . ,,n)
and Q = {qk} (k = 1 , 2 , . . . , « ) is defined as the incomplete distribution
{Prfk} (J — 1, . . m; к = 1 and will be denoted by S 5 * Q.
To every incomplete distribution & = (ръ . . ., pn) there can be assigned
an ordinary distribution .9'1' = [p\, . . .,p'„) by putting
Pk
Pk — n

ZPj
j=i
Let <*; be an incomplete random variable taking on values x k with probabili-
570 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , § 6

ties pk (k = 1, 2 , . . «); put s = Y Pk- If 0 < s < 1, £ can be inter-


k=1
preted as a quantity depending on the outcome of an experiment, but not
defined for all outcomes of the experiment. For example £ is only defined
if the outcome is observable, which happens with probability s, where
0 < s < 1. In this case the corresponding distribution 66' may be inter­
preted as the conditional distribution of £ with respect to the condition that
the outcome of the experiment is observable. Therefore 66' is said to be the
complete conditional distribution of the incomplete random variable £.
We shall now define the mean gain of information obtained if the (incom­
plete) distribution 66 = {pb .. .,p„) (Pk > 0 for к = 1,. . ., rí) of the incom­
plete random variable £ is replaced by the incomplete distribution Q =
= (gx, . .., q„). Before stating the postulates we make two remarks:
1. T h e g ain o f in fo rm a tio n d e n o te d b y 7(Q || 6 6 ) is d efin ed o n ly if 6 6 a n d
Q h av e th e sam e n u m b e r o f te rm s a n d th ese are in a o n e -to -o n e c o rre sp o n ­
dence defined by th e ir indices.
2. We supposed pk > 0 for all values of k; however, some qk (but not
all) can be equal to 0.
The quantity I(Q | | 66) has to satisfy the following postulates:

P ostulate ). I f 66 = 66x * 66., and Q = Цх * Q2, then

Щ l | ^ ) = Щ , 1166x) + Щ 2 II662). (1)


Remark. This means that if we put 66) = (pn, . . ., pin), Qt = (qn, . .., qin)
(i — 1 or 2), then there corresponds to the element pXJ p2k of 66 the ele­
ment qXJ q2k of Q. Postulate I is a general formulation of the additivity of
information.

P ostulate II. I f pk < qk (k = 1 , 2 , . . ., rí), then we have /(Q || 61s) > 0;


for pk > qk (k = 1 , 2 , . . . , « ) we have I(Q || 66) < 0.
Remark. It follows from this that 1(66 \\ 61s) = 0. For complete distribu­
tions 66 and Q Postulate II asserts nothing more than this, as then the in­
equalities pk < qk (k = 1 , 2 , . . . , « ) occur only if pk = qk (k = 1 , 2 , . . . , rí),
П П

since Y j Pk — Y 4k — 1- In the case of incomplete distributions,


k= 1 *=l
however, this postulate leads to important conclusions.
Let i f p be the distribution consisting of a single term {p} (0 < p < 1).
We require:
P ostulate Ш . Ilfox \\ x) = 1.
i
IX , § 6 ] F U R T H E R M E A S U R E S O F IN F O R M A T IO N 571

This postulate fixes the unit of gain of information.


Before proceeding further, we determine the function

д(ч, p ) = K&qW&p) (0 < p < i,o < q < i).


It follows from (1) that

<?(?1 ? 2 , Pi, Pi) = &( Чъ Pi) + Рг)- (2)

If we put qx — q2 = 1, we find

giUPiP*) = g(\,P i) + g ( l p 2)- (3)


If we put qx = Pi = 1, pi = p, q%= q, we obtain

g(q,p) = g (h p ) + g(g, i). (4)


Hence, according to Postulate II,

g (h p ) + g{p, 1) = 0. (5)
We conclude from (4) and (5) that

g(g,p) = g(Up) - g(i, q)- (6)


Now g (l,p ) being, by Postulate II, a decreasing function of p, it follows
from (3) by a well known theorem that

g{\,p) = c log2 —
P
with c > 0. According to Postulate III c = 1, thus

l{%i\\%p) = g {\,p )= log2 - L , (7)

and by (6)

K%q\\&p) = g(q,p) = log2 — - log2 — = log2 — . (8)


p q p
If we observe the occurrence of an event having probability p, we get the
amount of information log2 — ; if p is replaced by q the gain of
P
information is log2 — .
P
572 IN T R O D U C T IO N TO IN FO R M A T IO N THEORY [IX, § 6

The quantity log., — can also be considered as measuring the uncertainty


“ P

of the occurrence of an event with probability p; the quantity log2 — =


P
= l o g ,------ log2 — is the decrease of the uncertainty resulting from
p q
the replacement of p by q (note that this “decrease” can be negative as well;
indeed if q < p, the uncertainty increases).
We introduce now a new notion. If we replace an incomplete distribution
9P = (p1; . . .,p n) by an incomplete distribution Q = (qlt . . q„), we obtain
with probability qk the information lo g o - - { k = 1, 2, . . ., ri). Put
Pk

, 4k
Як = -

In

7= 1
я>

The conditional probability that we obtain the information log, —


Pk
under the condition that at least one observation occurs, is equal to
q'k (k = 1 , 2 , . . ., и). Put
F (Q ,& ,x )= X qk ; (9)
'°вг1!<х

F(Ll, ■9:>,X) will be called the conditional distribution function o f the gain o f
information.
Now we can formulate our further requirements:

P ostulate IV. I(Q 11 .9s) depends only on the function F(Q, x).
Because of this postulate we can also write /[T(x)] instead of /(Q || 9 s),
where F(x) = F(Q, ,9 , x).
Notes. 1. If Q = <%q, 9 s = If p, we have

0 for x < log2 — ,


F(Q, & ,x ) = P
1 otherwise.

Postulate IV is thus fulfilled and (8) expresses that for a degenerate distri-
IX, § 6] FU R TH ER M EASURES OF IN FO R M ATIO N 573

bution function
0 for X < c,
otherwise, ('°>
(where c can be any real number) we have the relation I[Dc(x)] = c.

2. Every distribution function F(x) of a finite discrete distribution can be


written in the form F{x) — F(Q, x) where Sfi and Q are suitably chosen
incomplete distributions. Indeed let ak be the discontinuity points of F(x),
П
with jumps wk (k = 1 , 2 , . . n; Y w k = 1), then we have to determine
*=i
numbers p k and qk (k = 1 , 2 , . . . , « ) such that the relations

ak —l° g 2 “ > wk = -— —~ ------ — — {k = 1,2,..., n) (1 1 )


Pk fa + <7г + • • • + Яп

hold. This is the case if we take qk = twk and pk — twk 2~ak. If we choose
the number t such that

0 < í < min 11, - - — 1 ----- , (12)


Y j Wk ' 2 ~ ak
\ k =1 J

then we obtain a system of solution satisfying all our hypotheses.


I[F] is thus a functional defined on the set J^of all distribution functions
of finite discrete distributions. The following postulates concern the prop­
erties of this functional.

P ostulate V. I f F £ У, G £ JFl F ^ G and G{x) > F(x) ( —oo < x <


< + oo), then
7[G(x)] < / № ) ] .

Remark. This postulate contains Postulate II. In fact if p k < qk (к = l,


2,..., n) we have log2 — > 0, hence F(Q, Sft, x) < D0(x). From this follows
Pk
by Postulate V that I(Q || JP3) > I[D fx)\ = 0; and if pk > qk, the inequal­
ity is reversed. However, Postulate II is not superfluous, since in order to
state Postulate V we used relation (8) resting on Postulate II.

P ostulate VI. Let Ft £ ■9r (/ = 1, 2, 3) and I[F.,\ — 7[F3], Then fo r every


574 I N T R O D U C T IO N T O I N F O R M A T I O N T H E O R Y [IX , § 6

1 (0 < t < 1) we have


I[tFx + (1 - t)F2] = fitF , + (1 - i)F3];
furthermore, I\tFl + (1 — t)F2] is a continuous function o f t.1
Remark. Postulate VI may be called the postulate of quasi-linearity.
Now we can state

T heorem 1. I f I(Q || d?3*


) satisfies Postulates I to VI, then:
— either there exists a real number a # 1 such that 7(Q 11 P ) = I fQ | | d?3) ,
defined by

Ц -1 ° г Л Z - 4 V (13а)
£ ? * * '1 Ä I
— or 7(Q II ti5) = IfQ у ,VJ) with

A(Q II-^)= .* É g * l° g i~ ' (O b)


Z 9* *■'
k= l
Pk

(13b) is i/ie limit o f (13a) /o r a -> 1:


lim 7a ( Q l |^ ) = 71( Q ||^ ) . (14)
a—1
Remark. Ifd /3and Q are complete distributions, fiflffP ') is identical to
Shannon’s gain of information defined by Formula (3) of § 4.
The quantity 7/Q 11.Vs) will be called the measure o f order a o f the gain
o f information; IfQ |] will be called Shannon's gain o f information or
measure o f order 1 o f the gain o f information.
In order to avoid confusions, Shannon’s gain of information will from
now on always be denoted by IfQ || - f ) instead of 7(Q || d?3).

P roof . 2 Instead of Iffil || Sfi) we use also the notation 7a(F(Q, x.))
Then (13a) and (13b) are written as
+ 00

I. (F) = — Ц -log., I 2(*-V*dF(x) for a ^ 1 (15)


a- 1 J
— 00

1 The assumption of continuity is not indispensable to the proof of Theorem 1;


its only purpose is to simplify our proof.
2 The following proof is a combination of the proofs of two theorems from the
theory of functional equations. Cf. G. H. Hardy, J. Littlewood and G. Pólya [1],
pp. 215 and 84.
IX, § 6] F U R T H E R M E A S U R E S O F IN F O R M A T IO N 575

and
+00
h (F )= J xdF(x). (16)
-00

From these formulae we see that IJQ || FA) satisfies for every «Postulates
I through VI. It remains still to show that no other functional can satisfy
all these Postulates. A simple calculation shows that

F (0i * 0.2, ^ 1 * <&2 , x ) - j F(Q 1 , ,X - y)dF (Q 2, y), (17)
—00

which permits to rewrite Postulates I and III in the following form :

Postulate I'. I f F e ,9r, G f У and i f we put


+0C1
F*G = J
F(x - y)dG(y).
—00
then we have
I[F *G ] = I[F] + I[G\. (18)

P ostulate ПГ. I[D fx)] = l.


We show now that Postulates Г, ИГ, IV, V and VI are satisfied only by
the functionals (15) and (16).
Let 3 rA be the class of finite discrete distribution functions with F( —A ) =
= 0, F(A) = 1. We deduce from Postulate VI by induction that from the
relations

Fey, Fey, m]=j[Fi wt > о (/=i,2,...,r)


r
and Y j wi — 1 the relation
i=i
n t wiFi] = I [ í w i F'i] (1 9)
i= l i= l

follows. We know already by Postulates Г, III', and V that

I[Dc(x)\ = c (20)

c
holds for every real c, where Dc(x) is the degenerate distribution function
of the constant (see Formula (10)).
Let
1h ( t ) = I [ ( l - t ) D _ A(x) + tDA(x)]. (21)
576 INTRODUCTION TO INFORMATION THEORY [IX, § 6

ij/A{t) is a strictly increasing continuous function of t; further фА(0) = —A


and фА(1) = +A. Put t — <pA{u) for и = \ßA(t). As <pA(u) is the inverse
function of ij/A(t), it is continuous and strictly increasing in the interval
( —A, + A ). From this we derive

/[A , (*)] = и = Фл(0 = 1 [ ( 1~?л O')) d - a (x ) + <Pa 00 A (x)]. (22)


Let F £ dFA be a distribution function which jumps wlt w2, . . wn at the
points аъ a2, . . an, we have

F = F(x) = £ w k Dak(x) (23)


k=1
and, according to (19) and (22),

I[F] = / [ í ((1 - c p A (ak)) P _ A (X) + <pA (ak) D A (x))], (24)


A: = l

hence

I[F] = /[(1 - t
*=i
w k9>A {O k )) D - a (x ) + (X
*=i
w k Т а Ю) D a (*)] (25)
or, according to (22) by writing <pA\ t ) instead of <J/A(t)

Л Л = <PÄ1 ( f J wk cpA (aÄ)). (26)


k =l
Formula (26) expresses that I[F] is the Kolmogorov-Nagumo quasilinear
mean of the numbers ak with weights wk (k = 1 , 2 , . . . , n). We shall need
the following lemma concerning this mean:

L emma . Let (pfx) and <p2(x) be two continuous and strictly increasing func­
tions in the interval [/, К]. Suppose that for arbitrarily chosen numbers
Хц x2, .. ., x„ in [J, K] and for positive numbers wi, w2, . . . , wn with
П

Yj wk = 1 we always have
k=1

<Pil ( i , w k V1 (x k)) = (p-I1 ( X wk <P2 (xk)). (27)


k = 1 k=l

This means that the relation

(pfx) = acpfx) + ß (28)

holds, where a > 0 and ß are two constants. (Conversely, (28) implies (27)).
IX, § 6) F U R T H E R M E A S U R E S O F IN F O R M A T IO N 577

P roof . It suffices to prove (28) by supposing (27) to hold for n = 2,


Щ - t, w2 = 1 - t, 0 < t < 1. Put cp2(J) = J', <p2(K) = K \ If Xj and x 2
describe the interval [/, K], then y x - <p2(*i) and y 2 = 9?2(x2) describe the
interval [ /', К ’]. Hence if J ’ < y i < K ’, J ’ < y 2 < K ‘ and if we put
95i (?)2 ’(x)) = <p3(x) we find
«PsCOh + (1 - 0 j 2) = t'PsO'i) + (1 - О'РзО'г) • (29)
<p3(y) is thus a linear function, which proves the first part of the lemma.
The converse is trivial.
We show now that there can be found a function <p(x), independent of A,
such that (26) remains valid if <pA is replaced by cp. It suffices to prove that

, ч <Pb (x )-<P b ( - A ) , - . D
Г Л *)= torO<A<B.(30)

This follows from


П П
<P~a 1 ( I wk <pA (ak)) = cpB\ £ wk <pB (ak)) (31)
k = l k = 1

or 0 < A < В, I ak I < A (k = 1, 2 , . . . , и); Formula (31) itself follows


from ,?A а Ув for A < B. From (31) and from the lemma we conclude that
<Pa (x )= a<Pn(x) + ß, (32)

and since <pA( —A) = 0, срл(А) = 1, we obtain (30).


Thus we proved the existence of a monotone continuous function cp(x)
which for every F having jumps wk at the points ak (k = 1,. . ., rí)
fulfills the relation
n +00

/[F] = 9 У 1( Z wk cp(ak)) = cp-1( J cp(x) dF(xj). (33)


k —1 —00

Now we investigate how cp(x) can be chosen such that it fulfils also Postu­
late Put in / '

F(x) = W a (x) + (1 - t) Db (x), G(x) = Dy (x).


Then we have
(p-1 (t(p(a + у) + (1 - t) 4>(b + j;)) = (p~l (tcp(a) + (1 - t) cp(b)) + y. (34)
Fix у and put <p*(t) = 9o(t + j). From (34) follows

<p*~l (tip* (a) + (1 - 0 (p* (b)) = <p-1 (tcp(a) + (1 - 0 cp(b)) (35)

0
578 IN T R O D U C T I O N T O I N F O R M A T I O N T H E O R Y [IX , § 6

for all values of a and b and for 0 < t < 1. It follows from the lemma that

<P*(x) = <p(x + y) = A(y) q>(x) + B(y) (A(y) > 0).

If <p(0) = 0, which can be supposed without restriction of generality, then


B(y) = <p(y), hence
Ф + jO = A(y) cp(x) + <p(y). (36)

This relation being fulfilled for every y, we may interchange x and y, hence

A(y) <p(x) + 950) = A(x) <p(y) + <p(x), (37)


thus
A(x) - 1 _ A ( y ) - l
Ф ) Ф )

From this we obtain A{y) = k<p{y) + 1, where A: is a constant. From (36)


and (38) it follows that
<p(x + y) = k(p(x) <p(y) + 9o(x) + <p(y). (39)

We distinguish two cases: к = 0 and к Ф 0. If к = 0,

Ф + У) = <P(x) + <{'(y) (40)

and, since y{x) is monotone, <p(x) = Bx, where В is a constant. If к ф О,


put h(x) — k(p(x) + 1. We conclude from (39)

h(x + y) - h(x) h(y), (41)


and h(x) being monotone,

h(x) = (42)

with а Ф 1, hence
9(«-l)5£ _ 1
Ф ) = ----- -k ----- • (43)

According to the lemma, iy{x) can be replaced by 2(01_1)л:. And, by taking


into account (33), we have thus either (15) or (16). The limit relation (14)
can be proved e.g. by the rule of l’Hospital. Theorem 1 is herewith proved.
IX , § 6 ] F U R T H E R M E A S U R E S O F IN F O R M A T IO N 579

If ffi = %n is the uniform distribution of n terms and if £j is any incomplete


distribution, then for a ^ 1 the relation
/ " \
1 (z«s\
4(QII &„) = log2 n - log2 ~ — (44)

4*=1 '
holds. Thus if we put for any incomplete distribution = (ръ .. ., p„)

1 { £ * }
/a (J * ) = J — l°g2 * (45a)
\I a ;
4= i J
we find that
4 ( Q № = 4(^4)-4(0). (46)
(46) shows that the quantity I f d f ) may be considered as a measure of the
amount of information corresponding to the distribution d?3 (or else as a
measure of the uncertainty of a random variable with the distribution d?5).
We call I f f ) the information o f order a. It is easy to see that

" 1
Z 0*l°g2—
4(Q m = logo n - ^ ~ - n-------^ . (47)
Z 9k
k=1
For any incomplete distribution 6$ = (ръ . . .,p „) we put

Zn Pklogi —
1

/r(^ ) = — 7i------— • (45b)


Z Pk
k = 1

and we call this quantity Shannon's information or information o f order 1.


If 6P is an ordinary distribution, I f f 6) is the entropy or Shannon’s informa­
tion of the distribution d?5. In what follows, in order to avoid confusions,
we shall write always I f f 6) instead of /(d?3) used in the preceding sections.
For a complete distribution Sfi the definition of I f f 6) gives

4 (^ ) = ~ — l o §2 z Pk for « ф 1. (45c)
l -а k=i
580 IN T R O D U C T I O N T O I N F O R M A T I O N T H E O R Y [IX , § 6

Clearly, for every distribution function, complete or incomplete,

lim f {.9s) = f (.9)


a -* -l

holds. We study now /„(99s) as a function of oc.


n

T heorem 2. Let . 9 = . . ., pn) he an incomplete distribution, f pk =


*=i
= s < 1. Then If99s) is a positive, decreasing function o f a. One has I f- 9 ) =
n
— log2 — ; in particular, i f 69 is a complete distribution, then I f 6 s) = log2 n.
s
Thus for a complete distribution
0<4 (9 ) < log, n (a > 0). (48)
P r o o f . We can write

, I a — V"“
f (69) = logs — - „ Pk -----
V Х й ,
4 A: = 1 7

We know1 that the average

( £ w* xg)'? (x* > 0, w* > 0, £ = 1)


*=1 A=1
s a monotone increasing function of ß. Hence Theorem 2 is proved.
Remark. If px < p2 < . . . < pn , we have

lim f (69) - log2 — and lim Ia (69) = log, — .


a -* —oo Px a-*- + oo Pn

Concerning the gain of information we obtain the following inequality:

T heorem 3 . I f 6 9 = (ръ p„) and f = (q,, . . .,q n) are any incomplete


n n

distributions ( f f Pk = s < 1; f qk = t < 1), then I f f [[ 69) is an in-


k=X k—X
creasing function o f cl. Since I f f 11,9') = log, — , for the complete dislribu-
s
tions 69 and f there follows the inequality

Ia( f \ \ 6 9 ) > 0 for a > 0. (49)


1 Cf. G. H. Hardy, J. E. Littlewood and G. Pólya [1], Theorem 16.
IX , § 6 ] F U R T H E R M E A SU R E S O F IN F O R M A T IO N 581

Proof. We have
/ Л_ l a_l \ 1
( Z
<7* — Y *
= iog2 h=i— -£hL — , (50)
I ft
4 *=i '

from which Theorem 3 follows by the same theorem on mean values (cf.
foot-note) as above.
If я is negative or zero, the properties of and IfQ || ffi) differ essen­
tially from those of Shannon's information. As can be seen from Theorem 3,
IJ fl II Sfi) is for complete distributions only then positive, when я is posi­
tive. The following property is particularly undesirable: Let я < 0; modify
the complete d istrib u tio n ^ = (ръ . . .,p n) by letting p x tend to zero, then
tends to infinity. On the other hand, is always equal to log2 n
whenever contains n positive terms. I f f 1) is thus very inadequate to meas­
ure the information and we consider only I f f -1) with positive я as true meas­
ures of information.
Let us now consider some distinctive features of Shannon’s information
among the informations of the family Iffifi), or of I f f l || 9 s) among the
informations of the family Ififl 11 ■9). One of these properties is given by

Theorem 4. I f £ and rj are two random variables with the discrete finite
distributions Sfi and Q and if .Jf denotes the two-dimensional distribution of
the pair (£, rf), then

f i ( - 9 ) < f i { .9 ) + fi((f) (51)

holds for every and Q with the mentioned properties i f and only if я = 1.

Proof. We know already that inequality (51) is valid for я = 1 (cf. § 3,


Formula (15)).
In the case of я Ф 1, (51) is not necessarily fulfilled. In fact let 0 and 1 be
the possible values of £ and r\; and suppose

P(fi = 0, ri = 0) = pq + e,

P(f = 0, n = 1) = p (i - q) - £>

P(£ = 1, t] = 0) = (1 - p)q - e,

Щ = 1, 4 = 1)=«(1 - Ж 1 - * ) + «
582 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , § 6

with
1 1
0 < p < 1, 0 < q < 1, р ф — ,
and
I e I < min ( pq, (1 - (1 - q)p, (1 - p)(l - q)).

If (51) were true, the function

g{e) = (pq + e f + (p(\ - q) - e)a + ((1 - p) q - efi + ((1 - p) (1 - q) + e)a


would have an extremum for 8 = 0. But this is not the case, since g'(0) # 0.
The quantity 4(Q || 6P) is distinguished among the 1^(0. || C?3) e.g. by
the following property:

T h e o r e m 5. I f & = (pb . . . , pr), &>' = (p[,. . p„) and Q. = (qu . . . , q„)


are discrete, finite, incomplete distributions fulfilling the relations

dk = s/Pk Pk (k = 1 , 2 , . . . , ri), (52a)


i.e. the relations

log.. — + log2 Ц- = 0 (k = 1 , 2 , . . . , л), (52b)


Pk Pk
then the relation
4 ( Q |i ^ + 4 (Q ||^ ') = 0 (53)

holds for every distribution fulfilling (52a) only if a. = 1.


Remark. The distributions Sfy 6P', and Q can only be all three complete
if they are identical. In fact, according to Cauchy’s inequality

d ЧкУ- d j í J k ) ^ d P k ) á p k \
k=1 *=1 k=1 *=1
where the equality sign can only hold if SP = .

P roof. F or а Ф 1 we have
in X \ [ n
X! ^ I Xj ™
/«( Q II & ) + 4 ( Q II & ') = (X 1
lo& k = lP *k i r - 1f y d ---- ■ (54)
L 4k
U=i
IX , § 7 ] S T A T IS T IC A L I N T E R P R E T A T I O N 583

It is easy to see that the right hand side of (54) is not identically zero; e.g.
it is different from 0 if we put n = 2, qx = q2, px ^ p2.

§ 7. Statistical interpretation of the information of order a

Let Аъ A 2, . .A r be the possible and mutually exclusive outcomes of an


Г
experiment with probabilities P(Ak) = p k ,
k =1
Z
pk = 1. PutiP3 = (ръ .. .,pr).
Suppose that px < p2 < . . . < pr and perform и independent repetitions
of the experiment. Let vk be the number of experiments leading to the out­
come Ak (k = 1, 2, . . r). Put тсп = p\' pi* ■■. py/. As in § 5, n„ is thus the
probability of a sequence of n observations. Consider the function

Z Pk b g 2
/(a) = ------- 7--------- . (I)
Z
k=1
pi
Since
Г r 1 / r 1 \ 2\
\ Z logl —
Pk I Zp2bg2- -\
/' (a) = - In 2 k=1 r------ - k=1 —------ — , (2)
,
4
Z Pi
k= 1
\
4
Z Pk
k= 1
J,
' '

it follows from Cauchy’s inequality that 1(a) is a strictly decreasing function;


further we have

/(1) = Ix (■ty), lim 1(a) = log2 — , lim 1(a) = log2 — .


a-*—oo Pi a-*+ao Pr
For a > 1 we have thus 1(a) < /, (&). If we put

p(a) = 2-'(‘> (3)


we have

tp llo g ,^-
log2 - J - r = Ha) = -*=Ц г------ Pk (4)
i
m
к = 1

and
< pf for a> ]
584 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , § 7

Now let Bn(<x) be the event n„ > p(oc)". Consider the conditional information
contained in the outcome of the sequence of experiments, under the condi­
tion Bn(u). Put for this

£ v ^ r b r r - <5>
П PÍ* :>?(»)"
*=' г
£ пк = п
fc=i

Obviously, C„(a) is the number of outcomes fulfilling the condition B„((x).


The information in question is thus at most equal to log2 C„(a). Further

P ( B n( * ) ) > C n ( a ) p ( a y (6)

and on the other hand

E « - 1) = ( Í Pi)"- (7)
k =1
Hence, because of Markov’s inequality,

P(Bn( « )) < P( « y [ i Ш Г (8)


U=i I Д а)
or, according to (6),

<9)
Put

?*(«)= г • (10)
l Pl
7=1

If Q denotes the distribution (<7i(a ),. . c/f(«)), we get from (4) by a simple
calculation that

log* f t (-ттП = h (Q '), (11)


U=i l />(«);

hence, according to (9),

C„ (a) < 2',/l(0") . (12)


IX, § 7 STATISTICAL INTER PR ETA TIO N 585

Furthermore, we have

A ( Q J - « l o g * 4 - - ( « - 1) / a( ^ ) = A ( ^ ) + T ^ - / i ( Q J I ^ ) . (13)
/W 1- a
Choose a sufficiently large h for which

Pr+ ■ > (P1 P2 • • • А- i ) ' “ 1 i (14)


this is possible because of p 1 < p2 < . . . < pr
r — 1

Put Hj(a) = [nqfcc)] - h (j = 1, 2 , . . r - 1) and n,{or) = n - £ и/а).


7 = 1
Then
>/>(«)-. (is)
k=l
When vk = и*(а) (/: = 1, 2 , . . r), the event Bn{a) occurs; hence

C „(a)> (16)
П "* (« )!
k=l
But according to Stirling’s formula

____ ____ _ 2И/1(0а)-0(1п/?) ^

П »*(«)'•
k=1
Relations (12), (16) and (17) lead to

h (QJ - О (— < — log2 C„ («) < A (Q.). (18)


и и
Therewith we proved

T heorem 1. Let A b A2, . .A r be the possible outcomes o f an experiment,


Г
P(Ak) = Рь 0 < Pi < p 2 < . . . < Pr, X Pk = 1 and Sfi = (ръ р2, . . ., p r).
k =1
Let the experiment be repeated n times such that the repetitions are inde­
pendent o f each other. Put further

4k (<*) = rk . Q„ = (<h (a),-■■,?,(«))


I pJ
7=1
586 IN T R O D U C T IO N TO INFO R M A T IO N THEORY [IX, § 8

with a. > 1 and

- Z 4 k ( a ) lo g o —
p(a) = 2 k=i Pk- (19)
r
Let vk be the number o f experiments with outcome Ak, let n„ = [ ] p'kk and
k = 1
let Bn(oi) be the event nn > p(d)n. Now if Bn(a) occurs, the outcome o f the
sequence o f experiments may be characterized completely by a sequence o f
0 - 1-symbols o f length
not
*/i(Q I) = n/a (J2) + — — (20)

If, however, q > 0 and e > 0 are arbitrarily small positive numbers and n
is large enough, then n f lf Q J — e) 0 —1-symbols are not sufficient with
probability > Q.
Remarks. 1. IfQf) = may also be considered as an information
measure of the distribution difi; it has the following properties:
a) 0 < P % 9‘) < log2 r,
b) if = Sfi * Q, we have
/W (J^) = /W ( ^ ) + /M(Q).

2. It follows from Jensen’s inequality that

4 (^ )> lo g 2 - Í - . (21)
PW

§ 8. The definition of information for general distributions

If the random variable £ takes on denumerably many values x k with


probabilities pk = P ( £ = x k) (k = 1 , 2 , . . .), then we define the information
o f order a. contained in the value o f f, by the formulas

h (£) = y —— log2 ( Z Pk) for а ф 1 (1)


1 - a k =i

and

Z Pk loge— (2 )
k = l Pk
IX, § 8 ] D E FIN IT IO N OF IN FO R M A T IO N FOR GENER A L D ISTR IBU TIO N S 587

if the series on the right hand sides of (1) and (2) converge. The series (2)
does not always converge. For instance for

Pk = ck log2(k + l) (k = 1,2, . . . )

it is divergent; c is here a “normalizing” factor:

CO J

■?i n loS2(" + !)
However the series (1) converges always for a > 1. In case of discrete in­
finite distributions the measure of order a of the amount of information is
thus always defined if a > 1.
Let ») be a second random variable which takes on the same values as £,
but has a different probability distribution P(ri = x k) = qk (k — 1 , 2 , . . . ) .
Let the gain of information of order a, obtained if the distribution
Q — (<h, ?2 , • • •) is replaced by & = ( р ъ p2, . ..) , be defined by

4(^11-^) = -^-j-log2 i f ~ r j for a# 1. (3)


and by

h ( Q .\ \^ ) = £ Як logo — , (4)
k =1 Pk

if the series on the right hand side of (3) or (4) converges (which is not always
the case). The series (3) converges according to Holder’s inequality always
for 0 < a < 1.
Let now £ be a random variable having continuous distribution. We want
now to extend the definition of the measure of order a of the amount of
information, i.e. Ix( 0 >to th iscase- If we do this in a straightforward way we
obtain that this quantity is, in general, infinite. If for instance £ is uniformly
distributed on (0, 1), we know (cf. Ch. VII, § 14, Exercise 12) that the digits
of the binary expansion of £ are completely independent random variables
which take on the values 0 and 1 with probability — . Hence the exact
knowledge of the values of £ furnishes an information 1 + 1 + 1 + . . .
which is infinite. Or, to put it more precisely, the amount of information
furnished would be infinite if the value of f could be known exactly. Practi­
cally, however, a continuous quantity can only be determined up to a finite
number of decimal (or binary) digits.
588 IN T R O D U C T IO N TO INFO R M A T IO N THEORY [IX, § 8

We see thus that if we want to define I f f ) we encounter problems of di­


vergence. It seems reasonable to approach a continuous distribution by a
discrete one and to investigate, how the information associated with the
discrete distribution increases as the deviation between the two distributions
is diminished. Instead of f, we can for instance consider

£ - Ш (5)

where [x] denotes the largest integer not exceeding x. Suppose a > 0 and
let I f ^ f be finite (this is only a restriction for a < 1). It follows from
Jensen’s inequality that IftqN) is finite for every N and the inequality
la(An) ^ Ia( £ l ) + 10g2 N ( 6)
is valid. If 0 < a < 1 and if we put
, к \ к &+ 1 1

№ - o, ±1, ± 2 , . . . ; N = 1 , 2 , . . . ) ,
then we have the inequality
+ CO + CO

X p*N,k < N l~* £ p l k,


k = —со k = —oo

from which (6) follows; for a > 1 (6) can be proved in a similar manner.
When the distribution is continuous, the information Ixf N) tends to
infinity as N -» oo; however, in many cases the limit

< Ш = lim (?)


N~oо l°g2 N
exists. The quantity d f f ) will be called the dimension o f order a o f q. If not
only d = df£J exists but also the limit
Hm (IÁZN) - d log2IV) = /.,« « ). (8)
N-*- oo

the quantity dxdf ) will be called the d-dimensional information o f order a


contained in the value o f the random variable £,.
In the important case when the distribution of q is absolutely continuous,
we have the following
T heorem 1. Let q be a random variable having an absolutely continuous
distribution with density function /(x), which is supposed to be bounded.1 I f
1 This supposition is superfluous (cf. A. Rényi [27], [34]); we make it merely in
order to simplify the proof.
I X , § 8] D E F I N I T I O N O F IN F O R M A T I O N F O R G E N E R A L D I S T R I B U T I O N S 589

we put (N = 1, 2, . . .) and if we suppose that 4 (<4) is finite


(a > 0) then
Ь ш Ь Ц (-U„
N~ 00 log2Af
+ 00
i.e. the dimension o f order a. o f £ is equal to 1; i f the integral j f ( x f d x (a Ф 1)
—00
exists, then
+ 00

]im ( 4 (£„) - log2jV) = 4 д (£) = log2 ( f /(X)”Ac); (10)


N- oo 1 — a J
— 00

if
+ 00

I
—oo
'°& д ^ * •
exists, then
+ 00

lim ( 4 (4v) - log2N ) = 4 д ( 0 = ( / ( a ) log2 - Ц rfx. (11)


A -o o J J (X )
—00
к 1
P r o o f . Consider first the case a = 1. Put pNk = P £N = — and

f N (x) = NpNk for ~ ^ x < 1 (к = 0, ± 1 ,...) .

We have then
+00
oo I f 1
h (ív) - log2N = £ pNk log2 — ---- = f N (x) log2— — dx. (12)
k=- oo ^N 4 J /ivW
—oo
If
л:
Д а) = j' / ( m) í/ m (13)
— СО

is the distribution function of 4 we have

/* (* )= — — j- - Щ for A < x < A ± i . (14)


~N
590 IN T R O D U C T IO N TO IN FO R M A T IO N THEOR Y [IX, § 8

According to the well-known theorem of Lebesgue it follows that

lim f N(x) =f(x)


N -* a o

for almost every x. Now we shall use Jensen’s inequality in the following
form : If g(x) is a concave function and if p(x) and h{x) are measurable func-
b
tions with p{x) > 0 and j p(x)dx = 1, then we have
a

h b
J g(h(x)) p(x) d x < g ( \ h(x) p ( x ) d x ) . (15)
a a

This inequality can be proved in the same way as the usual form of Jensen’s
inequality. If we apply (15) with g(x) = log2 x, h(x) = — — and
J \x )
, 4 f(x) к к+ 1
p(x) = ----- for — < x < .......- ■■■-■,
PNk A N
then we get
k+\
N
J / И '°8г
к
dx S pm log, (16)

N
and, by summing over к
+ 00

J f(x) log2 ^ dx < h (&v) - log2iV, (17)


— 00

i.e.
+ 00

I A x ) log2 7 Y T dx < lim inf (IX^ N) - log2N). (18)


J J \X) 7V-00
— 00

We still have to prove the inequality


+ 00

lim sup (JAZ n) - log2N ) < ( A x ) log2 dx. (19)


T V -00 J J \X)
—00
IX, § 8] D E FIN IT IO N OF IN FO R M A T IO N FOR GEN ER A L DISTR IBU TION S 591

Iff(x) < К, we have also f N(x) < K; thus the functions f N(x) are uniformly
bounded. Hence, by the convergence theorem of Lebesgue,
+ Л +A

üm
N-*oo J
Г
f N(x) log2
JN \X)
dx =
J
f(x) log,ГJ\X)
dx (20)
-A -A
for every A > 0.
According to Jensen’s inequality, we have
(/+1)V—1 j j
£ P n * log2 -Tz---- ^ Pit iog2 — . (21)
к = IN M pNk Pu
+00
Since we have assumed that /,(£ 1) and f(x) log, —— dx are finite, we
J “ Ax)
— 00

can find for every e > 0 an A > 0 such that

J /(•*) log2 д ” dx < e (22a)


\x\>A
and

£ Pit log2----- < £• (22b)


\1\>A P it

(20), (22a) and (22b) show immediately that the theorem is true for я = 1.
Consider now the case я > 1. We get from Fatou’s lemma1 that
+00 +00
lim inf j f N (.X)a dx > j f(x )a dx. (23)
N-*oo J J
—00 —00
On the other hand, according to Jensen’s inequality,
+CO +00
J f x i x f d x <, J f ( x f dx. (24)
—oo —oo

It follows from (23) and (24) that

lim J f N( x y d x = J f ( x f d x , (25)
N-+ oo —oo —oo

hence (10) is proved for я > 1. We have still to examine the case 0 < я < 1.
1 Cf. F. Riesz and B. Sz.-Nagy [1], p. 30.
592 IN T R O D U C T IO N TO IN F O R M A T IO N THEORY (IX, §8

According to Jensen’s inequality, we have now


+ 00 +00

]' f N( x f d x > J f { x f d x . (26)


— 00 — 00

On the other hand, according to the convergence theorem of Lebesgue, we


have for every A > 0

lim J f N (x f dx = j f ( x f dx < J f { x f dx. (27)


N-+CO —A —A — 00

Jensen’s inequality gives


( /+ 1)ЛГ—1
E N * -1P°Nk < Pli ■ (28)
k = lN

Since we supposed / a(<^i) to be finite, we can find for every e > 0 an A > 0
such that
I p\, < ^ (2 9 )
\1\>A
From (27), (28) and (29) we conclude that (25) remains valid for 0 < a < 1.
Theorem 1 is thus completely proved.
The quantities
+ 00

Ад ( 0 = J fix) log2 dx (30)


—oo
or
+ 00

A ,i(0 = log2J —00


dx (a > 0, a ф 1) (31)

are called (one-dimensional) information o f order 1 or order a, associated


with the random variable f Ад(£) is called also the entropy of the random
variable (or of the density function f{x)). The properties of these quantities
differ in some respect from the properties of the corresponding quantities
for discrete distributions. Thus for instance / 1Д(£) and 7a l(^) can be nega­
tive. Another difference is that these quantities are not invariant with respect
to any one-to-one transformation of the variable. E.g. for c > 0 we have

4 , i ( c 0 = 4 , i ( £ ) + log2 c. (32)
IX, § 8] D EFIN IT IO N OF IN FO R M A T IO N FOR GEN ER A L DISTR IBU TION S 59 3

These facts are explained by realizing that 7al(£) is the limit of a diJference
between two informations.
All what we have said can be extended to the case of r-dimensional ran­
dom vectors (r — 2, 3 ,. ..) with an absolutely continuous distribution. Let
f ( x x, . . xr) be the density function of the random vector (£(1), . . ., £(r)).
Put ^ ^ (k = 1,2,..., r). If /„((ЙЧ . . ., fir))) is finite,1 we have

,im 4 ( « P , . . ; « 9 (33)
N- 00 log2A
The dimension of the (absolutely continuous) distribution of a random vector
of r components is thus equal to r; the notion of dimension in information
theory is thus in accordance with the notion of geometrical dimension.
Furthermore, for a > 0, a Ф 1 we have

lim [/„(($}*,. . . , &>)) - r log-, N] = 7ar ( ( |(1>,. . . , £w)) (34a)


N-*co
with
+00 +CO

hr ((s(1), • • •, £(r) ) = Y-a l0g2J ‘ ' J ^ Xl’"


— 00 —00
x^ dXl ■• dx' (34b)
and
lim [h ((flj>, ..., Ф ) - r logo N ] = I Xr ( (6 l\ . . t lr>)) (35a)
N-*co
with
+ 00 +00

h ,r ((£(1), • • •, £(r))) = ( • • • ( /(-ж-1. • • •>X r ) Iog2 J . ------------г dx 1 ... dxr, (35b)


J J J ( X 1> • ■ •! X r)
—00 —00

provided of course that the integrals exist.


The quantities 7, r((£(1), • • •, £(r))) and 7l r((£(1), . . ., £(r))) defined by
(34) and (35) are called r-dimensional measure o f order a, and o f order 1,
o f the amount o f information (or entropy) associated with the distribution
o f the random vector ( f J), . . ., £(r)).
Consider now briefly the notion of gain of information in the case of gen­
eral distributions. Let ,VJ and Q be any two probability measures on the
measurable space [Í2, rA). Suppose that Q is absolutely continuous with
1 7а(({др , • • • , ^n )) denotes the entropy of order a of the distribution of the random
vector ( 4 }>, . . . , &’).
594 IN T R O D U C T I O N T O I N F O R M A T I O N T H E O R Y [IX , § 8

respect to &>. Then for every set A £


Q(A) = J h{m)dSP, (36)
A
dQ
where Mm) > 0 is the Radon-Nikodym derivative and
йгУд

I h(m) d£ß — 1.
n
The gain o f information o f order a (or o f order 1) obtained if is replaced by
Q is defined1* by the formulas

4 (Q 11^ ) = — j_ J log2 j h{m)adSA (37a)


ß
or
/i (Q11 У°) = j' h(m) log2 h(m)dffi. (38a)
я
Formulas (37a) and (38a) remain valid in the discrete case too. The (ordinary)
discrete distributions = (pu . . ,,p„) and Q = (дъ . . ., q„) may indeed
be considered as measures defined on an algebra of events of n elements
ш1; oj2, . . ., m„ with &'(mk) - pk and Q.(mk) = qk (k = 1 , 2 , . . . , rí). The con­
dition that Q is absolutely continuous with respect to d?5is here automatically
fulfilled whenever pk > 0 {к = 1 , 2 , . . . , « ) and we have

K°>h) = — (k= 1,2,...,«).


Pk

The formulas

h (Q II ~ 7"log2 ( £ - £ r and 7i(Q II loS2 —

appear thus as particular cases of (37a) and (38a).


If Q is the set of real numbers, Л the set of the Borel-measurable subsets
of Q and if Sfi and Q are absolutely continuous with respect to Lebesgue meas­
ure, there exist two functions p(x) and q(x) such that
^ ( A ) = j”p(x) dx, Q(A) = ( q(x) dx for A £ tj€.
A A
1 One could deduce Formulas (37a) and (38a) from a certain number of postulates
as was done in the discrete case (§ 6 ). This will not be dealt with here.
IX, § 8] D E FIN IT IO N OF IN FO R M A T IO N FO R GEN ER A L DISTR IBU TION S 595

The measure Q is absolutely continuous with respect to if for every x


such that p(x) = 0 we have q(x) = 0. Then

(39)
In this case we obtain for the gain of information from (37) and (38)
+ 00

4 ( Q l l - ^ ) = ^ 3 y log2 ( J -p (* y - i d*\ for a ^ 1 (37b)


—00
and
+ 00

/1 (Q li 3 s) = J q{x) log2 dx. (38b)


—00
The gain of information for absolutely continuous distributions can be
obtained from the gain of information for discrete distributions by a limit
process:

T heorem 2. Let d?5 and Q be two distributions, absolutely continuous with


respect to Lebesgue measure, Ü. absolutely continuous with respect to Let
p(x) and q{x) be the respective density functions o f ffi and Q. We suppose that
p(x) and q(x) are bounded.1 Further if
k+1 k+1
N N
PNk = J p{x)dx,
к
qNk =
к
J q{x)dx (k = 0, ± 1 , . . N = 1,2,...),
N N

and Qn denote the distributions {pNk} an(l if a is positive and


к (Qi II ^ i ) is finite ( which means a restriction only for a > 1), then we have
\\т1Мы\\&м) = Ш \ № , (40)
N-+ oo

where I f f l || S :i) is defined by (37b) fo r a Ф 1 and by (38b) for a = 1, pro­


vided that IfQ II fiL) exists.

P roof . This is similar to that of Theorem 1. We define pN(x) and qN(x)


к к 4" 1
by Px(x) = Npm for — < x < (k = 0, +1, . . .) and qN{x) = NqNk

for — < x < — (k = 0, ± 1 , . . . ) . Let further hN(x) = .


N N qN{x)
1 This supposition is superfluous and serves only to simplify the proof.
596 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , § 8

Consider first the case 0 < a < 1. It is clear that pN(x) -> p(x) and
qN(x) q(x) almost everywhere; further
+00
Ш ы 11& n) = log2 J qN( x f pN(x)l~adx. (41)
—00
According to Lebesgue’s theorem we have for every A > 0
+A +A
lim f qN{x)apN(xf~* dx = [ q(x)ap{x)x~a dx. (42)
ЛГ- t * -A -A

Since
j q N( x ) xp N( x y ~ * d x
1*1>A
can be made arbitrarily small for a sufficiently large A, uniformly in N ,
Theorem 2 is proved for 0 < a < 1.
Now suppose a > 1. We have, according to Jensen’s inequality,

7 q(*)“ , ^ 7 Ям(хУ
—oo
J m ** -J —oo
m
and on the other hand by Fatou’s lemma

—oo —oo
' T j / Í 7 í £ r ' jx 4 <44)
which settles the case a > 1.
Finally, let a = 1. We have
+ 00

A (Q vll^v) = j qN(x) log2 ~ ^ j d x . (45)


—oo

Since the function x log2 x is convex, Jensen’s inequality gives


+00 +00
j <?(*) log2 ^ dx > j' qH(x) log2 dx. (46)
—00 —00
IX, § 9) INFO R M AT IO N THEORY OF LIMIT THEOREM S 597

log2e
From je log2 X > ----------- we deduce

qN(x) log2 + l0g2 e Pn (x ) > 0.


P n (x ) e

Hence, according to Fatou’s lemma,


+ 00 + 00

liminf j qN(x) log2 - ^ dx > ( q(x) log2 ^ ß - d x . (47)


N-~ao J ' Pn (X) J p(x)
—oo —oo
(46) and (47) lead to
+ 00 + 00

lim
N-*oo J
Г qN(x) log2 P { ) dx = (
n x J
q(x) log2
p (x )
dx (48)
— 00 — 00

and Theorem 2 is herewith proved.

§ 9. Information-theoretical proofs of limit theorems

We have seen that for complete discrete distributions the relation


Ia(Q_ И.5°) > 0 holds, where the equality sign occurs only if Sfi and Q are
identical. We shall now prove the following property: If {Q.n} is a sequence
of discrete distributions such that lim Ia(Q.„ 113 s) = 0, then the distri-
«-»-00

butions U„ converge to the distribution Ä Thus we have the following

T heorem 1. I f & = (pl3 . . ,,pr) and Q„ = (qnu . .., qnr) are probability
distributions and i f
lim IfQn 11*9$) = 0 (a > 0), (1)
N -*-00

then we have also


Hm qnk = Pk • (2)
«-*-00

P roof . If (2) does not hold, there exists a subsequence nk < n2 < . ■. <
< ns < . . . of the integers with
Г
Hm qnsh=p'k and £ (p'k - pkf # 0. (3)
s-+ oo к —1
598 IN T R O D U C T I O N T O IN F O R M A T I O N T H E O R Y [IX , §9

Г
Obviously, Z Pk = 1; further if we put J f ' = {p[ , . . . , p'r), it follows
from (3) that
lim № п,\\& ) = 1А & '\\& ). (4)
s -* 00

According to (1), / al(d?3' H d?3) = 0 , but this is possible only if d?3' = ffi, i.e.
if Pk = Pk f°r к = 1, 2 , . . r, which contradicts (3). Thus Theorem 1 is
proved.
As an application of this theorem we shall now prove a theorem about
ergodicity of homogeneous Markov chains, which, essentially, is contained
in Theorem 1 of Chapter VIII, § 8. We give here a new proof of this result,
only to show how the methods of information theory may be used to prove
theorems on limit distributions.
T heorem 2. Let us consider a homogeneous Markov chain with a finite
number o f states A0, . . ., AN\ let the probability o f transition from Aj to A k in
n steps be denoted by р $ (n = 1 , 2 , . . . ) . For pffl we write simply pJk. I f there
exists an integer s > 1 such that pjl > 0 for j, к = 0 , 1 , . . . , N, then the
equations

j=o
Z X jPjk = Xk (к = 0 , 1 , . . . , N) (5a)

have a system o f solutions xk = p k {к = 0, 1,. . ., N) with

pk > 0 (к = 0, 1,. .., N) and £ pk = Í, (6)


k=0
and we have
lim j $ = pk (j,k = 0,l,-..,N )- (7)
I t - * CO

P roof . The existence of a solution xk = pk > 0 (к = 0 , 1 , . . . , N) of the


system of equations (5a) can be proved directly, without probabilistic consid­
erations, in the following manner:1 The determinant of the system (5a)
is zero, since Z
N
k =l
Pjk — 1 0 = 0 , 1 ,. . ., N ) ; . thus the system has a non-
trivial solution (x0, xl f . . . , xN). If (5a) is fulfilled, we have

\xk \ < Y Pjk\xj\. (5b)


1=0

1 We have here a particular case of the Perron-Fi obenius theorem; cf. F. R. Gant-
macher [1], Vol. 2, p. 46.

V
IX , § 9 ] I N F O R M A T I O N T H E O R Y O F L IM IT T H E O R E M S 599

If we add the inequalities (5b) for к = О, I , .. N, we obtain


N N
Z 1**1^ j=0
k=0 Z l*/l-

But this inequality is an equality; hence the same must hold for every in­
equality (5b), i.e.
N
1**1 = Z Pik\xj\. (5c)
7=0

Hence (5a) possesses a nontrivial nonnegative system of solutions, say


(ри,р ъ . . .,P n)• If we multiply (5a) by pkl, add the equations obtained for
к = 0, 1 , . . ., N and repeat the whole procedure n times, then we obtain

Z P j P p ' ^ i PkPP
j=0 k=0
We find then by induction that

Z PjPfk! = Pk for h = 1 ,2 , . . . . (5d)


7= 0

Since (5d) is valid for h = s, it follows that no pk can be zero. Because of the
homogeneity of the equations, (5a) has thus a positive system of solutions
N
p0,P i ,. ..,P n with Z Pk — l-1* Put
k=0

= (Po,.. .,pN) and = (p%\ . . ., pf$).


We consider now the quantities IJ& W || and prove the relation
lim 4 {Sftp 11 = 0, (8)
n-*-cc

then Theorem 2 follows because of Theorem 1. The value of a is immaterial


for the proof. We can e.g. assume a > 1. By assumption
N
Z PjPjk=Pk■ (9)
7= 0

If we put nik = PJPjk. t we have


Pk

Z nik = 1. ( 10)
7=0

1 This solution is unique; this is a corollary of (7) and need not be proved here
separately.
600 INTR O D U C T IO N TO INFO R M AT IO N THEORY [IX, § 9

Furthermore, by definition

Pjk+1) = X pffPik, (11)


/ =0
hence
i íí IN _(л) \* ‘
/ . И " +1) 11^) = - — rlo g 2 Y P k I Ы • (12)
a —i Lfc=o w=o Pi I .

Because of (10), Jensen’s inequality leads to

feiH - (13)
N
Since £ ptk = 1, it follows that
it = 0

N N
Z Pkщк = P iY Pik = Pi• ( 1 4)
fc = 0 *=0

If we multiply the inequality (13) by p k and then take the sum over k, we
obtain

/ « ( ^ " +1)\ \ ^ < Г Л ^ ) Л)\ \ ^ ) (n - 1 , 2 , . . . ) . (15)


II (и = 1, 2,. . .) is a monotone decreasing sequence of nonnega­
tive numbers; it has thus a limit

lim /„ ( ^ " M l^ 3) = y. (16)


n-*- 00

It remains to show that у = 0. Choose a subsequence nx < . . . < n, < . . .


of the integers such that the limits

lim р%д = qJk(k = 0, \ , N )


t-*■CO
N
exist. Then X qjt = 1 and by (11)
1=0

lim p%'+s) = X qn Р$ = q'jk•


Í-+-00 /= 0

Obviously
N AT
I q'jk = / =L0 «b = i-
* =0
IX , § 9] INFO R M AT IO N THEORY OF LIMIT THEOREM S 6 01

Let Qj and Q'j denote the distributions (qJ0, . . ., qjN) and (q'j0, . . ., q'JN),
respectively. If we put 71$ = PjP^/Pk, Jensen’s inequality implies

h m & ) = - iog2 [ z pk í z — ^ (ív)


a —1 U=o w=0 Pi .

by the same argument that led to (15). But, because of (16), the relations

h ( Q ' j \ \ & ) = Hm h ( ^ ' +I) II & ) = V (18a)


/-►00

and
lim 4 ( ^ " ' ) | l ^ ) = V (18b)
/-►00

hold; hence there is equality in (17). Since (17) is derived from Jensen’s
inequality, it follows that equality can hold only if q,, = Xp, (/ = 0, 1,. . .,N ).
N N
Since Y j Ял — Y Pi = 1j we must have Я = 1; consequently, Qj = Sf1.
1= 0 /= 0
But then IJSlj И — IJ.& II = 0, hence by (18b) у = 0. Theorem 2
is herewith proved.
The idea to prove theorems on limit distributions by means of informa­
tion theory is due to Yu. V. Linnik. He proved in this way the central limit
theorem under Lindeberg conditions by using Shannon’s entropy; he proved
the convergence of the distribution with the density function p„(x) to the
normal distribution by showing that

+00
lim j p„ (x) log2 ~ ~ dx = 0, (19)
00 J <P(x)
— 00

where <p(x) is the density function

1
<P(x) = —7= e 2
V 2л

of the normal distribution. Linnik’s proof can be simplified if we use the


gain of information of order 2 instead of Shannon’s entropy; but even so
the proof is too intricate to be reproduced here. However, we can briefly
indicate the principle of the method.
Let £1; £2 , . . . . . be independent random variables having the same
absolutely continuous distribution with zero expectation and unit variance
602 INTRODUCTION TO INFORMATION THEORY [IX, § 9

and let p„(x) be the density function of - 1-~^ <°'1 . Then


Jn
+oo
f x2pn( x ) d x = 1.
—oo

Relation (19) can be written as


+ C0

lim pn(x) log2— — dx = log2 J l n e . (20)


J P„(x)
—oo
It is easy to show that
+C0
log2yjlne = j <p(x) log2—— dx, (21)
— 00

thus (19) is equivalent to


+ 00 + 00
л 1
lim I pn{x) log.,— - ~ dx = <p(x) \og2— — dx. (22)
„-ос, J Pn(x) J <P(x)
— 00 —00

To say that the distribution with the density function p„(x) tends to the nor­
mal distribution, means therefore that the entropy of this distribution tends
to the entropy of the normal distribution. But we can prove that for a den­
sity function p(x) such that
+ 00
j xl p{x)dx = 1 (23)
— cn

the inequality
+00 +00
Г* c
I p(x) log2 —ГТ dx < (f{x) log, —— dx (24)
J P(x) J <p(x)
-0 0 -0 0

holds, since because of (21) and (23), (24) is equivalent to the well-known
inequality
+00
I p(x) log2 dx > 0. (25)
J <P(x)
—00
The statement of the central limit theorem may therefore be expressed as
follows: The entropy of the standardized sum of independent random vari­
ables tends, as the number of the variables tends to infinity, to the maxi-
IX, § 10] EXTENSION OF INFORMATION THEORY 603

mum of the entropy of all random variables with unit variance. Thus the
central limit theorem of probability theory is closely connected with the
second law of thermodynamics.1

§ 10. Extension of information theory to conditional probability spaces

In the present section we consider particular conditional probability spaces


[Í2, IB)] only: Let Q = {и>ъ со2, . . .,a>„,. ..} be a denumerable
set and the class of all subsets of Q. Let further pk (k = 1, 2 ,. . .) be a
00

sequence of nonnegative numbers with Y Pk — + oo. We put


k =l
P(A) = Y Pn for A ^ .
co/i£A
Let -55 be the set of those subsets В of Q for which p(B) is finite and positive.
For A £ and В £ £8, P{A | B) is defined by

P (A \B ) = ^ ^ . (1)

We shall indicate by some examples how the concepts of information


theory can be extended to conditional probability spaces.
If £ is a discrete random variable defined by £,(oj) — x n for a> £ A n (n =
00

= 1 , 2 , . . . ) , where A„ £ Y A„ = A nAm — О for n Ф m, then the


П=1
numbers P(An \ B) (n = 1, 2 В £ £%) form the conditional distribution
of c with respect to the condition B. Consider the entropy of this distribution,
i.e.

h( $ \ B) = Y P(A„ IB) log, — i— . (2)


/1= 1 M ^/1 I B)
If Qn is the set {тъ a>2, . . ., a>N}, then QN £ £& for N > N0. We define the
entropy /j(^) of £ (in other words the information contained in the value
of f) by
A « ) = Hm h ( . t \ n N), (3)
N-*■oo
1 If the distributions considered concern the velocities of the molecules of a gas,
the condition
J x2p„(x)dx = 1
— 03

means that the total kinetic energy of the gas is constant.

*
604 IN T R O D U C T IO N TO IN F O R M A T IO N T H E O R Y [IX , § 10

if this limit exists and is finite. If it does not exist, the information in ques­
tion will be characterized by the following two quantities:
lim inf Щ I Qn) = Щ )
N-*■00
and
lim sup Щ I Qn) = Щ ) .
N-+00

Consider for instance the binary representation of a positive integer.


We show that each of the digits of this representation contains exactly one
bit of information, just like a digit of the binary expansion of a real number
lying in the interval (0, 1). Let Q be the set of positive integers, co„ = n,
and p„ = 1 (n = 1 , 2 , .. . ) . Consider as above the conditional probability
space [Q, AB, P(A | 2?)]. Let ek(n) denote the k-th digit in the binary ex­
pansion of n; we have

n = £ «*(«) 2k, (4)


k =0

where sk(n) is equal either to 0 or to 1. Obviously

1 if и is odd,

If Qn = {1, 2, . . ., IV}, we have

N
{ 0 it n is even.

N1
— N --------
2 N 2 N
1,(80 (n) I Qs ) = — — log2 — — + ----- ------- - lo g .,-------- - ,
~Y
and by (3)
/ ,(%(«)) = 1. (5)
It follows in the same way that

h(?k («)) = ! ( k = 1,2,...). (6)


Take now an example for which the limit (3) does not exist. Consider again
the binary expansions of the positive integers; let [Q, ■%, P(A | B)] be
the same conditional probability space as in the previous example.
Let rj(n) be the largest exponent of 2 in the binary expansion of n ; hence
чОО
« = X £a(«) 2* with «„(«) = 1•
к =1
IX , § 11] E X E R C IS E S 605

If now 2r < N < 2r+1, that is if r = [log2 N], then

P(jl(n) = j I ®n) = ~ for j < r - 1

and
N — 2r 4- 1
p ( m - r \ q n) = n .

If JV tends to infinity through values for which

2[log2/vl 1 I
li™ ' - \r— = У — < у< 1 ,
N- oo 2 J
then we have

lim h(ri(n) \QN) = 2y + у log2 — + (1 - y) log2 = L(y). (7)


N~ao У 1 —У
Thus the limit £(y) depends on y. L(y) is a concave function of у and we
1 ] 4
have L — I = £(1) = 2. Furthermore, L{y) takes on its maximum for у = —
and we have
(4
max L(y) = L — = log25.
15
Consequently, [ M n ) ) = 2 and = log2 5. The information
h ('/(«)) is not defined in this case; nevertheless, it can be stated that the
number of digits in the binary representation of an integer contains at least
2 bits and at most log2 5 bits of information.

§ 11. Exercises

I .1 a) How much information is contained in the licence number of a car, if this


consists of two letters and four decimal digits? (2 2 .6 )
b) How much information is needed to express the outcome of a game of “lotto”,
in which 5 numbers are drawn at random from the first 90 numbers? (25.4)
c) What amount of information is needed to describe the hand of a player in bridge
(each player having 13 cards from 52)? (39.2)
d) How much information is contained in a Hollerith punch-card which has 80
columns and in each column one perforation in one of 12 possible positions? (286.8)

1 The numbers in parentheses are the solutions.


606 IN T R O D U C T IO N TO IN F O R M A T IO N T H E O R Y [IX , § 11

e) How much information is contained in a table of values of a function consisting


of 50 pages with 40 lines per page and 25 decimal digits on each line (numbers for
identification of the lines not counted)? (166 095.5)
f) How much information is contained in a linear macromolecule consisting of
100 000 single molecules, if there can occur one of four different molecules at every
place? (200 000)
g) H o w m u c h i n f o r m a tio n is t r a n s m i t t e d p e r s e c o n d b y a te le v is io n b r o a d c a s tin g
s t a ti o n w h ic h e m its 25 im a g e s p e r s e c o n d e a c h o f w h ic h c o n s is ts o f 5 2 0 0 0 0 p o in ts ,
e a c h b l a c k o r w h ite ? (13 0 0 0 000)

2. a) Let some integer n (1 < n < 2 000) be divided by 6 , 10, 22, and 35 and let
the remainders be given, while we assume that the remainders are compatible. How
much information is thus given concerning the number n?

Hint. The information is equal to log2 2000 = 10.96 (i.e. we get full information
o n и). I n f a c t th e r e m a in d e r s m e n tio n e d d e te r m in e n m o d u lo t h e le a s t c o m m o n
m u ltip le o f 6, 10, 2 2 a n d 3 5 ; w h ic h is e q u a l t o 2 31 0 > 2 0 0 0 , h e n c e n is u n iq u e ly
d e te r m in e d .
b ) L e t th e n u m b e r n b e e x p re s s e d in t h e sy s te m o f b a s e ( —2 ), i.e . p u t

» = E M “ 2)*,
k =0

where bk can take on the values 0 or 1 only. How much information on n is contained
in bkl
[-4 4
Hint. P u t N = Y 2 2i+1; th e n
/= 0

r 2r+1—N — 1

П (i + *(- a*) = Z a".


*=0 n= -N
It follows that for fixed r the numbers —N, —N + 1, . . ., 2r+1 —N — 1 can all
r
uniquely be represented in the form Z bk ( —2)k with bk = 0 or 1; there are thus
A= 0
exactly 2'+1 numbers which can be expressed in this form and every “ digit” bk contains
one bit of information with respect to n .
c) Let U(n) (and V(n)) denote the number of different (and of all) prime factors
of n. How much information with respect to n is contained in the value of the
difference V(n) — U(n) ?
Hint. As is known, 1 for every positive integer к the asymptotic density of the
numbers n with V(n) — U{ri) — к exists. Let this density be denoted by dk, then

i y - п К Ж -Z )
where p runs through all primes. Let N k(x) denote the number of integers n smaller
than X with V(n) — U(n) = k, then
гhm ---------= dk
n
» - + 0. Y
1 Cf. A. Rényi [16].
Ix, § 11] EXERCISES 607

(к = 0, 1 , . . . . ) ; d0 is the density of square-free numbers and is equal to -Д -. Thus


n-
*° ]
the amount of information in question is equal to ^ dk log2 —— .
k= 0 “ *

3. a) Let the real number x (0 < x < 1) be represented in the form

Y1 en(x)
X = л=1
L —J4 r .
where q is a positive integer > 2 , and t'„{x) can take on the values 0 , 1 , ,q — 1
(n — 1 , 2 , . . . ) . How much information with respect to x is contained in the value
of £„(*)?
b) Expand x (0 < x < 1) into the Cantor series
\ en(x)
x = > ------------- ,
«=i <hd2 ■■■%
where qL, q.,, . . . , q„, . . . are positive integers > 2 , and e„(x) can take on the values
0 ,1 ......... qn — 1 . How much information with respect to x is contained in the value
of en(x) 4
c) Expand x (0 < x < 1) into a regular continued fraction
1
* ” ” 1 ’
ai (x) 4------7~T~T-----
ű2 (x) + . . .
where each a„(x) can be an arbitrary positive integer. How much information about
x is contained in the value of a„(x) ?
Hint. Let m„(k) denote the measure of the set of those x for which a„(x) = к .
As is known1

Jim m„(k) = log, ( 1 + ^ 27) =%.


Hence
00 I
lim A 0 „ (л-)) = У nk log 2 ---- .
и —►со k= 1 nk

Let it be remarked that contrary to Exercises 3.a) and 3.b), the random variables a„(x)
in this example are not independent; the total information contained in a sequence
of several digits a„(x) is not equal to the sum of the informations contained in the
individual digits.
4. Let a differentiable function f(x) be defined in [0, A] and suppose/(0 ) = 0 and
|/ '0 ) | < В . Find an upper bound for the information necessary in order to determine
the value of /(x ) at every point of [0, A] with an error not exceeding e > 0 .
k£ 1 Д \
Hint. Put xk = ----- Iк — 0,1, . . . , ------ , Xr ab = A. Let the curve of
1
В l e J 1— J+1
/(x ) be approximated by a polygonal line у — rp(x) which can have for its slope in

' Cf. e.g. A. J. Khinchin [61.


608 IN T R O D U C T IO N T O IN F O R M A T IO N T H E O R Y [IX , § 11

each of the intervals (xk, xk+x) either + B or —B. If <p(x) is already defined for
0 < X < xk , then let the slope in (xk+x) be so chosen that \f(xk+t) - <f(xk+l)\ < e.
Obviously, this is always possible. S ince/(x) - rp(x) is in every interval (xk, x*+ ,)
monotone, the inequality |/(лг) — <f(x)\ < e holds in the open intervals (xk, xk+l)
(k = о, 1,. . .) as well. Clearly, the number of possible functions <f{x) is equal
Г 1 i
to 2*-— -* . In order to determine f ( x ) up to an error e there suffices therefore
AB
----- + 1 bits of information.
e

5. We have n apparently identical coins. One of them is false and heavier than the
others. We possess a balance with two scales but without weights. How many
weighings are necessary to find the false coin?
Hint. The amount of information needed is equal to log 2 n. Only weighings with
an equal number of coins in both scales are worth while to be performed. 3 cases
are possible: equilibrium, right scale heavier, and left scale heavier. One weighing
log» n ]
{
j-j weighings

({x} denotes the smallest integer greater than or equal to x). It is easy to see that
this number of weighings is sufficient. In fact, let к be defined by З*- 1 < n < 3*.
At the first weighing we put in each of the dishes -^-j coins. We know then to which
of the three sets, each containing at most 3*- 1 coins, the false coin belongs. Proceeding
in this manner, the false coin will be found after at most к weighings.
6. The “Bar-Kochba” game is played as follows: Player A thinks of any object,
playei В asks questions which can be answered by “yes” or “no”, and has to guess
the thing on which A thought from the answers. Naturally, A has to answer all
questions honestly.
a) The players agree that A thinks of some nonnegative integer < N. What is the
minimal number of questions permitting В to find out the considered integer? Give
an “optimal” sequence of questions.

Hint. Obviously, at least {log 2 N) questions are needed, since each answer provides
at most one bit of information and we need log2 N bits. An optimal system of questions
is to ask, whether in the binary representation of the number x the first, the second,...,
digit is 0? The aim is arrived at by {log 2 N} questions, since the binary representation
of an integer is unique.
b) Suppose N = 2s. How many optimal systems of questions do exist? That is:
how many systems of exactly s questions determine x whatever it may be?

Hint. The number of the possible sequences of answers to s questions is evidently 2s.
There corresponds thus to every integer x (x = 0, 1, . . . , 2s — 1) in a one-to-one
manner a sequence of s yes-or-no answers. Every question can be put in the following
form: Does x belong to a subset A of the sequence 0, 1, . . . , 2s — 1? Thus to an
optimal sequence of s questions there correspond 5 subsets of the set M = { 0 , 1 , . . . ,
2s — 1}; let these be denoted by Au A2, . . . , As . According to what has been said,
A x has to contain exactly 2’~' elements. Let A always denote the set complementary
to A with respect to M . Then ArA2 and AXA„ have to contain both 2S~2 elements;
AtA2A 3 , AyA2A3, AyAsA3 , and A,A2A3 have to contain 2s -3 elements, and so on.
IX, § 11] E X E R C IS E S 609

Conversely, if all sets AxAt . . . Ak~, Ak contain exactly T ~ k elements (k = 1, 2, . . . ,


where A means either A or A, then the system of sets A,, A2, . . . , A S is optimal.
s),
It follows from this that the number of optimal sequences of questions is

й ( Я ’ Й ' - С Г - ^
If we regard the systems of questions which differ only in the order of questions as
2 s!
identical, then the number looked for is —— .
i!
Remark. In the Bar-Kochba game the questions are, in general, formulated while
taking into account the answers already obtained. (In the language of set theory:
if the first answers have shown that the object belongs to a subset A of the set M of
all possible objects, then the next question is whether it belongs to some subset В
of the set A.) It follows from what has been said that the questioner suffers no dis­
advantage by being obliged to put his questions simultaneously.
7. Suppose that in the Bar-Kochba game type players agree that the objects
allowed to be thought of are the n elements of a given set M. Suppose that the
questions are asked at random, or in other words, all possible questions have the
same probability, independently of the answers already obtained.
a) What is the probability that the questioner finds out the object by к questions?
b) Find the limit of the probability obtained in a) as n and к both tend to + 00 such
that
lim (к — log2 n) — c .
n—»-CO
Hints. We may suppose that the elements of the set M are the numbers 1, 2 , . . . , n.
Each possible question is equivalent to asking whether the number thought of does
belong to a certain subset of M. The number of possible questions is thus equal to
the number of subsets of M, i.e. to 2". (For sake of simplicity there are included the
two trivial questions corresponding to the whole set and the empty set.) Let
Alt A2, . . . , Ak be the sets chosen at random by the questioner: i.e. he asks, whether
the number thought of does belong to these sets. By assumption, each of the sets
An is, with probability , equal to an arbitrary subset of M. Put

í 1 if Aj contains the number /,


e' I 0 otherwise.
The random variables £,(/) (/'= 1, 2......... k; / = 1 , 2 , . . . , « ) are independent of
each other and each takes on the values 0 and 1 with probability ~ . The questioner
finds the number x, when the sequence of numbers et(x), e2( x ) , . . . , ek(x) is different
from all sequences e,(y), f2(>), . . . , ek{y) with у Ф x. The sequences e,(/), e2(/),. . . , ek(l)
are, with probability -4 - and independently of each other, equal to any sequence
2*
consisting of к digits 0 or 1 ; the problem is thus equivalent to the following urn-
problem:
n balls are thrown into 2k urns; each ball has the same probability to fall into any
of the urns. One of the balls is red, all the other balls are white. What is the probability
610 IN T R O D U C T IO N TO IN F O R M A T IO N T H E O R Y [IX , § 11

that the red ball is alone in an urn? The answer is evidently jl — —H


( 1 'I"-1 = P„,k.

The answer to question b) is therefore

lim P„,k = exp [ - — I .


k — log2rt-> -c \ £ )

Remarks
1. The number of questions needed ( i f и is s u f f i c i e n t l y l a r g e ) in
order to find the number with a probability > 0.99 by means of this random strategy
exceeds only by 7 the number of questions needed in the case the optimal strategy

is employed. In fact, exp ^ j > 0.99. This result is surprising, since one would
be inclined to guess that the random strategy is much less advantageous than the
optimal strategy.
2. When the questions are asked at random it may happen that the same question
occurs twice. But the corresponding probability is so small, if n is large, that it is
not worth while to exclude this possibility, though of course this would slightly increase
the chances of success.

8. Certain players play the Bar-Kochba game in the following manner: There are
r + 1 players, r players think of some object; the last player asks them questions.
The same question is addressed to every player, who answers by “yes” or “no”,
according to what is true concerning the object he had thought of.
a) Each of the players thinks of one of the numbers 1, 2, . . . , и (и > r), but each
of a different number. The questions are asked at random, as in the preceding
exercise. What is the probability that the questioner finds all numbers by к questions?
b) n = r and the players agree to think each of a different number of the sequence
1 , 2 , . . . , « ; hence it is a permutation of numbers which is to be found. What is the
probability that the questioner finds the permutation by к questions? Calculate ap­
proximately this probability for к = 2 log2 n + c.
Hints, a) We are led to the following urn problem: we put « balls into 2k urns,
independently of each other, each ball having the same probability —k to get into
any one of the urns. Among the n balls there are r red balls, the others are white.
What is the probability that all the red balls get into different urns? This proba­
bility is

For r = 1 we find as a particular case the result of the preceding exercise,


b) n — r, hence

p"’k■" “ П f1 - );
thus

fim />„.2iog„+c,n ~ exp ( - -^qrrj .


IX , § 11] E X E R C IS E S 611

Remark. It is surprising that in this game to guess a permutation of the numbers


......... n approximately twice as many questions are necessary as to guess a single
1 ,2
one of these numbers. Of course, in the first case one gets to each question n answers.
9. Let/(») (n = 1, 2, . . .) be a completely additive number-theoretical function, i.e.
f(nm) = f(r i) + f(m) (A)
for all pairs of integers n and m. Suppose further that the limits
lim Ifin + d) —fin) I = l(d) (B)
n - * - CO

are finite for every integer d and


r -—
hm l(d)— = о0 .
rf-co lógd
Then g(ri) = c log n, where c is a constant.
Hint. Let P be an integer greater than 1; we put

, 4 = fr,( n4) ------


g{n) A ?------
) log—n .
log P
From g(P) = 0 and from (B) it follows that
max 1(d)
ita ^ .
» log n log P
From this we conclude that
An) AP)
lim lim ------------- -------- = 0.
p ^ c log n log P
This implies the existence of the limit
,. An)
lim -------= c.
logn
If we put now h(n) = f{n) — clog», then /(») is a completely additive function
for which
lim -^ O . ( 1)
logn
But this implies h(n) = 0 since otherwise there would exist an integer r with h(r) Ф 0
and thus, because of the additivity, h(rk) = kh(r) for к = 1 , 2 , , which contra­
dicts ( 1 ).

Remark. This problem is due to K. L. Chung but his proof differs from ours.
If instead of the complete additivity only simple additivity is required, i.e. that
/( » m) = An) + f (pi), if (и, tri) = 1, then the condition (B) does not imply /(») =
= c log n. (The last step of the proof cannot be carried out in this case.)10
10. Let P = {pk} be any distribution with
00
X kpk = k > 1.
k= 1
612 IN T R O D U C T IO N TO IN F O R M A T IO N T H E O R Y [ I X , § 11

Then the entropy

f j Pk 1o82 — = h (& )
*=1 Pk
takes on its maximal value if

11. Let (fi and Q be two distributions, absolutely continuous with respect to Lebesgue
measure, with density functions p(x) and q(x) and further let Q_ be absolutely con­
tinuous with respect to fP. It follows from Theorem 2 of § 8 that the gain of information
is nonnegative in this case too, i.e. we have the inequalities
+CO
J
—oo
Í q(x) lo g 2 - q\X
--
P(x)
dx > 0

and
+ 00
/ q(xY
-— 7 l0 g2 —— 71T d x > 0 for a> 0,a ^ 1.
a- 1 J p(xf 1

Prove these inequalities directly (without passing to the limit) by Jensen’s inequality
geneialized for functions, i.e. by inequality (15) of § 8 .
12. a) Let £ be a positive random variable having an absolutely continuous distri­
bution, with £'(£) = A > 0. Show that the entropy (of order 1) of c is maximal if
the distribution of £ is exponential.
Hint. Let f(x) be a density function in (0, + =») with
со
J X f(x) dx = Á
0
and put
*W = l e x p ( - | ) .

We have (cf. Exercise 11)


CO CO CO

°<j/W log t ^ d x = f e i x ) log ^ d x - j f i x ) log, ^ dx.


0 0 0
b) Let £ be a random variable distributed in the interval (a, b) with an absolutely
continuous distribution function. Show that the entropy (of order 1) of £ is maxima!
if £ is uniformly distributed in (a, b).
Hint. Let fix) be a density function which vanishes outside (a, b) and put
1
-—— - for a < X < b,
gix) = b - a
0 otherwise.
IX , § 11] E X E R C IS E S 613

We have then
b b J?

0 < ( f(x) log 2 dx = f g(x) log2 - - 1 dx — (f(x) log, - J — dx.


J 0(x) J a{x) J J(x)
а а а

13. Let {j, i 2, . . . , £ „ be random variables and C = (ckl) a nonsingular n by n


matrix. We put
П
% = £ C"Ä (к = 1,2,..., rí).
J=l
Show that we have for a > 0

4.« (0?l> Vi, ■■; Vn)) = 4.» ((fi, Í2. • • •, Í,)) + log2 I II С II I
where ]|C|| denotes the determinant of C.
14. Let £t, be independent random variables with absolutely continuous
distribution. We have

4 .Г ( t f l , O ) = 4 .1 t f l ) + 4 ,2 ( « + • - - + 4 , ( { ,) .

Hint. This follows from Formulas (34b) and (35b) of § 8 .


15. The relative information /( £ ,’?) contained in the value of r] concerning £ (or
conversely) can be defined, when the pair (c, rf) has an absolutely continuous distri­
bution, by
/(£> V) = У 0 + A.tOl) - 4,2 ((4»?)).
Show that if 01 is the joint distribution of { and ц and if * Q denotes the direct
product of the distributions S3 and Q of £ and?;, then Formula (11) of § 4 remains
valid; i.e. /({, ?7) is equal to the gain of information obtained by replacing *Q
by 01.
Hint. If h(x, y) is the density function of the pair ({, rj), f(x) and g(y) are the
density functions of { and rj, then we have
+00 +<*>
/ « , « - ] /# * ,,> lo b jr íríL w .
— CO — CO

It follows that
KÍ, V) = Л (« II & * Q),
because of Formula (38) of § 8 .
16. In the following exercises we always use natural (Napier’s) logarithms In.
a) Calculate the entropy (of dimension 1 and of order 1) of the normal distribution;
i.e. show that
+ 00

<p(x) In ---------- dx = In or . / 2ле ,


J <P(x) v

where
1 ( ( x - mY \
<f(x) = - ■ e x p ------- — — .
2ла l )
614 IN T R O D U C T IO N TO IN F O R M A T IO N T H E O R Y [IX , § 11

b) Calculate the entropy (of dimension r and of order 1) of the /--dimensional normal
distribution.

Hint. Let the r-dimensional density function of the random variables Ci, У •••>£/
be
J

f(xi, • • ■, -V,) = ^ B ^ , exp - X £ bu (x, — m,) (X,- - m,) ] ,


(2л)2 1 L/=i/=i V

where ||5 || is the determinant of the positive definite quadratic form

1É=1 /=1
E W /-

By a suitable orthogonal transformation


r
Vk= E c« - mi)
1=1
we obtain for the density function of the random variables r r j 2, . . . , r],:

1 ( 1 V У}\
— r ----------- exp| - y E ^ - .
(2л) 2 ala2 . . . a r
_ 1
with aya2 . . . aT— ||В|| 2 . According to Exercise 13, the entropy is invariant under
such a transformation, since the absolute value of the determinant of an orthogonal
transformation is equal to 1. Hence, according to Exercise 14,

h .r ((ii, is. •••. Í/)) = h ., ((%>%>•■•, Vr)) = É A,i (%) = In (2 яе) 2 | | ß j | 2.


*=i
c) Let the joint distribution of the random variables Z and r) be normal with density
function

f(x, у) = ^ А<2 Л B exP [ - у (Ax2 + 2ВхУ + C / l j •

Calculate the (relative) information contained in the value r; concerning


Hint. We find the desired information by subtracting from the sum of the informa­
tions contained in { and in r/, the information contained in the distribution of the
pair (c, rj). Hence

I A I С 2ле
I(Z,V) = In ^ 2 л е ^ — ^ + ln V 2 - ,n =

= In /— AC — .
V AC — B2
If В = 0, i.e, if 4' and rj are independent, we find of course that /({, rj) = 0.
17. Let { be a random variable with absolutely continuous distribution and density
function f(x). Let the standard deviation a оf f be finite and positive.
XX, § 11] EXERCISES 615

a) Show that Ihe entropy (of dimension 1 and of order 1) of { is maximal if { is


normally distributed.
b) Show that the entropy ;) (of dimension 1 and of order a > 1) is maximal if

1
W\—
a- 1
) —“‘i 1- < X2\ 1
--------- 7---------{ 2 1 - — for fx | < c ,
Ш = c Г2 ■ 1 1 c >
(a — 1j
0 otherwise,

, , /3a — 1
where we have put c = a , ---------.
V a —1
Hint. Put
+03
m = M(i), a2 =
J
f (x — mf f ( x ) dx, (f(x) = —
^ 2 ло
exp ( —
l 2 <r2
-I.
)
— 00

We have then
+ 00 +00 +00

0 < j f(x) In ~ ~ ~ dx = j <p(x) In dx — J ' f(x) In J - dx,


— 00 — CO — 00

which implies a), b) can be proved in the same fashion. Let it be noticed that f j x )
tends to

as a —> 1 .

18. Let f(x) and f„(x) be density functions such that f,,(x) = 0 (« = 1, 2, . . .) for
every value of x for which f(x) = 0 ; suppose further that all integrals

Jf Щ/(*)Я - d x (« = 1, 2 , . . . )

exist and that


+00 4} .
lim f ^ d x = l .
. — —Joo /(*)

Prove that under these conditions

lim sup I J f„(x) dx — J/(.«) i/x | = 0,


n —*■со E E E

where E runs through all measurable subsets of the set of real numbers.
616 IN T R O D U C T IO N TO IN F O R M A T IO N T H E O R Y [IX , § 11

Hint. Applying Schwarz’ inequality, we get

j ( fn (*) d x - J /(X) dx\ <S J' J J ^ )dx<


I E -с о ^

^ f f (/»(■*■) -f( x )Y J ) 2
" U -------7 w J
— oo
and clearli

f” (fr(x) —f(x))‘ _ f4 W ^ _ L
■I /(*) J /(X)
— 00 — oo
TABLES 617

TABLES

T able 1

Values of n\ and of log n\ for n <£ 50

n n\ lo g nl n n\ lo g n\

1 1 0.00000000 26 40329146-1019 26.60561903


2 2 0.30103000 27 10888869-1021 28.03698279
3 6 0.77815125 28 30488834-1022 29.48414082
4 24 1.38021124 29 88417620-1023 30.94653882
5 120 2.07918125 30 26525286-1025 32.42366007
6 720 2.85733250 31 82228387-1026 33.91502177
7 5040 3.70243054 32 26313084-1028 35.42017175
8 40320 4.60552052 33 86833176-102Э 36.93868569
9 362880 5.55976303 34 29523280-1031 38.47016460
10 3628800 6.55976303 35 10333148-1033 40.01423265
11 39916800 7.60115572 36 37199 333-1034 41.57053515
12 47900160-10 8.68033696 37 13763753-1036 43.13873687
13 62270208 • 102 9.79428032 38 52302262-1037 44.71852047
14 87178291 • 103 10.94040835 39 20397882-1039 46.30958508
15 13076744-105 12.1164996 1 40 81591528-1040 47.91164507
16 20922790 • 106 13.32061959 41 33452527-1042 49.52442892
17 35568743 • 107 14.55106852 42 14050061-1044 51.14767822
18 64023737-10s 15.80634102 43 60415263-1045 52.78114667
19 12164510 • 1010 17.08509462 44 26582716-1047 54.42459935
20 24329020 • 10111 18.38612462 45 11962222-1049 - 56.07781186
21 51090942-1012 19.70834391 46 55026222-1059 57.74056969
22 11240007 • 1014 21.05076659 47 25862324-1052 59.41266755
23 25852017 • 1015 22.41249443 48 12413916-10s4 61.09390879
24 62044840 • 1016 23.79270567 49 60828186-1055 62.78410487
25 15511210 • 1018 25.19064568 50 30414093-1057 64.48307487
618 TABLES

T ablü 2

ín \
Binomial coefficients ^ | for n <C 30 1

V 0 1 2 3 4 5 6 7 8

2 1 2 1
3 1 3 - 3 1
4 1 4 6 4 1
5 1 5 10 10 5 1
6 1 6 15 20 15 6 1
7 1 7 21 35 35 21 7 1
8 1 8 28 56 70 56 28 8 1
9 1 9 36 84 126 126 84 36 9
10 1 10 45 120 210 252 210 120 45
11 1 i 11 55 165 330 462 1 462 330 165
12 1 12 66 220 495 792 924 792 495
13 1 13 78 286 715 1287 1716 1716 1287
14 1 1 14 91 364 1001 2002 3003 3432 3003
15 1 Í 15 105 455 1365 3003 5005 6435 6435
16 1 ! 16 120 560 1820 4368 8008 11440 12870
17 1 17 136 680 2380 6188 12376 19448 24310
18 1 18 153 816 3060 8568 18564 31824 43758
19 1 19 171 969 3876 11628 27132 50388 75582
20 1 j 20 190 1140 4845 15504 38760 77520 125970
21 1 21 210 1330 5985 20349 54264 116280 203490
22 1 22 231 1540 7315 26334 74613 170544 319770
23 1 23 253 1771 8855 33649 100947 245157 490314
24 1 24 276 2024 10626 42504 134596 346104 735471
25 1 25 300 2300 12650 53130 177100 480700 1081575
26 1 26 325 2600 14950 65780 i 230230 657800 1562275
27 1 27 351 2925 17550 80730 296010 888030 2220075
28 1 28 378 3276 20475 98280 1 376740 1184040 3108105
29 1 29 406 3654 23751 118755 j 475020 1560780 4292145
30 1 30 435 4060 27405 142506 j 593775 2035800 5852925

1 For n > 15 values of | ” j are given for к < ~ only; the further values can

be taken from the table by the relation | ”д J = ( \ )•


TABLES 619

T able 2
(continued)
i к/
9 10 11 I 12 13 14 15 /
/ n

[7
j 2

55
I
10 I
11
1
i
1

I
1 j
j
'
.

И
h 9
10
5
6

220 66 12 1 12
715 286 j 78 I 13 1 13
2002 I 1001 j 364 j 91 I 14 1 14
5005 I 3003 1365 I 455 i 105 15 1 15
11440 I 8008 I 4368 j 1820 j 560 120 16 16
24310 ; 19448 I 12376 I 6188 j 2380 680 136 17
48620 ! 43758 ! 31824 j 18564 J
8568 3060 816 18
92378 j 92378 75582 j 50388 | 27132 11628 3876 19
167960 ! 184756 167960 | 125970 [ 77520 38760 15504 20
293930 j 352716 352716 [ 293930 | 203490 116280 54264 21
497420 I 646646 705432 I 646646 [ 497420 319770 170544 | 22
817190 j
1144066 1352078 1352078 |
1144066 817190 490314123
1307504 Í 1961256 2496144 i 2704156 2496144 j 1961256 1307504 24
2042975 3268760 4457400 j 5200300 j 5200300 4457400 3268760 25
3124550 5311735 j 7726160 I 9657700 10400600 j 9657700 7726160 26
4686825 8436285 ! 13037895 ! 17383860 Í20058300 20058300 17383860 27
6906900 13123110 21474180 30421755 I 37442160 40116600 37442160 28
10015005 I 20030010 34597290 51895935 ' 67863915 77558760 77558760 29
14307150 j
30045015 54627300 86493225 119759850 145422675 155117520 30
I I I
620 TABLES

The terms of the Poisson distribution


T able 3

~ Ä “ '
0.1 0.2 0.3 0.4 0.5
к \ ___________________________________________________________________________ _____ _________

0 0.90484 0.81873 0.74082 0.67032 0.60653


1 0.09048 0.16375 0.22225 0.26813 0.30327
2 0.00452 0.01637 0.03334 0.05362 0 07581
3 0.00015 0.00109 0.00333 0.00715 0.01263
4 0 .0 0 0 0 5 0 .0 0 0 2 5 0 .00071 0 .0 0 1 5 8
5 0 .00001 0 .0 0 0 0 5 0 .0 0 0 1 6
6 0.00001
_ -

\ 0.6 0.7 0.8 0.9


к \

0 0.54881 0.49659 0.44933 0.40657


1 0.32929 0.34761 0.35946 0.36591
2 0.09878 0.12166 0.14379 0.16466
3 0.01976 0.02838 0.03834 0.04939
4 0.00296 0.00496 0.00766 0.01111
5 0.00035 0.00069 0.00123 0.00200
6 0.00003 0.00008 0.00016 0.00030
7 0.00001 0.00003

Л
1 2 3 4 5
к^ ___________________________________________________________________

0 0.36788 0.13534 0.04978 0.01831 0.00673


1 0.36788 0.27067 0.14936 0.07326 0.03369
2 0.18394 0.27067 0.22404 0.14653 0.08422
3 0.06131 0.18045 0.22404 0.19537 0.14037
4 0.01532 0.09022 0.16803 0.19537 0.17547
5 0.00306 0.03609 0.10082 0.15629 0.17547
6 0.00051 0.01203 0.05040 0.10420 0.14622
7 0.00007 0.00343 0.02160 0.05954 0.10444
8 0 00085 0.00810 0.02977 0.06527
9 0.00019 0.00270 0.01323 0.03626
10 0.00003 0.00081 0.00529 0.01813
11 0.00022 0.00192 0.00824
12 0.00005 0.00064 0.00343
13 0.00001 0.00019 0.00132
14 0.00005 0.00047
15 0.00001 0.00015
16 0.00004
17 0.00001
TABLES 621

( c o n t in u e d ) T able 3

~ X j '
6 7 8 9 10
к \
I
0 0.00247 0.00091 0.00033 0.00012 0.00004
1 0.01487 0.00638 0.00268 0.00111 0.00045
2 0.04461 0.02234 0.01073 0.00499 0.00227
3 0.08923 0.05212 0.02862 0.01499 0.00756
4 0.13385 0.09122 0.05725 0.03373 0.01891
5 0.16062 0.12772 0.09160 0.06072 0.03783
6 0.16062 0.14900 0.12214 0.09109 0.06305
7 0.13768 0.14900 0.13959 0.11712 0.09007
8 0.10326 0.13038 0.13959 0.13176 0.11260
9 0.06883 0.10140 0.12408 0.13176 0.12511
10 0.04130 0.07098 0.09926 0.11858 0.12511
11 0.02252 0.04517 0.07219 0.09702 0.11374
12 0.01126 0.02635 0.04812 0.07276 0.09478
13 0.00519 0.01418 0.02961 0.05037 0.07290
14 0.00222 0.00709 0.01692 0.03238 0.05207
15 0.00089 0.00331 0.00902 0.01943 0.03471
16 0.00033 0.00144 0.00451 0.01093 0.02169
17 0.00011 0.00059 0.00212 0.00578 0.01276
18 0.00003 0.00023 0.00094 0.00289 0.00709
19 0.00001 0.00008 0.00039 0.00137 0.00373
20 0.00003 0.00015 0.00061 0.00186
21 0.00006 0.00026 0.00088
22 0.00002 0.00010 0.00040
23 0.00004 0.00017
24 0.00001 0.00007
25 0.00002
26 0 .0 0 0 0 1
622 TABLES

T able 3 ( c o n tin u e d )

\ я 11 12 13 14 15
к \ _________ __j___________________________________________________ _______

0 0 .0 0 0 0 1
1 0.00018 0.00007 0.00002 0.00001
2 0.00101 0.00044 0.00019 0.00008 0.00003
3 0.00370 0.00177 0.00082 0.00038 0.00017
4 0.01018 0.00530 0.00269 0.00133 0.00064
5 0.02241 0.01274 0.00699 0.00373 0 00193
6 0.04109 0.02548 0.01515 0.00869 0.00483
7 0.06457 0.04368 0.02814 0.01739 0.01037
8 0.08879 0.06552 0.04573 0.03043 0.01944
9 0.10853 0.08736 0.06605 0.04734 0.03240
10 0.11938 0.10484 0.08587 0.06628 0.04861
11 0.11938 0.11437 0.10148 0.08435 0.06628
12 0.10943 0.11437 0.10994 0.09841 0.08285
13 0.09259 0.10557 0.10994 0.10599 0.09560
14 0.07275 0.09048 0.10209 0.10599 0.10244
15 0.05335 0.07239 0.08847 0 09892 0.10244
16 0.03668 0.05429 0.07188 0.08655 0.09603
17 0.02373 0.03832 0 05497 0.07128 0.08473
18 0.01450 0.02555 0.03970 0.05544 0.07061
19 000839 0.01613 0.02716 0.04085 0.05574
20 0.00461 0.00968 0.01765 0.02859 0.04181
21 0.00241 0.00553 0.01093 0.01906 0.02986
22 0.00121 000301 0.00645 0 01213 0.02036
23 0.00057 0.00157 0.00365 | 0.00738 0.01328
24 0.00026 0.00078 0.00197 | 0.00430 0.00830
25 0.00011 0.00037 0.00102 0.00241 0.00498
26 0.00004 0.00017 0.00051 0.00129 0.00287
27 0.00002 0.00007 0.00024 0.00067 0 00159
28 0.00003 0.00011 0.00033 0.00085
29 0.00001 0.00005 0.00016 0.00044
30 I 0 .0 0 0 0 2 0.00007 0 .0 0 0 2 2
31 I i 0,00003 0.00010
32 I ! 0.00001 0.00005
33 I 0.00002
34 I ! 0.00001

I
TABLES 623

(c o n tin u e d ) T able 3

. A
\ 16 17 1819 20
кX ________________________________________________________________________________________________

0
1
2 0.00001
3 000007 0.00003
4 0.00030 0.00014 0.00006 0.00003 0.00001
5 0.00098 0.00049 0.00024 0.00011 0.00005
6 0.00262 0 00138 0.00071 0.00036 0.00018
7 0 00599 0.00337 0.00185 0.00099 0.00052
8 0.01198 0.00716 0.00416 0.00236 0.00130
9 0 02131 0.01352 0.00832 0.00498 0.00290
10 003409 0.02300 0.01498 0.00946 0.00581
11 0.04959 0 03554 0.02452 0.01635 0.01057
12 0.06612 0.05035 0.03678 0.02588 0.01762
13 0.08138 0.06584 0.05092 0.03783 0.02711
14 009301 0.07996 0.06548 0.05135 0.03874
15 009921 0.09062 0.07857 0.06504 0.05165
16 0.09921 0.09628 0.08839 0.07724 0.06456
17 0.09338 0.09628 0.09359 0.08632 0.07595
18 0.08300 0.09093 0.09359 0.09112 0.08439
19 0.06989 0.08136 0.08867 0.09112 0 08883
20 0 05592 0.06915 0.07980 0.08656 0 08883
21 0.04260 0.05598 0.06840 0 07832 0.08460
22 0 03098 0 04326 0.05596 0.06764 0 07691
23 0.02155 0.03197 0.04380 0.05587 0.06688
24 0.01437 0.02265 0.03285 0.04423 0.05573
25 0 00919 0.01540 0.02365 0 03362 0.04458
26 0 00566 0 01007 0.01637 0.02456 0 03429
27 0.00335 0.00634 0.01091 0.01728 0.02540
28 0.00191 0.00385 0.00701 0.01173 0.01814
29 0.00105 0.00225 0.00435 0.00768 0.01251
30 0.00056 0.00127 0.00261 0.00486 0.00834
31 0.00029 0.00070 0.00151 0.00298 0.00538
32 0.00014 0.00037 0.00085 0 00177 0.00336
33 0.00007 0.00019 0.00046 0.00102 0.00203
34 0.00003 0.00009 0.00024 0.00057 0.00119
35 0.00001 0.00004 0.00012 0.00030 0.00068
36 0.00002 0.00006 0.00016 0.00938
37 0.00001 0.00003 0.00008 0.00020
38 0.00001 0.00004 0.00010
39 0.00002 0.00005
40 0.00002
41 0.00001
624 TABLES

T able 4

The incomplete gamma function

n 2=0.001 2=0.002 2=0.003 ' 2=0.004


_______________________________________________ I________________
1 0.000999 j 0.001988 0.002995 0 003992
2 000001 000002 000005 ; 000008

n 2=0.005 j 2=0.006 2=0.007 2=0.008


________________________ I_______________________________________
1 0.004987 j 0.005982 0.006976 0.007968
2 000013 I 000018 000024 000032

n 2=0.009 I 2=0.01 J 2 = 0.02 ' 2=0.03

1 0.008960 0.009950 0.019801 0.029555


2 000040 000050 I 000197 , 000441
3 I 000002 000004

n 2= 0.04 2=0.05 j 2 = 0.06 j 2=0.07


________________________ I_____________I_____________1____________
1 0.039211 0.048771 0.058236 0.067606
2 000779 001209 001730 002339
3 000010 000020 000034 000054
4 000001

n 2=0.08 2 = 0.09 2 = 0.10 2=0.11

1 0.076884 0.086069 0.095163 0.104166


2 003034 003815 004679 005624
3 000080 000114 000155 000204
4 000002 000002 000003 000006

n 2= 0.12 2=0.13 2=0.14 2=0.15

1 0.113080 0.121905 0.130642 0.139292


2 006649 007752 008932 010186
3 000263 000332 000412 000503
4 000008 000011 000014 000018
5 000001
TABLES 625

( continued) T able 4

n 2 = 0 16 j 2=0.17 ! 2=0.18 2=0.19

1 0.147856 0.156335 0.164730 0.173041


2 011513 012912 014381 015919
3 000606 000721 000850 000992
4 000024 000031 000038 000047
5 000001 000001 000001 000001

n 2= 0.20 i 2= 0.22 2= 0.24 2=0.26

1 0.181269 0.197481 0.213372 0.228948


2 017523 020927 024581 028475
3 001149 001506 001927 002414
4 000057 000082 000113 000154
5 000002 000004 000006 000008
6 000001
n 2=0.28 2=0.30 2= 0.40 2=0.50

1 0.244216 0.259182 0.329680 0.393469


2 032597 036936 061551 090204
3 002970 003600 007926 014388
4 000205 000366 000776 001752
5 000011 000015 000061 000173
6 000001 000001 000004 000014
7 000001 000001

n \ 2= 1.0 2= 1.5 2 = 2.0 2= 2.5


___________ I__________________________ I__________________________
1 0.63212 0.77687 0.86466 0.91792
2 26424 44218 59399 71270
3 08030 19115 32332 45619
4 01899 06564 14288 24242
5 00366 01858 05265 10882
6 00059 00446 01656 04202
7 00008 00093 00453 01419
8 00001 00017 00110 00425
9 00001 00002 00024 00114
10 I 00005 00028
11 00001 00006
12 I 00001
626 TABLES

T able 4 ( c o n tin u e d )

n A=3.0 I Л= 3.5 I Л=4.0 Я= 4.5

1 0.95021 0.96980 0.98168 0.98889


2 80035 86411 90842 ! 93890
3 J 57681 67915 76190 82642
4 j 35277 46338 56653 65770
5 I 18474 27456 37116 46789
6 08392 14239 21487 29708
7 J 03351 06529 11067 16895
8 j 01191 02674 05113 08659
9 j 00380 00987 02137 04026
10 j 00110 00331 00813 01709
11 ] 00029 00102 00284 00667
12 J 00007 00029 00092 00240
13 00002 00008 00027 00081
14 00001 00008 00025
15 00002 00007
16 j 00001 00002
17 00001

n 2 = 5 .0 j 6=5. 5 2 = 6 .0 2= 6.5
_________________________I_______________________________________ _
1 I 0.99326 0.99591 0.99752 0.99850
2 95957 97345 98265 98872
3 87535 91162 93804 95696
4 73497 79830 84880 88816
5 55951 64248 71494 77633
6 38404 47108 55433 63096
7 23782 31396 39370 47347
8 13337 19051 25603 32724
9 06809 10564 15276 20843
10 03183 05378 08392 12262
11 01369 02525 04262 06684
12 00545 01099 02009 03389
13 00202 00445 00883 01603
14 00070 00169 00363 00710
15 00023 00060 00140 00296
16 00007 00020 00051 00116
17 00002 00006 00017 00044
18 00001 00002 00006 00015
19 00001 00002 00005
20 00001 00001
TABLES 627

( c o n t in u e d ) T able 4

n A= 7 .0 A=7.5 A= 8.0 A=8.5


Г ' Г
1 0.99909 I 0.99945 0.99966 0.99980
2 99271 j 99530 99698 99807
3 97036 j 97975 98625 99072
4 91823 94085 95762 96989
5 82701 86794 90037 92564
6 69929 75856 80876 85040
7 55029 62184 68663 74382
8 40129 47536 54704 61440
9 27091 33803 40745 47689
10 16950 22359 28338 34703
11 09852 13776 18412 23664
12 05335 07924 11192 15134
13 02700 04267 06380 09092
14 01281 02157 03418 05141
15 00572 01026 01726 02743
16 00241 00461 00823 01383
17 00096 00196 00372 00661
18 00036 00079 00160 00300
19 00013 00031 00065 00130
20 00005 00011 00025 00054
21 00004 00010 00020
22 00001 00003 00008
23 00001 00003
24 00001
628 TABLES

T able 4 ( c o n t in u e d )

n A=9.0 л=9.5 /.=10.0

1 0.999 8 8 0.99993 0.99996


2 99877 99921 99950
3 99377 99584 99724
4 97877 98514 98966
5 94504 95974 97075
6 88431 91147 93291
7 79322 83505 86986
8 67610 73134 77978
9 54435 60818 66719
10 41259 47817 54207
11 29401 35467 41696
12 19699 24801 30322
13 12423 16338 20844
14 07385 10186 13554
15 04147 05999 08346
16 02204 03347 04874
17 01111 01773 02704
18 00533 00893 01428
19 00243 00428 00719
20 00106 00196 00345
21 00044 00086 00159
22 00018 00036 00070
23 00006 00015 00030
24 00003 00006 00012
25 00002 00004
26 00001 00001
TABLES 629

T able 5

1 - *
The function <p(x) = — -— e 2
J in

X <p(x ) X \ 4 \x) X fix) X <P(x)

0.00 0.3989
0.01 0.3989 0.41 0.3668 0.81 0.2874 1.21 0.1919
0.02 0.3989 0.42 0.3653 0.82 0.2850 1.22 0.1895
0.03 0.3988 0.43 0.3637 0.83 0.2827 1.23 0.1872
0.04 0.3986 0.44 0.3621 0.84 i 0.2803 1.24 0.1849
0.05 0.3984 0.45 0.3605 0.85 0.2780 1.25 0.1826
0.06 0.3982 0.46 0.3589 0.86 0.2756 1.26 0.1804
0.07 0.3980 0.47 0.3572 0.87 0.2732 1.27 0.1781
0.08 ! 0.3977 0.48 0.3555 0.88 0.2709 1.28 0.1758
0.09 0.3973 0.49 0.3538 0.89 0.2685 1.29 0.1736
0.10 0.3970 0.50 0.3521 0.90 0.2661 1.30 0.1714
0.11 ; 0.3965 0.51 0.3503 0.91 0.2637 1.31 0.1691
0.12 ! 0.3961 0.52 0.3485 0.92 0.2613 1.32 0.1669
0.13 0.3956 0.53 0.3467 0.93 0.2589 1.33 0.1647
0.14 0.3951 0.54 0.3448 0.94 0.2565 1.34 0.1626
0.15 j 0.3945 0.55 0.3429 0.95 0.2541 1.35 0.1604
0.16 ; 0.3939 0.56 0.3410 0.96 0.2516 1.36 0.1582
0.17 : 0.3932 0.57 0.3391 0.97 0.2492 1.37 0.1561
0.18 j 0.3925 0.58 0.3372 0.98 0.2468 1.38 0.1539
0.19 0.3918 0.59 0.3352 0.99 0.2444 1.39 0.1518
0.20 1 0.3910 0.60 0.3332 1.00 0.2420 1.40 0.1497
0.21 ; 0 3902 0.61 0.3312 1.01 0.2396 1.41 0.1476
0.22 0.3894 0.62 0.3292 1.02 0.2371 1.42 0.1456
0.23 0.3885 0.63 0.3271 1.03 0.2347 1.43 0.1435
0.24 0.3876 0.64 0.3251 1.04 0.2323 1.44 0.1415
0.25 0.3867 0.65 0.3230 1.05 0.2299 1.45 0.1394
0.26 0.3857 0.66 0.3209 1.06 0.2275 1.46 0.1374
0.27 0.3847 0.67 0.3187 1.07 0.2251 1.47 0.1354
0.28 0.3836 0.68 0.3166 1.08 0.2227 1.48 0.1334
0.29 0.3825 0.69 0.3144 1.09 0.2203 1.49 0.1315
0.30 0.3814 0.70 0.3123 1.10 0.2179 1.50 0.1295
0.31 0.3802 0.71 0.3101 1.11 0.2155 1.51 0.1276
0.32 0.3790 0.72 0.3079 1.12 0.2131 1.52 0.1257
0.33 0.3778 0.73 0.3056 1.13 0.2107 1.53 0.1238
0.34 0.3765 0.74 0.3034 1.14 0.2083 1.54 0.1219
0.35 0.3752 0.75 0.3011 1.15 0.2059 1.55 0.1200
0.36 0.3739 0.76 0.2989 1.16 0.2036 1.56 0.1182
0-37 0.3725 0.77 0.2966 1.17 0.2012 1.57 0.1163
0.38 0.3712 0.78 j 0.2943 1.18 0.1989 1.58 0.1145
0.39 0.3697 0.79 0.2920 1.19 0.1965 1.59 0.1127
0.40 0.3683 0.80 0.2897 1.20 0.1942 1.60 0.1109
630 TABLES

T able 5 ( con tin u ed )

X <i(x) X <p(x) X <p(x) X q(x)

1.61 0.1092 2.01 0.0529 2.41 0.0219 2.81 0.0077


1.62 0.1074 2.02 0.0519 2.42 I 0.0213 2.82 0.0075
1.63 0.1 0 5 7 2.03 0 .0 5 0 8 2 .4 3 0.0 2 0 8 2.83 0.0 0 7 3
1.6 4 0 .1 0 4 0 2 .0 4 0.0 4 9 8 2 .4 4 0 .0 2 0 3 2 .8 4 0.0071
1.65 0.1023 2.05 0.0488 2.45 0.0198 2.85 0.0069
1.66 0.1006 2.06 0.0478 2.46 0.0194 2.86 0.0067
1.67 0.0989 2.07 0.0468 2.47 0.0189 2.87 0.0065
1.68 0.0973 2.08 0.0459 2.48 0.0184 2.88 0.0063
1.69 i 0.0957 2.09 0.0449 2.49 0.0180 2.89 0.0061
1.7 0 0 .0 9 4 0 2 .1 0 0 .0 4 4 0 2 .5 0 0 .0 1 7 5 2 .9 0 0 .0 0 6 0
1.71 0.0 9 2 5 2.11 0.0431 2.51 0.0171 2.91 0.0 0 5 8
1.72 0.0 9 0 9 2 .1 2 0 .0 4 2 2 2 .5 2 0.0 1 6 7 2 .9 2 0.0 0 5 6
1.73 0.0 8 9 3 2 .1 3 0.0 4 1 3 2.53 0.0 1 6 3 2 .9 3 0.0 0 5 5
1.74 0.0878 2.14 0.0404 2.54 0.0158 2.94 0.0053
1.75 0.0863 2.15 0.0396 2.55 0.0154 2.95 0.0051
1.76 0.0848 2.16 0.0387 2.56 0.0151 2.96 0.0050
1.77 0.0833 2.17 0.0379 2.57 0.0147 2.97 0.0048
1.78 0.0818 2.18 0.0371 2.58 0.0143 2.98 0.0047
1.79 i 0.0804 2.19 0.0363 2.59 0.0139 2.99 0.0046
1.80 0.0790 2.20 0.0355 2.60 0.0136 3.00 0.0044
1.81 0.0775 2.21 0.0347 2.61 0.0132 3.10 0.0033
1.82 0.0761 2.22 0.0339 2.62 0.0129 3.20 0.0024
1.83 0.0748 2.23 0.0332 2.63 0.0126 3.30 0.0017
1.84 0.0734 2.24 0.0325 2.64 0.0122 3.40 0.0012
1.85 0.0721 2.25 0.0317 2.65 0.0119 3.50 0.0009
1.86 0.0707 2.26 0.0310 2.66 0.0116 3.60 0.0006
1.87 0.0694 2.27 0.0303 2.67 j 0.0113 3.70 0.0004
1.88 0.0681 2.28 0.0297 2.68 j 0.0110 3.80 0.0003
1.89 0.0669 2.29 0.0290 2.69 0.0107 3.90 0.0002
1.90 0.0656 2.30 0.0283 2.70 0.0104 4.00 0.0001
1.91 0.0644 2.31 0.0277 2.71 0.0101 4.10 0.0001
1.92 0.0632 2.32 0.0270 2.72 0.0099 4.20 0.0001
1.93 0.0620 2.33 0.0264 2.73 | 0.0096
1.94 0.0608 2.34 0.0258 2.74 I 0.0093
1.95 0.0596 2.35 0.0252 2.75 0.0091
1.96 0.0584 2.36 0.0246 2.76 0.0088
1.97 0.0573 2.37 0.0241 2.77 0.0086
1.98 0.0562 2.38 0.0235 2.78 0.0084
1.99 0.0551 2.39 0.0229 2.79 0.0081
2.00 0.0540 2.40 0.0224 2.80 0.0079
TABLES 6 31

T able 6
л:

The normal distribution function Ф(х) = — -- f e 2 du


V 271 -I

i ’ I j
X Ф(х) X \ Ф(.Х) X 1 ф(х) X j Ф(х)
j— -
0.00 0.5000
0.01 0.5040 0.41 0.6591 0.81 0.7910 1.21 0.8869
0.02 0.5080 0.42 0.6628 0.82 0.7939 1.22 0.8888
0.03 0.5120 0.43 0.6664 0.83 0.7967 1.23 0.8907
0.04 0.5160 0.44 0.6700 0.84 0.7995 1.24 0.8925
0.05 0.5199 0.45 0.6736 0.85 0.8023 1.25 0.8944
0.06 0.5239 0.46 0.6772 0.86 0.8051 1.26 0.8962
0.07 0.5279 0.47 0.6808 0.87 0.8078 1.27 0.8980
0.08 0.5319 0.48 0.6844 0.88 0.8106 1.28 0.8997
0.09 0.5359 0.49 0.6879 0.89 0.8133 1.29 0.9015
0.10 0.5398 0.50 0.6915 0.90 0.8159 1.30 0.9032
0.11 0.5438 0.51 0.6950 0.91 0.8186 1.31 0.9049
0.12 0.5478 0.52 0.6985 0.92 0.8212 1.32 0.9066
0.13 0.5517 0.53 0.7019 0.93 0.8238 1.33 0.9082
0.14 0.5557 0.54 0.7054 0.94 0.8264 1.34 0.9099
0.15 0.5596 0.55 0.7088 0.95 0.8289 1.35 0.9115
0.16 0.5636 0.56 0.7123 0.96 0.8315 1.36 0.9131
0.17 0.5675 0.57 0.7157 0.97 0.8340 1.37 0.9147
0.18 0.5714 0.58 0.7190 0.98 0.8365 1.38 0.9162
0.19 0.5753 0.59 0.7224 0.99 0.8389 1.39 j 0.9177
0.20 0.5793 0.60 0.7257 1.00 0.8413 1.40 j 0.9192
0.21 0.5832 0.61 0.7291 1.01 0.8438 1.41 0.9207
0.22 0.5871 0.62 0.7324 1.02 0.8461 1.42 0.9222
0.23 0.5910 0.63 0.7357 1.03 0.8485 1.53 0.9236
0.24 0.5948 0.64 0.7389 1.04 0.8508 1.44 0.9251
0.25 0.5987 0.65 0.7422 1.05 0.8531 1.45 j 0.9265
0.26 0.6026 0.66 0.7454 1.06 0.8554 1.46 0.9279
0.27 0.6064 0.67 0.7486 1.07 0.8577 1.47 0.9292
0.28 0.6103 0.68 0.7517 1.08 0.8599 1.48 0.9306
0.29 0.6141 0.69 0.7549 1.09 0.8621 1.49 0.9319
0 30 0.6179 0.70 0.7580 1.10 0.8643 1.50 0.9332
0.31 0.6217 0.71 0.7611 1.11 0.8665 1.51 0.9345
0.32 0.6255 0.72 0.7642 1.12 0.8686 1.52 0.9357
0.33 0.6293 0.73 0.7673 1.13 0.8708 1.53 0.9370
0.34 0.6331 0.74 0.7703 1.14 0.8729 1.54 0.9382
0.35 0.6368 0.75 0.7734 1.15 0.8749 1.55 0.9394
0.36 0.6406 0.76 0.7764 1.16 0.8770 1.56 0.9406
0.37 0.6443 0.77 0.7794 1.17 0.8790 1.57 0.9418
0.38 0.6480 0.78 0.7823 1.18 0.8810 1.58 0.9429
0.39 0.6517 0.79 0.7853 1.19 0.8830 1.59 0.9441
0.40 0.6554 0.80 0.7881 1.20 0.8849 1.60 0.9452
632 TABLES

T able 6 ( c o n t in u e d )

X Ф (х) X Ф(Х) X Ф (х) X Ф (х)

1.61 0.9463 1.86 0.9686 2.22 0.9868 2.72 0.9967


1.62 0.9474 1.87 0.9693 2.24 0.9875 2.74 0.9969
1.63 0.9484 1.88 0.9699 2.26 0.9881 2.76 0.9971
1.64 0.9495 1.89 0.9706 2.28 0.9887 2.78 0.9973
1.65 0.9505 1.90 0.9713 2.30 0.9893 2.80 0.9974
1.66 0.9515 1.91 0.9719 2.32 0.9898 2.82 0.9976
1.67 0.9525 1.92 0.9726 2.34 0.9904 2.84 0.9977
1.68 0.9535 1.93 0.9732 2.36 0.9909 2.86 0.9979
1.69 0.9545 1.94 0.9738 2.38 0.9913 2.88 0.9880
1.70 0.9554 1.95 0.9744 2.40 0.9918 2.90 0.9981
1.71 0.9564 1.96 0.9750 2.42 0.9922 2.92 0.9982
1.72 0.9572 1.97 0.9756 2.44 0.9927 2.94 0.9984
1.73 0.9582 1.98 0.9761 2.46 0.9931 2.96 0.9985
1.74 0.9591 1.99 0.9767 2.48 0.9934 2.98 0.9986
1.75 0.9599 2.00 0.9772 2.50 0.9938 3.00 0.9986
1.76 0.9608 2.02 0.9783 2.52 0.9941 3.20 0.9993
1.77 0.9616 2.04 0.9793 2.54 0.9945 3.40 0.9996
1.78 0.9625 2.06 0.9803 2.56 0.9948 3.60 0.9998
1.79 0.9633 2.08 0.9812 2.58 0.9951 3.80 0.9999
1.80 0.9641 2.10 0.9821 2.60 0.9953
1.81 0.9649 2.12 0.9830 2.62 0.9956
1.82 0.9656 2.14 0.9838 2.64 0.9959
1.83 0.9664 2.16 0.9846 2.66 0.9961
1.84 0.9671 2.18 0.9854 2.68 0.9963
1.85 0.9678 2.20 0.9861 2.70 0.9965
T able 7

The values of 100 P„(c),


[JL ]

where PJfi) = p (^ y SUP 1Fn{X) ~ Gn{X) 1> 2) = 1 - t (- L + Äc ) (c = [z \ /2 " 1 + 1}

£ I I i I I
U) I I
__ _____ _____ I
1 I 2 3 ; 4 5 6 7 j 8 9 10 I 11 j 12 13 14 | 15 16 17 18

5 I 100.00 87.30 35.71 7.94 0.79


6 100.00 93.07 47.40 14.29 2 60 0.22
7 ! 100.00 96.27 57.52 21.21 5.30 0.82 0.06
8 ! 100.00 98.01 66.01 28.27 8.70 1.87 0.25 0.02
9 J 100.00 98.95 73.01 35.17 12.59 3.36 0.63 0.07 0.00
10 100.00 99.45 78.69 41.75 16.78 5.24 1.23 0.21 0.02 0.00
11 100.00 99.71 83.26 47.92 21.15 5.77 1.70 0.38 0.06 0.01 0.00
12 I100.00 99.85 86.90 53.61 25.58 | 9.95 3.14 0.79 0.15 0.02 0.00 | £
13 100.00 99.92 89.78 58.82 29.99 I 12.65 4.43 1.26 0.29 0.05 0.01 0.00 ! “
14 I100.00 99.96 92.06 63.55 34.33 j 15.49 5.90 1.88 0.49 0.10 0.02 0.00 j w
15 I 100.00 99.98 93.83 67.81 38.55 | 18.44 7.55 2.62 0 77 0.18 0.04 | 0.01 0.00
16 I 100.00 99.99 95.23 71.64 42.63 121.45 9.33 3.50 1.12 0.30 0.07 I 0.01 0.00
17 100.00 99.99 96.31 75.06 46.54 24.50 11.24 4.50 1.56 0.46 j 0.12 j 0.02 0.00
18 100.00 100.00 97.15 78.10 50.26 27.54 13.24 5.60 2.07 0.67 j 0.18 j 0.04 0.01 0.00 j
19 i 100.00 100.00 97.81 80.81 53.79 30.57 15.32 6.81 2.67 0.92 J 0.28 J 0.07 0.02 0.00
20 ! 100.00 100.00 98.31 83.20 57.13 33.56 17.45 8.11 3.35 1.23 | 0.40 ( 0.11 0.03 0.01 0.00
21 100.00 100.00 98.70 85.31 60.28 36.50 19.63 9.48 4.11 1.59 | 0.55 j 0.17 0.04 0.01 0.00
22 100.00 100.00 99.01 87.17 63.24 39.37 21.84 10.93 4.93 2.00 j 0.73 0.24 0.07 0.02 0.00
23 100.00 100.00 99.24 88.80 66.01 42.18 24.06 12.43 5.83 2.47 I 0.95 1 0.32 0.10 0.03 0.01 0.00
24 100.00 100.00 99.42 90.24 68.60 44.90 26.28 13.98 6.78 2.99 j 1.20 j 0.43 0.14 0.04 0.01 0.00 j
25 : 100.00 100.00 99.55 91.50 71.02 47.55 28.50 15.58 7.79 3.56 j 1.48 J 0.56 0.19 0.06 0.02 0.00
26 ! 100.00 100.00 99.66 92.60 73.27 50.10 30.71 17.20 8.85 4.18 J 1.81 j 0.71 0.26 0.08 0.02 0.01 0.00
27 100.00 100.00 99.74 93.57 75.37 52.56 32.90 18.86 9.96 4.84 2.17 j 0.98 0.33 0.11 0.03 0.01 0.00
28 100.00 100.00 99.80 94.41 77.32 54.94 35.06 20.53 11.10 5.55 j 2.56 1.09 0.42 0.15 0.05 0.01 0.00
29 ! 100.00 100.00 99.85 95.14 79.12 57.22 37.20 22.21 12.29 6.30 | 2.99 , 1.31 0.53 0.20 0.07 0.02 0.010.00
30 ! 100.00 100.00 99.88 95.78 80.80 59.41 39.29 ^23.91 13.50 7.09 i 3.46 1.56 0.65 0.25 0.09 0.03 0.010.00

I I I
634 TABLES

T a b le 8

The function K(z) = Y (-1


к—_ со
2 K (z) z K (z) Z K (Z)

0.28 0.000001 0.71 0.305471 1.14 0.851394


0.29 0.000004 0.72 0.322265 1.15 0.858038
0.30 0.000009 0.73 0.339113 1.16 0.864442
0.31 0.000021 0.74 0.355981 1.17 0.870612
0.32 0.000046 0.75 0 372833 1.18 0.876548
0.33 0.000091 0.76 0.389640 1.19 0.882258
0.34 0.000171 0.77 0.406372 1.20 0.887750
0.35 0.000303 0.78 0.423002 1.21 0.893030
0.36 0.000511 0.79 0.439505 1.22 0.898104
0.37 0.000826 0.80 0.455857 1.23 0.902972
0.38 0.001285 0.81 0.472041 1.24 0.907648
0.39 0.001929 0.82 0.488030 1.25 0.912132
0.40 0.002808 0.83 0.503808 1.26 0.916432
0.41 0.003972 0.84 0.519366 1.27 0.920556
0.42 0.005476 0.85 0.534682 1.28 0.924505
0.43 0.007377 0.86 0.549744 1.29 0.928288
0.44 0.009730 0.87 0.564546 1.30 0.931908
0.45 0.012590 0.88 0.579070 1.31 0.935370
0.46 0.016005 0.89 0.593316 1.32 0.938682
0.47 0.020022 0.90 0.607270 1.33 0.941848
0.48 0.024683 0.91 0.620928 1.34 0.944872
0.49 0.030017 0.92 0.634286 1.35 0.947756
0.50 0.036055 0.93 0.647338 1.36 0.950512
0.51 0.042814 0.94 0.660082 1.37 0.953142
0.52 0.050306 0.95 0.672516 1.38 0.955650
0.53 0.058534 0.96 0.684636 1.39 0.958040
0.54 0.067497 0.97 0.696444 1.40 0.960318
0.55 0.077183 0.98 0.707940 1.41 0.962486
0.56 0.087577 1.99 0.719126 1.42 0.964552
0.57 0.098656 1.00 0.730000 1.43 0.966516
0.58 0.110395 1.01 0.740566 1.44 0.968382
0.59 0 122760 1.02 0.750826 1.45 0.970158
0.60 0 135718 1.03 0.760780 1.46 0.971846
0.61 0.149223 1.04 0.770434 1.47 0.973448
0.62 0.163225 1.05 0.779794 1.48 0.974970
0.63 0.177753 1.06 0.788860 1.49 0.976412
0.64 0.192677 1.07 0.797636 1.50 0.977782
0.65 0.207987 1.08 0.806128 1.51 0.979080
0.66 0.223637 1.09 0.814342 1.52 0.980310
0.67 0.239582 1.10 0.822282 1.53 0.981476
0.68 0.255780 1.11 0.829950 1.54 0.982578
0.69 0.272189 1.12 0.837356 1.55 0.983622
0.70 0.288765 1.13 0.844502 1.56 0.984610
TABLES 635

(c o n tin u e d ) T able 8

i ..
z K(z) z K(z) z K(z)

1.57 0.985544 1.93 0.998837 2.29 0.999944


1.58 0.986426 1.94 0.998924 2.30 0.999949
1.59 0.987260 1.95 0.999004 2.31 0.999954
1.60 0.988048 1.96 0.999079 2.32 0.999958
1.61 0.988791 1.97 0.999149 2.33 0.999962
1.62 0.989492 1.98 0.999213 2.34 0.999965
1.63 0.990154 1.99 0.999273 2.35 0.999968
1.64 0.990777 2.00 0.999329 2.36 0.999970
1.65 0.991364 2.01 0.999380 2.37 0.999973
1.66 0.991917 2.02 0.999428 2.38 0.999976
1.67 0.992438 2.03 0.999474 2.39 0.999978
1.68 0.992928 2.04 0.999516 2.40 0.999980
1.69 0.993389 2.05 0.999552 2.41 0.999982
1.70 0.993828 2.06 0.999588 2.42 0.999984
1.71 0.994230 2 07 0.999620 2.43 0.999986
1.72 0.994612 2.08 0.999650 2.44 0.999987
1.73 0.994972 2 09 0.999680 2.45 0 999988
1.74 0.995309 2.10 0.999705 2.46 0.999989
1.75 0.995625 2.11 0.999723 2.47 0.999990
1.76 0.995922 2.12 0.999750 2.48 0.999991
1.77 0.996200 2.13 0.999770 2.49 0.999992
1.78 0.996460 2.14 0.999790 2.50 0.9999925
1.79 0.996704 2.15 0.999806 2.55 0.9999956
1.80 0.996932 2.16 0.999822 2 60 0.9999974
1.81 0.997146 2.17 0.999838 2.65 0.9999984
1.82 0.997316 2.18 0.999852 2.70 0.9999990
1.83 0.997533 2.19 0.999864 2.75 0.9999994
1.84 0.997707 2 20 0.999874 2 80 0.9999997
1.85 0.997870 2 21 0.999886 2 85 0.99999982
1.86 0.998023 2 22 0.999896 2.90 0.99999990
1.87 0.998145 2 23 0.999904 2.95 0.99999994
1.88 0.998297 2 24 0.999912 3.00 0.99999997
1.89 0.998421 2.25 0.999920
1.90 0.998536 2.26 0.999926
1.91 0.998644 2.27 0.999934
1.92 0.998744 2.28 0.999940
636 TABLES

T able 9

/ _ (2Лс + 1)»я*

Th' '“h“10" 4 Vt =í ‘) (a« - v | 0<-


a i I l i
0-01 1 0.02 0.03 I 0.04 ! 0.05 í 0.06 0.07
У j _____________ I_________________________ I____________ I____________ ____________

0.1
0.5
1-0 [ 0.0000 0.0000
1.5 I 0.0000 0.0000 I 0.0000 0.0002 0.0009
2.0 ! 0.0000 0.0001 0.0008 0.0036 0.0101 0.0212
2.5 j 0.0001 0.0022 0.0112 I 0.0299 0.0578 0.0925
3.0 0.0000 I 0.0015 0.0151 0.0474 j 0.0941 0.1487 0.2061
3.5 O.OOOl I 0 .0 0 9 2 0.0491 0 .1 1 3 6 j 0 .1 8 7 9 0 .2 6 2 8 0.3341
4 .0 0 .0 0 0 6 0.0291 0 .1 0 5 2 0.2001 | 0 .2 9 4 2 0 .3 8 0 4 0.4 5 7 1
4 .5 0 .0 0 3 1 0 .0 6 4 3 0 .1 7 7 6 0.2951 1 0.4001 0.4901 0 5665
5.0 0.0096 0.1134 0.2582 0.3895 0.4985 0.5873 0 6598
5.5 0.0225 0.1726 0.3406 0.4784 0.5863 0.6707 0.7374
6.0 0.0428 0.2375 0.4204 0.5591 0.6627 0.7409 0.8005
6.5 0.0707 0.3045 0.4952 0.6310 0.7282 0.7989 0.8509
7.0 0.1053 0.3708 0.5638 0.6940 0.7834 0.8461 0.8904
7.5 0.1452 j 0.4347 0.6258 0.7484 0.8294 0.8838 0.9207
8.0 0.1889 0.4959 0.6811 0.7951 0.8671 0.9135 0 9436
8.5 0 2348 0.5513 0.7301 0.8345 0.8977 0.9365 0.9606
9.0 0.2819 0.6031 0.7731 0.8676 0.9221 0.9540 0.9729
9.5 0.3290 0.6506 0.8104 0.8950 0.9414 0.9672 0.9817
10.0 0.3754 0.6938 0.8427 0.9175 0.9564 0.9770 0.9878
11.0 0.4640 0.7678 0.8939 0.9505 0.9768 0.9891 0.9949
12.0 0.5450 0.8270 0.9303 0.9714 ; 0.9882 0.9951 0.9980
13.0 0.6174 0.8734 0.9555 0.9841 j 0.9943 0.9980 0.9993
14.0 0.6812 0.9090 0.9724 0.9915 0.9974 0.9992 0.9998
15.0 0.7367 0.9358 0.9833 0.9956 j 0.9988 0.9997 0.9999
16.0 0.7844 0.9555 0.9902 0.9978 | 0.9995 0.9999 1.0000
17.0 0.8249 0.9697 0.9944 0.9990 j 0.9998 1.0000
18.0 0.8591 0.9797 0.9969 0.9995 0.9999
19.0 0.8876 0.9867 0.9983 0.9998 1.0000
20.0 0.9112 0.9915 0.9991 0.9999
21.0 0.9304 0.9946 0.9996 1.0000
22.0 0.9459 0.9967 0.9998
23.0 0.9584 0.9980 0.9999
24.0 0.9683 0.9988 1.0000
25.0 0.9760 0.9993
30.0 0.9949 1.0000
35.0 0.9991
40.0 0.9999
43.0 1.0000
TABLES 637

(c o n tin u e d ) T able 9

\ a
\ I 0.08 0.09 I 0.1 0.2 0.3 0.4 0.5
■У ^ 1 ____________________________________________ I _____________________________________________________________________________________________

0.1 0.0000 0.0000


0.5 0.0000 0.0000 0.0008 0.0092
1.0 0.0000 0.0000 0.0000 0.0092 0 0716 0.2001 0.3708
1.5 0.0023 0 0050 0.0092 0.1420 0.3542 0.5591 0.7328
2.0 0.0367 0.0563 0.0793 0.3708 0 6193 0.7951 0.9090
2.5 0.1315 0.1730 02155 0.5778 0.7966 0.9175 0.9752
3.0 0.2632 0.3184 0.3708 0.7328 0 9009 0.9714 0.9946
3.5 0.3999 0.4599 0.5142 0.8398 0.9561 0.9915 0.9991
4.0 0.5244 0.5835 0.6353 0.9090 0.9823 0.9978 0.9999
4.5 0.6311 0.6860 0.7328 0.9511 0.9936 0.9995 1.0000
5.0 0.7193 0.7683 0.8088 0.9752 0.9979 0 9999
5.5 0.7903 0.8326 0.8665 0.9881 0.9994 1.0000
6.0 0.8463 0.8817 0.9090 0.9946 0.9998
6.5 0.8895 0.9181 0.9395 0.9977 1.0000
7.0 0.9220 0.9446 0.9607 0.9991
7.5 0.9460 0.9633 0.9752 0.9996
8.0 0 9634 0.9763 0.9847 0.9999
8.5 0.9756 0.9850 0.9908 1.0000
9.0 0.9841 0.9907 0.9946
9.5 0.9898 0.9944 0.9969
10.0 0.9936 0.9967 0.9983
11.0 0.9976 0.9989 0.9995
12.0 0.9992 0.9997 0.9999
13.0 0.9997 0.9999 1.0000
14.0 0.9999 1.0000
15.0 1.0000
16.0
17.0
18.0
19.0
20.0
21.0
22.0
23.0
24.0
25.0
30.0
35.0
40.0
43.0
REMARKS AND BIBLIOGRAPHICAL NOTES

These notes wish to call attention to books and papers which may be useful to the
reader for further study of subjects dealt with in the present textbook, including
books and papers to which reference was made in the text. For topics which are
treated in detail in some current textbook, we mention only such books, where the
reader can find further references.
As regards topics not discussed in standard textbooks, the'sources of the material
contained in this book are given in greater detail. These bibliographic notes contain
often some remarks on the historical development of the problems dealt with, but
to give a full account of the history of probability theory was of course impossible.
For the history of Probability Calculus up to Laplace see Todhunter [1].
Concerning less-known theorems or methods from other branches of mathematics,
we refer to some current textbook readily accessible to the reader.
The notes are restricted to the most important methodical problems. On several
occasions the method of exposition chosen in the present book is compared in the
notes to that in other textbooks.

Chapter I

Glivenko was the first to stress in his textbook (Glivenko [3]; cf. also Kolmogorov
[9]) the advantage of discussing the algebra of events as a Boolean algebra before
the introduction of the notion of probability. It seems to us that the understanding
of Kolmogorov’s axiomatic theory is hereby facilitated. On the general theory of
measure and integration over a Boolean algebra instead of over a field of sets see
Carathéodory [1 ]. Recent results on probability as a measure on a Boolean algebra
are summarized in Kappos [1].
§§ 1- 4 . On Boolean algebras in general see Birkhoff [1], Glivenko [2]. We did
not give a system of independent axioms for Boolean algebras, since it seemed to
us of much more importance to present the rules of Boolean algebra in a way which
makes clear the duality of the two basic operations.
§ 5. See Stone [1]. We follow here Frink [1]; as to the Lemma see Hausdorff [1]
and Frink [2].
§ 6. The unsolved problem mentioned in Exercise 7 was first formulated by Dede­
kind (cf. Birkhofif [1], p. 147). Concerning Exercise 11 see e.g. Gavrilov [1].
R E M A R K S A N D B IB L IO G R A P H IC A L N O T E S 639

Chapter II

From Chapter II on, probability theory is developed on the basis of Kolmogorov’s


axiomatics, first published in Kolmogorov [5]. The idea to consider probability as
an additive set function has — like every important mathematical idea — many
forerunners, cf. e.g. Boréi [1], Lomnicki [1], Lévy [1], Steinhaus [1], Jordan [1], [2].
The merit of Kolmogorov was to formulate for the first time this idea consequently
and in its whole generality and to show how from this idea probability theory can
be developed as a strictly axiomatic branch of modern mathematics. Herewith he
solved one of the famous problems of Hilbert. Nearly all modern textbooks of proba­
bility theory and mathematical statistics (cf. e.g. Blanc-Lapierre and Fortét [1],
Cramér [2], Doob [1], Feller [7], Fisz [1], Fréchet [1], Gnedenko [3], Kac [3],
Lévy [3], [4], Loéve [1], Neyman [1], [2], Onicescu, Mihoc and Ionescu-Tulcea [1],
Parzen [1], Richter, [1], Schmetterer [1], van der Waerden [1]) and nearly the whole
recent literature of probability theory and mathematical statistics are based on Kol­
mogorov’s axiomatics. Earlier theories and discussions of the concept of probability
may be found in Laplace [1], [2], Bernstein [2], von Mises [1], [2], Wald [1].
§11 and § 12 present a generalized system of axioms which contains that of Kol­
mogorov as a particular case (cf. Rényi [15]).
§ 3. Concerning Theorem 10 cf. Ch. Jordan [3]. A large number of general identities
and inequalities between probabilities of events can be found in Fréchet [2]. With
respect to Theorems 11 and 12 see Rényi [23]. The method based on these theorems
is closely related to the method of indicator functions of Loéve (cf. Loeve [1]). About
the relation of the two methods to each other cf. Rényi [35].
§ 7. On Measure Theory see Halmos [1], Aumann [1]. The proof of the cr-additivity
of Lebesgue -Stieltjes measure (by means of Kolmogorov’s Theorem 3) differs from
those given usually in textbooks and is due to Catherine Rényi and A. Rényi.
§ 10. On integral geometry cf. Blaschke [1].
§ 11. Some authors (cf. e.g. Reichenbach [1], Popper [1], [2]) have long ago
emphasized that conditional probability may and should be chosen as the basic
concept of probability theory. But the starting point of these authors was essentially
philosophical. They did not try to give a corresponding generalization of Kolmogorov’s
axiomatic theory. The idea, that unbounded measures may serve as legitimate
(conditional) probability distributions, does not appear in these early works. On the
other hand, unbounded measures were long ago used in statistics as Bayesian a priori
distributions (cf. e.g. Jeffreys [1], Dumas [1], [2], Baticle [1], [2]), without an exact
mathematical foundation. In the paper [15] of Rényi, these two points of view are,
in a certain sense, united and connected to Kolmogorov’s axiomatic theory. Concerning
the theory of conditional probability algebras cf. further Rényi [14], [18], [19],
Császár [1], Kappos [1]. Somewhat later, but independently, Luce [1] constructed
a similar system of axioms for finite conditional probability algebras, by using an
entirely different reasoning (starting from an investigation of the psychological laws
of choice and preference).
§ 12. Concerning exchangeable events (Exercises 38-41) cf. de Finetti [1], Khin-
chin [2]. For Exercise 46, see Wilcoxon [1 ], Rényi [12]. On Exercise 47 and in general
on applications of probability theory to number theory see e.g. Kac [4], Rényi [25].
640 REM ARKS AN D B IB L IO G R A P H IC A L N O T E S

Chapter III

§ 2. Cf. Bayes [1].


§ 3. On the Pólya-distribution see Eggenberger-Pólya [1], for further generalization
see Onicescu-Mihoc [ 1].
§ 11. Concerning Theorem 6 see Kantorovich [1].
§ 13. Poisson [1], Bortkiewicz [1]. For a general theory of simple and composed
Poisson processes see Aczél-Jánossy-Rényi [1], Rényi [5], [6], Aczél [2], Prékopa [1],
Marczewski [1], Florek, Marczewski and Ryll-Nardzewski [1], Saxer [1].
§ 14. The idea that in number theory the application of Dirichlet’s series can be
replaced by a formal algebraic calculus is — according to Hardy — due to Harald
Bohr (Hardy and Wright [1], p. 259). Instead of Dirichlet’s series, this idea is applied
here to power series and hereby to the convolutions of probability distributions.
On related problems see Rényi [4].
§ 16. On Euler’s summation formula see Knopp [1].
§ 17. Concerning Exercise 8, see Bernstein [4]; for a generalization Erdős and
Rényi [4]. For Exercise 32, see Bernstein [1] and also Arató-Rényi [1]. For Exercise
35, see Feldheim [2] and Rényi [20].

Chapter IV

§ 7. On projections of probability distributions see Cramér-Wold [1], Rényi [8].


§ 8. On the lognormal distribution see Kolmogorov [8], Rényi [32].
§ 9. Concerning Example 1, see Lobatchewski [1]. On the ^’-distribution, see
Helmert [1], К. Pearson [1]; on the beta-function, Lösch-Schoblik [1].
§ 10. Example 1: Student [1], Example 2: Fisher [1].
§ 17. Concerning Exercises 17-18, cf. Malmquist [1], van Dantzig [1], Hajós-
Rényi [1]. Exercise 26: Bateman [1], Bharucha-Reid [1]. Exercise 45: see, for
instance, Veksler-Grochev-Isaev [1].

Chapter V

§ 1. The content of this section appeared first in the present book (German edition,
1962).
§ 2. We follow here Kolmogorov [5]. For the Radon-Nikodym theorem, see e.g.
Halmos [1 ].
§ 3. Concerning the new deduction of the Maxwell distribution given here, see
Rényi [19].
§ 4 . We follow here Kolmogorov [5].
§ 6 and § 7. See Gebelein [1], Rényi [26], [28], Csáki-Fischer [1], [2]. On the
Lemma of Theorem 3 of § 7, see Boas [1]. On Theorem 3, see Rényi [26]. On the
applications of the large sieve of Linnik in number theory, see Linnik [1], Rényi [2],
Bateman-Chowla-Erdős [1].
§ 7. Exercises 1-4 treat problems of integral geometry from the point of view of
probability theory (see Blaschke [1]). Exercise 6: cf. Hajós-Rényi [1]; Exercises
28-30: Rényi [28].
REM ARKS AN D B IB L IO G R A P H IC A L N O T E S 641

Chapter VI

The method of characteristic functions, as mentioned by Lévy [4], goes essentially


back to Cauchy. As to its application in probability theory, the merit is due to
Liapunov [1], Pólya [1] and Lévy [2]. Detailed expositions of the theory may be
found in Dugué [1], Esseen [1], Ky Fan [1], Linnik [3], Lukács [4]. On Fourier
series and integrals, see Zygmu nd [1], [2].
§ 4. On the theorem of Helly, cf. Lukács [4].
§ 5. Theorem 1: Bernstein [4]; Theorem 3: Cramér [1]; Theorem 4: Linnik-
Singer [1]. Concerning the theorem on the singularities of a power series with positive
coefficients, which was applied in point 4, see Titchmarsh [1]. On the theorem of
H. A. Schwarz, cf. Hurwitz-Courant [1]. For the formula of Faä di Bruno, cf. Jordan
[4], Lukács [2]. Theorem 5: cf. Darmois [1] and Skitovitch [1], Theorem 6: Lukács
[1]; see also Geary [1], Kawata-Sakamoto [1], Singer [1] and Lukács [3].
§§ 7-8. On the theory of infinitely divisible and stable distributions see Gnedenko-
Kolmogorov [1], Lévy [4], Khinchin-Lévy [1], Feldheim [1].
§ 7. Concerning the theory of distributions (theory of generalized functions) we
follow the book of Lighthill [1 ], which develops the theory established by J. Mikusinski.
The application of the theory of distributions to probability theory was published
for the first time in the German edition of the present book. For Theorem 7, see
Robbins [1]- On the method of Laplace, see de Bruijn [1]. On the theory of theta
functions, cf. Hurwitz and Courant [1]. Chung and Erdos proved Theorem 8 in a
stronger form.
§10. Exercise 4: see Shannon [1]; Exercise 6: Hardy-Littlewood-Pólya [1];
Exercise 14 (for a more general theorem): Császár [2]; Exercise 20: Laha [1].

Chapter VII

§ 1. Chebyshev [1], Bienaymé [1]. The first Chebyshev-type inequality is due to


Gauss [1 ].
§ 2. Bernoulli [1], Slutsky [1].
§ 3. Markov [1], Khinchin [4].
§ 4 . Bernstein [4], Uspenski [1].
§ 5. Boréi [1], Cantelli [1]. On Lemma C, see Erdos-Rényi [2].
§ 6. Kolmogorov [1], [5].
§ 7. Kolmogorov [2]. On Lemma 2: Knopp [1].
§ 8 . G liv e n k o [1].
§ 9. Khinchin [1], Kolmogorov [1], Erdős [1], Feller [4], Rényi [1].
§ 10. Rényi [24], Rényi-Révész [1].
§ 11. Rényi [381.
§ 12. Rényi-Révész [2], Rényi [38].
§ 13. Kolmogorov [5].
§ 14. Kolmogorov [5], Khinchin-Kolmogorov [1], Steinhaus [2] and also Stein­
haus, Kac and Ryll-Nardzewski [1]; for the lemma, Doob [2].
§ 15. Rényi [15], [17]. For the theoremof Abel-Dini, cf. Knopp [1], p. 173.
§ 16. Exercise 10, cf. Hille [1]; Exercise 11: Widder [1];Exercise 13: Rényi [1];
Exercise 14: Híjek-Rényi [1]; Exercise 21: Doob [2] (the theorem mentioned in
the remark is due to Menshov); Exercise 24: Kolmogorov [5].
642 REM ARKS AN D B IB L IO G R A P H IC A L N O T E S

Chapter VIII
§ 1. Chebyshev [1], Markov [1], Liapunov [1], Lindeberg [1], Pólya [1], Feller [1],
Khinchin [3], Gnedenko-Kolrnogorov [1], Kolmogorov [5], [11], Prékopa Rényi-
Urbanik [1 ].
§ 2. Gnedenko [2].
§ 3. Lévy [4], Feller [1], Khinchin [5], and also Gnedenko-Kolrnogorov [1].
§ 5. Erdős-Rényi [1]; for the particular case p = const, cf. Bernstein [4].
§ 6. For the lemma, cf. Cramér [3]. Theorem 3 was first, under certain restrictions,
proved by a different method (cf. Rényi [3]). This result was generalized by Kolmo­
gorov [10]. For the more simple proof given here, cf. Rényi [24]. Theorem 3 may be
applied to prove limit theorems for dependent random variables; cf. Révész [1].
On the central limit theorem for dependent random variables, see the fundamental
paper of Bernstein [3].
§ 7. Anscombe [1], Doeblin [2], Rényi [22], [31].
§ 8. On the theory and applications of Markov chains and Markov processes
see Markov [1], Kolmogorov [3], [7]; Doeblin [1], [2], Feller [2], [3], [6], [7],
Doob [2], Chung [1], Bartlett [1], Bharucha-Reid [1], Wiener [2], Chandrasekhar
[1], Einstein [1], Hostinsky [1], Lévy [3], Rényi [11].
§ 9. Rényi [9], [10], van Dantzig [1], Malmquist [1]; further references are to
be found in Wilks [1], Wang [1].
§ 10. Kolmogorov [6], N. V. Smirnov [1], [2], Gnedenko [1], Gnedenko-Koroljuk
[1], Doob [1], Feller [5], Donsker [1].
§ 11. For Theorem 1: Pólya [2]. See also Dvoretzky-Erdős [1]. On the arc sine
law (Theorem 6) cf. Lévy [2], Erdős-Kac [2]; Sparre-Andersen [1], [2]; Chung-
Feller [1], Rényi [33]. On Lemma 2, Rényi [36]; for other generalizations, Spitzer
[1]. For Theorem 8, see Erdős-Hunt [1]; for Theorem 9, Erdős-Kac [1]; for a gen­
eralization of it, Rényi [9].
§ 12. Lindeberg [1], Krickeberg [1].
§ 13. Exercise 5: Rényi [17]; for a similar general system of independent functions,
see Steinhaus, Kac and Ryll-Nardzewski [1 ]—[10], Rényi [7]. Exercise 8: Kac [1];
Exercise 24: Wilcoxon [1] and Rényi [12]; Exercise 25: Lehmann [1] and Rényi
[12]; the equivalence of the problems considered in these two papers is proved in
E. Csáki [1]. Exercise 28: Erdős-Rényi [3]. The result of Exercise 30 is due to
Chung and Feller [1]; as regards the presentation given here, cf. Rényi [33].

Appendix

On the concepts of the entropy and information see Boltzmann [1], Hartley [1],
Shannon [1], Wiener [1], Shannon-Weaver [1], Woodward [1], Barnard [1], Jeffreys
[1], and the papers of Khinchin. Fadeev, Kolmogorov, Gelfand, Jaglom, etc., in
Arbeiten zur Informationstheorie I— III .On the role of the notion of information in sta­
tistics, see the works of Fisher [1 ]—[3], and of Kuliback [1 ]. The notion of the dimension
of a probability distribution and that of the entropy of the corresponding dimension
were introduced in a paper of Balatoni and Rényi (Arbeiten zur Informationstheorie I)
and were further developed in Rényi [27], [30]. Measures of information differing
from the Shannon-measure were already considered earlier, e.g. by Bhattacharyya
[1] and Schützenberger [1]; the theory of entropy and information of order a is
developed in Rényi [34], [37].
R E M A R K S A N D B IB L IO G R A P H IC A L N O T E S 643

Part of the material appeared for the first time in the German edition of this book.
This appendix covers merely the basic notions of information theory; their appli­
cation to the transmission of information through a noisy channel, coding theory,
etc. are not dealt with here. Besides the already mentioned works of Shannon and
Khinchin let there be indicated those of Feinstein [1], [2], McMillan [1] and Wol­
fowitz [1], [2], [3].
§ 1. Concerning the theorem of Erdős on additive number-theoretical functions,
which was rediscovered by Fadeev, see Erdős [2]; the simple proof given in the text
is due to Rényi [29].
§ 2. For the theorem of Mercer, see Knopp [1].
§ 6. On the mean value theorem, see de Finetti [2], Kolmogorov [4], Nagumo [1],
Aczél [1], Hardy-Littlewood-Pólya [1] (where further references can be found;
this book contains also all other inequalities used in the Appendix, e.g. the inequalities
of Jensen and of Hölder).
§ 9. The idea that quantities of information theory may be used for the proof of
limit theorems is due to Linnik [2].
On the theorem of Perron-Frobenius, see Gantmacher [1].
§ 11. For Exercise 2c, see Rényi [16] and Kac [2]; Exercise 3c: Khinchin [6], for
the generalizations: Rényi [21]. Exercise 4: Kolmogorov-Tikhomirov (Arbeiten zur
Informationstheorie III). The content of Exercise 9 is due to Chung (unpublished
communication), the proof given here differs from that of Chung. Exercise 17b: cf.
Moriguti [1].

Tables

Further tables and graphic representations useful in probability calculus may be


found in Fisher-Yates [1], E. S. Pearson-H. O. Hartley [1], Molina [1], Koller [1].

REFERENCES

A czél, J.
[1] On mean values, Bull. Amer. Math. Soc. 54, 393-400 (1948).
[2] On composed Poisson distributions, III, Acta Math. Acad. Sei. Hung. 3, 219—
224 (1952).
A c z é l , J ., L . J á n o ssy a n d A . R é n y i
[1] On composed Poisson distributions, I, Acta Math. Acad. Sei. Hung. 1, 209-224
(1950).
A l e x a n d r o v , P. S. (Александров, С.) [1] Введение в общую теорию множеств
и функций (Introduction to the theory of sets and functions), OGIZ, Moscow -
Leningrad 1948.
A n s c o m b e , F. J.
[1] Large sample theory of sequential estimation, Proc. Cambridge Phil. Soc. 48,
600 (1952).
A r a tó , M . a n d A . R én y i
[1] Probabilistic proof of a theorem on the approximation of continuous functions
by means of generalized Bernstein polynomials, Acta Math. Acad. Sei. Hung. 8,
91-98 (1957).
ARBEITEN ZUR INFORMATIONSTHEORIE I-III (Teil von A. J . C h in t s c h i n ,
D. K . F a d d e j e w , A. N. K o l m o g o r o f f , A. R é n y i und J . B a l a t o n i ; Teil II von
I. M. G e l f a n d , A. M. J a g l o m , A. N. K o l m o g o r o f f , C h ia n g T se - P e i , I. P.
Z a r e g r a d s k i ; Teil III von A. N. K o l m o g o r o f f und W. M. T ic h o m ir o w ),
VEB Deutscher Verlag der Wissenschaften, Berlin 1957 bzw. 1960.
A u m a n n , G.
[1] Reelle Funktionen, Springer-Verlag, Berlin-Göttingen-Heidelberg 1954.

Barban, M. В. (Барбан, M. Б.)


[1] Об одной теореме P. Bateman, S. Chowla, P. Erdős, Publications of the
Mathematical Institute of the Hungarian Academy of Sciences, 9 A, 429-435
(1964).
B a r n a r d , G. A.
[1] The theory of information, J. Royal Stat. с. (B) 13, 46-69 (1951).
B a r t l e t t , M . S.
[1] An introduction to stochastic processes with special reference to methods and
applications, Cambridge Univ. Press, Cambridge 1955.
B a t e m a n , H.
[1] The solution of a system of differential equations occurring in the theory of
radioactive transformations, Proc. Cambridge Phil. Soc. 15, 423-427 (1910).
B a t e m a n , P. T., S. C h o w l a a n d P. E r d ő s
[1] Remarks on the size of L{ 1,%), Publ. Math. Debrecen 1, 165-182 (1950).
B a t ic l e , E .
[1] Sur une lói de probabilité a priori paramétres d’une loi laplacienne, C. R. Acad.
Sei. Paris 226, 55-57 (1948).

*
646 REFERENCES

E.
B a t ic l e ,
[2] Sur une lói de probabilité a priori pour l’interprétation des résultats de tirages
dans une urne, C. R. Acad. Sci. Paris 228, 902-904 (1949).
Bauer, И .
[1] Wahrscheinlichkeitstheorie und Grundzüge der Masstheorie, Sammlung Gö­
schen 1216/1216a, de Gruyter, Berlin 1964.
B a y es , T h .
[1] Essay towards solving a problem in the doctrine of chances, “Ostwald’s Klas­
siker der Exakten Wissenschaften”, Nr. 169, W. Engelmann, Leipzig 1908.
B e r n o u l l i , J.
[1] Ars Coniectandi (1713) I— II, III-IV. “Ostwald’s Klassiker der Exakten
Wissenschaften”, Nr. 108, W. Engelmann, Leipzig 1899.
B e r n s t e in , S. N. (Бернштейн, С. H.)
[1 ] Démonstration du théoréme de Weierstrass fondée sur la calcul des probabilités,
Soobshchs. Charkovskovo Mat. Obshch. (2) 13, 1-2 (1912).
[2] Опыт аксиоматического обоснования теории вероятностей (On a tentative
axiomatisation probability theory), Charkovskovo Zap. Mat. ot-va 15,
209-274 (1917).
[3] Sur Textension du théoréme limite du calcul des probabilités aux sommes de
quantités dépendantes. Math. Ann. 97, 1-59 (1926).
[4] Теория вероятностей (Probability theory), 4. ed., Goztehizdat, Moscow 1946.
B h a r u c h a - R e id , A. T.
[1] Elements of the theory of Markov processes and their applications, McGraw-
Hill, New York 1960.
B h a t t a c h a r y y a , A.
[1] On some analogues of the amount of information and their use in statistical
estimation, Sankhya 8, 1-14 (1946).
BlENAYMÉ, M.
[1 ] Considérations á l’appui de la découverte de Laplace sur la loi des probabilités
dans la méthode des moindres carrés, C. R. Acad. Sci. Paris 37, 309-324 (1853).
BlRKHOFF, G.
[1] Lattice theory, 3. ed., American Mathematical Society Colloquium Publications
25. AMS, Providence 1967.
B l a n c - L a p ie r r e , A., et R. F o r t é t
[1] Théorie des fonctions aléatoires, Masson et Cie., Paris 1953.
B l a s c h k e , W.
[1] Vorlesungen über Integralgeometrie, 3. Aufl., VEB Deutscher Verlag der
Wissenschaften, Berlin 1955.
B l u m , J . R ., D . L . H a n s o n a n d L . H . K o o pm a n s
[1] On the strong law of large numbers for a class of stochastic processes, Zeit­
schrift für Wahrscheinlichkeitstheorie 2, 1-11'(1963).
B oa s, R . P . J r .
[1] A general moment problem, Amer. J. Math. 63, 361-370 (1941).
Bochner, S. and S. C h a n d r a s e k h a r a n
[1] Fourier transforms, Princeton Univ. Press, Princeton 1949.
B o l t z m a n n , L.
[1] Vorlesungen über Gastheorie, Johann Ambrosius Barth, Leipzig 1896.
Borel, É .
[1] Sur les probabilités dénombrables et leurs applications arithmétiques, Rend.
Circ. Mat. Palermo 26, 247-271 (1909).
[2] Éléments de la théorie des probabilités, Hermann et Fils, Paris 1909.
REFERENCES 647

VON BORTKIEW ICZ, L.


[1] Das Gesetz der kleinen Zahlen, В. G. Teubner, Leipzig 1898.
d e B r u i j n , N. G.
[1] Asymptotic methods in analysis, North Holland Publ. Comp. Inc., Amsterdam
1958.

F. P.
C a n t e l l i,
[1] La tendenza ad un limite nel senzo del calcolo déllé probabilita, Rend. Circ.
Mat. Palermo 16, 191-201 (1916).
C a rathéodory, C.
[1] Entwurf einer Algebraisierung des Integralbegriffes, Sitzungsber. Math.-Natur-
wiss. Klasse Bayer. Akad. Wiss., München 1938, S. 24-28.
C h a n d r a s e k h a r , S.
[1 ] Stochastic problems in physics and astronomy, Rev. Mod. Phys. 15, 1-89 (1943).
C h e b y s h e v , P. L. (Чебышев, П. Л.)
[1] Теория вероятностей (Theory of probability), Akad. izd., Moscow 1936.
C h u n g , K. L.
[1] Markov chains with stationary transition probabilities, Springer-Verlag, Berlin-
Göttingen-Heideiberg 1960.
C h u n g , K. L ., a n d P. E r d ő s
[1] Probability limit theorems assuming only the first moment, I, Mem. Amer.
Math. Soc. 6, 1-19 (1950).
C h u n g , K. L . a n d W. F e l l e r
[1] On fluctuations in coin-tossing, Proc. Acad. Sei. USA 35, 605-608 (1949).
C r a m é r , H.
[1] Über eine Eigenschaft der normalen Verteilungsfunktion, Math. Z. 41 , 405-414
(1936).
[2] Random variables and probability distributions, Cambridge Univ. Press,
Cambridge 1937.
[3] Mathematical methods of statistics, Princeton Univ. Press, Princeton 1946.
C r a m é r , H. and H. W o l d
[1] Some theorems on distribution functions, J. London Math. Soc. 11, 290-294
(1936).
C s á k i, E .
[1] On two modifications of the Wilcoxon-test, Publ. Math. Inst. Hung. Acad.
Sei. 4 , 313-319 (1959).
C s á k i , P. and J. F is c h e r
[1] On bivariate stochastic connection, Publ. Math. Inst. Hung. Acad. Sei. 5,
311-323 (1960).
[2] Contributions to the problem of maximal correlation, Publ. Math. Inst. Hung.
Acad. Sei. 5, 325-337 (1950).
C sá szá r, Á .
[1] Sur la structure des espaces de probabilité conditionelle, Acta Math. Acad. Sci.
Hung. 6, 337-361 (1955).
[2] Sur une caractérisation de la répartition normale de probabilités, Acta Math.
Acad. Sci. Hung. 7, 359-382 (1956).

VAN D a NTZIG, D .
[1] Mathematische Statistiek, “Kadercursus Statistiek, 1947-1948”, Mathema­
tisch Centrum, Amsterdam 1948.
648 REFERENCES

D G.
a r m o is ,
[1] Analyse générale des liaisons stochastiques, Revue Inst. Internat. Stat. 21, 2-8
(1953).
D o e b l in , W .
[1] Sur les propriétés asymptotiques de mouvements régis par certains types de
chalnes simples, Bull. Soc. Math. Roumaine Sci. 391, 57-115 (1937); 3911,
3-61 (1937).
[2] Éléments d’une théorie générale des chaínes simples constantes de Markov,
Ann. Sci. École Norm. Sup. (3) 57, 61-111 (1940).
D on sk er, M . D .
[1] Justification and extension of Doob’s heuristic approach to the Kolmogorov-
Smirnov theorems, Ann. Math. Stat. 23, 277-281 (1952).
D oob, J. L.
[1] Heuristic approach to the Kolmogorov-Smirnov theorems, Ann. Math. Stat.
20, 393 (1949).
[2] Stochastic processes, Wiley-Chapman, New York-London 1953.
D ugué, D.
[1] Arithmétique de lois de probabilités, Mém. Sci. Math., No. 137, Gauthier-
Villars, Paris 1957.
D u m a s , M.
[1] Sur les lois de probabilités divergentes et la formule de Fisher, Interméd. Rech.
Math. 9 (1947), Supplement 127-130.
[2] Interpretation de resultats de tirages exhaustifs, C. R. Acad. Sei. Paris 288,
904-906 (1949).
(See the note from E. Borei after Dumas’ article, too.)
D v o r e t z k y , A. and P. E r d ő s
[1] Some problems on random walk in space, Proc. 2nd Berkeley Symp. Math.
Stat. Prob. 1950, Univ. California Press, Berkeley-Los Angeles 1951, 353-367.

E ggenberger, F., u n d G. P ó l y a
[1] Über die Statistik verketteter Vorgänge, Z. angew. Math. Mech. 3, 279-289
(1923).
E in s t e in , A.
[1] Zur Theorie der Brownschen Bewegung, Ann. Physik 19, 371-381 (1906).
E r d ő s , P.
[1] On the law of the iterated logarithm, Ann. Math. 43, 419-436 (1942).
[2] On the distribution function of additive functions Ann. Math. 47, 1-20(1946).
E r d ő s , P . and G . A. H u n t
[1 ] Changes of signs of sums of random variables, Pacific J. Math. 3,678-679 (1953).
E r d ő s , P . and M. K ac
[1 ] On certain limit theorems of the theory of probability, Bull. Amer. Math. Soc.
52, 292-302 (1946).
[2] On the number of positive sums of independent random variables, Bull. Amer.
Math. Soc. 53, 1011-1020 (1947).
E r d ő s, P . a n d A . R ényi
[1] On the central limit theorem for samples from a finite population, Publ. Math.
Inst. Hung. Acad. Sei. 4, 49-61 (1959).
[2] On Cantor’s series with convergent V — , Ann. Univ. Sei. Budapest, Rolando
4n
Eötvös nom., Sect. Math. 2, 93-109 (1959).
REFERENCES 649

A. R én y i
E rd ős, E. a n d
[3] On the evolution of random graphs, Publ. Math. Inst. Hung. Acad. Sei. 5,17-61
(1960).
[4] On a classical problem of probability theory, Publ. Math. Inst. Hung. Acad.
Sei. 6, 215-220 (1961).
E sseen , C. G.
(1] Fourier analysis of distribution functions, A mathematical study of the Laplace-
Gaussian law, Acta Math. 77, 1-125 (1945).

F e in s t e in ,A.
[1 ] A new basic theorem of information theory, Trans. Inst. Radio Eng., 2-22
(1954).
[2] Foundations of information theory, McGraw-Hill, New York 1958.
F e l d h e im , E.
[1 ] Étude de la stabilité des lois de probabilité, Dissertation, Univ. Paris, Paris 1937.
[2] Neuere Beweise und Verallgemeinerung der wahrscheinlichkeitstheoretischen
Sätze von Simmons, Mat. Fiz. Lapok 45, 99-114 (1938).
F e i l e r , W.
[1] Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung, Math. Z.
40, 521-559 (1935); 42, 301-312 (1947).
[2] Zur Theorie der stochastischen Prozesse, Existenz- und Eindeutigkeitssätze,
Math. Ann. 113, 113-160 (1936).
[3] On the integro-differential equations of purely discontinuous Markov processes,
Trans. Amer. Math. Soc. 48, 488-515 (1940). Errata: ibidem 58, 474 (1945).
[4] The law of the iterated logarithm for identically distributed random variables,
Ann. Math. 47, 631-638 (1946).
[5] On the Kolmogorov-Smirnov limit theorems for empirical distributions, Ann.
Math. Stat. 19, 177-189 (1948).
[6] On the theory of stochastic processes, with particular reference to applications,
Proc. Berkeley Symp. Math. Stat. Prob. 1945, 1946, Univ. California Press,
Berkeley-Los Angeles 1949, 403-432.
[7] An introduction to probability theory and its applications, Vols 1-2, Wiley,
New York 1950-1966.
DE FiN E T T I, B .
[1] Funzione caratteristica di un fenomeno aleatorio, Mem. R. Accad. Lincei (6)
4, 85-133 (1930).
[2] Sul concetto di media, Giorn. 1st. Ital. Att. 2, 369-396 (1931).
F is h e r , R. A.
[1] Statistical methods for research workers, 10th edition, Oliver-Boyd Ltd.,
Edinburgh-London 1948.
[2] The design of experiments, Oliver-Boyd Ltd., London-Edinburgh 1949.
[3] Contributions to mathematical statistics, Wiley-Chapman, New York-London
1950.
F is h e r , R . A. a n d F . Y a tes
[1] Statistical tables for biological, agricultural and medical research, Oliver-Boyd
Ltd., London-Edinburgh 1949.
F is z , M .
[1] Probability theory and mathematical statistics, 3. ed. Wiley, New York 1963.
E. M a r c z e w s k i a n d C. R y l l - N a r d z e w s k i
F l o r e k , K .,
[1] Remarks on the Poisson stochastic process, I, Studia Math. 13, 122-129 (1953).
650 REFERENCES

F réchet, M.
[1] Récherches théoriques modernes, Fascicule 3 du Tome 1 du Traité du calcul
des probabilités par É. Borei et divers auteurs, Gauthier-Villars, Paris 1937.
[2] Les probabilités associées ä un Systeme d’événements compatibles et dépendants,
I-II, Hermann et Cie., Paris 1940 and 1943.
F r in k , О .
[1] Representations of Boolean algebras, Bull. Amer. Math. Soc. 47, 755-756
(1941).
[2] A proof of the maximal chain theorem, Amer. J. Math. 74, 676-678 (1952).

G antm acher, F. R. (Гантмахер, Ф. P.)


[1] Matrize lrechnung, I-II, VEB Deutscher Verlag der Wissenschaften, Berlin
1958 bzw. 1959 (Übersetzung aus dem Russischen).
G a u s s , C. F.
[1] Theoria combinationis observationum erroribus minimus obnoxiae, Göttingen
1821.
G a v r il o v , M. А. (Гаврилов, M. A.)
[1] Теория релейно-контактных схем (Theory of relay-contact schemes),
Moscow-Leningrad 1950.
G e a r y , R. C.
[1] Distribution of Student’s ratio for nonnormal samples, J. Royal Stat. Soc.
Supplement 3, 178-184 (1936).
G e b e l e in , H.
[1] Das statistische Problem der Korrelation als Variations-und Eigenwertproblem
und sein Zusammenhang mit der Ausgleichungsrechnung, Z. angew. Math.
Mech. 21, 364-379 (1941).
G l i v e n k o , V. I. (Гливенко, В. И.)
[1 ] Sulla determinazione empirica di una legge probabilita, Gior. Ist. Ital. Att. 4,
1-10 (1933).
[2] Théorie générale des structures, Act. Sci. Industr. Nr. 652, Hermann et Cie.,
Paris 1938.
[3] Курс тории в ероятностей (A course of probability theory), GONTI, Mos­
cow-Leningrad 1939.
G n e d e n k o , В. V. (Гнеденко, Б. В.)
[1 ] Sur la distribution limite du terme maximum d’une série aléatoire, Ann. Math.
44, 423-453 (1943).
[2] Локальная предельная теорема для плотностей (A local limit theorem for
probability densities), Doki. Akad. Nauk. SSSR 95, 5-7 (1954).
[3] The theory of probability (Transl. from the Russian), Chelsea, New York 1962.
G n e d e n k o , В. V. and A. N. K o l m o g o r o v (Гнеденко, Б. В. и A. H. Колмогоров)
[1 ] Limit distributions for sums of independent random variables, Addison-Wesley.
Cambridge (Mass.) 1954.
G n e d e n k o , В. V. and V. S. K o r o l ju k (Гнеденко, Б. В. и В. С. Королюк)
[1 ] О максимальном расхождении двух эмпирических распределений (On the
maximal divergence of two empirical distributions), Doki. Akad. Nauk. SSSR
80, 525 (1951).

H ájek, J. and A. Rényi


[1 ] Generalization of an inequality of Kolmogorov, Acta Math. Acad. Sei. Hung. 6,
281-283 (1955).
REFERENCES 651

H a jó s , G . a n d A . R ényi
[1] Elementary proofs of some basic facts concerning order statistics, Acta Math.
Acad. Sei. Hung. 5, 1-6 (1954).
H a l m o s , P. R.
[1] Measure Theory, van Nostrand, New York 1950.
H ardy, G . H .
[1 ] D iv e r g e n t s e rie s , C la r e n d o n P re s s , O x f o r d 1949.
H a r d y , G. H., J. E. L it t l e w o o d and G. P ó l y a
[1] Inequalities, 2nd edition, Cambridge Univ. Press, Cambridge 1952.
H ardy, G . H. a n d W. W. R o g o s in s k i
[1] Fourier series, 3rd edition, Cambridge Univ. Press, Cambridge 1956.
H a r d y , G. H. and E. M. W r ig h t
[1] An introduction to the theory of numbers, 4th edition, Clarendon Press, Oxford
1960.
H a r r is , T. E .
[1] The theory of branching processes, Springer Verlag, Berlin-Heidelberg-New
York 1963.
H artley, R . V.
[1] Transmission of information, Bell Syst. Techn. J. 7, 535-563 (1928).
H a u sd o r ff, F.
[1] Grundzüge der Mengenlehre, В. G. Teubner, Leipzig 1914.
H e l m e r t , R.
[1] Über die Wahrscheinlichkeit der Potenzsummen der Beobachtungsfehler und
über einige damit im Zusammenhang stehende Fragen, Z. Math. Phys. 21 ,
192-219 (1876).
H il l e , E .
[1] Functional analysis and semi-groups, Amer. Math. Soc. Coll. Pubi., Vol. 31,
New York 1948.
HOSTINSKY, B.
[1] Méthodes générales du calcul des probabilités, Mém. Sci. Math. Nr. 52,
Gauthier-Villars, Paris 1931.
H u r w i t z , A. und R. C o u r a n t
[1] Funktionentheorie, Springer, Berlin 1929.

J effreys, H .
[1] Theory of probability, 2nd edition, Clarendon Press, Oxford 1948.
Jordan, C h .
[1] On probability, Proc. Phys. Math. Soc. Japan 7, 96-109 (1925).
[2] Statistique mathématique, Gauthier-Villars, Paris 1927.
[3] Le théoréme de probabilité de Poincaré, généralisé au cas de plusieurs variables
indépendantes, Acta Sei. Math. Szeged 7, 103-111 (1934).
[4] Calculus of finite differences, 2nd edition, Chelsea Publ. Comp., New York 1950.
[5] Fejezetek a klasszikus valószínűségszámításból (Chapters from the classical
calculus of probabilities), Akadémiai Kiadó, Budapest 1956.

K ac, M .
[1] Random walk and the theory of Brownian motion, Amer. Math. Monthly 54 ,.
369-391 (1947).
652 REFERENCES

K ac, M.
[2] A remark on the proceeding paper by A. Rényi, Pubi. Inst. Math. Beograd 8,
163-165 (1955).
[3] Probability and related topics in physical sciences, Lectures in applied mathe­
matics, Vol. I, Intersci. Publ., London-New York 1959.
[4] Statistical independence in probability, analysis and number theory, Math.
Assoc. America 1959.
K antorovitch, L. V. (Канторович, Л. B.)
[1] Sur une P ro b le m e de M. Steinhaus, Fund. Math. 14 , 266-270 (1929).
K a p p o s , D. A.
[1] Strukturtheorie der Wahrscheinlichkeitsfelder und -räume, Springer-Verlag,
Berlin-Göttingen-Heidelberg 1960.
K a w a t a , 1. and H. S a k a m o t o
[1 ] On the characterization of the normal population by the independence of the
sample mean and the sample variance, J. Math. Soc. Japan 1, 111-115 (1949).
K hinchin , A. J. ( Х и н ч и н , А. Я.)
[1] Über dyadische Brüche, Math. Z. 18, 109-118 (1923).
[2] Sur les classes d’événements équivalents, Mat. Sbornik 39:3, 40-43 (1932).
[3] Asymptotische Gesetze der Wahrscheinlichkeitsrechnung, Springer, Berlin 1933.
[4] Korrelationstheorie der stationärer stochastischer Prozesse, Math. Ann. 109, 604-
615 (1934).
[5] Sul dominio di attrazione della legge di Gauss, Giorn. Ist. Ital. Att. 6, 378-393
(1935).
[6] Kettenbrüche, B. G. Teubner, Leipzig 1956.
[7] О классах эквивалентных событии (On classes of equivalent events), Dok-
ladi Akad. Nauk. SSSR 8 5 , 713-714 (1952).
K h in c h in , A. J. und A. N. K o l m o g o r o v (Х и н ч ин , А. Я . и A. H. Колмогоров)
[1 ] Über Konvergenz von Reihen, deren Glieder durch den Zufall bestimmt werden,
Mat. Sbornik 32, 668-677 (1925).
(K hinchin ) C hintschin, A. J. et P. L evy
[1] Sur les lois stables, C. R. Acad. Sei. Paris 202, 374-376 (1936).
K n o pp, K.
[1] Theorie und Anwendung der unendlichen Reihen, Springer, Berlin 1924.
K o l l e r , S.
[1] Graphische Tafeln zur Beurteilung statistischer Zahlen, Steinkopff, Dresden-
Leipzig 1943.
K olmogorov, A. N. (Колмогоров, А. H.)
[1] Über das Gesetz des iterierten Logarithmus, Math. Ann. 101, 126-136 (1929).
[2] Sur la loi forte des grandes nombres, C. R. Acad. Sei. Paris 191, 910-912(1930).
[3] Über die analytischen Methoden in der Wahrscheinlichkeitsrechnung, Math.
Ann. 104, 415-458 (1930).
[4] Sur la notion de la moyenne, Atti R. Accad. Naz. Lincei 12, 388-391 (1930).
[5] Foundations of the theory of probability, Chelsea, New York 1956.
[6] Sulla determinazione empirica di una legge di distribuzione, Giorn. 1st. Ital. Att.
4 , 83-91 (1933).
[7] Цепы Маркова с счетным множеством возможных состояний (Markov
chains with denumerably infinite possible states), Bull. Mosk. Univ. 1, 1 (1937).
[8] О логарифмически нормальном законе распределения размеров частиц при
дроблении (On the lognormal distribution of the sizes of particles in chop­
ping), Doki. Akad. Nauk. SSSR 31, 99-101 (1941).
[9] Algebres de Boole métriques completes, VI. Zjad Matematykow Polskich,
Warsaw 20-23. IX. 1948, Inst. Math. Univ. Krakow, 1950. 22-30.
REFERENCES 653

K olmogorov, A. N. (Колмогоров, A. H.)


[10] Ein Satz über die Konvergenz der bedingten Erwartungswerte und deren Anwen­
dungen, I. Magyar Matematikai Kongresszus Közleményei, Budapest 1950,
377-386.
[11] Некоторые работы последних лет в области предельних теорем теории
вероятностей (On some recent works concerning the limit theorems of prob­
ability theory), Vestnik Univ. Moscow 8 (10), 29-38 (1953). See also “Ar­
beiten zur Informationstheorie III.”
K r ic k e b e r g , К.
[1] Wahrscheinlichkeitstheorie, Teubner, Stuttgart 1963.
K u l l b a c k , S.
[1] Information theory and statistics, Wiley, New York 1959.
K y F an
[1 ] Les fonctions définies-positives et les fonctions completement monotones, leurs
applications au calcul des probabilités et ä la théorie des espaces distanciées,
Mém. Sei. Math. 114, Gauthier-Villars, Paris 1950.

L aha, R. G.
[1 ] An example of a non-normal distribution where the quotient follows the Cauchy
law, Proc. Nat. Acad. Sei. USA 44 , 222-223 (1958).
L a p l a c e , P. S.
[1] Théorie analytique des probabilités, 1795. Oeuvres Complétes de Laplace, t. 7,
Gauthier-Villars, Paris 1886.
[2] Essai philosophique sur les probabilités, I—II, Gauthier-Villars, Paris 1921.
L e h m a n n , E . L.
[1 ] Consistency and unbiasedness of certain nonparametric tests, Ann. Math. Stat.
22, 165-180 (1951).
L evy, P.
[1] Calcul des Probabilités, Gauthier-Villars, Paris 1925.
[2] Sur certains processes stochastiques homogénes, Comp.Math. 7, 283-339. (1939).
[3] Processus stochastiques et mouvement brownien, Gauthier-Villars, Paris 1948.
[4] Théorie de l’addition des variables aléatoires, 2е ed. Gauthier-Villars, Paris 1954.
L i g h t h i l l , M. J.
[1] An introduction to Fourier analysis and generalised functions, Cambridge
Univ. Press, Cambridge 1959.
L in d e b e r g , J. W.
[1] Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrech­
nung, Math. Z. 15, 211-225 (1922).
Linnik , Y u. V. (Линник,Ю. В.)
[1] The large sieve, Doki. Akad. Nauk. SSSR 30, 292-294 (1941).
[2] Теоретико-информационное доказательство центральной предельной тео­
ремы в условиях Линдеберга (An information theoretic proof of the central
limit theorem on Lindeberg conditions), Teor. Verojatn. Prim. 4 , 311-321
(1959).
[3] Разложения вероятностных законов (Decomposition of probability functions),
Izd. Univ. Leningrad 1960.
L i n n i k , Yu. V. and A. A. S in g e r (Линник, Ю. В. и А. А. Зингер)
[1] Об одном аналитическом обобщении теоремы Крамера (On an analytic
extension of Cramér’s theorem), Vestnik Leningr. Univ. 11, 51-56 (1955).
L j a p u n o v , A. M. (Ляпунов, A. M.)
[1] Избранные труды (Selected works), Akad. izd. Moscow 1948, pp. 179-250.
654 REFERENCES

L obachevsky, N. I. (Лобачевский, H. И.)


[1] Sur la probabiiité des résultats moyens, tirés des observations répétées, J. reine
angew. Math. 24, 164-170 (1842).
L o éve, M .
[1] Probability theory, van Nostrand, New York 1955.
tO M N IC K I, A.
[1] Nouveaux fondements du calcul des probabilités, Fund. Math. 4, 34-41 (1923).
L ö s c h , F . und F. S c h o b l ik
[1] Die Fakultät, В. G. Teubner, Leipzig 1951.
L u c e , R. D.
[1] Individual choice behaviour. A theoretical analysis, Wiley, New York 1959.
Lu k á cs, E .
[1] A characterization of the normal distribution, Ann. Math. Stat. 13, 91-93
(1942).
[2] Application of Faä di Bruno’s formula in mathematical statistics, Amer. Math.
Monthly 62, 340-348 (1955).
[3] Characterisation of populations by properties of suitable statistics, Proc. 3rd
Berkeley Symp. Math. Stat. Prob. 1954-1955, Vol. II, Univ. California Press,
Berkeley-Los Angeles 1956, 215-229.
[4] Characteristic functions, Griffin, London 1960.
R. G . L a h a
L u k á cs, E . a n d
[1] Applications of characteristic functions, Griffin, London 1964.

S.
M a l m q u is t ,
[1] On a property of order statistics from a rectangular distribution, Skand.
Aktuarietidskrift 33, 214-222 (1950).
M a r c z e w s k i, E .
[1] Remarks on the Poisson stochastic process, II, Studia Math. 13, 130-136 (1953).
M arkov, А. А. (Марков, A. A.)
[1] Wahrscheinlichkeitsrechnung, В. G. Teubner, Leipzig 1912.
M c M il l a n , B .
[1] The basic theorems of information theory, Ann. Math. Stat. 24, 196-219(1953).
M edgyessy, P.
[1] Decomposition of superpositions of distribution functions, Akad. Kiadó,
Budapest 1961.
v o n M is e s , R.
[1] Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und theore­
tischen Physik, Deuticke, Leipzig-Wien 1931.
[2] Wahrscheinlichkeit, Statistik und Wahrheit, Springer-Verlag, Berlin 1952.
M o g y o r ó d i , J.
[1] On a consequence of a mixing theorem of A. Rényi, MTA Mat. Kút. Int. Közi.,
9, 263-267 (1964).
M o l in a , F. C.
[1] Poisson’s exponential binomial limit, van Nostrand, New York 1942.
M o r ig u t i , S.
[1 ] A lower bound for a probability moment of an absolutely continuous distribution
with finite variance, Ann. Math. Stat. 23, 286-289 (1952).

N agumo, M.
[1] Über eine Klasse von Mittelwerten, Japan. J. Math. 7, 71-79 (1930).
REFERENCES 655

N eveu, J.
[1] Mathematical foundations of the calculus of probability, Holden-Day Inc.,
San Francisco 1965.
N eyman, J.
[1] L’estimation statistique traité comme un probléme classique de probabilité,
Act. Sci. Industr., Nr. 739, Gauthier-Villars, Paris 1938.
[2] First course in probability and statistics, H. Holt et Co., New York 1950.

O nicescu, O. et G. M ihoc
[1] La dépendance statistique. Chaines et families de chaines discontinues, Act.
Sei. Industr., Nr. 503, Gauthier-Villars, Paris 1937.
O nicescu, O., G. M ihoc si C. T. I onescu-T ulcea
[1] Calculul probabilitatilor si applicatii, Bucurejti 1956.

P arzen, E.
[1] Modern probability theory and its applications, Wiley, New York 1960.
P earson, E. S. and H. O. H artley
[1] Biometrical tables for statisticians, Cambridge Univ. Press, Cambridge 1954.
P earson, K.
[1] Early statistical papers, Cambridge Univ. Press, Cambridge 1948.
P oincaré, H.
[1] Calcul des probabilités, Carré-Naud, Paris 1912.
P oisson, S. D.
[1] Recherches sur la probabilité de judgements, Bachelier, Paris 1837.
P ólya, G.
[1] Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung und das
Momentproblem, Math. Z. 8, 171-181 (1920).
[2] Über eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die Irrfahrt
im Straßennetz, Math. Ann. 84, 149-160 (1921).
P ólya, G. und G. Szegő
[1] Aufgaben und Lehrsätze aus der Analysis, I-II, Springer, Berlin 1925.
P opper , K.
[1 ] Philosophy of science: A personal report, British Philosophy in the Mid-Century,
ed. by C. A. M ace, 1956, p. 191.
[2] The logic of scientific discovery, Hutchinson, London 1959.
P rékopa, A.
[1] On composed Poisson distributions, IV, Acta Math. Acad. Sei. Hung. 3,
317-326 (1952).
[2] Valószínűségelmélet műszaki alkalmazásokkal (Probability theory and its
applications in technology), Műszaki Könyvkiadó, Budapest 1962.
P rékopa, A., A. R ényi and K. U rbanik
[1] О предельном раерпеделении для сумм независимых случайных величин
на бикомпактных коммутативных топологических группах (On limit dis­
tribution of sums of independent random variables over bicompact commuta­
tive topological groups), Acta Math. Acad. Sei. Hung. 7, 11-16 (1956).

R eichenbach, H.
[1] Wahrscheinlichkeitslehre, Sijthoff, Leiden 1935.
656 REFERENCES

RÉNYI, A .
[1] Simple proof of a theorem of Borel and of the law of the iterated logarithm,
Mat. Tidsskrift B, 41-48 (1948).
[2] О представлении четных , чисел в виде суммы простого и почти простого
числа (On the representation of even numbers as sums of a prime and an al­
most prime number), Izvestia Akad. Nauk. SSSR, Ser. Mat. 12, 57-78 (1948).
[3] К теории предельных теорем для сумм независимых случайных величин
(On limit theorems of sums of independent random variables), Acta Math.
Acad. Sei. Hung. 1, 99-108 (1950).
[4] On the algebra of distributions, Publ. Math. Debrecen 1, 135-149 (1950).
[5] On composed Poisson distributions, II, Acta Math. Acad. Sei. Hung. 2, 83-98
(1951).
[6] On some problems concerning Poisson processes, Publ. Math. Debrecen 2,
66-73 (1951).
[7] On a conjecture of H. Steinhaus, Ann. Soc. Polon. Math. 25, 279-287 (1952).
[8] On projections of probability distributions, Acta Math. Acad. Sei. Hung. 3,
131-142 (1952).
[9] On the theory of order statistics, Acta Math. Acad. Sei. Hung. 4, 191-232
(1953).
[10] Eine neue Methode in der Theorie der geordneten Stichproben, Bericht über
die Mathematiker-Tagung Berlin 1953, VEB Deutscher Verlag der Wissen­
schaften, Berlin 1953, 203-213.
[11] Kémiai reakciók tárgyalása a sztochasztikus folyamatok elmélete segítségével
(On describing chemical reactions by means of stochastic processes), A Magyar
Tudományos Akadémia Alkalmazott Matematikai Intézetének Közleményei 2,
596-600 (1953) (In Hungarian).
[12] Újabb kritériumok két minta összehasonlítására (Some new criteria for com­
parison of two samples), A Magyar Tudományos Akadémia Alkalmazott
Matematikai Intézetének Közleményei 2, 243-265 (1953) (In Hungarian).
[13] Valószinűségszámítás (Probability theory), Tankönyvkiadó, Budapest 1954
(In Hungarian).
[14] Axiomatischer Aufbau der Wahrscheinlichkeitsrechnung, Bericht über die
Tagung Wahrscheinlichkeitsrechnung und Mathematische Statistik, VEB
Deutscher Verlag der Wissenschaften, Berlin 1954, 7-15.
[15] On a new axiomatic theory of probability, Acta Math. Acad. Sei. Hung. 6,
285 -335 (1955).
[16] On the density of sequences of integers, Publ. Inst. Math. Beograd 8, 157-162
(1955).
[17] A számjegyek eloszlása valós számok Cantor-féle előállításaiban (The distri­
bution of the digits in Cantor’s representation of the real numbers), Mat. Lapok
7, 77-100 (1956) (In Hungarian).
[18] On conditional probability spaces generated by a dimensionally ordered set of
measures, Teor. Verojatn. prim. 1, 61-71 (1956).
[19] A new deduction of Maxwell’s law of velocity distribution, Isv. Mat. Inst. Sofia 2,
45-53 (1957).
[20] A remark on the theorem of Simmons, Acta Sei. Math. Szeged. 18, 21-22
(1957).
[21] Representations for real numbers and their ergodic properties, Acta Math.
Acad. Sei. Hung. 8, 477-493 (1957).
[22] On the asymptotic distribution of the sum of a random number of independent
random variables, Acta Math. Acad. Sei. Hung. 8, 193-199 (1957).
REFERENCES 657

R é n y i, A.
[23] Quelques remarques sur les probabilités des événements dépendantes, J. Math,
pures appl. 37, 393-398 (1958).
[24] On mixing sequences of sets, Acta Math. Acad. Sei. Hung. 9, 215-228 (1958).
[25] Probabilistic methods in number theory, Proceedings of the International
Congress of Mathematicians, Edinburgh 1958, 529-539.
[26] New version of the probabilistic generalization of the large sieve, Acta Math.
Acad. Sei. Hung. 10, 217-226 (1959).
[27] On the dimension and entropy of probability distributions, Acta Math. Acad.
Sei. Hung. 10, 193-215 (1959).
[28] On measures of dependence, Acta Math. Acad. Sei. Hung. 10, 441-451 (1959).
[29] On a theorem of P. Erdős and its applications in information theory, Mathe­
matica Cluj 1 (24), 341-344 (1959).
[30] Dimension, entropy and information, Transactions of the II. Prague Conference
on Information theory, statistical decision functions, random processes, Praha
1960, 545-556.
[31] On the central limit theorem for the sum of a random number of independent
random variables, Acta Math. Acad. Sei. Hung. 11, 97-102 (1960).
[32] Az aprítás matematikai elméletéről (On the mathematical theory of chopping),
Építőanyag 1-8 (1960) (In Hungarian).
[33] Bolyongási problémákra vonatkozó határeloszlástételek (Limit theorems in
random walk problems), A Magyar Tudományos Akadémia III (Matematikai
és Fizikai) Osztályának Közleményei 10, 149-170 (1960) (In Hungarian).
[34] Az információelmélet néhány alapvető kérdése (Some fundamental problems
of the information theory), A Magyar Tudományos Akadémia III (Matematikai
és Fizikai) Osztályának Közleményei 10, 251-282 (1960) (In Hungarian).
[35] Egy általános módszer valószínűségszámítási tételek bizonyítására (A general
method for proving theorems in probability theory), A Magyar Tudományos
Akadémia III (Matematikai és Fizikai) Osztályának Közleményei 11, 79-105
(1961) (In Hungarian).
[36] Legendre polynomials and probability theory, Ann. Univ. Sei. Budapest, R.
Eötvös nom., Sect. Math. 3-4, 247-251 (1961).
[37] On measures of entropy and informations. Proc. Fourth Berkeley Symposium
on Math. Stat. Prob. 1960, Vol. I, Univ. California Press, Berkeley-Los Angeles
1961, 547-561.
[38] On stable sequences of events, Sankhya A 25, 293-302 (1963).
[39] On certain representations of real numbers and on equivalent events, Acta Sei.
Math. Szeged 26, 63-74 (1965).
[40] Új módszerek és eredmények a kombinatorikus analízisben (New methods
and results in combinatorial analysis), A Magyar Tudományos Akadémia III
(Matematikai és Fizikai) Osztályának Közleményei 16, 75-105, 159-177
(1966) (In Hungarian).
[41] Sur les espaces simples des probabilités conditionnelles, Ann. Inst. H. Poincare
В 1, 3-19 (1964).
[42] On the foundations of information theory, Review of the International Statis­
tical Institute 33, 1-14 (1965).
R é n y i , A. and P. R év é sz
[1] On mixing sequences of random variables, Acta Math. Acad. Sei. Hung. 9,
389-393 (1958).
[2] A study of sequences of equivalent events as special stable sequences, Publi­
cationes Mathematicae Debrecen 10, 319-325 (1963).
658 REFERENCES

R ényi, A. and R. Sulanke


[1] Über die konvexe Hülle von n zufällig gewählten Punkten, I—II, Zeitschrift
für Wahrscheinlichkeitstheorie 2, 75-84 (1963); 3, 138-147 (1964).
RÉVÉSZ, P.
[1] A limit distribution theorem for sums of dependent random variables, Acta
Math. Acad. Sei. Hung. 10, 125-131 (1959).
[2] The laws of large numbers, Akad. Kiadó, Budapest 1967.
R ichter, H.
[1J Wahrscheinlichkeitstheorie, Springer-Verlag, Berlin 1956.
R iesz, F. and B. Sz .-N agy
[1] Functional analysis, Blackie, London-Glasgow 1956.
H.
R o b b in s ,
[1 ] On the equidistribution of sums of independent random variables, Proc. Amer.
Math. Soc. 4 , 786-799 (1953).
R ota, G. C.
[1] The number of partitions of a set, Amer. Math. Monthly, 71, 498-504 (1964).

Saxer, W.
[1] Versicherungsmathematik, II, Springer-Verlag, Berlin-Göttingen-Heidelberg
1958.
S chmetterer, L.
[1] Einführung in die mathematische Statistik, Springer-Verlag, Wien 1956.
Schützenberger, M. P.
[1 ] Contributions aux applications statistiques de la théorie de l’information, Inst.
Stat. Univ. Paris (A) 2575, 1-115 (1953).
S hannon, C. E.
[1 ] A mathematical theory of communication, Bell Syst. Techn. J. 27, 379-423,
623-653 (1948).
S hannon, C. E. and W. W eaver
[1] The mathematical theory of communication, Univ. Illinois Press, Urbana 1949.
Singer, А. А. (Зингер, A. A.)
[1] О независимых выборках из нормальной совокупности (On independent
samples from a population), Uspehi Mat. Nauk 6, 172-175 (1951).
Skitovich, V. R. (Скитович, В. P.)
[1] Об одном свойстве нормального распределения (On a property of the nor­
mal distribution), Doki. Akad. Nauk. SSSR 89, 217-219 (1953).
Slutsky, E.
[1] Über stochastische Asymptoten und Grenzwerte, Metrón 5, 1-90 (1925).
Smirnov, N. V. (Смирнов, H. В.)
[1] Über die Verteilung allgemeiner Glieder in der Variationsreihe, Metrón 12,
59-81 (1935).
[2] Приближение законов распределения случайных величин по эмпирическим
данным (Approximation of the laws of distribution of random variables by
means of empirical data), Uspehi Mat. Nauk. 10, 179-206 (1944).
S mirnov, V. I. (Смирнов, В. И.)
[1] Lehrgang der höheren Mathematik, Teil III, 3. Auf!., VEB Deutscher Verlag
der Wissenschaften, Berlin 1961.
von Smoluchowski, M.
[1] Drei Vorträge über Diffusion, Brownsche Molekularbewegung und Koagulation
von Kolloidteilchen, Phys. Z. 17, 557-571, 585-599 (1916).
REFERENCES 659

E.
S pa r r e-A n d er sen ,
[1] On the number of positive sums of random variables, Skand. Aktuarietidskrift,
1949, 27-36.
[2] On the fluctuations of sums of random variables, I—II, Math. Scand. 1,263-285
(1953); 2, 193-223 (1954).
S pitzer , F.
[1] A combinatorial lemma and its application to probability theory, Trans. Amer.
Math. Soc. 82, 323-339 (1956).
S t e in h a u s . H.
[1] Les probabilités dénombrables et leur rapport á la théorie de la mesure, Fund.
Math. 285-310 (1923).
[2] Sur la probabilité de la convergence des séries, Studia Math. 2, 21-39 (1951).
S teinhaus, H., M. Kac et C. R yll-N ardzewski
[1 ]—[10] Sur les fonctions indépendantes, I, Studia Mathematica 6, 46-58 (1936);
II, ibidem 6, 59-66 (1936); III, ibidem 6, 89-97 (1936); IV, ibidem 7 , 1-15
(1938); V, ibidem 7 , 96-100 (1938); VI, ibidem 9 , 121-132 (1940); VII, ibidem
10 , 1-20 (1948); VIII, ibidem 11, 133-144 (1949); IX, ibidem 12 , 102-107
(1951); X, ibidem 13 , 1-17 (1953).
S tone, M. H.
[1] The theory of representation for Boolean algebras, Trans. Amer. Math. Soc. 4,
31-111 (1936).
Student
[1] —’s Collected papers, Edited by E. S. Pearson and J. Wishart, London 1942.
S zász, G.
[1] Introduction to lattice theory (transl. from the Hungarian), Akad. Kiadó,
Budapest 1963.
S zőkefalvi-N agy, В.
[1] Spektraldarstellung linearer Transformationen des Hilbertschen Raumes,
Springer, Berlin 1942.
T itchmarsh, E. C.
[1] Theory of functions, Clarendon Press, Oxford 1952.
T odhunter , L.
[1] History of the mathematical theory of probability, MacMillan, Cambridge-
London 1865.
U spenski, J. W. (Успенский, Ю. В.)
[1] Introduction to mathematical probability, McGraw-Hill, New York-London
1937.
Veksler, V., L. G roshev and B. I saev (Векслер, В., Л. Грошев и Б. Исаев)
[1] Ионизационные методы исследования излучений (Ionisation methods in the
study of radiations), Gostehizdat, Moscow 1949.
W aerden, van der , В. L.
[1] Mathematische Statistik, Springer-Verlag, Berlin-Göttingen-Heidelberg 1957.
W ald , A.
[1] Die Widerspruchsfreiheit des Kollektivbegriffes der Wahrscheinlichkeitsrech­
nung, Erg. Math. Koll. 8, Wien 1935-1936.
W ang , Shou Y en
[1] On the limiting distribution of the ratio of two empirical distributions, Acta
Math. Sinica 5, 253 (1955).
660 REFERENCES

W id d e r , D. V.
[1] The Laplace-transform, Princeton Univ. Press, Princeton 1946.
W ie n e r , N.
[1] Cybernetics or control and communication in the animal and the machine,
Act. Sei. Indust., Nr. 1053, Hermann et Cie, Paris 1948.
[2] Extrapolation, interpolation and smoothing of stationary time series, Wiley,
New York 1949.
W ilcoxon, F.
[1] Individual comparisons by ranking methods, Biometrics Bull. 1, 80-83 (1945).
W il k s , S. S.
[1] Order statistics, Bull. Amer. Math. Soc. 54, 6-50 (1948).
W o l f o w it z , J.
[1] The coding of messages subject to chance errors, Illinois J. Math. 1, 591-606
(1957).
[2] Information theory for mathematicians, Ann. Math. Stat. 29, 351-356 (1958).
[3 ] Coding theorems of information theory, Springer-Verlag, Berlin-Göttingen-
Heidelberg 1961.
W o o d w a r d , P. M.
[1] Probability and information theory with applications to radar, Pergamon Press,
London 1953.

Z ygmund, A.
[1] Trigonometrical series, Warsaw 1935; Dover-New York 1955.
[2] Trigonometric series, I—II, Cambridge Univ. Press, Cambridge 1959.
AUTHOR AND SUBJECT INDEX

absolutely continuous distribution func­ Blum, J. R., 475, 646


tion, 175 Boas, R. P., 640, 646
absolutely monotone sequence, 415 Bochner, S., 304, 306, 646
A czél, J., 640, 643, 645 Bohr, H., 640
A lexandrov, P. S., 645 Boltzmann, L., 43, 554, 642, 646
algebra, of events, 9 Boltzmann energy distribution, 166
— of probability distributions, 131 Boltzmann- Shannon formula, 554
— of sets, 17 Boolean algebra, 9
almost sure convergence, 394 Borel, É., 639, 641, 646
A nscombe, F. J., 473, 642, 645 Borel, algebra, 46
a posteriori probabilities, 86 — cylinder set, 287
a priori probabilities, 86 — measurable function, 172
A rató M., 640 ,645 — ring of sets, 48
arc sine law, 508 — sets, 49
atom, 83 Borel-Cantelli lemma, 298, 390
A umann, G., 22, 639, 645 Bortkiewicz, L., von, 639, 647
Bose-Einstein statistics, 43
Balatoni, F., 642, 645 Bruijn , N. G., de, 641, 647
Barban, M. B., 645 B uffon, G. L. L., 31
Barnard, G. A., 642, 645 Buffon’s needle problem, 67
Bartlett, M. S., 642, 645
Bateman, H., 237, 640, 645 canonical representation of an event, 19
Bateman, P. T., 640, 645 C antelli, P. F., 436, 641, 647
Baticle, E., 639, 645 C arathéodory, C., 638, 647
Bauer, H., 645 C auchy, A., 641
Bayes, T h ., 640, 646 Cauchy distribution, 204
Bayes’ theorem, 86, 274, 294 central limit theorem, 440
Bernoulli, J., 165, 374, 641, 646 central moment generating function, 138
Bernoulli’s law of large numbers, 157 central moments, 137
Bernoulli theorem, 375 chain, 23
Bernstein, S. N., 323, 374, 379, 466, 639, —, maximal 23
640, 641, 642, 646 C handrasekhar, S., 642, 647
Bernstein polynomials, 165 C handrachekharan, S., 306, 646
Bernstein’s improvement of Chebyshev channel, 567
inequality, 384 —, noisy, 568
Bertrand’s paradoxon, 64 characteristic exponent, 350
beta distribution, 205 characteristic function, 217, 302, 365
beta integral, 98 C hebyshev, P. L., 442, 641, 642, 647
Bharucha-R eid , A. T., 640, 642, 646 Chebyshev inequality, 373
Bhattacharyya, A., 642, 646 X2 distribution, 198
Bienaymé, M., 641, 646 X distribution, 199
Bienaymé-Chebyshev inequality, 373 C howla, S. 640
binomial distribution, 87 C hung , K. L. 611, 641, 642, 643, 647
B irkhoff, G., 21, 646 coefficient of variation, 116
Blanc-L apierre, A., 639, 646 compatibility, 287
Blaschke, W., 66, 639, 646 complementary event, 9
662 AUTHOR AND S U B J E C T IN D E X

complete algebra of sets, 17 Dirac’s delta function, 355, 360


complete conditional distribution, 570 direct product of distributions, 553
completely additive function, 47 discrete random variable, 95
completely independent events, 59 dispersion, ellipsoid, 226
complete measure, 50 — matrix, 225
complete system of events, 16, 84 distribution function, 97, 172, 247, 251
compound event, 18 D o e b l in , W., 473, 642, 648
concentration, 227 domain of attraction of normal distribu­
conditional density function, 181, 259 tion, 453
conditional distribution, 97, 265 D o n s k e r , M. D., 642, 648
conditional distribution function, 181, D o o b , J. L., 422, 437, 639, 641, 642, 648
258 doubly stochastic matrix, 483
conditional expectation, 108, 212, 270 dual formulas, 13
conditional information, 557 D u g u é , D., 641, 648
conditional probability, 54, 255, 263 D u m a s , M., 639, 648
conditional probability space, 70 D voretzky, A., 642, 648
conditional variance, 276
contingency, 279 E g g e n b e r g e r , F., 640, 648
contraction operator, 515 Ehrenfest’s urn model 531
convergence, almost everywhere, 395 E instein, A., 43, 642, 648
— almost surely, 394 ellipse of concentration, 226
— in measure, 395 entropy, 430, 554
— in probability, 374 — of a distribution, 368
—, of generalized functions, 360 — of a random variable, 592
convolution, 101, 195 equivalent regular sequences, 354
— power, 133 E rdős, P„ 365, 460, 511, 512, 514, 544,
correlation, coefficient, 117, 229 640, 641, 648
— ratio, 276 E sseen, C. G., 641, 649
C o u r a n t , R., 641, 651 E uler, 199
covariance matrix, 225 Euler’s beta integral, 98
C r a m é r , H., 327, 329, 467, 639, 640, 641, — function, 80
642, 647 — gamma function, 124
crowd of events, 22 — summation formula, 149
C s á k i , E., 642, 647 event, 9
C s á k i , P., 640, 647 —, elementary, 18
C s á s z á r , Á ., 371, 639, 641, 647 —impossible, 10
cumulant, 139 —, sure, 11
— generating function, 139 events, exchangeable, 78, 412
—, mutually exclusive, 10
D a n t z ig , D ., v a n , 489, 640, 642, 647 —, product of, 10
D a r m o is , G., 336, 641, 648 —.subtraction of, 13
^/-dimensional information, 588 —, sum of, 11
D e d e k in d , R., 638 exchangeable random variables, 235
degenerate distribution, 134 expectation, 103, 209
------- function, 175 — vector, 217
density function, 175, 248, 251 experiment, 9
density, of sequence of mixing sets, 406
— of stable sequence of events, 409 F aá di Bruno , 331, 641
dimension of order alpha of a random F adeev, D. K. 544, 548, 642, 643, 645
variable, 588 family of distributions, 187
D irac, G., 43 F einstein, A., 643, 649
AUTHOR AND S U B J E C T IN D E X 663

F eldheim, E., 167, 640, 641, 649 H ardy, G. H., 307, 368, 552, 574, 580,
F eller, W., 447, 448, 453, 639, 641, 642 640, 641, 643, 651
Fermi-Dirac statistics, 43 H arris, T. E., 651
F inetti, B., de, 413, 639, 643, 649 H artley, H. O., 643
F ischer, J., 640, 647 H artley, R. V., 642, 643, 651
F isher, R. A., 339, 642, 643, 649 Hartley’s formula, 542
Fisz, M., 639, 649 H ausdorff, F., 23, 415, 638, 651
F lorek, K., 640, 649 H elly, E., 319, 641
F ortét, R., 639, 646 H elmert, R., 198, 640, 651
Fourier-Stieltjes transform, 302 H ille, E., 431, 641, 651
Fourier transform, 356, 357 H irschfeld, A. O., 283
F réchet, M., 37, 639, 650 H ostinsky, B., 642, 651
frequency, 30 H unt , G. A., 512, 642, 648
—, relative, 30 H urwitz, A., 641, 651
F rink , O., 21, 638, 650 hypergeometric distribution, 88
F robenius, 598, 643
fundamental theorem of mathematical incomplete probability distribution, 569
statistics, 400 incomplete random variable, 569
independent events, 57
gain, conditional distribution function of independent random variables, 99, 182
572 infinitely divisible distribution, 347
—, measure of, 574 infinitesimal random variable 448
—, of information, 562 information, 540, 554, 592
Gabon’s desk, 152 —, of order alpha, 579, 586
gamma distribution, 202 integral geometry, 69
G antmacher, F. R., 598, 643, 650 I onescu-T ulcea, C. T., 639, 658
G auss, C. F., 641, 650 I saev, B., 640, 659
Gauss curve, 152
Gaussian, density function, 191 J aglom, A. M., 642, 645
Gaussian distribution function, 157, 187 J ánossy, L., 640, 645
Gaussian random variable, 156 J effreys, H., 562, 639, 641, 642, 651
G avrilov, M. A., 28, 638, 650 Jensen inequality, 555
G eary, R. C., 339, 641, 650 joint distribution function, 178
G ebelein, H., 283, 640, 650 J o r d a n , C h ., 37, 6 3 9 , 651
G elfand, A. N., 642, 646
Gelfand-distributions, 353 K ac, M., 345, 511, 514, 639, 641, 642,
generalized functions, 354 643, 648, 651, 659
generating function, 135 K a n t o r o v ic h , L. V., 120, 640
geometric distribution, 90 K a p p o s , D. A., 638, 639
G livenko, V. I., 9, 401, 492, 638, 641, 650 K a w a t a , J ., 339, 641
G nedenko, В. V., 348, 448, 449, 458, K h in c h in , A., 347, 380, 453, 548, 607,
496, 639, 641, 642, 650 639, 641, 642, 643, 645
—, theorem of, 449 K n o p p , К., 150, 426, 472, 491, 640, 641,
G roshev, L., 640, 659 643
G umbel, A. J., 37 K o l l e r , S ., 643
K olmogorov, A. N., 9, 33, 69, 276, 383,
H ájek, J., 434, 460, 641, 650 396, 402, 420, 438, 448, 458, 493,
H ajós, G., 640, 651 576, 638, 639, 640, 641, 642, 643
half line period, 127 645, 650, 652
H almos, P. R., 48, 639, 651 Kolmogorov probability space, 97
H anson, L., 475 Kolmogorov’s formula, 348
664 A U T H O R A N D S U B J E C T IN D E X

Kolgomorov’s fundamental theorem, 286 M almquist, S., 489, 640, 642, 654
— inequality, 392 M arczewski, E., 640, 649, 654
K oopmans, L. H., 646 marginal distribution, 190
K orouuk , V. S„ 496, 642, 650 M arkov, A. A., 442, 642, 654
K rickeberg, K., 642, 653 —, theorem of, 479
K ronecker, L., 397 Markov chain, 475
K ullback, S., 642, 653 -------, additive, 483
Ky F an, 641, 653 -------, ergodic, 479
-------, homogeneous, 476
Laguerre polynomials, 169 -------, reversible, 534
L aha, R. G., 372, 641, 653 Markov inequality, 218
Laplace curve, 152 maximal correlation, 283
—, method of, 164 Maxwell distribution, 200, 239
L aplace, P. S„ 153, 639, 653 — —, of order n, 269
large sieve, 286 Maxwell-Boltzmann statistics, 43
lattice, 21 M c M illan, B., 643, 654
— d is tr ib u tio n , 308 measure, 49
la w o f e r r o r s 440 —, complete, 50
— of large numbers, due to, Bernstein, —, outer, 50
379 —, a-finite, 49
--------------------Khinchin, 380 measurable set, 50
--------------------Kolmogorov, 383 M edgyessy, P., 654
--------------------Markov, 378 median, 217
— of the iterated logarithm 402 M ensov, D. E., 641
Lebesgue measure, 52 Mercer theorem, 552, 643
Lebesgue-Stieltjes measure, 52 M ihoc , G., 639, 640, 655
Legendre polynomials, 509 M ikusinski, J., 353, 641
L ehmann, E. L., 642, 653 M ises, R. von, 639, 654
level set, 172 mixing sequence of random variables, 467
L évy, P„ 348, 350, 453, 511, 639, 641, mixture of distributions, 207, 131
642, 652, 653 modulus of dependence, 283
Lévy-Khinchin formula, 347 M ogyoródi, J., 475, 654
Liapunov, A. M., 517, 641, 642 Moivre-Laplace theorem, 153
Liapunov’s condition, 442 M olina, F. C., 643, 654
L ighthill, M. J., 353, 641, 653 moment, 137, 217
L indeberg, J. W., 642, 653 —- generating function, 138
Lindeberg’s, condition, 443, 447 monotone class, 418
— theorem 520 Monte Carlo method, 69
linear operator, 515 M origuti, S., 654
L innik , Yu . V., 286, 329, 336, 605, 640, mutually independent random variables,
641, 643, 653 252
L ittlewood, J. E., 368, 574, 580, 643,651
L japunov, A. M., 653, 653 N agumo, M., 576, 643, 654
L obachevski, N. I., 198, 640, 654 л-dimensional cylinder, 287
L o év e , M., 639, 654 negative binomial distribution, 92
logarithmically uniform distribution, 249 negligible random variable, 448
lognormal distribution, 194 N eveu, J., 655
L omnicki, A., 639, 654 N ewton, I. 531
L ö s c h , F ., 199, 640, 65 4 N eyman, J. 639, 655
L uce, R. D„ 639, 654 non-atomic probability space, 81
L ukács, E., 331, 339, 641, 654 normal curve, 152
A U T H O R A N D S U B J E C T IN D E X 665

normal distribution, 156, 186 643, 645, 648, 650, 651, 655, 656,
normally distributed random vector, 191 657, 658
R é v é s z , P„ 471, 641, 642, 657, 658
O b r e sk o v , N. G., 168 R ic h t e r , H ., 6 3 9 , 658
O n i c e s c u , O . , 639, 640, 655 R ie m a n n , B ., 307
R i e s z , F„ 407, 411, 658
operator method, 515
order statistics, 205, 235, 486 ring of sets, 48
R o b b i n s , H., 641, 658

R o g o s i n s k i , W. W., 307, 651


P a r se n , E., 639, 643, 655 R o s e n b l a t t , M ., 475
Parseval’s theorem, 356 R o t a , G. C., 658
partially ordered set, 23 R y l l - N a r d z e w s k i, C ., 640, 642, 649, 659
P e a r s o n , E. S . , 643, 655

P e a r s o n , K., 32,198,276,279,640,642,655
H., 339, 641, 652
S a k a m o t o ,
Pearson distribution, 233 sample space, 21
P o i s s o n , S . D., 640, 655 S a x e r , W„ 640, 658
Poisson distribution, 123, 202 S c h m e t t e r e r , L ., 639
Poisson’s summation formula, 365 S c h o b l ik , F., 199, 654
P ó l y a , G., 169, 368, 509, 510, 574, 580, S c h ü t z e n b e r g e r , M.P., 642
640, 641, 642, 643, 648, 651, 655 S c h w a r t z , L., 353
Pólya distribution, 94 S c h w a r z , H. A., 329, 641
polyhypergeometric distribution, 89 semiinvariant, 139
polynomial distribution, 87 sequences of mixing sets, 406
P o p p e r , K., 639, 655 S h a n n o n , C. E., 547, 567, 569, 642, 658
positive definit function, 304 Shannon’s formula, 547
Post-Widder inversion formula, 432 — gain of information, 574
P r é k o p a , A., 640, 642, 655 — information, 579
probability algebra, 34 similar distributions, 186
— distribution, 84 Simmons theorem, 167
— of an event, 32 Simson distribution, 197
projection of distribution, 189 S i n g e r , A. A., 330, 339, 641, 653, 658

S k i t o v i c h , V. R., 336, 641, 658

^-quantile, 218, 490 S l u t s k y , E., 374, 641, 658

quartile, 28 S m i r n o v , N. V., 421, 642, 658

quasiorthogonal, 284 S m i r n o v , V. I., 48, 199, 658

quasi stable distribution, 350 S m o l u c h o w s k i , M. v o n , 657, 658

S p a r r e - A n d e r s e n , E., 511, 514, 642, 658,

Radon-Nikodym theorem, 255 659


random event, 29 S p i t z e r , F., 642, 659

— variable, 95, 172, 245 stable distribution, 326, 349


— vector, 177 stable sequence of events, 409
— walk, 500 standard deviation, 110, 219
rectangular density function, 184 standardization, 440
regression curve, 278 standard normal distribution function,
regular dependence, 281 157
regular sequence of functions, 354 stationary distribution, 479
R e i c h e n b a c h , H., 639, 655 S t e i n h a u s , H., 639, 641, 642, 659

relative information, 558 Stieltjes moment problem, 308


R é n y i , A., 37, 70, 73, 268, 276, 279, 286, Stirling numbers, 137
409, 434, 460, 473, 475, 486, 494, Stirling’s formula, 149
514, 544, 588, 639, 640, 641, 642, stochastic convergence, 374
666 A U T H O R A N D S U B J E C T IN D E X

stochastic matrix, 477 u n if o r m d is tr ib u tio n , 184


stochastic schemes, 29 U r b a n ik , K., 642, 655
S t o n e , M. H., 21, 639, 659 U s p e n s k i , J. W., 658, 659
strong law of large numbers, 395
S t u d e n t , 640, 659 variance, 110, 219, 225
Student distribution, 204 V e k s l e r , V ., 639, 659
S u l a n k e , R ., 658
S z á s z , G., 658, 659 W a e r d e n , B . L ., v a n d e r , 6 3 9 , 65 9
S z e g ő , G., 169, 509, 510, 655 W a l d , A ., 6 4 0 , 65 9
а-additive set function, 47 W a llis f o r m u la , 149
o-algebra, 46 W a n g S h o u Y e n , 642, 659
S z ő k e f a l v i - N a g y , B., 407, 411,658,659 W ea v e r , W ., 642, 658
W e ie r s t r a s s , К., 165
T em ple, G., 353 W id d e r , D. V., 641, 659
theorem, of Kolmogorov, 493 W i e n e r , N ., 547, 642, 659
— of Pólya, 501 W i l c o x o n , F ., 639, 642, 6 6 0
— of Smirnov, 496, 493 W ilc o x o n ’s te s t, 536
— of total expectation, 108 W il k s , S. S ., 642, 660
— of total probability, 85 W o l f o w it z , J ., 6 4 3 , 660
—, three-series, of Kolmogorov, 420 W o o d w a r d , P. M., 642, 660
T ik h o m ir o v , W. M., 643, 645 W r ig h t , E. M., 640, 651
T it c h m a r s h , E. C., 336, 641, 659
T o d h u n t e r , L ., 638, 659 Y a tes, F., 643, 649
transition probabilities, 476
z e r o - o n e la w , 418
ultrafilter, 22 Z y g m u n d , A., 368, 641, 660

i

к
FURTHER TIT LE S T O BE
RECOMMENDED IN TH IS LIN E

Á. ÁD ÁM

TRUTH FUNCTIONS AND THE


PROBLEM OF THEIR REALIZATION
BY TWO-TERMINAL GRAPHS
In English — 206 pages — 34 figures
II tables — / 7 x 2 5 c m — Cloth

STUDIES IN MATHEMATICAL
STATISTICS
T h eo ry and Application
ED IT ED B Y I. V I N C Z E AN D K. S A R K A D !

Studies in English and German — 210


pages — 33 figures — 13 tables
17 X 25 cm — Cloth

Distributors
K U L T U R A
Budapest 62. P.O.B. 149
!

if
К--
\

,
.

You might also like