Professional Documents
Culture Documents
Full
Full
Full
MATHEMATICAL
STATISTICS, AND
STOCHASTIC
PROCESSES
Kyle Siegrist
University of Alabama in huntsville
Book: Probability, Mathematical Statistics,
Stochastic Processes
This text is disseminated via the Open Education Resource (OER) LibreTexts Project (https://LibreTexts.org) and like the hundreds
of other texts available within this powerful platform, it is freely available for reading, printing and "consuming." Most, but not all,
pages in the library have licenses that may allow individuals to make changes, save, and print this book. Carefully
consult the applicable license(s) before pursuing such effects.
Instructors can adopt existing LibreTexts texts or Remix them to quickly build course-specific resources to meet the needs of their
students. Unlike traditional textbooks, LibreTexts’ web based origins allow powerful integration of advanced features and new
technologies to support learning.
The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online platform
for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable textbook costs to our
students and society. The LibreTexts project is a multi-institutional collaborative venture to develop the next generation of open-
access texts to improve postsecondary education at all levels of higher learning by developing an Open Access Resource
environment. The project currently consists of 14 independently operating and interconnected libraries that are constantly being
optimized by students, faculty, and outside experts to supplant conventional paper-based books. These free textbook alternatives are
organized within a central environment that is both vertically (from advance to basic level) and horizontally (across different fields)
integrated.
The LibreTexts libraries are Powered by MindTouch® and are supported by the Department of Education Open Textbook Pilot
Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable Learning Solutions
Program, and Merlot. This material is based upon work supported by the National Science Foundation under Grant No. 1246120,
1525057, and 1413739. Unless otherwise noted, LibreTexts content is licensed by CC BY-NC-SA 3.0.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation nor the US Department of Education.
Have questions or comments? For information about adoptions or adaptions contact info@LibreTexts.org. More information on our
activities can be found via Facebook (https://facebook.com/Libretexts), Twitter (https://twitter.com/libretexts), or our blog
(http://Blog.Libretexts.org).
1 https://stats.libretexts.org/@go/page/10315
TABLE OF CONTENTS
Introduction
Licensing
Object Library
Credits
Sources and Resources
1: Foundations
1.1: Sets
1.2: Functions
1.3: Relations
1.4: Partial Orders
1.5: Equivalence Relations
1.6: Cardinality
1.7: Counting Measure
1.8: Combinatorial Structures
1.9: Topological Spaces
1.10: Metric Spaces
1.11: Measurable Spaces
1.12: Special Set Structures
2: Probability Spaces
2.1: Random Experiments
2.2: Events and Random Variables
2.3: Probability Measures
2.4: Conditional Probability
2.5: Independence
2.6: Convergence
2.7: Measure Spaces
2.8: Existence and Uniqueness
2.9: Probability Spaces Revisited
2.10: Stochastic Processes
2.11: Filtrations and Stopping Times
3: Distributions
3.1: Discrete Distributions
3.2: Continuous Distributions
3.3: Mixed Distributions
3.4: Joint Distributions
3.5: Conditional Distributions
3.6: Distribution and Quantile Functions
3.7: Transformations of Random Variables
3.8: Convergence in Distribution
3.9: General Distribution Functions
1 https://stats.libretexts.org/@go/page/25718
3.10: The Integral With Respect to a Measure
3.11: Properties of the Integral
3.12: General Measures
3.13: Absolute Continuity and Density Functions
3.14: Function Spaces
4: Expected Value
4.1: De nitions and Basic Properties
4.2: Additional Properties
4.3: Variance
4.4: Skewness and Kurtosis
4.5: Covariance and Correlation
4.6: Generating Functions
4.7: Conditional Expected Value
4.8: Expected Value and Covariance Matrices
4.9: Expected Value as an Integral
4.10: Conditional Expected Value Revisited
4.11: Vector Spaces of Random Variables
4.12: Uniformly Integrable Variables
4.13: Kernels and Operators
5: Special Distributions
5.1: Location-Scale Families
5.2: General Exponential Families
5.3: Stable Distributions
5.4: In nitely Divisible Distributions
5.5: Power Series Distributions
5.6: The Normal Distribution
5.7: The Multivariate Normal Distribution
5.8: The Gamma Distribution
5.9: Chi-Square and Related Distribution
5.10: The Student t Distribution
5.11: The F Distribution
5.12: The Lognormal Distribution
5.13: The Folded Normal Distribution
5.14: The Rayleigh Distribution
5.15: The Maxwell Distribution
5.16: The Lévy Distribution
5.17: The Beta Distribution
5.18: The Beta Prime Distribution
5.19: The Arcsine Distribution
5.20: General Uniform Distributions
5.21: The Uniform Distribution on an Interval
5.22: Discrete Uniform Distributions
5.23: The Semicircle Distribution
5.24: The Triangle Distribution
5.25: The Irwin-Hall Distribution
5.26: The U-Power Distribution
5.27: The Sine Distribution
5.28: The Laplace Distribution
5.29: The Logistic Distribution
2 https://stats.libretexts.org/@go/page/25718
5.30: The Extreme Value Distribution
5.31: The Hyperbolic Secant Distribution
5.32: The Cauchy Distribution
5.33: The Exponential-Logarithmic Distribution
5.34: The Gompertz Distribution
5.35: The Log-Logistic Distribution
5.36: The Pareto Distribution
5.37: The Wald Distribution
5.38: The Weibull Distribution
5.39: Benford's Law
5.40: The Zeta Distribution
5.41: The Logarithmic Series Distribution
6: Random Samples
6.1: Introduction
6.2: The Sample Mean
6.3: The Law of Large Numbers
6.4: The Central Limit Theorem
6.5: The Sample Variance
6.6: Order Statistics
6.7: Sample Correlation and Regression
6.8: Special Properties of Normal Samples
7: Point Estimation
7.1: Estimators
7.2: The Method of Moments
7.3: Maximum Likelihood
7.4: Bayesian Estimation
7.5: Best Unbiased Estimators
7.6: Suf cient, Complete and Ancillary Statistics
8: Set Estimation
8.1: Introduction to Set Estimation
8.2: Estimation the Normal Model
8.3: Estimation in the Bernoulli Model
8.4: Estimation in the Two-Sample Normal Model
8.5: Bayesian Set Estimation
9: Hypothesis Testing
9.1: Introduction to Hypothesis Testing
9.2: Tests in the Normal Model
9.3: Tests in the Bernoulli Model
9.4: Tests in the Two-Sample Normal Model
9.5: Likelihood Ratio Tests
9.6: Chi-Square Tests
3 https://stats.libretexts.org/@go/page/25718
11: Bernoulli Trials
11.1: Introduction to Bernoulli Trials
11.2: The Binomial Distribution
11.3: The Geometric Distribution
11.4: The Negative Binomial Distribution
11.5: The Multinomial Distribution
11.6: The Simple Random Walk
11.7: The Beta-Bernoulli Process
4 https://stats.libretexts.org/@go/page/25718
16: Markov Processes
16.1: Introduction to Markov Processes
16.2: Potentials and Generators for General Markov Processes
16.3: Introduction to Discrete-Time Chains
16.4: Transience and Recurrence for Discrete-Time Chains
16.5: Periodicity of Discrete-Time Chains
16.6: Stationary and Limiting Distributions of Discrete-Time Chains
16.7: Time Reversal in Discrete-Time Chains
16.8: The Ehrenfest Chains
16.9: The Bernoulli-Laplace Chain
16.10: Discrete-Time Reliability Chains
16.11: Discrete-Time Branching Chain
16.12: Discrete-Time Queuing Chains
16.13: Discrete-Time Birth-Death Chains
16.14: Random Walks on Graphs
16.15: Introduction to Continuous-Time Markov Chains
16.16: Transition Matrices and Generators of Continuous-Time Chains
16.17: Potential Matrices
16.18: Stationary and Limting Distributions of Continuous-Time Chains
16.19: Time Reversal in Continuous-Time Chains
16.20: Chains Subordinate to the Poisson Process
16.21: Continuous-Time Birth-Death Chains
16.22: Continuous-Time Queuing Chains
16.23: Continuous-Time Branching Chains
17: Martingales
17.1: Introduction to Martingalges
17.2: Properties and Constructions
17.3: Stopping Times
17.4: Inequalities
17.5: Convergence
17.6: Backwards Martingales
Index
Glossary
Detailed Licensing
5 https://stats.libretexts.org/@go/page/25718
Licensing
A detailed breakdown of this resource's licensing can be found in Back Matter/Detailed Licensing.
1 https://stats.libretexts.org/@go/page/32585
Table of Contents
http://www.randomservices.org/random/index.html
Random is a website devoted to probability, mathematical statistics, and stochastic processes, and is intended for teachers and
students of these subjects. The site consists of an integrated set of components that includes expository text, interactive web apps,
data sets, biographical sketches, and an object library. Please read the Introduction for more information about the content,
structure, mathematical prerequisites, technologies, and organization of the project.
1: Foundations
1.1: Sets
1.2: Functions
1.3: Relations
1.4: Partial Orders
1.5: Equivalence Relations
1.6: Cardinality
1.7: Counting Measure
1.8: Combinatorial Structures
1.9: Topological Spaces
1.10: Metric Spaces
1.11: Measurable Spaces
1.12: Special Set Structures
2: Probability Spaces
2.1: Random Experiments
2.2: Events and Random Variables
2.3: Probability Measures
2.4: Conditional Probability
2.5: Independence
2.6: Convergence
2.7: Measure Spaces
2.8: Existence and Uniqueness
2.9: Probability Spaces Revisited
2.10: Stochastic Processes
2.11: Filtrations and Stopping Times
3: Distributions
3.1: Discrete Distributions
3.2: Continuous Distributions
3.3: Mixed Distributions
3.4: Joint Distributions
3.5: Conditional Distributions
3.6: Distribution and Quantile Functions
3.7: Transformations of Random Variables
3.8: Convergence in Distribution
3.9: General Distribution Functions
3.10: The Integral With Respect to a Measure
3.11: Properties of the Integral
3.12: General Measures
3.13: Absolute Continuity and Density Functions
3.14: Function Spaces
1 https://stats.libretexts.org/@go/page/10321
4: Expected Value
4.1: New Page
4.2: New Page
4.3: New Page
4.4: New Page
4.5: New Page
4.6: New Page
4.7: New Page
4.8: New Page
4.9: New Page
4.10: New Page
5: Special Distributions
5.1: New Page
5.2: New Page
5.3: New Page
5.4: New Page
5.5: New Page
5.6: New Page
5.7: New Page
5.8: New Page
5.9: New Page
5.10: New Page
6: Random Samples
6.1: New Page
6.2: New Page
6.3: New Page
6.4: New Page
6.5: New Page
6.6: New Page
6.7: New Page
6.8: New Page
6.9: New Page
6.10: New Page
7: Point Estimation
7.1: New Page
7.2: New Page
7.3: New Page
7.4: New Page
7.5: New Page
7.6: New Page
7.7: New Page
7.8: New Page
7.9: New Page
7.10: New Page
8: Set Estimation
8.1: New Page
8.2: New Page
8.3: New Page
2 https://stats.libretexts.org/@go/page/10321
8.4: New Page
8.5: New Page
8.6: New Page
8.7: New Page
8.8: New Page
8.9: New Page
8.10: New Page
9: Hypothesis Testing
9.1: New Page
9.2: New Page
9.3: New Page
9.4: New Page
9.5: New Page
9.6: New Page
9.7: New Page
9.8: New Page
9.9: New Page
9.10: New Page
3 https://stats.libretexts.org/@go/page/10321
12.9: New Page
12.10: New Page
4 https://stats.libretexts.org/@go/page/10321
17: Martingales
17.1: New Page
17.2: New Page
17.3: New Page
17.4: New Page
17.5: New Page
17.6: New Page
17.7: New Page
17.8: New Page
17.9: New Page
17.10: New Page
5 https://stats.libretexts.org/@go/page/10321
Object Library
1 https://stats.libretexts.org/@go/page/10316
Credits
1 https://stats.libretexts.org/@go/page/10317
Sources and Resources
1 https://stats.libretexts.org/@go/page/10318
CHAPTER OVERVIEW
1: Foundations
In this chapter we review several mathematical topics that form the foundation of probability and mathematical statistics. These
include the algebra of sets and functions, general relations with special emphasis on equivalence relations and partial orders,
counting measure, and some basic combinatorial structures such as permuations and combinations. We also discuss some advanced
topics from topology and measure theory. You may wish to review the topics in this chapter as the need arises.
1.1: Sets
1.2: Functions
1.3: Relations
1.4: Partial Orders
1.5: Equivalence Relations
1.6: Cardinality
1.7: Counting Measure
1.8: Combinatorial Structures
1.9: Topological Spaces
1.10: Metric Spaces
1.11: Measurable Spaces
1.12: Special Set Structures
This page titled 1: Foundations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
1.1: Sets
Set theory is the foundation of probability and statistics, as it is for almost every branch of mathematics.
A set is simply a collection of objects; the objects are referred to as elements of the set. The statement that x is an element of
set S is written x ∈ S , and the negation that x is not an element of S is written as x ∉ S . By definition, a set is completely
determined by its elements; thus sets A and B are equal if they have the same elements:
A = B if and only if x ∈ A ⟺ x ∈ B (1.1.1)
Our next definition is the subset relation, another very basic concept.
Concepts in set theory are often illustrated with small, schematic sketches known as Venn diagrams, named for John Venn. The
Venn diagram in the picture below illustrates the subset relation.
Figure 1.1.1 : A ⊆ B
As noted earlier, membership is a primitive, undefined concept in naive set theory. However, the following construction, known as
Russell's paradox, after the mathematician and philosopher Bertrand Russell, shows that we cannot be too cavalier in the
construction of sets.
Let R be the set of all sets A such that A ∉ A . Then R ∈ R if and only if R ∉ R .
Proof
Usually, the sets under discussion in a particular context are all subsets of a well-defined, specified set S , often called a universal
set. The use of a universal set prevents the type of problem that arises in Russell's paradox. That is, if S is a given set and p(x) is a
predicate on S (that is, a valid mathematical statement that is either true or false for each x ∈ S ), then {x ∈ S : p(x)} is a valid
subset of S . Defining a set in this way is known as predicate form. The other basic way to define a set is simply be listing its
elements; this method is known as list form.
In contrast to a universal set, the empty set, denoted ∅, is the set with no elements.
One step up from the empty set is a set with just one element. Such a set is called a singleton set. The subset relation is a partial
order on the collection of subsets of S .
1.1.1 https://stats.libretexts.org/@go/page/10116
Here are a couple of variations on the subset relation.
The collection of all subsets of a given set frequently plays an important role, particularly when the given set is the universal set.
If S is a set, then the set of all subsets of S is known as the power set of S and is denoted P(S) .
Special Sets
The following special sets are used throughout this text. Defining them will also give us practice using list and predicate form.
Special Sets
1. R denotes the set of real numbers and is the universal set for the other subsets in this list.
2. N = {0, 1, 2, …} is the set of natural numbers
3. N = {1, 2, 3, …} is the set of positive integers
+
6. A = {x ∈ R : p(x) = 0 for some polynomial p with integer coefficients} is the set of algebraic numbers.
Note that N ⊂ N ⊂ Z ⊂ Q ⊂ A ⊂ R . We will also occasionally need the set of complex numbers C = {x + iy : x,
+ y ∈ R}
where i is the imaginary unit. The following special rational numbers turn out to be useful for various constructions.
For n ∈ N , a rational number of the form j/2 where j ∈ Z is odd is a dyadic rational (or binary rational) of rank n .
n
A number x ∈ R is a binary rational of rank n ∈ N+ if and only if the binary expansion of x is finite, with 1 in position n
Set Operations
We are now ready to review the basic operations of set theory. For the following definitions, suppose that A and B are subsets of a
universal set, which we will denote by S .
The union of A and B is the set obtained by combining the elements of A and B .
A ∪ B = {x ∈ S : x ∈ A or x ∈ B} (1.1.3)
1.1.2 https://stats.libretexts.org/@go/page/10116
The intersection of A and B is the set of elements common to both A and B :
A ∩ B = {x ∈ S : x ∈ A and x ∈ B} (1.1.4)
The set difference of B and A is the set of elements that are in B but not in A :
B ∖ A = {x ∈ S : x ∈ B and x ∉ A} (1.1.5)
Sometimes (particularly in older works and particularly when A ⊆B ), the notation B−A is used instead of B∖A . When
A ⊆ B , B − A is known as proper set difference.
Note that union, intersection, and difference are binary set operations, while complement is a unary set operation.
In the Venn diagram app, select each of the following and note the shaded area in the diagram.
1. A
2. B
3. A c
4. B c
5. A ∪ B
6. A ∩ B
Basic Rules
In the following theorems, A , B , and C are subsets of a universal set S . The proofs are straightforward, and just use the definitions
and basic logic. Try the proofs yourself before reading the ones in the text.
A∩B ⊆ A ⊆ A∪B .
So the empty set acts as an identity relative to the union operation, and the universal set acts as an identiy relative to the
intersection operation.
2. A ∩ A c
=∅
1.1.3 https://stats.libretexts.org/@go/page/10116
2. A ∩ B = B ∩ A
Proof
Thus, we can write A ∪ B ∪ C without ambiguity. Note that x is an element of this set if and only if x is an element of at least one
of the three given sets. Similarly, we can write A ∩ B ∩ C without ambiguity. Note that x is an element of this set if and only if x
is an element of all three of the given sets.
So intersection distributes over union, and union distributes over intersection. It's interesting to compare the distributive properties
of set theory with those of the real number system. If x, y, z ∈ R , then x(y + z) = (xy) + (xz) , so multiplication distributes over
addition, but it is not true that x + (yz) = (x + y)(x + z) , so addition does not distribute over multiplication. The following
results are particularly important in probability theory.
2. (A ∩ B) c
=A
c
∪B
c
.
Proof
The following result explores the connections between the subset relation and the set operations.
3. A ∪ B = B
4. A ∩ B = A
5. A ∖ B = ∅
Proof
In addition to the special sets defined earlier, we also have the following:
Since Q ⊂ A ⊂ R it follows that R ∖ A ⊂ R ∖ Q , that is, every transcendental number is also irrational.
Set difference can be expressed in terms of complement and intersection. All of the other set operations (complement, union, and
intersection) can be expressed in terms of difference.
2. A = S ∖ A
c
3. A ∩ B = A ∖ (A ∖ B)
4. A ∪ B = S ∖ {(S ∖ A) ∖ [(S ∖ A) ∖ (S ∖ B)]}
1.1.4 https://stats.libretexts.org/@go/page/10116
Proof
So in principle, we could do all of set theory using the one operation of set difference. But as (c) and (d) suggest, the results would
be hideous.
(A ∪ B) ∖ (A ∩ B) = (A ∖ B) ∪ (B ∖ A) .
Proof
The set in the previous result is called the symmetric difference of A and B , and is sometimes denoted A △ B . The elements of
this set belong to one but not both of the given sets. Thus, the symmetric difference corresponds to exclusive or in the same way
that union corresponds to inclusive or. That is, x ∈ A ∪ B if and only if x ∈ A or x ∈ B (or both); x ∈ A △ B if and only if
x ∈ A or x ∈ B , but not both. On the other hand, the complement of the symmetric difference consists of the elements that belong
Proof
There are 16 different (in general) sets that can be constructed from two given events A and B .
Proof
Open the Venn diagram app. This app lists the 16 sets that can be constructed from given sets A and B using the set
operations.
1. Select each of the four subsets in the proof of the last exercise: A ∩ B , A ∩ B , A ∩ B , and A
c c c
∩B
c
. Note that these are
disjoint and their union is S .
2. Select each of the other 12 sets and show how each is a union of some of the sets in (a).
General Operations
The operations of union and intersection can easily be extended to a finite or even an infinite collection of sets.
Definitions
Suppose that A is a nonempty collection of subsets of a universal set S . In some cases, the subsets in A may be naturally indexed
by a nonempty index set I , so that A = {A : i ∈ I } . (In a technical sense, any collection of subsets can be indexed.)
i
The union of the collection of sets A is the set obtained by combining the elements of the sets in A :
If A = { Ai : i ∈ I } , so that the collection of sets is indexed, then we use the more natural notation:
i∈I
The intersection of the collection of sets A is the set of elements common to all of the sets in A :
If A = { Ai : i ∈ I } , so that the collection of sets is indexed, then we use the more natural notation:
i∈I
Often the index set is an “integer interval” of N. In such cases, an even more natural notation is to use the upper and lower limits of
the index set. For example, if the collection is {A : i ∈ N } then we would write ⋃ A for the union and ⋂ A for the
i +
∞
i=1 i
∞
i=1 i
intersection. Similarly, if the collection is {A : i ∈ {1, 2, … , n}} for some n ∈ N , we would write ⋃ A for the union and
i +
n
i=1 i
1.1.5 https://stats.libretexts.org/@go/page/10116
A collection of sets A is pairwise disjoint if the intersection of any two sets in the collection is empty: A ∩ B = ∅ for every
A, B ∈ A with A ≠ B .
A collection of sets A is said to partition a set B if the collection A is pairwise disjoint and ⋃ A =B .
Partitions are intimately related to equivalence relations. As an example, for n ∈ N , the set
j j+1
Dn = {[ , ) : j ∈ Z} (1.1.20)
n n
2 2
is a partition of R into intervals of equal length 1/2 . Note that the endpoints are the dyadic rationals of rank n or less, and that
n
D n+1 can be obtained from D by dividing each interval into two equal parts. This sequence of partitions is one of the reasons that
n
Basic Rules
In the following problems, A = { Ai : i ∈ I } is a collection of subsets of a universal set S , indexed by a nonempty set I , and B is
a subset of S .
2. (⋂ i∈I
Ai ) ∪ B = ⋂
i∈I
(Ai ∪ B)
Restate the laws in the notation where the collection A is not indexed.
Proof
Restate the laws in the notation where the collection A is not indexed.
Proof
Suppose that the collection A partitions S . For any subset B , the collection {A ∩ B : A ∈ A } partitions B .
Proof
Proof
The sets in the previous result turn out to be important in the study of probability.
Product sets
Definitions
Product sets are sets of sequences. The defining property of a sequence, of course, is that order as well as membership is important.
1.1.6 https://stats.libretexts.org/@go/page/10116
Let us start with ordered pairs. In this case, the defining property is that (a, b) = (c, d) if and only if a = c and b = d .
Interestingly, the structure of an ordered pair can be defined just using set theory. The construction in the result below is due to
Kazimierz Kuratowski
Define (a, b) = {{a}, {a, b}}. This definition captures the defining property of an ordered pair.
Proof
Of course, it's important not to confuse the ordered pair (a, b) with the open interval (a, b), since the same notation is used for
both. Usually it's clear form context which type of object is referred to. For ordered triples, the defining property is
(a, b, c) = (d, e, f ) if and only if a = d , b = e , and c = f . Ordered triples can be defined in terms of ordered pairs, which via the
Define (a, b, c) = (a, (b, c)) . This definition captures the defining property of an ordered triple.
Proof
All of this is just to show how complicated structures can be built from simpler ones, and ultimately from set theory. But enough of
that! More generally, two ordered sequences of the same size (finite or infinite) are the same if and only if their corresponding
coordinates agree. Thus for n ∈ N , the definition for n -tuples is (x , x , … , x ) = (y , y , … , y ) if and only if x = y for all
+ 1 2 n 1 2 n i i
Suppose now that we have a sequence of n sets, (S 1, S2 , … , Sn ) , where n ∈ N . The Cartesian product of the sets is defined
+
as follows:
S1 × S2 × ⋯ × Sn = {(x1 , x2 , … , xn ) : xi ∈ Si for i ∈ {1, 2, … , n}} (1.1.22)
Cartesian products are named for René Descartes. If S = S for each i, then the Cartesian product set can be written compactly as
i
S , a Cartesian power. In particular, recall that R denotes the set of real numbers so that R is n -dimensional Euclidean space,
n n
named after Euclid, of course. The elements of {0, 1} are called bit strings of length n . As the name suggests, we sometimes
n
represent elements of this product set as strings rather than sequences (that is, we omit the parentheses and commas). Since the
coordinates just take two values, there is no risk of confusion.
Suppose that we have an infinite sequence of sets (S 1, S2 , …) . The Cartesian product of the sets is defined by
When S = S for i ∈ N , the Cartesian product set is sometimes written as a Cartesian power as S or as S . An explanation
i +
∞ N+
for the last notation, as well as a much more general construction for products of sets, is given in the next section on functions.
Also, notation similar to that of general union and intersection is often used for Cartesian product, with ∏ as the operator. So
n ∞
∏ Si = S1 × S2 × ⋯ × Sn , ∏ Si = S1 × S2 × ⋯ (1.1.24)
i=1 i=1
The most important rules that relate Cartesian product with union, intersection, and difference are the distributive rules:
1.1.7 https://stats.libretexts.org/@go/page/10116
In general, the product of unions is larger than the corresponding union of products.
(A ∪ B) × (C ∪ D) = (A × C ) ∪ (A × D) ∪ (B × C ) ∪ (B × D)
Proof
So in particular it follows that (A × C ) ∪ (B × D) ⊆ (A ∪ B) × (C ∪ D) . On the other hand, the product of intersections is the
same as the corresponding intersection of products.
(A × C ) ∩ (B × D) = (A ∩ B) × (C ∩ D)
Proof
In general, the product of differences is smaller than the corresponding difference of products.
Proof
Cross Sections
1. The cross section of C in the first coordinate at x ∈ S is C = {y ∈ T x : (x, y) ∈ C }
Projections
1. The projection of C onto T is C T = {y ∈ T : (x, y) ∈ C for some x ∈ S} .
2. The projection of C onto S is C S
= {x ∈ S : (x, y) ∈ C for some y ∈ T } .
Unions
1. CT =⋃
x∈S
Cx
2. C S
=⋃
y∈T
C
y
Cross sections are preserved under the set operations. We state the result for cross sections at x ∈ S . By symmetry, of course,
analgous results hold for cross sections at y ∈ T .
2. (C ∩ D) x = Cx ∩ Dx
3. (C ∖ D) x = Cx ∖ Dx
Proof
For projections, the results are a bit more complicated. We give the results for projections onto T ; naturally the results for
projections onto S are analogous.
2. (C ∩ D) T ⊆ CT ∩ DT
1.1.8 https://stats.libretexts.org/@go/page/10116
3. (C T
c
)
c
⊆ (C )T
Proof
It's easy to see that equality does not hold in general in parts (b) and (c). In part (b) for example, suppose that A , A ⊆ S are1 2
CT =D = B . In part (c) for example, suppose that A is a nonempty proper subset of S and B is a nonempty proper subset of T .
T
Cross sections and projections will be extended to very general product sets in the next section on Functions.
Computational Exercises
Subsets of R
The universal set is [0, ∞). Let A = [0, 5] and B = (3, 7) . Express each of the following in terms of intervals:
1. A ∩ B
2. A ∪ B
3. A ∖ B
4. B ∖ A
5. Ac
Answer
The universal set is N. Let A = {n ∈ N : n is even} and let B = {n ∈ N : n ≤ 9} . Give each of the following:
1. A ∩ B in list form
2. A ∪ B in predicate form
3. A ∖ B in list form
4. B ∖ A in list form
5. A in predicate form
c
6. B in list form
c
Answer
1. A
2. B
3. A ∩ B
4. A ∪ B
5. A ∖ B
6. B ∖ A
Answer
Let S = {0, 1} . This is the set of outcomes when a coin is tossed 3 times (0 denotes tails and 1 denotes heads). Further let
3
5. Bc
6. A ∩ B
7. A ∪ B
1.1.9 https://stats.libretexts.org/@go/page/10116
8. A ∖ B
9. B ∖ A
Answer
Let S = {0, 1} . This is the set of outcomes when a coin is tossed twice (0 denotes tails and 1 denotes heads). Give P(S) in
2
list form.
Answer
Cards
A standard card deck can be modeled by the Cartesian product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (1.1.26)
where the first coordinate encodes the denomination or kind (ace, 2–10, jack, queen, king) and where the second coordinate
encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for
example q♡ for the queen of hearts). For the problems in this subsection, the card deck D is the universal set.
Let H denote the set of hearts and F the set of face cards. Find each of the following:
1. H ∩ F
2. H ∖ F
3. F ∖ H
4. H △ F
Answer
A bridge hand is a subset of D with 13 cards. Often bridge hands are described by giving the cross sections by suit.
By contrast, it is usually more useful to describe a poker hand by giving the cross sections by denomination. In the usual version of
draw poker, a hand is a subset of D with 5 cards.
The poker hand in the last exercise is known as a dead man's hand. Legend has it that Wild Bill Hickock held this hand at the time
of his murder in 1876.
Let A n = [0, 1 −
1
n
] for n ∈ N . Find
+
1.1.10 https://stats.libretexts.org/@go/page/10116
1. ⋂ ∞
n=1
An
2. ⋃ ∞
n=1
An
3. ⋂ ∞
n=1
An
c
4. ⋃ ∞
n=1
An
c
Answer
Let A n = (2 −
1
n
,5+
1
n
) for n ∈ N . Find
+
1. ⋂ ∞
n=1
An
2. ⋃ ∞
n=1
An
3. ⋂ ∞
n=1
An
c
4. ⋃ ∞
n=1
An
c
Answer
Subsets of R 2
Let T be the closed triangular region in R with vertices (0, 0), (1, 0), and (1, 1). Find each of the following:
2
This page titled 1.1: Sets is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via
source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.1.11 https://stats.libretexts.org/@go/page/10116
1.2: Functions
Functions play a central role in probability and statistics, as they do in every other branch of mathematics. For the most part, the
proofs in this section are straightforward, so be sure to try them yourself before reading the ones in the text.
A function f from a set S into a set T is a subset of the product set S × T with the property that for each element x ∈ S , there
exists a unique element y ∈ T such that (x, y) ∈ f . If f is a function from S to T we write f : S → T . If (x, y) ∈ f we write
y = f (x).
Less formally, a function f from S into T is a “rule” (or “procedure” or “algorithm”) that assigns to each x ∈ S a unique element
f (x) ∈ T . The definition of a function as a set of ordered pairs, is due to Kazimierz Kuratowski. The term map or mapping is also
Suppose that f : S → T .
1. The set S is the domain of f .
2. The set T is the range space or co-domain of f .
3. The range of f is the set of function values. That is, range (f ) = {y ∈ T : y = f (x) for some x ∈ S} .
The domain and range are completely specified by a function. That's not true of the co-domain: if f is a function from S into T ,
and U is another set with T ⊆ U , then we can also think of f as a function from S into U . The following definitions are natural
and important.
Clearly a function always maps its domain onto its range. Note also that f is one-to-one if f (u) = f (v) implies u =v for
u, v ∈ S .
Inverse functions
A funtion that is one-to-one and onto can be “reversed” in a sense.
1.2.1 https://stats.libretexts.org/@go/page/10117
If you like to think of a function as a set of ordered pairs, then f = {(y, x) ∈ T × S : (x, y) ∈ f } . The fact that f is one-to-one
−1
and onto ensures that f is a valid function from T onto S . Sets S and T are in one-to-one correspondence if there exists a one-
−1
to-one function from S onto T . One-to-one correspondence plays an essential role in the study of cardinality.
Restrictions
The domain of a function can be restricted to create a new funtion.
Suppose that f : S → T and that A ⊆ S . The function f A : A → T defined by f A (x) = f (x) for x ∈ A is the restriction of f
to A .
Composition
Composition is perhaps the most important way to combine two functions to create another function.
Composition is associative:
Proof
Thus we can write f ∘ g ∘ h without ambiguity. On the other hand, composition is not commutative. Indeed depending on the
domains and co-domains, f ∘ g might be defined when g ∘ f is not. Even when both are defined, they may have different domains
and co-domains, and so of course cannot be the same function. Even when both are defined and have the same domains and co-
domains, the two compositions will not be the same in general. Examples of all of these cases are given in the computational
exercises below.
The identity function acts like an identity with respect to the operation of composition.
If f : S → T then
1. f ∘ I = f
S
2. I ∘ f = f
T
Proof
2. f ∘ f −1
= IT
Proof
An element x ∈ S can be thought of as a function from {1, 2, … , n} into S . Similarly, an element x ∈ S can be thought of as
n ∞
a function from N into S . For such a sequence x, of course, we usually write x instead of x(i). More generally, if S and T are
+ i
1.2.2 https://stats.libretexts.org/@go/page/10117
sets, then the set of all functions from S into T is denoted by T
S
. In particular, as we noted in the last section, S
∞
is also (and
more accurately) written as S . N+
Suppose that g is a one-to-one function from R onto S and that f is a one-to-one function from S onto T . Then
−1
(f ∘ g) =g
−1
∘f .
−1
Proof
Inverse Images
Inverse images of a function play a fundamental role in probability, particularly in the context of random variables.
So f −1
(A) is the subset of S consisting of those elements that map into A .
Technically, the inverse images define a new function from P(T ) into P(S) . We use the same notation as for the inverse
function, which is defined when f is one-to-one and onto. These are very different functions, but usually no confusion results. The
following important theorem shows that inverse images preserve all set operations.
2. f (A ∩ B) = f (A) ∩ f (B)
−1 −1 −1
3. f (A ∖ B) = f (A) ∖ f (B)
−1 −1 −1
Proof
The result in part (a) holds for arbitrary unions, and the result in part (b) holds for arbitrary intersections. No new ideas are
involved; only the notation is more complicated.
2. f −1
(⋂
i∈I
Ai ) = ⋂
i∈I
f
−1
(Ai )
Proof
Forward Images
Forward images of a function are a naturally complement to inverse images.
Suppose again that f : S → T . If A ⊆ S , the forward image of A under f is the subset of T given by
f (A) = {f (x) : x ∈ A} (1.2.5)
1.2.3 https://stats.libretexts.org/@go/page/10117
Figure 1.2.3: The forward image of A under f
Technically, the forward images define a new function from P(S) into P(T ) , but we use the same symbol f for this new function
as for the underlying function from S into T that we started with. Again, the two functions are very different, but usually no
confusion results.
It might seem that forward images are more natural than inverse images, but in fact, the inverse images are much more important
than the forward ones (at least in probability and measure theory). Fortunately, the inverse images are nicer as well—unlike the
inverse images, the forward images do not preserve all of the set operations.
The result in part (a) hold for arbitrary unions, and the results in part (b) hold for arbitrary intersections. No new ideas are involved;
only the notation is more complicated.
Suppose again that f : S → T . As noted earlier, the forward images of f define a function from P(S) into P(T ) and the inverse
images define a function from P(T ) into P(S) . One might hope that these functions are inverses of one another, but alas no.
Suppose that f : S → T .
1. A ⊆ f −1
[f (A)] for A ⊆ S . Equality holds if f is one-to-one.
2. f [ f−1
(B)] ⊆ B for B ⊆ T . Equality holds if f is onto.
Proof
Now let V denote the collection of all functions from the given set S into R. A fact that is very important in probability as well as
other branches of analysis is that V , with addition and scalar multiplication as defined above, is a vector space. The zero function 0
is defined, of course, by 0(x) = 0 for all x ∈ S .
1.2.4 https://stats.libretexts.org/@go/page/10117
(V , +, ⋅) is a vector space over R. That is, for all f , g, h ∈ V and a, b ∈ R
Various subspaces of V are important in probability as well. We will return to the discussion of vector spaces of functions in the
sections on partial orders and in the advanced sections on metric spaces and measure theory.
Indicator Functions
For our next discussion, suppose that S is the universal set, so that all other sets mentioned are subsets of S .
Suppose that A ⊆ S . The indicator function of A is the function 1 A : S → {0, 1} defined as follows:
1, x ∈ A
1A (x) = { (1.2.6)
0, x ∉ A
Thus, the indicator function of A simply indicates whether or not x ∈ A for each x ∈ S . Conversely, any function on S that just
takes the values 0 and 1 is an indicator function.
Thus, there is a one-to-one correspondence between P(S) , the power set of S , and the collection of indicator functions {0, 1}
S
.
The next result shows how the set algebra of subsets corresponds to the arithmetic algebra of the indicator functions.
2. 1 = 1 − (1 − 1 ) (1 − 1 ) = max { 1
A∪B A B A, 1B }
3. 1 = 1 − 1
A
c
A
4. 1 =1
A∖B (1 − 1 )A B
5. A ⊆ B if and only if 1 ≤ 1 A B
Proof
The results in part (a) extends to arbitrary intersections and the results in part (b) extends to arbitrary unions.
2. 1 ⋃
i∈I
Ai = 1 −∏
i∈I
(1 − 1Ai ) = max { 1Ai : i ∈ I }
Proof
Multisets
A multiset is like an ordinary set, except that elements may be repeated. A multiset A (with elements from a universal set S ) can be
uniquely associated with its multiplicity function m : S → N , where m (x) is the number of times that element x is in A for
A A
each x ∈ S . So the multiplicity function of a multiset plays the same role that an indicator function does for an ordinary set.
Multisets arise naturally when objects are sampled with replacement (but without regard to order) from a population. Various
sampling models are explored in the section on Combinatorial Structures. We will not go into detail about the operations on
multisets, but the definitions are straightforward generalizations of those for ordinary sets.
Suppose that A and B are multisets with elements from the universal set S . Then
1.2.5 https://stats.libretexts.org/@go/page/10117
1. A ⊆ B if and only if m ≤ m A B
2. m = max{ m , m }
A∪B A B
3. m = min{ m , m }
A∩B A B
4. m =m
A+B +m A B
Product Spaces
Using functions, we can generalize the Cartesian products studied earlier. In this discussion, we suppose that S is a set for each i i
i∈I i∈I
Note that except for being nonempty, there are no assumptions on the cardinality of the index set I . Of course, if I = {1, 2 … , n}
for some n ∈ N , or if I = N then this construction reduces to S × S × ⋯ × S and to S × S × ⋯ , respectively. Since
+ + 1 2 n 1 2
we want to make the notation more closely resemble that of simple Cartesian products, we will write x instead of x(i) for the i
value of the function x at i ∈ I , and we sometimes refer to this value as the ith coordinate of x. Finally, note that if S = S for i
each i ∈ I , then ∏ S is simply the set of all functions from I into S , which we denoted by S above.
i∈I i
I
So p (x) is just the j th coordinate of x. The projections are of basic importance for product spaces. In particular, we have a better
j
For A ⊆ ∏ i∈I
Si and j ∈ I , the forward image p j (A) is the projection of A onto S . j
Proof
So the properties of projection that we studied in the last section are just special cases of the properties of forward images.
Projections also allow us to get coordinate functions in a simple way.
Suppose that R is a set, and that f : R → ∏i∈I Si . If j ∈ I then p j ∘ f : R → Sj is the j th coordinate function of f .
Proof
This will look more familiar for a simple cartesian product. If f : R → S1 × S2 × ⋯ × Sn , then f = (f1 , f2 , … , fn ) where
f : R → S
j is the j th coordinate function for j ∈ {1, 2, … , n}.
i
Cross sections of a subset of a product set can be expressed in terms of inverse images of a function. First we need some additional
notation. Suppose that our index set I has at least two elements. For j ∈ I and u ∈ S , define j : ∏ S → ∏ S by j u i∈I−{j} i i∈I i
Let's look at this for the product of two sets S and T . For x ∈ S , the function 1 : T → S × T is given by 1 (y) = (x, y). x x
Similarly, for y ∈ T , the function 2 : S → S × T is given by 2 (x) = (x, y). Suppose now that A ⊆ S × T . If x ∈ S , then
y y
(A) = {y ∈ T : (x, y) ∈ A} , the very definition of the cross section of A in the first coordinate at x. Similarly, if y ∈ T , then
−1
1x
(A) = {x ∈ S : (x, y) ∈ A} , the very definition of the cross section of A in the second coordinate at y . This construction is
−1
2y
not particularly important except to show that cross sections are inverse images. Thus the fact that cross sections preserve all of the
set operations is a simple consequence of the fact that inverse images generally preserve set operations.
Operators
Sometimes functions have special interpretations in certain settings.
1.2.6 https://stats.libretexts.org/@go/page/10117
Suppose that S is a set.
1. A function f : S → S is sometimes called a unary operator on S .
2. A function g : S × S → S is sometimes called a binary operator on S .
As the names suggests, a unary operator f operates on an element x ∈ S to produce a new element f (x) ∈ S . Similarly, a binary
operator g operates on a pair of elements (x, y) ∈ S × S to produce a new element g(x, y) ∈ S . The arithmetic operators are
quintessential examples:
For a fixed universal set S , the set operators studied in the section on Sets provide other examples.
As these examples illustrate, a binary operator is often written as x f y rather than f (x, y). Still, it is useful to know that operators
are simply functions of a special type.
Thus if A is closed under the unary operator f , then f restricted to A is unary operator on A . Similary if A is closed under the
binary operator g , then g restricted to A × A is a binary operator on A . Let's return to our most basic example.
Many properties that you are familiar with for special operators (such as the arithmetic and set operators) can now be formulated
generally.
Suppose that f and g are binary operators on a set S . In the following definitions, x, y , and z are arbitrary elements of S .
1. f is commutative if f (x, y) = f (y, x), that is, x f y = y f x
2. f is associative if f (x, f (y, z)) = f (f (x, y), z), that is, x f (y f z) = (x f y) f z.
3. g distributes over f (on the left) if g(x, f (y, z)) = f (g(x, y), g(x, z)), that is, x g (y f z) = (x g y) f (x g z)
Stripped of most of the mathematical jargon, the idea is very simple. Since each set A ∈ S is nonempty, we can select an element
of A ; we will call the element we select f (A) and thus our selections define a function. In fact, you may wonder why we need an
1.2.7 https://stats.libretexts.org/@go/page/10117
axiom at all. The problem is that we have not given a rule (or procedure or algorithm) for selecting the elements of the sets in the
collection. Indeed, we may not know enough about the sets in the collection to define a specific rule, so in such a case, the axiom of
choice simply guarantees the existence of a choice function. Some mathematicians, known as constructionists do not accept the
axiom of choice, and insist on well defined rules for constructing functions.
A nice consequence of the axiom of choice is a type of duality between one-to-one functions and onto functions.
Suppose that f is a function from a set S onto a set T . There exists a one-to-one function g from T into S .
Proof.
Suppose that f is a one-to-one function from a set S into a set T . There exists a function g from T onto S .
Proof.
Computational Exercises
Some Elementary Functions
Each of the following rules defines a function from R into R.
2
f (x) = x
g(x) = sin(x)
h(x) = ⌊x⌋
x
e
u(x) = x
1+e
Find the range of the function and determine if the function is one-to-one in each of the following cases:
1. f
2. g
3. h
4. u
Answer
2. g−1
{0}
3. h−1
{2, 3, 4}
Answer
The function u is one-to-one. Find (that is, give the domain and rule for) the inverse function u .
−1
Answer
Give the rule and find the range for each of the following functions:
1. f ∘ g
2. g ∘ f
3. h ∘ g ∘ f
Answer
Dice
Let S = {1, 2, 3, 4, 5, 6} . This is the set of possible outcomes when a pair of standard dice are thrown. Let f , g , u, and
2
v be the
functions from S into Z defined by the following rules:
f (x, y) = x + y
g(x, y) = y − x
u(x, y) = min{x, y}
1.2.8 https://stats.libretexts.org/@go/page/10117
v(x, y) = max{x, y}
2. u−1
{3}
3. v−1
{4}
4. U −1
{(3, 4)}
Answer
Note that while f ∘ U is well-defined, U ∘ f is not. Note also that f ∘ U =f even though U is not the identity function on S .
Bit Strings
Let n ∈ N and let S = {0, 1} and T = {0, 1, … , n}. Recall that the elements of S are bit strings of length n , and could
+
n
represent the possible outcomes of n tosses of a coin (where 1 means heads and 0 means tails). Let f : S → T be the function
defined by f (x , x , … , x ) = ∑ x . Note that f (x) is just the number of 1s in the the bit string x. Let g : T → S be the
1 2 n
n
i=1 i
function defined by g(k) = x where x denotes the bit string with k 1s followed by n − k 0s.
k k
In the previous exercise, note that f ∘ g and g ∘ f are both well-defined, but have different domains, and so of course are not the
same. Note also that f ∘ g is the identity function on T , but f is not the inverse of g . Indeed f is not one-to-one, and so does not
have an inverse. However, f restricted to {x : k ∈ T } (the range of g ) is one-to-one and is the inverse of g .
k
Let n = 4 . Give f −1
({k}) in list form for each k ∈ T .
Answer
Again let n = 4 . Let A = {1000, 1010} and B = {1000, 1100}. Give each of the following in list form:
1. f (A)
2. f (B)
3. f (A ∩ B)
4. f (A) ∩ f (B)
1.2.9 https://stats.libretexts.org/@go/page/10117
5. f −1
(f (A))
Answer
Indicator Functions
Suppose that A and B are subsets of a universal set S . Express, in terms of 1 and 1 , the indicator function of each of the 14
A B
non-trivial sets that can be constructed from A and B . Use the Venn diagram app to help.
Answer
Suppose that A , B , and C are subsets of a universal set S . Give the indicator function of each of the following, in terms of 1 , A
1 , and 1
B in sum-product form:
C
Operators
Recall the standard arithmetic operators on R discussed above.
We all know that sum is commutative and associative, and that product distributes over sum.
1. Is difference commutative?
2. Is difference associative?
3. Does product distribute over difference?
4. Does sum distributed over product?
Answer
Multisets
Express the multiset A in list form that has the multiplicity function m : {a, b, c, d, e} → N given by m(a) = 2 , m(b) = 3 ,
m(c) = 1 , m(d) = 0 , m(e) = 4 .
Answer
This page titled 1.2: Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.2.10 https://stats.libretexts.org/@go/page/10117
1.3: Relations
Relations play a fundamental role in probability theory, as in most other areas of mathematics.
As the name suggests, a relation R from S into T is supposed to define a relationship between the elements of S and the elements
of T , and so we usually use the more suggestive notation x R y when (x, y) ∈ R. Note that the domain of R is the projections of R
onto S and the range of R is the projection of R onto T .
Basic Examples
Suppose that S is a set and recall that P(S) denotes the power set of S , the collection of all subsets of S . The membership relation
∈ from S to P(S) is perhaps the most important and basic relationship in mathematics. Indeed, for us, it's a primitive (undefined)
Constructions
Since a relation is just a set of ordered pairs, the set operations can be used to build new relations from existing ones.
Equivalently, R = {(y, x) : (x, y) ∈ R} . Note that any function f from S into T has an inverse relation, but only when the f is
−1
one-to-one is the inverse relation also a function (the inverse function). Composition is another natural way to create new relations
1.3.1 https://stats.libretexts.org/@go/page/10118
from existing ones.
Suppose that Q is a relation from S to T and that R is a relation from T to U . The composition Q ∘ R is the relation from S to
U defined as follows: for x ∈ S and z ∈ U , x(Q ∘ R)z if and only if there exists y ∈ T such that x Q y and y R z .
Note that the notation is inconsistent with the notation used for composition of functions, essentially because relations are read
from left to right, while functions are read from right to left. Hopefully, the inconsistency will not cause confusion, since we will
always use function notation for functions.
Basic Properties
The important classes of relations that we will study in the next couple of sections are characterized by certain basic properties.
Here are the definitions:
The proofs of the following results are straightforward, so be sure to try them yourself before reading the ones in the text.
Suppose that Q and R are relations on S . For each property below, if both Q and R have the property, then so does Q ∩ R .
1. reflexive
2. symmetric
3. transitive
Proof
1.3.2 https://stats.libretexts.org/@go/page/10118
Computational Exercises
Let R be the relation defined on R by xRy if and only if sin(x) = sin(y) . Determine if R has each of the following
properties:
1. reflexive
2. symmetric
3. transitive
4. irreflexive
5. antisymmetric
Answer
The relation R in the previous exercise is a member of an important class of equivalence relations.
This page titled 1.3: Relations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.3.3 https://stats.libretexts.org/@go/page/10118
1.4: Partial Orders
Partial orders are a special class of relations that play an important role in probability theory.
Basic Theory
Definitions
A partial order on a set S is a relation ⪯ on S that is reflexive, anti-symmetric, and transitive. The pair (S, ⪯) is called a
partially ordered set. So for all x, y, z ∈ S :
1. x ⪯ x , the reflexive property
2. If x ⪯ y and y ⪯ x then x = y , the antisymmetric property
3. If x ⪯ y and y ⪯ z then x ⪯ z , the transitive property
As the name and notation suggest, a partial order is a type of ordering of the elements of S . Partial orders occur naturally in many
areas of mathematics, including probability. A partial order on a set naturally gives rise to several other relations on the set.
Suppose that ⪯ is a partial order on a set S . The relations ⪰, ≺, ≻, ⊥, and ∥ are defined as follows:
1. x ⪰ y if and only if y ⪯ x .
2. x ≺ y if and only if x ⪯ y and x ≠ y .
3. x ≻ y if and only if y ≺ x .
4. x ⊥ y if and only if x ⪯ y or y ⪯ x .
5. x ∥ y if and only if neither x ⪯ y nor y ⪯ x .
Note that ⪰ is the inverse of ⪯, and ≻ is the inverse of ≺. Note also that x ⪯ y if and only if either x ≺ y or x = y , so the relation
≺ completely determines the relation ⪯. The relation ≺ is sometimes called a strict or strong partial order to distingush it from the
ordinary (weak) partial order ⪯. Finally, note that x ⊥ y means that x and y are related in the partial order, while x ∥ y means that
x and y are unrelated in the partial order. Thus, the relations ⊥ and ∥ are complements of each other, as sets of ordered pairs. A
total or linear order is a partial order in which there are no unrelated elements.
Suppose that ⪯ and ⪯ are partial orders on a set S . Then ⪯ is an sub-order of ⪯ , or equivalently ⪯ is an extension of ⪯
1 2 1 2 2 1
Thus if ⪯ is a suborder of ⪯ , then as sets of ordered pairs, ⪯ is a subset of ⪯ . We need one more relation that arises naturally
1 2 1 2
Suppose that ⪯ is a partial order on a set S . For x, y ∈ S , y is said to cover x if x ≺y but no element z ∈ S satisfies
x ≺z ≺y .
If S is finite, the covering relation completely determines the partial order, by virtue of the transitive property.
Suppose that ⪯ is a partial order on a finite set S . The covering graph or Hasse graph of (S, ⪯) is the directed graph with
vertex set S and directed edge set E , where (x, y) ∈ E if and only if y covers x.
Thus, x ≺ y if and only if there is a directed path in the graph from x to y . Hasse graphs are named for the German mathematician
Helmut Hasse. The graphs are often drawn with the edges directed upward. In this way, the directions can be inferred without
having to actually draw arrows.
1.4.1 https://stats.libretexts.org/@go/page/10119
Basic Examples
Of course, the ordinary order ≤ is a total order on the set of real numbers R. The subset partial order is one of the most important
in probability theory:
Suppose that S is a set. The subset relation ⊆ is a partial order on P(S) , the power set of S .
Proof
Let ∣ denote the division relation on the set of positive integers N . That is, m ∣ n if and only if there exists k ∈ N
+ + such that
n = km . Then
1. ∣ is a partial order on N .
+
The set of functions from a set into a partial ordered set can itself be partially ordered in a natural way.
Suppose that S is a set and that (T , ⪯ ) is a partially ordered set, and let S denote the set of functions
T f : S → T . The
relation ⪯ on S defined by f ⪯ g if and only f (x) ⪯ g(x) for all x ∈ S is a partial order on S .
T
Proof
Basic Properties
The proofs of the following basic properties are straightforward. Be sure to try them yourself before reading the ones in the text.
Let S be a set. A relation ⪯ is a partial order on S if and only if ≺ is transitive and irreflexive.
Proof
Suppose that ⪯ is a partial order on a set S and that A ⊆ S . In the following definitions, x, y are arbitrary elements of S .
1. A is increasing if x ∈ A and x ⪯ y imply y ∈ A .
2. A is decreasing if y ∈ A and x ⪯ y imply x ∈ A .
Suppose that S is a set with partial order ⪯S , T is a set with partial order ⪯T , and that f : S → T . In the following
definitions, x, y are arbitrary elements of S .
1. f is increasing if and only if x ⪯ y implies f (x) ⪯ f (y).
S T
Recall the definition of the indicator function 1 associated with a subset A of a universal set S : For x ∈ S , 1
A A (x) =1 if x ∈ A
and 1 (x) = 0 if x ∉ A .
A
1.4.2 https://stats.libretexts.org/@go/page/10119
Suppose that ⪯ is a partial order on a set S and that A ⊆ S . Then
1. A is increasing if and only if 1 is increasing.
A
Proof
Isomorphism
Two partially ordered sets (S, ⪯ ) and (T , ⪯ ) are said to be isomorphic if there exists a one-to-one function f from S onto
S T
T such that x ⪯ Sy if and only if f (x) ⪯ f (y), for all x, y ∈ S . The function f is an isomorphism.
T
Generally, a mathematical space often consists of a set and various structures defined in terms of the set, such as relations,
operators, or a collection of subsets. Loosely speaking, two mathematical spaces of the same type are isomorphic if there exists a
one-to-one function from one of the sets onto the other that preserves the structures, and again, the function is called an
isomorphism. The basic idea is that isomorphic spaces are mathematically identical, except for superficial matters of appearance.
The word isomorphism is from the Greek and means equal shape.
Proof
In a sense, the subset partial order is universal—every partially ordered set is isomorphic to (S , ⊆) for some collection of sets S .
Suppose that ⪯ is a partial order on a set S . Then there exists S ⊆ P(S) such that (S, ⪯) is isomorphic to (S , ⊆).
Proof
Extremal Elements
Various types of extremal elements play important roles in partially ordered sets. Here are the definitions:
In general, a set can have several maximal and minimal elements (or none). On the other hand,
The minimum and maximum elements of A , if they exist, are unique. They are denoted min(A) and max(A), respectively.
Proof
Minimal, maximal, minimum, and maximum elements of a set must belong to that set. The following definitions relate to upper
and lower bounds of a set, which do not have to belong to the set.
By (20), the greatest lower bound of A is unique, if it exists. It is denoted glb(A) or inf(A) . Similarly, the least upper bound of A
is unique, if it exists, and is denoted lub(A) or sup(A) . Note that every element of S is a lower bound and an upper bound for ∅,
since the conditions in the definition hold vacuously.
The symbols ∧ and ∨ are also used for infimum and supremum, respectively, so ⋀ A = inf(A) and ⋁ A = sup(A) if they exist..
In particular, for x, y ∈ S , operator notation is more commonly used, so x ∧ y = inf{x, y} and x ∨ y = sup{x, y} . Partially
1.4.3 https://stats.libretexts.org/@go/page/10119
ordered sets for which these elements always exist are important, and have a special name.
Suppose that ⪯ is a partial order on a set S . Then (S, ⪯) is a lattice if x ∧ y and x ∨ y exist for every x, y ∈ S .
For the subset partial order, the inf and sup operators correspond to intersection and union, respectively:
Let S be a set and consider the subset partial order ⊆ on P(S) , the power set of S . Let A be a nonempty subset of P(S) ,
that is, a nonempty collection of subsets of S . Then
1. inf(A ) = ⋂ A
2. sup(A ) = ⋃ A
Proof
Consider the division partial order ∣ on the set of positive integers N and let A be a nonempty subset of N .
+ +
1. inf(A) is the greatest common divisor of A , usually denoted gcd(A) in this context.
2. If A is infinite then sup(A) does not exist. If A is finite then sup(A) is the least common multiple of A , usually denoted
lcm(A) in this context.
Suppose that S is a set and that f : S → S . An element z ∈ S is said to be a fixed point of f if f (z) = z .
The following result explores a basic fixed point theorem for a partially ordered set. The theorem is important in the study of
cardinality.
Suppose that ⪯ is a partial order on a set S with the property that sup(A) exists for every A ⊆ S . If f : S → S is increasing,
then f has a fixed point.
Proof.
If ⪯ is a total order on a set S with the property that every nonempty subset of S has a minimum element, then S is said to be well
ordered by ⪯. One of the most important examples is N , which is well ordered by the ordinary order ≤. On the other hand, the
+
well ordering principle, which is equivalent to the axiom of choice, states that every nonempty set can be well ordered.
1. The relation ⪯ is a partial order on S × T , called, appropriately enough, the product order.
2. Suppose that (S, ⪯ ) = (T , ⪯ ) . If S has at least 2 elements, then ⪯ is not a total order on S .
S T
2
Proof
Figure 1.4.1 : The product order on R . The region shaded red is the set of points ⪰ (x, y). The region shaded blue is the set of
2
points ⪯ (x, y). The region shaded white is the set of points that are not comparable with (x, y).
Product order extends in a straightforward way to the Cartesian product of a finite or an infinite sequence of partially ordered
spaces. For example, suppose that S is a set with partial order ⪯ for each i ∈ {1, 2, … , n}, where n ∈ N . The product order ⪯
i i +
1.4.4 https://stats.libretexts.org/@go/page/10119
set, x ⪯ y if and only if x ⪯ y for each i ∈ {1, 2, … , n}. We can generalize this further to arbitrary product sets. Suppose that
i i i
S is a set for each i in a nonempty (both otherwise arbitrary) index set I . Recall that
i
i∈I i∈I
To make the notation look more like a simple Cartesian product, we will write x instead of x(i) for the value of a function x in the
i
product set at i ∈ I .
Suppose that S is a set with partial order ⪯ for each i in a nonempty index set I . Define the relation ⪯ on ∏ S by x ⪯ y
i i i∈I i
if and only if x ⪯ y for each i ∈ I . Then ⪯ is a partial order on the product set, known again as the product order.
i i i
Proof
Note again that no assumptions are made on the index set I , other than it be nonempty. In particular, no order is necessary on I .
The next result gives a very different type of order on a product space.
Suppose again that S and T are sets with partial orders ⪯ S and ⪯T respectively. Define the relation ⪯ on S ×T by
(x, y) ⪯ (z, w) if and only if either x ≺ z , or x = z and y ⪯
S T w.
1. The relation ⪯ is a partial order on S × T , called the lexicographic order or dictionary order.
2. If ⪯ and ⪯ are total orders on S and T , respectively, then ⪯ is a total order on S × T .
S T
Proof
Figure 1.4.2 : The lexicographic order on R . The region shaded red is the set of points ⪰ (x, y). The region shaded blue is the set
2
Suppose that S is a set with partial order ⪯ for each i in a nonempty index set I . Suppose also that ≤ is a total order on I .
i i
Define the relation ⪯ on the product set ∏ S as follows: x ≺ y if and only if there exists j ∈ I such that x = y if i < j
i∈I i i i
and x ≺ y . Then
j j j
Proof
The term lexicographic comes from the way that we order words alphabetically: We look at the first letter; if these are different, we
know how to order the words. If the first letters are the same, we look at the second letter; if these are different, we know how to
order the words. We continue in this way until we find letters that are different, and we can order the words. In fact, the
lexicographic order is sometimes referred to as the first difference order. Note also that if S is a set and ⪯ a total order on S for
i i i
i ∈ I , then by the well ordering principle, there exists a well ordering ≤ of I , and hence there exists a lexicographic total order on
the product space ∏ S . As a mathematical structure, the lexicographic order is not as obscure as you might think.
i∈I i
(R, ≤) is isomorphic to the lexicographic product of (Z, ≤) with ([0, 1), ≤), where ≤ is the ordinary order for real numbers.
Proof
1.4.5 https://stats.libretexts.org/@go/page/10119
Limits of Sequences of Real Numbers
Suppose that (a 1, a2 , …) is a sequence of real numbers.
Since the sequence of infimums in the last result is increasing, the limit exists in R ∪ {∞} , and is called the limit inferior of the
original sequence:
Since the the sequence of supremums in the last result is decreasing, the limit exists in R ∪ {−∞} , and is called the limit superior
of the original sequence:
lim sup an = lim sup{ an , an+1 , …} (1.4.4)
n→∞
n→∞
Note that lim inf n→∞ an ≤ lim sup n→∞ an and equality holds if and only if lim n→∞ an exists (and is the common value).
such that |f (x)| ≤ C for all x ∈ S . Now let U denote the set of bounded functions f : S → R , and for f ∈ U define
Appropriately enough, ∥ ⋅ ∥ is called the supremum norm on U . Vector spaces of bounded, real-valued functions, with the
supremum norm are especially important in probability and random processes. We will return to this discussion again in the
advanced sections on metric spaces and measure theory.
Computational Exercises
Let S = {2, 3, 4, 6, 12}.
1. Sketch the Hasse graph corresponding to the ordinary order ≤ on S .
2. Sketch the Hasse graph corresponding to the division partial order ∣ on S .
Answer
Consider the ordinary order ≤ on the set of real numbers R , and let A = [a, b) where a < b . Find each of the following that
exist:
1. The set of minimal elements of A
2. The set of maximal elements of A
3. min(A)
4. max(A)
5. The set of lower bounds of A
6. The set of upper bounds of A
7. inf(A)
8. sup(A)
Answer
1.4.6 https://stats.libretexts.org/@go/page/10119
Again consider the division partial order ∣ on the set of positive integers N+ and let A = {2, 3, 4, 6, 12} . Find each of the
following that exist:
1. The set of minimal elements of A
2. The set of maximal elements of A
3. min(A)
4. max(A)
5. The set of lower bounds of A
6. The set of upper bounds of A
7. inf(A)
8. sup(A) .
Answer
Let S = {a, b, c} .
1. Give P(S) in list form.
2. Describe the Hasse graph of (P(S), ⊆)
Answer
Let S = {a, b, c, d} .
1. Give P(S) in list form.
2. Describe the Hasse graph of (P(S), ⊆)
Answer
Suppose that A and B are subsets of a universal set S . Let A denote the collection of the 16 subsets of S that can be
constructed from A and B using the set operations. Show that (A , ⊆) is isomorphic to the partially ordered set in the previous
exercise. Use the Venn diagram app to help.
Proof
This page titled 1.4: Partial Orders is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.4.7 https://stats.libretexts.org/@go/page/10119
1.5: Equivalence Relations
Basic Theory
Definitions
A relation ≈ on a nonempty set S that is reflexive, symmetric, and transitive is an equivalence relation on S . Thus, for all
x, y, z ∈ S ,
1. x ≈ x , the reflexive property.
2. If x ≈ y then y ≈ x , the symmetric property.
3. If x ≈ y and y ≈ z then x ≈ z , the transitive property.
As the name and notation suggest, an equivalence relation is intended to define a type of equivalence among the elements of S .
Like partial orders, equivalence relations occur naturally in most areas of mathematics, including probability.
Suppose that ≈ is an equivalence relation on S . The equivalence class of an element x ∈ S is the set of all elements that are
equivalent to x, and is denoted
[x] = {y ∈ S : y ≈ x} (1.5.1)
Results
The most important result is that an equivalence relation on a set S defines a partition of S , by means of the equivalence classes.
Sometimes the set S of equivalence classes is denoted S/ ≈ . The idea is that the equivalence classes are new “objects” obtained
by “identifying” elements in S that are equivalent. Conversely, every partition of a set defines an equivalence relation on the set.
Suppose that S is a collection of nonempty sets that partition a given set S . Define the relation ≈ on S by x ≈ y if and only if
x ∈ A and y ∈ A for some A ∈ S .
1. ≈ is an equivalence relation.
2. S is the set of equivalence classes.
Proof
Figure 1.5.1 : A partition of S . Any two points in the same partition set are equivalent.
Sometimes the equivalence relation ≈ associated with a given partition S is denoted S/S . The idea, of course, is that elements in
the same set of the partition are equivalent.
The process of forming a partition from an equivalence relation, and the process of forming an equivalence relation from a
partition are inverses of each other.
1.5.1 https://stats.libretexts.org/@go/page/10120
1. If we start with an equivalence relation ≈ on S , form the associated partition, and then construct the equivalence relation
associated with the partition, then we end up with the original equivalence relation. In modular notation, S/(S/ ≈) is the
same as ≈.
2. If we start with a partition S of S , form the associated equivalence relation, and then form the partition associated with the
equivalence relation, then we end up with the original partition. In modular notation, S/(S/S ) is the same as S .
Suppose that S is a nonempty set. The most basic equivalence relation on S is the equality relation =. In this case [x] = {x} for
each x ∈ S . At the other extreme is the trivial relation ≈ defined by x ≈ y for all x, y ∈ S . In this case S is the only equivalence
class.
Every function f defines an equivalence relation on its domain, known as the equivalence relation associated with f . Moreover,
the equivalence classes have a simple description in terms of the inverse images of f .
2. If f is a constant function then the equivalence relation associated with f is the trivial relation, and hence S is the only
equivalence class.
Proof
Equivalence relations associated with functions are universal: every equivalence relation is of this form:
Suppose that ≈ is an equivalence relation on a set S . Define f : S → P(S) by f (x) = [x]. Then ≈ is the equivalence relation
associated with f .
Proof
Suppose that ≈ and ≅ are equivalence relations on a set S . Let ≡ denote the intersection of ≈ and ≅ (thought of as subsets of
S × S ). Equivalently, x ≡ y if and only if x ≈ y and x ≅y .
1. ≡ is an equivalence relation on S .
2. [x ] = [x ] ∩ [x ] .
≡ ≈ ≅
Suppose that we have a relation that is reflexive and transitive, but fails to be a partial order because it's not anti-symmetric. The
relation and its inverse naturally lead to an equivalence relation, and then in turn, the original relation defines a true partial order on
the equivalence classes. This is a common construction, and the details are given in the next theorem.
Suppose that ⪯ is a relation on a set S that is reflexive and transitive. Define the relation ≈ on S by x ≈ y if and only if x ⪯ y
and y ⪯ x .
1. ≈ is an equivalence relation on S .
2. If A and B are equivalence classes and x ⪯ y for some x ∈ A and y ∈ B , then u ⪯ v for all u ∈ A and v ∈ B .
1.5.2 https://stats.libretexts.org/@go/page/10120
3. Define the relation ⪯ on the collection of equivalence classes S by A ⪯ B if and only if x ⪯ y for some (and hence all)
x ∈ A and y ∈ B . Then ⪯ is a partial order on S .
Proof
A prime example of the construction in the previous theorem occurs when we have a function whose range space is partially
ordered. We can construct a partial order on the equivalence classes in the domain that are associated with the function.
Suppose that S and T are sets and that ⪯ is a partial order on T . Suppose also that f
T : S → T . Define the relation ⪯ on S
S
2. The equivalence relation on S constructed in (10) is the equivalence relation associated with f , as in (6).
3. ⪯ can be extended to a partial order on the equivalence classes corresponding to f .
S
Proof
2. g(x) = ⌊x⌋ .
3. h(x) = sin(x).
Answer
Calculus
Suppose that I is a fixed interval of R, and that S is the set of differentiable functions from I into R. Consider the equivalence
relation associated with the derivative operator D on S , so that D(f ) = f . For f ∈ S , give a simple description of [f ].
′
Answer
Congruence
Recall the division relation ∣ from N to Z: For d ∈ N and n ∈ Z , d ∣ n means that n = kd for some k ∈ Z . In words, d divides
+ +
Fix d ∈ N . +
1. Define the relation ≡ on Z by m ≡ n if and only if d ∣ (n − m) . The relation ≡ is known as congruence modulo d .
d d d
Recall that by the Euclidean division theorem, named for Euclid of course, n ∈ Z can be written uniquely in the form n = kd + q
where k ∈ Z and q ∈ {0, 1, … , d − 1} , and then r (n) = q .
d
Congruence modulo d .
1. ≡ is the equivalence relation associated with the function r .
d d
Proof
Answer
Linear Algebra
Linear algebra provides several examples of important and interesting equivalence relations. To set the stage, let R m×n
denote the
set of m × n matrices with real entries, for m, n ∈ N . +
1.5.3 https://stats.libretexts.org/@go/page/10120
1. Multiply a row by a non-zero real number.
2. Interchange two rows.
3. Add a multiple of a row to another row.
Row operations are essential for inverting matrices and solving systems of linear equations.
Proof.
Our next relation involves similarity, which is very important in the study of linear transformations, change of basis, and the theory
of eigenvalues and eigenvectors.
Matrices A, B ∈ R n×n
are similar if there exists an invertible P ∈ R
n×n
such that P −1
AP = B . Similarity is an equivalence
relation on R . n×n
Proof
entry of A , for i, j ∈ {1, 2, … , m}. Simply stated, A is the matrix whose rows are the columns of A . For the theorem that
T T
−1 T
follows, we need to remember that (AB)
T
=B
T T
A for A ∈ R
m×n
and B ∈ R
n×k
, and (A
T
) = (A
−1
) if n×n
A ∈ R is
invertible.
Proof
Congruence is important in the study of orthogonal matrices and change of basis. Of course, the term congruence applied to
matrices should not be confused with the same term applied to integers.
Number Systems
Equivalence relations play an important role in the construction of complex mathematical structures from simpler ones. Often the
objects in the new structure are equivalence classes of objects constructed from the simpler structures, modulo an equivalence
relation that captures the essential properties of the new objects.
The construction of number systems is a prime example of this general idea. The next exercise explores the construction of rational
numbers from integers.
n
= [(m, n)] , the equivalence class generated by (m, n), for m ∈ Z and n ∈ N . This definition captures the
+
This page titled 1.5: Equivalence Relations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
1.5.4 https://stats.libretexts.org/@go/page/10120
1.6: Cardinality
Basic Theory
Definitions
Suppose that S is a non-empty collection of sets. We define a relation ≈ on S by A ≈ B if and only if there exists a one-to-
one function f from A onto B . The relation ≈ is an equivalence relation on S . That is, for all A, B, C ∈ S ,
1. A ≈ A , the reflexive property
2. If A ≈ B then B ≈ A , the symmetric property
3. If A ≈ B and B ≈ C then A ≈ C , the transitive property
Proof
A one-to-one function f from A onto B is sometimes called a bijection. Thus if A ≈ B then A and B are in one-to-one
correspondence and are said to have the same cardinality. The equivalence classes under this equivalence relation capture the
notion of having the same number of elements.
Let N 0 =∅ , and for k ∈ N , let N
+ k = {0, 1, … k − 1} . As always, N = {0, 1, 2, …} is the set of all natural numbers.
In part (a), think of N as a reference set with k elements; any other set with k elements must be equivalent to this one. We will
k
study the cardinality of finite sets in the next two sections on Counting Measure and Combinatorial Structures. In this section, we
will concentrate primarily on infinite sets. In part (d), a countable set is one that can be enumerated or counted by putting the
elements into one-to-one correspondence with N for some k ∈ N or with all of N. An uncountable set is one that cannot be so
k
counted. Countable sets play a special role in probability theory, as in many other branches of mathematics. Apriori, it's not clear
that there are uncountable sets, but we will soon see examples.
Preliminary Examples
If S is a set, recall that P(S) denotes the power set of S (the set of all subsets of S ). If A and B are sets, then A is the set of all
B
functions from B into A . In particular, {0, 1} denotes the set of functions from S into {0, 1}.
S
Proof
At one level, it might seem that E has only half as many elements as N while Z has twice as many elements as N. as the previous
result shows, that point of view is incorrect: N, E , and Z all have the same cardinality (and are countably infinite). The next
example shows that there are indeed uncountable sets.
If A is a set with at least two elements then S = A , the set of all functions from N into A , is uncountable.
N
Proof
1.6.1 https://stats.libretexts.org/@go/page/10121
Subsets of Infinite Sets
Surely a set must be as least as large as any of its subsets, in terms of cardinality. On the other hand, by example (4), the set of
natural numbers N, the set of even natural numbers E and the set of integers Z all have exactly the same cardinality, even though
E ⊂ N ⊂ Z . In this subsection, we will explore some interesting and somewhat paradoxical results that relate to subsets of infinite
sets. Along the way, we will see that the countable infinity is the “smallest” of the infinities.
When S was infinite in the proof of the previous result, not only did we map S one-to-one onto a proper subset, we actually threw
away a countably infinite subset and still maintained equivalence. Similarly, we can add a countably infinite set to an infinite set S
without changing the cardinality.
In particular, if S is uncountable and B is countable then S ∪ B and S ∖ B have the same cardinality as S , and in particular are
uncountable. In terms of the dichotomies finite-infinite and countable-uncountable, a set is indeed at least as large as a subset. First
we need a preliminary result.
Suppose that A ⊆ B .
1. If B is finite then A is finite.
2. If A is infinite then B is infinite.
3. If B is countable then A is countable.
4. If A is uncountable then B is uncountable.
Proof
On the other hand, if there exists a function that maps a set A onto a set B , then in a sense, there is a copy of B contained in A .
Hence A should be at least as large as B .
1.6.2 https://stats.libretexts.org/@go/page/10121
3. If A is countable then B is countable.
4. If B is uncountable then A is uncountable.
Proof
The previous exercise also could be proved from the one before, since if there exists a function f mapping A onto B , then there
exists a function g mapping B one-to-one into A . This duality is proven in the discussion of the axiom of choice. A simple and
useful corollary of the previous two theorems is that if B is a given countably infinite set, then a set A is countable if and only if
there exists a one-to-one function f from A into B , if and only if there exists a function g from B onto A .
The last result could also be proven from the one before, by noting that
A × B = ⋃ {a} × B (1.6.1)
a∈A
Both proofs work because the set M is essentially a copy of N × N , embedded inside of N. The last theorem generalizes to the
statement that a finite product of countable sets is still countable. But, from (5), a product of infinitely many sets (with at least 2
elements each) will be uncountable.
A real number is algebraic if it is the root of a polynomial function (of degree 1 or more) with integer coefficients. Rational
numbers are algebraic, as are rational roots of rational numbers (when defined). Moreover, the algebraic numbers are closed under
addition, multiplication, and division. A real number is transcendental if it's not algebraic. The numbers e and π are transcendental,
but we don't know very many other transcendental numbers by name. However, as we will see, most (in the sense of cardinality)
real numbers are transcendental.
The following sets have the same cardinality, and in particular all are uncountable:
1. R, the set of real numbers.
2. Any interval I of R, as long as the interval is not empty or a single point.
3. R ∖ Q , the set of irrational numbers.
4. R ∖ A , the set of transcendental numbers.
5. P(N), the power set of N.
Proof
1.6.3 https://stats.libretexts.org/@go/page/10121
Proof
Thus, we can use the construction in the section on on Equivalence Relations to first define an equivalence relation on S , and then
extend ⪯ to a true partial order on the collection of equivalence classes. The only question that remains is whether the equivalence
relation we obtain in this way is the same as the one that we have been using in our study of cardinality. Rephrased, the question is
this: If there exists a one-to-one function from A into B and a one-to-one function from B into A , does there necessarily exist a
one-to-one function from A onto B ? Fortunately, the answer is yes; the result is known as the Schröder-Bernstein Theorem, named
for Ernst Schröder and Felix Bernstein.
If A ⪯ B and B ⪯ A then A ≈ B .
Proof
We will write A ≺ B if A ⪯ B , but A ≉ B , That is, there exists a one-to-one function from A into B , but there does not exist a
function from A onto B . Note that ≺ would have its usual meaning if applied to the equivalence classes. That is, [A] ≺ [B] if and
only if [A] ⪯ [B] but [A] ≠ [B] . Intuitively, of course, A ≺ B means that B is strictly larger than A , in the sense of cardinality.
We close our discussion with the observation that for any set, there is always a larger set.
The proof that a set cannot be mapped onto its power set is similar to the Russell paradox, named for Bertrand Russell.
The continuum hypothesis is the statement that there is no set whose cardinality is strictly between that of N and R. The continuum
hypothesis actually started out as the continuum conjecture, until it was shown to be consistent with the usual axioms of the real
number system (by Kurt Gödel in 1940), and independent of those axioms (by Paul Cohen in 1963).
Assuming the continuum hypothesis, if S is uncountable then there exists A ⊆ S such that A and A are uncountable. c
Proof
There is a more complicated proof of the last result, without the continuum hypothesis and just using the axiom of choice.
This page titled 1.6: Cardinality is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.6.4 https://stats.libretexts.org/@go/page/10121
1.7: Counting Measure
Basic Theory
For our first discussion, we assume that the universal set S is finite. Recall the following definition from the section on cardinality.
For A ⊆ S , the cardinality of A is the number of elements in A , and is denoted #(A) . The function # on P(S) is called
counting measure.
Counting measure plays a fundamental role in discrete probability structures, and particularly those that involve sampling from a
finite set. The set S is typically very large, hence efficient counting methods are essential. The first combinatorial problem is
attributed to the Greek mathematician Xenocrates.
In many cases, a set of objects can be counted by establishing a one-to-one correspondence between the given set and some other
set. Naturally, the two sets have the same number of elements, but for various reasons, the second set may be easier to count.
# ( ⋃ Ai ) = ∑ #(Ai ) (1.7.1)
i=1 i=1
Thus, # is an increasing function, relative to the subset partial order ⊆ on P(S) , and the ordinary order ≤ on N.
1.7.1 https://stats.libretexts.org/@go/page/10122
Inequalities
Our next disucssion concerns two inequalities that are useful for obtaining bounds on the number of elements in a set. The first is
Boole's inequality (named after George Boole) which gives an upper bound on the cardinality of a union.
# ( ⋃ Ai ) ≤ ∑ #(Ai ) (1.7.2)
i=1 i=1
Proof
Intuitively, Boole's inequality holds because parts of the union have been counted more than once in the expression on the right.
The second inequality is Bonferroni's inequality (named after Carlo Bonferroni), which gives a lower bound on the cardinality of
an intersection.
i=1 i=1
Proof
Suppose that {A i : i ∈ I} is a collection of subsets of S where I is an index set with #(I ) = n . Then
n
k−1
# ( ⋃ Ai ) = ∑(−1 ) ∑ # ( ⋂ Aj ) (1.7.6)
Proof
1.7.2 https://stats.libretexts.org/@go/page/10122
The general Bonferroni inequalities, named again for Carlo Bonferroni, state that if sum on the right is truncated after k terms (
k < n ), then the truncated sum is an upper bound for the cardinality of the union if k is odd (so that the last term has a positive
sign) and is a lower bound for the cardinality of the union if k is even (so that the last terms has a negative sign).
Suppose that a procedure consists of k steps, performed sequentially, and that for each j ∈ {1, 2, … , k}, step j can be
performed in n ways, regardless of the choices made on the previous steps. Then the number of ways to perform the entire
j
procedure is n n ⋯ n .
1 2 k
The key to a successful application of the multiplication rule to a counting problem is the clear formulation of an algorithm that
generates the objects being counted, so that each object is generated once and only once. That is, we must neither over count nor
under count. It's also important to notice that the set of choices available at step j may well depend on the previous steps; the
assumption is only that the number of choices available does not depend on the previous steps.
The first two results below give equivalent formulations of the multiplication principle.
Suppose that S is a set of sequences of length k , and that we denote a generic element of S by (x , x , … , x ). Suppose that
1 2 k
for each j ∈ {1, 2, … , k}, x has n different values, regardless of the values of the previous coordinates. Then
j j
#(S) = n n ⋯ n .
1 2 k
Proof
Suppose that T is an ordered tree with depth k and that each vertex at level i − 1 has n children for
i i ∈ {1, 2, … , k} . Then
the number of endpoints of the tree is n n ⋯ n .
1 2 k
Proof
Product Sets
If S is a set with n elements for i ∈ {1, 2, … , k} then
i i
#(S1 × S2 × ⋯ × Sk ) = n1 n2 ⋯ nk (1.7.10)
Proof
Proof
In (16), note that the elements of S can be thought of as ordered samples of size k that can be chosen with replacement from a
k
population of n objects. Elements of {0, 1} are sometimes called bit strings of length n . Thus, there are 2 bit strings of length n .
n n
Functions
The number of functions from a set A of m elements into a set B of n elements is n . m
Proof
Recall that the set of functions from a set A into a set B (regardless of whether the sets are finite or infinite) is denoted B . This
A
theorem is motivation for the notation. Note also that if S is a set with n elements, then the elements in the Cartesian power set S k
can be thought of as functions from {1, 2, … , k} into S . So the counting formula for sequences can be thought of as a corollary of
counting formula for functions.
Subsets
Suppose that S is a set with n elements, where n ∈ N . There are 2 subsets of S .
n
1.7.3 https://stats.libretexts.org/@go/page/10122
k
Suppose that {A , A , … A } is a collection of k subsets of a set S , where k ∈ N . There are 2 different (in general) sets
1 2 k +
2
that can be constructed from the k given sets, using the operations of union, intersection, and complement. These sets form the
algebra generated by the given sets.
Proof
2. Select each of the 12 other subsets of S . Note how each is a union of some of the sets in (a).
Suppose that S is a set with n elements and that A is a subset of S with k elements, where n, k ∈ N and k ≤ n . The number
of subsets of S that contain A is 2 . n−k
Proof
Our last result in this discussion generalizes the basic subset result above.
Suppose that n, k ∈ N and that S is a set with n elements. The number of sequences of subsets (A1 , A2 , … , Ak ) with
A1 ⊆ A2 ⊆ ⋯ ⊆ Ak ⊆ S is (k + 1) .n
Proof
Computational Exercises
Identification Numbers
A license number consists of two letters (uppercase) followed by five digits. How many different license numbers are there?
Answer
Suppose that a Personal Identification Number (PIN) is a four-symbol code word in which each entry is either a letter
(uppercase) or a digit. How many PINs are there?
Answer
An experiment consists of rolling a standard die, drawing a card from a standard deck, and tossing a standard coin. How many
outcomes are there?
Answer
A standard die is rolled 5 times and the sequence of scores recorded. How many outcomes are there?
Answer
In the card game Set, each card has 4 properties: number (one, two, or three), shape (diamond, oval, or squiggle), color (red,
blue, or green), and shading (solid, open, or stripped). The deck has one card of each (number, shape, color, shading)
configuration. A set in the game is defined as a set of three cards which, for each property, the cards are either all the same or
all different.
1.7.4 https://stats.libretexts.org/@go/page/10122
1. How many cards are in a deck?
2. How many sets are there?
Answer
A coin is tossed 10 times and the sequence of scores recorded. How many sequences are there?
Answer
The die-coin experiment consists of rolling a die and then tossing a coin the number of times shown on the die. The sequence
of coin results is recorded.
1. How many outcomes are there?
2. How many outcomes are there with all heads?
3. How many outcomes are there with exactly one head?
Answer
Run the die-coin experiment 100 times and observe the outcomes.
Birthdays
Consider a group of 10 persons.
1. If we record the birth month of each person, how many outcomes are there?
2. If we record the birthday of each person (ignoring leap day), how many outcomes are there?
Answer
Reliability
In the usual model of structural reliability, a system consists of components, each of which is either working or defective. The
system as a whole is also either working or defective, depending on the states of the components and how the components are
connected.
A string of lights has 20 bulbs, each of which may be good or defective. How many configurations are there?
Answer
If the components are connected in series, then the system as a whole is working if and only if each component is working. If the
components are connected parallel, then the system as a whole is working if and only if at least one component is working.
A system consists of three subsystems with 6, 5, and 4 components, respectively. Find the number of component states for
which the system is working in each of the following cases:
1. The components in each subsystem are in parallel and the subsystems are in series.
2. The components in each subsystem are in series and the subsystems are in parallel.
Answer
Menus
Suppose that a sandwich at a restaurant consists of bread, meat, cheese, and various toppings. There are 4 choices for the bread,
3 choices for the meat, 5 choices for the cheese, and 10 different toppings (each of which may be chosen). How many
sandwiches are there?
Answer
At a wedding dinner, there are three choices for the entrée, four choices for the beverage, and two choices for the dessert.
1.7.5 https://stats.libretexts.org/@go/page/10122
1. How many different meals are there?
2. If there are 50 guests at the wedding and we record the meal requested for each guest, how many possible outcomes are
there?
Answer
Braille
Braille is a tactile writing system used by people who are visually impaired. The system is named for the French educator
Louis Braille and uses raised dots in a 3 × 2 grid to encode characters. How many meaningful Braille configurations are there?
Answer
Figure 1.7.5 : The Braille encoding of the number 2 and the letter b
Personality Typing
The Meyers-Briggs personality typing is based on four dichotomies: A person is typed as either extroversion (E) or
introversion (I), either sensing (S) or intuition (I), either thinking (T) or feeling (F), and either judgement (J) or perception (P).
1. How many Meyers-Briggs personality types are there? List them.
2. Suppose that we list the personality types of 10 persons. How many possible outcomes are there?
Answer
ordered pair (n, k) where n is the row number and k is the peg number in that row.
A ball is dropped onto the top peg (0, 0) of the Galton board. In general, when the ball hits peg (n, k), it either bounces to the left
to peg (n + 1, k) or to the right to peg (n + 1, k + 1) . The sequence of pegs that the ball hits is a path in the Galton board.
There is a one-to-one correspondence between each pair of the following three collections:
1. Bit strings of length n
2. Paths in the Galton board from (0, 0) to any peg in row n .
3. Subsets of a set with n elements.
This page titled 1.7: Counting Measure is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.7.6 https://stats.libretexts.org/@go/page/10122
1.8: Combinatorial Structures
The purpose of this section is to study several combinatorial structures that are of basic importance in probability.
Permutations
Suppose that D is a set with n ∈ N elements. A permutation of length k ∈ {0, 1, … , n} from D is an ordered sequence of k
distinct elements of D; that is, a sequence of the form (x , x , … , x ) where x ∈ D for each i and x ≠ x for i ≠ j .
1 2 k i i j
Statistically, a permutation of length k from D corresponds to an ordered sample of size k chosen without replacement from the
population D.
Proof
By convention, n = 1 . Recall that, in general, a product over an empty index set is 1. Note that n has k factors, starting at n ,
(0) (k)
and with each subsequent factor one less than the previous factor. Some alternate notations for the number of permutations of size
k from a set of n objects are P (n, k) , P , and P . n,k n k
The number of permutations of length n from the n element set D (these are called simply permutations of D) is
(n)
n! = n = n(n − 1) ⋯ (1) (1.8.2)
The function on N given by n ↦ n! is the factorial function. The general permutation formula in (2) can be written in terms of
factorials:
(k)
n!
n = (1.8.3)
(n − k)!
Although this formula is succinct, it's not always a good idea numerically. If n and n − k are large, n! and (n − k)! are enormous,
and division of the first by the second can lead to significant round-off errors.
Note that the basic permutation formula in (2) is defined for every real number n and nonnegative integer k . This extension is
sometimes referred to as the generalized permutation formula. Actually, we will sometimes need an even more general formula of
this type (particularly in the sections on Pólya's urn and the beta-Bernoulli process).
1. a(0,k)
=a
k
2. a(−1,k)
=a
(k)
3. a(1,k)
= a(a + 1) ⋯ (a + k − 1)
4. 1(1,k)
= k!
called the rising power of a of order k , and is sometimes denoted a . Note that a is the ordinary k th power of a . In general,
[k] (0,k)
1.8.1 https://stats.libretexts.org/@go/page/10123
Combinations
Consider again a set D with n ∈ N elements. A combination of size k ∈ {0, 1, … , n} from D is an (unordered) subset of k
distinct elements of D. Thus, a combination of size k from D has the form {x , x , … , x }, where x ∈ D for each i and
1 2 k i
x ≠x
i jfor i ≠ j .
Statistically, a combination of size k from D corresponds to an unordered sample of size k chosen without replacement from the
population D. Note that for each combination of size k from D, there are k! distinct orderings of the elements of that combination.
Each of these is a permutation of length k from D. The number of combinations of size k from an n -element set is denoted by ( ). n
Proof
The number ( ) is called a binomial coefficient. Note that the formula makes sense for any real number n and nonnegative integer
n
k since this is true of the generalized permutation formula n . With this extension, ( ) is called the generalized binomial
(k) n
coefficient. Note that if n and k are positive integers and k > n then ( ) = 0 . By convention, we will also define ( ) = 0 if
n
k
n
0
.
Algebraically, the last result is trivial. It also makes sense combinatorially: There is only one way to select a subset of D with n
elements (D itself), and there is only one way to select a subset of size 0 from D (the empty set ∅).
If n, k ∈ N with k ≤ n then
n n
( ) =( ) (1.8.6)
k n−k
Combinatorial Proof
The next result is one of the most famous and most important. It's known as Pascal's rule and is named for Blaise Pascal.
If n, k ∈ N+ with k ≤ n then
n n−1 n−1
( ) =( ) +( ) (1.8.7)
k k−1 k
Combinatorial Proof
Recall that the Galton board is a triangular array of pegs: the rows are numbered n = 0, 1, … and the pegs in row n are numbered
k = 0, 1, … , n. If each peg in the Galton board is replaced by the corresponding binomial coefficient, the resulting table of
numbers is known as Pascal's triangle, named again for Pascal. By (8), each interior number in Pascal's triangle is the sum of the
two numbers directly above it.
The following result is the binomial theorem, and is the reason for the term binomial coefficient.
n
n k n−k
(a + b ) = ∑( )a b (1.8.8)
k
k=0
1.8.2 https://stats.libretexts.org/@go/page/10123
Combinatorial Proof
If j, k, n ∈ N+ with j ≤ k ≤ n then
n n−j
(j) (j)
k ( ) =n ( ) (1.8.9)
k k−j
Combinatorial Proof
The following result is known as Vandermonde's identity, named for Alexandre-Théophile Vandermonde.
If m, n, k ∈ N with k ≤ m + n , then
k
m n m +n
∑( )( ) =( ) (1.8.10)
j k−j k
j=0
Combinatorial Proof
The next result is a general identity for the sum of binomial coefficients.
If m, n ∈ N with n ≤ m then
m
j m +1
∑( ) =( ) (1.8.11)
n n+1
j=n
Combinatorial Proof
For an even more general version of the last result, see the section on Order Statistics in the chapter on Finite Sampling Models.
The following identity for the sum of the first m positive integers is a special case of the last result.
If m ∈ N then
m
m +1 (m + 1)m
∑j= ( ) = (1.8.12)
2 2
j=1
Proof
There is a one-to-one correspondence between each pair of the following collections. Hence the number objects in each of
these collection is ( ). n
The following identity is known as the alternating sum identity for binomial coefficients. It turns out to be useful in the Irwin-Hall
probability distribution. We give the identity in two equivalent forms, one for falling powers and one for ordinary powers.
If n ∈ N , j ∈ {0, 1, … n − 1} then
+
1. n
n k (j)
∑( )(−1 ) k =0 (1.8.13)
k
k=0
2. n
n k j
∑( )(−1 ) k = 0 (1.8.14)
k
k=0
Proof
If n, k ∈ N then
1.8.3 https://stats.libretexts.org/@go/page/10123
−n n+k−1
k
( ) = (−1 ) ( ) (1.8.16)
k k
Proof
k
k
) = (−1 ) . Our last result in this discussion concerns the binomial operator and its inverse.
The binomial operator takes a sequence of real numbers a = (a0 , a1 , a2 , …) and returns the sequence of real numbers
b = (b , b , b , …) by means of the formula
0 1 2
n
n
bn = ∑ ( ) ak , n ∈ N (1.8.19)
k
k=0
The inverse binomial operator recovers the sequence a from the sequence b by means of the formula
n
n−k
n
an = ∑(−1 ) ( ) bk , n ∈ N (1.8.20)
k
k=0
Proof
Samples
The experiment of drawing a sample from a population is basic and important. There are two essential attributes of samples:
whether or not order is important, and whether or not a sampled object is replaced in the population before the next draw. Suppose
now that the population D contains n objects and we are interested in drawing a sample of k objects. Let's review what we know
so far:
If order is important and sampled objects are replaced, then the samples are just elements of the product set D . Hence, the k
number of samples is n . k
If order is important and sample objects are not replaced, then the samples are just permutations of size k chosen from D.
Hence the number of samples is n . (k)
If order is not important and sample objects are not replaced, then the samples are just combinations of size k chosen from D.
Hence the number of samples is ( ). n
k
) members.
Proof
m… With replacement n
k
(
n+k−1
k
)
m… Without n
(k) n
( )
k
1.8.4 https://stats.libretexts.org/@go/page/10123
Multinomial Coefficients
Partitions of a Set
Recall that the binomial coefficient ( ) is the number of subsets of size j from a set S of n elements. Note also that when we select
n
a subset A of size j from S , we effectively partition S into two disjoint subsets of sizes j and n − j , namely A and A . A natural c
generalization is to partition S into a union of k distinct, pairwise disjoint subsets (A , A , … , A ) where #(A ) = n for each
1 2 k i i
The number of ways to partition a set of n elements into a sequence of k sets of sizes (n 1, n2 , … , nk ) is
n n − n1 n − n1 − ⋯ − nk−1 n!
( )( )⋯( ) = (1.8.22)
n1 n2 nk n1 ! n2 ! ⋯ nk !
Proof
If n, k ∈ N with k ≤ n then
n n
( ) =( ) (1.8.24)
k, n − k k
Combinatorial Proof
Sequences
Consider now the set T = {1, 2, … , k} . Elements of this set are sequences of length n in which each coordinate is one of k
n
values. Thus, these sequences generalize the bit strings of length n . Again, let (n , n , … , n ) be a sequence of nonnegative
1 2 k
integers with ∑ n = n .
k
i=1 i
Proof
objects of a given type are considered identical. There is a one-to-one correspondence between the following collections:
1. Sequences in {1, 2, … , k} in which j occurs n times for each j ∈ {1, 2, … , k}.
n
j
1.8.5 https://stats.libretexts.org/@go/page/10123
If x
1, x2 , … , xk ∈ R and n ∈ N then
n
n n1 n2 nk
(x1 + x2 + ⋯ + xk ) = ∑( )x x ⋯x (1.8.27)
1 2 k
n1 , n2 , ⋯ , nk
Computational Exercises
Arrangements
In a race with 10 horses, the first, second, and third place finishers are noted. How many outcomes are there?
Answer
Eight persons, consisting of four male-female couples, are to be seated in a row of eight chairs. How many seating
arrangements are there in each of the following cases:
1. There are no other restrictions.
2. The men must sit together and the women must sit together.
3. The men must sit together.
4. Each couple must sit together.
Answer
Suppose that n people are to be seated at a round table. How many seating arrangements are there? The mathematical
significance of a round table is that there is no dedicated first chair.
Answer
Twelve books, consisting of 5 math books, 4 science books, and 3 history books are arranged on a bookshelf. Find the number
of arrangements in each of the following cases:
1. There are no restrictions.
2. The books of each type must be together.
3. The math books must be together.
Answer
Find the number of distinct arrangements of the letters in each of the following words:
1. statistics
2. probability
3. mississippi
4. tennessee
5. alabama
Answer
A child has 12 blocks; 5 are red, 4 are green, and 3 are blue. In how many ways can the blocks be arranged in a line if blocks of
a given color are considered identical?
Answer
Code Words
A license tag consists of 2 capital letters and 5 digits. Find the number of tags in each of the following cases:
1. There are no other restrictions
2. The letters and digits are all different.
Answer
1.8.6 https://stats.libretexts.org/@go/page/10123
Committees
A club has 20 members; 12 are women and 8 are men. A committee of 6 members is to be chosen. Find the number of different
committees in each of the following cases:
1. There are no other restrictions.
2. The committee must have 4 women and 2 men.
3. The committee must have at least 2 women and at least 2 men.
Answer
Suppose that a club with 20 members plans to form 3 distinct committees with 6, 5, and 4 members, respectively. In how many
ways can this be done.
Answer
Cards
A standard card deck can be modeled by the Cartesian product set
where the first coordinate encodes the denomination or kind (ace, 2-10, jack, queen, king) and where the second coordinate encodes
the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example q♡).
A poker hand (in draw poker) consists of 5 cards dealt without replacement and without regard to order from a deck of 52
cards. Find the number of poker hands in each of the following cases:
1. There are no restrictions.
2. The hand is a full house (3 cards of one kind and 2 of another kind).
3. The hand has 4 of a kind.
4. The cards are all in the same suit (so the hand is a flush or a straight flush).
Answer
A bridge hand consists of 13 cards dealt without replacement and without regard to order from a deck of 52 cards. Find the
number of bridge hands in each of the following cases:
1. There are no restrictions.
2. The hand has exactly 4 spades.
3. The hand has exactly 4 spades and 3 hearts.
4. The hand has exactly 4 spades, 3 hearts, and 2 diamonds.
Answer
A hand of cards that has no cards in a particular suit is said to be void in that suit. Use the inclusion-exclusion formula to find
each of the following:
1. The number of poker hands that are void in at least one suit.
2. The number of bridge hands that are void in at least one suit.
Answer
A bridge hand that has no honor cards (cards of denomination 10, jack, queen, king, or ace) is said to be a Yarborough, in
honor of the Second Earl of Yarborough. Find the number of Yarboroughs.
Answer
A bridge deal consists of dealing 13 cards (a bridge hand) to each of 4 distinct players (generically referred to as north, south,
east, and west) from a standard deck of 52 cards. Find the number of bridge deals.
Answer
1.8.7 https://stats.libretexts.org/@go/page/10123
This staggering number is about the same order of magnitude as the number of atoms in your body, and is one of the reasons that
bridge is a rich and interesting game.
This number is even more staggering. Indeed if you perform the experiment of dealing all 52 cards from a well-shuffled deck, you
may well generate a pattern of cards that has never been generated before, thereby ensuring your immortality. Actually, this
experiment shows that, in a sense, rare events can be very common. By the way, Persi Diaconis has shown that it takes about seven
standard riffle shuffles to thoroughly randomize a deck of cards.
Suppose that 5 identical, standard dice are rolled. How many outcomes are there?
Answer
A coin is tossed 10 times and the outcome is recorded as a bit string (where 1 denotes heads and 0 tails).
1. Find the number of outcomes.
2. Find the number of outcomes with exactly 4 heads.
3. Find the number of outcomes with at least 8 heads.
Answer
Polynomial Coefficients
Answer
Answer
Samples
A shipment contains 12 good and 8 defective items. A sample of 5 items is selected. Find the number of samples that contain
exactly 3 good items.
Answer
1.8.8 https://stats.libretexts.org/@go/page/10123
In the (n, k) lottery, k numbers are chosen without replacement from the set of integers from 1 to n (where n, k ∈ N+ and
k < n ). Order does not matter.
For more on this topic, see the section on Lotteries in the chapter on Games of Chance.
Explicitly compute each formula in the sampling table above when n = 10 and k = 4 .
Answer
Greetings
Suppose there are n people who shake hands with each other. How many handshakes are there?
Answer
There are m men and n women. The men shake hands with each other; the women hug each other; and each man bows to each
woman.
1. How many handshakes are there?
2. How many hugs are there?
3. How many bows are there?
4. How many greetings are there?
Answer
Integer Solutions
Find the number of integer solutions of x1 + x2 + x3 = 10 in each of the following cases:
1. x i ≥0 for each i.
2. x i >0 for each i.
Answer
Generalized Coefficients
(4)
2. ( 1
2
)
(5)
3. (− 1
3
)
Answer
3
)
2. (−5
4
)
3. (−1/3
5
)
Answer
Birthdays
Suppose that n persons are selected and their birthdays noted. (Ignore leap years, so that a year has 365 days.)
1. Find the number of outcomes.
2. Find the number of outcomes with distinct birthdays.
Answer
1.8.9 https://stats.libretexts.org/@go/page/10123
Chess
Note that the squares of a chessboard are distinct, and in fact are often identified with the Cartesian product set
Find the number of ways of placing 8 rooks on a chessboard so that no rook can capture another in each of the following cases.
1. The rooks are distinguishable.
2. The rooks are indistinguishable.
Answer
Gifts
Suppose that 20 identical candies are distributed to 4 children. Find the number of distributions in each of the following cases:
1. There are no restrictions.
2. Each child must get at least one candy.
Answer
In the song The Twelve Days of Christmas, find the number of gifts given to the singer by her true love. (Note that the singer
starts afresh with gifts each day, so that for example, the true love gets a new partridge in a pear tree each of the 12 days.)
Answer
Teams
Suppose that 10 kids are divided into two teams of 5 each for a game of basketball. In how many ways can this be done in each
of the following cases:
1. The teams are distinguishable (for example, one team is labeled “Alabama” and the other team is labeled “Auburn”).
2. The teams are not distinguishable.
Answer
This page titled 1.8: Combinatorial Structures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
1.8.10 https://stats.libretexts.org/@go/page/10123
1.9: Topological Spaces
Topology is one of the major branches of mathematics, along with other such branches as algebra (in the broad sense of algebraic
structures), and analysis. Topology deals with spatial concepts involving distance, closeness, separation, convergence, and
continuity. Needless to say, entire series of books have been written about the subject. Our goal in this section and the next is
simply to review the basic definitions and concepts of topology that we will need for our study of probability and stochastic
processes. You may want to refer to this section as needed.
Basic Theory
Definitions
A topological space consists of a nonempty set S and a collection S of subsets of S that satisfy the following properties:
1. S ∈ S and ∅ ∈ S
2. If A ⊆ S then ⋃ A ∈ S
3. If A ⊆ S and A is finite, then ⋂ A ∈ S
If A ∈ S , then A is said to be open and A is said to be closed. The collection S of open sets is a topology on S .
c
So the union of an arbitrary number of open sets is still open, as is the intersection of a finite number of open sets. The universal set
S and the empty set ∅ are both open and closed. There may or may not exist other subsets of S with this property.
Suppose that S is a nonempty set, and that S and T are topologies on S . If S ⊆ T then T is finer than S , and S is
coarser than T . Coarser than defines a partial order on the collection of topologies on S . That is, if R, S , T are topologies
on S then
1. R is coarser than R , the reflexive property.
2. If R is coarser than S and S is coarser than R then R = S , the anti-symmetric property.
3. If R is coarser than S and S is coarser than T then R is coarser than T , the transitive property.
A topology can be characterized just as easily by means of closed sets as open sets.
Suppose that S is a nonempty set. A collection of subsets C is the collection of closed sets for a topology on S if and only if
1. S ∈ C and ∅ ∈ C
2. If A ⊆ C then ⋂ A ∈ C .
3. If A ⊆ C and A is a finite then ⋃ A ∈ C .
Proof
Suppose that (S, S ) is a topological space, and that x ∈ S . A set A ⊆ S is a neighborhood of x if there exists U ∈ S with
x ∈ U ⊆A .
So a neighborhood of a point x ∈ S is simply a set with an open subset that contains x. The idea is that points in a “small”
neighborhood of x are “close” to x in a sense. An open set can be defined in terms of the neighborhoods of the points in the set.
Suppose again that (S, S ) is a topological space. A set U ⊆S is open if and only if U is a neighborhood of every x ∈ U
Proof
Although the proof seems trivial, the neighborhood concept is how you should think of openness. A set U is open if every point in
U has a set of “nearby points” that are also in U .
Our next three definitions deal with topological sets that are naturally associated with a given subset.
Suppose again that (S, S ) is a topological space and that A ⊆ S . The closure of A is the set
1.9.1 https://stats.libretexts.org/@go/page/10124
This is the smallest closed set containing A :
1. cl(A) is closed.
2. A ⊆ cl(A) .
3. If B is closed and A ⊆ B then cl(A) ⊆ B
Proof
Of course, if A is closed then A = cl(A) . Complementary to the closure of a set is the interior of the set.
Suppose again that (S, S ) is a topological space and that A ⊆ S . The interior of A is the set
Proof
Of course, if A is open then A = int(A) . The boundary of a set is the set difference between the closure and the interior.
Suppose again that (S, S ) is a topological space. The boundary of A is ∂(A) = cl(A) ∖ int(A) . This set is closed.
Proof
Suppose that (S, S ) is a topological space and that R is a nonempty subset of S . Then R = {A ∩ R : A ∈ S } is a topology
on R , known as the relative topology induced by S .
Proof
In the context of the previous result, note that if R is itself open, then the relative topology is R = {A ∈ S : A ⊆ R} , the subsets
of R that are open in the original topology.
Separation Properties
Separation properties refer to the ability to separate points or sets with disjoint open sets. Our first definition deals with separating
two points.
Suppose that (S, S ) is a topological space and that x, y are distinct points in S . Then x and y can be separated if there exist
disjoint open sets U and V with x ∈ U and y ∈ V . If every pair of distinct points in S can be separated, then (S, S ) is called
a Hausdorff space.
Hausdorff spaces are named for the German mathematician Felix Hausdorff. There are weaker separation properties. For example,
there could be an open set U that contains x but not y , and an open set V that contains y but not x, but no disjoint open sets that
contain x and y . Clearly if every open set that contains one of the points also contains the other, then the points are
indistinguishable from a topological viewpoint. In a Hausdorff space, singletons are closed.
Suppose that (S, S ) is a Hausdorff space. Then {x} is closed for each x ∈ S .
Proof
Our next definition deals with separating a point from a closed set.
Suppose again that (S, S ) is a topological space. A nonempty closed set A ⊆ S and a point x ∈ A can be separated if there
c
exist disjoint open sets U and V with A ⊆ U and x ∈ V . If every nonempty closed set A and point x ∈ A can be separated,
c
Clearly if (S, S ) is a regular space and singleton sets are closed, then (S, S ) is a Hausdorff space.
1.9.2 https://stats.libretexts.org/@go/page/10124
Bases
Topologies, like other set structures, are often defined by first giving some basic sets that should belong to the collection, and the
extending the collection so that the defining axioms are satisfied. This idea is motivation for the following definition:
Suppose again that (S, S ) is a topological space. A collection B ⊆ S is a base for S if every set in S can be written as a
union of sets in B .
So, a base is a smaller collection of open sets with the property that every other open set can be written as a union of basic open
sets. But again, we often want to start with the basic open sets and extend this collection to a topology. The following theorem
gives the conditions under which this can be done.
Suppose that S is a nonempty set. A collection B of subsets of S is a base for a topology on S if and only if
1. S = ⋃ B
2. If A, B ∈ B and x ∈ A ∩ B , there exists C ∈ B with x ∈ C ⊆ A∩B
Proof
Here is a slightly weaker condition, but one that is often satisfied in practice.
Suppose that S is a nonempty set. A collection B of subsets of S that satisfies the following properties is a base for a topology
on S :
1. S = ⋃ B
2. If A, B ∈ B then A ∩ B ∈ B
Compactness
Our next discussion considers another very important type of set. Some additional terminology will make the discussion easier.
Suppose that S is a set and A ⊆ S . A collection of subsets A of S is said to cover A if A ⊆ A . So the word cover simply means
a collection of sets whose union contains a given set. In a topological space, we can have open an open cover (that is, a cover with
open sets), a closed cover (that is, a cover with closed sets), and so forth.
Suppose again that (S, S ) is a topological space. A set C ⊆ S is compact if every open cover of C has a finite sub-cover.
That is, if A ⊆ S with C ⊆ ⋃ A then there exists a finite B ⊆ A with C ⊆ ⋃ B .
So intuitively, a compact set is compact in the ordinary sense of the word. No matter how “small” are the open sets in the covering
of C , there will always exist a finite number of the open sets that cover C .
Suppose again that (S, S ) is a topological space and that C ⊆S is a compact. If B ⊆ C is closed, then B is also compact.
Proof
Suppose again that (S, S ) is a topological space, and that Ci ⊆ S is compact for each i in a finite index set I . Then
C =⋃ C is compact.
i
i∈I
Proof
As we saw above, closed subsets of a compact set are themselves compact. In a Hausdorff space, a compact set is itself closed.
Also in a Hausdorff space, a point can be separated from a compact set that does not contain the point.
Suppose that (S, S ) is a Hausdorff space. If x ∈ S , C ⊆S is compact, and x ∉ C , then there exist disjoint open sets U and
V with x ∈ U and C ⊆ V
1.9.3 https://stats.libretexts.org/@go/page/10124
Proof
In a Hausdorff space, if a point has a neighborhood with a compact boundary, then there is a smaller, closed neighborhood.
Suppose again that (S, S ) is a Hausdorff space. If x ∈ S and A is a neighborhood of x with ∂(A) compact, then there exists a
closed neighborhood B of x with B ⊆ A .
Proof
Generally, local properties in a topological space refer to properties that hold on the neighborhoods of a point x ∈ S .
A topological space (S, S ) is locally compact if every point x ∈ S has a compact neighborhood.
This definition is important because many of the topological spaces that occur in applications (like probability) are not compact,
but are locally compact. Locally compact Hausdorff spaces have a number of nice properties. In particular, in a locally compact
Hausdorff space, there are arbitrarily “small” compact neighborhoods of a point.
Suppose that (S, S ) is a locally compact Hausdorff space. If x ∈ S and A is a neighborhood of x, then there exists a compact
neighborhood B of x with B ⊆ A .
Proof
Countability Axioms
Our next discussion concerns topologies that can be “countably constructed” in a certain sense. Such axioms limit the “size” of the
topology in a way, and are often satisfied by important topological spaces that occur in applications. We start with an important
preliminary definition.
Suppose that (S, S ) is a topological space. A set D ⊆ S is dense if U ∩ D is nonempty for every nonempty U ∈ S .
Equivalently, D is dense if every neighborhood of a point x ∈ S contains an element of D. So in this sense, one can find elements
of D “arbitrarily close” to a point x ∈ S . Of course, the entire space S is dense, but we are usually interested in topological spaces
that have dense sets of limited cardinality.
Suppose again that (S, S ) is a topological space. A set D ⊆ S is dense if and only if cl(D) = S .
Proof
So in a separable space, there is a countable set D with the property that there are points in D “arbitrarily close” to every x ∈ S .
Unfortunately, the term separable is similar to separating points that we discussed above in the definition of a Hausdorff space. But
clearly the concepts are very different. Here is another important countability axiom.
So in a second countable space, there is a countable collection of open sets B with the property that every other open set is a union
of sets in B . Here is how the two properties are related:
As the terminology suggests, there are other axioms of countability (such as first countable), but the two we have discussed are the
most important.
1.9.4 https://stats.libretexts.org/@go/page/10124
A topological space (S, S ) is disconnected if there exist nonempty, disjoint, open sets U and V with S = U ∪ V . If (S, S ) is
not disconnected, then it is connected.
Since U = V , it follows that U and V are also closed. So the space is disconnected if and only if there exists a proper subset U
c
that is open and closed (sadly, such sets are sometimes called clopen). If S is disconnected, then S consists of two pieces U and V ,
and the points in U are not “close” to the points in V , in a sense. To study S topologically, we could simply study U and V
separately, with their relative topologies.
Convergence
There is a natural definition for a convergent sequence in a topological space, but the concept is not as useful as one might expect.
Suppose again that (S, S ) is a topological space. A sequence of points (x : n ∈ N n +) in S converges to x ∈ S if for every
neighborhood A of x there exists m ∈ N such that x ∈ A for n > m . We write x
+ n n → x as n → ∞ .
So for every neighborhood of x, regardless of how “small”, all but finitely many of the terms of the sequence will be in the
neighborhood. One would naturally hope that limits, when they exist, are unique, but this will only be the case if points in the space
can be separated.
Suppose that (S, S ) is a Hausdorff space. If (xn : n ∈ N+ ) is a sequence of points in S with xn → x ∈ S as n → ∞ and
x → y ∈ S as n → ∞ , then x = y .
n
Proof
On the other hand, if distinct points x, y ∈ S cannot be separated, then any sequence that converges to x will also converge to y .
Continuity
Continuity of functions is one of the most important concepts to come out of general topology. The idea, of course, is that if two
points are close together in the domain, then the functional values should be close together in the range. The abstract topological
definition, based on inverse images is very simple, but not very intuitive at first.
So a continuous function has the property that the inverse image of an open set (in the range space) is also open (in the domain
space). Continuity can equivalently be expressed in terms of closed subsets.
Suppose again that (S, S ) and (T , T ) are topological spaces. A function f : S → T is continuous if and only if f −1
(A) is a
closed subset of S for every closed subset A of T .
Proof
Suppose again that (S, S ) and (T , T ) are topological spaces, and that f : S → T is continuous. If (xn : n ∈ N+ ) is a
sequence of points in S with x → x ∈ S as n → ∞ , then f (x ) → f (x) as n → ∞ .
n n
Proof
The converse of the last result is not true, so continuity of functions in a general topological space cannot be characterized in terms
of convergent sequences. There are objects like sequences but more general, known as nets, that do characterize continuity, but we
will not study these. Composition, the most important way to combine functions, preserves continuity.
Suppose that (S, S ), (T , T ), and (U , U ) are topological spaces. If f : S → T and g : T → U are continuous, then
g∘ f : S → U is continuous.
Proof
The next definition is very important. A recurring theme in mathematics is to recognize when two mathematical structures of a
certain type are fundamentally the same, even though they may appear to be different.
1.9.5 https://stats.libretexts.org/@go/page/10124
Suppose again that (S, S ) and (T , T ) are topological spaces. A one-to-one function f that maps S onto T with both f and
f
−1
continuous is a homeomorphism from (S, S ) to (T , T ). When such a function exists, the topological spaces are said to
be homeomorphic.
Note that in this definition, f refers to the inverse function, not the mapping of inverse images. If f is a homeomorphism, then A
−1
is open in S if and only if f (A) is open in T . It follows that the topological spaces are essentially equivalent: any purely
topological property can be characterized in terms of open sets and therefore any such property is shared by the two spaces.
Being homeomorphic is an equivalence relation on the collection of topological spaces. That is, for spaces ,
(S, S ) (T , T ) ,
and (U , U ) ,
1. (S, S ) is homeomorphic to (S, S ) (the reflexive property).
2. If (S, S ) is homeomorphic to (T , T ) then (T , T ) is homeomorphic to (S, S ) (the symmetric property).
3. If (S, S ) is homeomorphic to (T , T ) and (T , T ) is homeomorphic to (U , U ) then (S, S ) is homeomorphic to (U , U )
(the transitive property).
Proof
Continuity can also be defined locally, by restricting attention to the neighborhoods of a point.
Suppose again that (S, S ) and (T , T ) are topological spaces, and that x ∈ S . A function f : S → T is continuous at x if
(B) is a neighborhood of x in S whenever B is a neighborhood of f (x) in T . If A ⊆ S , then f is continuous on A is f is
−1
f
continuous at each x ∈ A .
Suppose again that (S, S ) and (T , T ) are topological spaces, and that f : S → T . Then f is continuous if and only if f is
continuous at each x ∈ S .
Proof
Properties that are defined for a topological space can be applied to a subset of the space, with the relative topology. But one has to
be careful.
Suppose again that (S, S ) are topological spaces and that f : S → T . Suppose also that A ⊆ S , and let A denote the relative
topology on A induced by S , and let f denote the restriction of f to A . If f is continuous on A then f is continuous
A A
relative to the spaces (A, A ) and (T , T ). The converse is not generally true.
Proof
Product Spaces
Cartesian product sets are ubiquitous in mathematics, so a natural question is this: given topological spaces (S, S ) and (T , T ) ,
what is a natural topology for S × T ? The answer is very simple using the concept of a base above.
Suppose that (S, S ) and (T , T ) are topological spaces. The collection B = {A × B : A ∈ S , B ∈ T } is a base for a
topology on S × T , called the product topology associated with the given spaces.
Proof
So basically, we want the product of open sets to be open in the product space. The product topology is the smallest topology that
makes this happen. The definition above can be extended to very general product spaces, but to state the extension, let's recall how
general product sets are constructed. Suppose that S is a set for each i in a nonempty index set I . Then the product set ∏ S is
i i∈I i
Suppose that (S i, Si ) is a topological space for each i in a nonempty index set I . Then
i∈I
1.9.6 https://stats.libretexts.org/@go/page/10124
Suppose again that S is a set for each i in a nonempty index set I . For j ∈ I , recall that projection function
i pj : ∏
i∈I
Si → Sj is
defined by p (x) = x(j) .
j
Proof
As a special case of all this, suppose that (S, S ) is a topological space, and that S = S for all i ∈ I . Then the product space
i
S is the set of all functions from I to S , sometimes denoted S . In this case, the base for the product topology on S is
I I
∏ i
i∈I
i∈I
With the trivial topology, no two distinct points can be separated. So the topology cannot distinguish between points, in a sense,
and all points in S are close to each other. Clearly, this topology is not very interesting, except as a place to start. Since there is only
one nonempty open set (S itself), the space is connected, and every subset of S is compact. A sequence in S converges to every
point in S .
Suppose that S has the trivial topology and that (T , T ) is another topological space.
1. Every function from T to S is continuous.
2. If (T , T ) is a Hausdorff space then the only continuous functions from S to T are constant functions.
Proof
Suppose that S is a nonempty set. The power set P(S) (consisting of all subsets of S ) is a topology, known as the discrete
topology.
So in the discrete topology, every set is both open and closed. All points are separated, and in a sense, widely so. No point is close
to another point. With the discrete topology, S is Hausdorff, disconnected, and the compact subsets are the finite subsets. A
sequence in S converges to x ∈ S , if and only if all but finitely many terms of the sequence are x.
Suppose that S has the discrete topology and that (T , S ) is another topological space.
1. Every function from S to T is continuous.
2. If (T , T ) is connected, then the only continuous functions from T to S are constant functions.
Proof
Euclidean Spaces
The standard topologies used in the Euclidean spaces are the topologies built from open sets that you familiar with.
For the set of real numbers R, let B = {(a, b) : a, b ∈ R, a < b} , the collection of open intervals. Then B is a base for a
topology R on R, known as the Euclidean topology.
Proof
1.9.7 https://stats.libretexts.org/@go/page/10124
The space (R, R) satisfies many properties that are motivations for definitions in topology in the first place. The convergence of a
sequence in R, in the topological sense given above, is the same as the definition of convergence in calculus. The same statement
holds for the continuity of a function f from R to R.
Before listing other topological properties, we give a characterization of compact sets, known as the Heine-Borel theorem, named
for Eduard Heine and Émile Borel. Recall that A ⊆ R is bounded if A ⊆ [a, b] for some a, b ∈ R with a < b .
So in particular, closed, bounded intervals of the form [a, b] with a, b ∈ R and a < b are compact.
As noted in the proof, , the set of rationals, is countable and is dense in R. Another countable, dense subset is
Q
D = {j/ 2
n
, the set of dyadic rationals (or binary rationals). For the higher-dimensional Euclidean spaces, we
: n ∈ N and j ∈ Z}
can use the product topology based on the topology of the real numbers.
A subset A ⊆ R n
is bounded if there exists a, b ∈ R with a < b such that n
A ⊆ [a, b] , so that A fits inside of an n -dimensional
“block”.
A subset C ⊆R
n
is compact if and only if C is closed and bounded.
The space (R n
, Rn ) has the following properties:
1. Hausdorff.
2. Connected.
3. Locally compact.
4. Second countable.
This page titled 1.9: Topological Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
1.9.8 https://stats.libretexts.org/@go/page/10124
1.10: Metric Spaces
Basic Theory
Most of the important topological spaces that occur in applications (like probability) have an additional structure that gives a distance
between points in the space.
Definitions
A metric space consists of a nonempty set S and a function d : S × S → [0, ∞) that satisfies the following axioms: For
x, y, z ∈ S ,
So as the name suggests, d(x, y) is the distance between points x, y ∈ S . The axioms are intended to capture the essential properties
of distance from geometry. Part (a) is the positive property; the distance is strictly positive if and only if the points are distinct. Part
(b) is the symmetric property; the distance from x to y is the same as the distance from y to x. Part (c) is the triangle inequality;
going from x to z cannot be longer than going from x to z by way of a third point y .
Note that if (S, d) is a metric space, and A is a nonempty subset of S , then the set A with d restricted to A×A is also a metric
space (known as a subspace). The next definitions also come naturally from geometry:
Suppose that (S, d) is a metric space, and that x ∈ S and r ∈ (0, ∞).
1. B(x, r) = {y ∈ S : d(x, y) < r} is the open ball with center x and radius r.
2. C (x, r) = {y ∈ S : d(x, y) ≤ r} is the closed ball with center x and radius r.
Suppose that (S, d) is a metric space. By definition, a set U ⊆ S is open if for every x ∈ U there exists r ∈ (0, ∞) such that
B(x, r) ⊆ U . The collection S of open subsets of S is a topology.
d
Proof
As the names suggests, an open ball is in fact open and a closed ball is in fact closed.
Suppose again that (S, d) is a metric space, and that x ∈ S and r ∈ (0, ∞). Then
1. B(x, r) is open.
2. C (x, r) is closed.
Proof
Recall that for a general topological space, a neighborhood of a point x ∈ S is a set A ⊆ S with the property that there exists an
open set U with x ∈ U ⊆ A . It follows that in a metric space, A ⊆ S is a neighborhood of x if and only if there exists r > 0 such
that B(x, r) ⊆ A . In words, a neighborhood of a point must contain an open ball about that point.
It's easy to construct new metrics from ones that we already have. Here's one such result.
Suppose that S is a nonempty set, and that d, e are metrics on S , and c ∈ (0, ∞). Then the following are also metrics on S :
1. cd
2. d + e
Proof
Since a metric space produces a topological space, all of the definitions for general topological spaces apply to metric spaces as well.
In particular, in a metric space, distinct points can always be separated.
1.10.1 https://stats.libretexts.org/@go/page/10125
A metric space (S, d) is a Hausdorff space.
Proof
Metrizable Spaces
Again, every metric space is a topological space, but not conversely. A non-Hausdorff space, for example, cannot correspond to a
metric space. We know there are such spaces; a set S with more than one point, and with the trivial topology S = {S, ∅} is non-
Hausdorff.
Suppose that (S, S ) is a topological space. If there exists a metric d on S such that S = Sd , then (S, S ) is said to be
metrizable.
It's easy to see that different metrics can induce the same topology. For example, if d is a metric and c ∈ (0, ∞), then the metrics d
Let S be a nonempty set. Metrics d and e on S are equivalent, and we write d ≡ e , if Sd = Se . The relation ≡ is an
equivalence relation on the collection of metrics on S . That is, for metrics d, e, f on S ,
1. d ≡ d , the reflexive property.
2. If d ≡ e then e ≡ d , the symmetric property.
3. If d ≡ e and e ≡ f then d ≡ f , the transitive property.
There is a simple condition that characterizes when the topology of one metric is finer than the topology of another metric, and then
this in turn leads to a condition for equivalence of metrics.
Suppose again that S is a nonempty set and that d, e are metrics on S . Then S is finer than S if and only if every open ball
e d
It follows that metrics d and e on S are equivalent if and only if every open ball relative to one of the metrics contains an open ball
relative to the other metric.
So every metrizable topology on S corresponds to an equivalence class of metrics that produce that topology. Sometimes we want to
know that a topological space is metrizable, because of the nice properties that it will have, but we don't really need to use a specific
metric that generates the topology. At any rate, it's important to have conditions that are sufficient for a topological space to be
metrizable. The most famous such result is the Urysohn metrization theorem, named for the Russian mathematician Pavel Uryshon:
Suppose that (S, S ) is a regular, second-countable, Hausdorff space. Then (S, S ) is metrizable.
Review of the terms
Convergence
With a distance function, the convergence of a sequence can be characterized in a manner that is just like calculus. Recall that for a
general topological space (S, S ), if (x : n ∈ N ) is a sequence of points in S and x ∈ S , then x → x as n → ∞ means that for
n + n
Suppose that (S, d) is a metric space, and that (x : n ∈ N ) is a sequence of points in S and x ∈ S . Then x → x as n → ∞
n + n
if and only if for every ϵ > 0 there exists m ∈ N such that if n > m then d(x , x) < ϵ . Equivalently, x → x as n → ∞ if
+ n n
Proof
So, no matter how tiny ϵ>0 may be, all but finitely many terms of the sequence are within ϵ distance of x. As one might hope,
limits are unique.
Suppose again that (S, d) is a metric space. Suppose also that (x n : n ∈ N+ ) is a sequence of points in S and that x, y ∈ S . If
x → x as n → ∞ and x → y as n → ∞ then x = y .
n n
Proof
1.10.2 https://stats.libretexts.org/@go/page/10125
Suppose that d, e are equivalent metrics on S , and that (x : n ∈ N
n +) is a sequence of points in S and x ∈ S . Then x n → x as
n → ∞ relative to d if and only if x → x as n → ∞ relative to e .
n
Closed subsets of a metric space have a simple characterization in terms of convergent sequences, and this characterization is more
intuitive than the abstract axioms in a general topological space.
Suppose again that (S, d) is a metric space. Then A ⊆ S is closed if and only if whenever a sequence of points in A converges,
the limit is also in A .
Proof
The following definition also shows up in standard calculus. The idea is to have a criterion for convergence of a sequence that does
not require knowing a-priori the limit. But for metric spaces, this definition takes on added importance.
Suppose again that (S, d) is a metric space. A sequence of points (x : n ∈ N ) in S is a Cauchy sequence if for every
n + ϵ>0
there exist k ∈ N such that if m, n ∈ N with m > k and n > k then d(x , x ) < ϵ .
+ + m n
Cauchy sequences are named for the ubiquitous Augustin Cauchy. So for a Cauchy sequence, no matter how tiny ϵ > 0 may be, all
but finitely many terms of the sequence will be within ϵ distance of each other. A convergent sequence is always Cauchy.
Suppose again that (S, d) is a metric space. If a sequence of points (x n : n ∈ N+ ) in S converges, then the sequence is Cauchy.
Proof
Conversely, one might think that a Cauchy sequence should converge, but it's relatively trivial to create a situation where this is false.
Suppose that (S, d) is a metric space, and that there is a point x ∈ S that is the limit of a sequence of points in S that are all distinct
from x. Then the space T = S − {x} with the metric d restricted to T × T has a Cauchy sequence that does not converge.
Essentially, we have created a “convergence hole”. So our next defintion is very natural and very important.
Suppose again that (S, d) is metric space and that A ⊆S . Then A is complete if every Cauchy sequence in A converges to a
point in A .
Of course, completeness can be applied to the entire space S . Trivially, a complete set must be closed.
Suppose again that (S, d) is a metric space, and that A ⊆ S . If A is complete, then A is closed.
Proof
Completeness is such a crucial property that it is often imposed as an assumption on metric spaces that occur in applications. Even
though a Cauchy sequence may not converge, here is a partial result that will be useful latter: if a Cauchy sequence has a convergent
subsequence, then the sequence itself converges.
Suppose again the (S, d) is a metric space, and that (x : n ∈ N ) is a Cauchy sequence in
n + S . If there exists a subsequence
(xnk: k ∈ N ) such that x
+ → x ∈ S as k → ∞ , then x → x as n → ∞ .
nk n
Proof
Continuity
In metric spaces, continuity of functions also has simple characterizations in terms of that are familiar from calculus. We start with
local continuity. Recall that the general topological definition is that f : S → T is continuous at x ∈ S if f (V ) is a neighborhood−1
Suppose that (S, d) and (T , e) are metric spaces, and that f : S → T . The continuity of f at x ∈ S is equivalent to each of the
following conditions:
1. If (x : n ∈ N ) is a sequence in S with x → x as n → ∞ then f (x ) → f (x) as n → ∞ .
n + n n
2. For every ϵ > 0 , there exists δ > 0 such that if y ∈ S and d(x, y) < δ then e[f (y) − f (x)] < ϵ .
Proof
More generally, recall that f continuous on A ⊆ S means that f is continuous at each x ∈ A , and that f continuous means that f is
continuous on S . So general continuity can be characterized in terms of sequential continuity and the ϵ-δ condition.
1.10.3 https://stats.libretexts.org/@go/page/10125
On a metric space, there are stronger versions of continuity.
Suppose again that (S, d) and (T , e) are metric spaces and that f : S → T . Then f is uniformly continuous if for every ϵ>0
there exists δ > 0 such that if x, y ∈ S with d(x, y) < δ then e[f (x), f (y)] ≤ ϵ.
In the ϵ-δ formulation of ordinary point-wise continuity above, δ depends on the point x in addition to ϵ. With uniform continuity,
there exists a δ depending only on ϵ that works uniformly in x ∈ S .
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . If f is uniformly continuous then f is continuous.
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . Then is Höder continuous with exponent
f
α ∈ (0, ∞) if there exists C ∈ (0, ∞) such that e[f (x), f (y)] ≤ C [d(x, y)] for all x, y ∈ S.
α
The definition is named for Otto Höder. The exponent α is more important than the constant C , which generally does not have a
name. If α = 1 , f is said to be Lipschitz continuous, named for the German mathematician Rudolf Lipschitz.
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . If f is Höder continuous with exponent α > 0 then
f is uniformly continuous.
Suppose again that (S, d) and (T , e) are metric spaces. A function f : S → T is a contraction if there exists C ∈ (0, 1) such
that
e[f (x), f (y)] ≤ C d(x, y), x, y ∈ S (1.10.6)
So contractions shrink distance. By the result above, a contraction is uniformly continuous. Part of the importance of contraction
maps is due to the famous Banach fixed-point theorem, named for Stefan Banach.
Suppose that (S, d) is a complete metric space and that f : S → S is a contraction. Then f has a unique fixed point. That is,
there exists exactly one x ∈ S with f (x ) = x . Let x ∈ S , and recursively define x = f (x ) for n ∈ N . Then
∗ ∗ ∗
0 n n−1 +
x → x
n
∗
as n → ∞ .
Functions that preserve distance are particularly important. The term isometry means distance-preserving.
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . Then f is an isometry if e[f (x), f (y)] = d(x, y) for
every x, y ∈ S .
Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . If f is an isometry, then f is one-to-one and
Lipschitz continuous.
Proof
In particular, an isometry f is uniformly continuous. If one metric space can be mapped isometrically onto another metric space, the
spaces are essentially the same.
Metric spaces (S, d) and (T , e) are isometric if there exists an isometry f that maps S onto T . Isometry is an equivalence
relation on metric spaces. That is, for metric spaces (S, d) , (T , e) , and (U , ρ),
1. (S, d) is isometric to (S, d) , the reflexive property.
2. If (S, d) is isometric to (T , e) them (T , e) is isometric to (S, d) , the symmetric property.
3. If (S, d) is isometric to (T , e) and (T , e) is isometric to (U , ρ), then (S, d) is isometric to (U , ρ), the transitive property.
Proof
In particular, if metric spaces (S, d) and (T , e) are isometric, then as topological spaces, they are homeomorphic.
1.10.4 https://stats.libretexts.org/@go/page/10125
Compactness and Boundedness
In a metric space, various definitions related to a set being bounded are natural, and are related to the general concept of
compactness.
Suppose again that (S, d) is a metric space, and that A ⊆ S . Then A is bounded if there exists r ∈ (0, ∞) such that d(x, y) ≤ r
for all x, y ∈ A. The diameter of A is
diam(A) = inf{r > 0 : d(x, y) < r for all x, y ∈ A} (1.10.7)
Additional details
So A is bounded if and only if diam(A) < ∞ . Diameter is an increasing function relative to the subset partial order.
Suppose again that (S, d) is a metric space, and that A ⊆ B ⊆ S . Then diam(A) ≤ diam(B) .
Our next definition is stronger, but first let's review some terminology that we used for general topological spaces: If S is a set, A a
subset of S , and A a collection of subsets of S , then A is said to cover A if A ⊆ ⋃ A . So with this terminology, we can talk about
open covers, closed covers, finite covers, disjoint covers, and so on.
Suppose again that (S, d) is a metric space, and that A ⊆ S . Then A is totally bounded if for every r > 0 there is a finite cover
of A with open balls of radius r.
Recall that for a general topological space, a set A is compact if every open cover of A has a finite subcover. So in a metric space,
the term precompact is sometimes used instead of totally bounded: The set A is totally bounded if every cover of A with open balls
of radius r has a finite subcover.
Suppose again that (S, d) is a metric space. If A ⊆ S is totally bounded then A is bounded.
Proof
Since a metric space is a Hausdorff space, a compact subset of a metric space is closed. Compactness also has a simple
characterization in terms of convergence of sequences.
Suppose again that (S, d) is a metric space. A subset C ⊆S is compact if and only if every sequence of points in C has a
subsequence that converges to a point in C .
Proof
Suppose again that (S, d) is a metric space and that A ⊆ S . For δ ∈ (0, ∞) and k ∈ [0, ∞), define
k k
H (A) = inf { ∑ [diam(B)] : B is a countable δ cover of A} (1.10.9)
δ
B∈B
Additional details
Note that the k -dimensional Hausdorff measure is defined for every k ∈ [0, ∞), not just nonnegative integers. Nonetheless, the
integer dimensions are interesting. The 0-dimensional measure of A is the number of points in A . In Euclidean space, which we
consider in (36), the measures of dimension 1, 2, and 3 are related to length, area, and volume, respectively.
Suppose again that (S, d) is a metric space and that A ⊆ S . The Hausdorff dimension of A is
1.10.5 https://stats.libretexts.org/@go/page/10125
k
dimH (A) = inf{k ∈ [0, ∞) : H (A) = 0} (1.10.12)
Of special interest, as before, is the case when S = R for some n ∈ N and d is the standard Euclidean distance, reviewed in (36).
n
+
As you might guess, the Hausdorff dimension of a point is 0, the Hausdorff dimension of a “simple curve” is 1, the Hausdorff
dimension of a “simple surface” is 2, and so on. But there are also sets with fractional Hausdorff dimension, and the stochastic
process Brownian motion provides some fascinating examples. The graph of standard Brownian motion has Hausdorff dimension
3/2 while the set of zeros has Hausdorff dimension 1/2.
Suppose that (S, +, ⋅) is a vector space, and that ∥ ⋅ ∥ is a norm on the space. Then d defined by d(x, y) = ∥y − x∥ for x, y ∈ S
is a metric on S .
Proof
1/k
n
k n
dk (x, y) = ( ∑ | xi − yi | ) , x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R (1.10.13)
i=1
Proof
Of course the metric d is Euclidean distance, named for Euclid of course. This is the most important one, in a practical sense
2
because it's the usual one that we use in the real world, and in a mathematical sense because the associated norm corresponds to the
standard inner product on R given by
n
n
n
⟨x, y⟩ = ∑ xi yi , x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R (1.10.15)
i=1
n
d∞ (x, y) = max{| xi − yi | : i ∈ {1, 2 … , n}}, x = (x1 , x2 , … , xn ) ∈ R (1.10.16)
Proof
Figure 1.10.1 : From inside out, the boundaries of the unit balls centered at the origin in R
2
for the metrics dk with
k ∈ {1, 3/4, 2, 3, ∞} .
Suppose now that S is a nonempty set. Recall that the collection V of all functions f : S → R is a vector space under the usual
pointwise definition of addition and scalar multiplication. That is, if f , g ∈ V and c ∈ R , then f + g ∈ V and cf ∈ V are defined
by (f + g)(x) = f (x) + g(x) and (cf )(x) = cf (x) for x ∈ S . Recall further that the collection U of bounded functions f : S → R
is a vector subspace of V , and moreover, ∥ ⋅ ∥ defined by ∥f ∥ = sup{|f (x)| : x ∈ S} is a norm on U , known as the supremum
norm. It follow that U is a metric space with the metric d defined by
1.10.6 https://stats.libretexts.org/@go/page/10125
d(f , g) = ∥f − g∥ = sup{|f (x) − g(x)| : x ∈ S} (1.10.18)
Vector spaces of bounded, real-valued functions, with the supremum norm are very important in the study of probability and
stochastic processes. Note that the supremum norm on U generalizes the maximum norm on R , since we can think of a point in R
n n
as a function from {1, 2, … , n} into R. Later, as part of our discussion on integration with respect to a positive measure, we will see
how to generalize the k norms on R to spaces of functions.
n
Suppose n ∈ {2, 3, …}, and that (S , d ) is a metric space for each in ∈ {1, 2, … , n}. Suppose also that ∥ ⋅ ∥ is a norm on R .
i i
n
Proof
Graphs
Recall that a graph (in the combinatorial sense) consists of a countable set S of vertices and a set E ⊆ S × S of edges. In this
discussion, we assume that the graph is undirected in the sense that (x, y) ∈ E if and only if (y, x) ∈ E , and has no loops so that
(x, x) ∉ E for x ∈ S . Finally, recall that a path of length n ∈ N from x ∈ S to y ∈ S is a sequence (x , x , … , x ) ∈ S such n+1
+ 0 1 n
that x = x , x = y , and (x , x ) ∈ E for i ∈ {1, 2, … , n}. The graph is connected if there exists a path of finite length between
0 n i−1 i
Suppose that G = (S, E) is a connected graph. Then d defined as follows is a metric on S : d(x, x) = 0 for x ∈ S , and d(x, y)
Suppose again that S is a nonempty set. A metric d on S with the property that there exists c ∈ (0, ∞) such that d(x, y) ≥ c for
distinct x, y ∈ S generates the discrete topology.
Proof
So any metric that is bounded from below (for distinct points) generates the discrete topology. It's easy to see that there are such
metrics.
Suppose again that S is a nonempty set. The function d on S × S defined by d(x, x) = 0 for x ∈ S and d(x, y) = 1 for distinct
x, y ∈ S is a metric on S , known as the discrete metric. This metric generates the discrete topology.
Proof
In probability applications, the discrete topology is often appropriate when S is countable. Note also that the discrete metric is the
graph distance if S is made into the complete graph, so that (x, y) is an edge for every pair of distinct vertices x, y ∈ S .
This page titled 1.10: Metric Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.10.7 https://stats.libretexts.org/@go/page/10125
1.11: Measurable Spaces
In this section we discuss some topics from measure theory that are a bit more advanced than the topics in the early sections of this
chapter. However, measure-theoretic ideas are essential for a deep understanding of probability, since probability is itself a
measure. The most important of the definitions is the σ-algebra, a collection of subsets of a set with certain closure properties. Such
collections play a fundamental role, even for applied probability, in encoding the state of information about a random experiment.
On the other hand, we won't be overly pedantic about measure-theoretic details in this text. Unless we say otherwise, we assume
that all sets that appear are measurable (that is, members of the appropriate σ-algebras), and that all functions are measurable
(relative to the appropriate σ-algebras).
Although this section is somewhat abstract, many of the proofs are straightforward. Be sure to try the proofs yourself before
reading the ones in the text.
Algebras of Sets
Suppose that S is a nonempty collection of subsets of S . Then S is an algebra (or field) if it is closed under complement and
union:
1. If A ∈ S then A ∈ S . c
2. If A ∈ S and B ∈ S then A ∪ B ∈ S .
Suppose that S is an algebra of subsets of S and that A i ∈ S for each i in a finite index set I .
1. ⋃ i∈I
Ai ∈ S
2. ⋂ i∈I
Ai ∈ S
Proof
Thus it follows that an algebra of sets is closed under a finite number of set operations. That is, if we start with a finite number of
sets in the algebra S , and build a new set with a finite number of set operations (union, intersection, complement), then the new
set is also in S . However in many mathematical theories, probability in particular, this is not sufficient; we often need the
collection of admissible subsets to be closed under a countable number of set operations.
σ -Algebras of Sets
Suppose that S is a nonempty collection of subsets of S . Then S is a σ-algebra (or σ-field) if the following axioms are
satisfied:
1. If A ∈ S then A ∈ S . c
Clearly a σ-algebra of subsets is also an algebra of subsets, so the basic results for algebras above still hold. In particular, S ∈ S
and ∅ ∈ S .
1.11.1 https://stats.libretexts.org/@go/page/10126
Proof
Thus a σ-algebra of subsets of S is closed under countable unions and intersections. This is the reason for the symbol σ in the
name. As mentioned in the introductory paragraph, σ-algebras are of fundamental importance in mathematics generally and
probability theory specifically, and thus deserve a special definition:
If S is a set and S a σ-algebra of subsets of S , then the pair (S, S ) is called a measurable space.
The term measurable space will make more sense in the next chapter, when we discuss positive measures (and in particular,
probability measures) on such spaces.
Suppose that S is a set and that S is a finite algebra of subsets of S . Then S is also a σ-algebra.
Proof
However, there are algebras that are not σ-algebras. Here is the classic example:
Suppose that S is an infinite set. The collection of finite and co-finite subsets of S defined below is an algebra of subsets of S ,
but not a σ-algebra:
c
F = {A ⊆ S : A is finite or A is finite} (1.11.1)
Proof
General Constructions
Recall that P(S) denotes the collection of all subsets of S , called the power set of S . Trivially, P(S) is the largest σ-algebra of
S . The power set is often the appropriate σ-algebra if S is countable, but as noted above, is sometimes too large to be useful if S is
uncountable. At the other extreme, the smallest σ-algebra of S is given in the following result:
In many cases, we want to construct a σ-algebra that contains certain basic sets. The next two results show how to do this.
Suppose that S is a σ-algebra of subsets of S for each i in a nonempty index set I . Then S
i =⋂
i∈I
Si is also a σ-algebra of
subsets of S .
Proof
Note that no restrictions are placed on the index set I , other than it be nonempty, so in particular it may well be uncountable.
Suppose that S is a set and that B is a collection of subsets of S . The σ-algebra generated by B is
So the σ-algebra generated by B is the intersection of all σ-algebras that contain B , which by the previous result really is a σ-
algebra. Note that the collection of σ-algebras in the intersection is not empty, since P(S) is in the collection. Think of the sets in
B as basic sets that we want to be measurable, but do not form a σ-algebra.
Note that the conditions in the last theorem completely characterize σ(B) . If S1 and S2 satisfy the conditions, then by (a),
B ⊆S 1and B ⊆ S . But then by (b), S ⊆ S and S ⊆ S .
2 1 2 2 1
1.11.2 https://stats.libretexts.org/@go/page/10126
Proof
We can generalize the previous result. Recall that a collection of subsets A = { Ai : i ∈ I } is a partition of S if Ai ∩ Aj = ∅ for
i, j ∈ I with i ≠ j , and ⋃ A =S. i
i∈I
Suppose that A = {A i : i ∈ I} is a countable partition of S into nonempty subsets. Then σ(A ) is the collection of all unions
of sets in A . That is,
σ(A ) = { ⋃ Aj : J ⊆ I } (1.11.3)
j∈J
Proof
A σ-algebra of this form is said to be generated by a countable partition. Note that since A ≠ ∅ for i ∈ I , the representation of a i
set in σ(A ) as a union of sets in A is unique. That is, if J, K ⊆ I and J ≠ K then ⋃ A ≠ ⋃ A . In particular, if there
j∈J j k∈K k
are n nonempty sets in A , so that #(I ) = n , then there are 2 subsets of I and hence 2 sets in σ(A ).
n n
Suppose now that A = {A , A , … , A } is a collection of n subsets of S (not necessarily disjoint). To describe the σ-algebra
1 2 n
generated by A we need a bit more notation. For x = (x , x , … , x ) ∈ {0, 1} (a bit string of length n ), let B = ⋂ A
1 2 n
n
x
n
i=1
xi
where A = A and A = A .
1
i i
0
i
c
i
Proof
Recall that there are 2 bit strings of length n . The sets in A are said to be in general position if the sets in
n
B are distinct (and
n
hence there are 2 of them) and are nonempty. In this case, there are 2 sets in σ(A ).
n 2
Open the Venn diagram app. This app shows two subsets A and B of S in general position, and lists the 16 sets in σ{A, B}.
1. Select each of the 4 sets that partition S : A ∩ B , A ∩ B , A ∩ B , A ∩ B . c c c c
2. Select each of the other 12 sets in σ{A, B} and note how each is a union of some of the sets in (a).
Sketch a Venn diagram with sets A 1, A2 , A3 in general position. Identify the set B for each x ∈ {0, 1} .
x
3
If a σ-algebra is generated by a collection of basic sets, then each set in the σ-algebra is generated by a countable number of the
basic sets.
Proof
The σ-algebra R is the σ-algebra on R induced by S . The following construction is useful for counterexamples. Compare this
example with the one for finite and co-finite sets.
1.11.3 https://stats.libretexts.org/@go/page/10126
1. C is a σ-algebra
2. C = σ{{x} : x ∈ S}, the σ-algebra generated by the singleton sets.
Proof
Of course, if S is itself countable then C = P(S) . On the other hand, if S is uncountable, then there exists A ⊆ S such that A
and A are uncountable. Thus, A ∉ C , but A = ⋃
c
{x} , and of course {x} ∈ C . Thus, we have an example of a σ-algebra that
x∈A
Suppose that (S, S ) is a topological space. Then σ(S ) is the Borel σ-algebra on S , and (S, σ(S )) is a Borel measurable
space.
So the Borel σ-algebra on S , named for Émile Borel is generated by the open subsets of S . Thus, a topological space (S, S )
naturally leads to a measurable space (S, σ(S )). Since a closed set is simply the complement of an open set, the Borel σ-algebra
contains the closed sets as well (and in fact is generated by the closed sets). Here are some other sets that are in the Borel σ-
algebra:
Suppose again that (S, S ) is a topological space and that I is a countable index set.
1. If A is open for each i ∈ I then ⋂ A ∈ σ(S ) . Such sets are called G sets.
i i∈I i δ
2. If A is closed for each i ∈ I then ⋃ A ∈ σ(S ) . Such sets are called F sets.
i i∈I i σ
In terms of part (c), recall that a topological space is Hausdorff, named for Felix Hausdorff, if the topology can distinguish
individual points. Specifically, if x, y ∈ S are distinct then there exist disjoint open sets U , V with x ∈ U and y ∈ V . This is a
very basic property possessed by almost all topological spaces that occur in applications. A simple corollary of (c) is that if the
topological space (S, S ) is Hausdorff then A ∈ σ(S ) for every countable A ⊆ S .
Let's note the extreme cases. If S has the discrete topology P(S) , so that every set is open (and closed), then of course the Borel
σ-algebra is also P(S) . As noted above, this is often the appropriate σ-algebra if S is countable, but is often too large if S is
uncountable. If S has the trivial topology {S, ∅}, then the Borel σ-algebra is also {S, ∅}, and so is also trivial.
Recall that a base for a topological space (S, T ) is a collection B ⊆ T with the property that every set in T is a union of a
collection of sets in B . In short, every open set is a union of some of the basic open sets.
Suppose that (S, S ) is a topological space with a countable base B . Then σ(B) = σ(S ) .
Proof
The topological spaces that occur in probability and stochastic processes are usually assumed to have a countable base (along with
other nice properties such as the Hausdorff property and locally compactness). The σ-algebra used for such a space is usually the
Borel σ-algebra, which by the previous result, is countably generated.
Measurable Functions
Recall that a set usually comes with a σ-algebra of admissible subsets. A natural requirement on a function is that the inverse image
of an admissible set in the range space be admissible in the domain space. Here is the formal definition.
If the σ-algebra in the range space is generated by a collection of basic sets, then to check the measurability of a function, we need
only consider inverse images of basic sets:
1.11.4 https://stats.libretexts.org/@go/page/10126
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that T = σ(B) for a collection of subsets B of T . Then
f : S → T is measurable if and only if f (B) ∈ S for every B ∈ B .
−1
Proof
If you have reviewed the section on topology then you may have noticed a striking parallel between the definition of continuity for
functions on topological spaces and the defintion of measurability for functions on measurable spaces: A function from one
topological space to another is continuous if the inverse image of an open set in the range space is open in the domain space. A
function from one measurable space to another is measurable if the inverse image of a measurable set in the range space is
measurable in the domain space. If we start with topological spaces, which we often do, and use the Borel σ-algebras to get
measurable spaces, then we get the following (hardly surprising) connection.
Suppose that (S, S ) and (T , T ) are topological spaces, and that we give S and T the Borel σ-algebras σ(S ) and σ(T )
Measurability is preserved under composition, the most important method for combining functions.
Suppose that (R, R) , (S, S ), and (T , T ) are measurable spaces. If f : R → S is measurable and g : S → T is measurable,
then g ∘ f : R → T is measurable.
Proof
If T is given the smallest possible σ-algebra or if S is given the largest one, then any function from S into T is measurable.
When there are several σ-algebras for the same set, then we use the phrase with respect to so that we can be precise. If a function is
measurable with respect to a given σ-algebra on its domain, then it's measurable with respect to any larger σ-algebra on this space.
If the function is measurable with respect to a σ-algebra on the range space then its measurable with respect to any smaller σ-
algebra on this space.
Suppose that S has σ-algebras R and S with R ⊆ S , and that T has σ-algebras T and U with T ⊆U . If f : S → T is
measurable with respect to R and U , then f is measureable with respect to S and T .
Proof
Suppose that S is a set and (T , T ) is a measurable space. Suppose also that f : S → T and define
σ(f ) = { f
−1
(A) : A ∈ T }. Then
1. σ(f ) is a σ-algebra on S .
2. σ(f ) is the smallest σ-algebra on S that makes f measurable.
Proof
Appropriately enough, σ(f ) is called the σ-algebra generated by f . Often, S will have a given σ-algebra S and f will be
measurable with respect to S and T . In this case, σ(f ) ⊆ S . We can generalize to an arbitrary collection of functions on S .
Suppose S is a set and that (T , T ) is a measurable space for each i in a nonempty index set I . Suppose also that f
i i i : S → Ti
Again, this is the smallest σ-algebra on S that makes f measurable for each i ∈ I .
i
1.11.5 https://stats.libretexts.org/@go/page/10126
Product Sets
Product sets arise naturally in the form of the higher-dimensional Euclidean spaces R for n ∈ {2, 3, …}. In addition, product
n
spaces are particularly important in probability, where they are used to describe the spaces associated with sequences of random
variables. More general product spaces arise in the study of stochastic processes. We start with the product of two sets; the
generalization to products of n sets and to general products is straightforward, although the notation gets more complicated.
Suppose that (S, S ) and (T , T ) are measurable spaces. The product σ-algebra on S × T is
S ⊗ T = σ{A × B : A ∈ S , B ∈ T } (1.11.7)
So the definition is natural: the product σ-algebra is generated by products of measurable sets. Our next goal is to consider the
measurability of functions defined on, or mapping into, product spaces. Of basic importance are the projection functions. If S and
T are sets, let p : S × T → S and p : S × T → T be defined by p (x, y) = x and p (x, y) = y for (x, y) ∈ S × T . Recall that
1 2 1 2
p1 is the projection onto the first coordinate and p is the projection onto the second coordinate. The product σ algebra is the
2
Projection functions make it easy to study functions mapping into a product space.
Suppose that (R, R) , (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product σ-algebra S ⊗ T .
Suppose also that f : R → S × T , so that f (x) = (f (x), f (x)) for x ∈ R, where f : R → S and f : R → T are the
1 2 1 2
Proof
Our next goal is to consider cross sections of sets in a product space and cross sections of functions defined on a product space. It
will help to introduce some new functions, which in a sense are complementary to the projection functions.
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product σ-algebra S ⊗T .
1. For x ∈ S the function 1x : T → S ×T , defined by 1 x (y) = (x, y) for y ∈ T , is measurable.
2. For y ∈ T the function 2y : S → S ×T , defined by 2 y (x) = (x, y) for x ∈ S , is measurable.
Proof
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that C ∈ S ⊗T . Then
1. For x ∈ S , {y ∈ T : (x, y) ∈ C } ∈ T .
2. For y ∈ T , {x ∈ S : (x, y) ∈ C } ∈ S .
Proof
The set in (a) is the cross section of C in the first coordinate at x, and the set in (b) is the cross section of C in the second
coordinate at y . As a simple corollary to the theorem, note that if A ⊆ S , B ⊆ T and A × B ∈ S ⊗ T then A ∈ S and B ∈ T .
That is, the only measurable product sets are products of measurable sets. Here is the measurability result for cross-sectional
functions:
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product σ -algebra S ⊗T .
Suppose also that (U , U ) is another measurable space, and that f : S × T → U is measurable. Then
1. The function y ↦ f (x, y) from T to U is measurable for each x ∈ S .
2. The function x ↦ f (x, y) from S to U is measurable for each y ∈ T .
Proof
The results for products of two spaces generalize in a completely straightforward way to a product of n spaces.
1.11.6 https://stats.libretexts.org/@go/page/10126
Suppose n ∈ N and that (S , S ) is a measurable space for each
+ i i i ∈ {1, 2, … , n} . The product σ-algebra on the Cartesian
product set S × S × ⋯ × S is
1 2 n
So again, the product σ-algebra is generated by products of measurable sets. Results analogous to the theorems above hold. In the
special case that (S , S ) = (S, S ) for i ∈ {1, 2, … , n}, the Cartesian product becomes S and the corresponding product σ-
i i
n
algebra is denoted S . The notation is natural, but potentially confusing. Note that S is not the Cartesian product of S n times,
n n
but rather the σ-algebra generated by sets of the form A × A × ⋯ × A where A ∈ S for i ∈ {1, 2, … , n}. 1 2 n i
We can also extend these ideas to a general product. To recall the definition, suppose that S is a set for each i in a nonempty index i
set I . The product set ∏ S consists of all functions x : I → ⋃ S such that x(i) ∈ S for each i ∈ I . To make the notation
i∈I i i∈I i i
look more like a simple Cartesian product, we often write x instead of x(i) for the value of a function in the product set at i ∈ I .
i
The next definition gives the appropriate σ-algebra for the product set.
Suppose that (S i, Si ) is a measurable space for each i in a nonempty index set I . The product σ-algebra on the product set
∏ S is i
i∈I
i∈I
If you have reviewed the section on topology, the definition should look familiar. If the spaces were topological spaces instead of
measurable spaces, with S the topology of S for i ∈ I , then the set of products in the displayed expression above is a base for
i i
The definition can also be understood in terms of projections. Recall that the projection onto coordinate j ∈ I is the function
p : ∏
j S → S
i∈I
given by p (x) = x . The product σ-algebra is the smallest σ-algebra on the product set that makes all of the
i j j j
projections measurable.
Suppose again that (S , S ) is a measurable space for each i in a nonempty index set I , and let
i i S denote the product σ-
algebra on the product set S = ∏ S . Then S = σ{p : i ∈ I } .I i∈I i i
Proof
In the special case that (S, S ) is a fixed measurable space and (S , S ) = (S, S ) for all i ∈ I , the product set ∏ S is just the i i i∈I
collection of functions from I into S , often denoted S . The product σ-algebra is then denoted S , a notation that is natural, but
I I
again potentially confusing. Here is the main measurability result for a function mapping into a product space.
Suppose that (R, R) is a measurable space, and that (S , S ) is a measurable space for each i in a nonempty index set I . As
i i
before, let ∏ S have the product σ-algebra. Suppose now that f : R → ∏ S . For i ∈ I let f : R → S denote the ith
i∈I i i∈I i i i
coordinate function of f , so that f (x) = [f (x)] for x ∈ R . Then f is measurable if and only if f is measurable for each
i i i
i ∈ I.
Proof
Just as with the product of two sets, cross-sectional sets and functions are measurable with respect to the product measure. Again,
it's best to work with some special functions.
Suppose that (S , S ) is a measurable space for each i in an index set I with at least two elements. For j ∈ I and u ∈ S ,
i i j
In words, for j ∈ I and u ∈ S , the function j takes a point in the product set ∏
j S and assigns u to coordinate j to give a
u i∈I−{j} i
point in ∏ S . If A ⊆ ∏ S , then j (A) is the cross section of A in coordinate j at u. So it follows immediately from the
i∈I i i∈I i
−1
u
previous result that the cross sections of a measurable set are measurable. Cross sections of measurable functions are also
1.11.7 https://stats.libretexts.org/@go/page/10126
measurable. Suppose that (T , T ) is another measurable space, and that f : ∏ S → T is measurable. The cross section of f in
i∈I i
coordinate j ∈ I at u ∈ S is simply f ∘ j : S
j → T , a composition of measurable functions.
u I−{j}
However, a non-measurable set can have measurable cross sections, even in a product of two spaces.
Suppose that S is an uncountable set with the σ-algebra C of countable and co-countable sets as in (21). Consider S × S with
the product σ-algebra C ⊗ C . Let D = {(x, x) : x ∈ S} , the diagonal of S × S . Then D has measurable cross sections, but
D is not measurable.
Proof
Special Cases
Most of the sets encountered in applied probability are either countable, or subsets of R for some n , or more generally, subsets of
n
a product of a countable number of sets of these types. In the study of stochastic processes, various spaces of functions play an
important role. In this subsection, we will explore the most important special cases.
Discrete Spaces
If S is countable and S = P(S) is the collection of all subsets of S , then (S, S ) is a discrete measurable space.
Thus if (S, S ) is discrete, all subsets of S are measurable and every function from S to another measurable space is measurable.
The power set is also the discrete topology on S , so S is a Borel σ-algebra as well. As a topological space, (S, S ) is complete,
locally compact, Hausdorff, and since S is countable, separable. Moreover, the discrete topology corresponds to the discrete metric
d , defined by d(x, x) = 0 for x ∈ S and d(x, y) = 1 for x, y ∈ S with x ≠ y .
Euclidean Spaces
Recall that for n ∈ N , the Euclidean topology on R is generated by the standard Euclidean metric d given by
+
n
n
−−−−−−−−−−
n
2 n
dn (x, y) = √ ∑(xi − yi ) , x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R (1.11.10)
i=1
With this topology, R is complete, connected, locally compact, Hausdorff, and separable.
n
The one-dimensional case is particularly important. In this case, the standard Euclidean metric d is given by d(x, y) = |x − y| for
x, y ∈ R. The Borel σ-algebra R can be generated by various collections of intervals.
2. B 2 = {(a, b] : a, b ∈ R, a < b}
3. B 3 = {(−∞, b] : b ∈ R}
Proof
Since the Euclidean topology has a countable base, R is countably generated. In fact each collection of intervals above, but with
endpoints restricted to Q, generates R . Moreover, R can also be constructed from σ-algebras that are generated by countable
partitions. First recall that for n ∈ N , the set of dyadic rationals (or binary rationals) of rank n or less is D = {j/2 : j ∈ Z} . n
n
dyadic rationals are often useful in various applications because D has the natural ordered enumeration j ↦ j/2 for each n ∈ N .
n
n
Now let
n n
Dn = {(j/ 2 , (j + 1)/ 2 ] : j ∈ Z} , n ∈ N (1.11.11)
Then D is a countable partition of R into nonempty intervals of equal size 1/2 , so E = σ(D ) consists of unions of sets in D
n
n
n n n
1.11.8 https://stats.libretexts.org/@go/page/10126
For n ∈ {2, 3, …}, the Euclidean topology on R is the n -fold product topology formed from the Euclidean topology on R. So the
n
Borel σ-algebra R is also the n -fold power σ-algebra formed from R . Finally, R can be generated by n -fold products of sets in
n n
If f : S → R and g : S → R are measurable and a ∈ R , then each of the following functions from S into R is also
measurable:
1. f + g
2. f − g
3. f g
4. af
Proof
Similarly, if f : S → R ∖ {0} is measurable, then so is 1/f. Recall that the set of functions from S into R is a vector space, under
the pointwise definitions of addition and scalar multiplication. But once again, we usually want to restrict our attention to
measurable functions. Thus, it's nice to know that the measurable functions from S into R also form a vector space. This follows
immediately from the closure properties (a) and (d) of the previous theorem. Of particular importance in probability and stochastic
processes is the vector space of bounded, measurable functions f : S → R , with the supremum norm
∥f ∥ = sup {|f (x)| : x ∈ S} (1.11.12)
The elementary functions that we encounter in calculus and other areas of applied mathematics are functions from subsets of R into
R . The elementary functions include algebraic functions (which in turn include the polynomial and rational functions), the usual
transcendental functions (exponential, logarithm, trigonometric), and the usual functions constructed from these by composition,
the arithmetic operations, and by piecing together. As we might hope, all of the elementary functions are measurable.
This page titled 1.11: Measurable Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
1.11.9 https://stats.libretexts.org/@go/page/10126
1.12: Special Set Structures
There are several other types of algebraic set structures that are weaker than σ-algebras. These are not particularly important in
themselves, but are important for constructing σ-algebras and the measures on these σ-algebras. You may want to skip this section
if you are not intersted in questions of existence and uniqueness of positive measures.
Basic Theory
Definitions
Throughout this section, we assume that S is a set and S is a nonempty collection of subsets of S . Here are the main definitions
we will need.
Closure under intersection is clearly a very simple property, but π systems turn out to be useful enough to deserve a name.
S is a semi-algebra if it is closed under intersection and if complements can be written as finite, disjoint unions:
1. If A, B ∈ S then A ∩ B ∈ S .
2. If A ∈ S then there exists a finite, disjoint collection {B i : i ∈ I} ⊆ S such that A c
=⋃
i∈I
Bi .
sequence is decreasing if A ⊆A for all n ∈ N . Of course, these are the standard meanings of increasing and decreasing
n+1 n +
relative to the ordinary order ≤ on N and the subset partial order ⊆ on P(S) .
+
n=1 n n→∞ n 1 2
n=1
A . The reason for this notation will become clear in the
n n→∞ n
section on Convergence in the chapter on Probability Spaces. With this notation, a monotone class S is defined by the condition
that if (A , A , …) is an increasing or decreasing sequence of sets in S then lim
1 2 A ∈ S . n→∞ n
Basic Theorems
Our most important set structure, the σ-algebra, has all of the properties in the definitions above.
Any type of algebraic structure on subsets of S that is defined purely in terms of closure properties will be preserved under
intersection. That is, we will have results that are analogous to how σ-algebras are generated from more basic sets, with completely
straightforward and analgous proofs. In the following two theorems, the term system could mean π-system, λ -system, or monotone
class of subsets of S .
1.12.1 https://stats.libretexts.org/@go/page/10127
The condition that ⋂ S be nonempty is unnecessary for a λ -system, by the result above. Now suppose that B is a nonempty
i∈I i
collection of subsets of S , thought of as basic sets of some sort. Then the system generated by B is the intersection of all systems
that contain B .
The system S generated by B is the smallest system containing B , and is characterized by the following properties:
1. B ⊆ S .
2. If T is a system and B ⊆ T then S ⊆T .
Note however, that the previous two results do not apply to semi-algebras, because the semi-algebra is not defined purely in terms
of closure properties (the condition on A is not a closure property).
c
By definition, a semi-algebra is a π-system. More importantly, a semi-algebra can be used to construct an algebra.
Suppose that S is a semi-algebra of subsets of S . Then the collection S of finite, disjoint unions of sets in S is an algebra.
∗
Proof
We will say that our nonempty collection S is closed under proper set difference if A, B ∈ S and A ⊆ B implies B∖A ∈ S .
The following theorem gives the basic relationship between λ -systems and monotone classes.
The following theorem is known as the monotone class theorem, and is due to the mathematician Paul Halmos.
As noted in (5), a σ-algebra is both a π-system and a λ -system. The converse is also true, and is one of the main reasons for
studying these structures.
The importance of π-systems and λ -systems stems in part from Dynkin's π-λ theorem given next. It's named for the mathematician
Eugene Dynkin.
Euclidean Spaces
The following example is particulalry important because it will be used to construct positive measures on R. Let
B = {(a, b] : a, b ∈ R, a < b} ∪ {(−∞, b] : b ∈ R} ∪ {(a, ∞) : a ∈ R} (1.12.3)
B is a semi-algebra of subsets of R.
Proof
1.12.2 https://stats.libretexts.org/@go/page/10127
It follows from the theorem above that the collection A of finite disjoint unions of intervals in B is an algebra. Recall also that
σ(B) = σ(A ) is the Borel σ-algebra of R , named for Émile Borel. We can generalize all of this to R for n ∈ N
n
+
The collection B n = {∏
n
i=1
Ai : Ai ∈ B for each i ∈ {1, 2, … , n}} is a semi-algebra of subsets of R .
n
Product Spaces
The examples in this discussion are important for constructing positive measures on product spaces.
Suppose that S is a semi-algebra of subsets of a set S and that T is a semi-algebra of subsets of a set T . Then
U = {A × B : A ∈ S , B ∈ T } (1.12.4)
is a semi-algebra of subsets of S × T .
Proof
This result extends in a completely straightforward way to a product of a finite number of sets.
Suppose that n ∈ N + and that S is a semi-algebra of subsets of a set S for i ∈ {1, 2, … , n}. Then
i i
i=1
n
is a semi-algebra of subsets of ∏ i=1
Si .
i=1
is a semi-algebra of subsets of ∏ n
i=1
Si .
Proof
i=1
Ai : Ai ∈ Si for all i ∈ N+ } . In general, the complement of a set in U
cannot be written as a finite disjoint union of sets in U .
This page titled 1.12: Special Set Structures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
1.12.3 https://stats.libretexts.org/@go/page/10127
CHAPTER OVERVIEW
2: Probability Spaces
The basic topics in this chapter are fundamental to probability theory, and should be accessible to new students of probability. We
start with the paradigm of the random experiment and its mathematical model, the probability space. The main objects in this
model are sample spaces, events, random variables, and probability measures. We also study several concepts of fundamental
importance: conditional probability and independence.
The advanced topics can be skipped if you are a new student of probability, or can be studied later, as the need arises. These topics
include the convergence of random variables, the measure-theoretic foundations of probability theory, and the existence and
construction of probability measures and random processes.
2.1: Random Experiments
2.2: Events and Random Variables
2.3: Probability Measures
2.4: Conditional Probability
2.5: Independence
2.6: Convergence
2.7: Measure Spaces
2.8: Existence and Uniqueness
2.9: Probability Spaces Revisited
2.10: Stochastic Processes
2.11: Filtrations and Stopping Times
This page titled 2: Probability Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
2.1: Random Experiments
Experiments
Probability theory is based on the paradigm of a random experiment; that is, an experiment whose outcome cannot be predicted
with certainty, before the experiment is run. In classical or frequency-based probability theory, we also assume that the experiment
can be repeated indefinitely under essentially the same conditions. The repetitions can be in time (as when we toss a single coin
over and over again) or in space (as when we toss a bunch of similar coins all at once). The repeatability assumption is important
because the classical theory is concerned with the long-term behavior as the experiment is replicated. By contrast, subjective or
belief-based probability theory is concerned with measures of belief about what will happen when we run the experiment. In this
view, repeatability is a less crucial assumption. In any event, a complete description of a random experiment requires a careful
definition of precisely what information about the experiment is being recorded, that is, a careful definition of what constitutes an
outcome.
The term parameter refers to a non-random quantity in a model that, once chosen, remains constant. Many probability models of
random experiments have one or more parameters that can be adjusted to fit the physical experiment being modeled.
The subjects of probability and statistics have an inverse relationship of sorts. In probability, we start with a completely specified
mathematical model of a random experiment. Our goal is perform various computations that help us understand the random
experiment, help us predict what will happen when we run the experiment. In statistics, by contrast, we start with an incompletely
specified mathematical model (one or more parameters may be unknown, for example). We run the experiment to collect data, and
then use the data to draw inferences about the unknown factors in the mathematical model.
Compound Experiments
Suppose that we have n experiments (E , E , … , E ). We can form a new, compound experiment by performing the n
1 2 n
experiments in sequence, E first, and then E and so on, independently of one another. The term independent means, intuitively,
1 2
that the outcome of one experiment has no influence over any of the other experiments. We will make the term mathematically
precise later.
In particular, suppose that we have a basic experiment. A fixed number (or even an infinite number) of independent replications of
the basic experiment is a new, compound experiment. Many experiments turn out to be compound experiments and moreover, as
noted above, (classical) probability theory itself is based on the idea of replicating an experiment.
In particular, suppose that we have a simple experiment with two outcomes. Independent replications of this experiment are
referred to as Bernoulli trials, named for Jacob Bernoulli. This is one of the simplest, but most important models in probability.
More generally, suppose that we have a simple experiment with k possible outcomes. Independent replications of this experiment
are referred to as multinomial trials.
Sometimes an experiment occurs in well-defined stages, but in a dependent way, in the sense that the outcome of a given stage is
influenced by the outcomes of the previous stages.
Sampling Experiments
In most statistical studies, we start with a population of objects of interest. The objects may be people, memory chips, or acres of
corn, for example. Usually there are one or more numerical measurements of interest to us—the height and weight of a person, the
lifetime of a memory chip, the amount of rain, amount of fertilizer, and yield of an acre of corn.
Although our interest is in the entire population of objects, this set is usually too large and too amorphous to study. Instead, we
collect a random sample of objects from the population and record the measurements of interest of for each object in the sample.
There are two basic types of sampling. If we sample with replacement, each item is replaced in the population before the next draw;
thus, a single object may occur several times in the sample. If we sample without replacement, objects are not replaced in the
population. The chapter on Finite Sampling Models explores a number of models based on sampling from a finite population.
Sampling with replacement can be thought of as a compound experiment, based on independent replications of the simple
experiment of drawing a single object from the population and recording the measurements of interest. Conversely, a compound
experiment that consists of n independent replications of a simple experiment can usually be thought of as a sampling experiment.
2.1.1 https://stats.libretexts.org/@go/page/10129
On the other hand, sampling without replacement is an experiment that consists of dependent stages, because the population
changes with each draw.
Figure 2.1.1 : Obverse and reverse sides of a Roman coin, about 241 CE, from Wikipedia
Consider the coin experiment of tossing a coin n times and recording the score (1 for heads or 0 for tails) for each toss.
1. Identify a parameter of the experiment.
2. Interpret the experiment as a compound experiment.
3. Interpret the experiment as a sampling experiment.
4. Interpret the experiment as n Bernoulli trials.
Answer
In the simulation of the coin experiment, set n = 5 . Run the simulation 100 times and observe the outcomes.
Dice are randomizing devices that, like coins, date to antiquity and come in a variety of sizes and shapes. Typically, the faces of a
die have numbers or other symbols engraved on them. Again, the important fact is that when a die is thrown, a unique face is
chosen (usually the upward face, but sometimes the downward one). For more on dice, see the introductory section in the chapter
on Games of Chance.
Consider the dice experiment of throwing a k -sided die (with faces numbered 1 to k ), n times and recording the scores for each
throw.
1. Identify the parameters of the experiment.
2. Interpret the experiment as a compound experiment.
3. Interpret the experiment as a sampling experiment.
4. Identify the experiment as n multinomial trials.
Answer
In reality, most dice are Platonic solids (named for Plato of course) with 4, 6, 8, 12, or 20 sides. The six-sided die is the standard
die.
In the simulation of the dice experiment, set n = 5 . Run the simulation 100 times and observe the outcomes.
2.1.2 https://stats.libretexts.org/@go/page/10129
In the die-coin experiment, a standard die is thrown and then a coin is tossed the number of times shown on the die. The
sequence of coin scores is recorded (1 for heads and 0 for tails). Interpret the experiment as a compound experiment.
Answer
Note that this experiment can be obtained by randomizing the parameter n in the basic coin experiment in (1).
Run the simulation of the die-coin experiment 100 times and observe the outcomes.
In the coin-die experiment, a coin is tossed. If the coin lands heads, a red die is thrown and if the coin lands tails, a green die is
thrown. The coin score (1 for heads and 0 for tails) and the die score are recorded. Interpret the experiment as a compound
experiment.
Answer
Run the simulation of the coin-die experiment 100 times and observe the outcomes.
Cards
Playing cards, like coins and dice, date to antiquity. From the point of view of probability, the important fact is that a playing card
encodes a number of properties or attributes on the front of the card that are hidden on the back of the card. (Later in this chapter,
these properties will become random variables.) In particular, a standard card deck can be modeled by the Cartesian product set
where the first coordinate encodes the denomination or kind (ace, 2–10, jack, queen, king) and where the second coordinate
encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for
example q♡ rather than (q, ♡) for the queen of hearts). Some other properties, derived from the two main ones, are color
(diamonds and hearts are red, clubs and spades are black), face (jacks, queens, and kings have faces, the other cards do not), and
suit order (from least to highest rank: (♣, ♢, ♡, ♠) ).
Consider the card experiment that consists of dealing n cards from a standard deck (without replacement).
1. Identify a parameter of the experiment.
2. Interpret the experiment as a compound experiment.
3. Interpret the experiment as a sampling experiment.
Answer
In the simulation of the card experiment, set n = 5 . Run the simulation 100 times and observe the outcomes.
The special case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment.
Open each of the following to see depictions of card playing in some famous paintings.
1. Cheat with the Ace of Clubs by Georges de La Tour
2. The Cardsharps by Michelangelo Carravagio
3. The Card Players by Paul Cézanne
4. His Station and Four Aces by CM Coolidge
5. Waterloo by CM Coolidge
Urn Models
Urn models are often used in probability as simple metaphors for sampling from a finite population.
An urn contains m distinct balls, labeled from 1 to m. The experiment consists of selecting n balls from the urn, without
replacement, and recording the sequence of ball numbers.
1. Identify the parameters of the experiment.
2. Interpret the experiment as a compound experiment.
3. Interpret the experiment as a sampling experiment.
2.1.3 https://stats.libretexts.org/@go/page/10129
Answer
Consider the basic urn model of the previous exercise. Suppose that r of the m balls are red and the remaining m − r balls are
green. Identify an additional parameter of the model. This experiment is a metaphor for sampling from a general dichotomous
population
Answer
In the simulation of the urn experiment, set m = 100 , r = 40 , and n = 25 . Run the experiment 100 times and observe the
results.
An urn initially contains m balls; r are red and m − r are green. A ball is selected from the urn and removed, and then
replaced with k balls of the same color. The process is repeated. This is known as Pólya's urn model, named after George
Pólya.
1. Identify the parameters of the experiment.
2. Interpret the case k = 0 as a sampling experiment.
3. Interpret the case k = 1 as a sampling experiment.
Answer
Open the image of the painting Allegory of Fortune by Dosso Dossi. Presumably the young man has chosen lottery tickets
from an urn.
Buffon's coin experiment consists of tossing a coin with radius r ≤ on a floor covered with square tiles of side length 1. The
1
coordinates of the center of the coin are recorded, relative to axes through the center of the square, parallel to the sides. The
experiment is named for comte de Buffon.
1. Identify a parameter of the experiment
2. Interpret the experiment as a compound experiment.
3. Interpret the experiment as sampling experiment.
Answer
In the simulation of Buffon's coin experiment, set r = 0.1. Run the experiment 100 times and observe the outcomes.
Reliability
In the usual model of structural reliability, a system consists of n components, each of which is either working or failed. The states
of the components are uncertain, and hence define a random experiment. The system as a whole is also either working or failed,
depending on the states of the components and how the components are connected. For example, a series system works if and only
if each component works, while a parallel system works if and only if at least one component works. More generally, a k out of n
system works if at least k components work.
The reliability model above is a static model. It can be extended to a dynamic model by assuming that each component is initially
working, but has a random time until failure. The system as a whole would also have a random time until failure that would depend
on the component failure times and the structure of the system.
Genetics
In ordinary sexual reproduction, the genetic material of a child is a random combination of the genetic material of the parents.
Thus, the birth of a child is a random experiment with respect to outcomes such as eye color, hair type, and many other physical
2.1.4 https://stats.libretexts.org/@go/page/10129
traits. We are often particularly interested in the random transmission of traits and the random transmission of genetic disorders.
For example, let's consider an overly simplified model of an inherited trait that has two possible states (phenotypes), say a pea
plant whose pods are either green or yellow. The term allele refers to alternate forms of a particular gene, so we are assuming that
there is a gene that determines pod color, with two alleles: g for green and y for yellow. A pea plant has two alleles for the trait
(one from each parent), so the possible genotypes are
gg, alleles for green pods from each parent.
gy, an allele for green pods from one parent and an allele for yellow pods from the other (we usually cannot observe which
parent contributed which allele).
yy, alleles for yellow pods from each parent.
The genotypes gg and yy are called homozygous because the two alleles are the same, while the genotype gy is called heterozygous
because the two alleles are different. Typically, one of the alleles of the inherited trait is dominant and the other recessive. Thus, for
example, if g is the dominant allele for pod color, then a plant with genotype gg or gy has green pods, while a plant with genotype
yy has yellow pods. Genes are passed from parent to child in a random manner, so each new plant is a random experiment with
(phenotypes):
Type A : genotype aa or ao
Type B : genotype bb or bo
Type AB: genotype ab
type O: genotype oo
Of course, blood may be typed in much more extensive ways than the simple ABO typing. The RH factor (positive or negative) is
the most well-known example.
For our third example, consider a sex-linked hereditary disorder in humans. This is a disorder due to a defect on the X chromosome
(one of the two chromosomes that determine gender). Suppose that h denotes the healthy allele and d the defective allele for the
gene linked to the disorder. Women have two X chromosomes, and typically d is recessive. Thus, a woman with genotype hh is
completely normal with respect to the condition; a woman with genotype hd does not have the disorder, but is a carrier, since she
can pass the defective allele to her children; and a woman with genotype dd has the disorder. A man has only one X chromosome
(his other sex chromosome, the Y chromosome, typically plays no role in the disorder). A man with genotype h is normal and a
man with genotype d has the disorder. Examples of sex-linked hereditary disorders are dichromatism, the most common form of
color-blindness, and hemophilia, a bleeding disorder. Again, genes are passed from parent to child in a random manner, so the birth
of a child is a random experiment in terms of the disorder.
Point Processes
There are a number of important processes that generate “random points in time”. Often the random points are referred to as
arrivals. Here are some specific examples:
times that a piece of radioactive material emits elementary particles
times that customers arrive at a store
times that requests arrive at a web server
failure times of a device
To formalize an experiment, we might record the number of arrivals during a specified interval of time or we might record the
times of successive arrivals.
There are other processes that produce “random points in space”. For example,
flaws in a piece of sheet metal
errors in a string of symbols (in a computer program, for example)
2.1.5 https://stats.libretexts.org/@go/page/10129
raisins in a cake
misprints on a page
stars in a region of space
Again, to formalize an experiment, we might record the number of points in a given region of space.
Statistical Experiments
In 1879, Albert Michelson constructed an experiment for measuring the speed of light with an interferometer. The velocity of
light data set contains the results of 100 repetitions of Michelson's experiment. Explore the data set and explain, in a general
way, the variability of the data.
Answer
In 1998, two students at the University of Alabama in Huntsville designed the following experiment: purchase a bag of M&Ms
(of a specified advertised size) and record the counts for red, green, blue, orange, and yellow candies, and the net weight (in
grams). Explore the M&M data. set and explain, in a general way, the variability of the data.
Answer
In 1999, two researchers at Belmont University designed the following experiment: capture a cicada in the Middle Tennessee
area, and record the body weight (in grams), the wing length, wing width, and body length (in millimeters), the gender, and the
species type. The cicada data set contains the results of 104 repetitions of this experiment. Explore the cicada data and explain,
in a general way, the variability of the data.
Answer
On June 6, 1761, James Short made 53 measurements of the parallax of the sun, based on the transit of Venus. Explore the
Short data set and explain, in a general way, the variability of the data.
Answer
In 1954, two massive field trials were conducted in an attempt to determine the effectiveness of the new vaccine developed by
Jonas Salk for the prevention of polio. In both trials, a treatment group of children were given the vaccine while a control
group of children were not. The incidence of polio in each group was measured. Explore the polio field trial data set and
explain, in a general way, the underlying random experiment.
Answer
Each year from 1969 to 1972 a lottery was held in the US to determine who would be drafted for military service. Essentially,
the lottery was a ball and urn model and became famous because many believed that the process was not sufficiently random.
Explore the Vietnam draft lottery data set and speculate on how one might judge the degree of randomness.
Answer
2.1.6 https://stats.libretexts.org/@go/page/10129
4. In flight, the coin rotates about a horizontal axis along a diameter of the coin.
5. In flight, the coin is governed only by the force of gravity. All other possible forces (air resistance or wind, for example) are
neglected.
6. The coin does not bounce or roll after landing (as might be the case if it lands in sand or mud).
Of course, few of these ideal assumptions are valid for real coins tossed by humans. Let t = u/g where g is the acceleration of
gravity (in appropriate units). Note that the t just has units of time (in seconds) and hence is independent of how distance is
measured. The scaled parameter t actually represents the time required for the coin to reach its maximum height.
Keller showed that the regions of the parameter space (t, ω) where the coin lands either heads up or tails up are separated by the
curves
1 π
ω = (2n ± ) , n ∈ N (2.1.2)
2 2t
The parameter n is the total number of revolutions in the toss. A plot of some of these curves is given below. The largest region, in
the lower left corner, corresponds to the event that the coin does not complete even one rotation, and so of course lands heads up,
just as it started. The next region corresponds to one rotation, with the coin landing tails up. In general, the regions alternate
between heads and tails.
this parameter point is far beyond the region shown in our graph, in a region where the curves are exquisitely close together.
This page titled 2.1: Random Experiments is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
2.1.7 https://stats.libretexts.org/@go/page/10129
2.2: Events and Random Variables
The purpose of this section is to study two basic types of objects that form part of the model of a random experiment. If you are a
new student of probability, just ignore the measure-theoretic terminology and skip the technical details.
Sample Spaces
The Set of Outcomes
Recall that in a random experiment, the outcome cannot be predicted with certainty, before the experiment is run. On the other
hand:
We assume that we can identify a fixed set S that includes all possible outcomes of a random experiment. This set plays the
role of the universal set when modeling the experiment.
For simple experiments, S may be precisely the set of possible outcomes. More often, for complex experiments, S is a
mathematically convenient set that includes the possible outcomes and perhaps other elements as well. For example, if the
experiment is to throw a standard die and record the score that occurs, we would let S = {1, 2, 3, 4, 5, 6}, the set of possible
outcomes. On the other hand, if the experiment is to capture a cicada and measure its body weight (in milligrams), we might
conveniently take S = [0, ∞) , even though most elements of this set are impossible (we hope!). The problem is that we may not
know exactly the outcomes that are possible. Can a light bulb burn without failure for one thousand hours? For one thousand days?
for one thousand years?
Often the outcome of a random experiment consists of one or more real measurements, and thus, the S consists of all possible
measurement sequences, a subset of R for some n ∈ N . More generally, suppose that we have n experiments and that S is the
n
+ i
set of outcomes for experiment i ∈ {1, 2, … , n}. Then the Cartesian product S × S × ⋯ × S is the natural set of outcomes
1 2 n
for the compound experiment that consists of performing the n experiments in sequence. In particular, if we have a basic
experiment with S as the set of outcomes, then S is the natural set of outcomes for the compound experiment that consists of n
n
replications of the basic experiment. Similarly, if we have an infinite sequence of experiments and S is the set of outcomes for
i
experiment i ∈ N , then then S × S × ⋯ is the natural set of outcomes for the compound experiment that consists of
+ 1 2
performing the given experiments in sequence. In particular, the set of outcomes for the compound experiment that consists of
indefinite replications of a basic experiment is S = S × S × ⋯ . This is an essential special case, because (classical) probability
∞
Events
Consider again a random experiment with S as the set of outcomes. Certain subsets of S are referred to as events. Suppose that
A ⊆ S is a given event, and that the experiment is run, resulting in outcome s ∈ S .
Intuitively, you should think of an event as a meaningful statement about the experiment: every such statement translates into an
event, namely the set of outcomes for which the statement is true. In particular, S itself is an event; by definition it always occurs.
At the other extreme, the empty set ∅ is also an event; by definition it never occurs.
For a note on terminology, recall that a mathematical space consists of a set together with other mathematical structures defined on
the set. An example you may be familiar with is a vector space, which consists of a set (the vectors) together with the operations of
addition and scalar multiplication. In probability theory, many authors use the term sample space for the set of outcomes of a
random experiment, but here is the more careful definition:
The sample space of an experiment is (S, S ) where S is the set of outcomes and S is the collection of events.
Details
2.2.1 https://stats.libretexts.org/@go/page/10130
The Algebra of Events
The standard algebra of sets leads to a grammar for discussing random experiments and allows us to construct new events from
given events. In the following results, suppose that S is the set of outcomes of a random experiment, and that A and B are events.
A∩B is the event that occurs if and only if A occurs and B occurs.
Proof
A and B are disjoint if and only if they are mutually exclusive; they cannot both occur on the same run of the experiment.
Proof
A
c
is the event that occurs if and only if A does not occur.
Proof
A∖B is the event that occurs if and only if A occurs and B does not occur.
Proof
c
(A ∩ B ) ∪ (B ∩ A )
c
is the event that occurs if and only if one but not both of the given events occurs.
Proof
Recall that the event in (10) is the symmetric difference of A and B , and is sometimes denoted AΔB . This event corresponds to
exclusive or, as opposed to the ordinary union A ∪ B which corresponds to inclusive or.
(A ∩ B) ∪ (A
c c
∩B ) is the event that occurs if and only if both or neither of the given events occurs.
Proof
In the Venn diagram app, observe the diagram of each of the 16 events that can be constructed from A and B .
Suppose now that A = { Ai : i ∈ I } is a collection of events for the random experiment, where I is a countable index set.
⋃A = ⋃
i∈I
Ai is the event that occurs if and only if at least one event in the collection occurs.
Proof
⋂A = ⋂
i∈I
Ai is the event that occurs if and only if every event in the collection occurs:
Proof
A is a pairwise disjoint collection if and only if the events are mutually exclusive; at most one of the events could occur on a
given run of the experiment.
Proof
Proof
⋃
∞
n=1
⋂
∞
i=n
is the event that occurs if and only if all but finitely many of the given events occur. This event is sometimes
Ai
Proof
2.2.2 https://stats.libretexts.org/@go/page/10130
Limit superiors and inferiors are discussed in more detail in the section on convergence.
Random Variables
Intuitively, a random variable is a measurement of interest in the context of the experiment. Simple examples include the number
of heads when a coin is tossed several times, the sum of the scores when a pair of dice are thrown, the lifetime of a device subject
to random stress, the weight of a person chosen from a population. Many more examples are given below in the exercises below.
Mathematically, a random variable is a function defined on the set of outcomes.
A function X from S into a set T is a random variable for the experiment with values in T .
Details
Probability has its own notation, very different from other branches of mathematics. As a case in point, random variables, even
though they are functions, are usually denoted by capital letters near the end of the alphabet. The use of a letter near the end of the
alphabet is intended to emphasize the idea that the object is a variable in the context of the experiment. The use of a capital letter is
intended to emphasize the fact that it is not an ordinary algebraic variable to which we can assign a specific value, but rather a
random variable whose value is indeterminate until we run the experiment. Specifically, when we run the experiment an outcome
s ∈ S occurs, and random variable X takes the value X(s) ∈ T .
more natural since we think of X as a variable in the experiment. Think of {X ∈ B} as a statement about X, which then translates
into the event {s ∈ S : X(s) ∈ B}
As with a general function, the result in part (a) holds for the union of a countable collection of subsets, and the result in part (b)
holds for the intersection of a countable collection of subsets. No new ideas are involved; only the notation is more complicated.
Often, a random variable takes values in a subset T of R for some k ∈ N . We might express such a random variable as
k
+
X = (X , X , … , X ) where X is a real-valued random variable for each i ∈ {1, 2, … , k}. In this case, we usually refer to X
1 2 k i
as a random vector, to emphasize its higher-dimensional character. A random variable can have an even more complicated
structure. For example, if the experiment is to select n objects from a population and record various real measurements for each
object, then the outcome of the experiment is a vector of vectors: X = (X , X , … , X ) where X is the vector of measurements
1 2 n i
for the ith object. There are other possibilities; a random variable could be an infinite sequence, or could be set-valued. Specific
2.2.3 https://stats.libretexts.org/@go/page/10130
examples are given in the computational exercises below. However, the important point is simply that a random variable is a
function defined on the set of outcomes S .
The outcome of the experiment itself can be thought of as a random variable. Specifically, let T = S and let X denote the identify
function on S so that X(s) = s for s ∈ S . Then trivially X is a random variable, and the events that can be defined in terms of X
are simply the original events of the experiment. That is, if A is an event then {X ∈ A} = A . Conversely, every random variable
effectively defines a new random experiment.
In the general setting above, a random variable X defines a new random experiment with T as the new set of outcomes and
subsets of T as the new collection of events.
Details
In fact, often a random experiment is modeled by specifying the random variables of interest, in the language of the experiment.
Then, a mathematical definition of the random variables specifies the sample space. A function (or transformation) of a random
variable defines a new random variable.
Suppose that X is a random variable for the experiment with values in T and that g is a function from T into another set U .
Then Y = g(X) is a random variable with values in U .
Details
Note that, as functions, g(X) = g ∘ X , the composition of g with X. But again, thinking of X and Y as variables in the context of
the experiment, the notation Y = g(X) is much more natural.
Indicator Variables
For an event A , the indicator function of A is called the indicator variable of A .
The value of this random variables tells us whether or not A has occurred:
1, A occurs
1A = { (2.2.1)
0, A does not occur
1, s ∈ A
1A (s) = { (2.2.2)
0, s ∉ A
If X is a random variable that takes values 0 and 1, then X is the indicator variable of the event {X = 1} .
Proof
Recall also that the set algebra of events translates into the arithmetic algebra of indicator variables.
2. 1A∪B = 1 − (1 − 1 ) (1 − 1 ) = max { 1
A B A, 1B }
3. 1B∖A=1 B(1 − 1 ) A
4. 1 = 1 − 1
A
c
A
5. A ⊆ B if and only if 1 ≤ 1A B
The results in part (a) extends to arbitrary intersections and the results in part (b) extends to arbitrary unions. If the event A has a
complicated description, sometimes we use 1(A) for the indicator variable rather that 1 . A
2.2.4 https://stats.libretexts.org/@go/page/10130
Coins and Dice
The basic coin experiment consists of tossing a coin n times and recording the sequence of scores (X , X , … , X ) (where 1
1 2 n
denotes heads and 0 denotes tails). This experiment is a generic example of n Bernoulli trials, named for Jacob Bernoulli.
Consider the coin experiment with n = 4 , and Let Y denote the number of heads.
1. Give the set of outcomes S in list form.
2. Give the event {Y = k} in list form for each k ∈ {0, 1, 2, 3, 4}.
Answer
In the simulation of the coin experiment, set n =4 . Run the experiment 100 times and count the number of times that the
event {Y = 2} occurs.
Now consider the general coin experiment with the coin tossed n times, and let Y denote the number of heads.
1. Give the set of outcomes S in Cartesian product form, and give the cardinality of S .
2. Express Y as a function on S .
3. Find #{Y = k} (as a subset of S ) for k ∈ {0, 1, … , n}
Answer
The basic dice experiment consists of throwing n distinct k -sided dice (with faces numbered from 1 to k ) and recording the
sequence of scores (X , X , … , X ). This experiment is a generic example of n multinomial trials. The special case k = 6
1 2 n
Consider the dice experiment with n = 2 standard dice. Let S denote the set of outcomes, A the event that the first die score is
1, and B the event that the sum of the scores is 7. Give each of the following events in the form indicated:
1. S in Cartesian product form
2. A in list form
3. B in list form
4. A ∪ B in list form
5. A ∩ B in list form
6. A ∩ B in predicate form
c c
Answer
In the simulation of the dice experiment, set n = 2 . Run the experiment 100 times and count the number of times each event in
the previous exercise occurs.
Consider the dice experiment with n = 2 standard dice, and let S denote the set of outcomes, Y the sum of the scores, U the
minimum score, and V the maximum score.
1. Express Y as a function on S and give the set of possible values in list form.
2. Express U as a function on S and give the set of possible values in list form.
3. Express V as a function on the S and give the set of possible values in list form.
4. Give the set of possible values of (U , V ) in predicate from
Answer
Consider again the dice experiment with n = 2 standard dice, and let S denote the set of outcomes, Y the sum of the scores, U
the minimum score, and V the maximum score. Give each of the following as subsets of S , in list form.
1. {X < 3, X
1 2 > 4}
2. {Y = 7}
3. {U = 2}
4. {V = 4}
5. {U = V }
Answer
2.2.5 https://stats.libretexts.org/@go/page/10130
In the dice experiment, set n =2 . Run the experiment 100 times. Count the number of times each event in the previous
exercise occurred.
In the general dice experiment with n distinct k -sided dice, let Y denote the sum of the scores, U the minimum score, and V
the maximum score.
1. Give the set of outcomes S and find #(S) .
2. Express Y as a function on S , and give the set of possible values in list form.
3. Express U as a function on S , and give the set of possible values in list form.
4. Express V as a function on S , and give the set of possible values in list form.
5. Give the set of possible values of (U , V ) in predicate from.
Answer
The set of outcomes of a random experiment depends of course on what information is recorded. The following exercise is an
illustration.
An experiment consists of throwing a pair of standard dice repeatedly until the sum of the two scores is either 5 or 7. Let A
denote the event that the sum is 5 rather than 7 on the final throw. Experiments of this type arise in the casino game craps.
1. Suppose that the pair of scores on each throw is recorded. Define the set of outcomes of the experiment and describe A as a
subset of this set.
2. Suppose that the pair of scores on the final throw is recorded. Define the set of outcomes of the experiment and describe A
as a subset of this set.
Answer
Suppose that 3 standard dice are rolled and the sequence of scores (X , X , X ) is recorded. A person pays $1 to play. If some
1 2 3
of the dice come up 6, then the player receives her $1 back, plus $1 for each 6. Otherwise she loses her $1. Let W denote the
person's net winnings. This is the game of chuck-a-luck and is treated in more detail in the chapter on Games of Chance.
1. Give the set of outcomes S in Cartesian product form.
2. Express W as a function on S and give the set of possible values in list form.
Answer
Play the chuck-a-luck experiment a few times and see how you do.
In the die-coin experiment, a standard die is rolled and then a coin is tossed the number of times shown on the die. The
sequence of coin scores X is recorded (0 for tails, 1 for heads). Let N denote the die score and Y the number of heads.
1. Give the set of outcomes S in terms of Cartesian powers and find #(S) .
2. Express N as a function on S and give the set of possible values in list form.
3. Express Y as a function on S and give the set of possible values in list form.
4. Give the event A that all tosses result in heads in list form.
Answer
Run the simulation of the die-coin experiment 10 times. For each run, give the values of the random variables X, N , and Y of
the previous exercise. Count the number of times the event A occurs.
In the coin-die experiment, we have a coin and two distinct dice, say one red and one green. First the coin is tossed, and then if
the result is heads the red die is thrown, while if the result is tails the green die is thrown. The coin score X and the score of the
chosen die Y are recorded. Suppose now that the red die is a standard 6-sided die, and the green die a 4-sided die.
1. Give the set of outcomes S in list form.
2. Express X as a function on S .
3. Express Y as a function on S .
4. Give the event {Y ≥ 3} as a subset of S in list form.
Answer
2.2.6 https://stats.libretexts.org/@go/page/10130
Run the coin-die experiment 100 times, with various types of dice.
Sampling Models
Recall that many random experiments can be thought of as sampling experiments. For the general finite sampling model, we start
with a population D with m (distinct) objects. We select a sample of n objects from the population. If the sampling is done in a
random way, then we have a random experiment with the sample as the basic outcome. Thus, the set of outcomes of the experiment
is literally the set of samples; this is the historical origin of the term sample space. There are four common types of sampling from
a finite population, based on the criteria of order and replacement. Recall the following facts from the section on combinatorial
structures:
number of samples is m . n
2. If the sampling is without replacement and with regard to order, then the set of samples is the set of all permutations of size
n from D. The number of samples is m = m(m − 1) ⋯ [m − (n − 1)] .
(n)
3. If the sampling is without replacement and without regard to order, then the set of samples is the set of all combinations (or
subsets) of size n from D. The number of samples is ( ).m
4. If the sampling is with replacement and without regard to order, then the set of samples is the set of all multisets of size n
from D. The number of samples is ( m+n−1
n
).
If we sample with replacement, the sample size n can be any positive integer. If we sample without replacement, the sample size
cannot exceed the population size, so we must have n ∈ {1, 2, … , m}.
The basic coin and dice experiments are examples of sampling with replacement. If we toss a coin n times and record the sequence
of scores (where as usual, 0 denotes tails and 1 denotes heads), then we generate an ordered sample of size n with replacement
from the population {0, 1}. If we throw n (distinct) standard dice and record the sequence of scores, then we generate an ordered
sample of size n with replacement from the population {1, 2, 3, 4, 5, 6}.
Suppose that the sampling is without replacement (the most common case). If we record the ordered sample
X = (X , X , … , X ) , then the unordered sample W = { X , X , … , X } is a random variable (that is, a function of X). On
1 2 n 1 2 n
the other hand, if we just record the unordered sample W in the first place, then we cannot recover the ordered sample. Note also
that the number of ordered samples of size n is simply n! times the number of unordered samples of size n . No such simple
relationship exists when the sampling is with replacement. This will turn out to be an important point when we study probability
models based on random samples, in the next section.
Consider a sample of size n = 3 chosen without replacement from the population {a, b, c, d, e}.
1. Give T , the set of unordered samples in list form.
2. Give in list form the set of all ordered samples that correspond to the unordered sample {b, c, e}.
3. Note that for every unordered sample, there are 6 ordered samples.
4. Give the cardinality of S , the set of ordered samples.
Answer
Traditionally in probability theory, an urn containing balls is often used as a metaphor for a finite population.
Suppose that an urn contains 50 (distinct) balls. A sample of 10 balls is selected from the urn. Find the number of samples in
each of the following cases:
1. Ordered samples with replacement
2. Ordered samples without replacement
3. Unordered samples without replacement
4. Unordered samples with replacement
Answer
2.2.7 https://stats.libretexts.org/@go/page/10130
Suppose again that we have a population D with m (distinct) objects, but suppose now that each object is one of two types—either
type 1 or type 0. Such populations are said to be dichotomous. Here are some specific examples:
The population consists of persons, each either male or female.
The population consists of voters, each either democrat or republican.
The population consists of devices, each either good or defective.
The population consists of balls, each either red or green.
Suppose that the population D has r type 1 objects and hence m − r type 0 objects. Of course, we must have r ∈ {0, 1, … , m}.
Now suppose that we select a sample of size n without replacement from the population. Note that this model has three parameters:
the population size m, the number of type 1 objects in the population r, and the sample size n .
k
(k)
for each k ∈ {0, 1, … , n}, if the event is considered as a subset of S , the set of
(n−k)
ordered samples.
2. #{Y = k} = ( )(r
k
m−r
n−k
) for each k ∈ {0, 1, … , n}, if the event is considered as a subset of T , the set of unordered
samples.
3. The expression in (a) is n! times the expression in (b).
Proof
A batch of 50 components consists of 40 good components and 10 defective components. A sample of 5 components is
selected, without replacement. Let Y denote the number of defectives in the sample.
1. Let S denote the set of ordered samples. Find #(S) .
2. Let T denote the set of unordered samples. Find #(T ).
3. As a subset of T , find #{Y = k} for each k ∈ {0, 1, 2, 3, 4, 5}.
Answer
Run the simulation of the ball and urn experiment 100 times for the parameter values in the last exercise: m = 50 , r = 10 ,
n = 5 . Note the values of the random variable Y .
Cards
Recall that a standard card deck can be modeled by the Cartesian product set
where the first coordinate encodes the denomination or kind (ace, 2–10, jack, queen, king) and where the second coordinate
encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for
example q♡ rather than (q, ♡) for the queen of hearts).
Most card games involve sampling without replacement from the deck D, which plays the role of the population. Thus, the basic
card experiment consists of dealing n cards from a standard deck without replacement; in this special context, the sample of cards
is often referred to as a hand. Just as in the general sampling model, if we record the ordered hand X = (X , X , … , X ), then
1 2 n
the unordered hand W = {X , X , … , X } is a random variable (that is, a function of X). On the other hand, if we just record
1 2 n
the unordered hand W in the first place, then we cannot recover the ordered hand. Finally, recall that n = 5 is the poker
experiment and n = 13 is the bridge experiment. The game of poker is treated in more detail in the chapter on Games of Chance.
Suppose that a single card is dealt from a standard deck. Let Q denote the event that the card is a queen and H the event that
the card is a heart. Give each of the following events in list form:
1. Q
2. H
3. Q ∪ H
4. Q ∩ H
5. Q ∖ H
Answer
2.2.8 https://stats.libretexts.org/@go/page/10130
In the card experiment, set n =1 . Run the experiment 100 times and count the number of times each event in the previous
exercise occurs.
Suppose that two cards are dealt from a standard deck and the sequence of cards recorded. Let S denote the set of outcomes,
and let Q denote the event that the ith card is a queen and H the event that the ith card is a heart for i ∈ {1, 2}. Find the
i i
3. H2
4. H1 ∩ H2
5. Q1 ∩ H1
6. Q1 ∩ H2
7. H1 ∪ H2
Answer
Consider the general card experiment in which n cards are dealt from a standard deck, and the ordered hand X is recorded.
1. Give cardinality of S , the set of values of the ordered hand X.
2. Give the cardinality of T , the set of values of the unordered hand W .
3. How many ordered hands correspond to a given unordered hand?
4. Explicitly compute the numbers in (a) and (b) when n = 5 (poker).
5. Explicitly compute the numbers in (a) and (b) when n = 13 (bridge).
Answer
Consider the bridge experiment of dealing 13 cards from a deck and recording the unordered hand. In the most common point
counting system, an ace is worth 4 points, a king 3 points, a queen 2 points, and a jack 1 point. The other cards are worth 0
points. Let S denote the set of outcomes of the experiment and V the point value of the hand.
1. Find the set of possible values of V .
2. Find the cardinality of the event {V = 0} as a subset of S .
Answer
In the card experiment, set n = 13 and run the experiment 100 times. For each run, compute the value of each of the random
variable V in the previous exercise.
Consider the poker experiment of dealing 5 cards from a deck. Find the cardinality of each of the events below, as a subset of
the set of unordered hands.
1. A : the event that the hand is a full house (3 cards of one kind and 2 of another kind).
2. B : the event that the hand has 4 of a kind (4 cards of one kind and 1 of another kind).
3. C : the event that all cards in the hand are in the same suit (the hand is a flush or a straight flush).
Answer
Run the poker experiment 1000 times. Note the number of times that the events A , B , and C in the previous exercise occurred.
Consider the bridge experiment of dealing 13 cards from a standard deck. Let S denote the set of unordered hands, Y the
number of hearts in the hand, and Z the number of queens in the hand.
1. Find the cardinality of the event {Y = y} as a subset of S for each y ∈ {0, 1, … , 13}.
2. Find the cardinality of the event {Z = z} as a subset of S for each z ∈ {0, 1, 2, 3, 4}.
Answer
2.2.9 https://stats.libretexts.org/@go/page/10130
Geometric Models
In the experiments that we have considered so far, the sample spaces have all been discrete (so that the set of outcomes is finite or
countably infinite). In this subsection, we consider Euclidean sample spaces where the set of outcomes S is continuous in a sense
that we will make clear later. The experiments we consider are sometimes referred to as geometric models because they involve
selecting a point at random from a Euclidean set.
We first consider Buffon's coin experiment, which consists of tossing a coin with radius r ≤ randomly on a floor covered with
1
square tiles of side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the
square in which the coin lands. Buffon's experiments are studied in more detail in the chapter on Geometric Models and are named
for Compte de Buffon
4. Express Z as a function on S .
5. Express the event {X < Y } as a subset of S .
6. Express the event {Z ≤ } as a subset of S .
1
Answer
Run Buffon's coin experiment 100 times with r = 0.2 . For each run, note whether event A occurs and compute the value of
random variable Z .
A point (X, Y ) is chosen at random in the circular region of radius 1 in R centered at the origin. Let S denote the set of
2
outcomes. Let A denote the event that the point is in the inscribed square region centered at the origin, with sides parallel to
the coordinate axes. Let B denote the event that the point is in the inscribed square with vertices (±1, 0), (0, ±1).
1. Describe S mathematically and sketch the set.
2. Describe A mathematically and sketch the set.
3. Describe B mathematically and sketch the set.
4. Sketch A ∪ B
5. Sketch A ∩ B
6. Sketch A ∩ B c
Answer
Reliability
In the simple model of structural reliability, a system is composed of n components, each of which is either working or failed. The
state of component i is an indicator random variable X , where 1 means working and 0 means failure. Thus,
i
X = (X , X , … , X ) is a vector of indicator random variables that specifies the states of all of the components, and therefore
1 2 n
the set of outcomes of the experiment is S = {0, 1} . The system as a whole is also either working or failed, depending only on
n
the states of the components and how the components are connected together. Thus, the state of the system is also an indicator
random variable and is a function of X. The state of the system (working or failed) as a function of the states of the components is
the structure function.
2.2.10 https://stats.libretexts.org/@go/page/10130
A series system is working if and only if each component is working. The state of the system is
U = X1 X2 ⋯ Xn = min { X1 , X2 , … , Xn } (2.2.9)
A parallel system is working if and only if at least one component is working. The state of the system is
V = 1 − (1 − X1 ) (1 − X2 ) ⋯ (1 − Xn ) = max { X1 , X2 , … , Xn } (2.2.10)
More generally, a k out of n system is working if and only if at least k of the n components are working. Note that a parallel
system is a 1 out of n system and a series system is an n out of n system. A k out of 2k system is a majority rules system.
n
The state of the k out of n system is Un,k = 1 (∑i=1 Xi ≥ k) . The structure function can also be expressed as a polynomial
in the variables.
Explicitly give the state of the k out of 3 system, as a polynomial function of the component states (X1 , X2 , X3 ) , for each
k ∈ {1, 2, 3}.
Answer
In some cases, the system can be represented as a graph or network. The edges represent the components and the vertices the
connections between the components. The system functions if and only if there is a working path between two designated vertices,
which we will denote by a and b .
Find the state of the Wheatstone bridge network shown below, as a function of the component states. The network is named for
Charles Wheatstone.
Answer
3. For each i ∈ {1, 2, … , n}, there exist x and y in {0, 1} all of whose coordinates agree, except x = 0 and y = 1 , and
n
i i
Answer
The model just discussed is a static model. We can extend it to a dynamic model by assuming that component i is initially working,
but has a random time to failure T , taking values in [0, ∞), for each i ∈ {1, 2, … , n}. Thus, the basic outcome of the experiment
i
is the random vector of failure times (T , T , … , T ), and so the set of outcomes is [0, ∞) .
1 2 n
n
Consider the dynamic reliability model for a system with structure function u (valid in the sense of the previous exercise).
1. The state of component i at time t ≥ 0 is X (t) = 1 (T > t) .
i i
Suppose that we have two devices and that we record (X, Y ), where X is the failure time of device 1 and Y is the failure time
of device 2. Both variables take values in the interval [0, ∞), where the units are in hundreds of hours. Sketch each of the
following events:
1. The set of outcomes S
2. {X < Y }
3. {X + Y > 2}
2.2.11 https://stats.libretexts.org/@go/page/10130
Answer
Genetics
Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this
section.
Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, o is recessive and a and b
are co-dominant.
Suppose that a person is chosen at random and his genotype is recorded. Give each of the following in list form.
1. The set of outcomes S
2. The event that the person is type A
3. The event that the person is type B
4. The event that the person is type AB
5. The event that the person is type O
Answer
Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: g for green and y for yellow, and
that g is dominant.
Suppose that n (distinct) pea plants are collected and the sequence of pod color genotypes is recorded.
1. Give the set of outcomes S in Cartesian product form and find #(S) .
2. Let N denote the number of plants with green pods. Find #(N = k) (as a subset of S ) for each k ∈ {0, 1, … , n}.
Answer
Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let h denote the healthy allele
and d the defective allele for the gene linked to the disorder. Recall that d is recessive for women.
Suppose that n women are sampled and the sequence of genotypes is recorded.
1. Give the set of outcomes S in Cartesian product form and find #(S) .
2. Let N denote the number of women who are completely healthy (genotype hh ). Find #(N = k) (as a subset of S ) for
each k ∈ {0, 1, … , n}.
Answer
Radioactive Emissions
The emission of elementary particles from a sample of radioactive material occurs in a random way. Suppose that the time of
emission of the ith particle is a random variable T taking values in (0, ∞). If we measure these arrival times, then basic outcome
i
Run the simulation of the gamma experiment in single-step mode for different values of the parameters. Observe the arrival
times.
Now let N denote the number of emissions in the interval (0, t]. Then
t
1. N t = max {n ∈ N+ : Tn ≤ t} .
2. N t ≥n if and only if T
n ≤t .
Run the simulation of the Poisson experiment in single-step mode for different parameter values. Observe the arrivals in the
specified time interval.
Statistical Experiments
In the basic cicada experiment, a cicada in the Middle Tennessee area is captured and the following measurements recorded:
body weight (in grams), wing length, wing width, and body length (in millimeters), species type, and gender. The cicada data
set gives the results of 104 repetitions of this experiment.
2.2.12 https://stats.libretexts.org/@go/page/10130
1. Define the set of outcomes S for the basic experiment.
2. Let F be the event that a cicada is female. Describe F as a subset of S . Determine whether F occurs for each cicada in the
data set.
3. Let V denote the ratio of wing length to wing width. Compute V for each cicada.
4. Give the set of outcomes for the compound experiment that consists of 104 repetitions of the basic experiment.
Answer
In the basic M&M experiment, a bag of M&Ms (of a specified size) is purchased and the following measurements recorded:
the number of red, green, blue, yellow, orange, and brown candies, and the net weight (in grams). The M&M data set gives the
results of 30 repetitions of this experiment.
1. Define the set of outcomes S for the basic experiment.
2. Let A be the event that a bag contains at least 57 candies. Describe A as a subset of S .
3. Determine whether A occurs for each bag in the data set.
4. Let N denote the total number of candies. Compute N for each bag in the data set.
5. Give the set of outcomes for the compound experiment that consists of 30 repetitions of the basic experiment.
Answer
This page titled 2.2: Events and Random Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
2.2.13 https://stats.libretexts.org/@go/page/10130
2.3: Probability Measures
This section contains the final and most important ingredient in the basic model of a random experiment. If you are a new student
of probability, skip the technical detials.
Definition
A probability measure (or probability distribution) P on the sample space (S, S ) is a real-valued function defined on the
collection of events S that satisifes the following axioms:
1. P(A) ≥ 0 for every event A .
2. P(S) = 1 .
3. If {A : i ∈ I } is a countable, pairwise disjoint collection of events then
i
P ( ⋃ Ai ) = ∑ P(Ai ) (2.3.1)
i∈I i∈I
Details
Axiom (c) is known as countable additivity, and states that the probability of a union of a finite or countably infinite collection of
disjoint events is the sum of the corresponding probabilities. The axioms are known as the Kolmogorov axioms, in honor of Andrei
Kolmogorov who was the first to formalize probability theory in an axiomatic way. More informally, we say that P is a probability
measure (or distribution) on S , the collection of events S usually being understood.
Axioms (a) and (b) are really just a matter of convention; we choose to measure the probability of an event with a number between
0 and 1 (as opposed, say, to a number between −5 and 7). Axiom (c) however, is fundamental and inescapable. It is required for
probability for precisely the same reason that it is required for other measures of the “size” of a set, such as cardinality for finite
sets, length for subsets of R, area for subsets of R , and volume for subsets of R . In all these cases, the size of a set that is
2 3
composed of countably many disjoint pieces is the sum of the sizes of the pieces.
2.3.1 https://stats.libretexts.org/@go/page/10131
The Law of Large Numbers
Intuitively, the probability of an event is supposed to measure the long-term relative frequency of the event—in fact, this concept
was taken as the definition of probability by Richard Von Mises. Here are the relevant definitions:
Suppose that the experiment is repeated indefinitely, and that A is an event. For n ∈ N , +
1. Let N n (A) denote the number of times that A occurred. This is the frequency of A in the first n runs.
2. Let Pn (A) = Nn (A)/n . This is the relative frequency or empirical probability of A in the first n runs.
Note that repeating the original experiment indefinitely creates a new, compound experiment, and that N (A) and P (A) are
n n
random variables for the new experiment. In particular, the values of these variables are uncertain until the experiment is run n
times. The basic idea is that if we have chosen the correct probability measure for the experiment, then in some sense we expect
that the relative frequency of an event should converge to the probability of the event. That is,
Pn (A) → P(A) as n → ∞, A ∈ S (2.3.2)
regardless of the uncertainty of the relative frequencies on the left. The precise statement of this is the law of large numbers or law
of averages, one of the fundamental theorems in probability. To emphasize the point, note that in general there will be lots of
possible probability measures for an experiment, in the sense of the axioms. However, only the probability measure that models the
experiment correctly will satisfy the law of large numbers.
Given the data from n runs of the experiment, the empirical probability function P is a probability measure on S .
n
Proof
This probability space corresponds to the new random experiment in which the outcome is X. Moreover, recall that the outcome of
the experiment itself can be thought of as a random variable. Specifically, if we let T = S we let X be the identity function on S ,
so that X(s) = s for s ∈ S . Then X is a random variable with values in S and P(X ∈ A) = P(A) for each event A . Thus, every
probability measure can be thought of as the distribution of a random variable.
Constructions
2.3.2 https://stats.libretexts.org/@go/page/10131
Measures
How can we construct probability measures? As noted briefly above, there are other measures of the “size” of sets; in many cases,
these can be converted into probability measures. First, a positive measure μ on the sample space (S, S ) is a real-valued function
defined on S that satisfies axioms (a) and (c) in (1), and then (S, S , μ) is a measure space. In general, μ(A) is allowed to be
infinite. However, if μ(S) is positive and finite (so that μ is a finite positive measure), then μ can easily be re-scaled into a
probability measure.
If μ is a positive measure on S with 0 < μ(S) < ∞ then P defined below is a probability measure.
μ(A)
P(A) = , A ∈ S (2.3.3)
μ(S)
Proof
In this context, μ(S) is called the normalizing constant. In the next two subsections, we consider some very important special
cases.
Discrete Distributions
In this discussion, we assume that the sample space (S, S ) is discrete. Recall that this means that the set of outcomes S is
countable and that S = P(S) is the collection of all subsets of S , so that every subset is an event. The standard measure on a
discrete space is counting measure #, so that #(A) is the number of elements in A for A ⊆ S . When S is finite, the probability
measure corresponding to counting measure as constructed in above is particularly important in combinatorial and sampling
experiments.
Suppose that S is a finite, nonempty set. The discrete uniform distribution on S is given by
#(A)
P(A) = , A ⊆S (2.3.5)
#(S)
The underlying model is refereed to as the classical probability model, because historically the very first problems in probability
(involving coins and dice) fit this model.
In the general discrete case, if P is a probability measure on S , then since S is countable, it follows from countable additivity that
P is completely determined by its values on the singleton events. Specifically, if we define f (x) = P ({x}) for x ∈ S , then
P(A) = ∑ f (x) for every A ⊆ S . By axiom (a), f (x) ≥ 0 for x ∈ S and by axiom (b), ∑ f (x) = 1 . Conversely, we can
x∈A x∈S
Proof
In the context of our previous remarks, f (x) = g(x)/μ(S) = g(x)/ ∑ g(y) for x ∈ S . Distributions of this type are said to be
y∈S
If S is finite and g is a constant function, then the probability measure P associated with g is the discrete uniform distribution
on S .
Proof
Continuous Distributions
The probability distributions that we will construct next are continuous distributions on R for n ∈ N
n
+ and require some calculus.
2.3.3 https://stats.libretexts.org/@go/page/10131
n
λn (A) = ∫ 1 dx, A ⊆R (2.3.8)
A
In particular, λ1 (A) is the length of ( A \subseteq \R \), lambda2 (A) is the area of A ⊆R
2
, and λ3 (A) is the volume of
A ⊆R .
3
Details
When n > 3 , λ (A) is sometimes called the n -dimensional volume of A ⊆ R . The probability measure associated with
n n λn on a
set with positive, finite n -dimensional volume is particularly important.
Note that the continuous uniform distribution is analogous to the discrete uniform distribution defined in (8), but with Lebesgue
measure λ replacing counting measure #. We can generalize this construction to produce many other distributions.
n
A
g(x) dx for A ⊆ S is a positive measure
on S . If 0 < μ(S) < ∞ , then P defined as follows is a probability measure on S .
μ(A) ∫ g(x) dx
A
P(A) = = , A ∈ S (2.3.10)
μ(S) ∫ g(x) dx
S
Proof
Distributions of this type are said to be continuous. Continuous distributions are studied in detail in the chapter on Distributions.
Note that the continuous distribution above is analogous to the discrete distribution in (9), but with integrals replacing sums. The
general theory of integration allows us to unify these two special cases, and many others besides.
Rules of Probability
Basic Rules
Suppose again that we have a random experiment modeled by a probability space (S, S , P), so that S is the set of outcomes, S
the collection of events, and P the probability measure. In the following theorems, A and B are events. The results follow easily
from the axioms of probability in (1), so be sure to try the proofs yourself before reading the ones in the text.
c
P(A ) = 1 − P(A) . This is known as the complement rule.
Proof
P(∅) = 0 .
Proof
2.3.4 https://stats.libretexts.org/@go/page/10131
Figure 2.3.4 : The difference rule
Recall that if A ⊆ B we sometimes write B − A for the set difference, rather than B∖A . With this notation, the difference rule
has the nice form P(B − A) = P(B) − P(A) .
Thus, P is an increasing function, relative to the subset partial order on the collection of events S , and the ordinary order on R. In
particular, it follows that P(A) ≤ 1 for any event A .
Suppose that A ⊆ B .
1. If P(B) = 0 then P(A) = 0 .
2. If P(A) = 1 then P(B) = 1 .
Proof
P ( ⋃ Ai ) ≤ ∑ P(Ai ) (2.3.12)
i∈I i∈I
Proof
P ( ⋃ Ai ) = 0 (2.3.13)
i∈I
2.3.5 https://stats.libretexts.org/@go/page/10131
An event A with P(A) = 0 is said to be null. Thus, a countable union of null events is still a null event.
The next result is known as Bonferroni's inequality, named after Carlo Bonferroni. The inequality gives a simple lower bound for
the probability of an intersection.
P ( ⋂ Ai ) ≥ 1 − ∑ [1 − P(Ai )] (2.3.14)
i∈I i∈I
Proof
Of course, the lower bound in Bonferroni's inequality may be less than or equal to 0, in which case it's not helpful. The following
result is a simple consequence of Bonferroni's inequality.
P ( ⋂ Ai ) = 1 (2.3.16)
i∈I
An event A with P(A) = 1 is sometimes called almost sure or almost certain. Thus, a countable intersection of almost sure events
is still almost sure.
i∈I
Proof
x∈T
In this formula, note that the comma acts like the intersection symbol in the previous formula.
2.3.6 https://stats.libretexts.org/@go/page/10131
If A, B are events thatn P(A ∪ B) = P(A) + P(B) − P(A ∩ B) .
Proof
The inclusion-exclusion formulas for two and three events can be generalized to n events. For the remainder of this discussion,
suppose that {A : i ∈ I } is a collection of events where I is an index set with #(I ) = n .
i
Proof by induction
Proof by accounting
Here is the complementary result for the probability of an intersection in terms of the probabilities of the various unions:
k−1
P ( ⋂ Ai ) = ∑(−1 ) ∑ P ( ⋃ Aj ) (2.3.24)
The general inclusion-exclusion formulas are not worth remembering in detail, but only in pattern. For the probability of a union,
we start with the sum of the probabilities of the events, then subtract the probabilities of all of the paired intersections, then add the
probabilities of the third-order intersections, and so forth, alternating signs, until we get to the probability of the intersection of all
of the events.
2.3.7 https://stats.libretexts.org/@go/page/10131
The general Bonferroni inequalities (for a union) state that if sum on the right in the general inclusion-exclusion formula is
truncated, then the truncated sum is an upper bound or a lower bound for the probability on the left, depending on whether the last
term has a positive or negative sign. Here is the result stated explicitly:
k=1
(−1 )
k−1
∑
J⊆I, #(J)=k
P (⋂
j∈J
Aj ) if m is event.
Proof
More elegant proofs of the inclusion-exclusion formula and the Bonferroni inequalities can be constructed using expected value.
Note that there is a probability term in the inclusion-exclusion formulas for every nonempty subset J of the index set I , with either
a positive or negative sign, and hence there are 2 − 1 such terms. These probabilities suffice to compute the probability of any
n
event that can be constructed from the given events, not just the union or the intersection.
The probability of any event that can be constructed from { Ai : i ∈ I } can be computed from either of the following
collections of 2 − 1 probabilities:
n
1. P (⋂ j∈J
Aj ) where J is a nonempty subset of I .
2. P (⋃ j∈J
Aj ) where J is a nonempty subset of I .
Remark
If you go back and look at your proofs of the rules of probability above, you will see that they hold for any finite measure μ , not
just probability. The only change is that the number 1 is replaced by μ(S) . In particular, the inclusion-exclusion rule is as important
in combinatorics (the study of counting measure) as it is in probability.
3
1
4
, P(A ∩ B) =
1
10
. Express each of the
following events in the language of the experiment and find its probability:
1. A ∖ B
2. A ∪ B
3. A ∪ B
c c
4. A ∩ B
c c
5. A ∪ B c
Answer
Suppose that A , B , and C are events in an experiment with P(A) = 0.3 , P(B) = 0.2 , P(C ) = 0.4 , P(A ∩ B) = 0.04 ,
P(A ∩ C ) = 0.1 , P(B ∩ C ) = 0.1 , P(A ∩ B ∩ C ) = 0.01 . Express each of the following events in set notation and find its
probability:
1. At least one of the three events occurs.
2. None of the three events occurs.
3. Exactly one of the three events occurs.
4. Exactly two of the three events occur.
Answer
6
, P(B ∖ A) =
1
4
, and P(A ∩ B) =
1
12
. Find the
probability of each of the following events:
1. A
2. B
2.3.8 https://stats.libretexts.org/@go/page/10131
3. A ∪ B
4. A ∪ B
c c
5. A ∩ B
c c
Answer
5
, P(A ∪ B) = 10
7
, and P(A ∩ B) = 1
6
. Find the probability
of each of the following events:
1. B
2. A ∖ B
3. B ∖ A
4. A ∪ B
c c
5. A ∩ B
c c
Answer
3
, P(B) = 1
4
, P(C ) = 1
5
.
1. Use Boole's inequality to find an upper bound for P(A ∪ B ∪ C ) .
2. Use Bonferronis's inequality to find a lower bound for P(A ∩ B ∩ C ) .
Answer
Suppose that A , B , and C are events in a random experiment with P(A) = 1/4 , P(B) = 1/3 , P(C ) = 1/6 ,
P(A ∩ B) = 1/18 , P(A ∩ C ) = 1/16 , P(B ∩ C ) = 1/12 , and P(A ∩ B ∩ C ) = 1/24 . Find the probabilities of the various
unions:
1. A ∪ B
2. A ∪ C
3. B ∪ C
4. A ∪ B ∪ C
Answer
Suppose that A ,
B , and C are events in a random experiment with P(A) = 1/4 , P(B) = 1/4 , P(C ) = 5/16,
P(A ∪ B) = 7/16 , P(A ∪ C ) = 23/48 , P(B ∪ C ) = 11/24 , and P(A ∪ B ∪ C ) = 7/12 . Find the probabilities of the
various intersections:
1. A ∩ B
2. A ∩ C
3. B ∩ C
4. A ∩ B ∩ C
Answer
Suppose that A , B , and C are events in a random experiment. Explicitly give all of the Bonferroni inequalities for
P(A ∪ B ∪ C )
Proof
Coins
Consider the random experiment of tossing a coin n times and recording the sequence of scores X = (X , X , … , X ) (where 1 1 2 n
denotes heads and 0 denotes tails). This experiment is a generic example of n Bernoulli trials, named for Jacob Bernoulli. Note that
the set of outcomes is S = {0, 1} , the set of bit strings of length n . If the coin is fair, then presumably, by the very meaning of the
n
2.3.9 https://stats.libretexts.org/@go/page/10131
word, we have no reason to prefer one point in S over another. Thus, as a modeling assumption, it seems reasonable to give S the
uniform probability distribution in which all outcomes are equally likely.
Suppose that a fair coin is tossed 3 times and the sequence of coin scores is recorded. Let A be the event that the first coin is
heads and B the event that there are exactly 2 heads. Give each of the following events in list form, and then compute the
probability of the event:
1. A
2. B
3. A ∩ B
4. A ∪ B
5. A ∪ B
c c
6. A ∩ B
c c
7. A ∪ B c
Answer
In the Coin experiment, select 3 coins. Run the experiment 1000 times, updating after every run, and compute the empirical
probability of each event in the previous exercise.
Suppose that a fair coin is tossed 4 times and the sequence of scores is recorded. Let Y denote the number of heads. Give the
event {Y = k} (as a subset of the sample space) in list form, for each k ∈ {0, 1, 2, 3, 4}, and then give the probability of the
event.
Answer
Suppose that a fair coin is tossed n times and the sequence of scores is recorded. Let Y denote the number of heads.
n
n 1
P(Y = k) = ( )( ) , k ∈ {0, 1, … , n} (2.3.25)
k 2
Proof
The distribution of Y in the last exercise is a special case of the binomial distribution. The binomial distribution is studied in more
detail in the chapter on Bernoulli Trials.
Dice
Consider the experiment of throwing n distinct, k -sided dice (with faces numbered from 1 to k ) and recording the sequence of
scores X = (X , X , … , X ). We can record the outcome as a sequence because of the assumption that the dice are distinct; you
1 2 n
can think of the dice as somehow labeled from 1 to n , or perhaps with different colors. The special case k = 6 corresponds to
standard dice. In general, note that the set of outcomes is S = {1, 2, … , k} . If the dice are fair, then again, by the very meaning
n
of the word, we have no reason to prefer one point in S over another, so as a modeling assumption it seems reasonable to give S
the uniform probability distribution.
Suppose that two fair, standard dice are thrown and the sequence of scores recorded. Let A denote the event that the first die
score is less than 3 and B the event that the sum of the dice scores is 6. Give each of the following events in list form and then
find the probability of the event.
1. A
2. B
3. A ∩ B
4. A ∪ B
5. B ∖ A
Answer
In the dice experiment, set n =2 . Run the experiment 100 times and compute the empirical probability of each event in the
previous exercise.
2.3.10 https://stats.libretexts.org/@go/page/10131
Consider again the dice experiment with n = 2 fair dice. Let S denote the set of outcomes, Y the sum of the scores, U the
minimum score, and V the maximum score.
1. Express Y as a function on S and give the set of values.
2. Find P(Y = y) for each y in the set in part (a).
3. Express U as a function on S and give the set of values.
4. Find P(U = u) for each u in the set in part (c).
5. Express V as a function on S and give the set of values.
6. Find P(V = v) for each v in the set in part (e).
7. Find the set of values of (U , V ).
8. Find P(U = u, V = v) for each (u, v) in the set in part (g).
Answer
In the previous exercise, note that (U , V ) could serve as the outcome vector for the experiment of rolling two standard, fair dice if
we do not bother to distinguish the dice (so that we might as well record the smaller score first and then the larger score). Note that
this random vector does not have a uniform distribution. On the other hand, we might have chosen at the beginning to just record
the unordered set of scores and, as a modeling assumption, imposed the uniform distribution on the corresponding set of outcomes.
Both models cannot be right, so which model (if either) describes real dice in the real world? It turns out that for real (fair) dice, the
ordered sequence of scores is uniformly distributed, so real dice behave as distinct objects, whether you can tell them apart or not.
In the early history of probability, gamblers sometimes got the wrong answers for events involving dice because they mistakenly
applied the uniform distribution to the set of unordered scores. It's an important moral. If we are to impose the uniform distribution
on a sample space, we need to make sure that it's the right sample space.
A pair of fair, standard dice are thrown repeatedly until the sum of the scores is either 5 or 7. Let A denote the event that the
sum of the scores on the last throw is 5 rather than 7. Events of this type are important in the game of craps.
1. Suppose that we record the pair of scores on each throw. Give the set of outcomes S and express A as a subset of S .
2. Compute the probability of A in the setting of part (a).
3. Now suppose that we just record the pair of scores on the last throw. Give the set of outcomes T and express A as a subset
of T .
4. Compute the probability of A in the setting of parts (c).
Answer
The previous problem shows the importance of defining the set of outcomes appropriately. Sometimes a clever choice of this set
(and appropriate modeling assumptions) can turn a difficult problem into an easy one.
Sampling Models
Recall that many random experiments can be thought of as sampling experiments. For the general finite sampling model, we start
with a population D with m (distinct) objects. We select a sample of n objects from the population, so that the sample space S is
the set of possible samples. If we select a sample at random then the outcome X (the random sample) is uniformly distributed on
S:
#(A)
P(X ∈ A) = , A ⊆S (2.3.26)
#(S)
Recall from the section on Combinatorial Structures that there are four common types of sampling from a finite population, based
on the criteria of order and replacement.
If the sampling is with replacement and with regard to order, then the set of samples is the Cartesian power D . The number of
n
samples is m .n
If the sampling is without replacement and with regard to order, then the set of samples is the set of all permutations of size n
from D. The number of samples is m = m(m − 1) ⋯ (m − n + 1) .
(n)
If the sampling is without replacement and without regard to order, then the set of samples is the set of all combinations (or
subsets) of size n from D. The number of samples is ( ) = m /n! .
m
n
(n)
If the sampling is with replacement and without regard to order, then the set of samples is the set of all multisets of size n from
D. The number of samples is ( ).
m+n−1
2.3.11 https://stats.libretexts.org/@go/page/10131
If we sample with replacement, the sample size n can be any positive integer. If we sample without replacement, the sample size
cannot exceed the population size, so we must have n ∈ {1, 2, … , m}.
The basic coin and dice experiments are examples of sampling with replacement. If we toss a fair coin n times and record the
sequence of scores X (where as usual, 0 denotes tails and 1 denotes heads), then X is a random sample of size n chosen with order
and with replacement from the population {0, 1}. Thus, X is uniformly distributed on {0, 1} . If we throw n (distinct) standard n
fair dice and record the sequence of scores, then we generate a random sample X of size n with order and with replacement from
the population {1, 2, 3, 4, 5, 6}. Thus, X is uniformly distributed on {1, 2, 3, 4, 5, 6} . Of an analogous result would hold for fair,
n
k -sided dice.
Suppose that the sampling is without replacement (the most common case). If we record the ordered sample
X = (X , X , … , X ) , then the unordered sample W = { X , X , …} is a random variable (that is, a function of X). On the
1 2 n 1 2
other hand, if we just record the unordered sample W in the first place, then we cannot recover the ordered sample.
Suppose that X is a random sample of size n chosen with order and without replacement from D, so that X is uniformly
distributed on the space of permutations of size n from D. Then W , the unordered sample, is uniformly distributed on the
space of combinations of size n from D. Thus, W is also a random sample.
Proof
The result in the last exercise does not hold if the sampling is with replacement (recall the exercise above and the discussion that
follows). When sampling with replacement, there is no simple relationship between the number of ordered samples and the number
of unordered samples.
Proof
In the previous exercise, random variable Y has the hypergeometric distribution with parameters m, r, and n . The hypergeometric
distribution is studied in more detail in the chapter on Findite Sampling Models.
Proof
In the last exercise, random variable Y has the binomial distribution with parameters n and p =
r
m
. The binomial distribution is
studied in more detail in the chapter on Bernoulli Trials.
Suppose that a group of voters consists of 40 democrats and 30 republicans. A sample 10 voters is chosen at random. Find the
probability that the sample contains at least 4 democrats and at least 4 republicans, each of the following cases:
2.3.12 https://stats.libretexts.org/@go/page/10131
1. The sampling is without replacement.
2. The sampling is with replacement.
Answer
Urn Models
Drawing balls from an urn is a standard metaphor in probability for sampling from a finite population.
Consider an urn with 30 balls; 10 are red and 20 are green. A sample of 5 balls is chosen at random, without replacement. Let
Y denote the number of red balls in the sample. Explicitly compute P(Y = y) for each y ∈ {0, 1, 2, 3, 4, 5}.
answer
In the simulation of the ball and urn experiment, select 30 balls with 10 red and 20 green, sample size 5, and sampling without
replacement. Run the experiment 1000 times and compare the empirical probabilities with the true probabilities that you
computed in the previous exercise.
Consider again an urn with 30 balls; 10 are red and 20 are green. A sample of 5 balls is chosen at random, with replacement.
Let Y denote the number of red balls in the sample. Explicitly compute P(Y = y) for each y ∈ {0, 1, 2, 3, 4, 5}.
Answer
In the simulation of the ball and urn experiment, select 30 balls with 10 red and 20 green, sample size 5, and sampling with
replacement. Run the experiment 1000 times and compare the empirical probabilities with the true probabilities that you
computed in the previous exercise.
An urn contains 15 balls: 6 are red, 5 are green, and 4 are blue. Three balls are chosen at random, without replacement.
1. Find the probability that the chosen balls are all the same color.
2. Find the probability that the chosen balls are all different colors.
Answer
Suppose again that an urn contains 15 balls: 6 are red, 5 are green, and 4 are blue. Three balls are chosen at random, with
replacement.
1. Find the probability that the chosen balls are all the same color.
2. Find the probability that the chosen balls are all different colors.
Answer
Cards
Recall that a standard card deck can be modeled by the product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (2.3.29)
where the first coordinate encodes the denomination or kind (ace, 2–10, jack, queen, king) and where the second coordinate
encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for
example q♡ for the queen of hearts).
Card games involve choosing a random sample without replacement from the deck D, which plays the role of the population. Thus,
the basic card experiment consists of dealing n cards from a standard deck without replacement; in this special context, the sample
of cards is often referred to as a hand. Just as in the general sampling model, if we record the ordered hand
X = (X , X , … , X ) , then the unordered hand W = { X , X , … , X } is a random variable (that is, a function of X). On the
1 2 n 1 2 n
other hand, if we just record the unordered hand W in the first place, then we cannot recover the ordered hand. Finally, recall that
n = 5 is the poker experiment and n = 13 is the bridge experiment. The game of poker is treated in more detail in the chapter on
Games of Chance. By the way, it takes about 7 standard riffle shuffles to randomize a deck of cards.
2.3.13 https://stats.libretexts.org/@go/page/10131
Suppose that 2 cards are dealt from a well-shuffled deck and the sequence of cards is recorded. For , let H denote
i ∈ {1, 2} i
the event that card i is a heart. Find the probability of each of the following events.
1. H 1
2. H 1 ∩ H2
3. H 2 ∖ H1
4. H 2
5. H 1 ∖ H2
6. H 1 ∪ H2
Answer
Think about the results in the previous exercise, and suppose that we continue dealing cards. Note that in computing the probability
of H , you could conceptually reduce the experiment to dealing a single card. Note also that the probabilities do not depend on the
i
order in which the cards are dealt. For example, the probability of an event involving the 1st, 2nd and 3rd cards is the same as the
probability of the corresponding event involving the 25th, 17th, and 40th cards. Technically, the cards are exchangeable. Here's
another way to think of this concept: Suppose that the cards are dealt onto a table in some pattern, but you are not allowed to view
the process. Then no experiment that you can devise will give you any information about the order in which the cards were dealt.
In the card experiment, set n =2 . Run the experiment 100 times and compute the empirical probability of each event in the
previous exercise
In the poker experiment, find the probability of each of the following events:
1. The hand is a full house (3 cards of one kind and 2 cards of another kind).
2. The hand has four of a kind (4 cards of one kind and 1 of another kind).
3. The cards are all in the same suit (thus, the hand is either a flush or a straight flush).
Answer
Run the poker experiment 10000 times, updating every 10 runs. Compute the empirical probability of each event in the
previous problem.
Find the probability that a bridge hand will contain no honor cards that is, no cards of denomination 10, jack, queen, king, or
ace. Such a hand is called a Yarborough, in honor of the second Earl of Yarborough.
Answer
A card hand that contains no cards in a particular suit is said to be void in that suit. Use the inclusion-exclusion rule to find the
probability of each of the following events:
1. A poker hand is void in at least one suit.
2. A bridge hand is void in at least one suit.
Answer
Birthdays
The following problem is known as the birthday problem, and is famous because the results are rather surprising at first.
Suppose that n persons are selected and their birthdays recorded (we will ignore leap years). Let A denote the event that the
birthdays are distinct, so that A is the event that there is at least one duplication in the birthdays.
c
1. Define an appropriate sample space and probability measure. State the assumptions you are making.
2.3.14 https://stats.libretexts.org/@go/page/10131
2. Find P (A) and P(A ) in terms of the parameter n .
c
3. Explicitly compute P (A) and P (A ) for n ∈ {10, 20, 30, 40, 50}
c
Answer
The small value of P(A) for relatively small sample sizes n is striking, but is due mathematically to the fact that 365 grows much
n
faster than 365 as n increases. The birthday problem is treated in more generality in the chapter on Finite Sampling Models.
(n)
Suppose that 4 persons are selected and their birth months recorded. Assuming an appropriate uniform distribution, find the
probability that the months are distinct.
Answer
length 1, and the coordinates (X, Y ) of the center of the coin are recorded, relative to axes through the center of the square in
which the coin lands (with the axes parallel to the sides of the square, of course). Let A denote the event that the coin does not
touch the sides of the square.
1. Define the set of outcomes S mathematically and sketch S .
2. Argue that (X, Y ) is uniformly distributed on S .
3. Express A in terms of the outcome variables (X, Y ) and sketch A .
4. Find P(A) .
5. Find P(A ).
c
Answer
In Buffon's coin experiment, set r = 0.2. Run the experiment 100 times and compute the empirical probability of each event in
the previous exercise.
A point (X, Y ) is chosen at random in the circular region S ⊂ R of radius 1, centered at the origin. Let A denote the event
2
that the point is in the inscribed square region centered at the origin, with sides parallel to the coordinate axes, and let B denote
the event that the point is in the inscribed square with vertices (±1, 0), (0, ±1). Sketch each of the following events as a subset
of S , and find the probability of the event.
1. A
2. B
3. A ∩ B c
4. B ∩ A c
5. A ∩ B
6. A ∪ B
Answer
Suppose a point (X, Y ) is chosen at random in the circular region S ⊆ R of radius 12, centered at the origin. Let R denote
2
the distance from the origin to the point. Sketch each of the following events as a subset of S , and compute the probability of
the event. Is R uniformly distributed on the interval [0, 12]?
1. {R ≤ 3}
2. {3 < R ≤ 6}
3. {6 < R ≤ 9}
4. {9 < R ≤ 12}
Answer
In the simple probability experiment, points are generated according to the uniform distribution on a rectangle. Move and
resize the events A and B and note how the probabilities of the various events change. Create each of the following
configurations. In each case, run the experiment 1000 times and compare the relative frequencies of the events to the
probabilities of the events.
2.3.15 https://stats.libretexts.org/@go/page/10131
1. A and B in general position
2. A and B disjoint
3. A ⊆ B
4. B ⊆ A
Genetics
Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this
subsection.
Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, a and b are co-dominant
and o is recessive. Suppose that the probability distribution for the set of blood genotypes in a certain population is given in the
following table:
Genotype aa ab ao bb bo oo
A person is chosen at random from the population. Let A , B , AB, and O be the events that the person is type A , type B , type
AB , and type O respectively. Let H be the event that the person is homozygous and D the event that the person has an o
Answer
Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: g for green and y for yellow, and
that g is dominant.
Let G be the event that a child plant has green pods. Find P(G) in each of the following cases:
1. At least one parent is type gg .
2. Both parents are type yy.
3. Both parents are type gy .
4. One parent is type yy and the other is type gy .
Answer
Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let h denote the healthy allele
and d the defective allele for the gene linked to the disorder. Recall that d is recessive for women.
Let B be the event that a son has the disorder, C the event that a daughter is a healthy carrier, and D the event that a daughter
has the disease. Find P(B) , P(C ) and P(D) in each of the following cases:
1. The mother and father are normal.
2. The mother is a healthy carrier and the father is normal.
3. The mother is normal and the father has the disorder.
4. The mother is a healthy carrier and the father has the disorder.
5. The mother has the disorder and the father is normal.
6. The mother and father both have the disorder.
Answer
2.3.16 https://stats.libretexts.org/@go/page/10131
From this exercise, note that transmission of the disorder to a daughter can only occur if the mother is at least a carrier and the
father has the disorder. In ordinary large populations, this is a unusual intersection of events, and thus sex-linked hereditary
disorders are typically much less common in women than in men. In brief, women are protected by the extra X chromosome.
Radioactive Emissions
Suppose that T denotes the time between emissions (in milliseconds) for a certain type of radioactive material, and that T has
the following probability distribution, defined for measurable A ⊆ [0, ∞) by
−t
P(T ∈ A) = ∫ e dt (2.3.30)
A
Suppose that N denotes the number of emissions in a one millisecond interval for a certain type of radioactive material, and
that N has the following probability distribution:
−1
e
P(N ∈ A) = ∑ , A ⊆N (2.3.31)
n!
n∈A
The probability distribution that governs the time between emissions is a special case of the exponential distribution, while the
probability distribution that governs the number of emissions is a special case of the Poisson distribution, named for Simeon
Poisson. The exponential distribution and the Poisson distribution are studied in more detail in the chapter on the Poisson process.
Matching
Suppose that at an absented-minded secretary prepares 4 letters and matching envelopes to send to 4 different persons, but then
stuffs the letters into the envelopes randomly. Find the probability of the event A that at least one letter is in the proper
envelope.
Solution
This exercise is an example of the matching problem, originally formulated and studied by Pierre Remond Montmort. A complete
analysis of the matching problem is given in the chapter on Finite Sampling Models.
In the simulation of the matching experiment select n = 4 . Run the experiment 1000 times and compute the relative frequency
of the event that at least one match occurs.
2.3.17 https://stats.libretexts.org/@go/page/10131
For the cicada data, let W denote the event that a cicada weighs at least 0.20 grams, F the event that a cicada is female, and T
the event that a cicada is type tredecula. Find the empirical probability of each of the following:
1. W
2. F
3. T
4. W ∩ F
5. F ∪ T ∪ W
Answer
This page titled 2.3: Probability Measures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
2.3.18 https://stats.libretexts.org/@go/page/10131
2.4: Conditional Probability
The purpose of this section is to study how probabilities are updated in light of new information, clearly an absolutely essential
topic. If you are a new student of probability, you may want to skip the technical details.
Let A and B be events with P(B) > 0 . The conditional probability of A given B is defined to be
P(A ∩ B)
P(A ∣ B) = (2.4.1)
P(B)
Note that N (E) is a random variable in the compound experiment that consists of replicating the original experiment. In
n
particular, its value is unknown until we actually run the experiment n times.
If N (B) is large, the conditional probability that A has occurred, given that B has occurred, should be close to the conditional
n
relative frequency of A given B , namely the relative frequency of A for the runs on which B occurred: N (A ∩ B)/N (B) . But
n n
note that
Nn (A ∩ B) Nn (A ∩ B)/n
= (2.4.2)
Nn (B) Nn (B)/n
The numerator and denominator of the main fraction on the right are the relative frequencies of A ∩ B and B , respectively. So by
the law of large numbers again, N (A ∩ B)/n → P(A ∩ B) as n → ∞ and N (B) → P(B) as n → ∞ . Hence
n n
Nn (A ∩ B) P(A ∩ B)
→ as n → ∞ (2.4.3)
Nn (B) P(B)
2.4.1 https://stats.libretexts.org/@go/page/10132
It's very important that you not confuse P(A ∣ B) , the probability of A given B , with P(B ∣ A) , the probability of B given A .
Making that mistake is known as the fallacy of the transposed conditional. (How embarrassing!)
Conditional Distributions
Suppose that X is a random variable for the experiment with values in T . Mathematically, X is a function from S into T , and
{X ∈ A} denotes the event {s ∈ S : X(s) ∈ A} for A ⊆ T . Intuitively, X is a variable of interest in the experiment, and every
meaningful statement about X defines an event. Recall that the probability distribution of X is the probability measure on T given
by
A ↦ P(X ∈ A), A ⊆T (2.4.4)
If B is an event with P(B) > 0 , then the conditional distribution of X given B is the probability measure on T given by
A ↦ P(X ∈ A ∣ B), A ⊆T (2.4.5)
Details
Basic Theory
Preliminary Results
Our first result is of fundamental importance, and indeed was a crucial part of the argument for the definition of conditional
probability.
Suppose again that B is an event with P(B) > 0 . Then A ↦ P(A ∣ B) is a probability measure on S .
Proof
It's hard to overstate the importance of the last result because this theorem means that any result that holds for probability measures
in general holds for conditional probability, as long as the conditioning event remains fixed. In particular the basic probability rules
in the section on Probability Measure have analogs for conditional probability. To give two examples,
c
P (A ∣ B) = 1 − P(A ∣ B) (2.4.8)
By the same token, it follows that the conditional distribution of a random variable with values in T , given in above, really does
define a probability distribution on T . No further proof is necessary. Our next results are very simple.
Parts (a) and (c) certainly make sense. Suppose that we know that event B has occurred. If B ⊆ A then A becomes a certain event.
If A ∩ B = ∅ then A becomes an impossible event. A conditional probability can be computed relative to a probability measure
that is itself a conditional probability measure. The following result is a consistency condition.
Suppose that A , B , and C are events with P(B ∩ C ) > 0 . The probability of A given B , relative to P(⋅ ∣ C ) , is the same as
the probability of A given B and C (relative to P). That is,
P(A ∩ B ∣ C )
= P(A ∣ B ∩ C ) (2.4.10)
P(B ∣ C )
Proof
Correlation
Our next discussion concerns an important concept that deals with how two events are related, in a probabilistic sense.
2.4.2 https://stats.libretexts.org/@go/page/10132
Suppose that A and B are events with P(A) > 0 and P(B) > 0 .
1. P(A ∣ B) > P(A) if and only if P(B ∣ A) > P(B) if and only if P(A ∩ B) > P(A)P(B) . In this case, A and B are
positively correlated.
2. P(A ∣ B) < P(A) if and only if P(B ∣ A) < P(B) if and only if P(A ∩ B) < P(A)P(B) . In this case, A and B are
negatively correlated.
3. P(A ∣ B) = P(A) if and only if P(B ∣ A) = P(B) if and only if P(A ∩ B) = P(A)P(B) . In this case, A and B are
uncorrelated or independent.
Proof
Intuitively, if A and B are positively correlated, then the occurrence of either event means that the other event is more likely. If A
and B are negatively correlated, then the occurrence of either event means that the other event is less likely. If A and B are
uncorrelated, then the occurrence of either event does not change the probability of the other event. Independence is a fundamental
concept that can be extended to more than two events and to random variables; these generalizations are studied in the next section
on Independence. A much more general version of correlation, for random variables, is explored in the section on Covariance and
Correlation in the chapter on Expected Value.
Suppose that A and B are events. Note from (4) that if A ⊆ B or B ⊆ A then A and B are positively correlated. If A and B are
disjoint then A and B are negatively correlated.
2. A and B have the opposite correlation as A and B (that is, positive-negative, negative-positive, or 0-0).
c
Proof
The following generalization is known as the multiplication rule of probability. As usual, we assume that any event conditioned on
has positive probability.
Proof
The multiplication rule is particularly useful for experiments that consist of dependent stages, where Ai is an event in stage i.
Compare the multiplication rule of probability with the multiplication rule of combinatorics.
As with any other result, the multiplication rule can be applied to a conditional probability measure. In the context above, if E is
another event, then
P (A1 ∩ A2 ∩ ⋯ ∩ An ∣ E) = P (A1 ∣ E) P (A2 ∣ A1 ∩ E) P (A3 ∣ A1 ∩ A2 ∩ E) (2.4.17)
⋯ P (An ∣ A1 ∩ A2 ∩ ⋯ ∩ An−1 ∩ E)
2.4.3 https://stats.libretexts.org/@go/page/10132
The following theorem is known as the law of total probability.
If B is an event then
i∈I
Proof
The following theorem is known as Bayes' Theorem, named after Thomas Bayes:
If B is an event then
P(Aj )P(B ∣ Aj )
P(Aj ∣ B) = , j∈ I (2.4.20)
∑ P(Ai )P(B ∣ Ai )
i∈I
Proof
These two theorems are most useful, of course, when we know P(A ) and P(B ∣ A ) for each i ∈ I . When we compute the
i i
probability of P(B) by the law of total probability, we say that we are conditioning on the partition A . Note that we can think of
the sum as a weighted average of the conditional probabilities P(B ∣ A ) over i ∈ I , where P(A ) , i ∈ I are the weight factors. In
i i
the context of Bayes theorem, P(A ) is the prior probability of A and P(A ∣ B) is the posterior probability of A for j ∈ I . We
j j j j
will study more general versions of conditioning and Bayes theorem in the section on Discrete Distributions in the chapter on
Distributions, and again in the section on Conditional Expected Value in the chapter on Expected Value.
Once again, the law of total probability and Bayes' theorem can be applied to a conditional probability measure. So, if E is another
event with P(A ∩ E) > 0 for i ∈ I then
i
i∈I
P(Aj ∣ E)P(B ∣ Aj ∩ E)
P(Aj ∣ B ∩ E) = , j∈ I (2.4.22)
∑i∈I P(Ai ∩ E)P(B ∣ Ai ∩ E)
3
, P(B) = 1
4
, P(A ∩ B) = 10
1
. Find each of the following:
1. P(A ∣ B)
2. P(B ∣ A)
3. P(A ∣ B)
c
4. P(B ∣ A)
c
5. P(A ∣ B )
c c
Answer
2
, P(B ∣ C ) =
1
3
, and P(A ∩ B ∣ C ) =
1
4
.
Find each of the following:
1. P(B ∖ A ∣ C )
2. P(A ∪ B ∣ C )
3. P(A ∩ B ∣ C )
c c
4. P(A ∪ B ∣ C )
c c
5. P(A ∪ B ∣ C )
c
6. P(A ∣ B ∩ C )
Answer
2
, P(B) = 1
3
, and P(A ∣ B) = 3
4
.
1. Find P(A ∩ B)
2.4.4 https://stats.libretexts.org/@go/page/10132
2. Find P(A ∪ B)
3. Find P(B ∪ A ) c
4. Find P(B ∣ A)
5. Are A and B positively correlated, negatively correlated, or independent?
Answer
Simple Populations
In a certain population, 30% of the persons smoke cigarettes and 8% have COPD (Chronic Obstructive Pulmonary Disease).
Moreover, 12% of the persons who smoke have COPD.
1. What percentage of the population smoke and have COPD?
2. What percentage of the population with COPD also smoke?
3. Are smoking and COPD positively correlated, negatively correlated, or independent?
Answer
A company has 200 employees: 120 are women and 80 are men. Of the 120 female employees, 30 are classified as managers,
while 20 of the 80 male employees are managers. Suppose that an employee is chosen at random.
1. Find the probability that the employee is female.
2. Find the probability that the employee is a manager.
3. Find the conditional probability that the employee is a manager given that the employee is female.
4. Find the conditional probability that the employee is female given that the employee is a manager.
5. Are the events female and manager positively correlated, negatively correlated, or indpendent?
Answer
Y denote the sum of the scores. For each of the following pairs of events, find the probability of each event and the conditional
probability of each event given the other. Determine whether the events are positively correlated, negatively correlated, or
independent.
1. {X 1 = 3} , {Y = 5}
2. {X 1 = 3} , {Y = 7}
3. {X 1 = 2} , {Y = 5}
4. {X 1 = 3} , { X = 2}
1
Answer
Note that positive correlation is not a transitive relation. From the previous exercise, for example, note that {X = 3} and 1
{Y = 5} are positively correlated, {Y = 5} and { X = 2} are positively correlated, but { X = 3} and { X = 2} are negatively
1 1 1
In dice experiment, set n = 2 . Run the experiment 1000 times. Compute the empirical conditional probabilities corresponding
to the conditional probabilities in the last exercise.
Consider again the experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores
X = (X , X ) . Let Y denote the sum of the scores, U the minimum score, and V the maximum score.
1 2
2.4.5 https://stats.libretexts.org/@go/page/10132
4. Find P(U = u ∣ Y = 8) for the appropriate values of u.
5. Find P[(X , X ) = (x , x ) ∣ Y = 8] for the appropriate values of (x
1 2 1 2 1, x2 ) .
Answer
In the die-coin experiment, a standard, fair die is rolled and then a fair coin is tossed the number of times showing on the die.
Let N denote the die score and H the event that all coin tosses result in heads.
1. Find P(H ).
2. Find P(N = n ∣ H ) for n ∈ {1, 2, 3, 4, 5, 6}.
3. Compare the results in (b) with P(N = n) for n ∈ {1, 2, 3, 4, 5, 6}. In each case, note whether the events H and {N = n}
Run the die-coin experiment 1000 times. Let H and N be as defined in the previous exercise.
1. Compute the empirical probability of H . Compare with the true probability in the previous exercise.
2. Compute the empirical probability of {N = n} given H , for n ∈ {1, 2, 3, 4, 5, 6}. Compare with the true probabilities in
the previous exercise.
Suppose that a bag contains 12 coins: 5 are fair, 4 are biased with probability of heads 1
3
; and 3 are two-headed. A coin is
chosen at random from the bag and tossed.
1. Find the probability that the coin is heads.
2. Given that the coin is heads, find the conditional probability of each coin type.
Answer
Compare die-coin experiment and bag of coins experiment. In the die-coin experiment, we toss a coin with a fixed probability of
heads a random number of times. In the bag of coins experiment, we effectively toss a coin with a random probability of heads a
fixed number of times. The random experiment of tossing a coin with a fixed probability of heads p a fixed number of times n is
known as the binomial experiment with parameters n and p. This is a very basic and important experiment that is studied in more
detail in the section on the binomial distribution in the chapter on Bernoulli Trials. Thus, the die-coin and bag of coins experiments
can be thought of as modifications of the binomial experiment in which a parameter has been randomized. In general, interesting
new random experiments can often be constructed by randomizing one or more parameters in another random experiment.
In the coin-die experiment, a fair coin is tossed. If the coin lands tails, a fair die is rolled. If the coin lands heads, an ace-six flat
die is tossed (faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability each). Let H denote the
1
4
1
event that the coin lands heads, and let Y denote the score when the chosen die is tossed.
1. Find P(Y = y) for y ∈ {1, 2, 3, 4, 5, 6}.
2. Find P(H ∣ Y = y) for y ∈ {1, 2, 3, 4, 5, 6, }.
3. Compare each probability in part (b) with P(H ). In each case, note whether the events H and {Y = y} are positively
correlated, negatively correlated, or independent.
Answer
Run the coin-die experiment 1000 times. Let H and Y be as defined in the previous exercise.
1. Compute the empirical probability of {Y = y} , for each y , and compare with the true probability in the previous exercise
2. Compute the empirical probability of H given {Y = y} for each y , and compare with the true probability in the previous
exercise.
Cards
Consider the card experiment that consists of dealing 2 cards from a standard deck and recording the sequence of cards dealt.
For i ∈ {1, 2}, let Q be the event that card i is a queen and H the event that card i is a heart. For each of the following pairs
i i
of events, compute the probability of each event, and the conditional probability of each event given the other. Determine
whether the events are positively correlated, negatively correlated, or independent.
2.4.6 https://stats.libretexts.org/@go/page/10132
1. Q1 ,H
1
2. Q1 ,Q
2
3. Q2 ,H
2
4. Q1 ,H
2
Answer
In the card experiment, set n = 2 . Run the experiment 500 times. Compute the conditional relative frequencies corresponding
to the conditional probabilities in the last exercise.
Consider the card experiment that consists of dealing 3 cards from a standard deck and recording the sequence of cards dealt.
Find the probability of the following events:
1. All three cards are all hearts.
2. The first two cards are hearts and the third is a spade.
3. The first and third cards are hearts and the second is a spade.
Proof
In the card experiment, set n = 3 and run the simulation 1000 times. Compute the empirical probability of each event in the
previous exercise and compare with the true probability.
side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the square,
parallel to the sides. Since the needle is dropped randomly, the basic modeling assumption is that (X, Y ) is uniformly distributed
on the square [−1/2, 1/2] .2
Run Buffon's coin experiment 500 times. Compute the empirical probability that Y >0 given that X <Y and compare with
the probability in the last exercise.
In the conditional probability experiment, the random points are uniformly distributed on the rectangle S . Move and resize
events A and B and note how the probabilities change. For each of the following configurations, run the experiment 1000
times and compare the relative frequencies with the true probabilities.
1. A and B in general position
2. A and B disjoint
3. A ⊆ B
4. B ⊆ A
2.4.7 https://stats.libretexts.org/@go/page/10132
Reliability
A plant has 3 assembly lines that produces memory chips. Line 1 produces 50% of the chips and has a defective rate of 4%;
line 2 has produces 30% of the chips and has a defective rate of 5%; line 3 produces 20% of the chips and has a defective rate
of 1%. A chip is chosen at random from the plant.
1. Find the probability that the chip is defective.
2. Given that the chip is defective, find the conditional probability for each line.
Answer
Suppose that a bit (0 or 1) is sent through a noisy communications channel. Because of the noise, the bit sent may be received
incorrectly as the complementary bit. Specifically, suppose that if 0 is sent, then the probability that 0 is received is 0.9 and the
probability that 1 is received is 0.1. If 1 is sent, then the probability that 1 is received is 0.8 and the probability that 0 is
received is 0.2. Finally, suppose that 1 is sent with probability 0.6 and 0 is sent with probability 0.4. Find the probability that
1. 1 was sent given that 1 was received
2. 0 was sent given that 0 was received
Answer
Suppose that T denotes the lifetime of a light bulb (in 1000 hour units), and that T has the following exponential distribution,
defined for measurable A ⊆ [0, ∞) :
−t
P(T ∈ A) = ∫ e dt (2.4.23)
A
Answer
Suppose again that T denotes the lifetime of a light bulb (in 1000 hour units), but that T is uniformly distributed on the interal
[0, 10].
Answer
Genetics
Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this
section.
Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, a and b are co-dominant
and o is recessive. Suppose that the probability distribution for the set of blood genotypes in a certain population is given in the
following table:
Genotype aa ab ao bb bo oo
Suppose that a person is chosen at random from the population. Let A , B , AB, and O be the events that the person is type A ,
type B , type AB, and type O respectively. Let H be the event that the person is homozygous, and let D denote the event that
the person has an o allele. Find each of the following:
1. P(A) , P(B) , P(AB), P(O) , P(H ), P(D)
2. P (A ∩ H ) , P (A ∣ H ) , P (H ∣ A) . Are the events A and H positively correlated, negatively correlated, or independent?
3. P (B ∩ H ) , P (B ∣ H ) , P (H ∣ B) . Are the events B and H positively correlated, negatively correlated, or independent?
4. P (A ∩ D) , P (A ∣ D) , P (D ∣ A) . Are the events A and D positively correlated, negatively correlated, or independent?
5. P (B ∩ D) , P (B ∣ D) , P (D ∣ B) . Are the events B and D positively correlated, negatively correlated, or independent?
6. P (H ∩ D) , P (H ∣ D) , P (D ∣ H ) . Are the events H and D positively correlated, negatively correlated, or independent?
2.4.8 https://stats.libretexts.org/@go/page/10132
Answer
Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: g for green and y for yellow, and
that g is dominant and y recessive.
Suppose that a green-pod plant and a yellow-pod plant are bred together. Suppose further that the green-pod plant has a 1
Suppose that two green-pod plants are bred together. Suppose further that with probability neither plant has the recessive
1
allele, with probability one plant has the recessive allele, and with probability both plants have the recessive allele.
1
2
1
Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let h denote the healthy allele
and d the defective allele for the gene linked to the disorder. Recall that h is dominant and d recessive for women.
Suppose that in a certain population, 50% are male and 50% are female. Moreover, suppose that 10% of males are color blind
but only 1% of females are color blind.
1. Find the percentage of color blind persons in the population.
2. Find the percentage of color blind persons that are male.
Answer
Since color blindness is a sex-linked hereditary disorder, note that it's reasonable in the previous exercise that the probability that a
female is color blind is the square of the probability that a male is color blind. If p is the probability of the defective allele on the X
chromosome, then p is also the probability that a male will be color blind. But since the defective allele is recessive, a woman
would need two copies of the defective allele to be color blind, and assuming independence, the probability of this event is p . 2
A man and a woman do not have a certain sex-linked hereditary disorder, but the woman has a 1
3
chance of being a carrier.
1. Find the probability that a son born to the couple will be normal.
2. Find the probability that a daughter born to the couple will be a carrier.
3. Given that a son born to the couple is normal, find the updated probability that the mother is a carrier.
Answer
Urn Models
Urn 1 contains 4 red and 6 green balls while urn 2 contains 7 red and 3 green balls. An urn is chosen at random and then a ball
is chosen at random from the selected urn.
1. Find the probability that the ball is green.
2. Given that the ball is green, find the conditional probability that urn 1 was selected.
Answer
Urn 1 contains 4 red and 6 green balls while urn 2 contains 6 red and 3 green balls. A ball is selected at random from urn 1 and
transferred to urn 2. Then a ball is selected at random from urn 2.
1. Find the probability that the ball from urn 2 is green.
2. Given that the ball from urn 2 is green, find the conditional probability that the ball from urn 1 was green.
Answer
2.4.9 https://stats.libretexts.org/@go/page/10132
An urn initially contains 6 red and 4 green balls. A ball is chosen at random from the urn and its color is recorded. It is then
replaced in the urn and 2 new balls of the same color are added to the urn. The process is repeated. Find the probability of each
of the following events:
1. Balls 1 and 2 are red and ball 3 is green.
2. Balls 1 and 3 are red and ball 2 is green.
3. Ball 1 is green and balls 2 and 3 are red.
4. Ball 2 is red.
5. Ball 1 is red given that ball 2 is red.
Answer
Think about the results in the previous exercise. Note in particular that the answers to parts (a), (b), and (c) are the same, and that
the probability that the second ball is red in part (d) is the same as the probability that the first ball is red. More generally, the
probabilities of events do not depend on the order of the draws. For example, the probability of an event involving the first, second,
and third draws is the same as the probability of the corresponding event involving the seventh, tenth and fifth draws. Technically,
the sequence of events (R , R , …) is exchangeable. The random process described in this exercise is a special case of Pólya's urn
1 2
scheme, named after George Pólya. We sill study Pólya's urn in more detail in the chapter on Finite Sampling Models
An urn initially contains 6 red and 4 green balls. A ball is chosen at random from the urn and its color is recorded. It is then
replaced in the urn and two new balls of the other color are added to the urn. The process is repeated. Find the probability of
each of the following events:
1. Balls 1 and 2 are red and ball 3 is green.
2. Balls 1 and 3 are red and ball 2 is green.
3. Ball 1 is green and balls 2 and 3 are red.
4. Ball 2 is red.
5. Ball 1 is red given that ball 2 is red.
Answer
Think about the results in the previous exercise, and compare with Pólya's urn. Note that the answers to parts (a), (b), and (c) are
not all the same, and that the probability that the second ball is red in part (d) is not the same as the probability that the first ball is
red. In short, the sequence of events (R , R , …) is not exchangeable.
1 2
Diagnostic Testing
Suppose that we have a random experiment with an event A of interest. When we run the experiment, of course, event A will
either occur or not occur. However, suppose that we are not able to observe the occurrence or non-occurrence of A directly. Instead
we have a diagnostic test designed to indicate the occurrence of event A ; thus the test that can be either positive for A or negative
for A . The test also has an element of randomness, and in particular can be in error. Here are some typical examples of the type of
situation we have in mind:
The event is that a person has a certain disease and the test is a blood test for the disease.
The event is that a woman is pregnant and the test is a home pregnancy test.
The event is that a person is lying and the test is a lie-detector test.
The event is that a device is defective and the test consists of a sensor reading.
The event is that a missile is in a certain region of airspace and the test consists of radar signals.
The event is that a person has committed a crime, and the test is a jury trial with evidence presented for and against the event.
Let T be the event that the test is positive for the occurrence of A . The conditional probability P(T ∣ A) is called the sensitivity of
the test. The complementary probability
c
P(T ∣ A) = 1 − P(T ∣ A) (2.4.24)
2.4.10 https://stats.libretexts.org/@go/page/10132
is the false positive probability. In many cases, the sensitivity and specificity of the test are known, as a result of the development of
the test. However, the user of the test is interested in the opposite conditional probabilities, namely P(A ∣ T ) , the probability of the
event of interest, given a positive test, and P(A ∣ T ) , the probability of the complementary event, given a negative test. Of
c c
course, if we know P(A ∣ T ) then we also have P(A ∣ T ) = 1 − P(A ∣ T ) , the probability of the complementary event given a
c
positive test. Similarly, if we know P(A ∣ T ) then we also have P(A ∣ T ) , the probability of the event given a negative test.
c c c
The probability that the event does not occur, given a negative test is
c c c
P(A )P(T ∣ A )
c c
P(A ∣ T ) = (2.4.27)
c c c c
P(A)P(T ∣ A) + P(A )P(T ∣ A )
There is often a trade-off between sensitivity and specificity. An attempt to make a test more sensitive may result in the test being
less specific, and an attempt to make a test more specific may result in the test being less sensitive. As an extreme example,
consider the worthless test that always returns positive, no matter what the evidence. Then T = S so the test has sensitivity 1, but
specificity 0. At the opposite extreme is the worthless test that always returns negative, no matter what the evidence. Then T = ∅
so the test has specificity 1 but sensitivity 0. In between these extremes are helpful tests that are actually based on evidence of some
sort.
Suppose that the sensitivity a = P(T ∣ A) ∈ (0, 1) and the specificity b = P(T ∣ A ) ∈ (0, 1) are fixed. Letc c
p = P(A) denote
the prior probability of the event A and P = P(A ∣ T ) the posterior probability of A given a positive test.
P as a function of p is given by
ap
P = , p ∈ [0, 1] (2.4.28)
(a + b − 1)p + (1 − b)
Of course, part (b) is the typical case, where the test is useful. In fact, we would hope that the sensitivity and specificity are close to
1. In case (c), the test is worse than useless since it gives the wrong information about A . But this case could be turned into a useful
test by simply reversing the roles of positive and negative. In case (d), the test is worthless and gives no information about A . It's
interesting that the broad classification above depends only on the sum of the sensitivity and specificity.
Suppose that a diagnostic test has sensitivity 0.99 and specificity 0.95. Find P(A ∣ T ) for each of the following values of
P(A) :
2.4.11 https://stats.libretexts.org/@go/page/10132
1. 0.001
2. 0.01
3. 0.2
4. 0.5
5. 0.7
6. 0.9
Answer
With sensitivity 0.99 and specificity 0.95, the test in the last exercise superficially looks good. However the small value of
P(A ∣ T ) for small values of P(A) is striking (but inevitable given the properties above). The moral, of course, is that P(A ∣ T )
depends critically on P(A) not just on the sensitivity and specificity of the test. Moreover, the correct comparison is P(A ∣ T ) with
P(A) , as in the exercise, not P(A ∣ T ) with P(T ∣ A) —Beware of the fallacy of the transposed conditional! In terms of the correct
comparison, the test does indeed work well; P(A ∣ T ) is significantly larger than P(A) in all cases.
A woman initially believes that there is an even chance that she is or is not pregnant. She takes a home pregnancy test with
sensitivity 0.95 and specificity 0.90 (which are reasonable values for a home pregnancy test). Find the updated probability that
the woman is pregnant in each of the following cases.
1. The test is positive.
2. The test is negative.
Answer
Suppose that 70% of defendants brought to trial for a certain type of crime are guilty. Moreover, historical data show that juries
convict guilty persons 80% of the time and convict innocent persons 10% of the time. Suppose that a person is tried for a crime
of this type. Find the updated probability that the person is guilty in each of the following cases:
1. The person is convicted.
2. The person is acquitted.
Answer
The “Check Engine” light on your car has turned on. Without the information from the light, you believe that there is a 10%
chance that your car has a serious engine problem. You learn that if the car has such a problem, the light will come on with
probability 0.99, but if the car does not have a serious problem, the light will still come on, under circumstances similar to
yours, with probability 0.3. Find the updated probability that you have an engine problem.
Answer
The standard test for HIV is the ELISA (Enzyme-Linked Immunosorbent Assay) test. It has sensitivity and specificity of 0.999.
Suppose that a person is selected at random from a population in which 1% are infected with HIV, and given the ELISA test.
Find the probability that the person has HIV in each of the following cases:
1. The test is positive.
2. The test is negative.
Answer
The ELISA test for HIV is a very good one. Let's look another test, this one for prostate cancer, that's rather bad.
The PSA test for prostate cancer is based on a blood marker known as the Prostate Specific Antigen. An elevated level of PSA
is evidence for prostate cancer. To have a diagnostic test, in the sense that we are discussing here, we must decide on a definite
level of PSA, above which we declare the test to be positive. A positive test would typically lead to other more invasive tests
(such as biopsy) which, of course, carry risks and cost. The PSA test with cutoff 2.6 ng/ml has sensitivity 0.40 and specificity
0.81. The overall incidence of prostate cancer among males is 156 per 100000. Suppose that a man, with no particular risk
factors, has the PSA test. Find the probability that the man has prostate cancer in each of the following cases:
1. The test is positive.
2. The test is negative.
Answer
2.4.12 https://stats.libretexts.org/@go/page/10132
Diagnostic testing is closely related to a general statistical procedure known as hypothesis testing. A separate chapter on hypothesis
testing explores this procedure in detail.
For the M&M data set, find the empirical probability that a bag has at least 10 reds, given that the weight of the bag is at least
48 grams.
Answer
This page titled 2.4: Conditional Probability is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
2.4.13 https://stats.libretexts.org/@go/page/10132
2.5: Independence
In this section, we will discuss independence, one of the fundamental concepts in probability theory. Independence is frequently
invoked as a modeling assumption, and moreover, (classical) probability itself is based on the idea of independent replications of
the experiment. As usual, if you are a new student of probability, you may want to skip the technical details.
Basic Theory
As usual, our starting point is a random experiment modeled by a probability space (S, S , P) so that S is the set of outcomes, S
the collection of events, and P the probability measure on the sample space (S, S ). We will define independence for two events,
then for collections of events, and then for collections of random variables. In each case, the basic idea is the same.
If both of the events have positive probability, then independence is equivalent to the statement that the conditional probability of
one event given the other is the same as the unconditional probability of the event:
P(A ∣ B) = P(A) ⟺ P(B ∣ A) = P(B) ⟺ P(A ∩ B) = P(A)P(B) (2.5.2)
This is how you should think of independence: knowledge that one event has occurred does not change the probability assigned to
the other event. Independence of two events was discussed in the last section in the context of correlation. In particular, for two
events, independent and uncorrelated mean the same thing.
The terms independent and disjoint sound vaguely similar but they are actually very different. First, note that disjointness is purely
a set-theory concept while independence is a probability (measure-theoretic) concept. Indeed, two events can be independent
relative to one probability measure and dependent relative to another. But most importantly, two disjoint events can never be
independent, except in the trivial case that one of the events is null.
Suppose that A and B are disjoint events, each with positive probability. Then A and B are dependent, and in fact are
negatively correlated.
Proof
If A and B are independent events then intuitively it seems clear that any event that can be constructed from A should be
independent of any event that can be constructed from B . This is the case, as the next result shows. Moreover, this basic idea is
essential for the generalization of independence that we will consider shortly.
If A and B are independent events, then each of the following pairs of events is independent:
1. A , B
c
2. B , A c
3. A , B
c c
Proof
An event that is “essentially deterministic”, that is, has probability 0 or 1, is independent of any other event, even itself.
2.5.1 https://stats.libretexts.org/@go/page/10133
General Independence of Events
To extend the definition of independence to more than two events, we might think that we could just require pairwise
independence, the independence of each pair of events. However, this is not sufficient for the strong type of independence that we
have in mind. For example, suppose that we have three events A , B , and C . Mutual independence of these events should not only
mean that each pair is independent, but also that an event that can be constructed from A and B (for example A ∪ B ) should be c
independent of C . Pairwise independence does not achieve this; an exercise below gives three events that are pairwise independent,
but the intersection of two of the events is related to the third event in the strongest possible sense.
Another possible generalization would be to simply require the probability of the intersection of the events to be the product of the
probabilities of the events. However, this condition does not even guarantee pairwise independence. An exercise below gives an
example. However, the definition of independence for two events does generalize in a natural way to an arbitrary collection of
events.
Suppose that Ai is an event for each i in an index set I . Then the collection A = { Ai : i ∈ I } is independent if for every
finite J ⊆ I ,
P ( ⋂ Aj ) = ∏ P(Aj ) (2.5.4)
j∈J j∈J
Independence of a collection of events is much stronger than mere pairwise independence of the events in the collection. The basic
inheritance property in the following result follows immediately from the definition.
For a finite collection of events, the number of conditions required for mutual independence grows exponentially with the number
of events.
There are 2 n
−n−1 non-trivial conditions in the definition of the independence of n events.
1. Explicitly give the 4 conditions that must be satisfied for events A , B , and C to be independent.
2. Explicitly give the 11 conditions that must be satisfied for events A , B , C , and D to be independent.
Answer
If the events A 1, A2 , … , An are independent, then it follows immediately from the definition that
n n
P ( ⋂ Ai ) = ∏ P(Ai ) (2.5.5)
i=1 i=1
This is known as the multiplication rule for independent events. Compare this with the general multiplication rule for conditional
probability.
The next result generalizes the theorem above on the complements of two independent events.
Suppose that A = {A : i ∈ I } and B = {B : i ∈ I } are two collections of events with the property that for each
i i i ∈ I ,
either B = A or B = A . Then A is independent if and only if B is an independent.
i i i
c
i
Proof
The last theorem in turn leads to the type of strong independence that we want. The following exercise gives examples.
2.5.2 https://stats.libretexts.org/@go/page/10133
1. A ∪ B , C , D are independent.
c
2. A ∪ B , C ∪ D are independent.
c c c
Proof
The complete generalization of these results is a bit complicated, but roughly means that if we start with a collection of indpendent
events, and form new events from disjoint subcollections (using the set operations of union, intersection, and complment), then the
new events are independent. For a precise statement, see the section on measure spaces. The importance of the complement
theorem lies in the fact that any event that can be defined in terms of a finite collection of events {A : i ∈ I } can be written as ai
Another consequence of the general complement theorem is a formula for the probability of the union of a collection of
independent events that is much nicer than the inclusion-exclusion formula.
P ( ⋃ Ai ) = 1 − ∏ [1 − P(Ai )] (2.5.10)
i=1 i=1
Proof
Mathematically, X is a function from S into T , and recall that {X ∈ B} denotes the event {s ∈ S : X (s) ∈ B} for B ⊆ T .
i i i i i
Intuitively, X is a variable of interest in the experiment, and every meaningful statement about X defines an event. Intuitively, the
i i
random variables are independent if information about some of the variables tells us nothing about the other variables.
Mathematically, independence of a collection of random variables can be reduced to the independence of collections of events.
independent for every choice of B ⊆ T for i ∈ I . Equivalently then, X is independent if for every finite J ⊆ I , and for
i i
P ( ⋂ { Xj ∈ Bj }) = ∏ P(Xj ∈ Bj ) (2.5.12)
j∈J j∈J
Details
It would seem almost obvious that if a collection of random variables is independent, and we transform each variable in
deterministic way, then the new collection of random variables should still be independent.
As with events, the (mutual) independence of random variables is a very strong property. If a collection of random variables is
independent, then any subcollection is also independent. New random variables formed from disjoint subcollections are
independent. For a simple example, suppose that X, Y , and Z are independent real-valued random variables. Then
1. sin(X), cos(Y ), and e are independent. Z
2.5.3 https://stats.libretexts.org/@go/page/10133
In particular, note that statement 2 in the list above is much stronger than the conjunction of statements 4 and 5. Contrapositively, if
X and Z are dependent, then (X, Y ) and Z are also dependent. Independence of random variables subsumes independence of
events.
A collection of events A is independent if and only if the corresponding collection of indicator variables { 1A : A ∈ A } is
independent.
Proof
Many of the concepts that we have been using informally can now be made precise. A compound experiment that consists of
“independent stages” is essentially just an experiment whose outcome is a sequence of independent random variables
X = (X , X , …) where X is the outcome of the i th stage.
1 2 i
In particular, suppose that we have a basic experiment with outcome variable X. By definition, the outcome of the experiment that
consists of “independent replications” of the basic experiment is a sequence of independent random variables X = (X , X , …) 1 2
each with the same probability distribution as X. This is fundamental to the very concept of probability, as expressed in the law of
large numbers. From a statistical point of view, suppose that we have a population of objects and a vector of measurements X of
interest for the objects in the sample. The sequence X above corresponds to sampling from the distribution of X; that is, X is the i
vector of measurements for the ith object drawn from the sample. When we sample from a finite population, sampling with
replacement generates independent random variables while sampling without replacement generates dependent random variables.
∣
P ( ⋂ Aj ∣ B) = ∏ P(Aj ∣ B) (2.5.13)
∣
j∈J j∈J
Note that the definitions and theorems of this section would still be true, but with all probabilities conditioned on B .
Conversely, conditional probability has a nice interpretation in terms of independent replications of the experiment. Thus, suppose
that we start with a basic experiment with S as the set of outcomes. We let X denote the outcome random variable, so that
mathematically X is simply the identity function on S . In particular, if A is an event then trivially, P(X ∈ A) = P(A) . Suppose
now that we replicate the experiment independently. This results in a new, compound experiment with a sequence of independent
random variables (X , X , …), each with the same distribution as X. That is, X is the outcome of the ith repetition of the
1 2 i
experiment.
Suppose now that A and B are events in the basic experiment with P(B) > 0 . In the compound experiment, the event that
“when B occurs for the first time, A also occurs” has probability
P(A ∩ B)
= P(A ∣ B) (2.5.14)
P(B)
Proof
Heuristic Argument
Suppose that A and B are disjoint events in a basic experiment with P(A) > 0 and P(B) > 0 . In the compound experiment
obtained by replicating the basic experiment, the event that “A occurs before B ” has probability
P(A)
(2.5.18)
P(A) + P(B)
Proof
2.5.4 https://stats.libretexts.org/@go/page/10133
Examples and Applications
Basic Rules
Suppose that A , B , and C are independent events in an experiment with P(A) = 0.3 , P(B) = 0.4 , and P(C ) = 0.8 . Express
each of the following events in set notation and find its probability:
1. All three events occur.
2. None of the three events occurs.
3. At least one of the three events occurs.
4. At least one of the three events does not occur.
5. Exactly one of the three events occurs.
6. Exactly two of the three events occurs.
Answer
Suppose that A , B , and C are independent events for an experiment with P(A) =
1
3
, P(B) =
1
4
, and P(C ) =
1
5
. Find the
probability of each of the following events:
1. (A ∩ B) ∪ C
2. A ∪ B ∪ C c
3. (A ∩ B ) ∪ C
c c c
Answer
Simple Populations
A small company has 100 employees; 40 are men and 60 are women. There are 6 male executives. How many female
executives should there be if gender and rank are independent? The underlying experiment is to choose an employee at
random.
Answer
Suppose that a farm has four orchards that produce peaches, and that peaches are classified by size as small, medium, and
large. The table below gives total number of peaches in a recent harvest by orchard and by size. Fill in the body of the table
with counts for the various intersections, so that orchard and size are independent variables. The underlying experiment is to
select a peach at random from the farm.
Answer
Note from the last two exercises that you cannot “see” independence in a Venn diagram. Again, independence is a measure-
theoretic concept, not a set-theoretic concept.
Bernoulli Trials
A Bernoulli trials sequence is a sequence X = (X , X , …) of independent, identically distributed indicator variables. Random
1 2
variable X is the outcome of trial i, where in the usual terminology of reliability theory, 1 denotes success and 0 denotes failure.
i
The canonical example is the sequence of scores when a coin (not necessarily fair) is tossed repeatedly. Another basic example
arises whenever we start with an basic experiment and an event A of interest, and then repeat the experiment. In this setting, X is i
the indicator variable for event A on the ith run of the experiment. The Bernoulli trials process is named for Jacob Bernoulli, and
has a single basic parameter p = P(X = 1) . This random process is studied in detail in the chapter on Bernoulli trials.
i
2.5.5 https://stats.libretexts.org/@go/page/10133
For (x 1, x2 , … , xn ) ∈ {0, 1 }
n
,
x1 +x2 +⋯+xn n−( x1 +x2 +⋯+xn )
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = p (1 − p ) (2.5.19)
Proof
Note that the sequence of indicator random variables X is exchangeable. That is, if the sequence (x , x , … , x ) in the previous
1 2 n
result is permuted, the probability does not change. On the other hand, there are exchangeable sequences of indicator random
variables that are dependent, as Pólya's urn model so dramatically illustrates.
Proof
The distribution of Y is called the binomial distribution with parameters n and p. The binomial distribution is studied in more
detail in the chapter on Bernoulli Trials.
More generally, a multinomial trials sequence is a sequence X = (X , X , …) of independent, identically distributed random
1 2
variables, each taking values in a finite set S . The canonical example is the sequence of scores when a k -sided die (not necessarily
fair) is thrown repeatedly. Multinomial trials are also studied in detail in the chapter on Bernoulli trials.
Cards
Consider the experiment that consists of dealing 2 cards at random from a standard deck and recording the sequence of cards
dealt. For i ∈ {1, 2}, let Q be the event that card i is a queen and H the event that card i is a heart. Compute the appropriate
i i
In the card experiment, set n = 2 . Run the simulation 500 times. For each pair of events in the previous exercise, compute the
product of the empirical probabilities and the empirical probability of the intersection. Compare the results.
Dice
The following exercise gives three events that are pairwise independent, but not (mutually) independent.
Consider the dice experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores. Let A denote
the event that first score is 3, B the event that the second score is 4, and C the event that the sum of the scores is 7. Then
1. A , B , C are pairwise independent.
2. A ∩ B implies (is a subset of) C and hence these events are dependent in the strongest possible sense.
Answer
In the dice experiment, set n = 2 . Run the experiment 500 times. For each pair of events in the previous exercise, compute the
product of the empirical probabilities and the empirical probability of the intersection. Compare the results.
The following exercise gives an example of three events with the property that the probability of the intersection is the product of
the probabilities, but the events are not pairwise independent.
Suppose that we throw a standard, fair die one time. Let A = {1, 2, 3, 4}, B = C = {4, 5, 6} . Then
1. P(A ∩ B ∩ C ) = P(A)P(B)P(C ) .
2.5.6 https://stats.libretexts.org/@go/page/10133
2. B and C are the same event, and hence are dependent in the strongest possbile sense.
Answer
Suppose that a standard, fair die is thrown 4 times. Find the probability of the following events.
1. Six does not occur.
2. Six occurs at least once.
3. The sum of the first two scores is 5 and the sum of the last two scores is 7.
Answer
Suppose that a pair of standard, fair dice are thrown 8 times. Find the probability of each of the following events.
1. Double six does not occur.
2. Double six occurs at least once.
3. Double six does not occur on the first 4 throws but occurs at least once in the last 4 throws.
Answer
Consider the dice experiment that consists of rolling n , k -sided dice and recording the sequence of scores
X = (X , X , … , X ) .The following conditions are equivalent (and correspond to the assumption that the dice are fair):
1 2 n
Proof
A pair of standard, fair dice are thrown repeatedly. Find the probability of each of the following events.
1. A sum of 4 occurs before a sum of 7.
2. A sum of 5 occurs before a sum of 7.
3. A sum of 6 occurs before a sum of 7.
4. When a sum of 8 occurs the first time, it occurs “the hard way” as (4, 4).
Answer
Problems of the type in the last exercise are important in the game of craps. Craps is studied in more detail in the chapter on Games
of Chance.
Coins
A biased coin with probability of heads is tossed 5 times. Let X denote the outcome of the tosses (encoded as a bit string)
1
and let Y denote the number of heads. Find each of the following:
1. P(X = x) for each x ∈ {0, 1} . 5
A box contains a fair coin and a two-headed coin. A coin is chosen at random from the box and tossed repeatedly. Let F
denote the event that the fair coin is chosen, and let H denote the event that the ith toss results in heads. Then
i
4. P(H ∩ H ∩ ⋯ ∩ H ) =
1 2 + n . n+1
1 1
2
2
5. (H , H , …) are dependent.
1 2
6. P(F ∣ H ∩ H ∩ ⋯ ∩ H ) =
1 2 .n
2 +1
1
n
7. P(F ∣ H ∩ H ∩ ⋯ ∩ H ) → 0 as n → ∞ .
1 2 n
Proof
2.5.7 https://stats.libretexts.org/@go/page/10133
Consider again the box in the previous exercise, but we change the experiment as follows: a coin is chosen at random from the
box and tossed and the result recorded. The coin is returned to the box and the process is repeated. As before, let H denote the
i
4
n
3. P(H ∩ H ∩ ⋯ H ) = ( ) .
1 2 n
3
Proof
Think carefully about the results in the previous two exercises, and the differences between the two models. Tossing a coin
produces independent random variables if the probability of heads is fixed (that is, non-random even if unknown). Tossing a coin
with a random probability of heads generally does not produce independent random variables; the result of a toss gives information
about the probability of heads which in turn gives information about subsequent tosses.
Uniform Distributions
Recall that Buffon's coin experiment consists of tossing a coin with radius r ≤ randomly on a floor covered with square tiles
1
of side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the square
in which the coin lands. The following conditions are equivalent:
2
1. (X, Y ) is uniformly distributed on [− , ] .
1
2
1
2
,
1
2
] .
Proof
Compare this result with the result above for fair dice.
In Buffon's coin experiment, set r = 0.3. Run the simulation 500 times. For the events {X > 0} and {Y < 0} , compute the
product of the empirical probabilities and the empirical probability of the intersection. Compare the results.
The arrival time X of the A train is uniformly distributed on the interval (0, 30), while the arrival time Y of the B train is
uniformly distributed on the interval (15, 30). (The arrival times are in minutes, after 8:00 AM). Moreover, the arrival times
are independent. Find the probability of each of the following events:
1. The A train arrives first.
2. Both trains arrive sometime after 20 minutes.
Answer
Reliability
Recall the simple model of structural reliability in which a system is composed of n components. Suppose in addition that the
components operate independently of each other. As before, let X denote the state of component i, where 1 means working and 0
i
means failure. Thus, our basic assumption is that the state vector X = (X , X , … , X ) is a sequence of independent indicator
1 2 n
random variables. We assume that the state of the system (either working or failed) depends only on the states of the components.
Thus, the state of the system is an indicator random variable
Y = y(X1 , X2 , … , Xn ) (2.5.25)
2.5.8 https://stats.libretexts.org/@go/page/10133
where y : {0, 1} → {0, 1} is the structure function. Generally, the probability that a device is working is the reliability of the
n
device. Thus, we will denote the reliability of component i by p = P(X = 1) so that the vector of component reliabilities is
i i
Appropriately enough, this function is known as the reliability function. Our challenge is usually to find the reliability function,
given the structure function. When the components all have the same probability p then of course the system reliability r is just a
function of p. In this case, the state vector X = (X , X , … , X ) forms a sequence of Bernoulli trials.
1 2 n
Comment on the independence assumption for real systems, such as your car or your computer.
Recall that a series system is working if and only if each component is working.
1. The state of the system is U = X X ⋯ X
1 2 n = min{ X1 , X2 , … , Xn } .
2. The reliability is P(U = 1) = p p ⋯ p .
1 2 n
Recall that a parallel system is working if and only if at least one component is working.
1. The state of the system is V = 1 − (1 − X )(1 − X ) ⋯ (1 − X ) = max{X
1 2 n 1, X2 , … , Xn } .
2. The reliability is P(V = 1) = 1 − (1 − p )(1 − p ) ⋯ (1 − p ) .
1 2 n
Recall that a k out of n system is working if and only if at least k of the n components are working. Thus, a parallel system is a 1
out of n system and a series system is an n out of n system. A k out of 2k − 1 system is a majority rules system. The reliability
function of a general k out of n system is a mess. However, if the component reliabilities are the same, the function has a
reasonably simple form.
For a k out of n system with common component reliability p, the system reliability is
n
n
i n−i
r(p) = ∑ ( ) p (1 − p ) (2.5.27)
i
i=k
Consider a system of 3 independent components with common reliability p = 0.8 . Find the reliability of each of the following:
1. The parallel system.
2. The 2 out of 3 system.
3. The series system.
Answer
Consider an airplane with an odd number of engines, each with reliability p. Suppose that the airplane is a majority rules
system, so that the airplane needs a majority of working engines in order to fly.
1. Find the reliability of a 3 engine plane as a function of p.
2. Find the reliability of a 5 engine plane as a function of p.
3. For what values of p is a 5 engine plane preferable to a 3 engine plane?
Answer
The graph below is known as the Wheatstone bridge network and is named for Charles Wheatstone. The edges represent
components, and the system works if and only if there is a working path from vertex a to vertex b .
2.5.9 https://stats.libretexts.org/@go/page/10133
1. Find the structure function.
2. Find the reliability function.
Answer
A system consists of 3 components, connected in parallel. Because of environmental factors, the components do not operate
independently, so our usual assumption does not hold. However, we will assume that under low stress conditions, the
components are independent, each with reliability 0.9; under medium stress conditions, the components are independent with
reliability 0.8; and under high stress conditions, the components are independent, each with reliability 0.7. The probability of
low stress is 0.5, of medium stress is 0.3, and of high stress is 0.2.
1. Find the reliability of the system.
2. Given that the system works, find the conditional probability of each stress level.
Answer
Suppose that bits are transmitted across a noisy communications channel. Each bit that is sent, independently of the others, is
received correctly with probability 0.9 and changed to the complementary bit with probability 0.1. Using redundancy to
improve reliability, suppose that a given bit will be sent 3 times. We naturally want to compute the probability that we correctly
identify the bit that was sent. Assume we have no prior knowledge of the bit, so we assign probability each to the event that 1
000 was sent and the event that 111 was sent. Now find the conditional probability that 111 was sent given each of the 8
possible bit strings received.
Answer
Diagnostic Testing
Recall the discussion of diagnostic testing in the section on Conditional Probability. Thus, we have an event A for a random
experiment whose occurrence or non-occurrence we cannot observe directly. Suppose now that we have n tests for the occurrence
of A , labeled from 1 to n . We will let T denote the event that test i is positive for A . The tests are independent in the following
i
sense:
If A occurs, then (T , T , … , T ) are (conditionally) independent and test i has sensitivity a = P(T
1 2 n i i ∣ A) .
If A does not occur, then (T , T , … , T ) are (conditionally) independent and test i has specificity b
1 2 n i = P(T
c
i
c
∣ A ) .
Note that unconditionally, it is not reasonable to assume that the tests are independent. For example, a positive result for a given
test presumably is evidence that the condition A has occurred, which in turn is evidence that a subsequent test will be positive. In
short, we expect that T and T should be positively correlated.
i j
We can form a new, compound test by giving a decision rule in terms of the individual test results. In other words, the event T that
the compound test is positive for A is a function of (T , T , … , T ). The typical decision rules are very similar to the reliability
1 2 n
structures discussed above. A special case of interest is when the n tests are independent applications of a given basic test. In this
case, a = a and b = b for each i.
i i
Consider the compound test that is positive for A if and only if each of the n tests is positive for A .
1. T = T ∩ T ∩ ⋯ ∩ T
1 2 n
Consider the compound test that is positive for A if and only if each at least one of the n tests is positive for A .
1. T = T ∪ T ∪ ⋯ ∪ T
1 2 n
2.5.10 https://stats.libretexts.org/@go/page/10133
More generally, we could define the compound k out of n test that is positive for A if and only if at least k of the individual tests
are positive for A . The series test is the n out of n test, while the parallel test is the 1 out of n test. The k out of 2k − 1 test is the
majority rules test.
Suppose that a woman initially believes that there is an even chance that she is or is not pregnant. She buys three identical
pregnancy tests with sensitivity 0.95 and specificity 0.90. Tests 1 and 3 are positive and test 2 is negative.
1. Find the updated probability that the woman is pregnant.
2. Can we just say that tests 2 and 3 cancel each other out? Find the probability that the woman is pregnant given just one
positive test, and compare the answer with the answer to part (a).
Answer
Suppose that 3 independent, identical tests for an event A are applied, each with sensitivity a and specificity b . Find the
sensitivity and specificity of the following tests:
1. 1 out of 3 test
2. 2 out of 3 test
3. 3 out of 3 test
Answer
In a criminal trial, the defendant is convicted if and only if all 6 jurors vote guilty. Assume that if the defendant really is guilty,
the jurors vote guilty, independently, with probability 0.95, while if the defendant is really innocent, the jurors vote not guilty,
independently with probability 0.8. Suppose that 70% of defendants brought to trial are guilty.
1. Find the probability that the defendant is convicted.
2. Given that the defendant is convicted, find the probability that the defendant is guilty.
3. Comment on the assumption that the jurors act independently.
Answer
Genetics
Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this
section.
Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, a and b are co-dominant
and o is recessive. Suppose that in a certain population, the proportion of a , b , and o alleles are p, q, and r respectively. Of course
we must have p > 0 , q > 0 , r > 0 and p + q + r = 1 .
Suppose that the blood genotype in a person is the result of independent alleles, chosen with probabilities p, q, and r as above.
1. The probability distribubtion of the geneotypes is given in the following table:
Genotype aa ab ao bb bo oo
Probability p
2
2pq 2pr q
2
2qr r
2
2. The probability distribution of the blood types is given in the following table:
Blood type A B AB O
Probability 2
p + 2pr q
2
+ 2qr 2pq r
2
Proof
The discussion above is related to the Hardy-Weinberg model of genetics. The model is named for the English mathematician
Godfrey Hardy and the German physician Wilhelm Weiberg
Suppose that the probability distribution for the set of blood types in a certain population is given in the following table:
Blood type A B AB O
2.5.11 https://stats.libretexts.org/@go/page/10133
Probability 0.360 0.123 0.038 0.479
Find p, q, and r.
Answer
Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: g for green and y for yellow, and
that g is dominant and o recessive.
Suppose that 2 green-pod plants are bred together. Suppose further that each plant, independently, has the recessive yellow-pod
allele with probability . 1
1. Find the probability that 3 offspring plants will have green pods.
2. Given that the 3 offspring plants have green pods, find the updated probability that both parents have the recessive allele.
Answer
Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let h denote the healthy allele
and d the defective allele for the gene linked to the disorder. Recall that h is dominant and d recessive for women.
Suppose that a healthy woman initially has a chance of being a carrier. (This would be the case, for example, if her mother
1
and father are healthy but she has a brother with the disorder, so that her mother must be a carrier).
1. Find the probability that the first two sons of the women will be healthy.
2. Given that the first two sons are healthy, compute the updated probability that she is a carrier.
3. Given that the first two sons are healthy, compute the conditional probability that the third son will be healthy.
Answer
Suppose that we have m + 1 coins, labeled 0, 1, … , m. Coin i lands heads with probability for each i. The experiment is
m
i
to choose a coin at random (so that each coin is equally likely to be chosen) and then toss the chosen coin repeatedly.
n
1. The probability that the first n tosses are all heads is p m,n =
1
m+1
m
∑
i=0
(
i
m
)
2. p →
m,n as m → ∞
n+1
1
pm,n+1
3. The conditional probability that toss n + 1 is heads given that the previous n tosses were all heads is p
m,n
pm,n+1
4. pm,n
→
n+1
n+2
as m → ∞
Proof
Note that coin 0 is two-tailed, the probability of heads increases with i, and coin m is two-headed. The limiting conditional
probability in part (d) is called Laplace's Rule of Succession, named after Simon Laplace. This rule was used by Laplace and others
as a general principle for estimating the conditional probability that an event will occur on time n + 1 , given that the event has
occurred n times in succession.
Suppose that a missile has had 10 successful tests in a row. Compute Laplace's estimate that the 11th test will be successful.
Does this make sense?
Answer
This page titled 2.5: Independence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.5.12 https://stats.libretexts.org/@go/page/10133
2.6: Convergence
This is the first of several sections in this chapter that are more advanced than the basic topics in the first five sections. In this
section we discuss several topics related to convergence of events and random variables, a subject of fundamental importance in
probability theory. In particular the results that we obtain will be important for:
Properties of distribution functions,
The weak law of large numbers,
The strong law of large numbers.
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of
outcomes, F the σ-algebra of events, and P the probability measure on the sample space (Ω, F ).
Basic Theory
Sequences of events
Our first discussion deals with sequences of events and various types of limits of such sequences. The limits are also event. We
start with two simple definitions.
Note that these are the standard definitions of increasing and decreasing, relative to the ordinary total order ≤ on the index set N +
and the subset partial order ⊆ on the collection of events. The terminology is also justified by the corresponding indicator
variables.
1. The sequence of events is increasing if and only if the sequence of indicator variables is increasing in the ordinary sense.
That is, I ≤ I
n for each n ∈ N .
n+1 +
2. The sequence of events is decreasing if and only if the sequence of indicator variables is decreasing in the ordinary sense.
That is, I
n+1 ≤I for each n ∈ .
n +
Proof
If a sequence of events is either increasing or decreasing, we can define the limit of the sequence in a way that turns out to be quite
natural.
n=1
An .
2.6.1 https://stats.libretexts.org/@go/page/10134
2. If the sequence is decreasing, we define lim n→∞ An = ⋂
∞
n=1
An .
Suppose again that (A 1, A2 , …) is a sequence of events, and let I n = 1An denote the indicator variable of A for n ∈ N .
n +
1. If the sequence of events is increasing, then lim n→∞ In is the indicator variable of ⋃ ∞
n=1
An
2. If the sequence of events is decreasing, then lim n→∞ In is the indicator variable of ⋂ ∞
n=1
An
Proof
An arbitrary union of events can always be written as a union of increasing events, and an arbitrary intersection of events can
always be written as an intersection of decreasing events:
There is a more interesting and useful way to generate increasing and decreasing sequences from an arbitrary sequence of events,
using the tail segment of the sequence rather than the initial segment.
Proof
Since the new sequences defined in the previous results are decreasing and increasing, respectively, we can take their limits. These
are the limit superior and limit inferior, respectively, of the original sequence.
i=n
Ai = ⋂
∞
n=1
⋃
∞
i=n
Ai . This is the event that occurs if an only if A occurs for infinitely
n
many values of n .
2. lim inf A = lim
n→∞ n n→∞ ⋂
∞
i=n
Ai = ⋃
∞
n=1
⋂
∞
i=n
Ai . This is the event that occurs if an only if A occurs for all but
n
Once again, the terminology and notation are clarified by the corresponding indicator variables. You may need to review limit
inferior and limit superior for sequences of real numbers in the section on Partial Orders.
Suppose that (A 1, A2 , …) is a sequence of events, and et I n = 1An denote the indicator variable of A for n ∈ N . Then
n +
Suppose that (A 1, A2 , …) is a sequence of events. Then lim inf n→∞ An ⊆ lim sup n→∞ An .
Proof
2.6.2 https://stats.libretexts.org/@go/page/10134
The Continuity Theorems
Generally speaking, a function is continuous if it preserves limits. Thus, the following results are the continuity theorems of
probability. Part (a) is the continuity theorem for increasing events and part (b) the continuity theorem for decreasing events.
n=1
An )
n=1
An )
Proof
The continuity theorems can be applied to the increasing and decreasing sequences that we constructed earlier from an arbitrary
sequence of events.
i=1
Ai ) = limn→∞ P (⋃
n
i=1
Ai )
2. P (⋂ ∞
i=1
Ai ) = limn→∞ P (⋂
n
i=1
Ai )
Proof
i=n
Ai )
Proof
The next result shows that the countable additivity axiom for a probability measure is equivalent to finite additivity and the
continuity property for increasing events.
Temporarily, suppose that P is only finitely additive, but satisfies the continuity property for increasing events. Then P is
countably additive.
Proof
There are a few mathematicians who reject the countable additivity axiom of probability measure in favor of the weaker finite
additivity axiom. Whatever the philosophical arguments may be, life is certainly much harder without the continuity theorems.
Proof
The second lemma gives a condition that is sufficient to conclude that infinitely many independent events occur with probability 1.
∞
Second Borel-Cantelli Lemma. Suppose that (A1 , A2 , …) is a sequence of independent events. If ∑n=1 P(An ) = ∞ then
P (lim sup A ) = 1.
n→∞ n
Proof
For independent events, both Borel-Cantelli lemmas apply of course, and lead to a zero-one law.
n=1
P(An ) < ∞ then P (lim sup n→∞
An ) = 0 .
2. If ∑ ∞
n=1
P(An ) = ∞ then P (lim sup n→∞
An ) = 1.
This result is actually a special case of a more general zero-one law, known as the Kolmogorov zero-one law, and named for Andrei
Kolmogorov. This law is studied in the more advanced section on measure. Also, we can use the zero-one law to derive a calculus
2.6.3 https://stats.libretexts.org/@go/page/10134
theorem that relates infinite series and infinte products. This derivation is an example of the probabilistic method—the use of
probability to obtain results, seemingly unrelated to probability, in other areas of mathematics.
∞ ∞
i=1 i=1
Proof
Our next result is a simple application of the second Borel-Cantelli lemma to independent replications of a basic experiment.
Suppose that A is an event in a basic random experiment with P(A) > 0 . In the compound experiment that consists of
independent replications of the basic experiment, the event “A occurs infinitely often” has probability 1.
Proof
i=1
Euclidean spaces are named for Euclid, of course. As noted above, the one-dimensional case where d(x, y) = |y − x| for x, y ∈ R
is particularly important. Returning to the general metric space, recall that if (x , x , …) is a sequence in S and x ∈ S , then
1 2
x → x as n → ∞ means that d(x , x) → 0 as n → ∞ (in the usual calculus sense). For the rest of our discussion, we assume
n n
that (X , X , …) is a sequence of random variable with values in S and X is a random variable with values in S , all defined on
1 2
We say that X n → X as n → ∞ with probability 1 if the event that X n → X as n → ∞ has probability 1. That is,
P{ω ∈ S : Xn (ω) → X(ω) as n → ∞} = 1 (2.6.8)
Details
As good probabilists, we usually suppress references to the sample space and write the definition simply as
P(X → X as n → ∞) = 1 . The statement that an event has probability 1 is usually the strongest affirmative statement that we
n
can make in probability theory. Thus, convergence with probability 1 is the strongest form of convergence. The phrases almost
surely and almost everywhere are sometimes used instead of the phrase with probability 1.
Recall that metrics d and e on S are equivalent if they generate the same topology on S . Recall also that convergence of a
sequence is a topological property. That is, if (x , x , …) is a sequence in S and x ∈ S , and if d, e are equivalent metrics on S ,
1 2
then x → x as n → ∞ relative to d if and only if x → x as n → ∞ relative to e . So for our random variables as defined above,
n n
it follows that X → X as n → ∞ with probability 1 relative to d if and only if X → X as n → ∞ with probability 1 relative to
n n
e.
2.6.4 https://stats.libretexts.org/@go/page/10134
Proof
Our next result gives a fundamental criterion for convergence with probability 1:
If ∑ ∞
n=1
P [d(Xn , X) > ϵ] < ∞ for every ϵ > 0 then X n → X as n → ∞ with probability 1.
Proof
The phrase in probability sounds superficially like the phrase with probability 1. However, as we will soon see, convergence in
probability is much weaker than convergence with probability 1. Indeed, convergence with probability 1 is often called strong
convergence, while convergence in probability is often called weak convergence.
The converse fails with a passion. A simple counterexample is given below. However, there is a partial converse that is very useful.
There are two other modes of convergence that we will discuss later:
Convergence in distribution.
Convergence in mean,
n ∈ N , where a > 0 is a parameter. We toss each coin in sequence one time. In terms of a , find the probability of the
+
following events:
1. infinitely many heads occur
2. infinitely many tails occur
Answer
The following exercise gives a simple example of a sequence of random variables that converge in probability but not with
probability 1. Naturally, we are assuming the standard metric on R.
Suppose again that we have a sequence of coins labeled 1, 2, …, and that coin n lands heads up with probability n
1
for each n .
We toss the coins in order to produce a sequence (X , X , …) of independent indicator random variables with
1 2
1 1
P(Xn = 1) = , P(Xn = 0) = 1 − ; n ∈ N+ (2.6.12)
n n
1. P(X = 0 for infinitely many n) = 1 , so that infinitely many tails occur with probability 1.
n
2. P(X = 1 for infinitely many n) = 1 , so that infinitely many heads occur with probability 1.
n
4. X → 0 as n → ∞ in probability.
n
Proof
2.6.5 https://stats.libretexts.org/@go/page/10134
Discrete Spaces
Recall that a measurable space (S, S ) is discrete if S is countable and S is the collection of all subsets of S (the power set of S ).
Moreover, S is the Borel σ-algebra corresponding to the discrete metric d on S given by d(x, x) = 0 for x ∈ S and d(x, y) = 1
for distinct x, y ∈ S . How do convergence with probability 1 and convergence in probability work for the discrete metric?
Suppose that (S, S ) is a discrete space. Suppose further that (X , X , …) is a sequence of random variables with values in S
1 2
and X is a random variable with values in S , all defined on the probability space (Ω, F , P). Relative to the discrete metric d ,
1. X n → X as n → ∞ with probability 1 if and only if P(X = X for all but finitely many n ∈ N
n +) =1 .
2. X n → X as n → ∞ in probability if and only if P(X ≠ X) → 0 as n → ∞ .
n
Proof
Of course, it's important to realize that a discrete space can be the Borel space for metrics other than the discrete metric.
This page titled 2.6: Convergence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.6.6 https://stats.libretexts.org/@go/page/10134
2.7: Measure Spaces
In this section we discuss positive measure spaces (which include probability spaces) from a more advanced point of view. The
sections on Measure Theory and Special Set Structures in the chapter on Foundations are essential prerequisites. On the other hand,
if you are not interested in the measure-theoretic aspects of probability, you can safely skip this section.
Positive Measure
Definitions
Suppose that S is a set, playing the role of a universal set for a mathematical theory. As we have noted before, S usually comes
with a σ-algebra S of admissible subsets of S , so that (S, S ) is a measurable space. In particular, this is the case for the model of
a random experiment, where S is the set of outcomes and S the σ-algebra of events, so that the measurable space (S, S ) is the
sample space of the experiment. A probability measure is a special case of a more general object known as a positive measure.
A positive measure on (S, S ) is a function μ : S → [0, ∞] that satisfies the following axioms:
1. μ(∅) = 0
2. If {A : i ∈ I } is a countable, pairwise disjoint collection of sets in S then
i
μ ( ⋃ Ai ) = ∑ μ(Ai ) (2.7.1)
i∈I i∈I
Axiom (b) is called countable additivity, and is the essential property. The measure of a set that consists of a countable union of
disjoint pieces is the sum of the measures of the pieces. Note also that since the terms in the sum are positive, there is no issue with
the order of the terms in the sum, although of course, ∞ is a possible value.
So probability measures are positive measures, but positive measures are important beyond the application to probability. The
standard measures on the Euclidean spaces are all positive measures: the extension of length for measurable subsets of R, the
extension of area for measurable subsets of R , the extension of volume for measurable subsets of R , and the higher dimensional
2 3
analogues. We will actually construct these measures in the next section on Existence and Uniqueness. In addition, counting
measure # is a positive measure on the subsets of a set S . Even more general measures that can take positive and negative values
are explored in the chapter on Distributions.
Properties
The following results give some simple properties of a positive measure space (S, S , μ). The proofs are essentially identical to the
proofs of the corresponding properties of probability, except that the measure of a set may be infinite so we must be careful to
avoid the dreaded indeterminate form ∞ − ∞ .
2.7.1 https://stats.libretexts.org/@go/page/10135
Proof
If A, B ∈ S and A ⊆ B then
1. μ(B) = μ(A) + μ(B ∖ A)
2. μ(A) ≤ μ(B)
Proof
Thus μ is an increasing function, relative to the subset partial order ⊆ on S and the ordinary order ≤ on [0, ∞]. In particular, if μ
is a finite measure, then μ(A) < ∞ for every A ∈ S . Note also that if A, B ∈ S and μ(B) < ∞ then
μ(B ∖ A) = μ(B) − μ(A ∩ B) . In the special case that A ⊆ B , this becomes μ(B ∖ A) = μ(B) − μ(A) . In particular, these
results holds for a finite measure and are just like the difference rules for probability. If μ is a finite measure, then
μ(A ) = μ(S) − μ(A) . This is the analogue of the complement rule in probability, with but with μ(S) replacing 1.
c
The following result is the analogue of Boole's inequality for probability. For a general positive measure, the result is referred to as
the subadditive property.
μ ( ⋃ Ai ) ≤ ∑ μ(Ai ) (2.7.2)
i∈I i∈I
Proof
For a union of sets with finite measure, the inclusion-exclusion formula holds, and the proof is just like the one for probability.
Suppose that A i ∈ S for each i ∈ I where #(I ) = n , and that μ(A i) <∞ for i ∈ I . Then
n
k−1
μ ( ⋃ Ai ) = ∑(−1 ) ∑ μ ( ⋂ Aj ) (2.7.4)
Proof
The continuity theorem for increasing sets holds for a positive measure. The continuity theorem for decreasing events holds also, if
the sets have finite measure. Again, the proofs are similar to the ones for a probability measure, except for considerations of infinite
measure.
i=1
Ai ) = limn→∞ μ(An ) .
Proof
∞ ∞
Recall that if (A , A , …) is increasing, ⋃ A is denoted lim
1 2 i=1 i A , and if (A , A , …) is decreasing, ⋂
n→∞ n 1A is denoted
2 i=1 i
limn→∞ A . In both cases, the continuity theorem has the form μ (lim
n A ) = lim μ(A ) . The continuity theorem for
n→∞ n n→∞ n
decreasing events fails without the additional assumption of finite measure. A simple counterexample is given below.
The following corollary of the inclusion-exclusion law gives a condition for countable additivity that does not require that the sets
be disjoint, but only that the intersections have measure 0. The result is used below in the theorem on completion.
Suppose that A i ∈ S for each i in a countable index set I and that μ(Ai ) < ∞ for i ∈ I and μ(Ai ∩ Aj ) = 0 for distinct
i, j ∈ I . Then
μ ( ⋃ Ai ) = ∑ μ(Ai ) (2.7.11)
i∈I i∈I
Proof
2.7.2 https://stats.libretexts.org/@go/page/10135
More Definitions
If a positive measure is not finite, then the following definition gives the next best thing.
The measure space (S, S , μ) is σ -finite if there exists a countable collection { Ai : i ∈ I } ⊆ S with ⋃i∈I Ai = S and
μ(A ) < ∞ for each i ∈ I .
i
So of course, if μ is a finite measure on (S, S ) then μ is σ-finite, but not conversely in general. On the other hand, for i ∈ I , let
S = {A ∈ S : A ⊆ A } . Then S is a σ-algebra of subsets of A and μ restricted to S is a finite measure. The point of this
i i i i i
(and the reason for the definition) is that often nice properties of finite measures can be extended to σ-finite measures. In particular,
σ-finite measure spaces play a crucial role in the construction of product measure spaces, and for the completion of a measure
Our next definition concerns sets where a measure is concentrated, in a certain sense.
Suppose that (S, S , μ) is a measure space. An atom of the space is a set A ∈ S with the following properties:
1. μ(A) > 0
2. If B ∈ S and B ⊆ A then either μ(B) = μ(A) or μ(B) = 0 .
A measure space that has no atoms is called non-atomic or diffuse.
In probability theory, we are often particularly interested in atoms that are singleton sets. Note that {x} ∈ S is an atom if and only
if μ({x}) > 0, since the only subsets of {x} are {x} itself and ∅.
Constructions
There are several simple ways to construct new positive measures from existing ones. As usual, we start with a measurable space
(S, S ).
Suppose that (R, R) is a measurable subspace of (S, S ). If μ is a positive measure on (S, S ) then μ restricted to R is a
positive measure on (R, R) . If μ is a finite measure on (S, S ) then μ is a finite measure on (R, R) .
Proof
However, if μ is σ-finite on (S, S ), it is not necessarily true that μ is σ-finite on (R, R) . A counterexample is given below. The
previous theorem would apply, in particular, when R = S so that R is a sub σ-algebra of S . Next, a positive multiple of a positive
measure gives another positive measure.
If μ is a positive measure on (S, S ) and c ∈ (0, ∞), then cμ is also a positive measure on (S, S ). If μ is finite (σ-finite) then
cμ is finite (σ-finite) respectively.
Proof
A nontrivial finite positive measure μ is practically just like a probability measure, and in fact can be re-scaled into a probability
measure P, as was done in the section on Probability Measures:
Suppose that μ is a positive measure on (S, S ) with 0 < μ(S) < ∞ . Then P defined by P(A) = μ(A)/μ(S) for A ∈ S is a
probability measure on (S, S ).
Proof
If μ is a positive measure on
i (S, S ) for each i in a countable index set I then μ =∑
i∈I
μi is also a positive measure on
(S, S ).
2.7.3 https://stats.libretexts.org/@go/page/10135
1. If I is finite and μ is finite for each i ∈ I then μ is finite.
i
Proof
In the context of the last result, if I is countably infinite and μ is finite for each i ∈ I , then μ is not necessarily σ-finite. A
i
counterexample is given below. In this case, μ is said to be s -finite, but we've had enough definitions, so we won't pursue this one.
From scaling and sum properties, note that a positive linear combination of positive measures is a positive measure. The next
method is sometimes referred to as a change of variables.
Suppose that (S, S , μ) is a measure space. Suppose also that (T , T ) is another measurable space and that f : S → T is
measurable. Then ν defined as follows is a positive measure on (T , T )
−1
ν (B) = μ [ f (B)] , B ∈ T (2.7.17)
In the context of the last result, if μ is σ-finite on (S, S ), it is not necessarily true that ν is σ-finite on (T , T ), even if f is one-to-
one. A counterexample is given below. The takeaway is that σ-finiteness of ν depends very much on the nature of the σ-algebra
T . Our next result shows that it's easy to explicitly construct a positive measure on a countably generated σ-algebra, that is, a σ-
algebra generated by a countable partition. Such σ-algebras are important for counterexamples and to gain insight, and also because
many σ-algebras that occur in applications can be constructed from them.
Suppose that A = {A : i ∈ I } is a countable partition of S into nonempty sets, and that S = σ(A ) , the
i σ -algebra
generated by the partition. For i ∈ I , define μ(A ) ∈ [0, ∞] arbitrarily. For A = ⋃ A where J ⊆ I , define
i j∈J j
j∈J
Proof
One of the most general ways to construct new measures from old ones is via the theory of integration with respect to a positive
measure, which is explored in the chapter on Distributions. The construction of positive measures more or less “from scratch” is
considered in the next section on Existence and Uniqueness. We close this discussion with a simple result that is useful for
counterexamples.
Suppose that the measure space (S, S , μ) has an atom A ∈ S with μ(A) = ∞ . Then the space is not σ-finite.
Proof
Suppose that (S, T ) is a topological space and let S = σ(T ) be the Borel σ-algebra. A positive measure μ on (S, S ) is a
Borel measure and then (S, S , μ) is a Borel measure space.
2.7.4 https://stats.libretexts.org/@go/page/10135
The next definition concerns the subset on which a Borel measure is concentrated, in a certain sense.
The term Borel measure has different definitions in the literature. Often the topological space is required to be locally compact,
Hausdorff, and with a countable base (LCCB). Then a Borel measure μ is required to have the additional condition that μ(C ) < ∞
if C ⊆ S is compact. In this text, we use the term Borel measures in this more restricted sense.
Suppose that (S, S , μ) is a Borel measure space corresponding to an LCCB topolgy. Then the space is σ-finite.
Proof
Here are a couple of other definitions that are important for Borel measures, again linking topology and measure in natural ways.
The measure spaces that occur in probability and stochastic processes are usually regular Borel spaces associated with LCCB
topologies.
Consider a measurable “statement” with x ∈ S as a free variable. (Technically, such a statement is a predicate on S .) If the
statement is true for all x ∈ S except for x in a null set, we say that the statement holds almost everywhere on S . This terminology
is used often in measure theory and captures the importance of the definition.
Let D = {A ∈ S c
: μ(A) = 0 or μ(A ) = 0} , the collection of null and co-null sets. Then D is a sub σ-algebra of S .
Proof
Of course μ restricted to D is not very interesting since μ(A) = 0 or μ(A) = μ(S) for every A ∈ S . Our next definition is a type
of equivalence between sets in S . To make this precise, recall first that the symmetric difference between subsets A and B of S is
A △ B = (A ∖ B) ∪ (B ∖ A) . This is the set that consists of points in one of the two sets, but not both, and corresponds to
exclusive or.
Thus A ≡ B if and only if μ(A △ B) = μ(A ∖ B) + μ(B ∖ A) = 0 if and only if μ(A ∖ B) = μ(B ∖ A) = 0 . In the predicate
terminology mentioned above, the statement
is true for almost every x ∈ S . As the name suggests, the relation ≡ really is an equivalence relation on S and hence S is
partitioned into disjoint classes of mutually equivalent sets. Two sets in the same equivalence class differ by a set of measure 0.
2.7.5 https://stats.libretexts.org/@go/page/10135
3. If A ≡ B and B ≡ C then A ≡ C (the transitive property).
Proof
If A, B ∈ S and A ≡ B then A c
≡B
c
.
Proof
2. ⋂ i∈I
Ai ≡ ⋂
i∈I
Bi
Proof
The converse trivially fails, and a counterexample is given below. However, the collection of null sets and the collection of co-null
sets do form equivalence classes.
Suppose that A ∈ S .
1. If μ(A) = 0 then A ≡ B if and only if μ(B) = 0 .
2. If μ(A ) = 0 then A ≡ B if and only if μ(B ) = 0 .
c c
Proof
We can extend the notion of equivalence to measruable functions with a common range space. Thus suppose that (T , T ) is another
measurable space. If f , g : S → T are measurable, then (f , g) : S → T × T is measurable with respect the usual product σ-
algebra T ⊗ T . We assume that the diagonal set D = {(y, y) : y ∈ T } ∈ T ⊗ T , which is almost always true in applications.
In the terminology discussed earlier, f ≡ g means that f (x) = g(x) almost everywhere on S . As with measurable sets, the relation
≡ really does define an equivalence relation on the collection of measurable functions from S to T . Thus, the collection of such
The relation ≡ is an equivalence relation on the collection of measurable functions from S to T . That is, for measurable
f , g, h : S → T ,
1. f ≡ f (the reflexive property).
2. If f ≡ g then g ≡ f (the symmetric property).
3. If f ≡ g and g ≡ h then f ≡ h (the transitive property).
Proof
Suppose agaom that f , g : S → T are measurable and that f ≡g . Then for every B ∈ T , the sets f −1
(B) ≡ g
−1
(B) .
Proof
Thus if f , g : S → T are measurable and f ≡ g , then by the previous result, νf = νg where νf , νg are the measures on (T , T )
associated with f and g , as above. Again, the converse fails with a passion.
It often happens that a definition for functions subsumes the corresponding definition for sets, by considering the indicator functons
of the sets. So it is with equivalence. In the following result, we can take T = {0, 1} with T the collection of all subsets.
2.7.6 https://stats.libretexts.org/@go/page/10135
Equivalence is preserved under composition. For the next result, suppose that (U , U ) is yet another measurable space.
Suppose again that (S, S , μ) is a measure space. Let V denote the collection of all measurable real-valued random functions from
S into R . (As usual, R is given the Borel σ-algebra.) From our previous discussion of measure theory, we know that with the usual
definitions of addition and scalar multiplication, (V , +, ⋅) is a vector space. However, in measure theory, we often do not want to
distinguish between functions that are equivalent, so it's nice to know that the vector space structure is preserved when we identify
equivalent functions. Formally, let [f ] denote the equivalence class generated by f ∈ V , and let W denote the collection of all such
equivalence classes. In modular notation, W is V / ≡ . We define addition and scalar multiplication on W by
(W , +, ⋅) is a vector space.
Proof
Often we don't bother to use the special notation for the equivalence class associated with a function. Rather, it's understood that
equivalent functions represent the same object. Spaces of functions in a measure space are studied further in the chapter on
Distributions.
Completion
Suppose that (S, S , μ) is a measure space and let N = {A ∈ S : μ(A) = 0} denote the collection of null sets of the space. If
A ∈ N and B ∈ S is a subset of A , then we know that μ(B) = 0 so B ∈ N also. However, in general there might be subsets of
Our goal in this discussion is to show that if (S, S , μ) is a σ-finite measure that is not complete, then it can be completed. That is
μ can be extended to σ-algebra that includes all of the sets in S and all subsets of null sets. The first step is to extend the
For A, B ⊆ S , define A ≡ B if and only if there exists N ∈ N such that A △ B ⊆N . The relation ≡ is an equivalence
relation on P(S) : For A, B, C ⊆ S ,
1. A ≡ A (the reflexive property).
2. If A ≡ B then B ≡ A (the symmetric property).
3. If A ≡ B and B ≡ C then A ≡ C (the transitive property).
Proof
So the equivalence relation ≡ partitions P(S) into mutually disjoint equivalence classes. Two sets in an equivalence class differ
by a subset of a null set. In particular, A ≡ ∅ if and only if A ⊆ N for some N ∈ N . The extended relation ≡ is preserved under
the set operations, just as before. Our next step is to enlarge the σ-algebra S by adding any set that is equivalent to a set in S .
Let S = {A ⊆ S : A ≡ B for some B ∈ S } . Then S is a σ-algebra of subsets of S , and in fact is the σ-algebra generated
0 0
by S ∪ {A ⊆ S : A ≡ ∅} .
Proof
3. μ is a positive measure on S .
0 0
The measure space (S, S 0, μ0 ) is complete and is known as the completion of (S, S , μ).
2.7.7 https://stats.libretexts.org/@go/page/10135
Proof
Counterexamples
The continuity theorem for decreasing events can fail if the events do not have finite measure.
Consider Z with counting measure # on the σ -algebra of all subsets. Let An = {z ∈ Z : z ≤ −n} for n ∈ N+ . The
continuity theorem fails for (A , A , …).
1 2
Proof
Suppose that (S, S , μ)is a measure space with the property that there exist disjoint sets A, B ∈ S such that
μ(A) = μ(B) > 0 . Then A and B are not equivalent.
Proof
For a concrete example, we could take S = {0, 1} with counting measure # on σ-algebra of all subsets, and A = {0} , B = {1} .
The σ-finite property is not necessarily inherited by a sub-measure space. To set the stage for the counterexample, let R denote the
Borel σ-algebra of R, that is, the σ-algebra generated by the standard Euclidean topology. There exists a positive measure λ on
(R, R) that generalizes length. The measure λ , known as Lebesgue measure, is constructed in the section on Existence. Next let C
That C is a σ-algebra was shown in the section on measure theory in the chapter on foundations.
(R, C ) is a subspace of (R, R). Moreover, (R, R, λ) is σ-finite but (R, C , λ) is not.
Proof
Let S be a nonempty, finite set with the σ-algebra S of all subsets. Let μ = # be counting measure on (S, S ) for n ∈ N .
n +
Proof
Basic Properties
In the following problems, μ is a positive measure on the measurable space (S, S ).
Suppose that μ(S) = 20 and that A, B ∈ S with μ(A) = 5 , μ(B) = 6 , μ(A ∩ B) = 2 . Find the measure of each of the
following sets:
1. A ∖ B
2. A ∪ B
3. A ∪ B
c c
4. A ∩ B
c c
5. A ∪ B c
Answer
Suppose that μ(S) = ∞ and that A, B ∈ S with μ(A ∖ B) = 2 , μ(B ∖ A) = 3 , and μ(A ∩ B) = 4 . Find the measure of
each of the following sets:
1. A
2. B
2.7.8 https://stats.libretexts.org/@go/page/10135
3. A ∪ B
4. A ∩ B
c c
5. A ∪ B
c c
Answer
Suppose that μ(S) = 10 and that A, B ∈ S with μ(A) = 3 , μ(A ∪ B) = 7 , and μ(A ∩ B) = 2 . Find the measure of each of
the following events:
1. B
2. A ∖ B
3. B ∖ A
4. A ∪ B
c c
5. A ∩ B
c c
Answer
Suppose that A, B, C ∈ S with μ(A) = 10 , μ(B) = 12 , μ(C ) = 15 , μ(A ∩ B) = 3 , μ(A ∩ C ) = 4 , μ(B ∩ C ) = 5 , and
μ(A ∩ B ∩ C ) = 1S . Find the probabilities of the various unions:
1. A ∪ B
2. A ∪ C
3. B ∪ C
4. A ∪ B ∪ C
Answer
This page titled 2.7: Measure Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.7.9 https://stats.libretexts.org/@go/page/10135
2.8: Existence and Uniqueness
Suppose that S is a set and S a σ-algebra of subsets of S , so that (S, S ) is a measurable space. In many cases, it is impossible to
define a positive measure μ on S explicitly, by giving a “formula” for computing μ(A) for each A ∈ S . Rather, we often know
how the measure μ should work on some class of sets B that generates S . We would then like to know that μ can be extended to a
positive measure on S , and that this extension is unique. The purpose of this section is to discuss the basic results on this topic. To
understand this section you will need to review the sections on Measure Theory and Special Set Structures in the chapter on
Foundations, and the section on Measure Spaces in this chapter. If you are not interested in questions of existence and uniqueness
of positive measures, you can safely skip this section.
Basic Theory
Positive Measures on Algebras
Suppose first that A is an algebra of subsets of S . Recall that this means that A is a collection of subsets that contains S and is
closed under complements and finite unions (and hence also finite intersections). Here is our first definition:
μ ( ⋃ Ai ) = ∑ μ(Ai ) (2.8.1)
i∈I i∈I
Clearly the definition of a positive measure on an algebra is very similar to the definition for a σ-algebra. If the collection of sets in
(b) is finite, then ⋃ A must be in the algebra A . Thus, μ is finitely additive. If the collection is countably infinite, then there is
i∈I i
no guarantee that the union is in A . If it is however, then μ must be additive over this collection. Given the similarity, it is not
surprising that μ shares many of the basic properties of a positive measure on a σ-algebra, with proofs that are almost identical.
If A, B ∈ A and A ⊆ B then
1. μ(B) = μ(A) + μ(B ∖ A)
2. μ(A) ≤ μ(B)
Proof
Thus μ is increasing, relative to the subset partial order ⊆ on A and the ordinary order ≤ on [0, ∞]. Note also that if A, B ∈ A
and μ(B) < ∞ then μ(B ∖ A) = μ(B) − μ(A ∩ B) . In the special case that A ⊆ B , this becomes μ(B ∖ A) = μ(B) − μ(A) . If
μ(S) < ∞ then μ(A ) = μ(S) − μ(A) . These are the familiar difference and complement rules.
c
The following result is the subadditive property for a positive measure μ on an algebra A .
μ ( ⋃ Ai ) ≤ ∑ μ(Ai ) (2.8.2)
i∈I i∈I
Proof
For a finite union of sets with finite measure, the inclusion-exclusion formula holds, and the proof is just like the one for a
probability measure.
Suppose that {A i : i ∈ I} is a finite collection of sets in A where #(I ) = n ∈ N , and that μ(A
+ i) <∞ for i ∈ I . Then
2.8.1 https://stats.libretexts.org/@go/page/10136
n
k−1
μ ( ⋃ Ai ) = ∑(−1 ) ∑ μ ( ⋂ Aj ) (2.8.4)
The continuity theorems hold for a positive measure μ on an algebra A , just as for a positive measure on a σ-algebra, assuming
that the appropriate union and intersection are in the algebra. The proofs are just as before.
i=1
Ai ∈ A , then
μ(A ) .
∞
μ (⋃ A ) = lim
i n→∞ n
i=1
i=1
Ai ∈ A , then
μ(A ) .
∞
μ (⋂ A ) = lim
i n→∞ n
i=1
Proof
∞
Recall that if the sequence (A , A 1 2, …) is increasing, then we define lim A =⋃ n→∞ A , and if the sequence is decreasing
n n=1 n
∞
then we define lim A =⋂
n→∞ n n=1
An . Thus the conclusion of both parts of the continuity theorem is
Finite additivity and continuity for increasing events imply countable additivity:
Many of the basic theorems in measure theory require that the measure not be too far removed from being finite. This leads to the
following definition, which is just like the one for a positive measure on a σ-algebra.
A measure μ on an algebra A of subsets of S is σ-finite if there exists a sequence of sets (A , A 1 2, …) in A such that
∞
⋃n=1 An = S and μ(A ) < ∞ for each n ∈ N . The sequence is called a σ-finite sequence for μ .
n +
If μ is a positive, σ-finte measure on an algebra A , then μ can be extended to a positive measure on S = σ(A ) .
Proof
Our next goal is the basic uniqueness result, which serves as the complement to the basic extension result. But first we need another
variation of the term σ-finite.
Suppose that μ is a measure on a σ-algebra S of subsets of S and B ⊆ S . Then μ is σ-finite on B if there exists a countable
collection {B : i ∈ I } ⊆ B such that μ(B ) < ∞ for i ∈ I and ⋃ B = S .
i i i∈I i
The next result is the uniqueness theorem. The proof, like others that we have seen, uses Dynkin's π-λ theorem, named for Eugene
Dynkin.
2.8.2 https://stats.libretexts.org/@go/page/10136
Suppose that B is a π-system and that S = σ(B) . If μ and μ are positive measures on 1 2 S and are σ-finite on B , and if
μ (A) = μ (A) for all A ∈ B , then μ (A) = μ (A) for all A ∈ S .
1 2 1 2
Proof
An algebra A of subsets of S is trivially a π-system. Hence, if μ and μ are positive measures on S = σ(A ) and are σ-finite on
1 2
A , and if μ (A) = μ (A) for A ∈ A , then μ (A) = μ (A) for A ∈ S . This completes the second part of the fundamental
1 2 1 2
theorem.
Of course, the results of this subsection hold for probability measures. Formally, a probability measure P on an algebra A of
subsets of S is a positive measure on A with the additional requirement that P(S) = 1 . Probability measures are trivially σ-finite,
so a probability measure P on an algebra A can be uniquely extended to S = σ(A ) .
However, usually we start with a collection that is more primitive than an algebra. The next result combines the definition with the
main theorem associated with the definition. For a proof see the section on Special Set Structures in the chapter on Foundations.
i∈I
If the following conditions are satisfied, then B is a semi-algebra of subsets of S , and then A is the algebra generated by B .
1. If B , B ∈ B then B ∩ B
1 2 1 2 ∈ B .
2. If B ∈ B then B ∈ A . c
Suppose now that we know how a measure μ should work on a semi-algebra B that generates an algebra A and then a σ-algebra
S = σ(A ) = σ(B) . That is, we know μ(B) ∈ [0, ∞] for each B ∈ B . Because of the additivity property, there is no question as
i∈I
if A = ⋃ B for some finite, disjoint collection {B : i ∈ I } of sets in B (so that A ∈ A ). However, we cannot assign the
i∈I i i
values μ(B) for B ∈ B arbitrarily. The following extension theorem states that, subject just to some essential consistency
conditions, the extension of μ from the semi-algebra B to the algebra A does in fact produce a measure on A . The consistency
conditions are that μ be finitely additive and countably subadditive on B .
Suppose that B is a semi-algebra of subsets of S and that A is the algebra of subsets of S generated by B . A function
μ : B → [0, ∞] can be uniquely extended to a measure on A if and only if μ satisfies the following properties:
1. If ∅ ∈ B then μ(∅) = 0 .
2. If {B : i ∈ I } is a finite, disjoint collection of sets in B and B = ⋃ B ∈ B then μ(B) = ∑ μ(B ) .
i i∈I i i∈I i
If the measure μ on the algebra A is σ-finite, then the extension theorem and the uniqueness theorem apply, so μ can be extended
uniquely to a measure on the σ-algebra S = σ(A ) = σ(B) . This chain of extensions, starting with a semi-algebra B , is often
how measures are constructed.
the σ-algebra generated by the Cartesian products of measurable sets, sometimes referred to as measurable rectangles.
Suppose that (S, S , μ) and (T , T , ν ) are σ-finite measure spaces. Then there exists a unique σ-finite measure μ⊗ν on
(S × T , S ⊗ T ) such that
2.8.3 https://stats.libretexts.org/@go/page/10136
(μ ⊗ ν )(A × B) = μ(A)ν (B); A ∈ S, B ∈ T (2.8.19)
The measure space (S × T , S ⊗ T , μ ⊗ ν) is the product measure space associated with (S, S , μ) and (T , T , ν ).
Proof
Recall that for C ⊆ S × T , the cross section of C in the first coordinate at x ∈ S is C = {y ∈ T : (x, y) ∈ C } . Similarly, the
x
cross section of C in the second coordinate at y ∈ T is C = {x ∈ S : (x, y) ∈ C } . We know that the cross sections of a
y
measurable set are measurable. The following result shows that the measures of the cross sections of a measurable set form
measurable functions.
Suppose again that (S, S , μ) and (T , T , ν ) are σ-finite measure spaces. If C ∈ S ⊗T then
1. x ↦ ν (C x) is a measurable function from S to [0, ∞].
2. y ↦ μ(C y
) is a measurable function from T to [0, ∞].
Proof
In the next chapter, where we study integration with respect to a measure, we will see that for C ∈ S ⊗ T , the product measure
(μ ⊗ ν )(C ) can be computed by integrating ν (C ) over x ∈ S with respect to μ or by integrating μ(C ) over y ∈ T with respect
y
x
to ν . These results, generalizing the definition of the product measure, are special cases of Fubini's theorem, named for the Italian
mathematician Guido Fubini.
Except for more complicated notation, these results extend in a perfectly straightforward way to the product of a finite number of
σ-finite measure spaces.
Suppose that n ∈ N and that (S , S , μ ) is a σ-finite measure space for i ∈ {1, 2, … , n}. Let S = ∏
+ i i i
n
i=1
Si and let S
denote the corresponding product σ-algebra. There exists a unique σ-finite measure μ on (S, S ) satisfying
n n
i=1 i=1
The measure space (S, S , μ) is the product measure space associated with the given measure spaces.
Lebesgue Measure
The next discussion concerns our most important and essential application. Recall that the Borel σ-algebra on R, named for Émile
Borel, is the σ-algebra R generated by the standard Euclidean topology on R. Equivalently, R = σ(I ) where I is the collection
of intervals of R (of all types—bounded and unbounded, with any type of closure, and including single points and the empty set).
Next recall how the length of an interval is defined. For a, b ∈ R with a ≤ b , each of the intervals (a, b), [a, b), (a, b], and [a, b]
has length b − a . For a ∈ R , each of the intervals (a, ∞), [a, ∞), (−∞, a) , (−∞, a] has length ∞, as does R itself. The standard
measure on R generalizes the length measurement for intervals.
There exists a unique measure λ on R such that λ(I ) = length(I ) for I ∈ I . The measure λ is Lebesgue measure on
(R, R).
Proof
The is name in honor of Henri Lebesgue, of course. Since λ is σ-finite, the σ-algebra of Borel sets R can be completed with
respect to λ .
The completion of the Borel σ-algebra R with respect to λ is the Lebesgue σ-algebra R . ∗
Recall that completed means that if A ∈ R , λ(A) = 0 and B ⊆ A , then B ∈ R (and then λ(B) = 0 ). The Lebesgue measure λ
∗ ∗
on R, with either the Borel σ-algebra R , or its completion R is the standard measure that is used for the real numbers. Other
∗
properties of the measure space (R, R, λ) are given below, in the discussion of Lebesgue measure on R . n
For n ∈ N , let R denote the Borel σ-algebra corresponding to the the standard Euclidean topology on R , so that (R , R ) is
+ n
n n
n
the n -dimensional Euclidean measurable space. The σ-algebra, R is also the n -fold power of R , the Borel σ-algebra of R. That
n
2.8.4 https://stats.libretexts.org/@go/page/10136
Rn = σ { I1 × I2 × ⋯ In : Ij ∈ I for j ∈ {1, 2, … n}} (2.8.21)
In particular, λ extends the area measure on R and λ extends the volume measure on R . In general, λ (A) is sometimes
2 2 3 3 n
referred to as n -dimensional volume of A ∈ R . As in the one-dimensional case, R can be completed with respect to λ ,
n n n
essentially adding all subsets of sets of measure 0 to R . The completed σ-algebra is the σ-algebra of Lebesgue measurable sets.
n
Since λ (U ) > 0 if U ⊆ R is open, the support of λ is all of R . In addition, Lebesgue measure has the regularity properties
n
n
n
n
that are concerned with approximating the measure of a set, from below with the measure of a compact set, and from above with
the measure of an open set.
The following theorem describes how the measure of a set is changed under certain basic transformations. These are essential
properties of Lebesgue measure. To setup the notation, suppose that n ∈ N , A ⊆ R , x ∈ R , c ∈ (0, ∞) and that T is an n × n
+
n n
matrix. Define
A + x = {a + x : a ∈ A}, cA = {ca : a ∈ A}, T A = {T a : a ∈ A} (2.8.24)
Suppose that A ∈ R . n
Lebesgue-Stieltjes Measures on R
The construction of Lebesgue measure on R can be generalized. Here is the definition that we will need.
Since Fis increasing, the limit from the left at x ∈ R exists in R and is denoted F (x ) = lim F (t) . Similarly −
t↑x
The measure μ is called the Lebesgue-Stieltjes measure associated with F , named for Henri Lebesgue and Thomas Joannes
Stieltjes. Distribution functions and the measures associated with them are studied in more detail in the chapter on Distributions.
When the function F takes values in [0, 1], the associated measure P is a probability measure, and the function F is the probability
distribution function of P. Probability distribution functions are also studied in much more detail (but with less technicality) in the
chapter on Distributions.
Note that the identity function x ↦ x for x ∈ R is a distribution function, and the measure associated with this function is ordinary
Lebesgue measure on R constructed in(15).
2.8.5 https://stats.libretexts.org/@go/page/10136
This page titled 2.8: Existence and Uniqueness is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
2.8.6 https://stats.libretexts.org/@go/page/10136
2.9: Probability Spaces Revisited
In this section we discuss probability spaces from the more advanced point of view of measure theory. The previous two sections
on positive measures and existence and uniqueness are prerequisites. The discussion is divided into two parts: first those concepts
that are shared rather equally between probability theory and general measure theory, and second those concepts that are for the
most part unique to probability theory. In particular, it's a mistake to think of probability theory as a mere branch of measure theory.
Probability has its own notation, terminology, point of view, and applications that makes it an incredibly rich subject on its own.
Basic Concepts
Our first discussion concerns topics that were discussed in the section on positive measures. So no proofs are necessary, but you
will notice that the notation, and in some cases the terminology, is very different.
Definitions
We can now give a precise definition of the probability space, the mathematical model of a random experiment.
Often the special notation (Ω, F , P) is used for a probability space in the literature—the symbol Ω for the set of outcomes is
intended to remind us that these are all possible outcomes. However in this text, we don't insist on the special notation, and use
whatever notation seems most appropriate in a given context.
In probability, σ-algebras are not just important for theoretical and foundational purposes, but are important for practical purposes
as well. A σ-algebra can be used to specify partial information about an experiment—a concept of fundamental importance.
Specifically, suppose that A is a collection of events in the experiment, and that we know whether or not A occurred for each
A ∈ A . Then in fact, we can determine whether or not A occurred for each A ∈ σ(A ) , the σ-algebra generated by A .
Technically, a random variable for our experiment is a measurable function from the sample space into another measurable space.
Suppose that (S, S , P) is a probability space and that (T , T ) is another measurable space. A random variable X with values
in T is a measurable function from S into T .
1. The probability distribution of X is the mapping on T given by B ↦ P(X ∈ B) .
2. The collection of events {{X ∈ B} : B ∈ T } is a sub σ-algebra of S , and is the σ-algebra generated by X, denoted
σ(X).
Details
suppose that (T , T ) is a measurable space for each i in an index set I , and that
i i Xi is a random variable taking values in Ti
σ{ Xi : i ∈ I } = σ {{X ∈ Bi } : Bi ∈ Ti , i ∈ I } (2.9.1)
2.9.1 https://stats.libretexts.org/@go/page/10137
If we observe the value of X for each i ∈ I then we know whether or not each event in σ{X
i i : i ∈ I} has occurred. This idea is
very important in the study of stochastic processes.
Events A and B are equivalent if A △ B ∈ N , and we denote this by A ≡B . The relation ≡ is an equivalence relation on
S . That is, for A, B, C ∈ S ,
Thus A ≡ B if and only if P(A △ B) = P(A ∖ B) + P(B ∖ A) = 0 if and only if P(A ∖ B) = P(B ∖ A) = 0 . The equivalence
relation ≡ partitions S into disjoint classes of mutually equivalent events. Equivalence is preserved under the set operations.
2. ⋂ i∈I
Ai ≡ ⋂
i∈I
Bi
The converse trivially fails, and a counterexample is given below However, the null and almost sure events do form equivalence
classes.
Suppose that A ∈ S .
1. If A ∈ N then A ≡ B if and only if B ∈ N .
2. If A ∈ M then A ≡ B if and only if B ∈ M .
We can extend the notion of equivalence to random variables taking values in the same space. Thus suppose that (T , T ) is another
measurable space. If X and Y are random variables with values in T , then (X, Y ) is a random variable with values in T × T ,
which is given the usual product σ-algebra T ⊗ T . We assume that the diagonal set D = {(x, x) : x ∈ T } ∈ T ⊗ T , which is
almost always true in applications.
Random variables X and Y taking values in T are equivalent if P(X = Y ) = 1 . Again we write X ≡ Y . The relation ≡ is an
equivalence relation on the collection of random variables that take values in T . That is, for random variables X, Y , and Z
with values in T ,
1. X ≡ X (the reflexive property).
2.9.2 https://stats.libretexts.org/@go/page/10137
2. If X ≡ Y then Y ≡ X (the symmetric property).
3. If X ≡ Y and Y ≡ Z then X ≡ Z (the transitive property).
So the collection of random variables with values in T is partitioned into disjoint classes of mutually equivalent variables.
Suppose that X and Y are random variables taking values in T and that X ≡ Y . Then
1. {X ∈ B} ≡ {Y ∈ B} for every B ∈ T .
2. X and Y have the same probability distribution on (T , T ).
Again, the converse to part (b) fails with a passion, and a counterexample is given below. It often happens that a definition for
random variables subsumes the corresponding definition for events, by considering the indicator variables of the events. So it is
with equivalence.
Equivalence is preserved under a deterministic transformation of the variables. For the next result, suppose that (U , U ) is yet
another measurable space, along with (T , T ).
Suppose X, Y are random variables with values in T and that g : T → U is measurable. If X ≡ Y then g(X) ≡ g(Y ) .
Suppose again that (S, S , P) is a probability space corresponding to a random experiment. Let V denote the collection of all real-
valued random variables for the experiment, that is, all measurable functions from S into R. With the usual definitions of addition
and scalar multiplication, (V , +, ⋅) is a vector space. However, in probability theory, we often do not want to distinguish between
random variables that are equivalent, so it's nice to know that the vector space structure is preserved when we identify equivalent
random variables. Formally, let [X] denote the equivalence class generated by a real-valued random variable X ∈ V , and let W
denote the collection of all such equivalence classes. In modular notation, W is the set V / ≡ . We define addition and scalar
multiplication on V by
[X] + [Y ] = [X + Y ], c[X] = [cX]; [X], [Y ] ∈ V , c ∈ R (2.9.2)
(W , +, ⋅) is a vector space.
Often we don't bother to use the special notation for the equivalence class associated with a random variable. Rather, it's understood
that equivalent random variables represent the same object. Spaces of functions in a general measure space are studied in the
chapter on Distributions, and spaces of random variables are studied in more detail in the chapter on Expected Value.
Completion
Suppose again that (S, S , P) is a probability space, and that N denotes the collection of null events, as above. Suppose that
A ∈ N so that P(A) = 0 . If B ⊆ A and B ∈ S , then we know that P(B) = 0 so B ∈ N also. However, in general there might
be subsets of A that are not in S . This leads naturally to the following definition.
So the probability space is complete if every subset of an event with probability 0 is also an event (and hence also has probability
0). We know from our work on positive measures that every σ-finite measure space that is not complete can be completed. So in
particular a probability space that is not complete can be completed. To review the construction, recall that the equivalence relation
≡ that we used above on S is extended to P(S) (the power set of S ).
For A, B ⊆ S , define A ≡B if and only if there exists N ∈ N such that A △ B ⊆N . The relation ≡ is an equivalence
relation on P(S) .
2.9.3 https://stats.libretexts.org/@go/page/10137
2. P is a probability measure on (S, S ).
0 0
Product Spaces
Our next discussion concerns the construction of probability spaces that correspond to specified distributions. To set the stage,
suppose that (S, S , P) is a probability space. If we let X denote the identity function on S , so that X(x) = x for x ∈ S , then
{X ∈ A} = A for A ∈ S and hence P(X ∈ A) = P(A) . That is, P is the probability distribution of X. We have seen this before
—every probability measure can be thought of as the distribution of a random variable. The next result shows how to construct a
probability space that corresponds to a sequence of independent random variables with specified distributions.
Suppose n ∈ N and that (S , S , P ) is a probability space for i ∈ {1, 2, … , n}. The corresponding product measure space
+ i i i
Proof
Intuitively, the given probability spaces correspond to n random experiments. The product space then is the probability space that
corresponds to the experiments performed independently. When modeling a random experiment, if we say that we have a finite
sequence of independent random variables with specified distributions, we can rest assured that there actually is a probability space
that supports this statement
We can extend the last result to an infinite sequence of probability spaces. Suppose that (S , S ) is a measurable space for each
i i
i ∈ N . Recall that the product space ∏ S consists of all sequences x = (x , x , …) such that x ∈ S for each i ∈ N . The
∞
+ i=1 i 1 2 i i +
corresponding product σ-algebra S is generated by the collection of cylinder sets. That is, S = σ(B) where
∞
i=1
Suppose that (S , S , P ) is a probability space for i ∈ N . Let (S, S ) denote the product measurable space so that
i i i +
S = σ(B) where B is the collection of cylinder sets. Then there exists a unique probability measure P on (S, S ) that
satisfies
∞ ∞ ∞
P ( ∏ Ai ) = ∏ Pi (Ai ), ∏ Ai ∈ B (2.9.7)
(X1 , X2 , …) is a sequence of independent random variables on (S, S , P), and X has distribution P on (S , S ) for each
i i i i
i ∈ N+ .
Proof
Once again, if we model a random process by starting with an infinite sequence of independent random variables, we can be sure
that there exists a probability space that supports this sequence. The particular probability space constructed in the last theorem is
called the canonical probability space associated with the sequence of random variables. Note also that it was important that we
had probability measures rather than just general positive measures in the construction, since the infinite product ∏ P (A ) is ∞
i=1 i i
always well defined. The next section on Stochastic Processes continues the discussion of how to construct probability spaces that
correspond to a collection of random variables with specified distributional properties.
Probability Concepts
Our next discussion concerns topics that are unique to probability theory and do not have simple analogies in general measure
theory.
Independence
As usual, suppose that (S, S , P) is a probability space. We have already studied the independence of collections of events and the
independence of collections of random variables. A more complete and general treatment results if we define the independence of
2.9.4 https://stats.libretexts.org/@go/page/10137
collections of collections of events, and most importantly, the independence of collections of σ-algebras. This extension actually
occurred already, when we went from independence of a collection of events to independence of a collection of random variables,
but we did not note it at the time. In spite of the layers of set theory, the basic idea is the same.
Suppose that A is a collection of events for each i in an index set I . Then A = {A : i ∈ I } is independent if and only if for
i i
every choice of A ∈ A for i ∈ I , the collection of events {A : i ∈ I } is independent. That is, for every finite J ⊆ I ,
i i i
P ( ⋂ Aj ) = ∏ P(Aj ) (2.9.8)
j∈J j∈J
As noted above, independence of random variables, as we defined previously, is a special case of our new definition.
Suppose that (T , T ) is a measurable space for each i in an index set I , and that X is a random variable taking values in a set
i i i
Independence of events is also a special case of the new definition, and thus our new definition really does subsume our old one.
Suppose that Ai is an event for each i ∈ I . The independence of { Ai : i ∈ I } is equivalent to the independence of
{ Ai : i ∈ I } where A = σ{A } = {S, ∅, A , A } for each i ∈ I .
i i i
c
i
For every collection of objects that we have considered (collections of events, collections of random variables, collections of
collections of events), the notion of independence has the basic inheritance property.
Our most important collections are σ-algebras, and so we are most interested in the independence of a collection of σ-algebras. The
next result allows us to go from the independence of certain types of collections to the independence of the σ-algebras generated by
these collections. To understand the result, you will need to review the definitions and theorems concerning π-systems and λ -
systems. The proof uses Dynkin's π-λ theorem, named for Eugene Dynkin.
Suppose that A is a collection of events for each i in an index set I , and that A is a π-system for each i ∈ I . If {A
i i i : i ∈ I}
Proof
The next result is a rigorous statement of the strong independence that is implied the independence of a collection of events.
Suppose that A is an independent collection of events, and that {B j : j ∈ J} is a partition of A . That is, B j ∩ Bk = ∅ for
j ≠ k and ⋃ B = A . Then {σ(B ) : j ∈ J} is independent.
j j
j∈J
Proof
Let's bring the result down to earth. Suppose that A, B, C , D are independent events. In our elementary discussion of
independence, you were asked to show, for example, that A ∪ B and C ∪ D are independent. This is a consequence of the much
c c c
stronger statement that the σ-algebras σ{A, B} and σ{C , D} are independent.
Exchangeability
As usual, suppose that (S, S , P) is a probability space corresponding to a random experiment Roughly speaking, a sequence of
events or a sequence of random variables is exchangeable if the probability law that governs the sequence is unchanged when the
order of the events or variables is changed. Exchangeable variables arise naturally in sampling experiments and many other
settings, and are a natural generalization of a sequence of independent, identically distributed (IID) variables. Conversely, it turns
out that any exchangeable sequence of variables can be constructed from an IID sequence. First we give the definition for events:
Suppose that A = {A : i ∈ I } is a collection of events, where I is a nonempty index set. Then A is exchangeable if the
i
probability of the intersection of a finite number of the events depends only on the number of events. That is, if J and K are
2.9.5 https://stats.libretexts.org/@go/page/10137
finite subsets of I and #(J) = #(K) then
P ( ⋂ Aj ) = P ( ⋂ Ak ) (2.9.12)
j∈J k∈K
Exchangeability has the same basic inheritance property that we have seen before.
For a collection of exchangeable events, the inclusion exclusion law for the probability of a union is much simpler than the general
version.
Suppose that { A1 , A2 , … , An } is an exchangeable collection of events. For J ⊆ {1, 2, … , n} with #(J) = k , let
pk = P (⋂j∈J Aj ) . Then
n n
n
k−1
P ( ⋃ Ai ) = ∑(−1 ) ( ) pk (2.9.13)
k
i=1 k=1
Proof
The concept of exchangeablility can be extended to random variables in the natural way. Suppose that (T , T ) is a measurable
space.
Suppose that A is a collection of random variables, each taking values in T . The collection A is exchangeable if for any
{ X1 , X2 , … , Xn } ⊆ A , the distribution of the random vector (X , X , … , X ) depends only on n .
1 2 n
Thus, the distribution of the random vector is unchanged if the coordinates are permuted. Once again, exchangeability has the same
basic inheritance property as a collection of independent variables.
Suppose that A is a collection of random variables, each taking values in T , and that A is exchangeable. Then trivially the
variables are identically distributed: if X, Y ∈ A and A ∈ T , then P(X ∈ A) = P(Y ∈ A) . Also, the definition of exchangeable
variables subsumes the definition for events:
Suppose that A is a collection of events, and let B = {1 : A ∈ A } denote the corresponding collection of indicator random
A
Suppose that (X 1, X2 , …) be a sequence of random variables. The tail sigma algebra of the sequence is
∞
T = ⋂ σ{ Xn , Xn+1 , …} (2.9.15)
n=1
2.9.6 https://stats.libretexts.org/@go/page/10137
Informally, a tail event (random variable) is an event (random variable) that can be defined in terms of {X , X , …} for each n n+1
n ∈ N + . The tail sigma algebra for a sequence of events (A , A , …) is defined analogously (or simply let X = 1(A ) , the
1 2 k k
indicator variable of A , for each k ). For the following results, you may need to review some of the definitions in the section on
Convergence.
Suppose again that (A 1, A2 , …) is a sequence of events. Each of the following is a tail event of the sequence:
∞ ∞
1. lim sup n→∞
An = ⋂n=1 ⋃i=n Ai
∞ ∞
2. lim inf n→∞ An = ⋃n=1 ⋂i=n Ai
Proof
Proof
The random variable in part (b) may take the value −∞ , and the random variable in (c) may take the value ∞. From parts (b) and
(c) together, note that if X → X as n → ∞ on the sample space S , then X is a tail random variable for X.
n ∞ ∞
There are a number of zero-one laws in probability. These are theorems that give conditions under which an event will be
essentially deterministic; that is, have probability 0 or probability 1. Interestingly, it can sometimes be difficult to determine which
of these extremes is actually the case. The following result is the Kolmogorov zero-one law, named for Andrey Kolmogorov. It
states that an event in the tail σ-algebra of an independent sequence will have probability 0 or 1.
From the Komogorov zero-one law and the result above, note that if (A , A , …) is a sequence of independent events, then
1 2
lim sup A
n→∞
must have probability 0 or 1. The Borel-Cantelli lemmas give conditions for which of these is correct:
n
Another proof of the Kolmogorov zero-one law will be given using the martingale convergence theorem.
Counterexamples
Equal probability certainly does not imply equivalent events.
Consider the simple experiment of tossing a fair coin. The event that the coin lands heads and the event that the coin lands tails
have the same probability, but are not equivalent.
Proof
2.9.7 https://stats.libretexts.org/@go/page/10137
Similarly, equivalent distributions does not imply equivalent random variables.
Consider the experiment of rolling a standard, fair die. Let X denote the score and Y = 7 −X . Then X and Y have the same
distribution but are not equivalent.
Proof
Consider the experiment of rolling two standard, fair dice and recording the sequence of scores (X, Y ) . Then X and Y are
independent and have the same distribution, but are not equivalent.
Proof
This page titled 2.9: Probability Spaces Revisited is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
2.9.8 https://stats.libretexts.org/@go/page/10137
2.10: Stochastic Processes
Introduction
This section requires measure theory, so you may need to review the advanced sections in the chapter on Foundations and in this chapter.
In particular, recall that a set E almost always comes with a σ-algebra E of admissible subsets, so that (E, E ) is a measurable space.
Usually in fact, E has a topology and E is the corresponding Borel σ-algebra, that is, the σ-algebra generated by the topology. If E is
countable, we almost always take E to be the collection of all subsets of E , and in this case (E, E ) is a discrete space. The other common
case is when E is an uncountable measurable subset of R for some n ∈ N , in which case E is the collection of measurable subsets of E .
n
If (E , E ), (E , E ), … , (E , E ) are measurable spaces for some n ∈ N , then the Cartesian product E × E × ⋯ × E is given
1 1 2 2 n n + 1 2 n
the product σ-algebra E ⊗ E ⊗ ⋯ ⊗ E . As a special case, the Cartesian power E is given the corresponding power σ-algebra E .
1 2 n
n n
With these preliminary remarks out of the way, suppose that (Ω, F , P) is a probability space, so that Ω is the set of outcomes, F the σ-
algebra of events, and P is the probability measure on the sample space (Ω, F ). Suppose also that (S, S ) and (T , T ) are measurable
spaces. Here is our main definition:
A random process or stochastic process on (Ω, F , P) with state space (S, S ) and index set T is a collection of random variables
X = { X : t ∈ T } such that X takes values in S for each t ∈ T .
t t
Sometimes it's notationally convenient to write X(t) instead of X for t ∈ T . Often T = N or T = [0, ∞) and the elements of T are
t
interpreted as points in time (discrete time in the first case and continuous time in the second). So then X ∈ S is the state of the random
t
process at time t ∈ T , and the index space (T , T ) becomes the time space.
Since X is itself a function from Ω into S , it follows that ultimately, a stochastic process is a function from Ω × T into S . Stated another
t
way, t ↦ X is a random function on the probability space (Ω, F , P). To make this precise, recall that S is the notation sometimes used
t
T
for the collection of functions from T into S . Recall also that a natural σ-algebra used for S is the one generated by sets of the form
T
T
{f ∈ S : f (t) ∈ At for all t ∈ T } , where At ∈ S for every t ∈ T and At = S for all but finitely many t ∈ T (2.10.1)
Suppose that X = {X : t ∈ T } is a stochastic process on the probability space (Ω, F , P) with state space (S, S ) and index set
t T .
Then the mapping that takes ω into the function t ↦ X (ω) is measurable with respect to (Ω, F ) and (S , S ).
t
T T
Proof
For ω ∈ Ω , the function t ↦ X (ω) is known as a sample path of the process. So S , the set of functions from T into S , can be thought
t
T
of as a set of outcomes of the stochastic process X, a point we will return to in our discussion of existence below.
As noted in the proof of the last theorem, X is a measurable function from Ω into S for each t ∈ T , by the very meaning of the term
t
random variable. But it does not follow in general that (ω, t) ↦ X (ω) is measurable as a function from Ω × T into S . In fact, the σ-
t
algebra on T has played no role in our discussion so far. Informally, a statement about X for a fixed t ∈ T or even a statement about X
t t
for countably many t ∈ T defines an event. But it does not follow that a statement about X for uncountably many t ∈ T defines an event.
t
A stochastic process X = {X : t ∈ T } defined on the probability space (Ω, F , P) and with index space
t (T , T ) and state space
(S, S ) is measurable if (ω, t) ↦ X (ω) is a measurable function from Ω × T into S .
t
Every stochastic process indexed by a countable set T is measurable, so the definition is only important when T is uncountable, and in
particular for T = [0, ∞) .
Equivalent Processes
Our next goal is to study different ways that two stochastic processes, with the same state and index spaces, can be “equivalent”. We will
assume that the diagonal D = {(x, x) : x ∈ S} ∈ S , an assumption that almost always holds in applications, and in particular for the
2
discrete and Euclidean spaces that are most important to us. Sufficient conditions are that S have a sub σ-algebra that is countably
generated and contains all of the singleton sets, properties that hold for the Borel σ-algebra when the topology on S is locally compact,
Hausdorff, and has a countable base.
2.10.1 https://stats.libretexts.org/@go/page/10138
First, we often feel that we understand a random process X = {X : t ∈ T } well if we know the finite dimensional distributions, that is, if
t
we know the distribution of (X , X , … , X ) for every choice of n ∈ N and (t , t , … , t ) ∈ T . Thus, we can compute
t1 t2 tn + 1 2 n
n
P [(X , X , … , X ) ∈ A] for every n ∈ N , (t , t , … , t ) ∈ T , and A ∈ S . Using various rules of probability, we can compute
n n
t1 t2 tn + 1 2 n
the probabilities of many events involving infinitely many values of the index parameter t as well. With this idea in mind, we have the
following definition:
Random processes X = {X : t ∈ T } and Y = {Y : t ∈ T } with state space (S, S ) and index set T are equivalent in distribution
t t
if they have the same finite dimensional distributions. This defines an equivalence relation on the collection of stochastic processes
with this state space and index set. That is, if X, Y , and Z are such processes then
1. X is equivalent in distribution to X (the reflexive property)
2. If X is equivalent in distribution to Y then Y is equivalent in distribution to X (the symmetric property)
3. If X is equivalent in distribution to Y and Y is equivalent in distribution to Z then X is equivalent in distribution to Z (the
transitive property)
Note that since only the finite-dimensional distributions of the processes X and Y are involved in the definition, the processes need not be
defined on the same probability space. Thus, equivalence in distribution partitions the collection of all random processes with a given state
space and index set into mutually disjoint equivalence classes. But of course, we already know that two random variables can have the
same distribution but be very different as variables (functions on the sample space). Clearly, the same statement applies to random
processes.
2
. Let Y n = 1 − Xn for n ∈ N . Then
+
Proof
Motivated by this example, let's look at another, stronger way that random processes can be equivalent. First recall that random variables
X and Y on (Ω, F , P), with values in S , are equivalent if P(X = Y ) = 1 .
Suppose that X = {X : t ∈ T } and Y = {Y : t ∈ T } are stochastic processes defined on the same probability space (Ω, F , P) and
t t
both with state space (S, S ) and index set T . Then Y is a versions of X if Y is equivalent to X (so that P(X = Y ) = 1 ) for every
t t t t
t ∈ T . This defines an equivalence relation on the collection of stochastic processes on the same probability space and with the same
state space and index set. That is, if X, Y , and Z are such processes then
1. X is a version of X (the reflexive property)
2. If X is a version of Y then Y is ia version of X (the symmetric property)
3. If X is a version of Y and Y is of Z then X is a version of Z (the transitive property)
Proof
So the version of relation partitions the collection of stochastic processes on a given probability space and with a given state space and
index set into mutually disjoint equivalence classes.
Suppose again that X = {X : t ∈ T } and Y = {Y : t ∈ T } are random processes on (Ω, F , P) with state space (S, S ) and index
t t
As noted in the proof, a countable intersection of events with probability 1 still has probability 1. Hence if T is countable and random
processes X is a version of Y then
P(Xt = Yt for all t ∈ T ) = 1 (2.10.5)
so X and Y really are essentially the same random process. But when T is uncountable the result in the displayed equation may not be
true, and X and Y may be very different as random functions on T . Here is a simple example:
Suppose that Ω = T = [0, ∞) , F = T is the σ-algebra of Borel measurable subsets of [0, ∞), and P is any continuous probability
measure on (Ω, F ). Let S = {0, 1} (with all subsets measurable, of course). For t ∈ T and ω ∈ Ω , define X (ω) = 1 (ω) and t t
Proof
2.10.2 https://stats.libretexts.org/@go/page/10138
Motivated by this example, we have our strongest form of equivalence:
Suppose that X = {X : t ∈ T } and Y = {Y : t ∈ T } are measurable random processes on the probability space (Ω, F , P) and
t t
with state space (S, S ) and index space (T , T ). Then X is indistinguishable from Y if P(X = Y for all t ∈ T ) = 1 . This defines t t
an equivalence relation on the collection of measurable stochastic processes defined on the same probability space and with the same
state and index spaces. That is, if X, Y , and Z are such processes then
1. X is indistinguishable from X (the reflexive property)
2. If X is indistinguishable from Y then Y is indistinguishable from X (the symmetric property)
3. If X is indistinguishable from Y and Y is indistinguishable from Z then X is indistinguishable from Z (the transitive property)
Details
So the indistinguishable from relation partitions the collection of measurable stochastic processes on a given probability space and with
given state space and index space into mutually disjoint equivalence classes. Trivially, if X is indistinguishable from Y , then X is a
version of Y . As noted above, when T is countable, the converse is also true, but not, as our previous example shows, when T is
uncountable. So to summarize, indistinguishable from implies version of implies equivalent in distribution, but none of the converse
implications hold in general.
Suppose that (S, S , P ) is a probability space. Then there exists a random variable X on probability space (Ω, F , P) such that X
In spite of its triviality the last result contains the seeds of everything else we will do in this discussion. Next, let's see how to construct a
sequence of independent random variables with specified distributions.
Suppose that P is a probability measure on the measurable space (S, S ) for i ∈ N . Then there exists an independent sequence of
i +
random variables (X , X , …) on a probability space (Ω, F , P) such that X takes values in S and has distribution P for i ∈ N .
1 2 i i +
Proof
If you looked at the proof of the last two results you might notice that the last result can be viewed as a special case of the one before,
since X = (X , X , …) is simply the identity function on Ω = S . The important step is the existence of the product measure P on
1 2
∞
(Ω, F ).
The full generalization of these results is known as the Kolmogorov existence theorem (named for Andrei Kolmogorov). We start with the
state space (S, S ) and the index set T . The theorem states that if we specify the finite dimensional distributions in a consistent way, then
there exists a stochastic process defined on a suitable probability space that has the given finite dimensional distributions. The consistency
condition is a bit clunky to state in full generality, but the basic idea is very easy to understand. Suppose that s and t are distinct elements
in T and that we specify the distribution (probability measure) P of X , P of X , P of (X , X ), and P of (X , X ). Then clearly
s s t t s,t s t t,s t s
elements of T , and let T = ⋃ T denote the set of all finite sequences of distinct elements of T . If n ∈ N ,
∞
n=1
(n)
+
t = (t , t , … , t ) ∈ T
1 2 n and π is a permutation of {1, 2, … , n}, let tπ denote the element of T
(n)
with coordinates (tπ ) = t . That (n)
i π(i)
n n
πC = {(x1 , x2 , … , xn ) ∈ S : (xπ(1) , xπ(2) , … , xπ(n) ) ∈ C } ∈ S (2.10.14)
Now suppose that P is a probability measure on (S , S ) for each n ∈ N and t ∈ T . The idea, of course, is that we want the
t
n n
+
(n)
collection P = {P : t ∈ T } to be the finite dimensional distributions of a random process with index set T and state space (S, S ). Here
t
2.10.3 https://stats.libretexts.org/@go/page/10138
is the critical definition:
With the proper definition of consistence, we can state the fundamental theorem.
Kolmogorov Existence Theorem. If P is a consistent collection of probability distributions relative to the index set T and the state
space (S, S ), then there exists a probability space (Ω, F , P) and a stochastic process X = {X : t ∈ T } on this probability space
t
Note that except for the more complicated notation, the construction is very similar to the one for a sequence of independent variables.
Again, X is essentially the identity function on Ω = S . The important and more difficult part is the construction of the probability
T
measure P on (Ω, F ).
Applications
Our last discussion is a summary of the stochastic processes that are studied in this text. All are classics and are immensely important in
applications.
Markov chains form a very important family of random processes as do Brownian motion and related processes. We will study these in
subsequent chapters.
This page titled 2.10: Stochastic Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.10.4 https://stats.libretexts.org/@go/page/10138
2.11: Filtrations and Stopping Times
Introduction
Suppose that X = {X : t ∈ T } is a stochastic process with state space (S, S ) defined on an underlying probability space
t
(Ω, F , P). To review, Ω is the set of outcomes, F the σ-algebra of events, and P the probability measure on (S, S ). Also S is the
set of states, and S the σ-algebra of admissible subsets of S . Usually, S is a topological space and S the Borel σ-algebra
generated by the open subsets of S . A standard set of assumptions is that the topology is locally compact, Hausdorff, and has a
countable base, which we will abbreviate by LCCB. For the index set, we assume that either T = N or that T = [0, ∞) and as
usual in these cases, we interpret the elements of T as points of time. The set T is also given a topology, the discrete topology in
the first case and the standard Euclidean topology in the second case, and then the Borel σ-algebra T . So in discrete time with
T = N , T = P(T ) , the power set of T , so every subset of T is measurable, as is every function from T into a another
measurable space. Finally, X is a random variable and so by definition is measurable with respect to F and S for each t ∈ T . We
t
interpret X is the state of some random system at time t ∈ T . Many important concepts involving X are based on how the future
t
behavior of the process depends on the past behavior, relative to a given current time.
For t ∈ T , let F = σ {X : s ∈ T , s ≤ t} , the σ-algebra of events that can be defined in terms of the process up to time t .
t s
Roughly speaking, for a given A ∈ F , we can tell whether or not A has occurred if we are allowed to observe the process up to
t
time t . The family of σ-algebras F = {F : t ∈ T } has two critical properties: the family is increasing in t ∈ T , relative to the
t
subset partial order, and all of the σ-algebras are sub σ-algebras of F . That is for s, t ∈ T with s ≤ t , we have F ⊆ F ⊆ F . s t
Filtrations
Basic Definitions
Sometimes we need σ-algebras that are a bit larger than the ones in the last paragraph. For example, there may be other random
variables that we get to observe, as time goes by, besides the variables in X. Sometimes, particularly in continuous time, there are
technical reasons for somewhat different σ-algebras. Finally, we may want to describe how our information grows, as a family of
σ-algebras, without reference to a random process. For the remainder of this section, we have a fixed measurable space (Ω, F )
which we again think of as a sample space, and the time space (T , T ) as described above.
(Ω, F , F) is a filtered sample space. If P is a probability measure on (Ω, F ), then (Ω, F , F, P) is a filtered probability space.
So a filtration is simply an increasing family of sub-σ-algebras of F , indexed by T . We think of F as the σ-algebra of events up
t
to time t ∈ T . The larger the σ-algebras in a filtration, the more events that are available, so the following relation on filtrations is
natural.
Suppose that F = {F : t ∈ T } and G = {G : t ∈ T } are filtrations on (Ω, F ). We say that F is coarser than G and G is
t t
finer than F, and we write F ⪯ G , if F ⊆ G for all t ∈ T . The relation ⪯ is a partial order on the collection of filtrations on
t t
So the coarsest filtration on (Ω, F ) is the one where F = {Ω, ∅} for every t ∈ T while the finest filtration is the one where
t
F = F for every t ∈ T . In the first case, we gain no information as time evolves, and in the second case, we have complete
t
2.11.1 https://stats.libretexts.org/@go/page/10139
2. F t ⊆ F∞ for t ∈ T .
Proof
Of course, it may be the case that F = F , but not necessarily. Recall that the intersection of a collection of σ-algebras on
∞
(Ω, F ) is another σ-algebra. We can use this to create new filtrations from a collection of given filtrations.
Suppose that i
Fi = {Ft : t ∈ T } is a filtration on (Ω, F ) for each i in a nonempty index set I . Then F = { Ft : t ∈ T }
t
(Ω, F ) . This filtration is sometimes denoted F =⋀
i∈I
Fi , and is the
finest filtration that is coarser than F for every i ∈ I . i
Proof
Unions of σ-algebras are not in general σ-algebras, but we can construct a new filtration from a given collection of filtrations using
unions in a natural way.
t
: t ∈ T} is a filtration on (Ω, F ) for each i in a nonempty index set I . Then
F = { Ft : t ∈ T } where Ft = σ (⋃i∈I Ft )
i
for t ∈ T is also a filtration on (Ω, F ) . This filtration is sometimes denoted
F =⋁
i∈I
Fi , and is the coarsest filtration that is finer than F for every i ∈ I . i
Proof
Stochastic Processes
Note again that we can have a filtration without an underlying stochastic process in the background. However, we usually do have a
stochastic process X = {X : t ∈ T } , and in this case the filtration F = {F : t ∈ T } where F = σ{X : s ∈ T , s ≤ t} is
t
0
t
0 0
t s
the natural filtration associated with X. More generally, the following definition is appropriate.
Equivalently, is adapted to F if F is finer than F , the natural filtration associated with X. That is,
X
0
σ{ X : s ∈ T , s ≤ t} ⊆ F
s for each t ∈ T . So clearly, if X is adapted to a filtration, then it is adapted to any finer filtration, and
t
F
0
is the coarsest filtration to which X is adapted. The basic idea behind the definition is that if the filtration F encodes our
information as time goes by, then the process X is observable. In discrete time, there is a related definition.
Clearly if X is predictable by F then X is adapted to F. But predictable is better than adapted, in the sense that if F encodes our
information as time goes by, then we can look one step into the future in terms of X: at time n we can determine X . The n+1
concept of predictability can be extended to continuous time, but the definition is much more complicated.
Note that ultimately, a stochastic process X = {X : t ∈ T } with sample space (Ω, F ) and state space (S, S ) can be viewed a
t
function from Ω × T into S , so X (ω) ∈ S is the state at time t ∈ T corresponding to the outcome ω ∈ Ω . By definition,
t
ω ↦ X (ω) is measurable for each t ∈ T , but it is often necessary for the process to be jointly measurable in ω and t .
t
When we have a filtration, as we usually do, there is a stronger condition that is natural. Let T t = {s ∈ T : s ≤ t} for t ∈ T , and
let T = {A ∩ T : A ∈ T } be the corresponding induced σ-algebra.
t t
Suppose that X = { Xt : t ∈ T } is a stochastic process with sample space (Ω, F ) and state space (S, S ), and that
F = { Ft : t ∈ T } is a filtration. Then X is progressively measurable relative to F if X : Ω × T → S is measurable with t
2.11.2 https://stats.libretexts.org/@go/page/10139
Clearly if X is progressively measurable with respect to a filtration, then it is progressively measurable with respect to any finer
filtration. Of course when T is discrete , then any process X is measurable, and any process X adapted to F progressively
measurable, so these definitions are only of interest in the case of continuous time.
Suppose again that X = { X : t ∈ T } is a stochastic process with sample space (Ω, F ) and state space
t (S, S ) , and that
F = { Ft : t ∈ T } is a filtration. If X is progressively measurable relative to F then
1. X is measurable.
2. X is adapted to F.
Proof
When the state space is a topological space (which is usually the case), then as you might guess, there is a natural link between
continuity of the sample paths and progressive measurability.
Suppose that S has an LCCB topology and that S is the σ-algebra of Borel sets. Suppose also that X = {X t : t ∈ [0, ∞)} is
right continuous. Then X is progressively measurable relative to the natural filtration F . 0
So if X is right continuous, then X is progressively measurable with respect to any filtration to which X is adapted. Recall that in
the previous section, we studied different ways that two stochastic processes can be equivalent. The following example illustrates
some of the subtleties of processes in continuous time.
Suppose that Ω = T = [0, ∞) , F = T is the σ-algebra of Borel measurable subsets of [0, ∞), and P is any continuous
probability measure on (Ω, F ). Let S = {0, 1} and S = P(S) = {∅, {0}, {1}, {0, 1}}. For t ∈ T and ω ∈ Ω , define
X (ω) = 1 (ω) and Y (ω) = 0 . Then
t t t
1. X = {X : t ∈ T } is a version of Y = {Y : t ∈ T }
t t
Completion
Suppose now that P is a probability measure on (Ω, F ). Recall that F is complete with respect to P if A ∈ F , B ⊆ A , and
P (A) = 0 imply B ∈ F (and hence P (B) = 0 ). That is, if A is an event with probability 0 and B ⊆ A , then B is also an event
(and also has probability 0). For a filtration, the following definition is appropriate.
Suppose P is a probability measure on (Ω, F ) and that the filtration {F : t ∈ T } is complete with respect to P . If A ∈ F is
t
a null event (P (A) = 0 ) or an almost certain event (P(A) = 1 ) then A ∈ F for every t ∈ T .
t
Proof
Recall that if P is a probability measure on (Ω, F ), but F is not complete with respect to P , then F can always be completed.
Here's a review of how this is done: Let
P P P
So N is the collection of null sets. Then we let F = σ(F ∪ N ) and extend P to F is the natural way: if A ∈ F and A
differs from B ∈ F by a null set, then P (A) = P (B) . Filtrations can also be completed.
Proof
Naturally, F is the completion of F with respect to P . Sometimes we need to consider all probability measures on (Ω, F ).
P
2.11.3 https://stats.libretexts.org/@go/page/10139
Let P denote the collection of probability measures on (Ω, F ), and suppose that F = {F : t ∈ T } is a filtration on (Ω, F ). t
Let F = ⋂{F : P ∈ P} , and let F = ⋀{F : P ∈ P} . Then F is a filtration on (Ω, F ), known as the universal
∗ P ∗ P ∗ ∗
completion of F.
Proof
The last definition must seem awfully obscure, but it does have a place. In the theory of Markov processes, we usually allow
arbitrary initial distributions, which in turn produces a large collection of distributions on the sample space.
Right Continuity
In continuous time, we sometimes need to refine a given filtration somewhat.
Suppose that F = { Ft : t ∈ [0, ∞)} is a filtration on (Ω, F ). For t ∈ [0, ∞), define Ft+ = ⋂{ Fs : s ∈ (t, ∞)} . Then
F+ = { Ft+ : t ∈ T } is also a filtration on (Ω, F ) and is finer than F.
Proof
Since the -algebras in a filtration are increasing, it follows that for t ∈ [0, ∞), F = ⋂{F : s ∈ (t, t + ϵ)} for every
σ t+ s
ϵ ∈ (0, ∞) . So if the filtration F encodes the information available as time goes by, then the filtration F allows an “infinitesimal +
peak into the future” at each t ∈ [0, ∞). In light of the previous result, the next definition is natural.
A filtration F = {F t : t ∈ [0, ∞)} is right continuous if F + =F , so that F t+ = Ft for every t ∈ [0, ∞).
Right continuous filtrations have some nice properties, as we will see later. If the original filtration is not right continuous, the
slightly refined filtration is:
Suppose again that F = {F t : t ∈ [0, ∞)} is a filtration. Then F is a right continuous filtration.
+
Proof
For a stochastic process X = {X : t ∈ [0, ∞)} in continuous time, often the filtration F that is most useful is the right-continuous
t
refinement of the natural filtration. That is, F = F , so that F = σ{X : s ∈ [0, t]} for t ∈ [0, ∞).
0
+ t s +
Stopping Times
Basic Properties
Suppose again that we have a fixed sample space (Ω, F ). Random variables taking values in the time set T are important, but
often as we will see, it's necessary to allow such variables to take the value ∞ as well as finite times. So let T = T ∪ {∞} . We ∞
extend order to T by the obvious rule that t < ∞ for every t ∈ T . We also extend the topology on T to T by the rule that for
∞ ∞
each s ∈ T , the set {t ∈ T : t > s} is an open neighborhood of ∞. That is, T is the one-point compactification of T . The
∞ ∞
reason for this is to preserve the meaning of time converging to infinity. That is, if (t , t , …) is a sequence in T then t → ∞ as
1 2 ∞ n
n → ∞ if and only if, for every t ∈ T there exists m ∈ N such that t > t for n > m . We then give T the Borel σ-algebra
+ n ∞
T∞ as before. In discrete time, this is once again the discrete σ-algebra, so that all subsets are measurable. In both cases, we now
have an enhanced time space is (T , T ). A random variable τ taking values in T is called a random time.
∞ ∞ ∞
for each t ∈ T .
In a sense, a stopping time is a random time that does not require that we see into the future. That is, we can tell whether or not
τ ≤ t from our information at time t . The term stopping time comes from gambling. Consider a gambler betting on games of
chance. The gambler's decision to stop gambling at some point in time and accept his fortune must define a stopping time. That is,
the gambler can base his decision to stop gambling on all of the information that he has at that point in time, but not on what will
happen in the future. The terms Markov time and optional time are sometimes used instead of stopping time. If τ is a stopping time
relative to a filtration, then it is also a stoping time relative to any finer filtration:
Suppose that F = {F : t ∈ T } and G = {G : t ∈ T } are filtrations on (Ω, F ), and that G is finer than F. If a random time
t t
Proof
2.11.4 https://stats.libretexts.org/@go/page/10139
So, the finer the filtration, the larger the collection of stopping times. In fact, every random time is a stopping time relative to the
finest filtration F where F = F for every t ∈ T . But this filtration corresponds to having complete information from the
t
beginning of time, which of course is usually not sensible. At the other extreme, for the coarsest filtration F where F = {Ω, ∅} t
for every t ∈ T , the only stopping times are constants. That is, random times of the form τ (ω) = t for every ω ∈ Ω , for some
t ∈ T∞ .
Suppose again that F = {F : t ∈ T } is a filtration on (Ω, F ). A random time τ is a stopping time relative to F if and only if
t
Proof
Suppose again that F = {F t : t ∈ T} is a filtration on (Ω, F ), and that τ is a stopping time relative to F. Then
1. {τ < t} ∈ Ft for every t ∈ T .
2. {τ ≥ t} ∈ Ft for every t ∈ T .
3. {τ = t} ∈ Ft for every t ∈ T .
Proof
Note that when T = N , we actually showed that {τ < t} ∈ F and {τ ≥ t} ∈ F . The converse to part (a) (or equivalently
t−1 t−1
(b)) is not true, but in continuous time there is a connection to the right-continuous refinement of the filtration.
Suppose that T = [0, ∞) and that F = {F : t ∈ [0, ∞)} is a filtration ont (Ω, F ) . A random time τ is a stopping time
relative to F if and only if {τ < t} ∈ F for every t ∈ [0, ∞).
+ t
Proof
If F = {F : t ∈ [0, ∞)} is a filtration and τ is a random time that satisfies {τ < t} ∈ F for every t ∈ T , then some authors call
t t
τ a weak stopping time or say that τ is weakly optional for the filtration F . But to me, the increase in jargon is not worthwhile, and
it's better to simply say that τ is a stopping time for the filtration F . The following corollary now follows.
+
Suppose that T = [0, ∞) and that F = {F : t ∈ [0, ∞)} is a right-continuous filtration. A random time τ is a stopping time
t
The converse to part (c) of the result above holds in discrete time.
Proof
Basic Constructions
As noted above, a constant element of T ∞ is a stopping time, but not a very interesting one.
Suppose s ∈ T ∞ and that τ (ω) = s for all ω ∈ Ω . The τ is a stopping time relative to any filtration on (Ω, F ).
Proof
If the filtration {F : t ∈ T } is complete, then a random time that is almost certainly a constant is also a stopping time. The
t
following theorems give some basic ways of constructing new stopping times from ones we already have.
Suppose that F = {F : t ∈ T } is a filtration on (Ω, F ) and that τ and τ are stopping times relative to F. Then each of the
t 1 2
2. τ1 ∧ τ2 = min{ τ1 , τ2 }
3. τ1 + τ2
Proof
It follows that if (τ 1, τ2 , … , τn ) is a finite sequence of stopping times relative to F, then each of the following is also a stopping
time relative to F:
2.11.5 https://stats.libretexts.org/@go/page/10139
τ1 ∨ τ2 ∨ ⋯ ∨ τn
τ1 ∧ τ2 ∧ ⋯ ∧ τn
τ1 + τ2 + ⋯ + τn
Suppose that F = {F t : t ∈ T} is a filtration on (Ω, F ), and that (τ n : n ∈ N+ ) is a sequence of stopping times relative to F.
Then sup{τ : n ∈ N
n +} is also a stopping time relative to F.
Proof
Proof
1. inf {τ : n ∈ N
n +}
Proof
time relative to F. We want to define the σ-algebra F of events up to the random time τ , analagous to F the σ-algebra of events
τ t
Suppose that F = { Ft : t ∈ T } is a filtration on (Ω, F ) and that τ is a stopping time relative to F . Define
Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ F for all t ∈ T } . Then F is a σ-algebra.
t τ
Proof
Thus, an event A is in F if we can determine if A and τ ≤ t both occurred given our information at time t . If τ is constant, then
τ
F reduces to the corresponding member of the original filtration, which clealry should be the case, and is additional motivation
τ
Suppose again that F = { Ft : t ∈ T } is a filtration on (Ω, F ) . Fix s ∈ T and define τ (ω) = s for all ω ∈ Ω . Then
F =F .
τ s
Proof
Clearly, if we have the information available in F , then we should know the value of τ itself. This is also true:
τ
Proof
Here are other results that relate the σ-algebra of a stopping time to the original filtration.
2.11.6 https://stats.libretexts.org/@go/page/10139
Suppose again that F = {F t : t ∈ T} is a filtration on (Ω, F ) and that τ is a stopping time relative to F. If A ∈ F then for τ
t ∈ T ,
1. A ∩ {τ < t} ∈ Ft
2. A ∩ {τ = t} ∈ Ft
Proof
The σ-algebra of a stopping time relative to a filtration is related to the σ-algebra of the stopping time relative to a finer filtration in
the natural way.
Proof
When two stopping times are ordered, their σ-algebras are also ordered.
Suppose that F = { Ft : t ∈ T } is a filtration on (Ω, F ) and that ρ and τ are stopping times for F with ρ ≤τ . Then
F ⊆F .
ρ τ
Proof
1. {ρ < τ }
2. {ρ = τ }
3. {ρ > τ }
4. {ρ ≤ τ }
5. {ρ ≥ τ }
Proof
We can “stop” a filtration at a stopping time. In the next subsection, we will stop a stochastic process in the same way.
Suppose again that F = {F t : t ∈ T} is a filtration on (Ω, F ), and that τ is a stopping times for F . For t ∈ T define
F t
τ
=F . Then F = {F
t∧τ
τ
t
τ
: t ∈ T } is a filtration and is coarser than F .
Proof
Stochastic Processes
As usual, the most common setting is when we have a stochastic process X = {X : t ∈ T } defined on our sample space (Ω, F )
t
and with state space (S, S ). If τ is a random time, we are often interested in the state X at the random time. But there are two
τ
issues. First, τ may take the value infinity, in which case X is not defined. The usual solution is to introduce a new “death state” δ ,
τ
and define X = δ . The σ-algebra S on S is extended to S = S ∪ {δ} in the natural way, namely S = σ(S ∪ {δ}) .
∞ δ δ
Our other problem is that we naturally expect X to be a random variable (that is, measurable), just as X is a random variable for
τ t
measurable with respect to F , just as X is measurable with respect to F for deterministic t ∈ T . But this is not obvious, and in
τ t t
fact is not true without additional assumptions. Note that X is a random state at a random time, and so depends on an outcome
τ
Suppose that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that
t X is
measurable. If τ is a finite random time, then X is measurable. That is, X is a random variable with values in S .
τ τ
Proof
This result is one of the main reasons for the definition of a measurable process in the first place. Sometimes we literally want to
stop the random process at a random time τ . As you might guess, this is the origin of the term stopping time.
2.11.7 https://stats.libretexts.org/@go/page/10139
Suppose again that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that X
t
stopped at τ .
Proof
Suppose again that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that X
t
is progressively measurable with respect to a filtration F = {F : t ∈ T } . If τ is a stopping time relative to F, then the
t
Since F is finer than F , it follows that X is also progressively measurable with respect to F.
τ τ
Suppose again that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that X
t
is progressively measurable with respect to a filtration F = {F : t ∈ T } on (Ω, F ). If τ is a finite stopping time relative to F
t
For many random processes, the first time that the process enters or hits a set of states is particularly important. In the discussion
that follows, let T = {t ∈ T : t > 0} , the set of positive times.
+
Suppose that X = {X t : t ∈ T} is a stochastic process on (Ω, F ) with state space (S, S ). For A ∈ S , define
1. ρ A = inf{t ∈ T : Xt ∈ A} , the first entry time to A .
2. τ A = inf{t ∈ T+ : Xt ∈ A} , the first hitting time to A .
As usual, inf(∅) = ∞ so ρ = ∞ if X ∉ A for all t ∈ T , so that the process never enters A , and
A t τA = ∞ if Xt ∉ A for all
t ∈ T , so that the process never hits A . In discrete time, it's easy to see that these are stopping times.
+
Suppose that {X : n ∈ N} is a stochastic process on (Ω, F ) with state space (S, S ). If A ∈ S then τ and ρ are stopping
n A A
Proof
So of course in discrete time, τ and ρ are stopping times relative to any filtration F to which X is adapted. You might think that
A A
τA and ρ should always be a stopping times, since τ ≤ t if and only if X ∈ A for some s ∈ T with s ≤ t , and ρ ≤ t if and
A A s + A
only if X ∈ A for some s ∈ T with s ≤ t . It would seem that these events are known if one is allowed to observe the process up
s
to time t . The problem is that when T = [0, ∞) , these are uncountable unions, so we need to make additional assumptions on the
stochastic process X or the filtration F, or both.
Suppose that S has an LCCB topology, and that S is the σ-algebra of Borel sets. Suppose also that X = {X t : t ∈ [0, ∞)} is
right continuous and has left limits. Then τ and ρ are stopping times relative to F for every open A ∈ S .
A A
0
+
Here is another result that requires less of the stochastic process X, but more of the filtration F.
Suppose that X = {X : t ∈ [0, ∞)} is a stochastic process on (Ω, F ) that is progressively measurable relative to a complete,
t
right-continuous filtration F = {F : t ∈ [0, ∞)} . If A ∈ S then ρ and τ are stopping times relative to F.
t A A
This page titled 2.11: Filtrations and Stopping Times is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
2.11.8 https://stats.libretexts.org/@go/page/10139
CHAPTER OVERVIEW
3: Distributions
Recall that a probability distribution is just another name for a probability measure. Most distributions are associated with random
variables, and in fact every distribution can be associated with a random variable. In this chapter we explore the basic types of
probability distributions (discrete, continuous, mixed), and the ways that distributions can be defined using density functions,
distribution functions, and quantile functions. We also study the relationship between the distribution of a random vector and the
distributions of its components, conditional distributions, and how the distribution of a random variable changes when the variable
is transformed.
In the advanced sections, we study convergence in distribution, one of the most important types of convergence. We also construct
the abstract integral with respect to a positive measure and study the basic properties of the integral. This leads in turn to general
(signed measures), absolute continuity and singularity, and the existence of density functions. Finally, we study various vector
spaces of functions that are defined by integral pro
3.1: Discrete Distributions
3.2: Continuous Distributions
3.3: Mixed Distributions
3.4: Joint Distributions
3.5: Conditional Distributions
3.6: Distribution and Quantile Functions
3.7: Transformations of Random Variables
3.8: Convergence in Distribution
3.9: General Distribution Functions
3.10: The Integral With Respect to a Measure
3.11: Properties of the Integral
3.12: General Measures
3.13: Absolute Continuity and Density Functions
3.14: Function Spaces
This page titled 3: Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
3.1: Discrete Distributions
Basic Theory
Definitions and Basic Properties
As usual, our starting point is a random experiment modeled by a probability space (S, S , P). So to review, S is the set of
outcomes, S the collection of events, and P the probability measure on the sample space (S, S ). We use the terms probability
measure and probability distribution synonymously in this text. Also, since we use a general definition of random variable, every
probability measure can be thought of as the probability distribution of a random variable, so we can always take this point of view
if we like. Indeed, most probability measures naturally have random variables associated with them.
Recall that the sample space (S, S ) is discrete if S is countable and S = P(S) is the collection of all subsets of S . In this
case, P is a discrete distribution and (S, S , P) is a discrete probabiity space.
For the remainder or our discussion we assume that (S, S , P) is a discrete probability space. In the picture below, the blue dots are
intended to represent points of positive probability.
The function f on S defined by f (x) = P({x}) for x ∈ S is the probability density function of P, and satisfies the following
properties:
1. f (x) ≥ 0, x ∈ S
2. ∑ x∈S
f (x) = 1
3. ∑ x∈A
f (x) = P(A) for A ⊆ S
Proof
Property (c) is particularly important since it shows that a discrete probability distribution is completely determined by its
probability density function. Conversely, any function that satisfies properties (a) and (b) can be used to construct a discrete
probability distribution on S via property (c).
x∈A
Proof
Technically, f is the density of P relative to counting measure # on S . The technicalities are discussed in detail in the advanced
section on absolute continuity and density functions.
3.1.1 https://stats.libretexts.org/@go/page/10141
Figure 3.1.2 : A discrete distribution is completely determined by its probability density function.
The set of outcomes S is often a countable subset of some larger set, such as R for some n ∈ N . But not always. We might want
n
+
to consider a random variable with values in a deck of cards, or a set of words, or some other discrete population of objects. Of
course, we can always map a countable set S one-to-one into a Euclidean set, but it might be contrived or unnatural to do so. In any
event, if S is a subset of a larger set, we can always extend a probability density function f , if we want, to the larger set by defining
f (x) = 0 for x ∉ S . Sometimes this extension simplifies formulas and notation. Put another way, the “set of values” is often a
convenience set that includes the points with positive probability, but perhaps other points as well.
Suppose that f is a probability density function on S . Then {x ∈ S : f (x) > 0} is the support set of the distribution.
Values of x that maximize the probability density function are important enough to deserve a name.
Suppose again that f is a probability density function on S . An element x ∈ S that maximizes f is a mode of the distribution.
When there is only one mode, it is sometimes used as a measure of the center of the distribution.
A discrete probability distribution defined by a probability density function f is equivalent to a discrete mass distribution, with
total mass 1. In this analogy, S is the (countable) set of point masses, and f (x) is the mass of the point at x ∈ S . Property (c) in (2)
above simply means that the mass of a set A can be found by adding the masses of the points in A .
But let's consider a probabilistic interpretation, rather than one from physics. We start with a basic random variable X for an
experiment, defined on a probability space (Ω, F , P). Suppose that X has a discrete distribution on S with probability density
function f . So in this setting, f (x) = P(X = x) for x ∈ S . We create a new, compound experiment by conducting independent
repetitions of the original experiment. So in the compound experiment, we have a sequence of independent random variables
(X , X , …) each with the same distribution as X; in statistical terms, we are sampling from the distribution of X. Define
1 2
n
1 1
fn (x) = # {i ∈ {1, 2, … , n} : Xi = x} = ∑ 1(Xi = x), x ∈ S (3.1.3)
n n
i=1
Note that f (x) is the relative frequency of outcome x ∈ S in the first n runs. Note also that f (x) is a random variable for the
n n
compound experiment for each x ∈ S . By the law of large numbers, f (x) should converge to f (x), in some sense, as n → ∞ .
n
The function f is called the empirical probability density function, and it is in fact a (random) probability density function, since it
n
satisfies properties (a) and (b) of (2). Empirical probability density functions are displayed in most of the simulation apps that deal
with discrete variables.
It's easy to construct discrete probability density functions from other nonnegative functions defined on a countable set.
c = ∑ g(x) (3.1.4)
x∈S
c
g(x) for x ∈ S is a discrete probability density function on S .
Proof
Note that since we are assuming that g is nonnegative, c = 0 if and only if g(x) = 0 for every x ∈ S . At the other extreme, c = ∞
could only occur if S is infinite (and the infinite series diverges). When 0 < c < ∞ (so that we can construct the probability
density function f ), c is sometimes called the normalizing constant. This result is useful for constructing probability density
functions with desired functional properties (domain, shape, symmetry, and so on).
3.1.2 https://stats.libretexts.org/@go/page/10141
Conditional Densities
Suppose again that X is a random variable on a probability space (Ω, F , P) and that X takes values in our discrete set S . The
distributionn of X (and hence the probability density function of X) is based on the underlying probability measure on the sample
space (Ω, F ). This measure could be a conditional probability measure, conditioned on a given event E ∈ F (with P(E) > 0 ).
The probability density function in this case is
Except for notation, no new concepts are involved. Therefore, all results that hold for discrete probability density functions in
general have analogies for conditional discrete probability density functions.
For fixed E ∈ F with P(E) > 0 the function x ↦ f (x ∣ E) is a discrete probability density function on S That is,
1. f (x ∣ E) ≥ 0 for x ∈ S .
2. ∑ x∈S
f (x ∣ E) = 1
3. ∑ x∈A
f (x ∣ E) = P(X ∈ A ∣ E) for ⊆ S
Proof
In particular, the event E could be an event defined in terms of the random variable X itself.
Suppose that B ⊆ S and P(X ∈ B) > 0 . The conditional probability density function of X given X ∈ B is the function on B
defined by
f (x) f (x)
f (x ∣ X ∈ B) = = , x ∈ B (3.1.7)
P(X ∈ B) ∑ f (y)
y∈B
Proof
Note that the denominator is simply the normalizing constant for f restricted to B . Of course, f (x ∣ B) = 0 for x ∈ B . c
x∈S
Proof
This result is useful, naturally, when the distribution of X and the conditional probability of E given the values of X are known.
When we compute P(E) in this way, we say that we are conditioning on X. Note that P(E), as expressed by the formula, is a
weighted average of P(E ∣ X = x) , with weight factors f (x), over x ∈ S .
Proof
Bayes' theorem, named for Thomas Bayes, is a formula for the conditional probability density function of X given E . Again, it is
useful when the quantities on the right are known. In the context of Bayes' theorem, the (unconditional) distribution of X is
referred to as the prior distribution and the conditional distribution as the posterior distribution. Note that the denominator in
Bayes' formula is P(E) and is simply the normalizing constant for the function x ↦ f (x)P(E ∣ X = x) .
3.1.3 https://stats.libretexts.org/@go/page/10141
Examples and Special Cases
We start with some simple (albeit somewhat artificial) discrete distributions. After that, we study three special parametric models—
the discrete uniform distribution, hypergeometric distributions, and Bernoulli trials. These models are very important, so when
working the computational problems that follow, try to see if the problem fits one of these models. As always, be sure to try the
problems yourself before looking at the answers and proofs in the text.
Let g be the function defined by g(x, y) = xy for (x, y) ∈ {(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)}.
1. Sketch the domain of g .
2. Find the probability density function f that is proportional to g .
3. Find the mode of the distribution.
4. Find P [(X, Y ) ∈ {(1, 2), (1, 3), (2, 2), (2, 3)}]where (X, Y ) has probability density function f .
Answer
Consider the following game: An urn initially contains one red and one green ball. A ball is selected at random, and if the ball
is green, the game is over. If the ball is red, the ball is returned to the urn, another red ball is added, and the game continues. At
each stage, a ball is selected at random, and if the ball is green, the game is over. If the ball is red, the ball is returned to the
urn, another red ball is added, and the game continues. Let X denote the length of the game (that is, the number of selections
required to obtain a green ball). Find the probability density function of X.
Solution
Many random variables that arise in sampling or combinatorial experiments are transformations of uniformly distributed variables.
The next few exercises review the standard methods of sampling from a finite population. The parameters m and n are positive
inteters.
3.1.4 https://stats.libretexts.org/@go/page/10141
Suppose that n elements are chosen at random, with replacement from a set D with m elements. Let X denote the ordered
sequence of elements chosen. Then X is uniformly distributed on the Cartesian power set S = D , and has probability density n
function f given by
1
f (x) = , x ∈ S (3.1.13)
n
m
Proof
Suppose that n elements are chosen at random, without replacement from a set D with m elements (so n ≤ m ). Let X denote
the ordered sequence of elements chosen. Then X is uniformly distributed on the set S of permutations of size n chosen from
D, and has probability density function f given by
1
f (x) = , x ∈ S (3.1.14)
(n)
m
Proof
Suppose that n elements are chosen at random, without replacement, from a set D with m elements (so n ≤ m ). Let W
denote the unordered set of elements chosen. Then W is uniformly distributed on the set T of combinations of size n chosen
from D, and has probability density function f given by
1
f (w) = , w ∈ T (3.1.15)
m
( )
n
Proof
Suppose that X is uniformly distributed on a finite set S and that B is a nonempty subset of S . Then the conditional
distribution of X given X ∈ B is uniform on B .
Proof
Hypergeometric Models
Suppose that a dichotomous population consists of m objects of two different types: r of the objects are type 1 and m − r are type
0. Here are some typical examples:
The objects are persons, each either male or female.
The objects are voters, each either a democrat or a republican.
The objects are devices of some sort, each either good or defective.
The objects are fish in a lake, each either tagged or untagged.
The objects are balls in an urn, each either red or green.
A sample of n objects is chosen at random (without replacement) from the population. Recall that this means that the samples,
either ordered or unordered are equally likely. Note that this probability model has three parameters: the population size m, the
number of type 1 objects r, and the sample size n . Each is a nonnegative integer with r ≤ m and n ≤ m . Now, suppose that we
keep track of order, and let X denote the type of the ith object chosen, for i ∈ {1, 2, … , n}. Thus, X is an indicator variable
i i
Proof
permuted. This means that (X , X , … , X ) is exchangeable. In particular, the distribution of X is the same as the distribution of
1 2 n i
X , so P(X = 1) = . Thus, the variables are identically distributed. Also the distribution of (X , X ) is the same as the
r
1 i m i j
r(r−1)
distribution of (X1 , X2 ) , so P(Xi = 1, Xj = 1) =
m(m−1)
. Thus, Xi and Xj are not independent, and in fact are negatively
correlated.
3.1.5 https://stats.libretexts.org/@go/page/10141
Now let Y denote the number of type 1 objects in the sample. Note that Y =∑
n
i=1
Xi . Any counting variable can be written as a
sum of indicator variables.
The distribution defined by the probability density function in the last result is the hypergeometric distributions with parameters m,
r , and n . The term hypergeometric comes from a certain class of special functions, but is not particularly helpful in terms of
remembering the model. Nonetheless, we are stuck with it. The set of values {0, 1, … , n} is a convenience set: it contains all of
the values that have positive probability, but depending on the parameters, some of the values may have probability 0. Recall our
convention for binomial coefficients: for j, k ∈ N , ( ) = 0 if j > k . Note also that the hypergeometric distribution is unimodal:
+
k
the probability density function increases and then decreases, with either a single mode or two adjacent modes.
We can extend the hypergeometric model to a population of three types. Thus, suppose that our population consists of m objects; r
of the objects are type 1, s are type 2, and m − r − s are type 0. Here are some examples:
The objects are voters, each a democrat, a republican, or an independent.
The objects are cicadas, each one of three species: tredecula, tredecassini, or tredecim
The objects are peaches, each classified as small, medium, or large.
The objects are faculty members at a university, each an assistant professor, or an associate professor, or a full professor.
Once again, a sample of n objects is chosen at random (without replacement). The probability model now has four parameters: the
population size m, the type sizes r and s , and the sample size n . All are nonnegative integers with r + s ≤ m and n ≤ m .
Moreover, we now need two random variables to keep track of the counts for the three types in the sample. Let Y denote the
number of type 1 objects in the sample and Z the number of type 2 objects in the sample.
Proof
The distribution defined by the density function in the last exericse is the bivariate hypergeometric distribution with parameters m,
r , s , and n . Once again, the domain given is a convenience set; it includes the set of points with positive probability, but depending
on the parameters, may include points with probability 0. Clearly, the same general pattern applies to populations with even more
types. However, because of all of the parameters, the formulas are not worthing remembering in detail; rather, just note the pattern,
and remember the combinatorial meaning of the binomial coefficient. The hypergeometric model will be revisited later in this
chapter, in the section on joint distributions and in the section on conditional distributions. The hypergeometric distribution and the
multivariate hypergeometric distribution are studied in detail in the chapter on Finite Sampling Models. This chapter contains a
variety of distributions that are based on discrete uniform distributions.
Bernoulli Trials
A Bernoulli trials sequence is a sequence (X , X , …) of independent, identically distributed indicator variables. Random
1 2
variable X is the outcome of trial i, where in the usual terminology of reliability, 1 denotes success while 0 denotes failure, The
i
process is named for Jacob Bernoulli. Let p = P(X = 1) ∈ [0, 1] denote the success parameter of the process. Note that the
i
indicator variables in the hypergeometric model satisfy one of the assumptions of Bernoulli trials (identical distributions) but not
the other (independence).
3.1.6 https://stats.libretexts.org/@go/page/10141
y n−y n
f (x1 , x2 , … , xn ) = p (1 − p ) , (x1 , x2 , … , xn ) ∈ {0, 1 } , where y = x1 + x2 + ⋯ + xn (3.1.20)
Proof
Now let Y denote the number of successes in the first n trials. Note that Y = ∑ X , so we see again that a complicated random
n
i=1 i
variable can be written as a sum of simpler ones. In particular, a counting variable can always be written as a sum of indicator
variables.
The distribution defined by the probability density function in the last theorem is called the binomial distribution with parameters n
and p. The distribution is unimodal: the probability density function at first increases and then decreases, with either a single mode
or two adjacent modes. The binomial distribution is studied in detail in the chapter on Bernoulli Trials.
Suppose that p > 0 and let N denote the trial number of the first success. Then N has probability density function h given by
n−1
h(n) = (1 − p ) p, n ∈ N+ (3.1.22)
The distribution defined by the probability density function in the last exercise is the geometric distribution on N with parameter
+
Sampling Problems
In the following exercises, be sure to check if the problem fits one of the general models above.
An urn contains 30 red and 20 green balls. A sample of 5 balls is selected at random, without replacement. Let Y denote the
number of red balls in the sample.
1. Compute the probability density function of Y explicitly and identify the distribution by name and parameter values.
2. Graph the probability density function and identify the mode(s).
3. Find P(Y > 3) .
Answer
In the ball and urn experiment, select sampling without replacement and set m = 50 , r = 30 , and n = 5 . Run the experiment
1000 times and note the agreement between the empirical density function of Y and the probability density function.
An urn contains 30 red and 20 green balls. A sample of 5 balls is selected at random, with replacement. Let Y denote the
number of red balls in the sample.
1. Compute the probability density function of Y explicitly and identify the distribution by name and parameter values.
2. Graph the probability density function and identify the mode(s).
3. Find P(Y > 3) .
Answer
In the ball and urn experiment, select sampling with replacement and set m = 50 , r = 30 , and n = 5 . Run the experiment
1000 times and note the agreement between the empirical density function of Y and the probability density function.
3.1.7 https://stats.libretexts.org/@go/page/10141
A group of voters consists of 50 democrats, 40 republicans, and 30 independents. A sample of 10 voters is chosen at random,
without replacement. Let X denote the number of democrats in the sample and Y the number of republicans in the sample.
1. Give the probability density function of X.
2. Give the probability density function of Y .
3. Give the probability density function of (X, Y ).
4. Find the probability that the sample has at least 4 democrats and at least 4 republicans.
Answer
The Math Club at Enormous State University (ESU) has 20 freshmen, 40 sophomores, 30 juniors, and 10 seniors. A committee
of 8 club members is chosen at random, without replacement to organize π-day activities. Let X denote the number of
freshman in the sample, Y the number of sophomores, and Z the number of juniors.
1. Give the probability density function of X.
2. Give the probability density function of Y .
3. Give the probability density function of Z .
4. Give the probability density function of (X, Y ).
5. Give the probability density function of (X, Y , Z) .
6. Find the probability that the committee has no seniors.
Answer
Suppose that a coin with probability of heads p = 0.4 is tossed 5 times. Let Y denote the number of heads.
1. Compute the probability density function of Y explicitly.
2. Graph the probability density function and identify the mode.
3. Find P(Y > 3) .
Answer
In the binomial coin experiment, set n = 5 and p = 0.4 . Run the experiment 1000 times and compare the empirical density
function of Y with the probability density function.
Suppose that a coin with probability of heads p = 0.2 is tossed until heads occurs. Let N denote the number of tosses.
1. Find the probability density function of N .
2. Find P(N ≤ 5) .
Answer
In the negative binomial experiment, set k = 1 and p = 0.2 . Run the experiment 1000 times and compare the empirical
density function with the probability density function.
Suppose that two fair, standard dice are tossed and the sequence of scores (X , X ) recorded. Let Y = X
1 2 1 + X2 denote the
sum of the scores, U = min{X , X } the minimum score, and V = max{X , X } the maximum score.
1 2 1 2
3.1.8 https://stats.libretexts.org/@go/page/10141
4. Find the probability density function of V .
5. Find the probability density function of (U , V ).
Answer
Note that (U , V ) in the last exercise could serve as the outcome of the experiment that consists of throwing two standard dice if we
did not bother to record order. Note from the previous exercise that this random vector does not have a uniform distribution when
the dice are fair. The mistaken idea that this vector should have the uniform distribution was the cause of difficulties in the early
development of probability.
In the dice experiment, select n = 2 fair dice. Select the following random variables and note the shape and location of the
probability density function. Run the experiment 1000 times. For each of the following variables, compare the empirical
density function with the probability density function.
1. Y , the sum of the scores.
2. U , the minimum score.
3. V , the maximum score.
In the die-coin experiment, a fair, standard die is rolled and then a fair coin is tossed the number of times showing on the die.
Let N denote the die score and Y the number of heads.
1. Find the probability density function of N . Identify the distribution by name.
2. Find the probability density function of Y .
Answer
Run the die-coin experiment 1000 times. For the number of heads, compare the empirical density function with the probability
density function.
Suppose that a bag contains 12 coins: 5 are fair, 4 are biased with probability of heads ; and 3 are two-headed. A coin is
1
chosen at random from the bag and tossed 5 times. Let V denote the probability of heads of the selected coin and let Y denote
the number of heads.
1. Find the probability density function of V .
2. Find the probability density function of Y .
Answer
Compare thedie-coin experiment with the bag of coins experiment. In the first experiment, we toss a coin with a fixed probability of
heads a random number of times. In second experiment, we effectively toss a coin with a random probability of heads a fixed
number of times. In both cases, we can think of starting with a binomial distribution and randomizing one of the parameters.
In the coin-die experiment, a fair coin is tossed. If the coin lands tails, a fair die is rolled. If the coin lands heads, an ace-six flat
die is tossed (faces 1 and 6 have probability 1
4
each, while faces 2, 3, 4, 5 have probability 1
8
each). Find the probability
density function of the die score Y .
Answer
Run the coin-die experiment 1000 times, with the settings in the previous exercise. Compare the empirical density function
with the probability density function.
Suppose that a standard die is thrown 10 times. Let Y denote the number of times an ace or a six occurred. Give the probability
density function of Y and identify the distribution by name and parameter values in each of the following cases:
1. The die is fair.
2. The die is an ace-six flat.
Answer
Suppose that a standard die is thrown until an ace or a six occurs. Let N denote the number of throws. Give the probability
density function of N and identify the distribution by name and parameter values in each of the following cases:
3.1.9 https://stats.libretexts.org/@go/page/10141
1. The die is fair.
2. The die is an ace-six flat.
Answer
Fred and Wilma takes turns tossing a coin with probability of heads p ∈ (0, 1): Fred first, then Wilma, then Fred again, and so
forth. The first person to toss heads wins the game. Let N denote the number of tosses, and W the event that Wilma wins.
1. Give the probability density function of N and identify the distribution by name.
2. Compute P(W ) and sketch the graph of this probability as a function of p.
3. Find the conditional probability density function of N given W .
Answer
The alternating coin tossing game is studied in more detail in the section on The Geometric Distribution in the chapter on Bernoulli
trials.
Suppose that k players each have a coin with probability of heads p, where k ∈ {2, 3, …} and where p ∈ (0, 1).
1. Suppose that the players toss their coins at the same time. Find the probability that there is an odd man, that is, one player
with a different outcome than all the rest.
2. Suppose now that the players repeat the procedure in part (a) until there is an odd man. Find the probability density
function of N , the number of rounds played, and identify the distribution by name.
Answer
The odd man out game is treated in more detail in the section on the Geometric Distribution in the chapter on Bernoulli Trials.
Cards
Recall that a poker hand consists of 5 cards chosen at random and without replacement from a standard deck of 52 cards. Let
X denote the number of spades in the hand and Y the number of hearts in the hand. Give the probability density function of
each of the following random variables, and identify the distribution by name:
1. X
2. Y
3. (X, Y )
Answer
Recall that a bridge hand consists of 13 cards chosen at random and without replacement from a standard deck of 52 cards. An
honor card is a card of denomination ace, king, queen, jack or 10. Let N denote the number of honor cards in the hand.
1. Find the probability density function of N and identify the distribution by name.
2. Find the probability that the hand has no honor cards. A hand of this kind is known as a Yarborough, in honor of Second
Earl of Yarborough.
Answer
In the most common high card point system in bridge, an ace is worth 4 points, a king is worth 3 points, a queen is worth 2
points, and a jack is worth 1 point. Find the probability density function of V , the point value of a random bridge hand.
Reliability
Suppose that in a batch of 500 components, 20 are defective and the rest are good. A sample of 10 components is selected at
random and tested. Let X denote the number of defectives in the sample.
1. Find the probability density function of X and identify the distribution by name and parameter values.
2. Find the probability that the sample contains at least one defective component.
Answer
A plant has 3 assembly lines that produce a certain type of component. Line 1 produces 50% of the components and has a
defective rate of 4%; line 2 has produces 30% of the components and has a defective rate of 5%; line 3 produces 20% of the
3.1.10 https://stats.libretexts.org/@go/page/10141
components and has a defective rate of 1%. A component is chosen at random from the plant and tested.
1. Find the probability that the component is defective.
2. Given that the component is defective, find the conditional probability density function of the line that produced the
component.
Answer
Recall that in the standard model of structural reliability, a systems consists of n components, each of which, independently of the
others, is either working for failed. Let X denote the state of component i, where 1 means working and 0 means failed. Thus, the
i
state vector is X = (X , X , … , X ). The system as a whole is also either working or failed, depending only on the states of the
1 2 n
components. Thus, the state of the system is an indicator random variable U = u(X) that depends on the states of the components
according to a structure function u : {0, 1} → {0, 1}. In a series system, the system works if and only if every components works.
n
In a parallel system, the system works if and only if at least one component works. In a k out of n system, the system works if and
only if at least k of the n components work.
The reliability of a device is the probability that it is working. Let p = P(X = 1) denote the reliability of component i, so that
i i
p = (p , p , … , p ) is the vector of component reliabilities. Because of the independence assumption, the system reliability
1 2 n
depends only on the component reliabilities, according to a reliability function r(p) = P(U = 1) . Note that when all component
reliabilities have the same value p, the states of the components form a sequence of n Bernoulli trials. In this case, the system
reliability is, of course, a function of the common component reliability p.
Suppose that the component reliabilities all have the same value p. Let X denote the state vector and Y denote the number of
working components.
1. Give the probability density function of X.
2. Give the probability density function of Y and identify the distribution by name and parameter.
3. Find the reliability of the k out of n system.
Answer
Suppose that we have 4 independent components, with common reliability p = 0.8 . Let Y denote the number of working
components.
1. Find the probability density function of Y explicitly.
2. Find the reliability of the parallel system.
3. Find the reliability of the 2 out of 4 system.
4. Find the reliability of the 3 out of 4 system.
5. Find the reliability of the series system.
Answer
3.1.11 https://stats.libretexts.org/@go/page/10141
3. If a is not a positive integer, there is a single mode at ⌊a⌋
4. If a is a positive integer, there are two modes at a − 1 and a .
Proof
The distribution defined by the probability density function in the previous exercise is the Poisson distribution with parameter a ,
named after Simeon Poisson. Note that like the other named distributions we studied above (hypergeometric and binomial), the
Poisson distribution is unimodal: the probability density function at first increases and then decreases, with either a single mode or
two adjacent modes. The Poisson distribution is studied in detail in the Chapter on Poisson Processes, and is used to model the
number of “random points” in a region of time or space, under certain ideal conditions. The parameter a is proportional to the size
of the region of time or space.
Suppose that the customers arrive at a service station according to the Poisson model, at an average rate of 4 per hour. Thus,
the number of customers N who arrive in a 2-hour period has the Poisson distribution with parameter 8.
1. Find the modes.
2. Find P(N ≥ 6) .
Answer
In the Poisson experiment, set r = 4 and t = 2 . Run the simulation 1000 times and compare the empirical density function to
the probability density function.
Suppose that the number of flaws N in a piece of fabric of a certain size has the Poisson distribution with parameter 2.5.
1. Find the mode.
2. Find P(N > 4) .
Answer
Suppose that the number of raisins N in a piece of cake has the Poisson distribution with parameter 10.
1. Find the modes.
2. Find P(8 ≤ N ≤ 12) .
Answer
A Zeta Distribution
Let g be the function defined by g(n) = 1
n2
for n ∈ N .
+
The distribution defined in the previous exercise is a member of the zeta family of distributions. Zeta distributions are used to
model sizes or ranks of certain types of objects, and are studied in more detail in the chapter on Special Distributions.
Benford's Law
Let f be the function defined by f (d) = log(d + 1) − log(d) = log(1 + 1
d
) for d ∈ {1, 2, … , 9}. (The logarithm function is
the base 10 common logarithm, not the base e natural logarithm.)
1. Show that f is a probability density function.
2. Compute the values of f explicitly, and sketch the graph.
3. Find P(X ≤ 3) where X has probability density function f .
Answer
The distribution defined in the previous exercise is known as Benford's law, and is named for the American physicist and engineer
Frank Benford. This distribution governs the leading digit in many real sets of data. Benford's law is studied in more detail in the
3.1.12 https://stats.libretexts.org/@go/page/10141
chapter on Special Distributions.
In the M&M data, let R denote the number of red candies and N the total number of candies. Compute and graph the
empirical probability density function of each of the following:
1. R
2. N
3. R given N > 57
Answer
In the Cicada data, let G denotes gender, S species type, and W body weight (in grams). Compute the empirical probability
density function of each of the following:
1. G
2. S
3. (G, S)
4. G given W > 0.20 grams.
Answer
This page titled 3.1: Discrete Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
3.1.13 https://stats.libretexts.org/@go/page/10141
3.2: Continuous Distributions
In the previous section, we considered discrete distributions. In this section, we study a complementary type of distribution. As
usual, if you are a new student of probability, you may want to skip the technical details.
Basic Theory
Definitions and Basic Properties
As usual, our starting point is a random experiment modeled by a probability space (S, S , P). So to review, S is the set of
outcomes, S the collection of events, and P the probability measure on the sample space (S, S ). We use the terms probability
measure and probability distribution synonymously in this text. Also, since we use a general definition of random variable, every
probability measure can be thought of as the probability distribution of a random variable, so we can always take this point of view
if we like. Indeed, most probability measures naturally have random variables associated with them.
Details
The fact that each point is assigned probability 0 might seem impossible or paradoxical at first, but soon we will see very familiar
analogies.
Thus, continuous distributions are in complete contrast with discrete distributions, for which all of the probability mass is
concentrated on the points in a discrete set. For a continuous distribution, the probability mass is continuously spread over S in
some sense. In the picture below, the light blue shading is intended to suggest a continuous distribution of probability.
region in R , and a conical region in R . Suppose that P is a continuous probability measure on S . The fact that each point in S
2 3
has probability 0 is conceptually the same as the fact that an interval of R can have positive length even though it is composed of
points each of which has 0 length. Similarly, a region of R can have positive area even though it is composed of points (or curves)
2
each of which has area 0. In the one-dimensional case, continuous distributions are used to model random variables that take values
in intervals of R, variables that can, in principle, be measured with any degree of accuracy. Such variables abound in applications
and include
length, area, volume, and distance
time
mass and weight
charge, voltage, and current
resistance, capacitance, and inductance
velocity and acceleration
energy, force, and work
Usually a continuous distribution can usually be described by certain type of function.
3.2.1 https://stats.libretexts.org/@go/page/10142
Suppose again that P is a continuous distribution on S . A function f : S → [0, ∞) is a probability density function for P if
Details
So the probability distribution P is completely determined by the probability density function f . As a special case, note that
∫ f (x) dx = P(S) = 1 . Conversely, a nonnegative function on S with this property defines a probability measure.
S
Proof
Figure 3.2.2 : A continuous distribution is completely determined by its probability density function
Note that we can always extend f to a probability density function on a subset of R that contains S , or to all of R , by defining
n n
f (x) = 0 for x ∉ S . This extension sometimes simplifies notation. Put another way, we can be a bit sloppy about the “set of
values” of the random variable. So for example if a, b ∈ R with a < b and X has a continuous distribution on the interval [a, b],
then we could also say that X has a continuous distribution on (a, b) or [a, b), or (a, b].
The points x ∈ S that maximize the probability density function f are important, just as in the discrete case.
Suppose that P is a continuous distribution on S with probability density function f . An element x ∈ S that maximizes f is a
mode of the distribution.
If there is only one mode, it is sometimes used as a measure of the center of the distribution.
You have probably noticed that probability density functions for continuous distributions are analogous to probability density
functions for discrete distributions, with integrals replacing sums. However, there are essential differences. First, every discrete
distribution has a unique probability density function f given by f (x) = P({x}) for x ∈ S . For a continuous distribution, the
existence of a probability density function is not guaranteed. The advanced section on absolute continuity and density functions has
several examples of continuous distribution that do not have density functions, and gives conditions that are necessary and
sufficient for the existence of a probability density function. Even if a probability density function f exists, it is never unique. Note
that the values of f on a finite (or even countably infinite) set of points could be changed to other nonnegative values and the new
function would still be a probability density function for the same distribution. The critical fact is that only integrals of f are
important. Second, the values of the PDF f for a discrete distribution are probabilities, and in particular f (x) ≤ 1 for x ∈ S . For a
continuous distribution the values are not probabilities and in fact it's possible that f (x) > 1 for some or even all x ∈ S . Further, f
can be unbounded on S . In the typical calculus interpretation, f (x) really is probability density at x. That is, f (x) dx is
approximately the probability of a “small” region of size dx about x.
c =∫ g(x) dx (3.2.4)
S
3.2.2 https://stats.libretexts.org/@go/page/10142
If 0 < c < ∞ then f defined by f (x) =
1
c
g(x) for x ∈ S defines a probability density function for a continuous distribution
on S .
Proof
Note again that f is just a scaled version of g . So this result can be used to construct probability density functions with desired
properties (domain, shape, symmetry, and so on). The constant c is sometimes called the normalizing constant of g .
Conditional Densities
Suppose now that X is a random variable defined on a probability space (Ω, F , P) and that X has a continuous distribution on S .
A probability density function for X is based on the underlying probability measure on the sample space (Ω, F ). This measure
could be a conditional probability measure, conditioned on a given event E ∈ F with P(E) > 0 . Assuming that the conditional
probability density function exists, the usual notation is
f (x ∣ E), x ∈ S (3.2.6)
Note, however, that except for notation, no new concepts are involved. The defining property is
and all results that hold for probability density functions in general hold for conditional probability density functions. The event E
could be an event described in terms of the random variable X itself:
Suppose that X has a continuous distribution on S with probability density function f and that B ∈ S with P(X ∈ B) > 0 .
The conditional probability density function of X given X ∈ B is the function on B given by
f (x)
f (x ∣ X ∈ B) = , x ∈ B (3.2.8)
P(X ∈ B)
Proof
Of course, P(X ∈ B) = ∫ B
f (x) dx and hence is the normaliziang constant for the restriction of f to B , as in (8)
The distribution defined by the probability density function in the previous exercise is called the exponential distribution with rate
parameter r. This distribution is frequently used to model random times, under certain assumptions. Specifically, in the Poisson
model of random points in time, the times between successive arrivals have independent exponential distributions, and the
parameter r is the average rate of arrivals. The exponential distribution is studied in detail in the chapter on Poisson Processes.
The lifetime T of a certain device (in 1000 hour units) has the exponential distribution with parameter r = 1
2
. Find
1. P(T > 2)
Answer
In the gamma experiment, set n = 1 to get the exponential distribution. Vary the rate parameter r and note the shape of the
probability density function. For various values of r, run the simulation 1000 times and compare the the empirical density
function with the probability density function.
3.2.3 https://stats.libretexts.org/@go/page/10142
A Random Angle
In Bertrand's problem, a certain random angle Θ has probability density function f given by f (θ) = sin θ for θ ∈ [0, π
2
] .
1. Show that f is a probability density function.
2. Draw a careful sketch of the graph f , and state the important qualitative features.
3. Find P (Θ < ) .π
Answer
Bertand's problem is named for Joseph Louis Bertrand and is studied in more detail in the chapter on Geometric Models.
In Bertrand's experiment, select the model with uniform distance. Run the simulation 1000 times and compute the empirical
probability of the event {Θ < } . Compare with the true probability in the previous exercise.
π
Gamma Distributions
n
n!
for t ∈ [0, ∞) where n ∈ N is a parameter.
1. Show that g is a probability density function for each n ∈ N .
n
2. Draw a careful sketch of the graph of g , and state the important qualitative features.
n
Proof
Interestingly, we showed in the last section on discrete distributions, that f (n) = g (t) is a probability density function on N for
t n
each t ≥ 0 (it's the Poisson distribution with parameter t ). The distribution defined by the probability density function g belongs
n
to the family of Erlang distributions, named for Agner Erlang; n + 1 is known as the shape parameter. The Erlang distribution is
studied in more detail in the chapter on the Poisson Process. In turn the Erlang distribution belongs to the more general family of
gamma distributions. The gamma distribution is studied in more detail in the chapter on Special Distributions.
In the gamma experiment, keep the default rate parameter r = 1 . Vary the shape parameter and note the shape and location of
the probability density function. For various values of the shape parameter, run the simulation 1000 times and compare the
empirical density function with the probability density function.
Suppose that the lifetime of a device T (in 1000 hour units) has the gamma distribution above with n =2 . Find each of the
following:
1. P(T > 3) .
2. P(T ≤ 2)
3. P(1 ≤ T ≤ 4)
Answer
Beta Distributions
The distributions defined in the last two exercises are examples of beta distributions. These distributions are widely used to model
random proportions and probabilities, and physical quantities that take values in bounded intervals (which, after a change of units,
can be taken to be [0, 1]). Beta distributions are studied in detail in the chapter on Special Distributions.
3.2.4 https://stats.libretexts.org/@go/page/10142
In the special distribution simulator, select the beta distribution. For the following parameter values, note the shape of the
probability density function. Run the simulation 1000 times and compare the empirical density function with the probability
density function.
1. a = 2 , b = 2 . This gives the first beta distribution above.
2. a = 3 , b = 2 . This gives the second beta distribuiton above.
4
≤P ≤
3
4
) in each of the following cases:
1. P has the first beta distribution above.
2. P has the second beta distribution above.
Answer
The distribution defined in the last exercise is also a member of the beta family of distributions. But it is also known as the
(standard) arcsine distribution, because of the arcsine function that arises in the proof that f is a probability density function. The
arcsine distribution has applications to a very important random process known as Brownian motion, named for the Scottish
botanist Robert Brown. Arcsine distributions are studied in more generality in the chapter on Special Distributions.
In the special distribution simulator, select the (continuous) arcsine distribution and keep the default parameter values. Run the
simulation 1000 times and compare the empirical density function with the probability density function.
Suppose that X represents the change in the price of a stock at time t , relative to the value at an initial reference time 0. We
t
treat t as a continuous variable measured in weeks. Let T = max {t ∈ [0, 1] : X = 0} , the last time during the first week that
t
the stock price was unchanged over its initial value. Under certain ideal conditions, T will have the arcsine distribution. Find
each of the following:
1. P (T <
1
4
)
2. P (T ≥
1
2
)
3. P (T ≤
3
4
)
Answer
Open the Brownian motion experiment and select the last zero variable. Run the experiment in single step mode a few times.
The random process that you observe models the price of the stock in the previous exercise. Now run the experiment 1000
times and compute the empirical probability of each event in the previous exercise.
1. Draw a careful sketch the graph of g , and state the important qualitative features.
2. Find the values of b for which there exists a probability density function f (8)proportional to g . Identify the mode.
Answer
Note that the qualitative features of g are the same, regardless of the value of the parameter b > 0 , but only when b > 1 can g be
normalized into a probability density function. In this case, the distribution is known as the Pareto distribution, named for Vilfredo
Pareto. The parameter a = b − 1 , so that a > 0 , is known as the shape parameter. Thus, the Pareto distribution with shape
parameter a has probability density function
3.2.5 https://stats.libretexts.org/@go/page/10142
a
f (x) = , x ∈ [1, ∞) (3.2.13)
a+1
x
The Pareto distribution is widely used to model certain economic variables and is studied in detail in the chapter on Special
Distributions.
In the special distribution simulator, select the Pareto distribution. Leave the scale parameter fixed, but vary the shape
parameter, and note the shape of the probability density function. For various values of the shape parameter, run the simulation
1000 times and compare the empirical density function with the probability density function.
Suppose that the income X (in appropriate units) of a person randomly selected from a population has the Pareto distribution
with shape parameter a = 2 . Find each of the following:
1. P(X > 2)
2. P(X ≤ 4)
3. P(3 ≤ X ≤ 5)
Answer
The distribution constructed in the previous exercise is known as the (standard) Cauchy distribution, named after Augustin Cauchy
It might also be called the arctangent distribution, because of the appearance of the arctangent function in the proof that f is a
probability density function. In this regard, note the similarity to the arcsine distribution above. The Cauchy distribution is studied
in more generality in the chapter on Special Distributions. Note also that the Cauchy distribution is obtained by normalizing the
function x ↦ 1
1+x
; the graph of this function is known as the witch of Agnesi, in honor of Maria Agnesi.
2
In the special distribution simulator, select the Cauchy distribution with the default parameter values. Run the simulation 1000
times and compare the empirical density function with the probability density function.
A light source is 1 meter away from position 0 on an infinite, straight wall. The angle Θ that the light beam makes with the
perpendicular to the wall is randomly chosen from the interval (− , ). The position X = tan(Θ) of the light beam on the
π
2
π
wall has the standard Cauchy distribution. Find each of the following:
1. P(−1 < X < 1) .
2. P (X ≥ 1
)
√3
–
3. P(X ≤ √3)
Answer
The Cauchy experiment (with the default parameter values) is a simulation of the experiment in the last exercise.
1. Run the experiment a few times in single step mode.
2. Run the experiment 1000 times and compare the empirical density function with the probability density function.
3. Using the data from (b), compute the relative frequency of each event in the previous exercise, and compare with the true
probability.
3.2.6 https://stats.libretexts.org/@go/page/10142
1. Show that ϕ is a probability density function.
2. Draw a careful sketch the graph of ϕ , and state the important qualitative features.
Proof
The distribution defined in the last exercise is the standard normal distribution, perhaps the most important distribution in
probability and statistics. It's importance stems largely from the central limit theorem, one of the fundamental theorems in
probability. In particular, normal distributions are widely used to model physical measurements that are subject to small, random
errors. The family of normal distributions is studied in more generality in the chapter on Special Distributions.
In the special distribution simulator, select the normal distribution and keep the default parameter values. Run the simulation
1000 times and compare the empirical density function and the probability density function.
The function z ↦ e is a notorious example of an integrable function that does not have an antiderivative that can be expressed
2
−z /2
in closed form in terms of other elementary functions. (That's why we had to resort to the polar coordinate trick to show that ϕ is a
probability density function.) So probabilities involving the normal distribution are usually computed using mathematical or
statistical software.
Suppose that the error Z in the length of a certain machined part (in millimeters) has the standard normal distribution. Use
mathematical software to approximate each of the following:
1. P(−1 ≤ Z ≤ 1)
2. P(Z > 2)
3. P(Z < −3)
Answer
The distribution in the last exercise is the (standard) type 1 extreme value distribution, also known as the Gumbel distribution in
honor of Emil Gumbel. Extreme value distributions are studied in more generality in the chapter on Special Distributions.
In the special distribution simulator, select the extreme value distribution. Keep the default parameter values and note the shape
and location of the probability density function. Run the simulation 1000 times and compare the empirical density function
with the probability density function.
The distribution in the last exercise is the (standard) logistic distribution. Logistic distributions are studied in more generality in the
chapter on Special Distributions.
3.2.7 https://stats.libretexts.org/@go/page/10142
In the special distribution simulator, select the logistic distribution. Keep the default parameter values and note the shape and
location of the probability density function. Run the simulation 1000 times and compare the empirical density function with the
probability density function.
Weibull Distributions
The distributions in the last two exercises are examples of Weibull distributions, name for Waloddi Weibull. Weibull distributions
are studied in more generality in the chapter on Special Distributions. They are often used to model random failure times of devices
(in appropriately scaled units).
In the special distribution simulator, select the Weibull distribution. For each of the following values of the shape parameter k ,
note the shape and location of the probability density function. Run the simulation 1000 times and compare the empirical
density function with the probability density function.
1. k = 2 . This gives the first Weibull distribution above.
2. k = 3 . This gives the second Weibull distribution above.
Suppose that T is the failure time of a device (in 1000 hour units). Find P (T >
1
2
) in each of the following cases:
1. T has the first Weibull distribution above.
2. T has the second Weibull distribution above.
Answer
Additional Examples
Let f be the function defined by f (x) = − ln x for x ∈ (0, 1].
1. Show that f is a probability density function.
2. Draw a careful sketch of the graph of f , and state the important qualitative features.
3. Find P ( ≤ X ≤ ) where X has the probability density function in (a).
1
3
1
Answer
The following problems deal with two and three dimensional random vectors having continuous distributions. The idea of
normalizing a function to form a probability density function is important for some of the problems. The relationship between the
distribution of a vector and the distribution of its components will be discussed later, in the section on joint distributions.
3.2.8 https://stats.libretexts.org/@go/page/10142
3. Find the conditional density of (X, Y ) given {X < 1
2
,Y <
1
2
} .
Answer
n
λn (A) = ∫ 1 dx, A ⊆R (3.2.23)
A
Details
Note that if n >1 , the integral above is a multiple integral. Generally, λn (A) is referred to as the n -dimensional volumve of
A ∈⊆ R .
n
2. The probability measure associated with f is given by P(A) = λ (A)/λ (S) for A ⊆ S , and is known as the uniform
n n
distribution on S .
Proof
Note that the probability assigned to a set A ⊆ R is proportional to the size of A , as measured by λ . Note also that in both the
n
n
discrete and continuous cases, the uniform distribution on a set S has constant probability density function on S . The uniform
distribution on a set S governs a point X chosen “at random” from S , and in the continuous case, such distributions play a
3.2.9 https://stats.libretexts.org/@go/page/10142
fundamental role in various Geometric Models. Uniform distributions are studied in more generality in the chapter on Special
Distributions.
The most important special case is the uniform distribution on an interval [a, b] where a, b ∈ R and a <b . In this case, the
probability density function is
1
f (x) = , a ≤x ≤b (3.2.25)
b −a
This distribution models a point chosen “at random” from the interval. In particular, the uniform distribution on [0, 1] is known as
the standard uniform distribution, and is very important because of its simplicity and the fact that it can be transformed into a
variety of other probability distributions on R. Almost all computer languages have procedures for simulating independent,
standard uniform variables, which are called random numbers in this context.
Conditional distributions corresponding to a uniform distribution are also uniform.
The last theorem has important implications for simulations. If we can simulate a random variable that is uniformly distributed on a
set, we can simulate a random variable that is uniformly distributed on a subset.
Suppose again that R ⊆S ⊆R for some n ∈ N , and that λ (R) > 0 and λ (S) < ∞ . Suppose further that
n
+ n n
Proof
Suppose in particular that S is a Cartesian product of n bounded intervals. It turns out to be quite easy to simulate a sequence of
independent random variables X = (X , X , …) each of which is uniformly distributed on S . Thus, the last theorem give an
1 2
algorithm for simulating a random variable that is uniformly distributed on an irregularly shaped region R ⊆ S (assuming that we
have an algorithm for recognizing when a point x ∈ R falls in R ). This method of simulation is known as the rejection method,
n
and as we will see in subsequent sections, is more important that might first appear.
Figure 3.2.3 : With a sequence of independent, uniformly distributed points in S , the first one to fall in R is uniformly distributed
on R .
In the simple probability experiment, random points are uniformly distributed on the rectangular region S . Move and resize the
events A and B and note how the probabilities of the 16 events that can be constructed from A and B change. Run the
experiment 1000 times and note the agreement between the relative frequencies of the events and the probabilities of the
events.
Suppose that (X, Y ) is uniformly distributed on the circular region of radius 5, centered at the origin. We can think of (X, Y )
−−−−−− −
as the position of a dart thrown “randomly” at a target. Let R = √X + Y , the distance from the center to (X, Y ).
2 2
3.2.10 https://stats.libretexts.org/@go/page/10142
Suppose that (X, Y , Z) is uniformly distributed on the cube S = [0, 1] . Find P(X < Y
3
< Z) in two ways:
1. Using the probability density function.
2. Using a combinatorial argument.
Answer
The time T (in minutes) required to perform a certain job is uniformly distributed over the interval [15, 60].
1. Find the probability that the job requires more than 30 minutes
2. Given that the job is not finished after 30 minutes, find the probability that the job will require more than 15 additional
minutes.
Answer
For the cicada data, BW denotes body weight (in grams), BL body length (in millimeters), and G gender (0 for female and 1
for male). Construct an empirical density function for each of the following and display each as a bar graph:
1. BW
2. BL
3. BW given G = 0
Answer
This page titled 3.2: Continuous Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
3.2.11 https://stats.libretexts.org/@go/page/10142
3.3: Mixed Distributions
In the previous two sections, we studied discrete probability meausres and continuous probability measures. In this section, we will
study probability measure that that are combinations of the two types. Once again, if you are a new student of probability you may
want to skip the technical details.
Basic Theory
Definitions and Basic Properties
Our starting point is a random experiment modeled by a probability space (S, S , P). So to review, S is the set of outcomes, S the
collection of events, and P the probability measure on the sample space (S, S ). We use the terms probability measure and
probability distribution synonymously in this text. Also, since we use a general definition of random variable, every probability
measure can be thought of as the probability distribution of a random variable, so we can always take this point of view if we like.
Indeed, most probability measures naturally have random variables associated with them. Here is the main definition:
The probability measure P is of mixed type if S can be partitioned into events D and C with the following properties:
1. D is countable, 0 < P(D) < 1 and P({x}) > 0 for every x ∈ D.
2. C ⊆ R for some n ∈ N and P({x}) = 0 for every x ∈ C .
n
+
Details
Recall that the term partition means that D and C are disjoint and S = D ∪ C . As alwasy, the collection of events S is
required to be a σ-algebra. The set C is a measurable subset of R and then the elements of S have the form A ∪ B where
n
A ⊆ D and B is a measurable subset of C . Typically in applications, C is defined by a finite number of inequalities involving
elementary functions.
Often the discrete set D is a subset of R also, but that's not a requirement. Note that since D and C are complements,
n
0 < P(C ) < 1 also. Thus, part of the distribution is concentrated at points in a discrete set D; the rest of the distribution is
continuously spread over C . In the picture below, the light blue shading is intended to represent a continuous distribution of
probability while the darker blue dots are intended to represents points of positive probability.
Note that
Thus, the probability measure P really is a mixture of a discrete distribution and a continuous distribution. Mixtures are studied in
more generality in the section on conditional distributions. We can define a function on D that is a partial probability density
function for the discrete part of the distribution.
Suppose that P is a probability measure on S of mixed type as in (1). Let g be the function defined by g(x) = P({x}) for
x ∈ D . Then
3.3.1 https://stats.libretexts.org/@go/page/10143
1. g(x) ≥ 0 for x ∈ D
2. ∑ g(x) = P(D)
x∈D
3. P(A) = ∑ x∈A
g(x) for A ⊆ D
Proof
Clearly, the normalized function x ↦ g(x)/P(D) is the probability density function of the conditional distribution given D ,
discussed in (2). Often, the continuous part of the distribution is also described by a partial probability density function.
A partial probability density function for the continuous part of P is a nonnegative function h : C → [0, ∞) such that
Details
Technically, h is require to be measurable, and is a density function with respect to Lebesgue measure λn on C , the standard
measure on R . n
Clearly, the normalized function x ↦ h(x)/P(C ) is the probability density function of the conditional distribution given C
discussed in (2). As with purely continuous distributions, the existence of a probability density function for the continuous part of a
mixed distribution is not guaranteed. And when it does exist, a density function for the continuous part is not unique. Note that the
values of h could be changed to other nonnegative values on a countable subset of C , and the displayed equation above would still
hold, because only integrals of h are important. The probability measure P is completely determined by the partial probability
density functions.
Suppose that P has partial probability density functions g and h for the discrete and continuous parts, respectively. Then
Proof
Figure 3.3.2 : A mixed distribution is completely determined by its partial density functions.
Truncated Variables
Distributions of mixed type occur naturally when a random variable with a continuous distribution is truncated in a certain way.
For example, suppose that T is the random lifetime of a device, and has a continuous distribution with probability density function
f that is positive on [0, ∞). In a test of the device, we can't wait forever, so we might select a positive constant a and record the
T, T <a
U ={ (3.3.4)
a, T ≥a
Suppose next that random variable X has a continuous distribution on R, with probability density function f that is positive on R.
Suppose also that a, b ∈ R with a < b . The variable is truncated on the interval [a, b] to create a new random variable Y as
follows:
3.3.2 https://stats.libretexts.org/@go/page/10143
⎧ a, X ≤a
Another way that a “mixed” probability distribution can occur is when we have a pair of random variables (X, Y ) for our
experiment, one with a discrete distribution and the other with a continuous distribution. This setting is explored in the next section
on Joint Distributions.
2
1
2
uniformly distributed on
the interval [0, 10]. Find P(X > 6) .
Answer
3
uniformly distributed on 2
{0, 1, 2} and has probability 2
3
uniformly distributed on
[0, 2] . Find P(Y > X) .
2
Answer
Suppose that the lifetime T of a device (in 1000 hour units) has the exponential distribution with probability density function
f (t) = e
−t
for 0 ≤ t < ∞ . A test of the device is terminated after 2000 hours; the truncated lifetime U is recorded. Find each
of the following:
1. P(U < 1)
2. P(U = 2)
Answer
This page titled 3.3: Mixed Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
3.3.3 https://stats.libretexts.org/@go/page/10143
3.4: Joint Distributions
The purpose of this section is to study how the distribution of a pair of random variables is related to the distributions of the
variables individually. If you are a new student of probability you may want to skip the technical details.
Basic Theory
Joint and Marginal Distributions
As usual, we start with a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F
is the collection of events, and P is the probability measure on the sample space (Ω, F ). Suppose now that X and Y are random
variables for the experiment, and that X takes values in S while Y takes values in T . We can think of (X, Y ) as a random variable
taking values in the product set S × T . The purpose of this section is to study how the distribution of (X, Y ) is related to the
distributions of X and Y individually.
Recall that
1. The distribution of (X, Y ) is the probability measure on S × T given by P [(X, Y ) ∈ C ] for C ⊆ S ×T .
2. The distribution of X is the probability measure on S given by P(X ∈ A) for A ⊆ S .
3. The distribution of Y is the probability measure on T given by P(Y ∈ B) for B ⊆ T .
In this context, the distribution of (X, Y ) is called the joint distribution, while the distributions of X and of Y are referred to
as marginal distributions.
Details
The first simple but very important point, is that the marginal distributions can be obtained from the joint distribution.
Note that
1. P(X ∈ A) = P [(X, Y ) ∈ A × T ] for A ⊆ S
2. P(Y ∈ B) = P [(X, Y ) ∈ S × B] for B ⊆ T
The converse does not hold in general. The joint distribution contains much more information than the marginal distributions
separately. However, the converse does hold if X and Y are independent, as we will show below.
We assume that (X, Y ) has density function f : S × T → (0, ∞) in the following sense:
1. If X and Y have discrete distributions on the countable sets S and T respectively, then f is defined by
f (x, y) = P(X = x, Y = y), (x, y) ∈ S × T (3.4.1)
3. In the mixed case where X has a discrete distribution on the countable set S and Y has a continuous distribution on
T ⊆ R , then f is defined by the condition
k
4. In the mixed case where X has a continuous distribution on S ⊆ R and Y has a discrete distribution on the countable set
j
3.4.1 https://stats.libretexts.org/@go/page/10144
P(X ∈ A, Y = y) = ∫ f (x, y), dx, A ⊆ S, y ∈ T (3.4.4)
A
Details
First we will see how to obtain the probability density function of one variable when the other has a discrete distribution.
2. If X has a discrete distribution on the countable set S , then Y has probability density function h given by
h(y) = ∑ f (x, y), y ∈ T
x∈S
Proof
Next we will see how to obtain the probability density function of one variable when the other has a continuous distribution.
Suppose again that (X, Y ) has probability density function f as described above.
1. If Y has a continuous distribution on T ⊆R
k
then X has probability density function g given by
g(x) = ∫ f (x, y) dy, x ∈ S
T
Proof
In the context of the previous two theorems, f is called the joint probability density function of (X, Y ), while g and h are called
the marginal density functions of X and of Y , respectively. Some of the computational exercises below will make the term
marginal clear.
Independence
When the variables are independent, the marginal distributions determine the joint distribution.
If X and Y are independent, then the distribution of X and the distribution of Y determine the distribution of (X, Y ).
Proof
When the variables are independent, the joint density is the product of the marginal densities.
Suppose that X and Y are independent and have probability density function g and h respectively. Then (X, Y ) has
probability density function f given by
f (x, y) = g(x)h(y), (x, y) ∈ S × T (3.4.11)
Proof
The following result gives a converse to the last result. If the joint probability density factors into a function of x only and a
function of y only, then X and Y are independent, and we can almost identify the individual probability density functions just from
the factoring.
Factoring Theorem. Suppose that (X, Y ) has probability density function f of the form
f (x, y) = u(x)v(y), (x, y) ∈ S × T (3.4.15)
where u : S → [0, ∞) and v : T → [0, ∞) . Then X and Y are independent, and there exists a positve constant c such that X
1
h(y) = v(y), y ∈ T (3.4.17)
c
Proof
3.4.2 https://stats.libretexts.org/@go/page/10144
The last two results extend to more than two random variables, because X and Y themselves may be random vectors. Here is the
explicit statement:
Suppose that X is a random variable taking values in a set R with probability density funcion g for i ∈ {1, 2, … , n}, and
i i i
that the random variables are independent. Then the random vector X = (X , X , … , X ) taking values in 1 2 n
S = R ×R ×⋯ ×R
1 2 has probability density function f given by
n
The special case where the distributions are all the same is particularly important.
Suppose that X = (X , X , … , X ) is a sequence of independent random variables, each taking values in a set
1 2 n R and with
common probability density function g . Then the probability density function f of X on S = R is given by n
In probability jargon, X is a sequence of independent, identically distributed variables, a phrase that comes up so often that it is
often abbreviated as IID. In statistical jargon, X is a random sample of size n from the common distribution. As is evident from
the special terminology, this situation is very impotant in both branches of mathematics. In statistics, the joint probability density
function f plays an important role in procedures such as maximum likelihood and the identification of uniformly best estimators.
Recall that (mutual) independence of random variables is a very strong property. If a collection of random variables is independent,
then any subcollection is also independent. New random variables formed from disjoint subcollections are independent. For a
simple example, suppose that X, Y , and Z are independent real-valued random variables. Then
1. sin(X), cos(Y ), and e are independent.Z
Suppose that two standard, fair dice are rolled and the sequence of scores (X , X ) recorded. Our ususal assumption is that the
1 2
variables X and X are independent. Let Y = X + X and Z = X − X denote the sum and difference of the scores,
1 2 1 2 1 2
respectively.
1. Find the probability density function of (Y , Z).
2. Find the probability density function of Y .
3. Find the probability density function of Z .
4. Are Y and Z independent?
Answer
Suppose that two standard, fair dice are rolled and the sequence of scores (X 1, X2 ) recorded. Let U = min{ X1 , X2 } and
V = max{ X , X } denote the minimum and maximum scores, respectively.
1 2
3.4.3 https://stats.libretexts.org/@go/page/10144
Answer
The previous two exercises show clearly how little information is given with the marginal distributions compared to the joint
distribution. With the marginal PDFs alone, you could not even determine the support set of the joint distribution, let alone the
values of the joint PDF.
Suppose that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 .
1. Find the probability density function of X.
2. Find the probability density function of Y .
3. Are X and Y independent?
Answer
The last exercise is a good illustration of the factoring theorem. Without any work at all, we can tell that the PDF of X is
proportional to x ↦ x on the interval [0, 1], the PDF of Y is proportional to y ↦ y on the interval [0, 1], and that X and Y are
2
Suppose that (X, Y ) has probability density function f given by f (x, y) = 15x 2
y for 0 ≤ x ≤ y ≤ 1 .
1. Find the probability density function of X.
2. Find the probability density function of Y .
3. Are X and Y independent?
Answer
Note that in the last exercise, the factoring theorem does not apply. Random variables X and Y each take values in , but the
[0, 1]
Suppose that (X, Y , Z) has probability density function f given f (x, y, x) = 2(x + y)z for 0 ≤x ≤1 , 0 ≤y ≤1 ,
0 ≤z ≤1 .
3.4.4 https://stats.libretexts.org/@go/page/10144
In the previous exercise, X has an exponential distribution with rate parameter 2. Recall that exponential distributions are widely
used to model random times, particularly in the context of the Poisson model.
Suppose that X and Y are independent, and that X has probability density function g given by g(x) = 6x(1 − x) for
0 ≤ x ≤ 1 , and that Y has probability density function h given by h(y) = 12 y (1 − y) for 0 ≤ y ≤ 1 .
2
In the previous exercise, X and Y have beta distributions, which are widely used to model random probabilities and proportions.
Beta distributions are studied in more detail in the chapter on Special Distributions.
Suppose that Θ and Φ are independent random angles, with common probability density function g given by g(t) = sin(t) for
0 ≤t ≤ .
π
The common distribution of X and Y in the previous exercise governs a random angle in Bertrand's problem.
Suppose that X and Y are independent, and that X has probability density function g given by g(x) = 2
3
x
for 1 ≤ x < ∞ , and
that Y has probability density function h given by h(y) = 3
y
4
for 1 ≤ y < ∞ .
Both X and Y in the previous exercise have Pareto distributions, named for Vilfredo Pareto. Recall that Pareto distributions are
used to model certain economic variables and are studied in more detail in the chapter on Special Distributions.
Suppose that (X, Y ) has probability density function g given by g(x, y) = 15x y for 0 ≤ x ≤ y ≤ 1 , and that
2
Z has
probability density function h given by h(z) = 4z for 0 ≤ z ≤ 1 , and that (X, Y ) and Z are independent.
3
n
λn (A) = ∫ 1dx, A ⊆R (3.4.23)
A
In particular, λ
1 (A) is the length of A ⊆ R , λ2 (A) is the area of A ⊆ R , and λ
2
3 (A) is the volume of A ⊆ R . 3
Details
Suppose now that X takes values in R , Y takes values in R , and that (X, Y ) is uniformly distributed on a set R ⊆ R . So
j k j+k
0 <λ j+k(R) < ∞ and the joint probability density function f of (X, Y ) is given by f (x, y) = 1/ λ (R) for (x, y) ∈ R. Recall
j+k
that uniform distributions always have constant density functions. Now let S and T be the projections of R onto R and R j k
3.4.5 https://stats.libretexts.org/@go/page/10144
Note that R ⊆ S × T . Next we denote the cross sections at x ∈ S and at y ∈ T , respectively by
Tx = {t ∈ T : (x, t) ∈ R}, Sy = {s ∈ S : (s, y) ∈ R} (3.4.25)
Figure 3.4.1 : The projections S and T , and the cross sections at x and y
X takes values in S and Y takes values in T . The probability density functions g and h of X and Y are proportional to the
cross-sectional measures:
1. g(x) = λ k (Tx ) / λj+k (R) for x ∈ S
2. h(y) = λ j (Sy ) / λj+k (R) for y ∈ T
Proof
In particular, note from previous theorem that X and Y are neither independent nor uniformly distributed in general. However,
these properties do hold if R is a Cartesian product set.
Suppose that R = S × T .
1. X is uniformly distributed on S .
2. Y is uniformly distributed on T .
3. X and Y are independent.
Proof
In each of the following cases, find the joint and marginal probabilit density functions, and determine if X and Y are
independent.
1. (X, Y ) is uniformly distributed on the square R = [−6, 6] . 2
Answer
In the bivariate uniform experiment, run the simulation 1000 times for each of the following cases. Watch the points in the
scatter plot and the graphs of the marginal distributions. Interpret what you see in the context of the discussion above.
1. square
2. triangle
3. circle
3.4.6 https://stats.libretexts.org/@go/page/10144
2. Find the probability density function of each pair of variables.
3. Find the probability density function of each variable
4. Determine the dependency relationships between the variables.
Answer
n+1
R = {(x, y) : x ∈ S and 0 ≤ y ≤ g(x)} ⊆ R (3.4.27)
Suppose now that R ⊆ T where T ⊆ Rn+1 with λn+1 (T ) < ∞ and that ((X , Y ), (X , Y ), …) is a sequence of
1 1 2 2
Proof
Figure 3.4.3 : With a sequence of independent points, uniformly distributed on T , the x coordinate of the first point to land in R
S . In this case, we can find T that is the Cartesian product of n + 1 bounded intervals with R ⊆ T . It turns out to be very easy to
simulate a sequence of independent variables, each uniformly distributed on such a product set, so the rejection method always
works in this case. As you might guess, the rejection method works best if the size of T , namely λ (T ) , is small, so that the
n+1
3.4.7 https://stats.libretexts.org/@go/page/10144
The rejection method app simulates a number of continuous distributions via the rejection method. For each of the following
distributions, vary the parameters and note the shape and location of the probability density function. Then run the experiment
1000 times and observe the results.
1. The beta distribution
2. The semicircle distribution
3. The triangle distribution
4. The U-power distribution
(X, Y , Z) has a (multivariate) hypergeometric distribution with probability density function f given by
a b c m−a−b−c
( )( )( )( )
x y z n−x−y−z
f (x, y, z) = , x +y +z ≤ n (3.4.30)
m
( )
n
Proof
(X, Y ) also has a (multivariate) hypergeometric distribution, with the probability density function g given by
a b m−a−b
( )( )( )
x y n−x−y
g(x, y) = , x +y ≤ n (3.4.31)
m
( )
n
Proof
Proof
These results generalize in a straightforward way to a population with any number of types. In brief, if a random vector has a
hypergeometric distribution, then any sub-vector also has a hypergeometric distribution. In other words, all of the marginal
distributions of a hypergeometric distribution are themselves hypergeometric. Note however, that it's not a good idea to memorize
the formulas above explicitly. It's better to just note the patterns and recall the combinatorial meaning of the binomial coefficient.
The hypergeometric distribution and the multivariate hypergeometric distribution are studied in more detail in the chapter on Finite
Sampling Models.
Suppose that a population of voters consists of 50 democrats, 40 republicans, and 30 independents. A sample of 10 voters is
chosen at random from the population (without replacement, of course). Let X denote the number of democrats in the sample
and Y the number of republicans in the sample. Find the probability density function of each of the following:
1. (X, Y )
2. X
3. Y
Answer
Suppose that the Math Club at Enormous State University (ESU) has 50 freshmen, 40 sophomores, 30 juniors and 20 seniors.
A sample of 10 club members is chosen at random to serve on the π-day committee. Let X denote the number freshmen on the
committee, Y the number of sophomores, and Z the number of juniors.
1. Find the probability density function of (X, Y , Z)
3.4.8 https://stats.libretexts.org/@go/page/10144
2. Find the probability density function of each pair of variables.
3. Find the probability density function of each individual variable.
Answer
Multinomial Trials
Suppose that we have a sequence of n independent trials, each with 4 possible outcomes. On each trial, outcome 1 occurs with
probability p, outcome 2 with probability q, outcome 3 with probability r, and outcome 0 occurs with probability 1 − p − q − r .
The parameters p, q, and r are nonnegative numbers with p + q + r ≤ 1 , and n ∈ N . Denote the number of times that outcome 1,
+
outcome 2, and outcome 3 occurred in the n trials by X, Y , and Z respectively. Of course, the number of times that outcome 0
occurs is n − X − Y − Z . In the problems below, the variables x, y , and z take values in N.
Proof
(X, Y ) also has a multinomial distribution with the probability density function g given by
n x y n−x−y
g(x, y) = ( ) p q (1 − p − q ) , x +y ≤ n (3.4.34)
x, y
Proof
Proof
These results generalize in a completely straightforward way to multinomial trials with any number of trial outcomes. In brief, if a
random vector has a multinomial distribution, then any sub-vector also has a multinomial distribution. In other terms, all of the
marginal distributions of a multinomial distribution are themselves multinomial. The binomial distribution and the multinomial
distribution are studied in more detail in the chapter on Bernoulli Trials.
Suppose that a system consists of 10 components that operate independently. Each component is working with probability , 1
idle with probability , or failed with probability . Let X denote the number of working components and Y the number of
1
3
1
idle components. Give the probability density function of each of the following:
1. (X, Y )
2. X
3. Y
Answer
Suppose that in a crooked, four-sided die, face i occurs with probability for i ∈ {1, 2, 3, 4}. The die is thrown 12 times; let
10
i
X denote the number of times that score 1 occurs, Y the number of times that score 2 occurs, and Z the number of times that
score 3 occurs.
1. Find the probability density function of (X, Y , Z)
2. Find the probability density function of each pair of variables.
3. Find the probability density function of each individual variable.
Answer
3.4.9 https://stats.libretexts.org/@go/page/10144
2 2
1 x y 2
f (x, y) = exp[− ( + )], (x, y) ∈ R (3.4.36)
12π 8 18
1 2 2 2 2
f (x, y) = – exp[− (x − xy + y )], (x, y) ∈ R (3.4.37)
√3π 3
The joint distributions in the last two exercises are examples of bivariate normal distributions. Normal distributions are widely
used to model physical measurements subject to small, random errors. In both exercises, the marginal distributions of X and Y also
have normal distributions, and this turns out to be true in general. The multivariate normal distribution is studied in more detail in
the chapter on Special Distributions.
Exponential Distributions
Recall that the exponential distribution has probability density function
−rt
f (x) = re , x ∈ [0, ∞) (3.4.38)
where r ∈ (0, ∞) is the rate parameter. The exponential distribution is widely used to model random times, and is studied in more
detail in the chapter on the Poisson Process.
Suppose X and Y have exponential distributions with parameters a ∈ (0, ∞) and b ∈ (0, ∞) , respectively, and are
independent. Then P(X < Y ) = . a
a+b
Suppose X, Y , and Z have exponential distributions with parameters a ∈ (0, ∞) , b ∈ (0, ∞) , and c ∈ (0, ∞) , respectively,
and are independent. Then
1. P(X < Y < Z) = a
a+b+c b+c
b
a+b+c
If X, Y , and Z are the lifetimes of devices that act independently, then the results in the previous two exercises give probabilities
of various failure orders. Results of this type are also very important in the study of continuous-time Markov processes. We will
continue this discussion in the section on transformations of random variables.
Mixed Coordinates
Suppose X takes values in the finite set {1, 2, 3}, Y takes values in the interval [0, 3], and that (X, Y ) has probability density
function f given by
1
⎧
⎪ , x = 1, 0 ≤ y ≤ 1
⎪ 3
1
f (x, y) = ⎨ , x = 2, 0 ≤ y ≤ 2 (3.4.39)
6
⎪
⎩
⎪ 1
, x = 3, 0 ≤ y ≤ 3
9
3.4.10 https://stats.libretexts.org/@go/page/10144
Suppose that P takes values in the interval [0, 1] X , takes values in the finite set {0, 1, 2, 3}, and that (P , X) has probability
density function f given by
3 x+1 4−x
f (p, x) = 6( )p (1 − p ) , (p, x) ∈ [0, 1] × {0, 1, 2, 3} (3.4.40)
x
As we will see in the section on conditional distributions, the distribution in the last exercise models the following experiment: a
random probability P is selected, and then a coin with this probability of heads is tossed 3 times; X is the number of heads. Note
that P has a beta distribution.
Random Samples
Recall that the Bernoulli distribution with parameter p ∈ [0, 1] has probability density function g given by
x
g(x) = p (1 − p )
1−x
for x ∈ {0, 1}. Let X = (X , X , … , X ) be a random sample of size n ∈ N from the distribution.
1 2 n +
The Bernoulli distribution is name for Jacob Bernoulli, and governs an indicator random varible. Hence if X is a random sample of
size n from the distribution then X is a sequence of n Bernoulli trials. A separate chapter studies Bernoulli trials in more detail.
Recall that the geometric distribution on N with parameter p ∈ (0, 1) has probability density function g given by
+
g(x) = p(1 − p)
x−1
for x ∈ N . Let X = (X , X , … , X ) be a random sample of size n ∈ N from the distribution. Give
+ 1 2 n +
The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials. Hence the variables in the
random sample can be interpreted as the number of trials between successive successes.
x
Recall that the Poisson distribution with parameter a ∈ (0, ∞) has probability density function g given by g(x) = e for −a a
x!
The Poisson distribution is named for Simeon Poisson, and governs the number of random points in a region of time or space under
appropriate circumstances. The parameter a is proportional to the size of the region. The Poisson distribution is studied in more
detail in the chapter on the Poisson process.
Recall again that the exponential distribution with rate parameter r ∈ (0, ∞) has probability density function g given by
g(x) = re for x ∈ (0, ∞). Let X = (X , X , … , X ) be a random sample of size n ∈ N from the distribution. Give
−rx
1 2 n +
The exponential distribution governs failure times and other types or arrival times under appropriate circumstances. The
exponential distribution is studied in more detail in the chapter on the Poisson process. The variables in the random sample can be
interpreted as the times between successive arrivals in the Poisson process.
Recall that the standard normal distribution has probability density function given by for . Let
2
1 −z /2
ϕ ϕ(z) = e z ∈ R
√2π
Z = (Z1 , Z2 , … , Zn ) be a random sample of size n ∈ N + from the distribution. Give the probability density funcion of Z in
simplified form.
Answer
3.4.11 https://stats.libretexts.org/@go/page/10144
The standard normal distribution governs physical quantities, properly scaled and centered, subject to small, random errors. The
normal distribution is studied in more generality in the chapter on the Special Distributions.
For the cicada data, G denotes gender and S denotes species type.
1. Find the empirical density of (G, S) .
2. Find the empirical density of G.
3. Find the empirical density of S .
4. Do you believe that S and G are independent?
Answer
For the cicada data, let W denote body weight (in grams) and L body length (in millimeters).
1. Construct an empirical density for (W , L).
2. Find the corresponding empirical density for W .
3. Find the corresponding empirical density for L.
4. Do you believe that W and L are independent?
Answer
For the cicada data, let G denote gender and W body weight (in grams).
1. Construct and empirical density for (W , G).
2. Find the empirical density for G.
3. Find the empirical density for W .
4. Do you believe that G and W are independent?
Answer
This page titled 3.4: Joint Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
3.4.12 https://stats.libretexts.org/@go/page/10144
3.5: Conditional Distributions
In this section, we study how a probability distribution changes when a given random variable has a known, specified value. So this
is an essential topic that deals with hou probability measures should be updated in light of new information. As usual, if you are a
new student or probability, you may want to skip the technical details.
Basic Theory
Our starting point is a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F is
the collection of events, and P is the probability measure on the underlying sample space (Ω, F ).
Suppose that X is a random variable defined on the sample space (that is, defined for the experiment), taking values in a set S .
Details
The purpose of this section is to study the conditional probability measure given X = x for x ∈ S . That is, if E is an event, we
would like to define and study the probability of E given X = x , denoted P(E ∣ X = x) . If X has a discrete distribution, the
conditioning event has positive probability, so no new concepts are involved, and the simple definition of conditional probability
suffices. When X has a continuous distribution, however, the conditioning event has probability 0, so a fundamentally new
approach is needed.
Proof
The next result is a sepcial case of the law of total probability. and will be the key to the definition when X has a continuous
distribution.
If E is an event then
x∈A
that g(x) > 0 for x ∈ S . Unlike the discrete case, we cannot use simple conditional probability to define P(E ∣ X = x) for an
event E and x ∈ S because the conditioning event has probability 0. Nonetheless, the concept should make sense. If we actually
run the experiment, X will take on some value x (even though a priori, this event occurs with probability 0), and surely the
information that X = x should in general alter the probabilities that we assign to other events. A natural approach is to use the
results obtained in the discrete case as definitions in the continuous case.
If E is an event and x ∈ S , the conditional probability P(E ∣ X = x) is defined by the requirement that
Details
3.5.1 https://stats.libretexts.org/@go/page/10145
Conditioning and Bayes' Theorem
Suppose again that X is a random variable with values in S and probability density function g , as described above. Our discussions
above in the discrete and continuous cases lead to basic formulas for computing the probability of an event by conditioning on X.
The law of total probability. If E is an event, then P(E) can be computed as follows:
1. If X has a discrete distribution then
x∈S
Proof
Naturally, the law of total probability is useful when P(E ∣ X = x) and g(x) are known for x ∈ S . Our next result is, Bayes'
Theorem, named after Thomas Bayes.
Bayes' Theorem. Suppose that E is an event with P(E) > 0 . The conditional probability density function x ↦ g(x ∣ E) of X
given E can be computed as follows:
1. If X has a discrete distribution then
g(x)P(E ∣ X = x)
g(x ∣ E) = , x ∈ S (3.5.9)
∑s∈S g(s)P(E ∣ X = s)
Proof
In the context of Bayes' theorem, g is called the prior probability density function of X and x ↦ g(x ∣ E) is the posterior
probability density function of X given E . Note also that the conditional probability density function of X given E is proportional
to the function x ↦ g(x)P(E ∣ X = x) , the sum or integral of this function that occurs in the denominator is simply the
normalizing constant. As with the law of total probability, Bayes' theorem is useful when P(E ∣ X = x) and g(x) are known for
x ∈ S.
Suppose that X and Y are random variables on the probability space, with values in sets S and T , respectively, so that (X, Y )
is a random variable with values in S × T . We assume that (X, Y ) has probability density function f , as discussed in the
section on Joint Distributions. Recall that X has probability density function g defined as follows:
1. If Y has a discrete distribution on the countable set T then
y∈T
Similary, the probability density function h of Y can be obtained by summing f over x ∈ S if X has a discrete distribution or
integrating f over S if X has a continuous distribution.
3.5.2 https://stats.libretexts.org/@go/page/10145
Suppose that x ∈ S and that g(x) > 0 . The function y ↦ h(y ∣ x) defined below is a probability density function on T :
f (x, y)
h(y ∣ x) = , y ∈ T (3.5.14)
g(x)
Proof
The distribution that corresponds to this probability density function is what you would expect:
For x ∈ S , the function y ↦ h(y ∣ x) is the conditional probability density function of Y given X = x . That is,
1. If Y has a discrete distribution then
y∈B
Proof
The following theorem gives Bayes' theorem for probability density functions. We use the notation established above.
Bayes' Theorem. For y ∈ T , the conditional probability density function x ↦ g(x ∣ y) of X given y = y can be computed as
follows:
1. If X has a discrete distribution then
g(x)h(y ∣ x)
g(x ∣ y) = , x ∈ S (3.5.24)
∑ g(s)h(y ∣ s)
s∈S
Proof
In the context of Bayes' theorem, g is the prior probability density function of X and x ↦ g(x ∣ y) is the posterior probability
density function of X given Y = y for y ∈ T . Note that the posterior probability density function x ↦ g(x ∣ y) is proportional to
the function x ↦ g(x)h(y ∣ x). The sum or integral in the denominator is the normalizing constant.
Independence
Intuitively, X and Y should be independent if and only if the conditional distributions are the same as the corresponding
unconditional distributions.
3.5.3 https://stats.libretexts.org/@go/page/10145
A couple of special distributions will occur frequently in the exercises. First, recall that the discrete uniform distribution on a finite,
nonempty set S has probability density function f given by f (x) = 1/#(S) for x ∈ S . This distribution governs an element
selected at random from S .
Recall also that Bernoulli trials (named for Jacob Bernoulli) are independent trials, each with two possible outcomes generically
called success and failure. The probability of success p ∈ [0, 1] is the same for each trial, and is the basic parameter of the random
process. The number of successes in n ∈ N Bernoulli trials has the binomial distribution with parameters n and p. This
+
x
x
for x ∈ {0, 1, … , n}. The binomial distribution
n−x
1. Find the conditional probability density function of U given V =v for each v ∈ {1, 2, 3, 4, 5, 6}.
2. Find the conditional probability density function of V given U =u for each u ∈ {1, 2, 3, 4, 5, 6}.
Answer
In the die-coin experiment, a standard, fair die is rolled and then a fair coin is tossed the number of times showing on the die.
Let N denote the die score and Y the number of heads.
1. Find the joint probability density function of (N , Y ).
2. Find the probability density function of Y .
3. Find the conditional probability density function of N given Y =y for each y ∈ {0, 1, 2, 3, 4, 5, 6}.
Answer
In the coin-die experiment, a fair coin is tossed. If the coin is tails, a standard, fair die is rolled. If the coin is heads, a standard,
ace-six flat die is rolled (faces 1 and 6 have probability each and faces 2, 3, 4, 5 have probability each). Let X denote the
1
4
1
coin score (0 for tails and 1 for heads) and Y the die score.
1. Find the joint probability density function of (X, Y ).
2. Find the probability density function of Y .
3. Find the conditional probability density function of X given Y =y for each y ∈ {1, 2, 3, 4, 5, 6}.
Answer
Suppose that a box contains 12 coins: 5 are fair, 4 are biased so that heads comes up with probability , and 3 are two-headed.
1
A coin is chosen at random and tossed 2 times. Let P denote the probability of heads of the selected coin, and X the number of
heads.
1. Find the joint probability density function of (P , X).
2. Find the probability density function of X.
3. Find the conditional probability density function of P given X = x for x ∈ {0, 1, 2}.
3.5.4 https://stats.libretexts.org/@go/page/10145
Answer
Compare the die-coin experiment with the box of coins experiment. In the first experiment, we toss a coin with a fixed probability
of heads a random number of times. In the second experiment, we effectively toss a coin with a random probability of heads a fixed
number of times.
Suppose that P has probability density function g(p) = 6p(1 − p) for p ∈ [0, 1]. Given P =p , a coin with probability of
heads p is tossed 3 times. Let X denote the number of heads.
1. Find the joint probability density function of (P , X).
2. Find the probability density of function of X.
3. Find the conditional probability density of P given X = x for x ∈ {0, 1, 2, 3}. Graph these on the same axes.
Answer
Compare the box of coins experiment with the last experiment. In the second experiment, we effectively choose a coin from a box
with a continuous infinity of coin types. The prior distribution of P and each of the posterior distributions of P in part (c) are
members of the family of beta distributions, one of the reasons for the importance of the beta family. Beta distributions are studied
in more detail in the chapter on Special Distributions.
In the simulation of the beta coin experiment, set a = b = 2 and n = 3 to get the experiment studied in the previous exercise.
For various “true values” of p, run the experiment in single step mode a few times and observe the posterior probability density
function on each run.
for t ∈ [0, ∞). The exponential distribution is often used to model random times, under certain assumptions. The exponential
distribution is studied in more detail in the chapter on the Poisson Process. Recall also that for a, b ∈ R with a < b , the continuous
uniform distribution on the interval [a, b] has probability density function f given by f (x) = 1
b−a
for x ∈ [a, b]. This distribution
governs a point selected at random from the interval.
Suppose that there are 5 light bulbs in a box, labeled 1 to 5. The lifetime of bulb n (in months) has the exponential distribution
with rate parameter n . A bulb is selected at random from the box and tested.
1. Find the probability that the selected bulb will last more than one month.
2. Given that the bulb lasts more than one month, find the conditional probability density function of the bulb number.
Answer
Suppose that X is uniformly distributed on {1, 2, 3}, and given X = x ∈ {1, 2, 3}, random variable Y is uniformly distributed
on the interval [0, x].
1. Find the joint probability density function of (X, Y ).
2. Find the probability density function of Y .
3. Find the conditional probability density function of X given Y =y for y ∈ [0, 3].
Answer
Recall that the Poisson distribution with parameter a ∈ (0, ∞) has probability density function g(n) = e for n ∈ N . This
−a a
n!
distribution is widely used to model the number of “random points” in a region of time or space; the parameter a is proportional to
the size of the region. The Poisson distribution is named for Simeon Poisson, and is studied in more detail in the chapter on the
Poisson Process.
Suppose that N is the number of elementary particles emitted by a sample of radioactive material in a specified period of time,
and has the Poisson distribution with parameter a . Each particle emitted, independently of the others, is detected by a counter
with probability p ∈ (0, 1) and missed with probability 1 − p . Let Y denote the number of particles detected by the counter.
1. For n ∈ N , argue that the conditional distribution of Y given N =n is binomial with parameters n and p.
3.5.5 https://stats.libretexts.org/@go/page/10145
2. Find the joint probability density function of (N , Y ).
3. Find the probability density function of Y .
4. For y ∈ N , find the conditional probability density function of N given Y =y .
Answer
The fact that Y also has a Poisson distribution is an interesting and characteristic property of the distribution. This property is
explored in more depth in the section on thinning the Poisson process.
4
3
4
1
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2(x + y) for 0 < x < y < 1 .
1. Find the conditional probability density function of X given Y = y for y ∈ (0, 1).
2. Find the conditional probability density function of Y given X = x for x ∈ (0, 1).
3. Find P (Y ≥ ∣∣ X = ) .
3
4
1
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 15x 2
y for 0 < x < y < 1 .
1. Find the conditional probability density function of X given Y = y for y ∈ (0, 1).
2. Find the conditional probability density function of Y given X = x for x ∈ (0, 1).
3. Find P (X ≤ ∣∣ Y = ) .
1
4
1
Suppose that X is uniformly distributed on the interval (0, 1), and that given X = x , Y is uniformly distributed on the interval
(0, x).
3.5.6 https://stats.libretexts.org/@go/page/10145
Suppose that X has probability density function g defined by g(x) = 3x
2
for x ∈ (0, 1) . The conditional probability density
2
3y
function of Y given X = x is h(y ∣ x) = x
3
for y ∈ (0, x).
1. Find the joint probability density function of (X, Y ).
2. Find the probability density function of Y .
3. Find the conditional probability density function of X given Y =y .
4. Are X and Y independent?
Answer
n
λn (A) = ∫ 1 dx, A ⊆R (3.5.29)
A
In particular, λ
1 (A) is the length of A ⊆ R , λ 2 (A) is the area of A ⊆ R and λ 2
3 (A) is the volume of A ⊆ R . 3
Details
Suppose now that takes values in R , Y takes values in R , and that (X, Y ) is uniformly distributed on a set R ⊆ R . So
X
j k j+k
0 <λ (R) < ∞ and then the joint probability density function f of (X, Y ) is given by f (x, y) = 1/ λ
j+k (R) for (x, y) ∈ R. j+k
Now let S and T be the projections of R onto R and R respectively, defined as follows:
j k
j k k j
S = {x ∈ R : (x, y) ∈ R for some y ∈ R } , T = {y ∈ R : (x, y) ∈ R for some x ∈ R } (3.5.30)
Figure 3.5.1 : The projections S and T , and the cross sections at x and y
In the last section on Joint Distributions, we saw that even though (X, Y ) is uniformly distributed, the marginal distributions of X
and Y are not uniform in general. However, as the next theorem shows, the conditional distributions are always uniform.
Proof
Find the conditional density of each variable given a value of the other, and determine if the variables are independent, in each
of the following cases:
1. (X, Y ) is uniformly distributed on the square R = (−6, 6) . 2
Answer
3.5.7 https://stats.libretexts.org/@go/page/10145
In the bivariate uniform experiment, run the simulation 1000 times in each of the following cases. Watch the points in the
scatter plot and the graphs of the marginal distributions.
1. square
2. triangle
3. circle
Suppose that z ≤ c and n − m + c ≤ z ≤ n . Then the conditional distribution of (X, Y ) given Z = z is hypergeometric, and
has the probability density function defined by
a b m−a−b−c
( )( )( )
x y n−x−y−z
g(x, y ∣ z) = , x +y ≤ n−z (3.5.34)
m−c
( )
n−z
Proof
Proof
These results generalize in a completely straightforward way to a population with any number of types. In brief, if a random vector
has a hypergeometric distribution, then the conditional distribution of some of the variables, given values of the other variables, is
also hypergeometric. Moreover, it is clearly not necessary to remember the hideous formulas in the previous two theorems. You
just need to recognize the problem as sampling without replacement from a multi-type population, and then identify the number of
objects of each type and the sample size. The hypergeometric distribution and the multivariate hypergeometric distribution are
studied in more detail in the chapter on Finite Sampling Models.
In a population of 150 voters, 60 are democrats and 50 are republicans and 40 are independents. A sample of 15 voters is
selected at random, without replacement. Let X denote the number of democrats in the sample and Y the number of
republicans in the sample. Give the probability density function of each of the following:
1. (X, Y )
2. X
3. Y given X = 5
Answer
Recall that a bridge hand consists of 13 cards selected at random and without replacement from a standard deck of 52 cards.
Let X, Y , and Z denote the number of spades, hearts, and diamonds, respectively, in the hand. Find the probability density
function of each of the following:
3.5.8 https://stats.libretexts.org/@go/page/10145
1. (X, Y , Z)
2. (X, Y )
3. X
4. (X, Y ) given Z = 3
5. X given Y = 3 and Z = 2
Answer
Multinomial Trials
Recall the discussion of multinomial trials in the last section on joint distributions. As in that discussion, suppose that we have a
sequence of n independent trials, each with 4 possible outcomes. On each trial, outcome 1 occurs with probability p, outcome 2
with probability q, outcome 3 with probability r, and outcome 0 with probability 1 − p − q − r . The parameters p, q, r ∈ (0, 1),
with p + q + r < 1 , and n ∈ N . Denote the number of times that outcome 1, outcome 2, and outcome 3 occurs in the n trials by
+
X, Y , and Z respectively. Of course, the number of times that outcome 0 occurs is n − X − Y − Z . In the following exercises,
x, y, z ∈ N .
For z ≤ n , the conditional distribution of (X, Y ) given Z = z is also multinomial, and has the probability density function.
x y n−x−y−z
n−z p q p q
g(x, y ∣ z) = ( )( ) ( ) (1 − − ) , x +y ≤ n−z (3.5.36)
x, y 1 −r 1 −r 1 −r 1 −r
Proof
For y + z ≤ n , the conditional distribution of X given Y =y and Z = z is binomial, with the probability density function
x n−x−y−z
n−y −z p p
h(x ∣ y, z) = ( )( ) (1 − ) , x ≤ n−y −z (3.5.37)
x 1 −q −r 1 −q −r
Proof
These results generalize in a completely straightforward way to multinomial trials with any number of trial outcomes. In brief, if a
random vector has a multinomial distribution, then the conditional distribution of some of the variables, given values of the other
variables, is also multinomial. Moreover, it is clearly not necessary to remember the specific formulas in the previous two
exercises. You just need to recognize a problem as one involving independent trials, and then identify the probability of each
outcome and the number of trials. The binomial distribution and the multinomial distribution are studied in more detail in the
chapter on Bernoulli Trials.
Suppose that peaches from an orchard are classified as small, medium, or large. Each peach, independently of the others is
small with probability 3
10
, medium with probability , and large with probability . In a sample of 20 peaches from the
1
2
1
orchard, let X denote the number of small peaches and Y the number of medium peaches. Give the probability density
function of each of the following:
1. (X, Y )
2. X
3. Y given X = 5
Answer
For a certain crooked, 4-sided die, face 1 has probability , face 2 has probability , face 3 has probability , and face 4 has
2
5
3
10
1
probability . Suppose that the die is thrown 50 times. Let X, Y , and Z denote the number of times that scores 1, 2, and 3
1
10
occur, respectively. Find the probability density function of each of the following:
1. (X, Y , Z)
2. (X, Y )
3. X
4. (X, Y ) given Z = 5
5. X given Y = 10 and Z = 5
Answer
3.5.9 https://stats.libretexts.org/@go/page/10145
Bivariate Normal Distributions
The joint distributions in the next two exercises are examples of bivariate normal distributions. The conditional distributions are
also normal, an important property of the bivariate normal distribution. In general, normal distributions are widely used to model
physical measurements subject to small, random errors. The bivariate normal distribution is studied in more detail in the chapter on
Special Distributions.
Suppose that (X, Y ) has the bivariate normal distribution with probability density function f defined by
2 2
1 x y
2
f (x, y) = exp[− ( + )], (x, y) ∈ R (3.5.38)
12π 8 18
Suppose that (X, Y ) has the bivariate normal distribution with probability density function f defined by
1 2
2 2 2
f (x, y) = exp[− (x − xy + y )], (x, y) ∈ R (3.5.39)
–
√3π 3
Mixtures of Distributions
With our usual sets S and T , as above, suppose that P is a probability measure on T for each x ∈ S . Suppose also that g is a
x
probability density function on S . We can obtain a new probability measure on T by averaging (or mixing) the given distributions
according to g .
First suppose that g is the probability density function of a discrete distribution on the countable set S . Then the function P
x∈S
Proof
In the setting of the previous theorem, suppose that Px has probability density function hx for each x ∈ S . Then P has
probability density function h given by
x∈S
Proof
Conversely, given a probability density function g on S and a probability density function h on T for each x ∈ S , the function
x h
Proof
In the setting of the previous theorem, suppose that P is a discrete (respectively continuous) distribution with probability
x
density function h for each x ∈ S . Then P is also discrete (respectively continuous) with probability density function h given
x
3.5.10 https://stats.libretexts.org/@go/page/10145
by
Proof
In both cases, the distribution P is said to be a mixture of the set of distributions {P x : x ∈ S} , with mixing density g .
One can have a mixture of distributions, without having random variables defined on a common probability space. However,
mixtures are intimately related to conditional distributions. Returning to our usual setup, suppose that X and Y are random
variables for an experiment, taking values in S and T respectively and that X probability density function g . The following result
is simply a restatement of the law of total probability.
The distribution of Y is a mixture of the conditional distributions of Y given X = x , over x ∈ S , with mixing density g .
Proof
Finally we note that a mixed distribution (with discrete and continuous parts) really is a mixture, in the sense of this discussion.
Suppose that P is a mixed distribution on a set T . Then P is a mixture of a discrete distribution and a continuous distribution.
Proof
This page titled 3.5: Conditional Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
3.5.11 https://stats.libretexts.org/@go/page/10145
3.6: Distribution and Quantile Functions
As usual, our starting point is a random experiment modeled by a with probability space (Ω, F , P). So to review, Ω is the set of
outcomes, F is the collection of events, and P is the probability measure on the sample space (Ω, F ). In this section, we will
study two types of functions that can be used to specify the distribution of a real-valued random variable.
Distribution Functions
Definition
Suppose that X is a random variable with values in R . The (cumulative) distribution function of X is the function
F : R → [0, 1] defined by
F (x) = P(X ≤ x), x ∈ R (3.6.1)
The distribution function is important because it makes sense for any type of random variable, regardless of whether the
distribution is discrete, continuous, or even mixed, and because it completely determines the distribution of X. In the picture below,
the light shading is intended to represent a continuous distribution of probability, while the darker dots represents points of positive
probability; F (x) is the total probability mass to the left of (and including) x.
Figure 3.6.1 : F (x) is the total probability to the left of (and including) x
Basic Properties
A few basic properties completely characterize distribution functions. Notationally, it will be helpful to abbreviate the limits of F
4. F (−∞) = 0 .
5. F (∞) = 1 .
Proof
3.6.1 https://stats.libretexts.org/@go/page/10146
the distribution of X.
Suppose again that F is the distribution function of a real-valued random variable X. If a, b ∈ R with a < b then
1. P(X = a) = F (a) − F (a ) −
4. P(a ≤ X ≤ b) = F (b) − F (a ) −
5. P(a ≤ X < b) = F (b ) − F (a )
− −
Proof
Conversely, if a Function F : R → [0, 1] satisfies the basic properties, then the formulas above define a probability distribution on
R , with F as the distribution function. For more on this point, read the section on Existence and Uniqueness.
Thus, the two meanings of continuous come together: continuous distribution and continuous function in the calculus sense. Next
recall that the distribution of a real-valued random variable X is symmetric about a point a ∈ R if the distribution of X − a is the
same as the distribution of a − X .
Suppose that X has a continuous distribution on R that is symmetric about a point a . Then the distribution function F satisfies
F (a − t) = 1 − F (a + t) for t ∈ R .
Proof
Definition (1), it's customary to define the distribution function F on all of R, even if the random variable takes values in a subset.
Suppose that X has discrete distribution on a countable subset S ⊆ R . Let f denote the probability density function and F the
distribution function.
1. F (x) = ∑ f (t) for x ∈ R
t∈S, t≤x
Proof
Thus, F is a step function with jumps at the points in S ; the size of the jump at x is f (x).
Suppose that X has a continuous distribution on R with probability density function f and distribution function F .
x
1. F (x) = ∫ −∞
f (t)dt for x ∈ R.
3.6.2 https://stats.libretexts.org/@go/page/10146
2. f (x) = F ′
(x) if f is continuous at x.
Proof
Suppose that X has a mixed distribution, with discrete part on a countable subset D ⊆ R , and continuous part on R ∖ D . Let g
denote the partial probability density function of the discrete part and assume that the continuous part has partial probability
density function h . Let F denote the distribution function.
x
1. F (x) = ∑ g(t) + ∫
t∈D, t≤x
h(t)dt for x ∈ R
−∞
Go back to the graph of a general distribution function. At a point of positive probability, the probability is the size of the jump. At
a smooth point of the graph, the continuous probability density is the slope.
Recall that the existence of a probability density function is not guaranteed for a continuous distribution, but of course the
distribution function always makes perfect sense. The advanced section on absolute continuity and density functioons has an
example of a continuous distribution on the interval (0, 1) that has no probability density function. The distribution function is
continuous and strictly increases from 0 to 1 on the interval, but has derivative 0 at almost every point!
Naturally, the distribution function can be defined relative to any of the conditional distributions we have discussed. No new
concepts are involved, and all of the results above hold.
Reliability
Suppose again that X is a real-valued random variable with distribution function F . The function in the following definition clearly
gives the same information as F .
c
F (x) = 1 − F (x) = P(X > x), x ∈ R (3.6.4)
is the right-tail distribution function of X. Give the mathematical properties of F analogous to the properties of F in (2).
c
Answer
So F might be called the left-tail distribution function. But why have two distribution functions that give essentially the same
information? The right-tail distribution function, and related functions, arise naturally in the context of reliability theory. For the
remainder of this subsection, suppose that T is a random variable with values in [0, ∞) and that T has a continuous distribution
with probability density function f . Here are the important defintions:
2. The function h defined by h(t) = f (t)/F (t) for t ≥ 0 is the failure rate function of T .
c
3.6.3 https://stats.libretexts.org/@go/page/10146
To interpret the reliability function, note that F (t) = P(T > t) is the probability that the device lasts at least
c
t time units. To
interpret the failure rate function, note that if dt is “small” then
P(t < T < t + dt) f (t) dt
P(t < T < t + dt ∣ T > t) = ≈ = h(t) dt (3.6.5)
c
P(T > t) F (t)
So h(t) dt is the approximate probability that the device will fail in the interval (t, t + dt) , given survival up to time t . Moreover,
like the distribution function and the reliability function, the failure rate function also completely determines the distribution of T .
The reliability function can be expressed in terms of the failure rate function by
t
c
F (t) = exp(− ∫ h(s) ds), t ≥0 (3.6.6)
0
Proof
Proof
Conversely, a function that satisfies these properties is the failure rate function for a continuous distribution on [0, ∞):
∞
Suppose that h : [0, ∞) → [0, ∞) is piecewise continuous and ∫ 0
h(t) dt = ∞ . Then the function G defined by
t
c
F (t) = exp(− ∫ h(s) ds), t ≥0 (3.6.8)
0
Figure 3.6.5 : F (x, y) is the total probability below and to the left of (x, y).
In the graph above, the light shading is intended to suggest a continuous distribution of probability, while the darker dots represent
points of positive probability. Thus, F (x, y) is the total probability mass below and to the left (that is, southwest) of the point
(x, y). As in the single variable case, the distribution function of (X, Y ) completely determines the distribution of (X, Y ).
3.6.4 https://stats.libretexts.org/@go/page/10146
Proof
A probability distribution on R is completely determined by its values on rectangles of the form (a, b] × (c, d] , so just as in the
2
single variable case, it follows that the distribution function of (X, Y ) completely determines the distribution of (X, Y ). See the
advanced section on existence and uniqueness of positive measures in the chapter on Probability Measures for more details.
In the setting of the previous result, give the appropriate formula on the right for all possible combinations of weak and strong
inequalities on the left.
The joint distribution function determines the individual (marginal) distribution functions.
Let F denote the distribution function of (X, Y ), and let G and H denote the distribution functions of X and Y , respectively.
Then
1. G(x) = F (x, ∞) for x ∈ R
2. H (y) = F (∞, y) for y ∈ R
Proof
On the other hand, we cannot recover the distribution function of (X, Y ) from the individual distribution functions, except when
the variables are independent.
Proof
All of the results of this subsection generalize in a straightforward way to n -dimensional random vectors. Only the notation is more
complicated.
with the same distribution as X. In statistical terms, this sequence is a random sample of size n from the distribution of X. In
statistical inference, the observed values (x , x , … , x ) of the random sample form our data.
1 2 n
n
1 1
Fn (x) = # {i ∈ {1, 2, … , n} : xi ≤ x} = ∑ 1(xi ≤ x), x ∈ R (3.6.16)
n n
i=1
Thus, F (x) gives the proportion of values in the data set that are less than or equal to x. The function F is a statistical estimator
n n
of F , based on the given data set. This concept is explored in more detail in the section on the sample mean in the chapter on
Random Samples. In addition, the empirical distribution function is related to the Brownian bridge stochastic process which is
studied in the chapter on Brownian motion.
Quantile Functions
Definitions
Suppose again that X is a real-valued random variable with distribution function F .
Roughly speaking, a quantile of order p is a value where the graph of the distribution function crosses (or jumps over) p. For
example, in the picture below, a is the unique quantile of order p and b is the unique quantile of order q. On the other hand, the
3.6.5 https://stats.libretexts.org/@go/page/10146
quantiles of order r form the interval [c, d], and moreover, d is a quantile for all orders in the interval [r, s]. Note also that if X has
a continuous distribution (so that F is continuous) and x is a quantile of order p ∈ (0, 1), then F (x) = p .
Note that there is an inverse relation of sorts between the quantiles and the cumulative distribution values, but the relation is more
complicated than that of a function and its ordinary inverse function, because the distribution function is not one-to-one in general.
For many purposes, it is helpful to select a specific quantile for each order; to do this requires defining a generalized inverse of the
distribution function F .
F
−1
is well defined
Note that if F strictly increases from 0 to 1 on an interval S (so that the underlying distribution is continuous and is supported on
S ), then F is the ordinary inverse of F . We do not usually define the quantile function at the endpoints 0 and 1. If we did, note
−1
Properties
The following exercise justifies the name: F −1
(p) is the minimum of the quantiles of order p.
Other basic properties of the quantile function are given in the following theorem.
F
−1
satisfies the following properties:
1. F −1
is increasing on (0, 1).
2. F −1
[F (x)] ≤ xfor any x ∈ R with F (x) < 1 .
3. F [F
−1
(p)] ≥ p for any p ∈ (0, 1).
4. F −1 −
(p ) = F
−1
(p) for p ∈ (0, 1). Thus F is continuous from the left.
−1
5. F −1
(p ) = inf{x ∈ R : F (x) > p} for p ∈ (0, 1). Thus F
+
has limits from the right.
−1
Proof
As always, the inverse of a function is obtained essentially by reversing the roles of independent and dependent variables. In the
graphs below, note that jumps of F become flat portions of F while flat portions of F become jumps of F . For p ∈ (0, 1),
−1 −1
the set of quantiles of order p is the closed, bounded interval [ F (p), F (p )]. Thus, F (p) is the smallest quantile of order
−1 −1 + −1
3.6.6 https://stats.libretexts.org/@go/page/10146
Figure 3.6.7: Graph of the distribution function
The following basic property will be useful in simulating random variables, a topic explored in the section on transformations of
random variables.
Special Quantiles
Certain quantiles are important enough to deserve special names.
4
is a first quartile of the distribution.
2. A quantile of order 1
2
is a median or second quartile of the distribution.
3. A quantile of order 3
4
is a third quartile of the distribution.
When there is only one median, it is frequently used as a measure of the center of the distribution, since it divides the set of values
of X in half, by probability. More generally, the quartiles can be used to divide the set of values into fourths, by probability.
Assuming uniqueness, let q , q , and q denote the first, second, and third quartiles of
1 2 3 X , respectively, and let a =F
−1 +
(0 )
and b = F (1) .
−1
2. The five parameters (a, q , q , q , b) are referred to as the five number summary of the distribution.
1 2 3
Note that the interval [q , q ] roughly gives the middle half of the distribution, so the interquartile range, the length of the interval,
1 3
is a natural measure of the dispersion of the distribution about the median. Note also that a and b are essentially the minimum and
maximum values of X, respectively, although of course, it's possible that a = −∞ or b = ∞ (or both). Collectively, the five
parameters give a great deal of information about the distribution in terms of the center, spread, and skewness. Graphically, the five
3.6.7 https://stats.libretexts.org/@go/page/10146
numbers are often displayed as a boxplot or box and whisker plot, which consists of a line extending from the minimum value a to
the maximum value b , with a rectangular box from q to q , and “whiskers” at a , the median q , and b . Roughly speaking, the five
1 3 2
numbers separate the set of values of X into 4 intervals of approximate probability each. 1
Figure 3.6.9 : The probability density function and boxplot for a continuous distribution
Suppose that X has a continuous distribution that is symmetric about a point a ∈ R . If a + t is a quantile of order p ∈ (0, 1)
⎧ 0, x <1
⎪
⎪
⎪
⎪ 1 3
⎪ , 1 ≤x <
⎪
⎪ 10 2
⎪
⎪ 3 3
, ≤x <2
10 2
F (x) = ⎨ (3.6.18)
6 5
⎪ , 2 ≤x <
⎪
⎪
10 2
⎪ 9 5
⎪
⎪
⎪ , ≤x <3
⎪
⎩
10 2
⎪
1, x ≥ 3;
1. Sketch the graph of F and show that F is the distribution function for a discrete distribution.
2. Find the corresponding probability density function f and sketch the graph.
3. Find P(2 ≤ X < 3) where X has this distribution.
4. Find the quantile function and sketch the graph.
5. Find the five number summary and sketch the boxplot.
Answer
0, x <0
F (x) = { x (3.6.19)
, x ≥0
x+1
1. Sketch the graph of F and show that F is the distribution function for a continuous distribution.
2. Find the corresponding probability density function f and sketch the graph.
3. Find P(2 ≤ X < 3) where X has this distribution.
4. Find the quantile function and sketch the graph.
5. Find the five number summary and sketch the boxplot.
Answer
p
The expression 1−p
that occurs in the quantile function in the last exercise is known as the odds ratio associated with p ,
particularly in the context of gambling.
3.6.8 https://stats.libretexts.org/@go/page/10146
⎧ 0, x <0
⎪
⎪ 1
⎪
⎪
⎪ x, 0 ≤x <1
⎪ 4
1 1 2
F (x) = ⎨ + (x − 1 ) , 1 ≤x <2 (3.6.20)
3 4
⎪
⎪ 2 1 3
⎪
⎪ + (x − 2 ) , 2 ≤x <3
⎪
⎩
⎪
3 4
1, x ≥3
1. Sketch the graph of F and show that F is the distribution function of a mixed distribution.
2. Find the partial probability density function of the discrete part and sketch the graph.
3. Find the partial probability density function of the continuous part and sketch the graph.
4. Find P(2 ≤ X < 3) where X has this distribution.
5. Find the quantile function and sketch the graph.
6. Find the five number summary and sketch the boxplot.
Answer
b−a
for x ∈ [a, b], where a, b ∈ R and a < b .
1. Find the distribution function and sketch the graph.
2. Find the quantile function and sketch the graph.
3. Compute the five-number summary.
4. Sketch the graph of the probability density function with the boxplot on the horizontal axis.
Answer
The distribution in the last exercise is the uniform distribution on the interval [a, b]. The left endpoint a is the location parameter
and the length of the interval w = b − a is the scale parameter. The uniform distribution models a point chose “at random” from
the interval, and is studied in more detail in the chapter on Special Distributions.
In the special distribution calculator, select the continuous uniform distribution. Vary the location and scale parameters and
note the shape of the probability density function and the distribution function.
The distribution in the last exercise is the exponential distribution with rate parameter r. Note that this distribution is characterized
by the fact that it has constant failure rate (and this is the reason for referring to r as the rate parameter). The reciprocal of the rate
parameter is the scale parameter. The exponential distribution is used to model failure times and other random times under certain
conditions, and is studied in detail in the chapter on The Poisson Process.
In the special distribution calculator, select the exponential distribution. Vary the scale parameter b and note the shape of the
probability density function and the distribution function.
xa +1
for 1 ≤ x < ∞ where a > 0 is a parameter.
3.6.9 https://stats.libretexts.org/@go/page/10146
3. Find the failure rate function.
4. Find the quantile function.
5. Compute the five-number summary.
6. In the case a = 2 , sketch the graph of the probability density function with the boxplot on the horizontal axis.
Answer
The distribution in the last exercise is the Pareto distribution with shape parameter a , named after Vilfredo Pareto. The Pareto
distribution is a heavy-tailed distribution that is sometimes used to model income and certain other economic variables. It is studied
in detail in the chapter on Special Distributions.
In the special distribution calculator, select the Pareto distribution. Keep the default value for the scale parameter, but vary the
shape parameter and note the shape of the density function and the distribution function.
π(1+x2 )
for x ∈ R.
The distribution in the last exercise is the Cauchy distribution, named after Augustin Cauchy. The Cauchy distribution is studied in
more generality in the chapter on Special Distributions.
In the special distribution calculator, select the Cauchy distribution and keep the default parameter values. Note the shape of
the density function and the distribution function.
The distribution in the previous exercise is the Weibull distributions with shape parameter k , named after Walodi Weibull. The
Weibull distribution is studied in detail in the chapter on Special Distributions. Since this family includes increasing, decreasing,
and constant failure rates, it is widely used to model the lifetimes of various types of devices.
In the special distribution calculator, select the Weibull distribution. Keep the default scale parameter, but vary the shape
parameter and note the shape of the density function and the distribution function.
Beta Distributions
Suppose that X has probability density function f (x) = 12x 2
(1 − x) for 0 ≤ x ≤ 1 .
1. Find the distribution function of X and sketch the graph.
2. Find P ( ≤ X ≤ ) .
1
4
1
3. Compute the five number summary and the interquartile range. You will have to approximate the quantiles.
4. Sketch the graph of the density function with the boxplot on the horizontal axis.
3.6.10 https://stats.libretexts.org/@go/page/10146
Answer
3
2
The distributions in the last two exercises are examples of beta distributions. The particular beta distribution in the last exercise is
also known as the arcsine distribution; the distribution function explains the name. Beta distributions are used to model random
proportions and probabilities, and certain other types of random variables, and are studied in detail in the chapter on Special
Distributions.
In the special distribution calculator, select the beta distribution. For each of the following parameter values, note the location
and shape of the density function and the distribution function.
1. a = 3 , b = 2 . This gives the first beta distribution above.
2. a = b = . This gives the arcsine distribution above
1
Logistic Distribution
x
The distribution in the last exercise is an logistic distribution and the quantile function is known as the logit function. The logistic
distribution is studied in detail in the chapter on Special Distributions.
In the special distribution calculator, select the logistic distribution and keep the default parameter values. Note the shape of the
probability density function and the distribution function.
Let F (x) = e −e
for x ∈ R.
1. Show that F is a distribution function for a continuous distribution, and sketch the graph.
2. Compute P(−1 ≤ X ≤ 1) where X is a random variable with distribution function F .
3. Find the quantile function and sketch the graph.
4. Compute the five-number summary.
5. Find the probability density function and sketch the graph with the boxplot on the horizontal axis.
Answer
The distribution in the last exercise is the type 1 extreme value distribution, also known as the Gumbel distribution in honor of Emil
Gumbel. Extreme value distributions are studied in detail in the chapter on Special Distributions.
In the special distribution calculator, select the extreme value distribution and keep the default parameter values. Note the
shape and location of the probability density function and the distribution function.
3.6.11 https://stats.libretexts.org/@go/page/10146
The Standard Normal Distribution
Recall that the standard normal distribution has probability density function ϕ given by
1 −
1
z
2
ϕ(z) = −−e
2 , z ∈ R (3.6.21)
√2π
This distribution models physical measurements of all sorts subject to small, random errors, and is one of the most important
distributions in probability. The normal distribution is studied in more detail in the chapter on Special Distributions. The
distribution function Φ, of course, can be expressed as
z
but Φ and the quantile function Φ cannot be expressed, in closed from, in terms of elementary functions. Because of the
−1
importance of the normal distribution Φ and Φ are themselves considered special functions, like sin, ln, and many others.
−1
Approximate values of these functions can be computed using most mathematical and statistical software packages. Because the
distribution is symmetric about 0, Φ(−z) = 1 − Φ(z) for z ∈ R , and equivalently, Φ (1 − p) = −Φ (p) . In particular, the
−1 −1
median is 0.
Open the sepcial distribution calculator and choose the normal distribution. Keep the default parameter values and select CDF
view. Note the shape and location of the distribution/quantile function. Compute each of the following:
1. The first and third quartiles
2. The quantiles of order 0.9 and 0.1
3. The quantiles of order 0.95 and 0.05
Miscellaneous Exercises
Suppose that X has probability density function f (x) = − ln x for 0 < x ≤ 1 .
1. Sketch the graph of f .
2. Find the distribution function F and sketch the graph.
3. Find P ( ≤ X ≤ ) .
1
3
1
Answer
Suppose that a pair of fair dice are rolled and the sequence of scores (X 1, X2 ) is recorded.
1. Find the distribution function of Y = X + X , the sum of the scores.
1 2
4
1
2
1
3
2
Statistical Exercises
For the M&M data, compute the empirical distribution function of the total number of candies.
Answer
3.6.12 https://stats.libretexts.org/@go/page/10146
For the cicada data, let BL denotes body length and let G denote gender. Compute the empirical distribution function of the
following variables:
1. BL
2. BL given G = 1 (male)
3. BL given G = 0 (female).
4. Do you believe that BL and G are independent?
For statistical versions of some of the topics in this section, see the chapter on Random Samples, and in particular, the sections on
empirical distributions and order statistics.
This page titled 3.6: Distribution and Quantile Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
3.6.13 https://stats.libretexts.org/@go/page/10146
3.7: Transformations of Random Variables
This section studies how the distribution of a random variable changes when the variable is transfomred in a deterministic way. If
you are a new student of probability, you should skip the technical details.
Basic Theory
The Problem
As usual, we start with a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F
is the collection of events, and P is the probability measure on the sample space (Ω, F ). Suppose now that we have a random
variable X for the experiment, taking values in a set S , and a function r from S into another set T . Then Y = r(X) is a new
random variable taking values in T . If the distribution of X is known, how do we find the distribution of Y ? This is a very basic
and important question, and in a superficial sense, the solution is easy. But first recall that for B ⊆ T ,
(B) = {x ∈ S : r(x) ∈ B} is the inverse image of B under r .
−1
r
P(Y ∈ B) = P [X ∈ r
−1
(B)] for B ⊆ T .
Proof
Suppose that X has a discrete distribution on a countable set S , with probability density function f . Then Y has a discrete
distribution with probability density function g given by
Proof
Suppose that X has a continuous distribution on a subset S ⊆ R with probability density function f , and that T is countable.
n
Proof
3.7.1 https://stats.libretexts.org/@go/page/10147
Figure 3.7.3 : A continuous distribution on S transformed by a discrete function r : S → T
So the main problem is often computing the inverse images r {y} for y ∈ T . The formulas above in the discrete and continuous
−1
cases are not worth memorizing explicitly; it's usually better to just work each problem from scratch. The main step is to write the
event {Y = y} in terms of X, and then find the probability of this event using the probability density function of X.
T ⊆R
m
. Suppose also that X has a known probability density function f . In many cases, the probability density function of Y can
be found by first finding the distribution function of Y (using basic rules of probability) and then computing the appropriate
derivatives of the distribution function. This general method is referred to, appropriately enough, as the distribution function
method.
Proof
As in the discrete case, the formula in (4) not much help, and it's usually better to work each problem from scratch. The main step
is to write the event {Y ≤ y} in terms of X, and then find the probability of this event using the probability density function of X.
We will explore the one-dimensional case first, where the concepts and formulas are simplest. Thus, suppose that random variable
X has a continuous distribution on an interval S ⊆ R , with distribution function F and probability density function f . Suppose
that Y = r(X) where r is a differentiable function from S onto an interval T . As usual, we will let G denote the distribution
function of Y and g the probability density function of Y .
2. g(y) = f [ r −1
(y)]
d
dy
r
−1
(y)
Proof
2. g(y) = −f [ r −1
(y)]
d
dy
−1
r (y)
Proof
The formulas for the probability density functions in the increasing case and the decreasing case can be combined:
If r is strictly increasing or strictly decreasing on S then the probability density function g of Y is given by
−1
∣ d −1
∣
g(y) = f [ r (y)] ∣ r (y)∣ (3.7.5)
∣ dy ∣
Letting x = r−1
(y) , the change of variables formula can be written more compactly as
3.7.2 https://stats.libretexts.org/@go/page/10147
∣ dx ∣
g(y) = f (x) ∣ ∣ (3.7.6)
∣ dy ∣
Although succinct and easy to remember, the formula is a bit less clear. It must be understood that x on the right should be written
in terms of y via the inverse function. The images below give a graphical interpretation of the formula in the two cases where r is
increasing and where r is decreasing.
Figure 3.7.4 : The change of variables theorems in the increasing and decreasing cases
The generalization of this result from R to R is basically a theorem in multivariate calculus. First we need some notation. Suppose
n
that r is a one-to-one differentiable function from S ⊆ R onto T ⊆ R . The first derivative of the inverse function x = r (y) is
n n −1
The Jacobian (named in honor of Karl Gustav Jacobi) of the inverse function is the determinant of the first derivative matrix
dx
det ( ) (3.7.8)
dy
With this compact notation, the multivariate change of variables formula is easy to state.
Suppose that X is a random variable taking values in S ⊆ R , and that X has a continuous distribution with probability
n
density function f . Suppose also Y = r(X) where r is a differentiable function from S onto T ⊆ R . Then the probability
n
Proof
The Jacobian is the infinitesimal scale factor that describes how n -dimensional volume changes under the transformation.
Special Transformations
Linear Transformations
Linear transformations (or more technically affine transformations) are among the most common and important transformations.
Moreover, this type of transformation leads to simple applications of the change of variable theorems. Suppose first that X is a
3.7.3 https://stats.libretexts.org/@go/page/10147
random variable taking values in an interval S ⊆ R and that X has a continuous distribution on S with probability density function
f . Let Y = a + b X where a ∈ R and b ∈ R ∖ {0} . Note that Y takes values in T = {y = a + bx : x ∈ S} , which is also an
interval.
Proof
When b > 0 (which is often the case in applications), this transformation is known as a location-scale transformation; a is the
location parameter and b is the scale parameter. Scale transformations arise naturally when physical units are changed (from feet
to meters, for example). Location transformations arise naturally when the physical reference point is changed (measuring time
relative to 9:00 AM as opposed to 8:00 AM, for example). The change of temperature measurement from Fahrenheit to Celsius is a
location and scale transformation. Location-scale transformations are studied in more detail in the chapter on Special Distributions.
The multivariate version of this result has a simple and elegant form when the linear transformation is expressed in matrix-vector
form. Thus suppose that X is a random variable taking values in S ⊆ R and that X has a continuous distribution on S with
n
probability density function f . Let Y = a + BX where a ∈ R and B is an invertible n × n matrix. Note that Y takes values in
n
T = {a + Bx : x ∈ S} ⊆ R
n
.
Proof
x∈Dz
2. If (X, Y ) has a continuous distribution then Z = X + Y has a continuous distribution with probability density function u
given by
Proof
In the discrete case, R and S are countable, so T is also countable as is D for each z ∈ T . In the continuous case, R and S are
z
typically intervals, so T is also an interval as is D for z ∈ T . In both cases, determining D is often the most difficult step. By far
z z
the most important special case occurs when X and Y are independent.
Suppose that X and Y are independent and have probability density functions g and h respectively.
1. If X and Y have discrete distributions then Z = X + Y has a discrete distribution with probability density function g ∗ h
given by
x∈Dz
3.7.4 https://stats.libretexts.org/@go/page/10147
2. If X and Y have continuous distributions then Z = X + Y has a continuous distribution with probability density function
g ∗ h given by
In both cases, the probability density function g ∗ h is called the convolution of g and h .
Proof
As before, determining this set D is often the most challenging step in finding the probability density function of
z Z . However,
there is one case where the computations simplify significantly.
Suppose again that X and Y are independent random variables with probability density functions g and h , respectively.
1. In the discrete case, suppose X and Y take values in N. Then Z has probability density function
z
x=0
2. In the continuous case, suppose that X and Y take values in [0, ∞). Then Z and has probability density function
z
Proof
Convolution is a very important mathematical operation that occurs in areas of mathematics outside of probability, and so involving
functions that are not necessarily probability density functions. The following result gives some simple properties of convolution.
Convolution (either discrete or continuous) satisfies the following properties, where f , g , and h are probability density
functions of the same type.
1. f ∗ g = g ∗ f (the commutative property)
2. (f ∗ g) ∗ h = f ∗ (g ∗ h) (the associative property)
Proof
Thus, in part (b) we can write f ∗ g ∗ h without ambiguity. Of course, the constant 0 is the additive identity so
X + 0 = 0 + X = 0 for every random variable X. Also, a constant is independent of every other random variable. It follows that
the probability density function δ of 0 (given by δ(0) = 1 ) is the identity with respect to convolution (at least for discrete PDFs).
That is, f ∗ δ = δ ∗ f = f . The next result is a simple corollary of the convolution theorem, but is important enough to be
highligted.
Suppose that X = (X , X , …) is a sequence of independent and identically distributed real-valued random variables, with
1 2
In statistical terms, X corresponds to sampling from the common distribution.By convention, Y = 0 , so naturally we take
0
= δ . When appropriately scaled and centered, the distribution of Y converges to the standard normal distribution as n → ∞ .
∗0
f n
The precise statement of this result is the central limit theorem, one of the fundamental theorems of probability. The central limit
theorem is studied in detail in the chapter on Random Samples. Clearly convolution power satisfies the law of exponents:
f
∗n
∗f
∗m
=f
∗(n+m)
for m, n ∈ N.
Convolution can be generalized to sums of independent variables that are not of the same type, but this generalization is usually
done in terms of distribution functions rather than probability density functions.
3.7.5 https://stats.libretexts.org/@go/page/10147
Suppose that (X, Y ) has a continuous distribution on R with probability density function f .
2
Proof
If (X, Y ) takes values in a subset D ⊆ R , then for a given v ∈ R , the integral in (a) is over {x ∈ R : (x, v/x) ∈ D}, and for a
2
given w ∈ R, the integral in (b) is over {x ∈ R : (x, wx) ∈ D}. As usual, the most important special case of this result is when X
and Y are independent.
Suppose that X and Y are independent random variables with continuous distributions on R having probability density
functions g and h , respectively.
1. Random variable V = XY has probability density function
∞
1
v↦ ∫ g(x)h(v/x) dx (3.7.28)
−∞ |x|
w ↦ ∫ g(x)h(wx)|x|dx (3.7.29)
−∞
Proof
If X takes values in S ⊆ R and Y takes values in T ⊆ R , then for a given v ∈ R , the integral in (a) is over {x ∈ S : v/x ∈ T } ,
and for a given w ∈ R, the integral in (b) is over {x ∈ S : wx ∈ T } . As with convolution, determining the domain of integration is
often the most challenging step.
are very important in a number of applications. For example, recall that in the standard model of structural reliability, a system
consists of n components that operate independently. Suppose that X represents the lifetime of component i ∈ {1, 2, … , n}.
i
Then U is the lifetime of the series system which operates if and only if each component is operating. Similarly, V is the lifetime of
the parallel system which operates if and only if at least one component is operating.
A particularly important special case occurs when the random variables are identically distributed, in addition to being
independent. In this case, the sequence of variables is a random sample of size n from the common distribution. The minimum and
maximum variables are the extreme examples of order statistics. Order statistics are studied in detail in the chapter on Random
Samples.
Suppose that (X , X , … , X
1 2 n) is a sequence of indendent real-valued random variables and that X has distribution function
i
1. V = max{X , X , … , X } has distribution function H given by H (x) = F (x)F (x) ⋯ F (x) for x ∈ R.
1 2 n 1 2 n
for x ∈ R.
Proof
3.7.6 https://stats.libretexts.org/@go/page/10147
From part (a), note that the product of n distribution functions is another distribution function. From part (b), the product of n
right-tail distribution functions is a right-tail distribution function. In the reliability setting, where the random variables are
nonnegative, the last statement means that the product of n reliability functions is another reliability function. If X has a i
continuous distribution with probability density function f for each i ∈ {1, 2, … , n}, then U and V also have continuous
i
distributions, and their probability density functions can be obtained by differentiating the distribution functions in parts (a) and (b)
of last theorem. The computations are straightforward using the product rule for derivatives, but the results are a bit of a mess.
The formulas in last theorem are particularly nice when the random variables are identically distributed, in addition to being
independent
Suppose that (X1 , X2 , … , Xn ) is a sequence of independent real-valued random variables, with common distribution
function F .
1. V = max{ X1 , X2 , … , Xn } has distribution function H given by H (x) = F (x) for x ∈ R.
n
In particular, it follows that a positive integer power of a distribution function is a distribution function. More generally, it's easy to
see that every positive power of a distribution function is a distribution function. How could we construct a non-integer power of a
distribution function in a probabilistic way?
Suppose that (X , X , … , X ) is a sequence of independent real-valued random variables, with a common continuous
1 2 n
n−1
2. U = min{ X1 , X2 , … , Xn } has probability density function g given by g(x) = n[1 − F (x)] f (x) for x ∈ R .
Coordinate Systems
For our next discussion, we will consider transformations that correspond to common distance-angle based coordinate systems—
polar coordinates in the plane, and cylindrical and spherical coordinates in 3-dimensional space. First, for (x, y) ∈ R , let (r, θ) 2
denote the standard polar coordinates corresponding to the Cartesian coordinates (x, y), so that r ∈ [0, ∞) is the radial distance
and θ ∈ [0, 2π) is the polar angle.
Figure 3.7.6 : Polar coordinates. Stover, Christopher and Weisstein, Eric W. "Polar Coordinates." From MathWorld—A Wolfram
Web Resource. http://mathworld.wolfram.com/PolarCoordinates.html
It's best to give the inverse transformation: x = r cos θ , y = r sin θ . As we all know from calculus, the Jacobian of the
transformation is r. Hence the following result is an immediate consequence of our change of variables theorem:
Suppose that (X, Y ) has a continuous distribution on R with probability density function f , and that
2
(R, Θ) are the polar
coordinates of (X, Y ). Then (R, Θ) has probability density function g given by
Next, for (x, y, z) ∈ R , let (r, θ, z) denote the standard cylindrical coordinates, so that (r, θ) are the standard polar coordinates of
3
(x, y) as above, and coordinate z is left unchanged. Given our previous result, the one for cylindrical coordinates should come as
no surprise.
Suppose that (X, Y , Z) has a continuous distribution on R with probability density function f , and that
3
(R, Θ, Z) are the
cylindrical coordinates of (X, Y , Z) . Then (R, Θ, Z) has probability density function g given by
g(r, θ, z) = f (r cos θ, r sin θ, z)r, (r, θ, z) ∈ [0, ∞) × [0, 2π) × R (3.7.33)
3.7.7 https://stats.libretexts.org/@go/page/10147
Finally, for (x, y, z) ∈ R , let (r, θ, ϕ) denote the standard spherical coordinates corresponding to the Cartesian coordinates
3
(x, y, z), so that r ∈ [0, ∞) is the radial distance, θ ∈ [0, 2π) is the azimuth angle, and ϕ ∈ [0, π] is the polar angle. (In spite of
our use of the word standard, different notations and conventions are used in different subjects.)
Suppose that (X, Y , Z) has a continuous distribution on R with probability density function f , and that
3
(R, Θ, Φ) are the
spherical coordinates of (X, Y , Z) . Then (R, Θ, Φ) has probability density function g given by
2
g(r, θ, ϕ) = f (r sin ϕ cos θ, r sin ϕ sin θ, r cos ϕ)r sin ϕ, (r, θ, ϕ) ∈ [0, ∞) × [0, 2π) × [0, π] (3.7.34)
Suppose that X has a continuous distribution on R with distribution function F and probability density function f .
1. |X| has distribution function G given by G(y) = F (y) − F (−y) for y ∈ [0, ∞).
2. |X| has probability density function g given by g(y) = f (y) + f (−y) for y ∈ [0, ∞).
Proof
Recall that the sign function on R (not to be confused, of course, with the sine function) is defined as follows:
⎧ −1, x <0
sgn(x) = ⎨ 0, x =0 (3.7.35)
⎩
1, x >0
Suppose again that X has a continuous distribution on R with distribution function F and probability density function f , and
suppose in addition that the distribution of X is symmetric about 0. Then
1. |X| has distribution function G given byG(y) = 2F (y) − 1 for y ∈ [0, ∞).
2. |X| has probability density function g given by g(y) = 2f (y) for y ∈ [0, ∞).
3. sgn(X) is uniformly distributed on {−1, 1}.
4. |X| and sgn(X) are independent.
Proof
Dice
Recall that a standard die is an ordinary 6-sided die, with faces labeled from 1 to 6 (usually in the form of dots). A fair die is one in
which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 occur with probability each and the
1
3.7.8 https://stats.libretexts.org/@go/page/10147
other faces with probability 1
8
each.
Suppose that two six-sided dice are rolled and the sequence of scores (X , X 1 2) is recorded. Find the probability density
function of Y = X + X , the sum of the scores, in each of the following cases:
1 2
In the dice experiment, select two dice and select the sum random variable. Run the simulation 1000 times and compare the
empirical density function to the probability density function for each of the following cases:
1. fair dice
2. ace-six flat dice
Suppose that n standard, fair dice are rolled. Find the probability density function of the following variables:
1. the minimum score
2. the maximum score.
Answer
In the dice experiment, select fair dice and select each of the following random variables. Vary n with the scroll bar and note
the shape of the density function. With n = 4 , run the simulation 1000 times and note the agreement between the empirical
density function and the probability density function.
1. minimum score
2. maximum score.
Uniform Distributions
Recall that for n ∈ N , the standard measure of the size of a set A ⊆ R is
+
n
λn (A) = ∫ 1 dx (3.7.37)
A
In particular, λ (A) is the length of A for A ⊆ R , λ (A) is the area of A for A ⊆ R , and λ
1 2
2
3 (A) is the volume of A for A ⊆ R . 3
probability density function f defined by f (x) = 1/λ (S) for x ∈ S . Uniform distributions are studied in more detail in the
n
Let Y =X
2
. Find the probability density function of Y and sketch the graph in each of the following cases:
1. X is uniformly distributed on the interval [0, 4].
2. X is uniformly distributed on the interval [−2, 2].
3. X is uniformly distributed on the interval [−1, 3].
Answer
Compare the distributions in the last exercise. In part (c), note that even a simple transformation of a simple distribution can
produce a complicated distribution. In this particular case, the complexity is caused by the fact that x ↦ x is one-to-one on part of
2
the domain {0} ∪ (1, 3] and two-to-one on the other part [−1, 1] ∖ {0}.
On the other hand, the uniform distribution is preserved under a linear transformation of the random variable.
Suppose that X has the continuous uniform distribution on S ⊆ R . Let Y = a + BX , where a ∈ R and B is an invertible
n n
3.7.9 https://stats.libretexts.org/@go/page/10147
Proof
For the following three exercises, recall that the standard uniform distribution is the uniform distribution on the interval [0, 1].
Suppose that X and Y are independent and that each has the standard uniform distribution. Let U = X +Y , V = X −Y ,
W = XY , Z = Y /X . Find the probability density function of each of the follow:
1. (U , V )
2. U
3. V
4. W
5. Z
Answer
Suppose that X, Y , and Z are independent, and that each has the standard uniform distribution. Find the probability density
function of (U , V , W ) = (X + Y , Y + Z, X + Z) .
Answer
Suppose that (X , X , … , X ) is a sequence of independent random variables, each with the standard uniform distribution.
1 2 n
Find the distribution function and probability density function of the following variables.
1. U = min{ X1 , X2 … , Xn }
2. V = max{ X1 , X2 , … , Xn }
Answer
Both distributions in the last exercise are beta distributions. More generally, all of the order statistics from a random sample of
standard uniform variables have beta distributions, one of the reasons for the importance of this family of distributions. Beta
distributions are studied in more detail in the chapter on Special Distributions.
Let f denote the probability density function of the standard uniform distribution.
1. Compute f ∗2
2. Compute f ∗3
Answer
In the last exercise, you can see the behavior predicted by the central limit theorem beginning to emerge. Recall that if
(X , X , X ) is a sequence of independent random variables, each with the standard uniform distribution, then f , f , and f are
∗2 ∗3
1 2 3
sequence of independent random variables, each with the standard uniform distribution, then the distribution of ∑ X (which n
i=1 i
has probability density function f ) is known as the Irwin-Hall distribution with parameter n . The Irwin-Hall distributions are
∗n
Open the Special Distribution Simulator and select the Irwin-Hall distribution. Vary the parameter n from 1 to 3 and note the
shape of the probability density function. (These are the density functions in the previous exercise). For each value of n , run
the simulation 1000 times and compare the empricial density function and the probability density function.
3.7.10 https://stats.libretexts.org/@go/page/10147
Simulations
A remarkable fact is that the standard uniform distribution can be transformed into almost any other distribution on R. This is
particularly important for simulations, since many computer languages have an algorithm for generating random numbers, which
are simulations of independent variables, each with the standard uniform distribution. Conversely, any continuous distribution
supported on an interval of R can be transformed into the standard uniform distribution.
Suppose first that F is a distribution function for a distribution on R (which may be discrete, continuous, or mixed), and let F
−1
Assuming that we can compute F , the previous exercise shows how we can simulate a distribution with distribution function F .
−1
To rephrase the result, we can simulate a variable with distribution function F by simply computing a random quantile. Most of the
apps in this project use this method of simulation. The first image below shows the graph of the distribution function of a rather
complicated mixed distribution, represented in blue on the horizontal axis. In the second image, note how the uniform distribution
on [0, 1], represented by the thick red line, is transformed, via the quantile function, into the given distribution.
Suppose that X has a continuous distribution on an interval S ⊆ R Then U = F (X) has the standard uniform distribution.
Proof
Show how to simulate the uniform distribution on the interval [a, b] with a random number. Using your calculator, simulate 5
values from the uniform distribution on the interval [2, 10].
Answer
Beta Distributions
−−
2. V = √X
3. W = X
1
Proof
Random variables X, U , and V in the previous exercise have beta distributions, the same family of distributions that we saw in the
exercise above for the minimum and maximum of independent standard uniform variables. In general, beta distributions are widely
used to model random proportions and probabilities, as well as physical quantities that take values in closed bounded intervals
(which after a change of units can be taken to be [0, 1]). On the other hand, W has a Pareto distribution, named for Vilfredo Pareto.
The family of beta distributions and the family of Pareto distributions are studied in more detail in the chapter on Special
Distributions.
3.7.11 https://stats.libretexts.org/@go/page/10147
Suppose that the radius R of a sphere has a beta distribution probability density function f given by 2
f (r) = 12 r (1 − r) for
0 ≤ r ≤ 1 . Find the probability density function of each of the following:
3. The volume V = π R 4
3
3
Answer
Suppose that the grades on a test are described by the random variable Y = 100X where X has the beta distribution with
probability density function f given by f (x) = 12x(1 − x) for 0 ≤ x ≤ 1 . The grades are generally low, so the teacher
2
−
− −−
decides to “curve” the grades using the transformation Z = 10√Y = 100√X . Find the probability density function of
1. Y
2. Z
Answer
Bernoulli Trials
Recall that a Bernoulli trials sequence is a sequence (X , X , …) of independent, identically distributed indicator random
1 2
variables. In the usual terminology of reliability theory, X = 0 means failure on trial i, while X = 1 means success on trial i. The
i i
basic parameter of the process is the probability of success p = P(X = 1) , so p ∈ [0, 1]. The random process is named for Jacob
i
Now let Y denote the number of successes in the first n trials, so that Y
n n =∑
n
i=1
Xi for n ∈ N .
n
y n−y
fn (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (3.7.39)
y
Proof
For m, n ∈ N
1. f n =f
∗n
.
2. f m ∗ fn = fm+n .
Proof
From part (b) it follows that if Y and Z are independent variables, and that Y has the binomial distribution with parameters n ∈ N
and p ∈ [0, 1] while Z has the binomial distribution with parameter m ∈ N and p, then Y + Z has the binomial distribution with
parameter m + n and p.
Find the probability density function of the difference between the number of successes and the number of failures in n ∈ N
3.7.12 https://stats.libretexts.org/@go/page/10147
This distribution is named for Simeon Poisson and is widely used to model the number of random points in a region of time or
space; the parameter t is proportional to the size of the regtion. The Poisson distribution is studied in detail in the chapter on The
Poisson Process.
The last result means that if X and Y are independent variables, and X has the Poisson distribution with parameter a > 0 while Y
has the Poisson distribution with parameter b > 0 , then X + Y has the Poisson distribution with parameter a + b . In terms of the
Poisson model, X could represent the number of points in a region A and Y the number of points in a region B (of the appropriate
sizes so that the parameters are a and b respectively). The independence of X and Y corresponds to the regions A and B being
disjoint. Then X + Y is the number of points in A ∪ B .
for t ∈ [0, ∞). This distribution is often used to model random times such as failure times and lifetimes. In particular, the times
between arrivals in the Poisson model of random points in time have independent, identically distributed exponential distributions.
The Exponential distribution is studied in more detail in the chapter on Poisson Processes.
Show how to simulate, with a random number, the exponential distribution with rate parameter r. Using your calculator,
simulate 5 values from the exponential distribution with parameter r = 3 .
Answer
For the next exercise, recall that the floor and ceiling functions on R are defined by
⌊x⌋ = max{n ∈ Z : n ≤ x}, ⌈x⌉ = min{n ∈ Z : n ≥ x}, x ∈ R (3.7.43)
Suppose that T has the exponential distribution with rate parameter r ∈ (0, ∞) . Find the probability density function of each
of the following random variables:
1. Y = ⌊T ⌋
2. Z = ⌈T ⌉
Answer
Note that the distributions in the previous exercise are geometric distributions on N and on N , respectively. In many respects, the
+
Suppose that T has the exponential distribution with rate parameter r ∈ (0, ∞) . Find the probability density function of each
of the following random variables:
1. X = T 2
2. Y = e T
3. Z = ln T
Answer
In the previous exercise, Y has a Pareto distribution while Z has an extreme value distribution. Both of these are studied in more
detail in the chapter on Special Distributions.
Suppose that X and Y are independent random variables, each having the exponential distribution with parameter 1. Let
Z = . Y
Suppose that X has the exponential distribution with rate parameter a > 0 , Y has the exponential distribution with rate
parameter b > 0 , and that X and Y are independent. Find the probability density function of Z = X + Y in each of the
3.7.13 https://stats.libretexts.org/@go/page/10147
following cases.
1. a = b
2. a ≠ b
Answer
Suppose that (T 1, T2 , … , Tn ) is a sequence of independent random variables, and that T has the exponential distribution with
i
3. Find the probability density function of V in the special case that r = r for each i ∈ {1, 2, … , n}.
i
Answer
Note that the minimum U in part (a) has the exponential distribution with parameter r + r + ⋯ + r . In particular, suppose that
1 2 n
a series system has independent components, each with an exponentially distributed lifetime. Then the lifetime of the system is also
exponentially distributed, and the failure rate of the system is the sum of the component failure rates.
ri
P (Ti < Tj for all j ≠ i) = (3.7.44)
n
∑ rj
j=1
Proof
The result in the previous exercise is very important in the theory of continuous-time Markov chains. If we have a bunch of
independent alarm clocks, with exponentially distributed alarm times, then the probability that clock i is the first one to sound is
r .
n
r /∑
i j
j=1
With a positive integer shape parameter, as we have here, it is also referred to as the Erlang distribution, named for Agner Erlang.
This distribution is widely used to model random times under certain basic assumptions. In particular, the n th arrival times in the
Poisson model of random points in time has the gamma distribution with parameter n . The Erlang distribution is studied in more
detail in the chapter on the Poisson Process, and in greater generality, the gamma distribution is studied in the chapter on Special
Distributions.
Let g = g , and note that this is the probability density function of the exponential distribution with parameter 1, which was the
1
If m, n ∈ N+ then
1. g
n =g
∗n
2. g
m ∗ gn = gm+n
Proof
3.7.14 https://stats.libretexts.org/@go/page/10147
Part (b) means that if X has the gamma distribution with shape parameter m and Y has the gamma distribution with shape
parameter n , and if X and Y are independent, then X + Y has the gamma distribution with shape parameter m + n . In the context
of the Poisson model, part (a) means that the n th arrival time is the sum of the n independent interarrival times, which have a
common exponential distribution.
Suppose that T has the gamma distribution with shape parameter n ∈ N . Find the probability density function of X = ln T .
+
Answer
Members of this family have already come up in several of the previous exercises. The Pareto distribution, named for Vilfredo
Pareto, is a heavy-tailed distribution often used for modeling income and other financial variables. The Pareto distribution is
studied in more detail in the chapter on Special Distributions.
Suppose that X has the Pareto distribution with shape parameter a . Find the probability density function of each of the
following random variables:
1. U =X
2
2. V =
1
3. Y = ln X
Answer
In the previous exercise, V also has a Pareto distribution but with parameter a
2
; Y has the beta distribution with parameters a and
b = 1 ; and Z has the exponential distribution with rate parameter a .
Show how to simulate, with a random number, the Pareto distribution with shape parameter a . Using your calculator, simulate
5 values from the Pareto distribution with shape parameter a = 2 .
Answer
ϕ(z) = −−e
2
, z ∈ R (3.7.48)
√2π
Suppose that Z has the standard normal distribution, and that μ ∈ (−∞, ∞) and σ ∈ (0, ∞).
1. Find the probability density function f of X = μ + σZ
2. Sketch the graph of f , noting the important qualitative features.
Answer
Random variable X has the normal distribution with location parameter μ and scale parameter σ. The normal distribution is
perhaps the most important distribution in probability and mathematical statistics, primarily because of the central limit theorem,
one of the fundamental theorems. It is widely used to model physical measurements of all types that are subject to small, random
errors. The normal distribution is studied in detail in the chapter on Special Distributions.
Suppose that Z has the standard normal distribution. Find the probability density function of Z and sketch the graph. 2
Answer
Random variable V has the chi-square distribution with 1 degree of freedom. Chi-square distributions are studied in detail in the
chapter on Special Distributions.
Suppose that X and Y are independent random variables, each with the standard normal distribution, and let (R, Θ) be the
standard polar coordinates (X, Y ). Find the probability density function of
3.7.15 https://stats.libretexts.org/@go/page/10147
1. (R, Θ)
2. R
3. Θ
Answer
The distribution of R is the (standard) Rayleigh distribution, and is named for John William Strutt, Lord Rayleigh. The Rayleigh
distribution is studied in more detail in the chapter on Special Distributions.
The standard normal distribution does not have a simple, closed form quantile function, so the random quantile method of
simulation does not work well. However, the last exercise points the way to an alternative method of simulation.
Show how to simulate a pair of independent, standard normal variables with a pair of random numbers. Using your calculator,
simulate 6 values from the standard normal distribution.
Answer
Random variable T has the (standard) Cauchy distribution, named after Augustin Cauchy. The Cauchy distribution is studied in
detail in the chapter on Special Distributions.
Suppose that a light source is 1 unit away from position 0 on an infinite straight wall. We shine the light at the wall an angle Θ
to the perpendicular, where Θ is uniformly distributed on (− , ). Find the probability density function of the position of the
π
2
π
Thus, X also has the standard Cauchy distribution. Clearly we can simulate a value of the Cauchy distribution by
+ πU ) where U is a random number. This is the random quantile method.
π
X = tan(−
2
Open the Cauchy experiment, which is a simulation of the light problem in the previous exercise. Keep the default parameter
values and run the experiment in single step mode a few times. Then run the experiment 1000 times and compare the empirical
density function and the probability density function.
This page titled 3.7: Transformations of Random Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
3.7.16 https://stats.libretexts.org/@go/page/10147
3.8: Convergence in Distribution
This section is concenred with the convergence of probability distributions, a topic of basic importance in probability theory. Since
we will be almost exclusively concerned with the convergences of sequences of various kinds, it's helpful to introduce the notation
∪ {∞} = {1, 2, …} ∪ {∞} .
∗
N+
=N +
Distributions on (R, R)
Definition
We start with the most important and basic setting, the measurable space (R, R), where R is the set of real numbers of course, and
R is the Borel σ-algebra of subsets of R . Recall that if P is a probability measure on (R, R), then the function F : R → [0, 1]
defined by F (x) = P (−∞, x] for x ∈ R is the (cumulative) distribution function of P . Recall also that F completely determines
P . Here is the definition for convergence of probability measures in this setting:
Suppose P is a probability measure on (R, R) with distribution function F for each n ∈ N . Then P converges (weakly)
n n
∗
+ n
Recall that a distribution function F is continuous at x ∈ R if and only if P(X = x) = 0 , so that x is not an atom of the
distribution (a point of positive probability). We will see shortly why this condition on F is appropriate. Of course, a probability
∞
measure on (R, R) is usually associated with a real-valued random variable for some random experiment that is modeled by a
probability space (Ω, F , P). So to review, Ω is the set of outcomes, F is the σ-algebra of events, and P is the probability measure
on the sample space (Ω, F ). If X is a real-valued random variable defined on the probability space, then the distribution of X is
the probability measure P on (R, R) defined by P (A) = P(X ∈ A) for A ∈ R , and then of course, the distribution function of
X is the function F defined by F (x) = P(X ≤ x) for x ∈ R . Here is the convergence terminology used in this setting:
Suppose that X is a real-valued random variable with distribution P for each n ∈ N . If P ⇒ P as n → ∞ then we say
n n
∗
+ n ∞
point x ∈ R where F is continuous. On the one hand, the terminology and notation are helpful, since again most probability
∞
measures are associated with random variables (and every probability measure can be). On the other hand, the terminology and
notation can be a bit misleading since the random variables, as functions, do not converge in any sense, and indeed the random
variables need not be defined on the same probability spaces. It is only the distributions that converge. However, often the random
variables are defined on the same probability space (Ω, F , P), in which case we can compare convergence in distribution with the
other modes of convergence we have or will study:
Convergence with probability 1
Convergence in probability
Convergence in mean
We will show, in fact, that convergence in distribution is the weakest of all of these modes of convergence. However, strength of
convergence should not be confused with importance. Convergence in distribution is one of the most important modes of
convergence; the central limit theorem, one of the two fundamental theorems of probability, is a theorem about convergence in
distribution.
Preliminary Examples
The examples below show why the definition is given in terms of distribution functions, rather than probability density functions,
and why convergence is only required at the points of continuity of the limiting distribution function. Note that the distributions
considered are probability measures on (R, R), even though the support of the distribution may be a much smaller subset. For the
first example, note that if a deterministic sequence converges in the ordinary calculus sense, then naturally we want the sequence
(thought of as random variables) to converge in distribution. Expand the proof to understand the example fully.
3.8.1 https://stats.libretexts.org/@go/page/10148
Proof
For the example below, recall that Q denotes the set of rational numbers. Once again, expand the proof to understand the example
fully
n−1
For n ∈ N , let P denote the discrete uniform distribution on {
+ n
1
n
,
2
n
,…
n
, 1} and let P ∞ denote the continuous uniform
distribution on the interval [0, 1]. Then
1. P n ⇒ P∞ as n → ∞
2. P n (Q) =1 for each n ∈ N
+ but P
∞ (Q) =0 .
Proof
The point of the example is that it's reasonable for the discrete uniform distribution on {
1
n
,
2
n
,…
n−1
n
, 1} to converge to the
continuous uniform distribution on [0, 1], but once again, the probability density functions are evidently not the correct objects of
study.
2. Suppose that f is a probability density function for a continuous distribution P on R for each n ∈ N If f (x) → f (x)
n n
∗
+ n
Proof
Convergence in Probability
Naturally, we would like to compare convergence in distribution with other modes of convergence we have studied.
Suppose that X is a real-valued random variable for each n ∈ N , all defined on the same probability space. If
n
∗
+
Xn → X∞
Proof
Our next example shows that even when the variables are defined on the same probability space, a sequence can converge in
distribution, but not in any other way.
2
, so that X is the result of tossing a fair coin. Let
Xn = 1 − X for n ∈ N . Then +
1. X → X as n → ∞ in distribution.
n
Proof
The critical fact that makes this counterexample work is that 1 − X has the same distribution as X. Any random variable with this
property would work just as well, so if you prefer a counterexample with continuous distributions, let X have probability density
function f given by f (x) = 6x(1 − x) for 0 ≤ x ≤ 1 . The distribution of X is an example of a beta distribution.
The following summary gives the implications for the various modes of convergence; no other implications hold in general.
3.8.2 https://stats.libretexts.org/@go/page/10148
Suppose that X is a real-valued random variable for each n ∈ N , all defined on a common probability space.
n
∗
+
It follows that convergence with probability 1, convergence in probability, and convergence in mean all imply convergence in
distribution, so the latter mode of convergence is indeed the weakest. However, our next theorem gives an important converse to
part (c) in (7), when the limiting variable is a constant. Of course, a constant can be viewed as a random variable defined on any
probability space.
Suppose that X is a real-valued random variable for each n ∈ N , defined on the same probability space, and that c ∈ R . If
n +
Proof
2. X → X as n → ∞ with probability 1.
n ∞
Proof
The following theorem illustrates the value of the Skorohod representation and the usefulness of random variable notation for
convergence in distribution. The theorem is also quite intuitive, since a basic idea is that continuity should preserve convergence.
Suppose that X is a real-valued random variable for each n ∈ N (not necessarily defined on the same probability space).
n
∗
+
Suppose also that g : R → R is measurable, and let D denote the set of discontinuities of g , and P the distribution of X .
g ∞ ∞
Proof
The definition of convergence in distribution requires that the sequence of probability measures converge on sets of the form
(−∞, x] for x ∈ R when the limiting distrbution has probability 0 at x. It turns out that the probability measures will converge on
lots of other sets as well, and this result points the way to extending convergence in distribution to more general spaces. To state the
result, recall that if A is a subset of a topological space, then the boundary of A is ∂A = cl(A) ∖ int(A) where cl(A) is the
closure of A (the smallest closed set that contains A ) and int(A) is the interior of A (the largest open set contained in A ).
In the context of this result, suppose that a, b ∈ R with a < b . If P {a} = P {b} = 0 , then as n → ∞ we have
P (a, b) → P (a, b) , P [a, b) → P [a, b) , P (a, b] → P (a, b] , and P [a, b] → P [a, b] . Of course, the limiting values are all the
n n n n
same.
3.8.3 https://stats.libretexts.org/@go/page/10148
Examples and Applications
Next we will explore several interesting examples of the convergence of distributions on (R, R). There are several important cases
where a special distribution converges to another special distribution as a parameter approaches a limiting value. Indeed, such
convergence results are part of the reason why such distributions are special in the first place.
The pramaters m, r, and n are positive integers with n ≤ m and r ≤ m . The hypergeometric distribution is studied in more detail
in the chapter on Finite Sampling Models
Recall next that Bernoulli trials are independent trials, each with two possible outcomes, generically called success and failure. The
probability of success p ∈ [0, 1] is the same for each trial. The binomial distribution with parameters n ∈ N and p is the +
distribution of the number successes in n Bernoulli trials. This distribution has probability density function g given by
n
k n−k
g(k) = ( ) p (1 − p ) , k ∈ {0, 1, … , n} (3.8.4)
k
The binomial distribution is studied in more detail in the chapter on Bernoulli Trials. Note that the binomial distribution with
parameters n and p = r/m is the distribution that governs the number of type 1 objects in a sample of size n , drawn with
replacement from a population of m objects with r objects of type 1. This fact is motivation for the following result:
Suppose that r ∈ {0, 1, … , m} for each m ∈ N and that r /m → p as m → ∞ . For fixed n ∈ N , the hypergeometric
m + m +
distribution with parameters m, r , and n converges to the binomial distribution with parameters n and p as m → ∞ .
m
Proof
From a practical point of view, the last result means that if the population size m is “large” compared to sample size n , then the
hypergeometric distribution with parameters m, r, and n (which corresponds to sampling without replacement) is well
approximated by the binomial distribution with parameters n and p = r/m (which corresponds to sampling with replacement).
This is often a useful result, not computationally, but rather because the binomial distribution has fewer parameters than the
hypergeometric distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the
limiting binomial distribution, we do not need to know the population size m and the number of type 1 objects r individually, but
only in the ratio r/m.
In the ball and urn experiment, set m = 100 and r = 30 . For each of the following values of n (the sample size), switch
between sampling without replacement (the hypergeometric distribution) and sampling with replacement (the binomial
distribution). Note the difference in the probability density functions. Run the simulation 1000 times for each sampling mode
and compare the relative frequency function to the probability density function.
1. 10
2. 20
3. 30
4. 40
5. 50
Bernoulli trials, when p is the probability of success on a trial. This distribution has probability density function f given by
n
k n−k
f (k) = ( ) p (1 − p ) , k ∈ {0, 1, … , n} (3.8.7)
k
3.8.4 https://stats.libretexts.org/@go/page/10148
Recall also that the Poisson distribution with parameter r ∈ (0, ∞) has probability density function g given by
k
r
−r
g(k) = e , k ∈ N (3.8.8)
k!
The distribution is named for Simeon Poisson and governs the number of “random points” in a region of time or space, under
certain ideal conditions. The parameter r is proportional to the size of the region of time or space. The Poisson distribution is
studied in more detail in the chapter on the Poisson Process.
Suppose that p ∈ [0, 1] for n ∈ N and that np → r ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters
n + n
Proof
From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials n is
“large” and the probability of success p “small”, so that np is small, then the binomial distribution with parameters n and p is well
2
approximated by the Poisson distribution with parameter r = np . This is often a useful result, again not computationally, but rather
because the Poisson distribution has fewer parameters than the binomial distribution (and often in real problems, the parameters
may only be known approximately). Specifically, in the approximating Poisson distribution, we do not need to know the number of
trials n and the probability of success p individually, but only in the product np. As we will see in the next chapter, the condition
that np be small means that the variance of the binomial distribution, namely np(1 − p) = np − np is approximately r = np ,
2 2
In the binomial timeline experiment, set the parameter values as follows, and observe the graph of the probability density
function. (Note that np = 5 in each case.) Run the experiment 1000 times in each case and compare the relative frequency
function and the probability density function. Note also the successes represented as “random points” in discrete time.
1. n = 10 , p = 0.5
2. n = 20 , p = 0.25
3. n = 100 , p = 0.05
In the Poisson experiment, set r = 5 and t = 1 , to get the Poisson distribution with parameter 5. Note the shape of the
probability density function. Run the experiment 1000 times and compare the relative frequency function to the probability
density function. Note the similarity between this experiment and the one in the previous exercise.
k−1
f (k) = p(1 − p ) , k ∈ N+ (3.8.10)
The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.
Suppose that U has the geometric distribution on N with success parameter p ∈ (0, 1]. For
+ n ∈ N+ , the conditional
distribution of U given U ≤ n converges to the uniform distribution on {1, 2, … , n} as p ↓ 0 .
Proof
Next, recall that the exponential distribution with rate parameter r ∈ (0, ∞) has distribution function G given by
−rt
G(t) = 1 − e , 0 ≤t <∞ (3.8.12)
The exponential distribution governs the time between “arrivals” in the Poisson model of random points in time.
Suppose that Un has the geometric distribution on N+ with success parameter p ∈ (0, 1] for n ∈ N , and that
n +
npn → r ∈ (0, ∞) as n → ∞ . The distribution of U n /n converges to the exponential distribution with parameter r as
n → ∞ .
Proof
Note that the limiting condition on n and p in the last result is precisely the same as the condition for the convergence of the
binomial distribution to the Poisson distribution. For a deeper interpretation of both of these results, see the section on the Poisson
3.8.5 https://stats.libretexts.org/@go/page/10148
distribution.
In the negative binomial experiment, set k = 1 to get the geometric distribution. Then decrease the value of p and note the
shape of the probability density function. With p = 0.5 run the experiment 1000 times and compare the relative frequency
function to the probability density function.
In the gamma experiment, set k = 1 to get the exponential distribution, and set r = 5 . Note the shape of the probability density
function. Run the experiment 1000 times and compare the empirical density function and the probability density function.
Compare this experiment with the one in the previous exercise, and note the similarity, up to a change in scale.
P (Xi = i) =
1
n
for each i ∈ {1, 2, … , n}.
Proof
So the matching events all have the same probability, which varies inversely with the number of trials.
P (Xi = i, Xj = j) =
1
n(n−1)
for i, j ∈ {1, 2, … , n} with i ≠ j .
Proof
So the matching events are dependent, and in fact are positively correlated. In particular, the matching events do not form a
sequence of Bernoulli trials. The matching problem is studied in detail in the chapter on Finite Sampling Models. In that section we
show that the number of matches N has probability density function f given by:
n n
n−k j
1 (−1)
fn (k) = ∑ , k ∈ {0, 1, … , n} (3.8.14)
k! j!
j=0
Proof
In the matching experiment, increase n and note the apparent convergence of the probability density function for the number of
matches. With selected values of n , run the experiment 1000 times and compare the relative frequency function and the
probability density function.
(parameter 1). Thus, recall that the common distribution function G is given by
−x
G(x) = 1 − e , 0 ≤x <∞ (3.8.16)
Proof
The limiting distribution in Exercise (27) is the standard extreme value distribution, also known as the standard Gumbel
distribution in honor of Emil Gumbel. Extreme value distributions are studied in detail in the chapter on Special Distributions.
3.8.6 https://stats.libretexts.org/@go/page/10148
1
F (x) = 1 − , 1 ≤x <∞ (3.8.19)
a
x
The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution sometimes used to model financial variables. It is
studied in more detail in the chapter on Special Distributions.
Suppose that X has the Pareto distribution with parameter n for each n ∈ N . Then
n +
Proof
Fundamental Theorems
The two fundamental theorems of basic probability theory, the law of large numbers and the central limit theorem, are studied in
detail in the chapter on Random Samples. For this reason we will simply state the results in this section. So suppose that
(X , X , …) is a sequence of independent, identically distributed, real-valued random variables (defined on the same probability
1 2
space) with mean μ ∈ (−∞. ∞) and standard deviation σ ∈ (0, ∞). For n ∈ N , let Y = ∑ X denote the sum of the first n
+ n
n
i=1 i
2. The distribution of Z converges to the standard normal distribution as n → ∞ . This is the central limit theorem.
n
In part (a), convergence with probability 1 is the strong law of large numbers while convergence in probability and in distribution
are the weak laws of large numbers.
General Spaces
Our next goal is to define convergence of probability distributions on more general measurable spaces. For this discussion, you
may need to refer to other sections in this chapter: the integral with respect to a positive measure, properties of the integral, and
density functions. In turn, these sections depend on measure theory developed in the chapters on Foundations and Probability
Measures.
We assume that (S, d) is a complete, separable metric space and let S denote the Borel σ-algebra of subsets of S , that is, the
σ-algebra generated by the topology. The standard spaces that we often use are special cases of the measurable space (S, S ):
1. Discrete: S is countable and is given the discrete metric so S is the collection of all subsets of S .
2. Euclidean: R is given the standard Euclidean metric so R is the usual σ-algebra of Borel measurable subsets of R .
n
n
n
Additional details
As suggested by our setup, the definition for convergence in distribution involves both measure theory and topology. The
motivation is the theorem above for the one-dimensional Euclidean space (R, R).
Convergence in distribution:
1. Suppose that P is a probability measure on (S, S ) for each n ∈ N . Then P converges (weakly) to P as n → ∞ if
n
∗
+ n ∞
P (A) → P
n (A) as n → ∞ for every A ∈ S with P
∞ (∂A) = 0 . We write P ⇒ P
∞ as n → ∞ . n ∞
2. Suppose that X is a random variable with distribution P on (S, S ) for each n ∈ N . Then X converges in distribution
n n
∗
+ n
to X as n → ∞ if P ⇒ P as n → ∞ . We write X → X as n → ∞ in distribution.
∞ n ∞ n ∞
Notes
Let's consider our two special cases. In the discrete case, as usual, the measure theory and topology are not really necessary.
3.8.7 https://stats.libretexts.org/@go/page/10148
Proof
In the Euclidean case, it suffices to consider distribution functions, as in the one-dimensional case. If P is a probability measure on
(R , R ), recall that the distribution function F of P is given by
n
n
n
F (x1 , x2 , … , xn ) = P ((−∞, x1 ] × (−∞, x2 ] × ⋯ × (−∞, xn ]) , (x1 , x2 , … , xn ) ∈ R (3.8.21)
Convergence in Probability
As in the case of (R, R), convergence in probability implies convergence in distribution.
Suppose that X is a random variable with values in S for each n ∈ N , all defined on the same probability space. If
n
∗
+
X → X
n as n → ∞ in probability then X → X as n → ∞ in distribution.
∞ n ∞
Notes
So as before, convergence with probability 1 implies convergence in probability which in turn implies convergence in distribution.
Suppose that P is a probability measures on (S, S ) for each n ∈ N and that P ⇒ P as n → ∞ . Then there exists a
n
∗
+ n ∞
random variable X with values in S for each n ∈ N , defined on a common probability space, such that
n
∗
+
2. X → X as n → ∞ with probability 1.
n ∞
One of the main consequences of Skorohod's representation, the preservation of convergence in distribution under continuous
functions, is still true and has essentially the same proof. For the general setup, suppose that (S, d, S ) and (T , e, T ) are spaces of
the type described above.
Suppose that X is a random variable with values in S for each n ∈ N (not necessarily defined on the same probability
n
∗
+
space). Suppose also that g : S → T is measurable, and let D denote the set of discontinuities of g , and P the distribution of
g ∞
Proof
A simple consequence of the continuity theorem is that if a sequence of random vectors in R converge in distribution, then the n
sequence of each coordinate also converges in distribution. Let's just consider the two-dimensional case to keep the notation
simple.
Scheffé's Theorem
Our next discussion concerns an important result known as Scheffé's theorem, named after Henry Scheffé. To state our theorem,
suppose that (S, S , μ) is a measure space, so that S is a set, S is a σ-algebra of subsets of S , and μ is a positive measure on
(S, S ). Further, suppose that P is a probability measure on (S, S ) that has density function f with respect to μ for each
n n
n ∈ N , and that P is a probability measure on (S, S ) that has density function f with respect to μ .
+
If f n (x) → f (x) as n → ∞ for almost all x ∈ S (with respect to μ ) then P n (A) → P (A) as n → ∞ uniformly in A ∈ S .
3.8.8 https://stats.libretexts.org/@go/page/10148
Proof
Of course, the most important special cases of Scheffé's theorem are to discrete distributions and to continuous distributions on a
subset of R , as in the theorem above on density functions.
n
Expected Value
Generating functions are studied in the chapter on Expected Value. In part, the importance of generating functions stems from the
fact that ordinary (pointwise) convergence of a sequence of generating functions corresponds to the convergence of the
distributions in the sense of this section. Often it is easier to show convergence in distribution using generating functions than
directly from the definition.
In addition, converence in distribution has elegant characterizations in terms of the convergence of the expected values of certain
types of functions of the underlying random variables.
This page titled 3.8: Convergence in Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
3.8.9 https://stats.libretexts.org/@go/page/10148
3.9: General Distribution Functions
Our goal in this section is to define and study functions that play the same role for positive measures on R that (cumulative)
distribution functions do for probability measures on R. Of course probability measures on R are usually associated with real-
valued random variables. These general distribution functions are useful for constructing measures on R and will appear in our
study of integrals with respect to a measure in the next section, as well as non-homogeneous Poisson processes and general renewal
processes.
Basic Theory
Throughout this section, our basic measurable space is (R, R), where R is the σ-algebra of Borel measurable subsets of R, and as
usual, we will let λ denote Lebesgue measure on (R, R). As with cumulative distribution functions, it's convenient to have
compact notation for the limits of a function F : R → R from the left and right at x ∈ R, and at ∞ and −∞ (assuming of course
that these limits exist):
+ −
F (x ) = lim F (t), F (x ) = lim F (t), F (∞) = lim F (t), F (−∞) = lim F (t) (3.9.1)
t↓x t↑x t→∞ t→−∞
Since F is increasing, F (x −
) exists in R. Similarly F (∞) exists, as a real number or ∞, and F (−∞) exists, as a real number or
−∞ .
If F is a distribution function on R, then there exists a unique positive measure μ on R that satisfies
Proof
The measure μ is called the Lebesgue-Stieltjes measure associated with F , named for Henri Lebesgue and Thomas Joannes
Stieltjes. A very rich variety of measures on R can be constructed in this way. In particular, when the function F takes values in
[0, 1], the associated measure P is a probability measure. Another special case of interest is the distribution function defined by
F (x) = x for x ∈ R , in which case μ(a, b] is the length of the interval (a, b] and therefore μ = λ , Lebesgue measure on R . But
although the measure associated with a distribution function is unique, the distribution function itself is not. Note that if c ∈ R then
the distribution function defined by F (x) = x + c for x ∈ R also generates Lebesgue measure. This example captures the general
situation.
Suppose that F and G are distribution functions that generate the same measure μ on R . Then there exists c ∈ R such that
G = F +c .
Proof
Returning to the case of a probability measure P on R, the cumulative distribution function F that we studied in this chapter is the
unique distribution function satisfying F (−∞) = 0 . More generally, having constructed a measure from a distribution function,
let's now consider the complementary problem of finding a distribution function for a given measure. The proof of the last theorem
points the way.
Suppose that μ is a positive measure on (R, R) with the property that μ(A) < ∞ if A is bounded. Then there exists a
distribution function that generates μ .
Proof
In the proof of the last theorem, the use of 0 as a “reference point” is arbitrary, of course. Any other point in R would do as well,
and would produce a distribution function that differs from the one in the proof by a constant. If μ has the property that
3.9.1 https://stats.libretexts.org/@go/page/10149
μ(−∞, x] < ∞ for x ∈ R, then it's easy to see that F defined by F (x) = μ(−∞, x] for x ∈ R is a distribution function that
generates μ , and is the unique distribution function with F (−∞) = 0 . Of course, in the case of a probability measure, this is the
cumulative distribution function, as noted above.
Properties
General distribution functions enjoy many of the same properties as the cumulative distribution function (but not all because of the
lack of uniqueness). In particular, we can easily compute the measure of any interval from the distribution function.
Suppose that F is a distribution function and μ is the positive measure on (R, R) associated with F . For a, b ∈ R with a < b ,
1. μ[a, b] = F (b) − F (a )
−
2. μ{a} = F (a) − F (a ) −
3. μ(a, b) = F (b ) − F (a)
−
4. μ[a, b) = F (b ) − F (a )
− −
Proof
Note that F is continuous at x ∈ R if and only if μ{x} = 0. In particular, μ is a continuous measure (recall that this means that
μ{x} = 0 for all x ∈ R ) if and only if F is continuous on R . On the other hand, F is discontinuous at x ∈ R if and only if
μ{x} > 0, so that μ has an atom at x. So μ is a discrete measure (recall that this means that μ has countable support) if and only if
F is a step function.
Suppose again that F is a distribution function and μ is the positive measure on (R, R) associated with F . If a ∈ R then
1. μ(a, ∞) = F (∞) − F (a)
2. μ[a, ∞) = F (∞) − F (a ) −
The discrete case. Suppose that G is discrete, so that there exists a countable set C ⊂ [0, ∞) with G (C ) = 0 . Let c
g(t) = G{t} for t ∈ C so that g is the density function of G with respect to counting measure on C . If u : [0, ∞) → R is
The continuous case. Suppose that G is absolutely continuous with respect to Lebesgue measure on [0, ∞) with density
function g : [0, ∞) → [0, ∞) . If u : [0, ∞) → R is locally bounded then
t t
3.9.2 https://stats.libretexts.org/@go/page/10149
Figure 3.9.2 : A continuous measure
The mixed case. Suppose that there exists a countable set C ⊂ [0, ∞) with G(C ) > 0 and G (C ) > 0 , and that G restricted
c
to subsets of C is absolutely continuous with respect to Lebesgue measure. Let g(t) = G{t} for t ∈ C and let h be a density
c
with respect to Lebesgue measure of G restricted to subsets of C . If u : [0, ∞) → R is locally bounded then,
c
t t
This page titled 3.9: General Distribution Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
3.9.3 https://stats.libretexts.org/@go/page/10149
3.10: The Integral With Respect to a Measure
Probability density functions have very different interpretations for discrete distributions as opposed to continuous distributions.
For a discrete distribution, the probability of an event is computed by summing the density function over the outcomes in the event,
while for a continuous distribution, the probability is computed by integrating the density function over the outcomes. For a mixed
distributions, we have partial discrete and continuous density functions and the probability of an event is computed by summing
and integrating. The various types of density functions can unified under a general theory of integration, which is the subject of this
section. This theory has enormous importance in probability, far beyond just density functions. Expected value, which we consider
in the next chapter, can be interpreted as an integral with respect to a probability measure. Beyond probability, the general theory of
integration is of fundamental importance in many areas of mathematics.
Basic Theory
Definitions
Our starting point is a measure space (S, S , μ). That is, S is a set, S is a σ-algebra of subsets of S , and μ is a positive measure
on S . As usual, the most important special cases are
Euclidean space: S = R for some n ∈ N , S = R , the σ-algebra of Lebesgue measurable subsets of R , and μ = λ ,
n
+ n
n
n
Consider a statement with x ∈ S as a free variable. Technically such a statement is a predicate on S . Suppose that A ∈ S .
1. The statement holds on A if it is true for every x ∈ A .
2. The statement holds almost everywhere on A (with respect to μ ) if there exists B ∈ S with B ⊆ A such that the statement
holds on B and μ(A ∖ B) = 0 .
A typical statement that we have in mind is an equation or an inequality with x ∈ S as a free variable. Our goal is to define the
integral of certain measurable functions f : S → R , with respect to the measure μ . The integral may exist as a number in R (in
which case we say that f is integrable), or may exist as ∞ or −∞ , or may not exist at all. When it exists, the integral is denoted
variously by
of ∞ and −∞ . Here are the conventions that are appropriate for integration:
Arithmetic on R ∗
However, ∞ − ∞ is not defined (because it does not make consistent sense) and we must be careful never to produce this
indeterminate form. You might recall from calculus that 0 ⋅ ∞ is also an indeterminate form. However, for the theory of
integration, the convention that 0 ⋅ ∞ = 0 is convenient and consistent. In terms of order of course, −∞ < a < ∞ for a ∈ R .
3.10.1 https://stats.libretexts.org/@go/page/10150
We also need to extend topology and measure to R . In terms of the first, (a, ∞] is an open neighborhood of ∞ and [−∞, a) is an
∗
open neighborhood of −∞ for every a ∈ R . This ensures that if x ∈ R for n ∈ N then x → ∞ or x → −∞ as n → ∞ has
n + n n
its usual calculus meaning. Technically this topology results in the two-point compactification of R. Now we can give R the Borel ∗
σ-algebra R , that is, the σ-algebra generated by the topology. Basically, this simply means that if A ∈ R then A ∪ {∞} ,
∗
Desired Properties
As motivation for the definition, every version of integration should satisfy some basic properties. First, the integral of the indicator
function of a measurable set should simply be the size of the set, as measured by μ . This gives our first definition:
If A ∈ S then ∫ S
1A dμ = μ(A) .
This definition hints at the intimate relationship between measure and integration. We will construct the integral from the measure
μ in this section, but this first property shows that if we started with the integral, we could recover the measure. This property also
shows why we need ∞ as a possible value of the integral, and coupled with some of the properties below, why −∞ is also needed.
Here is a simple corollary of our first definition.
∫ 0 dμ = 0
S
Proof
We give three more essential properties that we want. First are the linearity properties in two parts—part (a) is the additive
property and part (b) is the scaling property.
To be more explicit, we want the additivity property (a) to hold if at least one of the integrals on the right is finite, or if both are ∞
or if both are −∞ . What is ruled out are the two cases where one integral is ∞ and the other is −∞ , and this is what is meant by
the indeterminate form ∞ − ∞ . Our next essential properties are the order properties, again in two parts—part (a) is the positive
property and part (b) is the increasing property.
The positive property and the additive property imply the increasing property
Our last essential property is perhaps the least intuitive, but is a type of continuity property of integration, and is closely related to
the continuity property of positive measure. The official name is the monotone convergence theorem.
function). This property shows that it is sometimes convenient to allow nonnegative functions to take the value ∞. Note also that
by the increasing property, ∫ f dμ is increasing in n and hence also has a limit in R ∪ {∞} .
S
n
To see the connection with measure, suppose that (A , A , …) is an increasing sequence of sets in S , and let A = ⋃ A . Note
1 2
∞
i=1 i
that 1 is increasing in n ∈ N and 1 → 1 as n → ∞ . For this reason, the union A is sometimes called the limit of A as
An + An A n
n → ∞ . The continuity theorem of positive measure states that μ(A ) → μ(A) as n → ∞ . Equivalently, ∫ 1 n dμ → ∫ 1 dμ An A
S S
as n → ∞ , so the continuity theorem of positive measure is a special case of the monotone convergence theorem.
3.10.2 https://stats.libretexts.org/@go/page/10150
Armed with the properties that we want, the definition of the integral is fairly straightforward, and proceeds in stages. We give the
definition successively for
1. Nonnegative simple functions
2. Nonnegative measurable functions
3. Measurable real-valued functions
Of course, each definition should agree with the previous one on the functions that are in both collections.
Simple Functions
A simple function on S is simply a measurable, real-valued function with finite range. Simple functions are usually expressed as
linear combinations of indicator functions.
2. A simple function f has a unique representation as f = ∑ b 1 where J is a finite index set, {b : j ∈ J} is a set of
j∈J j Bj j
distinct real numbers, and {B : j ∈ J} is a collection of nonempty sets in S that partition S . This representation is
j
You might wonder why we don't just always use the canonical representation for simple functions. The problem is that even if we
start with canonical representations, when we combine simple functions in various ways, the resulting representations may not be
canonical. The collection of simple functions is closed under the basic arithmetic operations, and in particular, forms a vector
space.
Proof
As we alluded to earlier, note that even if the representations of f and g are canonical, the representations for f + g and fg may
not be. The next result treats composition, and will be important for the change of variables theorem in the next section.
Suppose that (T , T ) is another measurable space, and that f : S → T is measurable. If g is a simple function on T with
representation g = ∑ b 1 , then g ∘ f is a simple function on S with representation g ∘ f = ∑ b 1
i∈I i Bi . i∈I i f
−1
( Bi )
Proof
Given the definition of the integral of an indicator function in (3) and that we want the linearity property (5) to hold, there is no
question as to how we should define the integral of a nonnegative simple function.
∫ f dμ = ∑ ai μ(Ai ) (3.10.4)
S
i∈I
Note that if f is a nonnegative simple function, then ∫ f dμ exists in [0, ∞], so the order properties holds. We next show that the
S
Suppose that f and g are nonnegative simple functions, and that c ∈ [0, ∞). Then
1. ∫S
(f + g) dμ = ∫
S
f dμ + ∫
S
g dμ
2. ∫S
cf dμ = c ∫
S
f dμ
3.10.3 https://stats.libretexts.org/@go/page/10150
Proof
Proof
Next we give a version of the continuity theorem in (7) for simple functions. It's not completely general, but will be needed for the
next subsection where we do prove the general version.
Suppose that f is a nonnegative simple function and that (A1 , A2 , …) is an increasing sequence of sets in S with
∞
A =⋃
n=1
An . then
∫ 1An f dμ → ∫ 1A f dμ as n → ∞ (3.10.12)
S S
Proof
Nonnegative Functions
Next we will consider nonnegative measurable functions on S . First we note that a function of this type is the limit of nonnegative
simple functions.
Suppose that f : S → [0, ∞) is measurable. Then there exists an increasing sequence (f1 , f2 , …) of nonnegative simple
functions with f → f on S as n → ∞ .
n
Proof
The last result points the way towards the definition of the integral of a measurable function f : S → [0, ∞) in terms of the
integrals of simple functions. If g is a nonnegative simple function with g ≤ f , then by the order property, we need
∫ g dμ ≤ ∫ f dμ . On the other hand, there exists a sequence of nonnegative simple function converging to f . Thus the continuity
S S
Note that ∫ f dμ exists in [0, ∞] so the positive property holds. Note also that if f is simple, the new definition agrees with the
S
old one. As always, we need to establish the essential properties. First, the increasing property holds.
We can now prove the continuity property known as the monotone convergence theorem in full generality.
Proof
If f : S → [0, ∞) is measurable, then by the theorem above, there exists an increasing sequence (f , f , …) of simple functions 1 2
with f → f as n → ∞ . By the monotone convergence theorem in (18), ∫ f dμ → ∫ f dμ as n → ∞ . These two facts can be
n S n S
used to establish other properties of the integral of a nonnegative function based on our knowledge that the properties hold for
simple functions. This type of argument is known as bootstrapping. We use bootstrapping to show that the linearity properties hold:
3.10.4 https://stats.libretexts.org/@go/page/10150
If f , g : S → [0, ∞) are measurable and c ∈ [0, ∞), then
1. ∫ S
(f + g) dμ = ∫
S
f dμ + ∫
S
g dμ
2. ∫ S
cf dμ = c ∫
S
f dμ
Proof
General Functions
Our final step is to define the integral of a measurable function f : S → R . First, recall the positive and negative parts of x ∈ R:
+ −
x = max{x, 0}, x = max{−x, 0} (3.10.19)
Note that x ≥ 0 , x ≥ 0 , x = x − x , and |x| = x + x . Given that we want the integral to have the linearity properties in
+ − + − + −
(5), there is no question as to how we should define the integral of f in terms of the integrals of f and f , which being + −
If f : S → R is measurable, we define
+ −
∫ f dμ = ∫ f dμ − ∫ f dμ (3.10.20)
S S S
assuming that at least one of the integrals on the right is finite. If both are finite, then f is said to be integrable.
Assuming that either the integral of the positive part or the integral of the negative part is finite ensures that we do not get the
dreaded indeterminate form ∞ − ∞ .
Note that if f is nonnegative, then our new definition agrees with our old one, since f +
=f and f −
=0 . For simple functions the
integral has the same basic form as for nonnegative simple functions:
Suppose that f is a simple function with the representation f = ∑i∈I ai 1Ai . Then
∫ f dμ = ∑ ai μ(Ai ) (3.10.21)
S
i∈I
assuming that the sum does not have both ∞ and −∞ terms.
Proof
Once again, we need to establish the essential properties. Our first result is an intermediate step towards linearity.
Proof
In particular, note that if f and g are integrable, then so are f + g and cf for c ∈ R . Thus, the set of integrable functions on
(S, S , μ) forms a vector space, which is denoted L (S, S , μ). The L is in honor of Henri Lebesgue, who first developed the
theory. This vector space, and other related ones, will be studied in more detail in the section on function spaces.
We also have the increasing property in full generality.
3.10.5 https://stats.libretexts.org/@go/page/10150
If f , g : S → R are measurable functions whose integrals exist, and if f ≤g on S then ∫ S
f dμ ≤ ∫
S
g dμ
Proof
∫ f dμ = ∫ 1A f dμ (3.10.31)
A S
If f : S → R is a measurable function whose integral exists and A ∈ S , then the integral of f over A exists.
Proof
restriction of μ to S . It follows that all of the essential properties hold for integrals over A : the linearity properties, the order
A
properties, and the monotone convergence theorem. The following property is a simple consequence of the general additive
property, and is known as additive property for disjoint domains.
Suppose that f : S → R is a measurable function whose integral exists, and that A, B ∈ S are disjoint. then
∫ f dμ = ∫ f dμ + ∫ f dμ (3.10.32)
A∪B A B
Proof
By induction, the additive property holds for a finite collection of disjoint domains. The extension to a countably infinite collection
of disjoint domains will be considered in the next section on properties of the integral.
Special Cases
Discrete Spaces
Recall again that the measure space (S, S , #) is discrete if S is countable, S is the collection of all subsets of S , and # is
counting measure on S . Thus all functions f : S → R are measurable, and and as we will see, integrals with respect to # are
simply sums.
If f : S → R then
∫ f d# = ∑ f (x) (3.10.34)
S
x∈S
as long as either the sum of the positive terms or the sum of the negative terms in finite.
Proof
If the sum of the positive terms and the sum of the negative terms are both finite, then f is integrable with respect to #, but the
usual term from calculus is that the series ∑ f (x) is absolutely convergent. The result will look more familiar in the special
x∈S
case S = N . Functions on S are simply sequences, so we can use the more familiar notation a rather than a(i) for a function
+ i
a : S → R . Part (b) of the proof (with A = {1, 2, … , n}) is just the definition of an infinite series of nonnegative terms as the
n
∑ ai = lim ∑ ai (3.10.36)
n→∞
i=1 i=1
3.10.6 https://stats.libretexts.org/@go/page/10150
Part (c) of the proof is just the definition of a general infinite series
∞ ∞ ∞
+ −
∑ ai = ∑ a −∑a (3.10.37)
i i
as long as one of the series on the right is finite. Again, when both are finite, the series is absolutely convergent. In calculus we also
consider conditionally convergent series. This means that ∑ a = ∞ , ∑ a = ∞ , but lim
∞
i=1
+
i
∑
∞
i=1
a exists in R . Such
−
i n→∞
n
i=1 i
series have no place in general integration theory. Also, you may recall that such series are pathological in the sense that, given any
number in R , there exists a rearrangement of the terms so that the rearranged series converges to the given number.
∗
over a set A ∈ R . It's not surprising that in this special case, the theory of integration is referred to as Lebesgue integration in
honor of our good friend Henri Lebesgue, who first developed the theory.
On the other hand, we already have a theory of integration on R, namely the Riemann integral of calculus, named for our other
good friend Georg Riemann. For a suitable function f and domain A this integral is denoted ∫ f (x) dx, as we all remember from A
calculus. How are the two integrals related? As we will see, the Lebesgue integral generalizes the Riemann integral.
To understand the connection we need to review the definition of the Riemann integral. Consider first the standard case where the
domain of integration is a closed, bounded interval. Here are the preliminary definitions that we will need.
2. The norm of a partition A is ∥A∥ = max{λ(A ) : i ∈ I } , the length of the largest subinterval of A .
i
3. A set of points B = {x : i ∈ I } where x ∈ A for each i ∈ I is said to be associated with the partition A .
i i i
4. The Riemann sum of f corresponding to a partition A and and a set B associated with A is
i∈I
Note that the Riemann sum is simply the integral of the simple function g = ∑ f (x )1 . Moreover, since A is an interval for
i∈I i Ai i
each i ∈ I , g is a step function, since it is constant on a finite collection of disjoint intervals. Moreover, again since A is an i
interval for each i ∈ I , λ(A ) is simply the length of the subinterval A , so of course measure theory per se is not needed for
i i
f is Riemann integrable on [a, b] if there exists r ∈ R with the property that for every ϵ > 0 there exists δ > 0 such that if A
is a partition of [a, b] with ∥A∥ < δ then |r − R (f , A , B)| < ϵ for every set of points B associated with A . Then of course
we define the integral by
b
∫ f (x) dx = r (3.10.39)
a
∫ f dλ = ∫ f (x) dx (3.10.40)
[a,b] a
On the other hand, there are lots of functions that are Lebesgue integrable but not Riemann integrable. In fact there are indicator
functions of this type, the simplest of functions from the point of view of Lebesgue integration.
Consider the function 1 where as usual, Q is the set of rational number in R. Then
Q
1. ∫ R
1Q dλ = 0 .
3.10.7 https://stats.libretexts.org/@go/page/10150
2. 1 is not Riemann integrable on any interval [a, b] with a < b .
Q
Proof
f : [a, b] → R is Riemann integrable on [a, b] if and only if f is bounded on [a, b] and f is continuous almost everywhere on
[a, b] .
Now that the Riemann integral is defined for a closed bounded interval, it can be extended to other domains.
limit exists in R .∗
b b
2. If f is defined on (a, b] and Riemann integrable on [t, b] for a < t < b , we define ∫ f (x) dx = lim ∫ f (x) dx if the
a
t↓a
t
limit exists in R .∗
b c b
3. If f is defined on (a, b), we select c ∈ (a, b) and define ∫ f (x) dx = ∫ f (x) dx + ∫ f (x), dx if the integrals on the
a a c
right exist in R by (a) and (b), and are not of the form ∞ − ∞ .
∗
∞ t
4. If f is defined an [a, ∞) and Riemann integrable on [a, t] for a < t < ∞ we define ∫ f (x) dx = lima
∫ f (x) dx. t→∞ a
5. if f is defined on (−∞, b] and Riemann integrable on [t, b] for −∞ < t < b we define
b b
∫
−∞
f (x) dx = limt→−∞ ∫
t
if the limit exists in R
f (x) dx
∗
∞ c ∞
6. if f is defined on R we select c ∈ R and define ∫ −∞
f (x) dx = ∫ f (x) dx + ∫
−∞
f (x) dx if both integrals on the right
c
As another indication of its superiority, note that none of these convolutions is necessary for the Lebesgue integral. Once and for
all, we have defined ∫ f (x) dx for a general measurable function f : R → R and a general domain A ∈ R
A
Recall that F satisfies some, but not necessarily all of the properties of a probability distribution function. The properties not
necessarily satisfied are the normalizing properties
F (x) → 0 as x → −∞
F (x) → 1 as x → ∞
If F does satisfy these two additional properties, then μ is a probability measure and F its probability distribution function.
The integral with respect to the measure μ is, appropriately enough, referred to as the Lebesgue-Stieltjes integral with respect to F ,
and like the measure, is named for the ubiquitous Henri Lebesgue and for Thomas Stieltjes. In addition to our usual notation
∫ f dμ , the Lebesgue-Stieltjes integral is also denoted ∫ f dF and ∫ f (x) dF (x).
S S S
Probability Spaces
Suppose that (S, S , P) is a probability space, so that S is the set of outcomes of a random experiment, S is the σ-algebra of
events, and P the probability measure on the sample space (S, S ). A measurable, real-valued function X on S is, of course, a real-
valued random variable. The integral with respect to P, if it exists, is the expected value of X and is denoted
E(X) = ∫ X dP (3.10.43)
S
3.10.8 https://stats.libretexts.org/@go/page/10150
This concept is of fundamental importance in probability theory and is studied in detail in a separate chapter on Expected Value,
mostly from an elementary point of view that does not involve abstract integration. However an advanced section treats expected
value as an integral over the underlying probability measure, as above.
Suppose next that (T , T , #) is a discrete space and that X is a random variable for the experiment, taking values in T . In this case
X has a discrete distribution and the probability density function f of X is given by f (x) = P(X = x) for x ∈ T . More generally,
On the other hand, suppose that X is a random variable with values in R , where as usual, (R , R , λ ) is
n n
n n n -dimensional
Euclidean space. If X has a continuous distribution, then f : T → [0, ∞) is a probability density function of X if
n
P(X ∈ A) = ∫ f dλn , A ∈ R (3.10.45)
A
Technically, f is the density function of X with respect to counting measure # in the discrete case, and f is the density function of
X with respect to Lebesgue measure λ in the continuous case. In both cases, the probability of an event A is computed by
n
integrating the density function, with respect to the appropriate measure, over A . There are still differences, however. In the
discrete case, the existence of the density function with respect to counting measure is guaranteed, and indeed we have an explicit
formula for it. In the continuous case, the existence of a density function with respect to Lebesgue measure is not guaranteed, and
indeed there might not be one. More generally, suppose that we have a measure space (T , T , μ) and a random variable X with
values in T . A measurable function f : T → [0, ∞) is a probability density function of X (or more precisely, the distribution of X)
with respect to μ if
This fundamental question of the existence of a density function will be clarified in the section on absolute continuity and density
functions.
Suppose again that X is a real-valued random variable with distribution function F . Then, by definition, the distribution of X is the
Lebesgue-Stieltjes measure associated with F :
P(a < X ≤ b) = F (b) − F (a), a, b ∈ R, a < b (3.10.47)
regardless of whether the distribution is discrete, continuous, or mixed. Trivially, P(X ∈ A) = ∫ 1 dF for A ∈ R and the
S A
expected value of X defined above can also be written as E(X) = ∫ x dF (x). Again, all of this will be explained in much more
R
Computational Exercises
Let g(x) = 1+x
1
2
for x ∈ R.
∞
1. Find ∫−∞
g(x) dx .
∞
2. Show that ∫ xg(x) dx does not exist.
−∞
Answer
You may recall that the function g in the last exercise is important in the study of the Cauchy distribution, named for Augustin
Cauchy. You may also remember that the graph of g is known as the witch of Agnesi, named for Maria Agnesi.
∞
Let g(x) = x
1
b
for x ∈ [1, ∞) where b > 0 is a parameter. Find ∫ 1
g(x) dx
Answer
You may recall that the function g in the last exercise is important in the study of the Pareto distribution, named for Vilfredo
Pareto.
3.10.9 https://stats.libretexts.org/@go/page/10150
π
2. Does ∫ 0
f (x) dx exist?
Answer
This page titled 3.10: The Integral With Respect to a Measure is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
3.10.10 https://stats.libretexts.org/@go/page/10150
3.11: Properties of the Integral
Basic Theory
Again our starting point is a measure space (S, S , μ) . That is, S is a set, S is a σ-algebra of subsets of S , and μ is a positive
measure on S .
Definition
In the last section we defined the integral of certain measurable functions f : S → R with respect to the measure μ . Recall that the
integral, denoted ∫ f dμ , may exist as a number in R (in which case f is integrable), or may exist as ∞ or −∞ , or may fail to
S
∫ f dμ = ∑ ai μ(Ai ) (3.11.1)
S i∈I
3. If f : S → R is measurable, then
+ −
∫ f dμ = ∫ f dμ − ∫ f dμ (3.11.3)
S S S
as long as the right side is not of the form ∞ − ∞ , and where f and f denote the positive and negative parts of f .
+ −
∫ f dμ = ∫ 1A f dμ (3.11.4)
A S
Consider a statement on the elements of S , for example an equation or an inequality with x ∈ S as a free variable. (Technically
such a statement is a predicate on S .) For A ∈ S , we say that the statement holds on A if it is true for every x ∈ A . We say that
the statement holds almost everywhere on A (with respect to μ ) if there exists B ∈ S with B ⊆ A such that the statement holds
on B and μ(A ∖ B) = 0 .
Basic Properties
A few properties of the integral that were essential to the motivation of the definition were given in the last section. In this section,
we extend some of those properties and we study a number of new ones. As a review, here is what we know so far.
6. f : S → R is measurable and the the integral of f on A ∪ B exists, where A, B ∈ S are disjoint, then
∫ f dμ = ∫ f dμ + ∫ f dμ .
A∪B A B
3.11.1 https://stats.libretexts.org/@go/page/10151
Parts (a) and (b) are the linearity properties; part (a) is the additivity property and part (b) is the scaling property. Parts (c) and (d)
are the order properties; part (c) is the positive property and part (d) is the increasing property. Part (e) is a continuity property
known as the monotone convergence theorem. Part (f) is the additive property for disjoint domains. Properties (a)–(e) hold with S
replaced by A ∈ S .
Two functions that are indistinguishable from the point of view of μ must have the same integral.
Suppose that f : S → R is a measurable function whose integral exists. If g : S → R is measurable and g =f almost
everywhere on S , then ∫ g dμ = ∫ f dμ .
S S
Proof
2. ∫ S
f =0 if and only if f =0 almost everywhere on S .
Proof
So, if f ≥ 0 almost everywhere on S then ∫ f dμ > 0 if and only if μ{x ∈ S : f (x) > 0} > 0 . The simple extension of the
S
Suppose that f , g : S → R are measurable functions whose integrals exist, and that f ≤g almost everywhere on S . Then
1. ∫ f ≤ ∫ g
S S
So if f ≤g almost everywhere on S then, except in the two cases mentioned, ∫ f dμ < ∫ g dμ if and only if S S
μ{x ∈ S : f (x) < g(x)} > 0 . The exclusion when both integrals are ∞ or −∞ is important. A counterexample when this
condition does not hold is given below. The next result is the absolute value inequality.
If f is integrable, then equality holds if and only if f ≥0 almost everywhere on S or f ≤0 almost everywhere on S .
Proof
Change of Variables
Suppose that (T , T ) is another measurable space and that u : S → T is measurable. As we saw in our first study of positive
measures, ν defined by
−1
ν (B) = μ [ u (B)] , B ∈ T (3.11.11)
is a positive measure on (T , T ). The following result is known as the change of variables theorem.
∫ f dν = ∫ (f ∘ u) dμ (3.11.12)
T S
3.11.2 https://stats.libretexts.org/@go/page/10151
Proof
The change of variables theorem will look more familiar if we give the variables explicitly. Thus, suppose that we want to evaluate
where again, u : S → T and f : T → R . One way is to use the substitution u = u(x), find the new measure ν , and then evaluate
Convergence Properties
We start with a simple but important corollary of the monotone convergence theorem that extends the additivity property to a
countably infinite sum of nonnegative functions.
∞ ∞
∫ ∑ fn dμ = ∑ ∫ fn dμ (3.11.19)
S n=1 n=1 S
Proof
A theorem below gives a related result that relaxes the assumption that f be nonnegative, but imposes a stricter integrability
requirement. Our next result is the additivity of the integral over a countably infinite collection of disjoint domains.
Suppose that f : S → R is a measurable function whose integral exists, and that {A n : n ∈ N+ } is a disjoint collection of sets
in S . Let A = ⋃ A . Then
∞
n=1 n
∫ f dμ = ∑ ∫ f dμ (3.11.20)
A n=1 An
Proof
Of course, the previous theorem applies if f is nonnegative or if f is integrable. Next we give a minor extension of the monotone
convergence theorem that relaxes the assumption that the functions be nonnegative.
Monotone Convergence Theorem. Suppose that f : S → R is a measurable function whose integral exists for each n ∈ N
n +
Proof
If ∫ f dμ < ∞ then
S 1
Proof
The additional assumptions on the integral of f in the last two extensions of the monotone convergence theorem are necessary. An
1
3.11.3 https://stats.libretexts.org/@go/page/10151
∫ lim inf fn dμ ≤ lim inf ∫ fn dμ (3.11.26)
n→∞ n→∞
S S
Proof
Given the weakness of the hypotheses, it's hardly surprising that strict inequality can easily occur in Fatou's lemma. An example is
given below.
Our next convergence result is one of the most important and is known as the dominated convergence theorem. It's sometimes also
known as Lebesgue's dominated convergence theorem in honor of Henri Lebesgue, who first developed all of this stuff in the
context of R . The dominated convergence theorem gives a basic condition under which we may interchange the limit and
n
integration operators.
Dominated Convergence Theorem. Suppose that f : S → R is measurable for n n ∈ N+ and that limn→∞ fn exists on S .
Suppose also that |f | ≤ g for n ∈ N where g : S → [0, ∞) is integrable. Then
n
Proof
As you might guess, the assumption that |f | is uniformly bounded in n by an integrable function is critical. A counterexample
n
when this assumption is missing is given below when this assumption is missing. The dominated convergence theorem remains
true if lim
n→∞ f nexists almost everywhere on S . The follow corollary of the dominated convergence theorem gives a condition
for the interchange of infinite sum and integral.
i=1
| fi | is integrable. then
∞ ∞
∫ ∑ fi dμ = ∑ ∫ fi dμ (3.11.30)
S i=1 i=1 S
Proof
The following corollary of the dominated convergence theorem is known as the bounded convergence theorem.
Bounded Convergence Theorem. Suppose that f : S → R is measurable forn n ∈ N+ and there exists A ∈ S such that
μ(A) < ∞ , lim f exists on A , and |f | is bounded in n ∈ N on A . Then
n→∞ n n +
Proof
Again, the bounded convergence remains true if lim f exists almost everywhere on
n→∞ n A . For a finite measure space (and in
particular for a probability space), the condition that μ(A) < ∞ automatically holds.
Product Spaces
Suppose now that (S, S , μ) and (T , T , ν ) are σ-finite measure spaces. Please recall the basic facts about the product σ-algebra
S ⊗ T of subsets of S × T , and the product measure μ ⊗ ν on S ⊗ T . The product measure space (S × T , S ⊗ T , μ ⊗ ν ) is
the standard one that we use for product spaces. If f : S × T → R is measurable, there are three integrals we might consider. First,
of course, is the integral of f with respect to the product measure μ ⊗ ν
sometimes called a double integral in this context. But also we have the nested or iterated integrals where we integrate with
respect to one variable at a time:
3.11.4 https://stats.libretexts.org/@go/page/10151
How are these integrals related? Well, just as in calculus with ordinary Riemann integrals, under mild conditions the three integrals
are the same. The resulting important theorem is known as Fubini's Theorem in honor of the Italian mathematician Guido Fubini.
Fubini's Theorem. Suppose that f : S ×T → R is measurable. If the double integral on the left exists, then
∫ f (x, y) d(μ ⊗ ν )(x, y) = ∫ ∫ f (x, y) dν (y) dμ(x) = ∫ ∫ f (x, y) dμ(x) dν (y) (3.11.34)
S×T S T T S
Proof
Of course, the double integral exists, and so Fubini's theorem applies, if either f is nonnegative or integrable with respect to μ ⊗ ν .
When f is nonnegative, the result is sometimes called Tonelli's theorem in honor of another Italian mathematician, Leonida Tonelli.
On the other hand, the iterated integrals may exist, and may be different, when the double integral does not exist. A
counterexample and a second counterexample are given below.
A special case of Fubini's theorem (and indeed part of the proof) is that we can compute the measure of a set in the product space
by integrating the cross-sectional measures.
If C ∈ S ⊗T then
y
(μ ⊗ ν )(C ) = ∫ ν (Cx ) dμ(x) = ∫ μ (C ) dν (y) (3.11.39)
S T
In particular, if C , D ∈ S ⊗ T have the property that ν (C ) = ν (D ) for all x ∈ S , or μ (C ) = μ (D ) for all y ∈ T (that is,
x x
y y
C and D have the same cross-sectional measures with respect to one of the variables), then (μ ⊗ ν )(C ) = (μ ⊗ ν )(D) . In R with 2
area, and in R with volume (Lebesgue measure in both cases), this is known as Cavalieri's principle, named for Bonaventura
3
Cavalieri, yet a third Italian mathematician. Clearly, Italian mathematicians cornered the market on theorems of this sort.
A simple corollary of Fubini's theorem is that the double integral of a product function over a product set is the product of the
integrals. This result has important applications to independent random variables.
Suppose that g : S → R and h : T → R are measurable, and are either nonnegative or integrable with respect to μ and ν ,
respectively. Then
Recall that a discrete measure space consists of a countable set with the σ-algebra of all subsets and with counting measure. In such
a space, integrals are simply sums and so Fubini's theorem allows us to rearrange the order of summation in a double sum.
Suppose that I and J are countable and that aij ∈ R for i ∈ I and j∈ J . If the sum of the positive terms or the sum of the
negative terms is finite, then
Often I = J = N , and in this case, a can be viewed as an infinite array, with i ∈ N the row number and j ∈ N
+ ij + + the column
number:
⋮ ⋮ ⋮ ⋮
3.11.5 https://stats.libretexts.org/@go/page/10151
The significant point is that N is totally ordered. While there is no implied order of summation in the double sum ∑
+ a , 2
(i,j)∈N+ ij
the iterated sum ∑ ∑ a is obtained by summing over the rows in order and then summing the results by column in order,
∞
i=1
∞
j=1 ij
∞ ∞
while the iterated sum ∑ ∑ a is obtained by summing over the columns in order and then summing the results by row in
j=1 i=1 ij
order.
Of course, only one of the product spaces might be discrete. Theorems (9) and (15) which give conditions for the interchange of
sum and integral can be viewed as applications of Fubini's theorem, where one of the measure spaces is (S, S , μ) and the other is
N+ with counting measure.
Ω to S . Recall that the probability distribution of X is the probability measure P on (S, S ) defined by
X
Since {X ∈ A} is just probability notation for the inverse image of A under X, P is simply a special case of constructing a new
X
positive measure from a given positive measure via a change of variables. Suppose now that r : S → R is measurable, so that r(X)
is a real-valued random variable. The integral of r(X) (assuming that it exists) is known as the expected value of r(X) and is of
fundamental importance. We will study expected values in detail in the next chapter. Here, we simply note different ways to write
the integral. By the change of variables formula (8) we have
Now let F denote the distribution function of Y = r(X) . By another change of variables, Y has a probability distribution P on
Y Y
R , which is also a Lebesgue-Stieltjes measure, named for Henri Lebesgue and Thomas Stiletjes. Recall that this probability
measure is characterized by
With another application of our change of variables theorem, we can add to our chain of integrals:
∫ r [X(ω)] dP(ω) = ∫ r(x) dPX (x) = ∫ y dPY (y) = ∫ y dFY (y) (3.11.45)
Ω S R R
Of course, the last two integrals are simply different notations for exactly the same thing. In the section on absolute continuity and
density functions, we will see other ways to write the integral.
Counterexamples
In the first three exercises below, (R, R, λ) is the standard one-dimensional Euclidean space, so mathscrR is σ -algebra of
Lebesgue measurabel sets and λ is Lebesgue measure.
This example shows that the strict increasing property can fail when the integrals are infinite.
1. f is decreasing in n ∈ N on R.
n +
2. f → 0 as n → ∞ on R.
n
3. ∫ f dλ = ∞ for each n ∈ N .
R
n +
3.11.6 https://stats.libretexts.org/@go/page/10151
This example shows that the monotone convergence theorem can fail if the first integral is infinite. It also illustrates strict
inequality in Fatou's lemma.
1. lim f = 0 on R so ∫ lim
n→∞ n R n→∞ fn dμ = 0
2. ∫ f dλ = 1 for n ∈ N so lim
R n + n→∞ ∫
R
fn dλ = 1
3. sup{f : n ∈ N } = 1
n on R + [1,∞)
This example shows that the dominated convergence theorem can fail if | fn | is not bounded by an integrable function. It also
shows that strict inequality can hold in Fatou's lemma.
Consider the product space [0, 1] with the usual Lebesgue measurable subsets and Lebesgue measure. Let f
2 2
: [0, 1 ] → R be
defined by
2 2
x −y
f (x, y) = (3.11.46)
2 2 2
(x +y )
Show that
1. ∫[0,1]
2 f (x, y) d(x, y) does not exist.
1 1
2. ∫0
∫
0
f (x, y) dx dy = −
π
4
1 1
3. ∫0
∫
0
f (x, y) dy dx =
π
This example shows that Fubini's theorem can fail if the double integral does not exist.
3. Show that ∑ ∑ a = 1 ∞
i=1
∞
j=1 ij
∞ ∞
4. Show that ∑ ∑ a = 0 j=1 i=1 ij
This example shows that the iterated sums can exist and be different when the double sum does not exist, a counterexample to
the corollary to Fubini's theorem for sums when the hypotheses are not satisfied.
Computational Exercises
Compute ∫ D
f (x, y) d(x, y) in each case below for the given D ⊆ R and f 2
: D → R .
1. f (x, y) = e −2x
e
−3y
, D = [0, ∞) × [0, ∞)
2. f (x, y) = e −2x
e
−3y
, D = {(x, y) ∈ R : 0 ≤ x ≤ y < ∞}
2
Integrals of the type in the last exercise are useful in the study of exponential distributions.
This page titled 3.11: Properties of the Integral is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
3.11.7 https://stats.libretexts.org/@go/page/10151
3.12: General Measures
Basic Theory
Our starting point in this section is a measurable space (S, S ). That is, S is a set and S is a σ-algebra of subsets of S . So far, we
have only considered positive measures on such spaces. Positive measures have applications, as we know, to length, area, volume,
mass, probability, counting, and similar concepts of the nonnegative “size” of a set. Moreover, we have defined the integral of a
measurable function f : S → R with respect to a positive measure, and we have studied properties of the integral.
Definition
But now we will consider measures that can take negative values as well as positive values. These measures have applications to
electric charge, monetary value, and other similar concepts of the “content” of a set that might be positive or negative. Also, this
generalization will help in our study of density functions in the next section. The definition is exactly the same as for a positive
measure, except that values in R = R ∪ {−∞, ∞} are allowed.
∗
As before, (b) is known as countable additivity and is the critical assumption: the measure of a set that consists of a countable
number of disjoint pieces is the sum of the measures of the pieces. Implicit in the statement of this assumption is that the sum in (b)
exists for every countable disjoint collection {A : i ∈ I } . That is, either the sum of the positive terms is finite or the sum of the
i
negative terms is finite. In turn, this means that the order of the terms in the sum does not matter (a good thing, since there is no
implied order). The term signed measure is used by many, but we will just use the simple term measure, and add appropriate
adjectives for the special cases. Note that if μ(A) ≥ 0 for all A ∈ S , then μ is a positive measure, the kind we have already
studied (and so the new definition really is a generalization). In this case, the sum in (b) always exists in [0, ∞]. If μ(A) ∈ R for all
A ∈ S then μ is a finite measure. Note that in this case, the sum in (b) is absolutely convergent for every countable disjoint
collection {A : i ∈ I } . If μ is a positive measure and μ(S) = 1 then μ is a probability measure, our favorite kind. Finally, as with
i
positive measures, μ is σ-finite if there exists a countable collection {A : i ∈ I } of sets in S such that S = ⋃ A and
i i∈I i
μ(A ) ∈ R for i ∈ I .
i
Basic Properties
We give a few simple properties of general measures; hopefully many of these will look familiar. Throughout, we assume that μ is
a measure on (S, S ). Our first result is that although μ can take the value ∞ or −∞ , it turns out that it cannot take both of these
values.
We will say that two measures are of the same type if neither takes the value ∞ or if neither takes the value −∞ . Being of the same
type is trivially an equivalence relation on the collection of measures on (S, S ).
The difference rule holds, as long as the sets have finite measure:
The following corollary is the difference rule for subsets, and will be needed below.
Suppose that A, B ∈ S and A ⊆ B . If μ(B) ∈ R then μ(A) ∈ R and μ(B ∖ A) = μ(B) − μ(A) .
Proof
3.12.1 https://stats.libretexts.org/@go/page/10152
As a consequence, suppose that A, B ∈ S and A ⊆ B . If μ(A) = ∞ , then by the infinity rule we cannot have μ(B) = −∞ and
by the difference rule we cannot have μ(B) ∈ R , so we must have μ(B) = ∞ . Similarly, if μ(A) = −∞ then μ(B) = −∞ . The
inclusion-exclusion rules hold for general measures, as long as the sets have finite measure.
Suppose that A i ∈ S for each i ∈ I where #(I ) = n , and that μ(A i) ∈ R for i ∈ I . Then
n
k−1
μ ( ⋃ Ai ) = ∑(−1 ) ∑ μ ( ⋂ Aj ) (3.12.1)
Proof
The continuity properties hold for general measures. Part (a) is the continuity property for increasing sets, and part (b) is the
continuity property for decreasing sets.
i=1
Ai ) .
2. If An+1 ⊆ An for n ∈ N + and μ(A 1) ∈ R , then lim n→∞ μ(An ) = μ (⋂
∞
i=1
Ai )
Proof
Recall that a positive measure is an increasing function, relative to the subset partial order on S and the ordinary order on [0, ∞],
and this property follows from the difference rule. But for general measures, the increasing property fails, and so do other
properties that flow from it, including the subadditive property (Boole's inequality in probability) and the Bonferroni inequalities.
Constructions
It's easy to construct general measures as differences of positive measures.
Suppose that μ and ν are positive measures on (S, S ) and that at least one of them is finite. Then δ = μ − ν is a measure.
Proof
If μ is a finite measure, then so is cμ for c ∈ R . If μ is not finite then μ and cμ are of the same type if c > 0 and are of opposite
types if c < 0 . We can add two measures to get another measure, as long as they are of the same type. In particular, the collection
of finite measures is closed under addition as well as scalar multiplication, and hence forms a vector space.
If μ and ν are measures on (S, S ) of the same type then μ + ν is a measure on (S, S ).
Proof
Finally, it is easy to explicitly construct measures on a σ-algebra generated by a countable partition. Such σ-algebras are important
for counterexamples and to gain insight, and also because many σ-algebras that occur in applications can be constructed from
them.
Suppose that A = { A : i ∈ I } is a countable partition of S into nonempty sets, and that S = σ(A ) . For i ∈ I , define
i
μ(Ai ) ∈ R
∗
arbitrarily, subject only to the condition that the sum of the positive terms is finite, or the sum of the negative
terms is finite. For A = ⋃ A where J ⊆ I , define
j∈J j
j∈J
3.12.2 https://stats.libretexts.org/@go/page/10152
Positive, Negative, and Null Sets
To understand the structure of general measures, we need some basic definitions and properties. As before, we assume that μ is a
measure on (S, S ).
Definitions
1. A ∈ S is a positive set for μ if μ(B) ≥ 0 for every B ∈ S with B ⊆ A .
2. A ∈ S is a negative set for μ if μ(B) ≤ 0 for every B ∈ S with B ⊆ A .
3. A ∈ S is a null set for μ if μ(B) = 0 for every B ∈ S with B ⊆ A .
Note that positive and negative are used in the weak sense (just as we use the terms increasing and decreasing in this text). Of
course, if μ is a positive measure, then every A ∈ S is positive for μ , and A ∈ S is negative for μ if and only if A is null for μ if
and only if μ(A) = 0 . For a general measure, A ∈ S is both positive and negative for μ if and only if A is null for μ . In
particular, ∅ is null for μ . A set A ∈ S is a support set for μ if and only if A is a null set for μ . A support set is a set where the
c
measure “lives” in a sense. Positive, negative, and null sets for μ have a basic inheritance property that is essentially equivalent to
the definition.
Suppose A ∈ S .
1. If A is positive for μ then B is positive for μ for every B ∈ S with B ⊆ A .
2. If A is negative for μ then B is negative for μ for every B ∈ S with B ⊆ A .
3. If A is null for μ then B is null for μ for every B ∈ S with B ⊆ A .
The collections of positive sets, negative sets, and null sets for μ are closed under countable unions.
Proof
It's easy to see what happens to the positive, negative, and null sets when a measure is multiplied by a non-zero constant.
Positive, negative, and null sets are also preserved under countable sums, assuming that the measures make senes.
Suppose that μ is a measure on (S, S ) for each i in a countable index set I , and that μ = ∑
i i∈I
μi is a well-defined measure
on (S, S ). Let A ∈ S .
1. If A is positive for μ for every i ∈ I then A is positive for μ .
i
In particular, note that μ = ∑ μ is a well-defined measure if μ is a positive measure for each i ∈ I , or if I is finite and μ is a
i∈I i i i
finite measure for each i ∈ I . It's easy to understand the positive, negative, and null sets for a σ-algebra generated by a countable
partition.
3.12.3 https://stats.libretexts.org/@go/page/10152
Let A ∈ S , so that A = ⋃ j∈J
Aj for some J ⊆ I (and this representation is unique). Then
1. A is positive for μ if and only if J ⊆ I + ∪ I0 .
2. A is negative for μ if and only if J ⊆ I − ∪ I0 .
3. A is null for μ if and only if J ⊆ I . 0
If A ∈ S and 0 ≤ μ(A) < ∞ then there exists P ∈ S with P ⊆A such that P is positive for μ and μ(P ) ≥ μ(A) .
Proof
The assumption that μ(A) < ∞ is critical; a counterexample is given below. Our first decomposition result is the Hahn
decomposition theorem, named for the Austrian mathematician Hans Hahn. It states that S can be partitioned into a positive set and
a negative set, and this decomposition is essentially unique.
Hahn Decomposition Theorem. There exists P ∈ S such that P is positive for μ and P is negative for μ . The pair (P , P c c
)
Proof
It's easy to see the Hahn decomposition for a measure on a σ-algebra generated by a countable partition.
Jordan Decomposition Theorem. The measure μ can be written uniquely in the form μ = μ − μ where μ and μ are + − + −
positive measures, at least one finite, and with the property that if (P , P ) is any Hahn decomposition of S , then P is a null
c c
Proof
Note that, in spite of the similarity in notation, μ (A) and μ (A) are not simply the positive and negative parts of the (extended)
+ −
real number μ(A) , nor is |μ| (A) the absolute value of μ(A) . Also, be careful not to confuse the total variation of μ , a number in
[0, ∞], with the total variation measure. The positive, negative, and total variation measures can be written directly in terms of μ .
For A ∈ S ,
1. μ (A) = sup{μ(B) : B ∈ S , B ⊆ A}
+
2. μ (A) = − inf{μ(B) : B ∈ S , B ⊆ A}
−
3.12.4 https://stats.libretexts.org/@go/page/10152
4. ∥μ∥ = sup {∑ i∈I
μ(Ai ) : { Ai : i ∈ I } is a finite, measurable partition of S}
The total variation measure is related to sum and scalar multiples of measures in a natural way.
Suppose that μ and ν are measures of the same type and that c ∈ R . Then
1. |μ| = 0 if and only if μ = 0 (the zero measure).
2. |cμ| = |c| |μ|
3. |μ + ν | ≤ |μ| + |ν |
Proof
You may have noticed that the properties in the last result look a bit like norm properties. In fact, total variation really is a norm on
the vector space of finite measures on (S, S ):
Suppose that μ and ν are measures of the same type and that c ∈ R . Then
1. ∥μ∥ = 0 if and only if μ = 0 (the zero property)
2. ∥cμ∥ = |c| ∥μ∥ (the scaling property)
3. ∥μ + ν ∥ ≤ ∥μ∥ + ∥ν ∥ (the triangle inequality)
Proof
Every norm on a vector space leads to a corresponding measure of distance (a metric). Let M denote the collection of finite
measures on (S, S ). Then M , under the usual definition of addition and scalar multiplication of measures, is a vector space, and
as the last theorem shows, ∥ ⋅ ∥ is a norm on M . Here are the corresponding metric space properties:
Of course, M includes the probability measures on (S, S ), so we have a new notion of convergence to go along with the others
we have studied or will study. Here is a list:
convergence with probability 1
convergence in probability
convergence in distribution
convergence in k th mean
convergence in total variation
The Integral
Armed with the Jordan decomposition, the integral can be extended to general measures in a natural way.
assuming that the integrals on the right exist and that the right side is not of the form ∞ − ∞ .
We will not pursue this extension, but as you might guess, the essential properties of the integral hold.
3.12.5 https://stats.libretexts.org/@go/page/10152
Complex Measures
Again, suppose that (S, S ) is a measurable space. The same axioms that work for general measures can be used to define complex
measures. Recall that C = {x + iy : x, y ∈ R} denotes the set of complex numbers, where i is the imaginary unit.
Clearly a complex measure μ can be decomposed as μ = ν + iρ where ν and ρ are finite (real) measures on (S, S ). We will have
no use for complex measures in this text, but from the decomposition into finite measures, it's easy to see how to develop the
theory.
Computational Exercises
Counterexamples
The lemma needed for the Hahn decomposition theorem can fail without the assumption that μ(A) < ∞ .
Let S be a set with subsets A and B satisfying ∅ ⊂ B ⊂ A ⊂ S . Let S = σ{A, B} be the σ-algebra generated by {A, B} .
Define μ(B) = −1 , μ(A ∖ B) = ∞ , μ(A ) = 1 . c
This page titled 3.12: General Measures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
3.12.6 https://stats.libretexts.org/@go/page/10152
3.13: Absolute Continuity and Density Functions
Basic Theory
Our starting point is a measurable space (S, S ). That is S is a set and S is a σ-algebra of subsets of S . In the last section, we
discussed general measures on (S, S ) that can take positive and negative values. Special cases are positive measures, finite
measures, and our favorite kind, probability measures. In particular, we studied properties of general measures, ways to construct
them, special sets (positive, negative, and null), and the Hahn and Jordan decompositions.
In this section, we see how to construct a new measure from a given positive measure using a density function, and we answer the
fundamental question of when a measure has a density function relative to the given positive measure.
Relations on Measures
The answer to the question involves two important relations on the collection of measures on (S, S ) that are defined in terms of
null sets. Recall that A ∈ S is null for a measure μ on (S, S ) if μ(B) = 0 for every B ∈ S with B ⊆ A . At the other extreme,
A ∈ S is a support set for μ if A is a null set. Here are the basic definitions:
c
Thus ν ≪ μ if every support support set of μ is a support set of ν . At the opposite end, μ ⊥ ν if μ and ν have disjoint support sets.
Recall that every relation that is reflexive and transitive leads to an equivalence relation, and then in turn, the original relation can
be extended to a partial order on the collection of equivalence classes. This general theorem on relations leads to the following two
results.
Measures μ and ν on (S, S ) are equivalent if μ ≪ ν and ν ≪ μ , and we write μ ≡ ν . The relation ≡ is an equivalence
relation on the collection of measures on (S, S ). That is, if μ , ν , and ρ are measures on (S, S ) then
1. μ ≡ μ , the reflexive property
2. If μ ≡ ν then ν ≡ μ , the symmetric property
3. If μ ≡ ν and ν ≡ ρ then μ ≡ ρ , the transitive property
Thus, μ and ν are equivalent if they have the same null sets and thus the same support sets. This equivalence relation is rather
weak: equivalent measures have the same support sets, but the values assigned to these sets can be very different. As usual, we will
write [μ] for the equivalence class of a measure μ on (S, S ), under the equivalence relation ≡.
If μ and ν are measures on (S, S ), we write [μ] ⪯ [ν ] if μ ≪ ν . The definition is consistent, and defines a partial order on the
collection of equivalence classes. That is, if μ , ν , and ρ are measures on (S, S ) then
1. [μ] ⪯ [μ] , the reflexive property.
2. If [μ] ⪯ [ν ] and [ν ] ⪯ [μ] then [μ] = [ν ] , the antisymmetric property.
3. If [μ] ⪯ [ν ] and [ν ] ⪯ [ρ] then [μ] ⪯ [ρ], the transitive property
3.13.1 https://stats.libretexts.org/@go/page/10153
Proof
Absolute continuity and singularity are preserved under multiplication by nonzero constants.
Suppose that μ and ν are measures on (S, S ) and that a, b ∈ R ∖ {0} . Then
1. ν ≪ μ if and only if aν ≪ bμ .
2. ν ⊥μ if and only if aν ⊥ bμ .
Proof
Suppose that μ is a measure on (S, S ) and that ν is a measure on (S, S ) for each i in a countable index set I . Suppose also
i
As before, note that ν = ∑ ν is well-defined if ν is a positive measure for each i ∈ I or if I is finite and ν is a finite measure
i∈I i i i
for each i ∈ I . We close this subsection with a couple of results that involve both the absolute continuity relation and the
singularity relation
Density Functions
We are now ready for our study of density functions. Throughout this subsection, we assume that μ is a positive, σ-finite measure
on our measurable space (S, S ). Recall that if f : S → R is measurable, then the integral of f with respect to μ may exist as a
number in R = R ∪ {−∞, ∞} or may fail to exist.
∗
Suppose that f : S → R is a measurable function whose integral with respect to μ exists. Then function ν defined by
is a σ-finite measure on (S, S ) that is absolutely continuous with respect to μ . The function f is a density function of ν
relative to μ .
Proof
In case 3, f is the probability density function of ν relative to μ , our favorite kind of density function. When they exist, density
functions are essentially unique.
Suppose that ν is a σ-finite measure on (S, S ) and that ν has density function f with respect to μ . Then g : S → R is a
density function of ν with respect to μ if and only if f = g almost everywhere on S with respect to μ .
Proof
The essential uniqueness of density functions can fail if the positive measure space (S, S , μ) is not σ-finite. A simple example is
given below. Our next result answers the question of when a measure has a density function with respect to μ , and is the
fundamental theorem of this section. The theorem is in two parts: Part (a) is the Lebesgue decomposition theorem, named for our
3.13.2 https://stats.libretexts.org/@go/page/10153
old friend Henri Lebesgue. Part (b) is the Radon-Nikodym theorem, named for Johann Radon and Otto Nikodym. We combine the
theorems because our proofs of the two results are inextricably linked.
Proof
In particular, a measure ν on (S, S ) has a density function with respect to μ if and only if ν ≪ μ . The density function in this case
is also referred to as the Radon-Nikodym derivative of ν with respect to μ and is sometimes written in derivative notation as
dν /dμ. This notation, however, can be a bit misleading because we need to remember that a density function is unique only up to a
μ -null set. Also, the Radon-Nikodym theorem can fail if the positive measure space (S, S , μ) is not σ-finite. A couple of
examples are given below. Next we characterize the Hahn decomposition and the Jordan decomposition of ν in terms of the density
function.
Suppose that ν is a measure on (S, S ) with ν ≪ μ , and that ν has density function f with respect to μ . Let
= {x ∈ S : f (x) ≥ 0} , and let f and f denote the positive and negative parts of f .
+ −
P
1. A Hahn decomposition of ν is (P , P ). c
Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and that ν has density function f with respect to μ . If g : S → R
is a measurable function whose integral with respect to ν exists, then
∫ g dν = ∫ gf dμ (3.13.6)
S S
Proof
In differential notation, the change of variables theorem has the familiar form dν = f dμ , and this is really the justification for the
derivative notation f = dν /dμ in the first place. The following result gives the scalar multiple rule for density functions.
Suppose that ν is a measure on (S, S ) with ν ≪ μ and that ν has density function f with respect to μ . If c ∈ R , then cν has
density function cf with respect to μ .
Proof
Of course, we already knew that ν ≪ μ implies cν ≪ μ for c ∈ R , so the new information is the relation between the density
functions. In derivative notation, the scalar multiple rule has the familiar form
d(cν ) dν
=c (3.13.9)
dμ dμ
The following result gives the sum rule for density functions. Recall that two measures are of the same type if neither takes the
value ∞ or if neither takes the value −∞ .
Suppose that ν and ρ are measures on (S, S ) of the same type with ν ≪ μ and ρ ≪ μ , and that ν and ρ have density
functions f and g with respect to μ , respectively. Then ν + ρ has density function f + g with respect to μ .
Proof
Of course, we already knew that ν ≪ μ and ρ ≪ μ imply ν + ρ ≪ μ , so the new information is the relation between the density
functions. In derivative notation, the sum rule has the familiar form
d(ν + ρ) dν dρ
= + (3.13.11)
dμ dμ dμ
3.13.3 https://stats.libretexts.org/@go/page/10153
The following result is the chain rule for density functions.
Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and that ν has density function f with respect to μ . Suppose ρ is a
measure on (S, S ) with ρ ≪ ν and that ρ has density function g with respect to ν . Then ρ has density function gf with
respect to μ .
Proof
Of course, we already knew that ν ≪ μ and ρ ≪ ν imply ρ ≪ μ , so once again the new information is the relation between the
density functions. In derivative notation, the chan rule has the familiar form
dρ dρ dν
= (3.13.12)
dμ dν dμ
The following related result is the inverse rule for density functions.
Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and μ ≪ ν (so that ν ≡μ ). If ν has density function f with
respect to μ then μ has density function 1/f with respect to ν .
Proof
x∈A
for a unique f : S → R . Of course, this is obvious by a direct argument. If we define f (x) = ν {x} for x ∈ S then the displayed
equation follows by the countable additivity of ν .
representation of the form A = ⋃ A where J ⊆ I . Suppse now that μ is a positive measure on S with 0 < μ(A ) < ∞ for
j∈J j i
every i ∈ I . Then once again, the measure space (S, S , μ) is σ-finite and ∅ is the only null set. Hence if ν is a measure on (S, S )
then ν is absolutely continuous with respect to μ and hence has unique density function f with respect to μ :
to μ .
Proof
Often positive measure spaces that occur in applications can be decomposed into spaces generated by countable partitions. In the
section on Convergence in the chapter on Martingales, we show that more general density functions can be obtained as limits of
density functions of the type in the last theorem.
3.13.4 https://stats.libretexts.org/@go/page/10153
Probability Spaces
Suppose that (Ω, F , P) is a probability space and that X is a random variable taking values in a measurable space (S, S ). Recall
that the distribution of X is the probability measure P on (S, S ) given by
X
If μ is a positive measure, σ-finite measure on (S, S ), then the theory of this section applies, of course. The Radon-Nikodym
theorem tells us precisely when (the distribution of) X has a probability density function with respect to μ : we need the distribution
to be absolutely continuous with respect to μ : if μ(A) = 0 then P (A) = P(X ∈ A) = 0 for A ∈ S .
X
Suppose that r : S → R is measurable, so that r(X) is a real-valued random variable. The integral of r(X) (assuming that it exists)
is of fundamental importance, and is knowns as the expected value of r(X). We will study expected values in detail in the next
chapter, but here we just note different ways to write the integral. By the change of variables theorem in the last section we have
Specializing, suppose that (S, S , #) is a discrete measure space. Thus X has a discrete distribution and (as noted in the previous
subsection), the distribution of X is absolutely continuous with respect to #, with probability density function f given by
f (x) = P(X = x) for x ∈ S . In this case the integral simplifies:
Recall next that for n ∈ N , the n -dimensional Euclidean measure space is (R , R , λ ) where R is the σ-algebra of Lebesgue
+
n
n n n
measurable sets and λ is Lebesgue measure. Suppose now that S ∈ R and that S is the σ-algebra of Lebesgue measurable
n n
subsets of S , and that once again, X is a random variable with values in S . By definition, X has a continuous distribution if
P(X = x) = 0 for x ∈ S . But we now know that this is not enough to ensure that the distribution of X has a density function with
respect to λ . We need the distribution to be absolutely continuous, so that if λ (A) = 0 then P(X ∈ A) = 0 for A ∈ S . Of
n n
course λ {x} = 0 for x ∈ S , so absolute continuity implies continuity, but not conversely. Continuity of the distribution is a
n
(much) weaker condition than absolute continuity of the distribution. If the distribution of X is continuous but not absolutely so,
then the distribution will not have a density function with respect to λ . n
For example, suppose that λ (S) = 0 . Then the distribution of X and λ are mutually singular since P(X ∈ S) = 1 and so X will
n n
not have a density function with respect to λ . This will always be the case if S is countable, so that the distribution of X is
n
discrete. But it is also possible for X to have a continuous distribution on an uncountable set S ∈ R with λ (S) = 0 . In such a
n n
case, the continuous distribution of X is said to be degenerate. There are a couple of natural ways in which this can happen that are
illustrated in the following exercises.
Suppose that Θ is uniformly distributed on the interval [0, 2π). Let X = cos Θ , Y = sin Θ .
1. (X, Y ) has a continuous distribution on the circle C = {(x, y) : x 2
+y
2
= 1} .
2. The distribution of (X, Y ) and λ are mutually singular.
2
The last example is artificial since (X, Y ) has a one-dimensional distribution in a sense, in spite of taking values in R
2
. And of
course Θ has a probability density function f with repsect λ given by f (θ) = 1/2π for θ ∈ [0, 2π).
1
Suppose that X is uniformly distributed on the set {0, 1, 2}, Y is uniformly distributed on the interval [0, 2], and that X and Y
are independent.
1. (X, Y ) has a continuous distribution on the product set S = {0, 1, 2} × [0, 2].
3.13.5 https://stats.libretexts.org/@go/page/10153
2. The distribution of (X, Y ) and λ are mutually singular.
2
The last exercise is artificial since X has a discrete distribution on {0, 1, 2} (with all subsets measureable and with #), and Y a
continuous distribution on the Euclidean space [0, 2] (with Lebesgue mearuable subsets and with λ ). Both are absolutely
continuous; X has density function g given by g(x) = 1/3 for x ∈ {0, 1, 2} and Y has density function h given by h(y) = 1/2
for y ∈ [0, 2]. So really, the proper measure space on S is the product measure space formed from these two spaces. Relative to this
product space (X, Y ) has a density f given by f (x, y) = 1/6 for (x, y) ∈ S.
It is also possible to have a continuous distribution on S ⊆ R with λ (S) > 0 , yet still with no probability density function, a
n
n
much more interesting situation. We will give a classical construction. Let (X , X , …) be a sequence of Bernoulli trials with
1 2
success parameter p ∈ (0, 1). We will indicate the dependence of the probability measure P on the parameter p with a subscript.
Thus, we have a sequence of independent indicator variables with
Pp (Xi = 1) = p, Pp (Xi = 0) = 1 − p (3.13.21)
We interpret X as the ith binary digit (bit) of a random variable X taking values in (0, 1). That is, X = ∑ X /2 . Conversely,
i
∞
i=1 i
i
recall that every number x ∈ (0, 1) can be written in binary form as x = ∑ x /2 where x ∈ {0, 1} for each i ∈ N . This
∞
i=1 i
i
i +
representation is unique except when x is a binary rational of the form x = k/2 for n ∈ N and k ∈ {1, 3, … 2 − 1} . In this
n
+
n
case, there are two representations, one in which the bits are eventually 0 and one in which the bits are eventually 1. Note, however,
that the set of binary rationals is countable. Finally, note that the uniform distribution on (0, 1) is the same as Lebesgue measure on
(0, 1).
X has a continuous distribution on (0, 1) for every value of the parameter p ∈ (0, 1). Moreover,
1. If p, q ∈ (0, 1) and p ≠ q then the distribution of X with parameter p and the distribution of X with parameter q are
mutually singular.
2. If p = , X has the uniform distribution on (0, 1).
1
3. If p ≠ , then the distribution of X is singular with respect to Lebesgue measure on (0, 1), and hence has no probability
1
For an application of some of the ideas in this example, see Bold Play in the game of Red and Black.
Counterexamples
The essential uniqueness of density functions can fail if the underlying positive measure μ is not σ -finite. Here is a trivial
counterexample:
Suppose that S is a nonempty set and that S = {S, ∅} is the trivial σ-algebra. Define the positive measure μ on (S, S ) by
μ(∅) = 0 , μ(S) = ∞ . Let ν denote the measure on (S, S ) with constant density function c ∈ R with respect to μ .
c
The Radon-Nikodym theorem can fail if the measure μ is not σ -finite, even if ν is finite. Here are a couple of standard
counterexample:
Suppose that S is an uncountable set and S is the σ-algebra of countable and co-countable sets:
c
S = {A ⊆ S : A is countable or A is countable} (3.13.23)
As usual, let # denote counting measure on S , and define ν on S by ν (A) = 0 if A is countable and ν (A) = 1 if A
c
is
countable. Then
1. (S, S , #) is not σ-finite.
2. ν is a finite, positive measure on (S, S ).
3. ν is absolutely continuous with respect to #.
3.13.6 https://stats.libretexts.org/@go/page/10153
4. ν does not have a density function with respect to #.
Proof
Let R denote the standard Borel σ-algebra on R . Let # and λ denote counting measure and Lebesgue measure on (R, R) ,
respectively. Then
1. (R, R, #) is not σ-finite.
2. λ is absolutely continuous with respect to #.
3. λ does not have a density function with respect to #.
Proof
This page titled 3.13: Absolute Continuity and Density Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated
by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history
is available upon request.
3.13.7 https://stats.libretexts.org/@go/page/10153
3.14: Function Spaces
Basic Theory
Our starting point is a positive measure space (S, S , μ). That is S is a set, S is a σ-algebra of subsets of S , and μ is a positive
measure on (S, S ). As usual, the most important special cases are
Euclidean space: S is a Lebesgue measurable subset of R for some n ∈ N , S is the σ-algebra of Lebesgue measurable
n
+
Discrete space: S is a countable set, S = P(S) is the collection of all subsets of S , and μ = # is counting measure.
Probability space: S is the set of outcomes of a random experiment, S is the σ-algebra of events, and μ = P is a probability
measure.
In previous sections, we defined the integral of certain measurable functions f : S → R with respect to μ , and we studied
properties of the integral. In this section, we will study vector spaces of functions that are defined in terms of certain integrability
conditions. These function spaces are of fundamental importance in all areas of analysis, including probability. In particular, the
results of this section will reappear in the form of spaces of random variables in our study of expected value.
Measurable functions f , g : S → R are equivalent if f = g almost everywhere on S , in which case we write f ≡ g . The
relation ≡ is an equivalence relation on the collection of measurable functions from S to R. That is, if f , g, h : S → R are
measurable then
1. f ≡ f , the reflexive property.
2. If f ≡ g then g ≡ f , the symmetric property.
3. If f ≡ g and g ≡ h then f ≡ h , the transitive property.
Thus, equivalent functions are indistinguishable from the point of view of the measure μ . As with any equivalence relation, ≡
partitions the underlying set (in this case the collection of real-valued measurable functions on S ) into equivalence classes of
mutually equivalent elements. As we will see, we often view these equivalence classes as the basic objects of study. Our next task
is to define measures of the “size” of a function; these will become norms in our spaces.
p p
Since |f |is a nonnegative, measurable function for p ∈ (0, ∞), ∫ |f | dμ exists in [0, ∞], and hence so does ∥f ∥ . Clearly
S p
∥f ∥∞ also exists in [0, ∞] and is known as the essential supremum of f . A number b ∈ [0, ∞] such that |f | ≤ b almost
everywhere on S is an essential bound of f and so, appropriately enough, the essential supremum of f is the infimum of the
essential bounds of f . Thus, we have defined ∥f ∥ for all p ∈ (0, ∞]. The definition for p = ∞ is special, but we will see that it's
p
So for p ∈ (0, ∞), f ∈ L if and only if |f | is integrable. The symbol L is in honor of Henri Lebesgue, who first developed the
p p
theory. If we want to indicate the dependence on the underlying measure space, we write L (S, S , μ). Of course, L is simply the
p 1
3.14.1 https://stats.libretexts.org/@go/page/10154
collection of functions that are integrable with respect to μ . Our goal is to study the spaces L for p ∈ (0, ∞]. We start with some
p
simple properties.
Suppose that f : S → R is measurable and c ∈ R . Then ∥cf ∥ p = |c|∥f ∥p for p ∈ (0, ∞].
Proof
In particular, if f ∈ L
p
and c ∈ R then cf ∈ L
p
.
Indices p, q ∈ (1, ∞) are said to be conjugate if 1/p + 1/q = 1 . In addition, 1 and ∞ are conjugate indices.
For justification of the last case, note that if p ∈ (1, ∞), then the index conjugate to p is
1
q = (3.14.3)
1 − 1/p
and q ↑ ∞ as p ↓ 1 . Note that p = q = 2 are conjugate indices, and this is the only case where the indices are the same. Ultimately,
the importance of conjugate indices stems from the following inequality:
our next major result is Hölder's inequality, named for Otto Hölder, which clearly indicates the importance of conjugate indices.
Suppose that f , g : S → R are measurable and that p and q are conjugate indices. Then
Proof
In particular, if f ∈ L and g ∈ L then f g ∈ L . The most important special case of Hölder's inequality is when p = q = 2 , in
p q 1
which case we have the Cauchy-Schwartz inequality, named for Augustin Louis Cauchy and Karl Hermann Schwarz:
∥f g∥1 ≤ ∥f ∥2 ∥g∥2 (3.14.14)
Minkowski's Inequality
Our next major result is Minkowski's inequality, named for Hermann Minkowski. This inequality will help show that L is a vector p
Proof
3.14.2 https://stats.libretexts.org/@go/page/10154
Vector Spaces
We can now discuss various vector spaces of functions. First, we know from our previous work with measure spaces, that the set V
of all measurable functions f : S → R is a vector space under our standard (pointwise) definitions of sum and scalar multiple. The
spaces we are studying in this section are subspaces:
L
p
is a subspace of V for every p ∈ [1, ∞].
Proof
However, we usually want to identify functions that are equal almost everywhere on S (with respect to μ ). Recalling the
equivalence relation ≡ defined above, here are the definitions:
Let [f ] denote the equivalence class of f ∈ V under the equivalence relation ≡, and let U = {[f ] : f ∈ V } . If f , g ∈ V and
c ∈ R we define
1. [f ] + [g] = [f + g]
2. c[f ] = [cf ]
Then U is a vector space.
Proof
Proof
We have stated these results precisely, but on the other hand, we don't want to be overly pedantic. It's more natural and intuitive to
simply work with the space V and the subspaces L for p ∈ [1, ∞], and just remember that functions that are equal almost
p
everywhere on S are regarded as the same vector. This will be our point of view for the rest of this section.
Every norm on a vector space naturally leads to a metric. That is, we measure the distance between vectors as the norm of their
difference. Stated in terms of the norm ∥ ⋅ ∥ , here are the properties of the metric on L .
p
p
For f , g, h ∈ L
p
,
1. ∥f − g∥ p ≥0 and ∥f − g∥ = 0 if and only if f ≡ g , the positive property
p
Suppose that f n ∈ L
p
for n ∈ N + and f ∈ L
p
. Then by definition, f n → f as n → ∞ in L if and only if ∥f
p
n − f ∥p → 0 as
n → ∞.
Suppose again that f ∈ L for n ∈ N . Recall that this sequence is said to be a Cauchy sequence if for every ϵ > 0 there exists
n
p
+
N ∈ N such that if n > N and m > N then ∥f − f ∥ < ϵ . Needless to say, the Cauchy criterion is named for our ubiquitous
+ n m p
friend Augustin Cauchy. A metric space in which every Cauchy sequence converges (to an element of the space) is said to be
complete. Intuitively, one expects a Cauchy sequence to converge, so a complete space is literally one that is not missing any
elements that should be there. A complete, normed vector space is called a Banach space, after the Polish mathematician Stefan
Banach. Banach spaces are of fundamental importance in analysis, in large part because of the following result:
L
p
is a Banach space for every p ∈ [1, ∞].
3.14.3 https://stats.libretexts.org/@go/page/10154
The Space L 2
For f , g ∈ L
2
, define
⟨f , g⟩ = ∫ f g dμ (3.14.23)
S
Note that the integral is well-defined by the Cauchy-Schwarz inequality. As with all of our other definitions, this one is consistent
with the equivalence relation. That is, if f ≡ f and g ≡ g then f g ≡ f g so ∫ f g dμ = ∫ f g dμ and hence
1 1 1 1 S S 1 1
L
2
is an inner product space. That is, if f , g, h ∈ L
2
and c ∈ R then
1. ⟨f , f ⟩ ≥ 0 and ⟨f , f ⟩ = 0 if and only if f ≡ 0 , the positive property
2. ⟨f , g⟩ = ⟨g, f ⟩ , the symmetric property
3. ⟨cf , g⟩ = c⟨f , g⟩ , the scaling property
4. ⟨f + g, h⟩ = ⟨f , g⟩ + ⟨f , h⟩ , the additive property
Proof
From parts (c) and (d), the inner product is linear in the first argument, with the second argument fixed. By the symmetric property
(b), it follows that the inner product is also linear in the second argument with the first argument fixed. That is, the inner product is
bi-linear. A complete. inner product space is known as a Hilbert space, named for the German mathematician David Hilbert. Thus,
the following result follows immediately from the previous two.
L
2
is a Hilbert space.
All inner product spaces lead naturally to the concept of orthogonality; L is no exception. 2
Functions f , g ∈ L
2
are orthogonal if ⟨f , g⟩ = 0 , in which case we write f ⊥g . Equivalently f ⊥g if
∫ f g dμ = 0 (3.14.24)
S
Of course, all of the basic theorems of general inner product spaces hold in L
2
. For example, the following result is the
Pythagorean theorem, named of course for Phythagoras.
If f , g ∈ L
2
and f ⊥g then ∥f + g∥2
2
= ∥f ∥
2
2
+ ∥g∥
2
2
.
Proof
1/p
p
∥x ∥p = ( ∑ |x| ) (3.14.26)
i
i∈S
On the other hand, ∥x ∥ = sup{x : i ∈ S} . The only null set for # is ∅, so the equivalence relation ≡ is simply equality, and so
∞ i
the spaces L and L are the same. For p ∈ [1, ∞), x ∈ L if and only if
p p p
p
∑ |x| <∞ (3.14.27)
i
i∈S
3.14.4 https://stats.libretexts.org/@go/page/10154
When p ∈ N (as is often the case), this condition means that ∑ x is absolutely convergent. On the other hand,
+ i∈S
p
i
x ∈ L
∞
if
and only if x is bounded. When S = N , the space L is often denoted l . The inner produce on L is
+
p p 2
2
⟨x, y⟩ = ∑ xi yi , x, y ∈ L (3.14.28)
i∈S
When S = {1, 2, … , n}, L is simply the vector space R with the usual addition, scalar multiplication, inner product, and norm
2 n
that we study in elementary linear algebra. Orthogonal vector are perpendicular in the usual sense.
Probability Spaces
Suppose that (S, S , P) is a probability space, so that S is the set of outcomes of a random experiment, S is the σ-algebra of
events, and P is a probability measure on the sample space (S, S ). Of course, a measurable function X : S → R is simply a real-
p p p
valued random variable. For p ∈ [1, ∞), the integral ∫ |x| dP is the expected value of |X| , and is denoted E (|X| ) . Thus in this
S
case, L is the collection of real-valued random variables X with E (|X| ) < ∞ . We will study these spaces in more detail in the
p p
This page titled 3.14: Function Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
3.14.5 https://stats.libretexts.org/@go/page/10154
CHAPTER OVERVIEW
4: Expected Value
Expected value is one of the fundamental concepts in probability, in a sense more general than probability itself. The expected
value of a real-valued random variable gives a measure of the center of the distribution of the variable. More importantly, by taking
the expected value of various functions of a general random variable, we can measure many interesting features of its distribution,
including spread, skewness, kurtosis, and correlation. Generating functions are certain types of expected value that completely
determine the distribution of the variable. Conditional expected value, which incorporates known information in the computation,
is one of the fundamental concepts in probability.
In the advanced topics, we define expected value as an integral with respect to the underlying probability measure. We also revisit
conditional expected value from a measure-theoretic point of view. We study vector spaces of random variables with certain
expected values as the norms of the spaces, which in turn leads to modes of convergence for random variables.
4.1: Definitions and Basic Properties
4.2: Additional Properties
4.3: Variance
4.4: Skewness and Kurtosis
4.5: Covariance and Correlation
4.6: Generating Functions
4.7: Conditional Expected Value
4.8: Expected Value and Covariance Matrices
4.9: Expected Value as an Integral
4.10: Conditional Expected Value Revisited
4.11: Vector Spaces of Random Variables
4.12: Uniformly Integrable Variables
4.13: Kernels and Operators
This page titled 4: Expected Value is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
4.1: Definitions and Basic Properties
Expected value is one of the most important concepts in probability. The expected value of a real-valued random variable gives the
center of the distribution of the variable, in a special sense. Additionally, by computing expected values of various real
transformations of a general random variable, we con extract a number of interesting characteristics of the distribution of the
variable, including measures of spread, symmetry, and correlation. In a sense, expected value is a more general concept than
probability itself.
Basic Concepts
Definitions
As usual, we start with a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F
the collection of events and P the probability measure on the sample space (Ω, F ). In the following definitions, we assume that X
is a random variable for the experiment, taking values in S ⊆ R .
If X has a discrete distribution with probability density function f (so that S is countable), then the expected value of X is
defined as follows (assuming that the sum is well defined):
x∈S
The sum defining the expected value makes sense if either the sum over the positive x ∈ S is finite or the sum over the negative
x ∈ S is finite (or both). This ensures the that the entire sum exists (as an extended real number) and does not depend on the order
of the terms. So as we will see, it's possible for E(X) to be a real number or ∞ or −∞ or to simply not exist. Of course, if S is
finite the expected value always exists as a real number.
If X has a continuous distribution with probability density function f (and so S is typically an interval or a union of disjoint
intervals), then the expected value of X is defined as follows (assuming that the integral is well defined):
The probability density functions in basic applied probability that describe continuous distributions are piecewise continuous. So
the integral above makes sense if the integral over positive x ∈ S is finite or the integral over negative x ∈ S is finite (or both).
This ensures that the entire integral exists (as an extended real number). So as in the discrete case, it's possible for E(X) to exist as
a real number or as ∞ or as −∞ or to not exist at all. As you might guess, the definition for a mixed distribution is a combination
of the definitions for the discrete and continuous cases.
If X has a mixed distribution, with partial discrete density g on D and partial continuous density h on C , where D and C are
disjoint, D is countable, C is typically an interval, and S = D ∪ C . The expected value of X is defined as follows (assuming
that the expression on the right is well defined):
For the expected value above to make sense, the sum must be well defined, as in the discrete case, the integral must be well
defined, as in the continuous case, and we must avoid the dreaded indeterminate form ∞ − ∞ . In the next section on additional
properties, we will see that the various definitions given here can be unified into a single definition that works regardless of the
type of distribution of X. An even more general definition is given in the advanced section on expected value as an integral.
Interpretation
The expected value of X is also called the mean of the distribution of X and is frequently denoted μ . The mean is the center of the
probability distribution of X in a special sense. Indeed, if we think of the distribution as a mass distribution (with total mass 1),
4.1.1 https://stats.libretexts.org/@go/page/10156
then the mean is the center of mass as defined in physics. The two pictures below show discrete and continuous probability density
functions; in each case the mean μ is the center of mass, the balance point.
Recall the other measures of the center of a distribution that we have studied:
A mode is any x ∈ S that maximizes f .
A median is any x ∈ R that satisfies P(X < x) ≤ 1
2
and P(X ≤ x) ≥ 1
2
.
To understand expected value in a probabilistic way, suppose that we create a new, compound experiment by repeating the basic
experiment over and over again. This gives a sequence of independent random variables (X , X , …), each with the same
1 2
distribution as X. In statistical terms, we are sampling from the distribution of X. The average value, or sample mean, after n runs
is
n
1
Mn = ∑ Xi (4.1.4)
n
i=1
Note that M is a random variable in the compound experiment. The important fact is that the average value M converges to the
n n
expected value E(X) as n → ∞ . The precise statement of this is the law of large numbers, one of the fundamental theorems of
probability. You will see the law of large numbers at work in many of the simulation exercises given below.
Extensions
If a ∈ R and n ∈ N , the moment of X about a of order n is defined to be
n
E [(X − a) ] (4.1.5)
The moments about 0 are simply referred to as moments (or sometimes raw moments). The moments about μ are the central
moments. The second central moment is particularly important, and is studied in detail in the section on variance. In some cases, if
we know all of the moments of X, we can determine the entire distribution of X. This idea is explored in the section on generating
functions.
The expected value of a random variable X is based, of course, on the probability measure P for the experiment. This probability
measure could be a conditional probability measure, conditioned on a given event A ∈ F for the experiment (with P(A) > 0 ). The
usual notation is E(X ∣ A) , and this expected value is computed by the definitions given above, except that the conditional
probability density function x ↦ f (x ∣ A) replaces the ordinary probability density function f . It is very important to realize that,
except for notation, no new concepts are involved. All results that we obtain for expected value in general have analogues for these
4.1.2 https://stats.libretexts.org/@go/page/10156
conditional expected values. On the other hand, we will study a more general notion of conditional expected value in a later
section.
Basic Properties
The purpose of this subsection is to study some of the essential properties of expected value. Unless otherwise noted, we will
assume that the indicated expected values exist, and that the various sets and functions that we use are measurable. We start with
two simple but still essential results.
Simple Variables
First, recall that a constant c ∈ R can be thought of as a random variable (on any probability space) that takes only the value c with
probability 1. The corresponding distribution is sometimes called point mass at c .
Next recall that an indicator variable is a random variable that takes only the values 0 and 1.
In particular, if 1 is the indicator variable of an event A , then E (1 ) = P(A) , so in a sense, expected value subsumes
A A
probability. For a book that takes expected value, rather than probability, as the fundamental starting concept, see the book
Probability via Expectation, by Peter Whittle.
exists). However, to compute this expected value from the definition would require that we know the probability density function
of the transformed variable r(X) (a difficult problem, in general). Fortunately, there is a much better way, given by the change of
variables theorem for expected value. This theorem is sometimes referred to as the law of the unconscious statistician, presumably
because it is so basic and natural that it is often used without the realization that it is a theorem, and not a definition.
If X has a discrete distribution on a countable set S with probability density function f . then
x∈S
Proof
Figure 4.1.3 : The change of variables theorem when X has a discrete distribution.
The next result is the change of variables theorem when X has a continuous distribution. We will prove the continuous version in
stages, first when r has discrete range below and then in the next section in full generality. Even though the complete proof is
delayed, however, we will use the change of variables theorem in the proofs of many of the other properties of expected value.
Suppose that X has a continuous distribution on S ⊆ R with probability density function f , and that r : S → R . Then
n
4.1.3 https://stats.libretexts.org/@go/page/10156
Proof when r has discrete range
Figure 4.1.4 : The change of variables theorem when X has a continuous distribution and r has countable range.
The results below gives basic properties of expected value. These properties are true in general, but we will restrict the proofs
primarily to the continuous case. The proofs for the discrete case are analogous, with sums replacing integrals. The change of
variables theorem is the main tool we will need. In these theorems X and Y are real-valued random variables for an experiment
(that is, defined on an underlying probability space) and c is a constant. As usual, we assume that the indicated expected values
exist. Be sure to try the proofs yourself before reading the ones in the text.
Linearity
Our first property is the additive property.
Proof
E(cX) = c E(X)
Proof
Here is the linearity of expected value in full generality. It's a simple corollary of the previous two results.
Suppose that (X 1,X , …) is a sequence of real-valued random variables defined on the underlying probability space and that
2
E ( ∑ ai Xi ) = ∑ ai E(Xi ) (4.1.11)
i=1 i=1
Thus, expected value is a linear operation on the collection of real-valued random variables for the experiment. The linearity of
expected value is so basic that it is important to understand this property on an intuitive level. Indeed, it is implied by the
interpretation of expected value given in the law of large numbers.
n
n
i=1 i
Proof
If the random variables in the previous result are also independent and identically distributed, then in statistical terms, the sequence
is a random sample of size n from the common distribution, and M is the sample mean.
In several important cases, a random variable from a special distribution can be decomposed into a sum of simpler random
variables, and then part (a) of the last theorem can be used to compute the expected value.
Inequalities
The following exercises give some basic inequalities for expected value. The first, known as the positive property is the most
obvious, but is also the main tool for proving the others.
4.1.4 https://stats.libretexts.org/@go/page/10156
Proof
Next is the increasing property, perhaps the most important property of expected value, after linearity.
Thus, if X is not a constant (with probability 1), then X must take values greater than its mean with positive probability and values
less than its mean with positive probability.
Symmetry
Again, suppose that X is a random variable taking values in R. The distribution of X is symmetric about a ∈ R if the distribution
of a − X is the same as the distribution of X − a .
Suppose that the distribution of X is symmetric about a ∈ R . If E(X) exists, then E(X) = a .
Proof
The previous result applies if X has a continuous distribution on R with a probability density f that is symmetric about a ; that is,
f (a + x) = f (a − x) for x ∈ R .
Independence
If X and Y are independent real-valued random variables then E(XY ) = E(X)E(Y ) .
Proof
It follows from the last result that independent random variables are uncorrelated (a concept that we will study in a later section).
Moreover, this result is more powerful than might first appear. Suppose that X and Y are independent random variables taking
values in general spaces S and T respectively, and that u : S → R and v : T → R . Then u(X) and v(Y ) are independent, real-
valued random variables and hence
E [u(X)v(Y )] = E [u(X)] E [v(Y )] (4.1.14)
Uniform Distributions
Discrete uniform distributions are widely used in combinatorial probability, and model a point chosen at random from a finite set.
2
, the average of the endpoints.
4.1.5 https://stats.libretexts.org/@go/page/10156
Proof
The previous results are easy to see if we think of E(X) as the center of mass, since the discrete uniform distribution corresponds
to a finite set of points with equal mass.
Open the special distribution simulator, and select the discrete uniform distribution. This is the uniform distribution on n
points, starting at a , evenly spaced at distance h . Vary the parameters and note the location of the mean in relation to the
probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical
mean to the distribution mean.
Next, recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the
interval. Continuous uniform distributions arise in geometric probability and a variety of other applied problems.
Suppose that X has the continuous uniform distribution on an interval [a, b], where a, b ∈ R and a < b .
1. E(X) = a+b
2
, the midpoint of the interval.
2. E (X ) =
n 1
n+1
n
(a
n−1
+a b + ⋯ + ab
n−1 n
+b ) for n ∈ N .
Proof
Part (a) is easy to see if we think of the mean as the center of mass, since the uniform distribution corresponds to a uniform
distribution of mass on the interval.
Open the special distribution simulator, and select the continuous uniform distribution. This is the uniform distribution the
interval [a, a + w] . Vary the parameters and note the location of the mean in relation to the probability density function. For
selected values of the parameters, run the simulation 1000 times and compare the empirical mean to the distribution mean.
Next, the average value of a function on an interval, as defined in calculus, has a nice interpretation in terms of the uniform
distribution.
b
1
E [g(X)] = ∫ g(x)dx (4.1.19)
b −a a
Proof
Find the average value of the following functions on the given intervals:
1. f (x) = x on [2, 4]
2. g(x) = x on [0, 1]
2
The next exercise illustrates the value of the change of variables theorem in computing expected values.
Answer
The discrete uniform distribution and the continuous uniform distribution are studied in more detail in the chapter on Special
Distributions.
4.1.6 https://stats.libretexts.org/@go/page/10156
Dice
Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard
die in which faces 1 and 6 have probability each, and faces 2, 3, 4, and 5 have probability each.
1
4
1
Two standard, fair dice are thrown, and the scores (X1 , X2 ) recorded. Find the expected value of each of the following
variables.
1. Y = X + X , the sum of the scores.
1 2
2
1 2
Answer
In the dice experiment, select two fair die. Note the shape of the probability density function and the location of the mean for
the sum, minimum, and maximum variables. Run the experiment 1000 times and compare the sample mean and the
distribution mean for each of these variables.
Two standard, ace-six flat dice are thrown, and the scores (X 1, X2 ) recorded. Find the expected value of each of the following
variables.
1. Y = X + X , the sum of the scores.
1 2
2
1 2
Answer
In the dice experiment, select two ace-six flat die. Note the shape of the probability density function and the location of the
mean for the sum, minimum, and maximum variables. Run the experiment 1000 times and compare the sample mean and the
distribution mean for each of these variables.
Bernoulli Trials
Recall that a Bernoulli trials process is a sequence X = (X , X , …) of independent, identically distributed indicator random
1 2
variables. In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The
i
probability of success p = P(X = 1) ∈ [0, 1] is the basic parameter of the process. The process is named for Jacob Bernoulli. A
i
i=1 i
distribution with parameters n and p, and has probability density function f given by
n y n−y
f (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (4.1.20)
y
Note the superiority of the second proof to the first. The result also makes intuitive sense: in n trials with success probability p, we
expect np successes.
In the binomial coin experiment, vary n and p and note the shape of the probability density function and the location of the
mean. For selected values of n and p, run the experiment 1000 times and compare the sample mean to the distribution mean.
4.1.7 https://stats.libretexts.org/@go/page/10156
Suppose that p ∈ (0, 1], and let N denote the trial number of the first success. This random variable has the geometric distribution
on N with parameter p, and has probability density function g given by
+
n−1
g(n) = p(1 − p ) , n ∈ N+ (4.1.23)
If N has the geometric distribution on N with parameter p ∈ (0, 1] then E(N ) = 1/p .
+
Proof
Again, the result makes intuitive sense. Since p is the probability of success, we expect a success to occur after 1/p trials.
In the negative binomial experiment, select k = 1 to get the geometric distribution. Vary p and note the shape of the
probability density function and the location of the mean. For selected values of p, run the experiment 1000 times and compare
the sample mean to the distribution mean.
selected. Recall that X = (X , X , … , X ) is a sequence of identically distributed (but not independent) indicator random
1 2 n
n
Let Y denote the number of type 1 objects in the sample, so that Y = ∑i=1 Xi . Recall that Y has the hypergeometric distribution,
which has probability density function f given by
r m−r
( )( )
y n−y
f (y) = , y ∈ {0, 1, … , n} (4.1.25)
m
( )
n
m
.
Proof from the definition
Proof using the additive property
In the ball and urn experiment, vary n , r, and m and note the shape of the probability density function and the location of the
mean. For selected values of the parameters, run the experiment 1000 times and compare the sample mean to the distribution
mean.
Note that if we select the objects with replacement, then X would be a sequence of Bernoulli trials, and hence Y would have the
binomial distribution with parameters n and p = . Thus, the mean would still be E(Y ) = n .
r
m
r
where a ∈ (0, ∞) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number
of “random points” in a region of time or space; the parameter a is proportional to the size of the region. The Poisson distribution is
studied in detail in the chapter on the Poisson Process.
If N has the Poisson distribution with parameter a then E(N ) = a . Thus, the parameter of the Poisson distribution is the mean
of the distribution.
Proof
In the Poisson experiment, the parameter is a = rt . Vary the parameter and note the shape of the probability density function
and the location of the mean. For various values of the parameter, run the experiment 1000 times and compare the sample
mean to the distribution mean.
4.1.8 https://stats.libretexts.org/@go/page/10156
The Exponential Distribution
Recall that the exponential distribution is a continuous distribution with probability density function f given by
−rt
f (t) = re , t ∈ [0, ∞) (4.1.31)
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other “arrival times”; in
particular, the distribution governs the time between arrivals in the Poisson model. The exponential distribution is studied in detail
in the chapter on the Poisson Process.
Suppose that T has the exponential distribution with rate parameter r. Then E(T ) = 1/r .
Proof
Recall that the mode of T is 0 and the median of T is ln 2/r. Note how these measures of center are ordered: 0 < ln 2/r < 1/r
In the gamma experiment, set n = 1 to get the exponential distribution. This app simulates the first arrival in a Poisson
process. Vary r with the scroll bar and note the position of the mean relative to the graph of the probability density function.
For selected values of r, run the experiment 1000 times and compare the sample mean to the distribution mean.
Suppose again that T has the exponential distribution with rate parameter r and suppose that t > 0 . Find E(T ∣ T > t) .
Answer
where n ∈ N is the shape parameter and r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times
+
and other “arrival times”, and in particular, models the n th arrival in the Poisson process. Thus it follows that if (X , X , … , X )
1 2 n
is a sequence of independent random variables, each having the exponential distribution with rate parameter r, then T = ∑ X n
i=1 i
has the gamma distribution with shape parameter n and rate parameter r. The gamma distribution is studied in more generality,
with non-integer shape parameters, in the chapter on the Special Distributions.
Suppose that T has the gamma distribution with shape parameter n and rate parameter r. Then E(T ) = n/r .
Proof from the definition
Proof using the additive property
Note again how much easier and more intuitive the second proof is than the first.
Open the gamma experiment, which simulates the arrival times in the Poisson process. Vary the parameters and note the
position of the mean relative to the graph of the probability density function. For selected parameter values, run the experiment
1000 times and compare the sample mean to the distribution mean.
Beta Distributions
The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions
and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions.
Suppose that X has probability density function f given by f (x) = 3x for x ∈ [0, 1]. 2
4.1.9 https://stats.libretexts.org/@go/page/10156
In the special distribution simulator, select the beta distribution and set a = 3 and b = 1 to get the distribution in the last
exercise. Run the experiment 1000 times and compare the sample mean to the distribution mean.
Suppose that a sphere has a random radius R with probability density function f given by 2
f (r) = 12 r (1 − r) for .
r ∈ [0, 1]
3. The volume V = π R 4
3
3
Answer
The particular beta distribution in the last exercise is also known as the (standard) arcsine distribution. It governs the last time that
the Brownian motion process hits 0 during the time interval [0, 1]. The arcsine distribution is studied in more generality in the
chapter on Special Distributions.
Open the Brownian motion experiment and select the last zero. Run the simulation 1000 times and compare the sample mean
to the distribution mean.
Suppose that the grades on a test are described by the random variable Y = 100X where X has the beta distribution with
probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. The grades are generally low, so the teacher
2
−− −
−
decides to “curve” the grades using the transformation Z = 10√Y = 100√X . Find the expected value of each of the
following variables
1. X
2. Y
3. Z
Answer
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is
widely used to model certain financial variables. The Pareto distribution is studied in detail in the chapter on Special Distributions.
Suppose that X has the Pareto distribution with shape parameter a . Then
1. E(X) = ∞ if 0 < a ≤ 1
2. E(X) = if a > 1
a
a−1
Proof
The previous exercise gives us our first example of a distribution whose mean is infinite.
In the special distribution simulator, select the Pareto distribution. Note the shape of the probability density function and the
location of the mean. For the following values of the shape parameter a , run the experiment 1000 times and note the behavior
of the empirical mean.
4.1.10 https://stats.libretexts.org/@go/page/10156
1. a = 1
2. a = 2
3. a = 3 .
This distribution is named for Augustin Cauchy. The Cauchy distributions is studied in detail in the chapter on Special
Distributions.
Note that the graph of f is symmetric about 0 and is unimodal. Thus, the mode and median of X are both 0. By the symmetry
result, if X had a mean, the mean would be 0 also, but alas the mean does not exist. Moreover, the non-existence of the mean is not
just a pedantic technicality. If we think of the probability distribution as a mass distribution, then the moment to the right of a is
∞ a
∫ (x − a)f (x) dx = ∞ and the moment to the left of a is ∫ (x − a)f (x) dx = −∞ for every a ∈ R . The center of mass
a −∞
simply does not exist. Probabilisitically, the law of large numbers fails, as you can see in the following simulation exercise:
In the Cauchy experiment (with the default parameter values), a light sources is 1 unit from position 0 on an infinite straight
−π
wall. The angle that the light makes with the perpendicular is uniformly distributed on the interval ( , ), so that the
2
π
position of the light beam on the wall has the Cauchy distribution. Run the simulation 1000 times and note the behavior of the
empirical mean.
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in
the chapter on Special Distributions.
The standard normal distribution is unimodal and symmetric about 0. Thus, the median, mean, and mode all agree. More generally,
for μ ∈ (−∞, ∞) and σ ∈ (0, ∞), recall that X = μ + σZ has the normal distribution with location parameter μ and scale
parameter σ. X has probability density function f given by
2
1 1 x −μ
f (x) = exp[− ( ) ], x ∈ R (4.1.43)
−−
√2πσ 2 σ
If X has the normal distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞), then E(X) = μ
Proof
In the special distribution simulator, select the normal distribution. Vary the parameters and note the location of the mean. For
selected parameter values, run the simulation 1000 times and compare the sample mean to the distribution mean.
4.1.11 https://stats.libretexts.org/@go/page/10156
Additional Exercises
Suppose that (X, Y ) has probability density function f given by f (x, y) = x + y for (x, y) ∈ [0, 1] × [0, 1] . Find the
following expected values:
1. E(X)
2. E (X Y )
2
3. E (X + Y )
2 2
4. E(XY ∣ Y > X)
Answer
Suppose that Nhas a discrete distribution with probability density function f given by f (n) =
1
50
2
n (5 − n) for
n ∈ {1, 2, 3, 4}. Find each of the following:
1. The median of N .
2. The mode of N
3. E(N ).
4. E (N ) 2
5. E(1/N ).
6. E (1/N ). 2
Answer
Suppose that X and Y are real-valued random variables with E(X) = 5 and E(Y ) = −2 . Find E(3X + 4Y − 7) .
Answer
Suppose that X and Y are real-valued, independent random variables, and that E(X) = 5 and E(Y ) = −2 . Find
E [(3X − 4)(2Y + 7)] .
Answer
Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each hunter selects one duck at
random and shoots. Find the expected number of ducks killed.
Solution
For a more complete analysis of the duck hunter problem, see The Number of Distinct Sample Values in the chapter on Finite
Sampling Models.
Consider the following game: An urn initially contains one red and one green ball. A ball is selected at random, and if the ball
is green, the game is over. If the ball is red, the ball is returned to the urn, another red ball is added, and the game continues. At
each stage, a ball is selected at random, and if the ball is green, the game is over. If the ball is red, the ball is returned to the
urn, another red ball is added, and the game continues. Let X denote the length of the game (that is, the number of selections
required to obtain a green ball). Find E(X).
Solution
This page titled 4.1: Definitions and Basic Properties is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
4.1.12 https://stats.libretexts.org/@go/page/10156
4.2: Additional Properties
In this section, we study some properties of expected value that are a bit more specialized than the basic properties considered in
the previous section. Nonetheless, the new results are also very important. They include two fundamental inequalities as well as
special formulas for the expected value of a nonnegative variable. As usual, unless otherwise noted, we assume that the referenced
expected values exist.
Basic Theory
Markov's Inequality
Our first result is known as Markov's inequality (named after Andrei Markov). It gives an upper bound for the tail probability of a
nonnegative random variable in terms of the expected value of the variable.
Proof
The upper bound in Markov's inequality may be rather crude. In fact, it's quite possible that E(X)/x ≥ 1 , in which case the bound
is worthless. However, the real value of Markov's inequality lies in the fact that it holds with no assumptions whatsoever on the
distribution of X (other than that X be nonnegative). Also, as an example below shows, the inequality is tight in the sense that
equality can hold for a given x. Here is a simple corollary of Markov's inequality.
Proof
Proof
Proof
The following result is similar to the theorem above, but is specialized to nonnegative integer valued variables:
4.2.1 https://stats.libretexts.org/@go/page/10157
∞ ∞
n=0 n=1
Proof
A General Definition
The special expected value formula for nonnegative variables can be used as the basis of a general formulation of expected value
that would work for discrete, continuous, or even mixed distributions, and would not require the assumption of the existence of
probability density functions. First, the special formula is taken as the definition of E(X) if X is nonnegative.
Next, for x ∈ R, recall that the positive and negative parts of x are x +
= max{x, 0} and x −
= max{0, −x} .
For x ∈ R,
1. x ≥ 0 , x ≥ 0
+ −
2. x = x − x+ −
3. |x| = x + x + −
Now, if X is a real-valued random variable, then X and X , the positive and negative parts of X, are nonnegative random
+ −
variables, so their expected values are defined as above. The definition of E(X) is then natural, anticipating of course the linearity
property.
The usual formulas for expected value in terms of the probability density function, for discrete, continuous, or mixed distributions,
would now be proven as theorems. We will not go further in this direction, however, since the most complete and general definition
of expected value is given in the advanced section on expected value as an integral.
x∈S
In both cases, of course, we assume that the expected values exist. In the previous section on basic properties, we proved the
change of variables theorem when X has a discrete distribution and when X has a continuous distribution but r has countable
range. Now we can finally finish our proof in the continuous case.
Suppose that X has a continuous distribution on S with probability density function f , and r : S → R . Then
Proof
4.2.2 https://stats.libretexts.org/@go/page/10157
Jensens's Inequality
Our next sequence of exercises will establish an important inequality known as Jensen's inequality, named for Johan Jensen. First
we need a definition.
A real-valued function g defined on an interval S ⊆ R is said to be convex (or concave upward) on S if for each t ∈ S , there
exist numbers a and b (that may depend on t ), such that
1. a + bt = g(t)
2. a + bx ≤ g(x) for all x ∈ S
The graph of x ↦ a + bx is called a supporting line for g at t .
Thus, a convex function has at least one supporting line at each point in the domain
Proof
Jensens's inequality extends easily to higher dimensions. The 2-dimensional version is particularly important, because it will be
used to derive several special inequalities in the section on vector spaces of random variables. We need two definitions.
A set S ⊆R
n
is convex if for every pair of points in S , the line segment connecting those points also lies in S . That is, if
x, y ∈ S and p ∈ [0, 1] then px + (1 − p)y ∈ S .
1. a + b ⋅ t = g(t)
2. a + b ⋅ x ≤ g(x) for all x ∈ S
The graph of x ↦ a + b ⋅ x is called a supporting hyperplane for g at t .
In R a supporting hyperplane is an ordinary plane. From calculus, if g has continuous second derivatives on S and has a positive
2
non-definite second derivative matrix, then g is convex on S . Suppose now that X = (X , X , … , X ) takes values in S ⊆ R ,
1 2 n
n
4.2.3 https://stats.libretexts.org/@go/page/10157
and let E(X) = (E(X 1 ), E(X2 ), … , E(Xn )) . The following result is the general version of Jensen's inequlaity.
Proof
We will study the expected value of random vectors and matrices in more detail in a later section. In both the one and n -
dimensional cases, a function g : S → R is concave (or concave downward) if the inequality in the definition is reversed. Jensen's
inequality also reverses.
Suppose that X has a continuous distribution with support on an interval (a, b) ⊆ R . Let F denote the cumulative distribution
function of X so that F is the quantile function of X. If g : (a, b) → R then (assuming that the expected value exists),
−1
1
−1
E[g(X)] = ∫ g [F (p)] dp, n ∈ N (4.2.21)
0
Proof
1
So in particular, E(X) = ∫ 0
F
−1
(p) dp .
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other “arrival times”; in
particular, the distribution governs the time between arrivals in the Poisson model. The exponential distribution is studied in detail
in the chapter on the Poisson Process.
Open the gamma experiment. Keep the default value of the stopping parameter (n = 1 ), which gives the exponential
distribution. Vary the rate parameter r and note the shape of the probability density function and the location of the mean. For
various values of the rate parameter, run the experiment 1000 times and compare the sample mean with the distribution mean.
n−1
f (n) = p(1 − p ) , n ∈ N+ (4.2.24)
4.2.4 https://stats.libretexts.org/@go/page/10157
Suppose that N has the geometric distribution on N with parameter p ∈ (0, 1).
+
Open the negative binomial experiment. Keep the default value of the stopping parameter (k = 1 ), which gives the geometric
distribution. Vary the success parameter p and note the shape of the probability density function and the location of the mean.
For various values of the success parameter, run the experiment 1000 times and compare the sample mean with the distribution
mean.
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is
widely used to model certain financial variables. The Pareto distribution is studied in detail in the chapter on Special Distributions.
Open the special distribution simulator and select the Pareto distribution. Keep the default value of the scale parameter. Vary
the shape parameter and note the shape of the probability density function and the location of the mean. For various values of
the shape parameter, run the experiment 1000 times and compare the sample mean with the distribution mean.
A Bivariate Distribution
Suppose that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 .
1. Show that the domain of f is a convex set.
2. Show that (x, y) ↦ x + y is convex on the domain of f .
2 2
3. Compute E (X + Y ) .
2 2
Proof
This page titled 4.2: Additional Properties is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
4.2.5 https://stats.libretexts.org/@go/page/10157
4.3: Variance
Recall the expected value of a real-valued random variable is the mean of the variable, and is a measure of the center of the
distribution. Recall also that by taking the expected value of various transformations of the variable, we can measure other
interesting characteristics of the distribution. In this section, we will study expected values that measure the spread of the
distribution about the mean.
Basic Theory
Definitions and Interpretations
As usual, we start with a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F
the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose that X is a random variable for the
experiment, taking values in S ⊆ R . Recall that E(X), the expected value (or mean) of X gives the center of the distribution of X.
−−−−−−
2. sd(X) = √var(X)
Implicit in the definition is the assumption that the mean E(X) exists, as a real number. If this is not the case, then var(X) (and
hence also sd(X)) are undefined. Even if E(X) does exist as a real number, it's possible that var(X) = ∞ . For the remainder of
our discussion of the basic theory, we will assume that expected values that are mentioned exist as real numbers.
The variance and standard deviation of X are both measures of the spread of the distribution about the mean. Variance (as we will
see) has nicer mathematical properties, but its physical unit is the square of that of X. Standard deviation, on the other hand, is not
as nice mathematically, but has the advantage that its physical unit is the same as that of X. When the random variable X is
understood, the standard deviation is often denoted by σ, so that the variance is σ . 2
Recall that the second moment of X about a ∈ R is E [(X − a) ] . Thus, the variance is the second moment of X about the mean
2
μ = E(X) , or equivalently, the second central moment of X. In general, the second moment of X about a ∈ R can also be thought
of as the mean square error if the constant a is used as an estimate of X. In addition, second moments have a nice interpretation in
physics. If we think of the distribution of X as a mass distribution in R, then the second moment of X about a ∈ R is the moment
of inertia of the mass distribution about a . This is a measure of the resistance of the mass distribution to any change in its rotational
motion about a . In particular, the variance of X is the moment of inertia of the mass distribution about the center of mass μ .
Proof
4.3.1 https://stats.libretexts.org/@go/page/10158
The relationship between measures of center and measures of spread is studied in more detail in the advanced section on vector
spaces of random variables.
Properties
The following exercises give some basic properties of variance, which in turn rely on basic properties of expected value. As usual,
be sure to try the proofs yourself before reading the ones in the text. Our first results are computational formulas based on the
change of variables formula for expected value
Let μ = E(X) .
1. If X has a discrete distribution with probability density function f , then var(X) = ∑ x∈S
2
(x − μ) f (x) .
2. If X has a continuous distribution with probability density function f , then var(X) = ∫ S
2
(x − μ) f (x)dx
Proof
Our next result is a variance formula that is usually better than the definition for computational purposes.
var(X) = E(X
2 2
) − [E(X)] .
Proof
Variance is always nonnegative, since it's the expected value of a nonnegative random variable. Moreover, any random variable that
really is random (not a constant) will have strictly positive variance.
Our next result shows how the variance and standard deviation are changed by a linear transformation of the random variable. In
particular, note that variance, unlike general expected value, is not a linear operation. This is not really surprising since the variance
is the expected value of a nonlinear function of the variable: x ↦ (x − μ) . 2
If a, b ∈ R then
1. var(a + bX) = b var(X) 2
Recall that when b > 0 , the linear transformation x ↦ a + bx is called a location-scale transformation and often corresponds to a
change of location and change of scale in the physical units. For example, the change from inches to centimeters in a measurement
of length is a scale transformation, and the change from Fahrenheit to Celsius in a measurement of temperature is both a location
and scale transformation. The previous result shows that when a location-scale transformation is applied to a random variable, the
standard deviation does not depend on the location parameter, but is multiplied by the scale factor. There is a particularly important
location-scale transformation.
Suppose that X is a random variable with mean μ and variance σ . The random variable Z defined as follows is the standard
2
score of X.
X −μ
Z = (4.3.2)
σ
1. E(Z) = 0
2. var(Z) = 1
Proof
4.3.2 https://stats.libretexts.org/@go/page/10158
Since X and its mean and standard deviation all have the same physical units, the standard score Z is dimensionless. It measures
the directed distance from E(X) to X in terms of standard deviations.
Let Z denote the standard score of X, and suppose that Y = a + bX where a, b ∈ R and b ≠ 0 .
1. If b > 0 , the standard score of Y is Z .
2. If b < 0 , the standard score of Y is −Z .
Proof
As just noted, when b > 0 , the variable Y = a + bX is a location-scale transformation and often corresponds to a change of
physical units. Since the standard score is dimensionless, it's reasonable that the standard scores of X and Y are the same. Here is
another standardized measure of dispersion:
Suppose that X is a random variable with E(X) ≠ 0 . The coefficient of variation is the ratio of the standard deviation to the
mean:
sd(X)
cv(X) = (4.3.4)
E(X)
The coefficient of variation is also dimensionless, and is sometimes used to compare variability for random variables with different
means. We will learn how to compute the variance of the sum of two random variables in the section on covariance.
Chebyshev's Inequality
Chebyshev's inequality (named after Pafnuty Chebyshev) gives an upper bound on the probability that a random variable will be
more than a specified distance from its mean. This is often useful in applied problems where the distribution is unknown, but the
mean and variance are known (at least approximately). In the following two results, suppose that X is a real-valued random
variable with mean μ = E(X) ∈ R and standard deviation σ = sd(X) ∈ (0, ∞) .
Chebyshev's inequality 1.
2
σ
P (|X − μ| ≥ t) ≤ , t >0 (4.3.5)
2
t
Proof
Chebyshev's inequality 2.
1
P (|X − μ| ≥ kσ) ≤ , k >0 (4.3.6)
2
k
Proof
The usefulness of the Chebyshev inequality comes from the fact that it holds for any distribution (assuming only that the mean and
variance exist). The tradeoff is that for many specific distributions, the Chebyshev bound is rather crude. Note in particular that the
first inequality is useless when t ≤ σ , and the second inequality is useless when k ≤ 1 , since 1 is an upper bound for the
probability of any event. On the other hand, it's easy to construct a distribution for which Chebyshev's inequality is sharp for a
specified value of t ∈ (0, ∞) . Such a distribution is given in an exercise below.
4.3.3 https://stats.libretexts.org/@go/page/10158
Examples and Applications
As always, be sure to try the problems yourself before looking at the solutions and answers.
Indicator Variables
Suppose that X is an indicator variable with p = P(X = 1) , where p ∈ [0, 1]. Then
1. E(X) = p
2. var(X) = p(1 − p)
Proof
The graph of var(X) as a function of p is a parabola, opening downward, with roots at 0 and 1. Thus the minimum value of
var(X) is 0, and occurs when p = 0 and p = 1 (when X is deterministic, of course). The maximum value is and occurs when 1
p =
1
2
.
Uniform Distributions
Discrete uniform distributions are widely used in combinatorial probability, and model a point chosen at random from a finite set.
The mean and variance have simple forms for the discrete uniform distribution on a set of evenly spaced points (sometimes referred
to as a discrete interval):
Suppose that X has the discrete uniform distribution on {a, a + h, … , a + (n − 1)h} where a ∈ R , h ∈ (0, ∞) , and
n ∈ N . Let b = a + (n − 1)h , the right endpoint. Then
+
1. E(X) = (a + b) .
1
12
Proof
Note that mean is simply the average of the endpoints, while the variance depends only on difference between the endpoints and
the step size.
Open the special distribution simulator, and select the discrete uniform distribution. Vary the parameters and note the location
and size of the mean ± standard deviation bar in relation to the probability density function. For selected values of the
parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and
standard deviation.
Next, recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the
interval. Continuous uniform distributions arise in geometric probability and a variety of other applied problems.
Suppose that X has the continuous uniform distribution on the interval [a, b] where a, b ∈ R with a < b . Then
1. E(X) = (a + b)
1
2. var(X) = (b − a)
1
12
2
Proof
Note that the mean is the midpoint of the interval and the variance depends only on the length of the interval. Compare this with the
results in the discrete case.
4.3.4 https://stats.libretexts.org/@go/page/10158
Open the special distribution simulator, and select the continuous uniform distribution. This is the uniform distribution the
interval [a, a + w] . Vary the parameters and note the location and size of the mean ± standard deviation bar in relation to the
probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical
mean and standard deviation to the distribution mean and standard deviation.
Dice
Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice.
Here are three:
An ace-six flat die is a six-sided die in which faces 1 and 6 have probability each while faces 2, 3, 4, and 5 have probability
1
4
1
8
each.
A two-five flat die is a six-sided die in which faces 2 and 5 have probability each while faces 1, 3, 4, and 6 have probability
1
4
1
each.
A three-four flat die is a six-sided die in which faces 3 and 4 have probability each while faces 1, 2, 5, and 6 have probability
1
4
1
8
each.
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular
probabilities that we use ( and ) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis
1
4
1
have slightly larger probabilities that the other four faces. Flat dice are sometimes used by gamblers to cheat. In the following
problems, you will compute the mean and variance for each of the various types of dice. Be sure to compare the results.
A standard, fair die is thrown and the score X is recorded. Sketch the graph of the probability density function and compute
each of the following:
1. E(X)
2. var(X)
Answer
An ace-six flat die is thrown and the score X is recorded. Sketch the graph of the probability density function and compute
each of the following:
1. E(X)
2. var(X)
Answer
A two-five flat die is thrown and the score X is recorded. Sketch the graph of the probability density function and compute
each of the following:
1. E(X)
2. var(X)
Answer
A three-four flat die is thrown and the score X is recorded. Sketch the graph of the probability density function and compute
each of the following:
1. E(X)
2. var(X)
Answer
In the dice experiment, select one die. For each of the following cases, note the location and size of the mean ± standard
deviation bar in relation to the probability density function. Run the experiment 1000 times and compare the empirical mean
and standard deviation to the distribution mean and standard deviation.
1. Fair die
2. Ace-six flat die
3. Two-five flat die
4. Three-four flat die
4.3.5 https://stats.libretexts.org/@go/page/10158
The Poisson Distribution
Recall that the Poisson distribution is a discrete distribution on N with probability density function f given by
n
a
−a
f (n) = e , n ∈ N (4.3.10)
n!
where a ∈ (0, ∞) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number
of “random points” in a region of time or space; the parameter a is proportional to the size of the region. The Poisson distribution is
studied in detail in the chapter on the Poisson Process.
Thus, the parameter of the Poisson distribution is both the mean and the variance of the distribution.
In the Poisson experiment, the parameter is a = rt . Vary the parameter and note the size and location of the mean ± standard
deviation bar in relation to the probability density function. For selected values of the parameter, run the experiment 1000
times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
n−1
f (n) = p(1 − p ) , n ∈ N+ (4.3.13)
Suppose that N has the geometric distribution on N with success parameter p ∈ (0, 1]. Then
+
1. E(N ) = 1
p
1−p
2. var(N ) = p
2
Proof
Note that the variance is 0 when p = 1 , not surprising since X is deterministic in this case.
In the negative binomial experiment, set k = 1 to get the geometric distribution . Vary p with the scroll bar and note the size
and location of the mean ± standard deviation bar in relation to the probability density function. For selected values of p, run
the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard
deviation.
Suppose that N has the geometric distribution with parameter p = . Compute the true value and the Chebyshev bound for the
3
where r ∈ (0, ∞) is the with rate parameter. This distribution is widely used to model failure times and other “arrival times”. The
exponential distribution is studied in detail in the chapter on the Poisson Process.
Suppose that T has the exponential distribution with rate parameter r. Then
4.3.6 https://stats.libretexts.org/@go/page/10158
1. E(T ) = .1
2. var(T ) = . 1
r2
Proof
Thus, for the exponential distribution, the mean and standard deviation are the same.
In the gamma experiment, set k = 1 to get the exponential distribution. Vary r with the scroll bar and note the size and
location of the mean ± standard deviation bar in relation to the probability density function. For selected values of r, run the
experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard
deviation.
Suppose that X has the exponential distribution with rate parameter r > 0 . Compute the true value and the Chebyshev bound
for the probability that X is at least k standard deviations away from the mean.
Answer
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is
widely used to model financial variables such as income. The Pareto distribution is studied in detail in the chapter on Special
Distributions.
Suppose that X has the Pareto distribution with shape parameter a . Then
1. E(X) = ∞ if 0 < a ≤ 1 and E(X) = a
if 1 < a < ∞
a−1
Proof
In the special distribution simuator, select the Pareto distribution. Vary a with the scroll bar and note the size and location of
the mean ± standard deviation bar. For each of the following values of a , run the experiment 1000 times and note the behavior
of the empirical mean and standard deviation.
1. a = 1
2. a = 2
3. a = 3
ϕ(z) = −−e
2
, z ∈ R (4.3.22)
√2π
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in
the chapter on Special Distributions.
More generally, for μ ∈ R and σ ∈ (0, ∞), recall that the normal distribution with location parameter μ and scale parameter σ is
a continuous distribution on R with probability density function f given by
4.3.7 https://stats.libretexts.org/@go/page/10158
2
1 1 x −μ
f (x) = −− exp[− ( ) ], x ∈ R (4.3.25)
√2πσ 2 σ
Moreover, if Z has the standard normal distribution, then X = μ + σZ has the normal distribution with location parameter μ and
scale parameter σ. As the notation suggests, the location parameter is the mean of the distribution and the scale parameter is the
standard deviation.
Suppose that X has the normal distribution with location parameter μ and scale parameter σ. Then
1. E(X) = μ
2. var(X) = σ 2
Proof
So to summarize, if X has a normal distribution, then its standard score Z has the standard normal distribution.
In the special distribution simulator, select the normal distribution. Vary the parameters and note the shape and location of the
mean ± standard deviation bar in relation to the probability density function. For selected parameter values, run the experiment
1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Beta Distributions
The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions
and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions.
Suppose that X has a beta distribution with probability density function f . In each case below, graph f below and compute the
mean and variance.
1. f (x) = 6x(1 − x) for x ∈ [0, 1]
2. f (x) = 12x (1 − x) for x ∈ [0, 1]
2
Answer
In the special distribution simulator, select the beta distribution. The parameter values below give the distributions in the
previous exercise. In each case, note the location and size of the mean ± standard deviation bar. Run the experiment 1000
times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
1. a = 2 , b = 2
2. a = 3 , b = 2
3. a = 2 , b = 3
Suppose that a sphere has a random radius R with probability density function f given by 2
f (r) = 12 r (1 − r) for .
r ∈ [0, 1]
3. The volume V = π R 4
3
3
Answer
1. E(X)
2. var(X)
Answer
The particular beta distribution in the last exercise is also known as the (standard) arcsine distribution. It governs the last time that
the Brownian motion process hits 0 during the time interval [0, 1]. The arcsine distribution is studied in more generality in the
chapter on Special Distributions.
4.3.8 https://stats.libretexts.org/@go/page/10158
Open the Brownian motion experiment and select the last zero. Note the location and size of the mean ± standard deviation bar
in relation to the probability density function. Run the simulation 1000 times and compare the empirical mean and standard
deviation to the distribution mean and standard deviation.
Suppose that the grades on a test are described by the random variable Y = 100X where X has the beta distribution with
probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. The grades are generally low, so the teacher
2
−
− −
−
decides to “curve” the grades using the transformation Z = 10√Y = 100√X . Find the mean and standard deviation of each
of the following variables:
1. X
2. Y
3. Z
Answer
Answer
Suppose that X is a real-valued random variable with E(X) = 2 and E [X(X − 1)] = 8 . Find each of the following:
1. E(X )2
2. var(X)
Answer
Suppose that X and 1 X2 are independent, real-valued random variables with E(Xi ) = μi and var(Xi ) = σ
i
2
for .
i ∈ {1, 2}
Then
1. E (X X ) = μ μ
1 2 1 2
2. var (X X ) = σ σ
1 2
2
1
2
2
2
+σ μ
1
2
2
2
+σ μ
2
2
1
Proof
Marilyn Vos Savant has an IQ of 228. Assuming that the distribution of IQ scores has mean 100 and standard deviation 15, find
Marilyn's standard score.
Answer
Fix t ∈ (0, ∞) . Suppose that X is the discrete random variable with probability density function defined by
P(X = t) = P(X = −t) = p , P(X = 0) = 1 − 2p , where p ∈ (0, ). Then equality holds in Chebyshev's inequality at t .
1
Proof
This page titled 4.3: Variance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services)
via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.3.9 https://stats.libretexts.org/@go/page/10158
4.4: Skewness and Kurtosis
As usual, our starting point is a random experiment, modeled by a probability space (Ω, F , P ). So to review, Ω is the set of
outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose that X is a real-valued
random variable for the experiment. Recall that the mean of X is a measure of the center of the distribution of X. Furthermore, the
variance of X is the second moment of X about the mean, and measures the spread of the distribution of X about the mean. The
third and fourth moments of X about the mean also measure interesting (but more subtle) features of the distribution. The third
moment measures skewness, the lack of symmetry, while the fourth moment measures kurtosis, roughly a measure of the fatness in
the tails. The actual numerical measures of these characteristics are standardized to eliminate the physical units, by dividing by an
appropriate power of the standard deviation. As usual, we assume that all expected values given below exist, and we will let
μ = E(X) and σ = var(X) . We assume that σ > 0 , so that the random variable is really random.
2
Basic Theory
Skewness
The distribution of X is said to be positively skewed, negatively skewed or unskewed depending on whether skew(X) is
positive, negative, or 0.
In the unimodal case, if the distribution is positively skewed then the probability density function has a long tail to the right, and if
the distribution is negatively skewed then the probability density function has a long tail to the left. A symmetric distribution is
unskewed.
The converse is not true—a non-symmetric distribution can have skewness 0. Examples are given in Exercises (30) and (31) below.
Proof
Since skewness is defined in terms of an odd power of the standard score, it's invariant under a linear transformation with positve
slope (a location-scale transformation of the distribution). On the other hand, if the slope is negative, skewness changes sign.
Recall that location-scale transformations often arise when physical units are changed, such as inches to centimeters, or degrees
Fahrenheit to degrees Celsius.
4.4.1 https://stats.libretexts.org/@go/page/10159
Kurtosis
The kurtosis of X is the fourth moment of the standard score:
4
X −μ
kurt(X) = E [ ( ) ] (4.4.4)
σ
Kurtosis comes from the Greek word for bulging. Kurtosis is always positive, since we have assumed that σ > 0 (the random
variable really is random), and therefore P(X ≠ μ) > 0 . In the unimodal case, the probability density function of a distribution
with large kurtosis has fatter tails, compared with the probability density function of a distribution with smaller kurtosis.
Proof
Since kurtosis is defined in terms of an even power of the standard score, it's invariant under linear transformations.
We will show in below that the kurtosis of the standard normal distribution is 3. Using the standard normal distribution as a
benchmark, the excess kurtosis of a random variable X is defined to be kurt(X) − 3 . Some authors use the term kurtosis to mean
what we have defined as excess kurtosis.
Computational Exercises
As always, be sure to try the exercises yourself before expanding the solutions and answers in the text.
Indicator Variables
Recall that an indicator random variable is one that just takes the values 0 and 1. Indicator variables are the building blocks of
many counting random variables. The corresponding distribution is known as the Bernoulli distribution, named for Jacob
Bernoulli.
Suppose that X is an indicator variable with P(X = 1) = p where p ∈ (0, 1). Then
1. E(X) = p
2. var(X) = p(1 − p)
1−2p
3. skew(X) =
√p(1−p)
2
1−3p+3p
4. kurt(X) = p(1−p)
Proof
Open the binomial coin experiment and set n = 1 to get an indicator variable. Vary p and note the change in the shape of the
probability density function.
Dice
Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice.
Here are three:
An ace-six flat die is a six-sided die in which faces 1 and 6 have probability 1
4
each while faces 2, 3, 4, and 5 have probability
1
8
each.
A two-five flat die is a six-sided die in which faces 2 and 5 have probability 1
4
each while faces 1, 3, 4, and 6 have probability 1
each.
4.4.2 https://stats.libretexts.org/@go/page/10159
A three-four flat die is a six-sided die in which faces 3 and 4 have probability 1
4
each while faces 1, 2, 5, and 6 have probability
1
8
each.
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular
probabilities that we use ( and ) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis
1
4
1
have slightly larger probabilities that the other four faces. Flat dice are sometimes used by gamblers to cheat.
A standard, fair die is thrown and the score X is recorded. Compute each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
An ace-six flat die is thrown and the score X is recorded. Compute each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
A two-five flat die is thrown and the score X is recorded. Compute each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
A three-four flat die is thrown and the score X is recorded. Compute each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
2
and are symmetric (and hence have skewness 0), but differ in variance and
kurtosis.
Open the dice experiment and set n = 1 to get a single die. Select each of the following, and note the shape of the probability
density function in comparison with the computational results above. In each case, run the experiment 1000 times and compare
the empirical density function to the probability density function.
1. fair
2. ace-six flat
3. two-five flat
4. three-four flat
Uniform Distributions
Recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the interval.
Continuous uniform distributions arise in geometric probability and a variety of other applied problems.
Suppose that X has uniform distribution on the interval [a, b], where a, b ∈ R and a < b . Then
4.4.3 https://stats.libretexts.org/@go/page/10159
1. E(X) = (a + b)
1
2. var(X) = (b − a)
12
1 2
3. skew(X) = 0
4. kurt(X) = 9
Proof
Open the special distribution simulator, and select the continuous uniform distribution. Vary the parameters and note the shape
of the probability density function in comparison with the moment results in the last exercise. For selected values of the
parameter, run the simulation 1000 times and compare the empirical density function to the probability density function.
where r ∈ (0, ∞) is the with rate parameter. This distribution is widely used to model failure times and other “arrival times”. The
exponential distribution is studied in detail in the chapter on the Poisson Process.
Suppose that X has the exponential distribution with rate parameter r > 0 . Then
1. E(X) = 1
2. var(X) = r
1
2
3. skew(X) = 2
4. kurt(X) = 9
Proof
Note that the skewness and kurtosis do not depend on the rate parameter r. That's because 1/r is a scale parameter for the
exponential distribution
Open the gamma experiment and set n = 1 to get the exponential distribution. Vary the rate parameter and note the shape of
the probability density function in comparison to the moment results in the last exercise. For selected values of the parameter,
run the experiment 1000 times and compare the empirical density function to the true probability density function.
Pareto Distribution
Recall that the Pareto distribution is a continuous distribution on [1, ∞) with probability density function f given by
a
f (x) = , x ∈ [1, ∞) (4.4.8)
a+1
x
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is
widely used to model financial variables such as income. The Pareto distribution is studied in detail in the chapter on Special
Distributions.
Suppose that X has the Pareto distribution with shape parameter a > 0 . Then
1. E(X) = a−1
a
if a > 1
2. var(X) = a
2
if a > 2
(a−1 ) (a−2)
−−−−−
2(1+a)
3. skew(X) = a−3
√1 −
2
a
if a > 3
2
3(a−2)(3 a +a+2)
4. kurt(X) = a(a−3)(a−4)
if a > 4
Proof
Open the special distribution simulator and select the Pareto distribution. Vary the shape parameter and note the shape of the
probability density function in comparison to the moment results in the last exercise. For selected values of the parameter, run
the experiment 1000 times and compare the empirical density function to the true probability density function.
4.4.4 https://stats.libretexts.org/@go/page/10159
The Normal Distribution
Recall that the standard normal distribution is a continuous distribution on R with probability density function ϕ given by
1 −
1
z
2
ϕ(z) = −−e
2 , z ∈ R (4.4.9)
√2π
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in
the chapter on Special Distributions.
More generally, for μ ∈ R and σ ∈ (0, ∞), recall that the normal distribution with mean μ and standard deviation σ is a
continuous distribution on R with probability density function f given by
2
1 1 x −μ
f (x) = −− exp[− ( ) ], x ∈ R (4.4.10)
√2πσ 2 σ
However, we also know that μ and σ are location and scale parameters, respectively. That is, if Z has the standard normal
distribution then X = μ + σZ has the normal distribution with mean μ and standard deviation σ.
If X has the normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞), then
1. skew(X) = 0
2. kurt(X) = 3
Proof
Open the special distribution simulator and select the normal distribution. Vary the parameters and note the shape of the
probability density function in comparison to the moment results in the last exercise. For selected values of the parameters, run
the experiment 1000 times and compare the empirical density function to the true probability density function.
Suppose that X has probability density function f given by f (x) = 6x(1 − x) for x ∈ [0, 1]. Find each of the following:
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
4.4.5 https://stats.libretexts.org/@go/page/10159
Suppose that X has probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. Find each of the following:
2
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
Open the special distribution simulator and select the beta distribution. Select the parameter values below to get the
distributions in the last three exercises. In each case, note the shape of the probability density function in relation to the
calculated moment results. Run the simulation 1000 times and compare the empirical density function to the probability
density function.
1. a = 2 , b = 2
2. a = 3 , b = 2
3. a = 2 , b = 3
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
The particular beta distribution in the last exercise is also known as the (standard) arcsine distribution. It governs the last time that
the Brownian motion process hits 0 during the time interval [0, 1]. The arcsine distribution is studied in more generality in the
chapter on Special Distributions.
Open the Brownian motion experiment and select the last zero. Note the shape of the probability density function in relation to
the moment results in the last exercise. Run the simulation 1000 times and compare the empirical density function to the
probability density function.
Counterexamples
The following exercise gives a simple example of a discrete distribution that is not symmetric but has skewness 0.
Suppose that X is a discrete random variable with probability density function f given by f (−3) =
1
10
, f (−1) =
1
2
,
f (2) = . Find each of the following and then show that the distribution of X is not symmetric.
2
1. E(X)
2. var(X)
3. skew(X)
4. kurt(X)
Answer
The following exercise gives a more complicated continuous distribution that is not symmetric but has skewness 0. It is one of a
collection of distributions constructed by Erik Meijer.
Suppose that U , V , and I are independent random variables, and that U is normally distributed with mean μ = −2 and
variance σ = 1 , V is normally distributed with mean ν = 1 and variance τ = 2 , and I is an indicator variable with
2 2
P(I = 1) = p = . Let X = I U + (1 − I )V . Find each of the following and then show that the distribution of X is not
1
symmetric.
1. E(X)
2. var(X)
3. skew(X)
4.4.6 https://stats.libretexts.org/@go/page/10159
4. kurt(X)
Solution
This page titled 4.4: Skewness and Kurtosis is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
4.4.7 https://stats.libretexts.org/@go/page/10159
4.5: Covariance and Correlation
Recall that by taking the expected value of various transformations of a random variable, we can measure many interesting
characteristics of the distribution of the variable. In this section, we will study an expected value that measures a special type of
relationship between two real-valued variables. This relationship is very important both in probability and statistics.
Basic Theory
Definitions
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). Unless otherwise noted, we assume
that all expected values mentioned in this section exist. Suppose now that X and Y are real-valued random variables for the
experiment (that is, defined on the probability space) with means E(X), E(Y ) and variances var(X), var(Y ), respectively.
and, assuming the variances are positive, the correlation of (X, Y ) is defined by
cov(X, Y )
cor(X, Y ) = (4.5.2)
sd(X)sd(Y )
Correlation is a scaled version of covariance; note that the two parameters always have the same sign (positive, negative, or 0).
Note also that correlation is dimensionless, since the numerator and denominator have the same physical units, namely the product
of the units of X and Y .
As these terms suggest, covariance and correlation measure a certain kind of dependence between the variables. One of our goals is
a deeper understanding of this dependence. As a start, note that (E(X), E(Y )) is the center of the joint distribution of (X, Y ), and
the vertical and horizontal lines through this point separate R into four quadrants. The function (x, y) ↦ [x − E(X)] [y − E(Y )]
2
is positive on the first and third quadrants and negative on the second and fourth.
Figure 4.5.1 : A joint distribution with (E(X), E(Y ))as the center of mass
Properties of Covariance
The following theorems give some basic properties of covariance. The main tool that we will need is the fact that expected value is
a linear operation. Other important properties will be derived below, in the subsection on the best linear predictor. As usual, be sure
to try the proofs yourself before reading the ones in the text. Once again, we assume that the random variables are defined on the
common sample space, are real-valued, and that the indicated expected values exist (as real numbers).
Our first result is a formula that is better than the definition for computational purposes, but gives less insight.
4.5.1 https://stats.libretexts.org/@go/page/10160
cov(X, Y ) = E(XY ) − E(X)E(Y ) .
Proof
From (2), we see that X and Y are uncorrelated if and only if E(XY ) = E(X)E(Y ) , so here is a simple but important corollary:
However, the converse fails with a passion: Exercise (31) gives an example of two variables that are functionally related (the
strongest form of dependence), yet uncorrelated. The computational exercises give other examples of dependent yet uncorrelated
variables also. Note also that if one of the variables has mean 0, then the covariance is simply the expected product.
Trivially, covariance is a symmetric operation.
cov(X, Y ) = cov(Y , X) .
cov(X, X) = var(X) .
Proof
Covariance is a linear operation in the first argument, if the second argument is fixed.
By symmetry, covariance is also a linear operation in the second argument, with the first argument fixed. Thus, the covariance
operator is bi-linear. The general version of this property is given in the following theorem.
Suppose that (X1 , X2 , … , Xn ) and (Y1 , Y2 , … , Ym ) are sequences of random variables, and that (a1 , a2 , … , an ) and
(b1 , b2 , … , bm ) are constants. Then
n m n m
The following result shows how covariance is changed under a linear transformation of one of the variables. This is simply a
special case of the basic properties, but is worth stating.
Of course, by symmetry, the same property holds in the second argument. Putting the two together we have that if a, b, c, d ∈ R
Properties of Correlation
Next we will establish some basic properties of correlation. Most of these follow easily from corresponding properties of
covariance above. We assume that var(X) > 0 and var(Y ) > 0 , so that the random variable really are random and hence the
correlation is well defined.
The correlation between X and Y is the covariance of the corresponding standard scores:
X − E(X) Y − E(Y ) X − E(X) Y − E(Y )
cor(X, Y ) = cov ( , ) =E( ) (4.5.8)
sd(X) sd(Y ) sd(X) sd(Y )
Proof
4.5.2 https://stats.libretexts.org/@go/page/10160
This shows again that correlation is dimensionless, since of course, the standard scores are dimensionless. Also, correlation is
symmetric:
cor(X, Y ) = cor(Y , X) .
Under a linear transformation of one of the variables, the correlation is unchanged if the slope is positve and changes sign if the
slope is negative:
If a, b ∈ R and b ≠ 0 then
1. cor(a + bX, Y ) = cor(X, Y ) if b > 0
2. cor(a + bX, Y ) = −cor(X, Y ) if b < 0
Proof
This result reinforces the fact that correlation is a standardized measure of association, since multiplying the variable by a positive
constant is equivalent to a change of scale, and adding a contant to a variable is equivalent to a change of location. For example, in
the Challenger data, the underlying variables are temperature at the time of launch (in degrees Fahrenheit) and O-ring erosion (in
millimeters). The correlation between these two variables is of fundamental importance. If we decide to measure temperature in
degrees Celsius and O-ring erosion in inches, the correlation is unchanged. Of course, the same property holds in the second
argument, so if a, b, c, d ∈ R with b ≠ 0 and d ≠ 0 , then cor(a + bX, c + dY ) = cor(X, Y ) if bd > 0 and
cor(a + bX, c + dY ) = −cor(X, Y ) if bd < 0 .
The most important properties of covariance and correlation will emerge from our study of the best linear predictor below.
Proof
Note that the variance of a sum can be larger, smaller, or equal to the sum of the variances, depending on the pure covariance terms.
As a special case of (12), when n = 2 , we have
var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y ) (4.5.12)
i=1 i=1
Proof
Note that the last result holds, in particular, if the random variables are independent. We close this discussion with a couple of
minor corollaries.
If X and Y are real-valued random variables then var(X + Y ) + var(X − Y ) = 2 [var(X) + var(Y )] .
Proof
If X and Y are real-valued random variables with var(X) = var(Y ) then X + Y and X − Y are uncorrelated.
Proof
4.5.3 https://stats.libretexts.org/@go/page/10160
Random Samples
In the following exercises, suppose that (X , X , …) is a sequence of independent, real-valued random variables with a common
1 2
distribution that has mean μ and standard deviation σ > 0 . In statistical terms, the variables form a random sample from the
common distribution.
n
For n ∈ N+ , let Y n = ∑i=1 Xi .
1. E (Y ) = nμ
n
2. var (Y ) = nσ
n
2
Proof
For n ∈ N , let M
+ n = Yn /n =
n
1 n
∑
i=1
Xi , so that M is the sample mean of (X
n 1, X2 , … , Xn ) .
1. E (M ) = μ
n
2. var (M ) = σ /n
n
2
3. var (M ) → 0 as n → ∞
n
Proof
Part (c) of (17) means that M → μ as n → ∞ in mean square. Part (d) means that M → μ as n → ∞ in probability. These are
n n
both versions of the weak law of large numbers, one of the fundamental theorems of probability.
The standard score of the sum Y and the standard score of the sample mean M are the same:
n n
Yn − n μ Mn − μ
Zn = − = − (4.5.16)
√n σ σ/ √n
1. E(Z ) = 0
n
2. var(Z ) = 1
n
Proof
The central limit theorem, the other fundamental theorem of probability, states that the distribution of Z converges to the standard n
normal distribution as n → ∞ .
Events
If A and B are events in our random experiment then the covariance and correlation of A and B are defined to be the covariance
and correlation, respectively, of their indicator random variables.
In particular, note that A and B are positively correlated, negatively correlated, or independent, respectively (as defined in the
section on conditional probability) if and only if the indicator variables of A and B are positively correlated, negatively correlated,
or uncorrelated, as defined in this section.
2. cov(A , B ) = cov(A, B)
c c
Proof
4.5.4 https://stats.libretexts.org/@go/page/10160
Proof
In the language of the experiment, A ⊆B means that A implies B . In such a case, the events are positively correlated, not
surprising.
The random variable L(Y ∣ X) defined as follows is the only linear function of X satisfying properties (a) and (b).
cov(X, Y )
L(Y ∣ X) = E(Y ) + [X − E(X)] (4.5.17)
var(X)
Note that in the presence of part (a), part (b) is equivalent to E [XL(Y ∣ X)] = E(XY ) . Here is another minor variation, but one
that will be very useful: L(Y ∣ X) is the only linear function of X with the same mean as Y and with the property that
Y − L(Y ∣ X) is uncorrelated with every linear function of X.
The variance of L(Y ∣ X) and its covariance with Y turn out to be the same.
2. cov [L(Y 2
∣ X), Y ] = cov (X, Y )/var(X)
Proof
We can now prove the fundamental result that L(Y ∣ X) is the linear function of X that is closest to Y in the mean square sense.
We give two proofs; the first is more straightforward, but the second is more interesting and elegant.
Proof
Our solution to the best linear perdictor problems yields important properties of covariance and correlation.
4.5.5 https://stats.libretexts.org/@go/page/10160
Additional properties of covariance and correlation:
1. −1 ≤ cor(X, Y ) ≤ 1
2. −sd(X)sd(Y ) ≤ cov(X, Y ) ≤ sd(X)sd(Y )
3. cor(X, Y ) = 1 if and only if, with probability 1, Y is a linear function of X with positive slope.
4. cor(X, Y ) = −1 if and only if, with probability 1, Y is a linear function of X with negative slope.
Proof
The last two results clearly show that cov(X, Y ) and cor(X, Y ) measure the linear association between X and Y . The equivalent
inequalities (a) and (b) above are referred to as the correlation inequality. They are also versions of the Cauchy-Schwarz inequality,
named for Augustin Cauchy and Karl Schwarz
Recall from our previous discussion of variance that the best constant predictor of Y , in the sense of minimizing mean square error,
is E(Y ) and the minimum value of the mean square error for this predictor is var(Y ). Thus, the difference between the variance of
Y and the mean square error above for L(Y ∣ X) is the reduction in the variance of Y when the linear term in X is added to the
predictor:
2 2
var(Y ) − E ( [Y − L(Y ∣ X)] ) = var(Y ) cor (X, Y ) (4.5.33)
The function x ↦ L(Y ∣ X = x) is known as the distribution regression function for Y given X, and its graph is known as the
distribution regression line. Note that the regression line passes through (E(X), E(Y )), the center of the joint distribution.
The regression line for Y given X and the regression line for X given Y are not the same line, except in the trivial case where
the variables are perfectly correlated. However, the coefficient of determination is the same, regardless of which variable is the
predictor and which is the response.
Proof
Suppose that A and B are events with 0 < P(A) < 1 and 0 < P(B) < 1 . Then
1. cor(A, B) = 1 if and only P(A ∖ B) + P(B ∖ A) = 0 . (That is, A and B are equivalent events.)
2. cor(A, B) = −1 if and only P(A ∖ B ) + P(B ∖ A) = 0 . (That is, A and B are equivalent events.)
c c c
Proof
The concept of best linear predictor is more powerful than might first appear, because it can be applied to transformations of the
variables. Specifically, suppose that X and Y are random variables for our experiment, taking values in general spaces S and T ,
respectively. Suppose also that g and h are real-valued functions defined on S and T , respectively. We can find L [h(Y ) ∣ g(X)] ,
the linear function of g(X) that is closest to h(Y ) in the mean square sense. The results of this subsection apply, of course, with
4.5.6 https://stats.libretexts.org/@go/page/10160
g(X) replacing X and h(Y ) replacing Y . Of course, we must be able to compute the appropriate means, variances, and
covariances.
We close this subsection with two additional properties of the best linear predictor, the linearity properties.
Suppose that X, Y , and Z are random variables and that c is a constant. Then
1. L(Y + Z ∣ X) = L(Y ∣ X) + L(Z ∣ X)
2. L(cY ∣ X) = cL(Y ∣ X)
Proof from the definitions
Proof by characterizing properties
There are several extensions and generalizations of the ideas in the subsection:
The corresponding statistical problem of estimating a and b , when these distribution parameters are unknown, is considered in
the section on Sample Covariance and Correlation.
The problem finding the function of X that is closest to Y in the mean square error sense (using all reasonable functions, not
just linear functions) is considered in the section on Conditional Expected Value.
The best linear prediction problem when the predictor and response variables are random vectors is considered in the section on
Expected Value and Covariance Matrices.
The use of characterizing properties will play a crucial role in these extensions.
Answer
In the bivariate uniform experiment, select each of the regions below in turn. For each region, run the simulation 2000 times
and note the value of the correlation and the shape of the cloud of points in the scatterplot. Compare with the results in the last
exercise.
1. Square
2. Triangle
3. Circle
Suppose that X is uniformly distributed on the interval (0, 1) and that given X = x ∈ (0, 1) , Y is uniformly distributed on the
interval (0, x). Find each of the following:
1. cov(X, Y )
2. cor(X, Y )
3. L(Y ∣ X)
4. L(X ∣ Y )
Answer
4.5.7 https://stats.libretexts.org/@go/page/10160
Dice
Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard
die in which faces 1 and 6 have probability each, and faces 2, 3, 4, and 5 have probability each.
1
4
1
A pair of standard, fair dice are thrown and the scores (X , X ) recorded. Let Y = X + X denote the sum of the scores,
1 2 1 2
U = min{ X , X } the minimum scores, and V = max{ X , X } the maximum score. Find the covariance and correlation of
1 2 1 2
2. (X , Y )
1
3. (X , U )
1
4. (U , V )
5. (U , Y )
Answer
Suppose that n fair dice are thrown. Find the mean and variance of each of the following variables:
1. Y , the sum of the scores.
n
Answer
In the dice experiment, select fair dice, and select the following random variables. In each case, increase the number of dice
and observe the size and location of the probability density function and the mean ± standard deviation bar. With n = 20 dice,
run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard
deviation.
1. The sum of the scores.
2. The average of the scores.
Suppose that n ace-six flat dice are thrown. Find the mean and variance of each of the following variables:
1. Y , the sum of the scores.
n
Answer
In the dice experiment, select ace-six flat dice, and select the following random variables. In each case, increase the number of
dice and observe the size and location of the probability density function and the mean ± standard deviation bar. With n = 20
dice, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and
standard deviation.
1. The sum of the scores.
2. The average of the scores.
A pair of fair dice are thrown and the scores (X , X 1 2) recorded. Let Y = X + X denote the sum of the scores,
1 2
2. L(U ∣ X1 )
3. L(V ∣ X1 )
Answer
Bernoulli Trials
Recall that a Bernoulli trials process is a sequence X = (X , X , …) of independent, identically distributed indicator random
1 2
variables. In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The
i
probability of success p = P(X = 1) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate
i
4.5.8 https://stats.libretexts.org/@go/page/10160
For n ∈ N , the number of successes in the first n trials is Y = ∑ X . Recall that this random variable has the binomial
+ n
n
i=1 i
distribution with parameters n and p, which has probability density function f given by
n y n−y
fn (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (4.5.45)
y
1. E(Y ) = np
n
2. var(Y ) = np(1 − p)
n
Proof
In the binomial coin experiment, select the number of heads. Vary n and p and note the shape of the probability density
function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the
experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation.
1. E(M ) = p
n
Proof
In the binomial coin experiment, select the proportion of heads. Vary n and p and note the shape of the probability density
function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the
experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation.
denote the type of the ith object selected. Recall that (X , X , … , X ) is a sequence of identically distributed (but not
1 2 n
i=1
Xi . Recall that this random variable has the
hypergeometric distribution, which has probability density function f given by n
r m−r
( )( )
y n−y
f (y) = , y ∈ {0, 1, … , n} (4.5.46)
m
( )
n
2. var(X ) = (1 −
i
r
m
r
m
)
3. cov(X , X ) = −
i j
r
m
(1 −
r
m
)
1
m−1
4. cor(X , X ) = −
i j
1
m−1
Proof
Note that the event of a type 1 object on draw i and the event of a type 1 object on draw j are negatively correlated, but the
correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if
m = 2 . Think about these result intuitively.
4.5.9 https://stats.libretexts.org/@go/page/10160
2. var(Y ) = n m
r
(1 −
m
r
)
m−n
m−1
Proof
Note that if the sampling were with replacement, Y would have a binomial distribution, and so in particular E(Y ) = n and m
r
m−n
) . The additional factor that occurs in the variance of the hypergeometric distribution is sometimes
r r
var(Y ) = n (1 −
m m m−1
m−n
called the finite population correction factor. Note that for fixed m, is decreasing in n , and is 0 when n = m . Of course, we
m−1
know that we must have var(Y ) = 0 if n = m , since we would be sampling the entire population, and so deterministically, Y = r .
On the other hand, for fixed n , → 1 as m → ∞ . More generally, the hypergeometric distribution is well approximated by the
m−n
m−1
binomial when the population size m is large compared to the sample size n . These ideas are discussed more fully in the section on
the hypergeometric distribution in the chapter on Finite Sampling Models.
In the ball and urn experiment, select sampling without replacement. Vary m, r, and n and note the shape of the probability
density function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the
experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation.
Suppose that X and Y are real-valued random variables with cov(X, Y ) = 3 . Find cov(2X − 5, 4Y + 2) .
Answer
Suppose X and Y are real-valued random variables with var(X) = 5 , var(Y ) = 9 , and cov(X, Y ) = −3 . Find
1. cor(X, Y )
2. var(2X + 3Y − 7)
3. cov(5X + 2Y − 3, 3X − 4Y + 2)
4. cor(5X + 2Y − 3, 3X − 4Y + 2)
Answer
Suppose that X and Y are independent, real-valued random variables with var(X) = 6 and var(Y ) = 8 . Find
var(3X − 4Y + 5) .
Answer
2
, P(B) =
1
3
, and P(A ∩ B) =
1
8
. Find each of the
following:
1. cov(A, B)
2. cor(A, B)
Answer
Suppose that X ,
, and Z are real-valued random variables for an experiment, and that
Y L(Y ∣ X) = 2 − 3X and
L(Z ∣ X) = 5 + 4X . Find L(6Y − 2Z ∣ X) .
Answer
Suppose that X and Y are real-valued random variables for an experiment, and that E(X) = 3 , var(X) = 4 , and
L(Y ∣ X) = 5 − 2X . Find each of the following:
1. E(Y )
2. cov(X, Y )
Answer
4.5.10 https://stats.libretexts.org/@go/page/10160
2. cor(X, Y )
3. L(Y ∣ X)
4. L(X ∣ Y )
Answer
Suppose that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤x ≤y ≤1 . Find each of the
following:
1. cov(X, Y )
2. cor(X, Y )
3. L(Y ∣ X)
4. L(X ∣ Y )
Answer
Suppose again that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 .
1. Find cov (X , Y ).
2
2. Find cor (X , Y ).
2
3. Find L (Y ∣ X ) . 2
Answer
Suppose again that (X, Y ) has probability density function f given by f (x, y) = 15x 2
y for 0 ≤ x ≤ y ≤ 1 .
−
−
1. Find cov (√X , Y ).
−
−
2. Find cor (√X , Y ).
−
−
3. Find L (Y ∣ √X ) .
−
−
4. Which of the predictors of Y is better, the one based on X of the one based on √X ?
Answer
This page titled 4.5: Covariance and Correlation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
4.5.11 https://stats.libretexts.org/@go/page/10160
4.6: Generating Functions
As usual, our starting point is a random experiment modeled by a probability sace (Ω, F , P). A generating function of a real-
valued random variable is an expected value of a certain transformation of the random variable involving another (deterministic)
variable. Most generating functions share four important properties:
1. Under mild conditions, the generating function completely determines the distribution of the random variable.
2. The generating function of a sum of independent variables is the product of the generating functions
3. The moments of the random variable can be obtained from the derivatives of the generating function.
4. Ordinary (pointwise) convergence of a sequence of generating functions corresponds to the special convergence of the
corresponding distributions.
Property 1 is perhaps the most important. Often a random variable is shown to have a certain distribution by showing that the
generating function has a certain form. The process of recovering the distribution from the generating function is known as
inversion. Property 2 is frequently used to determine the distribution of a sum of independent variables. By contrast, recall that the
probability density function of a sum of independent variables is the convolution of the individual density functions, a much more
complicated operation. Property 3 is useful because often computing moments from the generating function is easier than
computing the moments directly from the probability density function. The last property is known as the continuity theorem. Often
it is easer to show the convergence of the generating functions than to prove convergence of the distributions directly.
The numerical value of the generating function at a particular value of the free variable is of no interest, and so generating
functions can seem rather unintuitive at first. But the important point is that the generating function as a whole encodes all of the
information in the probability distribution in a very useful way. Generating functions are important and valuable tools in
probability, as they are in other areas of mathematics, from combinatorics to differential equations.
We will study the three generating functions in the list below, which correspond to increasing levels of generality. The fist is the
most restrictive, but also by far the simplest, since the theory reduces to basic facts about power series that you will remember from
calculus. The third is the most general and the one for which the theory is most complete and elegant, but it also requires basic
knowledge of complex analysis. The one in the middle is perhaps the one most commonly used, and suffices for most distributions
in applied probability.
1. the probability generating function
2. the moment generating function
3. the characteristic function
We will also study the characteristic function for multivariate distributions, although analogous results hold for the other two types.
In the basic theory below, be sure to try the proofs yourself before reading the ones in the text.
Basic Theory
The Probability Generating Function
For our first generating function, assume that N is a random variable taking values in N.
Suppose that N has probability density function f and probability generating function P . Then
∞
n
P (t) = ∑ f (n)t , t ∈ (−r, r) (4.6.2)
n=0
4.6.1 https://stats.libretexts.org/@go/page/10161
where r ∈ [1, ∞] is the radius of convergence of the series.
Proof
In the language of combinatorics, P is the ordinary generating function of f . Of course, if N just takes a finite set of values in N
then r = ∞ . Recall from calculus that a power series can be differentiated term by term, just like a polynomial. Each derivative
series has the same radius of convergence as the original series (but may behave differently at the endpoints of the interval of
convergence). We denote the derivative of order n by P . Recall also that if n ∈ N and k ∈ N with k ≤ n , then the number of
(n)
The following theorem is the inversion result for probability generating functions: the generating function completely determines
the distribution.
Suppose again that N has probability density function f and probability generating function P . Then
(k)
P (0)
f (k) = , k ∈ N (4.6.4)
k!
Proof
Our next result is not particularly important, but has a certain curiosity.
P(N is even) =
1
2
[1 + P (−1)] .
Proof
Recall that the factorial moment of N of order k ∈ N is E [ N ] . The factorial moments can be computed from the derivatives of
(k)
the probability generating function. The factorial moments, in turn, determine the ordinary moments about 0 (sometimes referred to
as raw moments).
2. var(N ) = P (1) + P ′′ ′
(1) [1 − P (1)]
′
Proof
Suppose that N and N are independent random variables taking values in N, with probability generating functions P and
1 2 1
P 2 having radii of convergence r and r , respectively. Then the probability generating function P of N + N is given by
1 2 1 2
Proof
Note that since e ≥ 0 with probability 1, M (t) exists, as a real number or ∞, for any t ∈ R . But as we will see, our interest will
tX
Suppose that X has a continuous distribution on R with probability density function f . Then
4.6.2 https://stats.libretexts.org/@go/page/10161
∞
tx
M (t) = ∫ e f (x) dx (4.6.8)
−∞
Proof
Thus, the moment generating function of X is closely related to the Laplace transform of the probability density function f . The
Laplace transform is named for Pierre Simon Laplace, and is widely used in many areas of applied mathematics, particularly
differential equations. The basic inversion theorem for moment generating functions (similar to the inversion theorem for Laplace
transforms) states that if M (t) < ∞ for t in an open interval about 0, then M completely determines the distribution of X. Thus,
if two distributions on R have moment generating functions that are equal (and finite) in an open interval about 0, then the
distributions are the same.
Suppose that X has moment generating function M that is finite in an open interval I about 0. Then X has moments of all
orders and
∞ n
E (X )
n
M (t) = ∑ t , t ∈ I (4.6.9)
n!
n=0
Proof
So under the finite assumption in the last theorem, the moment generating function, like the probability generating function, is a
power series in t .
Suppose again that X has moment generating function M that is finite in an open interval about 0. Then M (n)
(0) = E (X
n
)
for n ∈ N
Proof
Thus, the derivatives of the moment generating function at 0 determine the moments of the variable (hence the name). In the
language of combinatorics, the moment generating function is the exponential generating function of the sequence of moments.
Thus, a random variable that does not have finite moments of all orders cannot have a finite moment generating function. Even
when a random variable does have moments of all orders, the moment generating function may not exist. A counterexample is
constructed below.
For nonnegative random variables (which are very common in applications), the domain where the moment generating function is
finite is easy to understand.
Suppose that X takes values in [0, ∞) and has moment generating function M . If M (t) < ∞ for t ∈ R then M (s) < ∞ for
s ≤t.
Proof
So for a nonnegative random variable, either M (t) < ∞ for all t ∈ R or there exists r ∈ (0, ∞) such that M (t) < ∞ for t < r .
Of course, there are complementary results for non-positive random variables, but such variables are much less common. Next we
consider what happens to the moment generating function under some simple transformations of the random variables.
Suppose that X has moment generating function M and that a, b ∈ R . The moment generating function N of Y = a + bX is
given by N (t) = e M (bt) for t ∈ R .
at
Proof
Recall that if a ∈ R and b ∈ (0, ∞) then the transformation a + bX is a location-scale transformation on the distribution of X,
with location parameter a and scale parameter b . Location-scale transformations frequently arise when units are changed, such as
length changed from inches to centimeters or temperature from degrees Fahrenheit to degrees Celsius.
Suppose that X and X are independent random variables with moment generating functions
1 2 M1 and M2 respectively. The
moment generating function M of Y = X + X is given by M (t) = M (t)M (t) for t ∈ R .
1 2 1 2
Proof
The probability generating function of a variable can easily be converted into the moment generating function of the variable.
4.6.3 https://stats.libretexts.org/@go/page/10161
Suppose that X is a random variable taking values in N with probability generating function G having radius of convergence
r . The moment generating function M of X is given by M (t) = G (e ) for t < ln(r) .
t
Proof
The following theorem gives the Chernoff bounds, named for the mathematician Herman Chernoff. These are upper bounds on the
tail events of a random variable.
Naturally, the best Chernoff bound (in either (a) or (b)) is obtained by finding t that minimizes e −tx
M (t) .
Note that χ is a complex valued function, and so this subsection requires some basic knowledge of complex analysis. The function
χ is defined for all t ∈ R because the random variable in the expected value is bounded in magnitude. Indeed, ∣ ∣ = 1 for all
itX
∣e ∣
t ∈ R . Many of the properties of the characteristic function are more elegant than the corresponding properties of the probability or
If X has a continuous distribution on R with probability density function f and characteristic function χ then
∞
itx
χ(t) = ∫ e f (x)dx, t ∈ R (4.6.13)
−∞
Proof
Thus, the characteristic function of X is closely related to the Fourier transform of the probability density function f . The Fourier
transform is named for Joseph Fourier, and is widely used in many areas of applied mathematics.
As with other generating functions, the characteristic function completely determines the distribution. That is, random variables X
and Y have the same distribution if and only if they have the same characteristic function. Indeed, the general inversion formula
given next is a formula for computing certain combinations of probabilities from the characteristic function.
The probability combinations on the right side completely determine the distribution of X . A special inversion formula holds for
continuous distributions:
Suppose that X has a continuous distribution with probability density function f and characteristic function χ. At every point
x ∈ R where f is differentiable,
∞
1
−itx
f (x) = ∫ e χ(t) dt (4.6.15)
2π −∞
This formula is essentially the inverse Fourrier transform. As with the other generating functions, the characteristic function can be
used to find the moments of X. Moreover, this can be done even when only some of the moments exist.
4.6.4 https://stats.libretexts.org/@go/page/10161
Suppose again that X has characteristic function χ. If n ∈ N + and E (|X n
|) < ∞ . Then
n k
E (X )
k n
χ(t) = ∑ (it) + o(t ) (4.6.16)
k!
k=0
Next we consider how the characteristic function is changed under some simple transformations of the variables.
Suppose that X has characteristic function χ and that a, b ∈ R . The characteristic function ψ of Y = a + bX is given by
ψ(t) = e
iat
χ(bt) for t ∈ R .
Proof
Suppose that X and X are independent random variables with characteristic functions
1 2 χ1 and χ2 respectively. The
characteristic function χ of Y = X + X is given by χ(t) = χ (t)χ (t) for t ∈ R .
1 2 1 2
Proof
The characteristic function of a random variable can be obtained from the moment generating function, under the basic existence
condition that we saw earlier.
Suppose that X has moment generating function M that satisfies M (t) < ∞ for t in an open interval I about 0. Then the
characteristic function χ of X satisfies χ(t) = M (it) for t ∈ I .
The final important property of characteristic functions that we will discuss relates to convergence in distribution. Suppose that
(X , X , …) is a sequence of real-valued random with characteristic functions (χ , χ , …) respectively. Since we are only
1 2 1 2
concerned with distributions, the random variables need not be defined on the same probability space.
2. Conversely, if χ (t) converges to a function χ(t) as n → ∞ for t in an open interval about 0, and if χ is continuous at 0,
n
then χ is the characteristic function of a random variable X, and the distribution of X converges to the distribution of X n
as n → ∞ .
There are analogous versions of the continuity theorem for probability generating functions and moment generating functions. The
continuity theorem can be used to prove the central limit theorem, one of the fundamental theorems of probability. Also, the
continuity theorem has a straightforward generalization to distributions on R . n
Once again, the most important fact is that χ completely determines the distribution: two random vectors taking values in R have 2
the same characteristic function if and only if they have the same distribution.
The joint moments can be obtained from the derivatives of the characteristic function.
4.6.5 https://stats.libretexts.org/@go/page/10161
(m,n) i (m+n) m n
χ (0, 0) = e E (X Y ) (4.6.19)
The marginal characteristic functions and the characteristic function of the sum can be easily obtained from the joint characteristic
function:
Suppose again that (X, Y ) has characteristic function χ, and let χ1 , χ , and
2 χ+ denote the characteristic functions of X , Y ,
and X + Y , respectively. For t ∈ R
1. χ(t, 0) = χ 1 (t)
2. χ(0, t) = χ 2 (t)
3. χ(t, t) = χ + (t)
Proof
Naturally, the results for bivariate characteristic functions have analogies in the general multivariate case. Only the notation is more
complicated.
Dice
Recall that an ace-six flat die is a six-sided die for which faces numbered 1 and 6 have probability each, while faces numbered 2, 1
3, 4, and 5 have probability each. Similarly, a 3-4 flat die is a six-sided die for which faces numbered 3 and 4 have probability
1
8
1
Suppose that an ace-six flat die and a 3-4 flat die are rolled. Use probability generating functions to find the probability density
function of the sum of the scores.
Solution
Two fair, 6-sided dice are rolled. One has faces numbered (0, 1, 2, 3, 4, 5) and the other has faces numbered
(0, 6, 12, 18, 24, 30). Use probability generating functions to find the probability density function of the sum of the scores, and
identify the distribution.
Solution
Bernoulli Trials
Suppose X is an indicator random variable with p = P(X = 1) , where p ∈ [0, 1] is a parameter. Then X has probability
generating function P (t) = 1 − p + pt for t ∈ R .
Proof
Recall that a Bernoulli trials process is a sequence (X , X , …) of independent, identically distributed indicator random variables.
1 2
In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The
i
4.6.6 https://stats.libretexts.org/@go/page/10161
probability of success p = P(X = 1) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate
i
i=1 i
distribution with parameters n and p, which has probability density function f given by n
n
y n−y
fn (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (4.6.23)
y
(k)
1. E [Y n ] =n
(k) k
p
2. E (Y ) = np
n
3. var (Y ) = np(1 − p)
n
2
n
]
Proof
Suppose now that p ∈ (0, 1]. The trial number N of the first success in the sequence of Bernoulli trials has the geometric
distribution on N with success parameter p. The probability density function h is given by
+
n−1
h(n) = p(1 − p ) , n ∈ N+ (4.6.24)
The geometric distribution is studied in more detail in the chapter on Bernoulli trials.
1−p
<t <
1
1−p
k−1
(1−p)
2. E [ N (k)
] = k!
pk
for k ∈ N
3. E(N ) = 1
p
1−p
4. var(N ) = p
2
1−p
5. P(N is even) =
2−p
Proof
The probability that N is even comes up in the alternating coin tossing game with two players.
where a ∈ (0, ∞) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number
of “random points” in a region of time or space; the parameter is proportional to the size of the region of time or space. The Poisson
distribution is studied in more detail in the chapter on the Poisson Process.
Suppose that N has Poisson distribution with parameter a ∈ (0, ∞) . Let P denote the probability generating function of a N .
Then
4.6.7 https://stats.libretexts.org/@go/page/10161
1. P (t) = e
a for t ∈ R
a(t−1)
2. E [ N ] = a
(k) k
3. E(N ) = a
4. var(N ) = a
5. P(N is even) = (1 + e 1
2
−2a
)
Proof
The Poisson family of distributions is closed with respect to sums of independent variables, a very important property.
Suppose that X, Y have Poisson distributions with parameters a, b ∈ (0, ∞) , respectively, and that X and Y are independent.
Then X + Y has the Poisson distribution with parameter a + b .
Proof
The right distribution function of the Poisson distribution does not have a simple, closed-form expression. The following exercise
gives an upper bound.
Suppose that N has the Poisson distribution with parameter a > 0 . Then
n
a
n−a
P(N ≥ n) ≤ e ( ) , n >a (4.6.28)
n
Proof
The following theorem gives an important convergence result that is explored in more detail in the chapter on the Poisson process.
Suppose that p ∈ (0, 1) for n ∈ N and that np → a ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters
n + n
Proof
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other random times, and in
particular governs the time between arrivals in the Poisson model. The exponential distribution is studied in more detail in the
chapter on the Poisson Process.
Suppose that T has the exponential distribution with rate parameter r ∈ (0, ∞) and let M denote the moment generating
function of T . Then
1. M (s) = for s ∈ (−∞, r) .
r
r−s
Proof
Suppose that (T , T , …) is a sequence of independent random variables, each having the exponential distribution with rate
1 2
i=1 i
n
r
Mn (s) = ( ) , s ∈ (−∞, r) (4.6.32)
r−s
Proof
Random variable U has the Erlang distribution with shape parameter n and rate parameter r, named for Agner Erlang. This
n
distribution governs the n th arrival time in the Poisson model. The Erlang distribution is a special case of the gamma distribution
and is studied in more detail in the chapter on the Poisson Process.
4.6.8 https://stats.libretexts.org/@go/page/10161
Uniform Distributions
Suppose that a, b ∈ R and a <b . Recall that the continuous uniform distribution on the interval [a, b] has probability density
function f given by
1
f (x) = , x ∈ [a, b] (4.6.33)
b −a
The distribution corresponds to selecting a point at random from the interval. Continuous uniform distributions arise in geometric
probability and a variety of other applied problems.
Suppose that X is uniformly distributed on the interval [a, b] and let M denote the moment generating function of X. Then
bt at
1. M (t) = e
(b−a)t
−e
if t ≠ 0 and M (0) = 1
n+1 n+1
2. E (X n
) =
b
(n+1)(b−a)
−a
for n ∈ N
Proof
A Bivariate Distribution
Suppose that (X, Y ) has probability density function f given by f (x, y) = x + y for (x, y) ∈ [0, 1]
2
. Compute each of the
following:
1. The joint moment generating function (X, Y ).
2. The moment generating function of X.
3. The moment generating function of Y .
4. The moment generating function of X + Y .
Answer
ϕ(z) = −−e
2
, z ∈ R (4.6.34)
√2π
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in more
detail in the chapter on Special Distributions.
Suppose that Z has the standard normal distribution and let M denote the moment generating function of Z . Then
1 2
1. M (t) = e for t ∈ R
2
t
2. E (Z ) = 1 ⋅ 3 ⋯ (n − 1) if n is even and E (Z
n n
) =0 if n is odd.
Proof
More generally, for μ ∈ R and σ ∈ (0, ∞), recall that the normal distribution with mean μ and standard deviation σ is a
continuous distribution on R with probability density function f given by
2
1 1 x −μ
f (x) = −− exp[− ( ) ], x ∈ R (4.6.37)
√2πσ 2 σ
4.6.9 https://stats.libretexts.org/@go/page/10161
Moreover, if Z has the standard normal distribution, then X = μ + σZ has the normal distribution with mean μ and standard
deviation σ. Thus, we can easily find the moment generating function of X:
Suppose that X has the normal distribution with mean μ and standard deviation σ. The moment generating function of X is
1
2 2
M (t) = exp(μt + σ t ), t ∈ R (4.6.38)
2
Proof
So the normal family of distributions in closed under location-scale transformations. The family is also closed with respect to sums
of independent variables:
If X and Y are independent, normally distributed random variables then X + Y has a normal distribution.
Proof
where a ∈ (0, ∞) is the shape parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that
is widely used to model financial variables such as income. The Pareto distribution is studied in more detail in the chapter on
Special Distributions.
Suppose that X has the Pareto distribution with shape parameter a , and let M denote the moment generating function of X .
Then
1. E (X ) =
n a
a−n
if n < a and E (X n
) =∞ if n ≥ a
2. M (t) = ∞ for t > 0
Proof
On the other hand, like all distributions on R, the Pareto distribution has a characteristic function. However, the characteristic
function of the Pareto distribution does not have a simple, closed form.
and is named for Augustin Cauchy. The Cauch distribution is studied in more generality in the chapter on Special Distributions.
The graph of f is known as the Witch of Agnesi, named for Maria Agnesi.
Suppose that X has the standard Cauchy distribution, and let M denote the moment generating function of X. Then
1. E(X) does not exist.
2. M (t) = ∞ for t ≠ 0 .
Proof
Once again, all distributions on R have characteristic functions, and the standard Cauchy distribution has a particularly simple one.
Counterexample
For the Pareto distribution, only some of the moments are finite; so course, the moment generating function cannot be finite in an
interval about 0. We will now give an example of a distribution for which all of the moments are finite, yet still the moment
4.6.10 https://stats.libretexts.org/@go/page/10161
generating function is not finite in any interval about 0. Furthermore, we will see two different distributions that have the same
moments of all orders.
Suppose that Z has the standard normal distribution and let X = e . The distribution of X is known as the (standard) lognormal
Z
distribution. The lognormal distribution is studied in more generality in the chapter on Special Distributions. This distribution has
finite moments of all orders, but infinite moment generating function.
1 2
1. E (X n
) =e 2
n
for n ∈ N .
2. E (e tX
) =∞ for t > 0 .
Proof
Let h be the function defined by h(x) = sin(2π ln x) for x > 0 and let g be the function defined by g(x) = f (x) [1 + h(x)]
Figure 4.6.1 : The graphs of f and g , probability density functions for two distributions with the same moments of all orders.
This page titled 4.6: Generating Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
4.6.11 https://stats.libretexts.org/@go/page/10161
4.7: Conditional Expected Value
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of
outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose next that X is a random
variable taking values in a set S and that Y is a random variable taking values in T ⊆ R . We assume that either Y has a discrete
distribution, so that T is countable, or that Y has a continuous distribution so that T is an interval (or perhaps a union of intervals).
In this section, we will study the conditional expected value of Y given X, a concept of fundamental importance in probability. As
we will see, the expected value of Y given X is the function of X that best approximates Y in the mean square sense. Note that X
is a general random variable, not necessarily real-valued, but as usual, we will assume that either X has a discrete distribution, so
that S is countable or that X has a continuous distribution on S ⊆ R for some n ∈ N . In the latter case, S is typically a region
n
+
defined by inequalites involving elementary functions. We will also assume that all expected values that are mentioned exist (as
real numbers).
Basic Theory
Definitions
Note that we can think of (X, Y ) as a random variable that takes values in the Cartesian product set S × T . We need recall some
basic facts from our work with joint distributions and conditional distributions.
We assume that (X, Y ) has joint probability density function f and we let g denote the (marginal) probability density function
X. Recall that if Y has a discrte distribution then
y∈T
In either case, for x ∈ S , the conditional probability density function of Y given X = x is defined by
f (x, y)
h(y ∣ x) = , y ∈ T (4.7.3)
g(x)
For x ∈ S , the conditional expected value of Y given X =x ∈ S is simply the mean computed relative to the conditional
distribution. So if Y has a discrete distribution then
y∈T
1. The function v : S → R defined by v(x) = E(Y ∣ X = x) for x ∈ S is the regression function of Y based on X.
2. The random variable v(X) is called the conditional expected value of Y given X and is denoted E(Y ∣ X) .
Intuitively, we treat X as known, and therefore not random, and we then average Y with respect to the probability distribution that
remains. The advanced section on conditional expected value gives a much more general definition that unifies the definitions
given here for the various distribution types.
4.7.1 https://stats.libretexts.org/@go/page/10163
Properties
The most important property of the random variable E(Y ∣ X) is given in the following theorem. In a sense, this result states that
E(Y ∣ X) behaves just like Y in terms of other functions of X, and is essentially the only function of X with this property.
Two random variables that are equal with probability 1 are said to be equivalent. We often think of equivalent random variables as
being essentially the same object, so the fundamental property above essentially characterizes E(Y ∣ X) . That is, we can think of
E(Y ∣ X) as any random variable that is a function of X and satisfies this property. Moreover the fundamental property can be
used as a definition of conditional expected value, regardless of the type of the distribution of (X, Y ). If you are interested, read
the more advanced treatment of conditional expected value.
Suppose that X is also real-valued. Recall that the best linear predictor of Y based on X was characterized by property (a), but
with just two functions: r(x) = 1 and r(x) = x . Thus the characterization in the fundamental property is certainly reasonable,
since (as we show below) E(Y ∣ X) is the best predictor of Y among all functions of X, not just linear functions.
The basic property is also very useful for establishing other properties of conditional expected value. Our first consequence is the
fact that Y and E(Y ∣ X) have the same mean.
Aside from the theoretical interest, this theorem is often a good way to compute E(Y ) when we know the conditional distribution
of Y given X. We say that we are computing the expected value of Y by conditioning on X.
For many basic properties of ordinary expected value, there are analogous results for conditional expected value. We start with two
of the most important: every type of expected value must satisfy two critical properties: linearity and monotonicity. In the following
two theorems, the random variables Y and Z are real-valued, and as before, X is a general random variable.
Linear Properties
1. E(Y + Z ∣ X) = E(Y ∣ X) + E(Z ∣ X) .
2. E(c Y ∣ X) = c E(Y ∣ X)
Proof
Part (a) is the additive property and part (b) is the scaling property. The scaling property will be significantly generalized below in
(8).
Our next few properties relate to the idea that E(Y ∣ X) is the expected value of Y given X . The first property is essentially a
restatement of the fundamental property.
The next result states that any (deterministic) function of X acts like a constant in terms of the conditional expected value with
respect to X.
If s : S → R then
4.7.2 https://stats.libretexts.org/@go/page/10163
E [s(X) Y ∣ X] = s(X) E(Y ∣ X) (4.7.12)
Proof
The following rule generalizes theorem (8) and is sometimes referred to as the substitution rule for conditional expected value.
If s : S × T → R then
In particular, it follows from (8) that E[s(X) ∣ X] = s(X) . At the opposite extreme, we have the next result: If X and Y are
independent, then knowledge of X gives no information about Y and so the conditional expected value with respect to X reduces
to the ordinary (unconditional) expected value of Y .
Proof
Suppose now that Z is real-valued and that X and Y are random variables (all defined on the same probability space, of course).
The following theorem gives a consistency condition of sorts. Iterated conditional expected values reduce to a single conditional
expected value with respect to the minimum amount of information. For simplicity, we write E(Z ∣ X, Y ) rather than
E [Z ∣ (X, Y )] .
Consistency
1. E [E(Z ∣ X, Y ) ∣ X] = E(Z ∣ X)
2. E [E(Z ∣ X) ∣ X, Y ] = E(Z ∣ X)
Proof
Finally we show that E(Y ∣ X) has the same covariance with X as does Y , not surprising since again, E(Y ∣ X) behaves just like
Y in its relations with X.
Conditional Probability
The conditional probability of an event A , given random variable X (as above), can be defined as a special case of the conditional
expected value. As usual, let 1 denote the indicator random variable of A .
A
If A is an event, defined
P(A ∣ X) = E (1A ∣ X) (4.7.17)
For example, suppose that X has a discrete distribution on a countable set S with probability density function g . Then (a) becomes
x∈S x∈S
But this is obvious since P(A ∣ X = x) = P(A, X = x)/P(X = x) and g(x) = P(X = x) . Similarly, if X has a continuous
distribution on S ⊆ R then (a) states that
n
4.7.3 https://stats.libretexts.org/@go/page/10163
E [r(X)1A ] = ∫ r(x)P(A ∣ X = x)g(x) dx (4.7.19)
S
The properties above for conditional expected value, of course, have special cases for conditional probability.
Again, the result in the previous exercise is often a good way to compute P(A) when we know the conditional probability of A
given X. We say that we are computing the probability of A by conditioning on X. This is a very compact and elegant version of
the conditioning result given first in the section on Conditional Probability in the chapter on Probability Spaces and later in the
section on Discrete Distributions in the Chapter on Distributions.
The following result gives the conditional version of the axioms of probability.
Axioms of probability
1. P(A ∣ X) ≥ 0 for every event A .
2. P(Ω ∣ X) = 1
3. If {A : i ∈ I } is a countable collection of disjoint events then P (⋃
i i∈I
Ai ∣
∣ X) = ∑i∈I P(Ai ∣ X) .
Details
From the last result, it follows that other standard probability rules hold for conditional probability given X. These results include
the complement rule
the increasing property
Boole's inequality
Bonferroni's inequality
the inclusion-exclusion laws
If u : S → R , then
1. E ([E(Y ∣ X) − Y ]
2
) ≤ E ([u(X) − Y ]
2
)
Suppose now that X is real-valued. In the section on covariance and correlation, we found that the best linear predictor of Y given
X is
cov(X, Y )
L(Y ∣ X) = E(Y ) + [X − E(X)] (4.7.23)
var(X)
On the other hand, E(Y ∣ X) is the best predictor of Y among all functions of X. It follows that if E(Y ∣ X) happens to be a linear
function of X then it must be the case that E(Y ∣ X) = L(Y ∣ X) . However, we will give a direct proof also:
4.7.4 https://stats.libretexts.org/@go/page/10163
Conditional Variance
The conditional variance of Y given X is defined like the ordinary variance, but with all expected values conditioned on X.
2 ∣
var(Y ∣ X) = E ( [Y − E(Y ∣ X)] ∣ X) (4.7.24)
∣
Thus, var(Y ∣ X) is a function of X, and in particular, is a random variable. Our first result is a computational formula that is
analogous to the one for standard variance—the variance is the mean of the square minus the square of the mean, but now with all
expected values conditioned on X:
var(Y ∣ X) = E (Y
2
∣ X) − [E(Y ∣ X)]
2
.
Proof
Our next result shows how to compute the ordinary variance of Y by conditioning on X.
Thus, the variance of Y is the expected conditional variance plus the variance of the conditional expected value. This result is often
a good way to compute var(Y ) when we know the conditional distribution of Y given X. With the help of (21) we can give a
formula for the mean square error when E(Y ∣ X) is used a predictor of Y .
Proof
Let us return to the study of predictors of the real-valued random variable Y , and compare the three predictors we have studied in
terms of mean square error.
3. If X is a general random variable, then the best overall predictor of Y given X is E(Y ∣ X) with mean square error
var(Y ) − var [E(Y ∣ X)] .
Conditional Covariance
Suppose that Y and Z are real-valued random variables, and that X is a general random variable, all defined on our underlying
probability space. Analogous to variance, the conditional covariance of Y and Z given X is defined like the ordinary covariance,
but with all expected values conditioned on X.
∣
cov(Y , Z ∣ X) = E ([Y − E(Y ∣ X)][Z − E(Z ∣ X) ∣ X) (4.7.30)
∣
Thus, cov(Y , Z ∣ X) is a function of X, and in particular, is a random variable. Our first result is a computational formula that is
analogous to the one for standard covariance—the covariance is the mean of the product minus the product of the means, but now
4.7.5 https://stats.libretexts.org/@go/page/10163
with all expected values conditioned on X:
Our next result shows how to compute the ordinary covariance of Y and Z by conditioning on X.
Thus, the covariance of Y and Z is the expected conditional covariance plus the covariance of the conditional expected values.
This result is often a good way to compute cov(Y , Z) when we know the conditional distribution of (Y , Z) given X.
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 .
1. Find L(Y ∣ X) .
2. Find E(Y ∣ X) .
3. Graph L(Y ∣ X = x) and E(Y ∣ X = x) as functions of x, on the same axes.
4. Find var(Y ).
5. Find var(Y ) [1 − cor (X, Y )] .
2
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 15x 2
y for 0 ≤ x ≤ y ≤ 1 .
1. Find L(Y ∣ X) .
2. Find E(Y ∣ X) .
3. Graph L(Y ∣ X = x) and E(Y ∣ X = x) as functions of x, on the same axes.
4. Find var(Y ).
5. Find var(Y ) [1 − cor (X, Y )] .
2
4.7.6 https://stats.libretexts.org/@go/page/10163
Answer
E (Y e
X
− Z sin X ∣ X) .
Answer
Uniform Distributions
As usual, continuous uniform distributions can give us some geometric insight.
n
λn (A) = ∫ 1dx, A ⊆R (4.7.36)
A
Details
With our usual setup, suppose that X takes values in S ⊆ R , Y takes values in T ⊆ R , and that (X, Y ) is uniformly distributed
n
on R ⊆ S × T ⊆ R . So 0 < λ (R) < ∞ , and the joint probability density function f of (X, Y ) is given by
n+1
n+1
f (x, y) = 1/ λ n+1(R) for (x, y) ∈ R. Recall that uniform distributions, whether discrete or continuous, always have constant
In the setting above, suppose that T is a bounded interval with midpoint m(x) and length l(x) for each x ∈ S . Then
x
1. E(Y ∣ X) = m(X)
2. var(Y ∣ X) = l (X) 1
12
2
Proof
So in particular, the regression curve x ↦ E(Y ∣ X = x) follows the midpoints of the cross-sectional intervals.
In each case below, suppose that (X, Y ) is uniformly distributed on the give region. Find E(Y ∣ X) and var(Y ∣ X)
Answer
In the bivariate uniform experiment, select each of the following regions. In each case, run the simulation 2000 times and note
the relationship between the cloud of points and the graph of the regression function.
1. square
2. triangle
3. circle
Suppose that X is uniformly distributed on the interval (0, 1), and that given X, random variable Y is uniformly distributed on
(0, X). Find each of the following:
1. E(Y ∣ X)
2. E(Y )
3. var(Y ∣ X)
4. var(Y )
Answer
4.7.7 https://stats.libretexts.org/@go/page/10163
The Hypergeometric Distribution
Suppose that a population consists of m objects, and that each object is one of three types. There are a objects of type 1, b objects
of type 2, and m − a − b objects of type 0. The parameters a and b are positive integers with a + b < m . We sample n objects
from the population at random, and without replacement, where n ∈ {0, 1, … , m}. Denote the number of type 1 and 2 objects in
the sample by X and Y , so that the number of type 0 objects in the sample is n − X − Y . In the in the chapter on Distributions, we
showed that the joint, marginal, and conditional distributions of X and Y are all hypergeometric—only the parameters change.
Here is the relevant result for this section:
m−a
(n − X)
b(m−a−b)
2. var(Y ∣ X) = 2
(n − X)(m − a − n + X)
(m−a) (m−a−1)
n(m−n)b(m−a−b)
3. E ([Y − E(Y ∣ X)] ) =
2
m(m−1)(m−a)
Proof
In a collection of 120 objects, 50 are classified as good, 40 as fair and 30 as poor. A sample of 20 objects is selected at random
and without replacement. Let X denote the number of good objects in the sample and Y the number of poor objects in the
sample. Find each of the following:
1. E(Y ∣ X)
2. var(Y ∣ X)
3. The predicted value of Y given X = 8
Answer
Y the number of trials that resulted in outcome 2, so that n − X − Y is the number of trials that resulted in outcome 0. In the in
the chapter on Distributions, we showed that the joint, marginal, and conditional distributions of X and Y are all multinomial—
only the parameters change. Here is the relevant result for this section:
q(1−p−q)
2. var(Y ∣ X) = 2
(n − X)
(1−p)
q(1−p−q)
3. E ([Y − E(Y ∣ X)] ) =
2
1−p
n
Proof
Note again that E(Y ∣ X) is a linear function of X and hence E(Y ∣ X) = L(Y ∣ X) .
Suppose that a fair, 12-sided die is thrown 50 times. Let X denote the number of throws that resulted in a number from 1 to 5,
and Y the number of throws that resulted in a number from 6 to 9. Find each of the following:
1. E(Y ∣ X)
2. var(Y ∣ X)
3. The predicted value of Y given X = 20
Answer
4.7.8 https://stats.libretexts.org/@go/page/10163
Process. The Poisson distribution with parameter r ∈ (0, ∞) has probability density function f defined by
x
−r
r
f (x) = e , x ∈ N (4.7.39)
x!
Suppose that X and Y are independent random variables, and that X has the Poisson distribution with parameter a ∈ (0, ∞)
and Y has the Poisson distribution with parameter b ∈ (0, ∞). Let N = X + Y . Then
1. E(X ∣ N ) = a
a+b
N
2. var(X ∣ N ) = ab
2
N
(a+b)
3. E ([X − E(X ∣ N )] 2
) =
ab
a+b
Proof
Once again, E(X ∣ N ) is a linear function of N and so E(X ∣ N ) = L(X ∣ N ) . If we reverse the roles of the variables, the
conditional expected value is trivial from our basic properties:
1. E (Y ∣ X ) 1
2. E (U ∣ X ) 1
3. E (Y ∣ U )
4. E (X ∣ X )
2 1
Answer
9
. A coin is chosen at random from the box and
tossed. Find the probability of heads.
Answer
This problem is an example of Laplace's rule of succession, named for Pierre Simon Laplace.
denote the common mean, variance, and moment generating function, respectively, by μ = E(X ) , σ = var(X ) , and i
2
i
) . Let
t Xi
G(t) = E (e
Yn = ∑ Xi , n ∈ N (4.7.42)
i=1
YN = ∑ Xi (4.7.43)
i=1
is a random sum of random variables; the terms in the sum are random, and the number of terms is random. This type of variable
occurs in many different contexts. For example, N might represent the number of customers who enter a store in a given period of
time, and X the amount spent by the customer i, so that Y is the total revenue of the store during the period.
i N
2. E (Y N ) = E(N )μ
4.7.9 https://stats.libretexts.org/@go/page/10163
Proof
Wald's equation, named for Abraham Wald, is a generalization of the previous result to the case where N is not necessarily
independent of X, but rather is a stopping time for X. Roughly, this means that the event N = n depends only (X , X , … , X ).
1 2 n
Wald's equation is discussed in the chapter on Random Samples. An elegant proof of and Wald's equation is given in the chapter on
Martingales. The advanced section on stopping times is in the chapter on Probability Measures.
2. var (Y N ) = E(N )σ
2 2
+ var(N )μ
Proof
Let H denote the probability generating function of N . The conditional and ordinary moment generating function of Y N are
1. E (e tYN
∣ N ) = [G(t)]
N
2. E (e tN
) = H (G(t))
Proof
Thus the moment generating function of Y is H ∘ G , the composition of the probability generating function of
N N with the
common moment generating function of X, a simple and elegant result.
In the die-coin experiment, a fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let N
denote the die score and Y the number of heads. Find each of the following:
1. The conditional distribution of Y given N .
2. E (Y ∣ N )
3. var (Y ∣ N )
4. E (Y ) i
5. var(Y )
Answer
Run the die-coin experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and
standard deviation.
The number of customers entering a store in a given hour is a random variable with mean 20 and standard deviation 3. Each
customer, independently of the others, spends a random amount of money with mean $50 and standard deviation $5. Find the
mean and standard deviation of the amount of money spent during the hour.
Answer
A coin has a random probability of heads V and is tossed a random number of times N . Suppose that V is uniformly
distributed on [0, 1]; N has the Poisson distribution with parameter a > 0 ; and V and N are independent. Let Y denote the
number of heads. Compute the following:
1. E(Y ∣ N , V )
2. E(Y ∣ N )
3. E(Y ∣ V )
4. E(Y )
5. var(Y ∣ N , V )
6. var(Y )
Answer
Mixtures of Distributions
Suppose that X = (X , X , …) is a sequence of real-valued random variables. Denote the mean, variance, and moment
1 2
generating function of X by μ = E(X ) , σ = var(X ) , and M (t) = E (e ) , for i ∈ N . Suppose also that N is a random
i i i
2
i i i
t Xi
+
variable taking values in N , independent of X. Denote the probability density function of N by p = P(N = n) for n ∈ N .
+ n +
4.7.10 https://stats.libretexts.org/@go/page/10163
The distribution of the random variable XN is a mixture of the distributions of X = (X1 , X2 , …) , with the distribution of N as
the mixing distribution.
1. E(X N ∣ N ) = μN
∞
2. E(X N ) = ∑n=1 pn μn
Proof
1. var (X N ∣ N) = σ
2
N
∞ ∞ 2
2. var(X 2 2
N ) = ∑n=1 pn (σn + μn ) − (∑n=1 pn μn ) .
Proof
1. E (e tXN
∣ N ) = MN (t)
2. E (e tXN
) =∑
∞
i=1
pi Mi (t) .
Proof
In the coin-die experiment, a biased coin is tossed with probability of heads . If the coin lands tails, a fair die is rolled; if the
1
coin lands heads, an ace-six flat die is rolled (faces 1 and 6 have probability each, and faces 2, 3, 4, 5 have probability
1
4
1
each). Find the mean and standard deviation of the die score.
Answer
Run the coin-die experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the
distribution mean and standard deviation.
This page titled 4.7: Conditional Expected Value is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
4.7.11 https://stats.libretexts.org/@go/page/10163
4.8: Expected Value and Covariance Matrices
The main purpose of this section is a discussion of expected value and covariance for random matrices and vectors. These topics
are somewhat specialized, but are particularly important in multivariate statistical models and for the multivariate normal
distribution. This section requires some prerequisite knowledge of linear algebra.
We assume that the various indices m, n, p, k that occur in this section are positive integers. Also we assume that expected values
of real-valued random variables that we reference exist as real numbers, although extensions to cases where expected values are ∞
or −∞ are straightforward, as long as we avoid the dreaded indeterminate form ∞ − ∞ .
Basic Theory
Linear Algebra
We will follow our usual convention of denoting random variables by upper case letters and nonrandom variables and constants by
lower case letters. In this section, that convention leads to notation that is a bit nonstandard, since the objects that we will be
dealing with are vectors and matrices. On the other hand, the notation we will use works well for illustrating the similarities
between results for random matrices and the corresponding results in the one-dimensional case. Also, we will try to be careful to
explicitly point out the underlying spaces where various objects live.
Let m×n
R denote the space of all m × n matrices of real numbers. The (i, j) entry of a ∈ R is denoted a
m×n
for ij
i ∈ {1, 2, … , m} and j ∈ {1, 2, … , n}. We will identify R with R , so that an ordered n -tuple can also be thought of as an
n n×1
n × 1 column vector. The transpose of a matrix a ∈ R is denoted a —the n × m matrix whose (i, j) entry is the (j, i) entry
m×n T
of a . Recall the definitions of matrix addition, scalar multiplication, and matrix multiplication. Recall also the standard inner
product (or dot product) of x, y ∈ R : n
T
⟨x, y⟩ = x ⋅ y = x y = ∑ xi yi (4.8.1)
i=1
The outer product of x and y is xy , the n × n matrix whose (i, j) entry is x y . Note that the inner product is the trace (sum of
T
i j
the diagonal entries) of the outer product. Finally recall the standard norm on R , given by n
−−−−− −−−−−−−−−−−−−−−
2 2 2
∥x∥ = √⟨x, x⟩ = √ x +x + ⋯ + xn (4.8.2)
1 2
Recall that inner product is bilinear, that is, linear (preserving addition and scalar multiplication) in each argument separately. As a
consequence, for x, y ∈ R , n
2 2 2
∥x + y∥ = ∥x ∥ + ∥y∥ + 2⟨x, y⟩ (4.8.3)
Suppose that X is an m × n matrix of real-valued random variables, whose (i, j) entry is denoted X . Equivalently, X is as a
ij
random m × n matrix, that is, a random variable with values in R . The expected value E(X) is defined to be the m × n
m×n
Many of the basic properties of expected value of random variables have analogous results for expected value of random matrices,
with matrix operation replacing the ordinary ones. Our first two properties are the critically important linearity properties. The first
part is the additive property—the expected value of a sum is the sum of the expected values.
The next part of the linearity properties is the scaling property—a nonrandom matrix factor can be pulled out of the expected value.
4.8.1 https://stats.libretexts.org/@go/page/10164
Suppose that X is a random n × p matrix.
1. E(aX) = aE(X) if a ∈ R m×n
.
2. E(Xa) = E(X)a if a ∈ R p×n
.
Proof
Recall that for independent, real-valued variables, the expected value of the product is the product of the expected values. Here is
the analogous result for random matrices.
E(XY ) = E(X)E(Y ) if X is a random m × n matrix, Y is a random n × p matrix, and X and Y are independent.
Proof
Actually the previous result holds if X and Y are simply uncorrelated in the sense that X and Y are uncorrelated for each ij jk
i ∈ {1, … , m}, j ∈ {1, 2, … , n} and k ∈ {1, 2, … p}. We will study covariance of random vectors in the next subsection.
Covariance Matrices
Our next goal is to define and study the covariance of two random vectors.
1. The covariance matrix of X and Y is the m × n matrix cov(X, Y ) whose (i, j) entry is cov (X , Y ) the ordinary i j
covariance of X and Y .
i j
2. Assuming that the coordinates of X and Y have positive variance, the correlation matrix of X and Y is the m × n matrix
cor(X, Y ) whose (i, j) entry is cor (X , Y ), the ordinary correlation of X and Y
i j i j
Many of the standard properties of covariance and correlation for real-valued random variables have extensions to random vectors.
For the following three results, X is a random vector in R and Y is a random vector in R .
m n
T
cov(X, Y ) = E ([X − E(X)] [Y − E(Y )] )
Proof
Thus, the covariance of X and Y is the expected value of the outer product of X − E(X) and Y − E(Y ) . Our next result is the
computational formula for covariance: the expected value of the outer product of X and Y minus the outer product of the expected
values.
cov(X, Y ) = E (X Y
T
) − E(X)[E(Y )]
T
.
Proof
cov(X, Y ) = 0 if and only if cov (Xi , Yj ) = 0 for each i and j , so that each coordinate of X is uncorrelated with each
coordinate of Y .
Proof
Naturally, when cov(X, Y ) = 0 , we say that the random vectors X and Y are uncorrelated. In particular, if the random vectors
are independent, then they are uncorrelated. The following results establish the bi-linear properties of covariance.
2. cov(X, Y + Z) = cov(X, Y ) + cov(X, Z) if X is a random vector in R , and Y and Z are random vectors in R .
m n
Proof
4.8.2 https://stats.libretexts.org/@go/page/10164
The scaling properties
1. cov(aX, Y ) = acov(X, Y ) if X is a random vector in R , Y is a random vector in R , and a ∈ R
n p m×n
.
2. cov(X, aY ) = cov(X, Y )a if X is a random vector in R , Y is a random vector in R , and a ∈ R
T m n k×n
.
Proof
Variance-Covariance Matrices
Suppose that X is a random vector in R . The covariance matrix of X with itself is called the variance-covariance matrix of
n
X:
T
vc(X) = cov(X, X) = E ([X − E(X)] [X − E(X)] ) (4.8.6)
Recall that for an ordinary real-valued random variable X, var(X) = cov(X, X) . Thus the variance-covariance matrix of a
random vector in some sense plays the same role that variance does for a random variable.
The following result is the formula for the variance-covariance matrix of a sum, analogous to the formula for the variance of a sum
of real-valued variables.
Proof
Recall that var(aX) = a var(X) if X is a real-valued random variable and a ∈ R . Here is the analogous result for the variance-
2
vc(aX) = avc(X)a
T
if X is a random vector in R and a ∈ R n m×n
.
Proof
Recall that if X is a random variable, then var(X) ≥ 0 , and var(X) = 0 if and only if X is a constant (with probability 1). Here is
the analogous result for a random vector:
probability 1, a X = ∑ a X = c
T n
i=1 i i
Proof
Recall that since vc(X) is either positive semi-definite or positive definite, the eigenvalues and the determinant of vc(X) are
nonnegative. Moreover, if vc(X) is positive semi-definite but not positive definite, then one of the coordinates of X can be written
as a linear transformation of the other coordinates (and hence can usually be eliminated in the underlying model). By contrast, if
vc(X) is positive definite, then this cannot happen; vc(X) has positive eigenvalues and determinant and is invertible.
analogous to linear functions in the single variable case. However, unless a = 0 , such functions are not linear transformations in
the sense of linear algebra, so the correct term is affine function of X. This problem is of fundamental importance in statistics when
random vector X, the predictor vector is observable, but not random vector Y , the response vector. Our discussion here
generalizes the one-dimensional case, when X and Y are random variables. That problem was solved in the section on Covariance
and Correlation. We will assume that vc(X) is positive definite, so that vc(X) is invertible, and none of the coordinates of X can
be written as an affine function of the other coordinates. We write vc (X) for the inverse instead of the clunkier [vc(X)] .
−1 −1
4.8.3 https://stats.libretexts.org/@go/page/10164
As with the single variable case, the solution turns out to be the affine function that has the same expected value as Y , and whose
covariance with X is the same as that of Y .
satisfying
1. E [L(Y ∣ X)] = E(Y )
2. cov [L(Y ∣ X), X] = cov(Y , X)
Proof
The variance-covariance matrix of L(Y ∣ X) , and its covariance matrix with Y turn out to be the same, again analogous to the
single variable case.
Proof
Next is the fundamental result that L(Y ∣ X) is the affine function of X that is closest to Y in the mean square sense.
Suppose that U ∈ R
n
is an affine function of X. Then
1. E (∥Y − L(Y ∣ X)∥ ) ≤ E (∥Y − U ∥ )
2 2
The variance-covariance matrix of the difference between Y and the best affine approximation is given in the next theorem.
−1
vc [Y − L(Y ∣ X)] = vc(Y ) − cov(Y , X)vc (X)cov(X, Y )
Proof
The actual mean square error when we use L(Y ∣ X) to approximate Y , namely E (∥Y − L(Y ∣ X)∥
2
) , is the trace (sum of the
diagonal entries) of the variance-covariance matrix above. The function of x given by
−1
L(Y ∣ X = x) = E(Y ) + cov(Y , X)vc (X) [x − E(X)] (4.8.16)
is known as the (distribution) linear regression function. If we observe x then L(Y ∣ X = x) is our best affine prediction of Y .
Multiple linear regression is more powerful than it may at first appear, because it can be applied to non-linear transformations of
the random vectors. That is, if g : R → R and h : R → R then L [h(Y ) ∣ g(X)] is the affine function of g(X) that is closest
m j n k
to h(Y ) in the mean square sense. Of course, we must be able to compute the appropriate means, variances, and covariances.
Moreover, Non-linear regression with a single, real-valued predictor variable can be thought of as a special case of multiple linear
regression. Thus, suppose that X is the predictor variable, Y is the response variable, and that (g , g , … , g ) is a sequence of
1 2 n
real-valued functions. We can apply the results of this section to find the linear function of (g (X), g (X), … , g (X)) that is
1 2 n
closest to Y in the mean square sense. We just replace X with g (X) for each i. Again, we must be able to compute the
i i
4.8.4 https://stats.libretexts.org/@go/page/10164
Examples and Applications
Suppose that (X, Y ) has probability density function f defined by f (x, y) = x + y for 0 ≤x ≤1 , 0 ≤y ≤1 . Find each of
the following:
1. E(X, Y )
2. vc(X, Y )
Answer
Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2(x + y) for 0 ≤x ≤y ≤1 . Find each of the
following:
1. E(X, Y )
2. vc(X, Y )
Answer
Suppose that X is uniformly distributed on (0, 1), and that given X , random variable Y is uniformly distributed on (0, X) .
Find each of the following:
1. E(X, Y )
2. vc(X, Y )
Answer
This page titled 4.8: Expected Value and Covariance Matrices is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
4.8.5 https://stats.libretexts.org/@go/page/10164
4.9: Expected Value as an Integral
In the introductory section, we defined expected value separately for discrete, continuous, and mixed distributions, using density
functions. In the section on additional properties, we showed how these definitions can be unified, by first defining expected value
for nonnegative random variables in terms of the right-tail distribution function. However, by far the best and most elegant
definition of expected value is as an integral with respect to the underlying probability measure. This definition and a review of the
properties of expected value are the goals of this section. No proofs are necessary (you will be happy to know), since all of the
results follow from the general theory of integration. However, to understand the exposition, you will need to review the advanced
sections on the integral with respect to a positive measure and the properties of the integral. If you are a new student of probability,
or are not interested in the measure-theoretic detail of the subject, you can safely skip this section.
Definitions
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So Ω is the set of outcomes, F is
the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ).
Recall that a random variable X for the experiment is simply a measurable function from (Ω, F ) into another measurable space
(S, S ). When S ⊆ R , we assume that S is Lebesgue measurable, and we take S to the σ-algebra of Lebesgue measurable
n
If X is a real-valued random variable on the probability space, the expected value of X is defined as the integral of X with
respect to P, assuming that the integral exists:
E(X) = ∫ X dP (4.9.1)
Ω
Let's review how the integral is defined in stages, but now using the notation of probability theory.
X
−
denote the positive and negative parts of X.
4. If A ∈ F , then E(X; A) = E (X1 ) , assuming that the expected value on the right exists.
A
Thus, as with integrals generally, an expected value can exist as a number in R (in which case X is integrable), can exist as ∞ or
−∞ , or can fail to exist. In reference to part (a), a random variable with a finite set of values in R is a simple function in the
terminology of general integration. In reference to part (b), note that the expected value of a nonnegative random variable always
exists in [0, ∞]. In reference to part (c), E(X) exists if and only if either E (X ) < ∞ or E (X ) < ∞ .
+ −
Our next goal is to restate the basic theorems and properties of integrals, but in the notation of probability. Unless otherwise noted,
all random variables are assumed to be real-valued.
Basic Properties
The Linear Properties
Perhaps the most important and basic properties are the linear properties. Part (a) is the additive property and part (b) is the scaling
property.
Suppose that X and Y are random variables whose expected values exist, and that c ∈ R . Then
1. E(X + Y ) = E(X) + E(Y ) as long as the right side is not of the form ∞ − ∞ .
2. E(cX) = cE(X)
4.9.1 https://stats.libretexts.org/@go/page/10165
Thus, part (a) holds if at least one of the expected values on the right is finite, or if both are ∞, or if both are −∞ . What is ruled
out are the two cases where one expected value is ∞ and the other is −∞ , and this is what is meant by the indeterminate form
∞−∞ .
Random variables that are equivalent have the same expected value
If X is a random variable whose expected value exists, and Y is a random variable with P(X = Y ) = 1 , then E(X) = E(Y ) .
So, if X is a nonnegative random variable then E(X) > 0 if and only if P(X > 0) > 0 . The next result is the increasing property
of expected value, perhaps the most important property after linearity.
Suppose that X, Y are random variables whose expected values exist, and that P(X ≤ Y ) = 1 . Then
1. E(X) ≤ E(Y )
2. Except in the case that both expected values are ∞ or both −∞ , E(X) = E(Y ) if and only if P(X = Y ) = 1 .
So if X ≤ Y with probability 1 then, except in the two cases mentioned, E(X) < E(Y ) if and only if P(X < Y ) > 0 . The next
result is the absolute value inequality.
So, using the original definition and the change of variables theorem, and giving the variables explicitly for emphasis, we have
4.9.2 https://stats.libretexts.org/@go/page/10165
Radon and Otto Nikodym, X has a probability density function f with respect to μ . That is,
In this case, we can write the expected value of g(X) as an integral with respect to the probability density function.
E [g(X)] = ∫ gf dμ (4.9.5)
S
Again, giving the variables explicitly for emphasis, we have the following chain of integrals:
Discrete Distributions
Suppose first that (S, S , #) is a discrete measure space, so that S is countable, S = P(S) is the collection of all subsets of S ,
and # is counting measure on (S, S ). Thus, X has a discrete distribution on S , and this distribution is always absolutely
continuous with respect to #. Specifically, #(A) = 0 if and only if A = ∅ and of course P(X ∈ ∅) = 0 . The probability density
function f of X with respect to #, as we know, is simply f (x) = P(X = x) for x ∈ S . Moreover, integrals with respect to # are
sums, so
x∈S
assuming that the expected value exists. Existence in this case means that either the sum of the positive terms is finite or the sum of
the negative terms is finite, so that the sum makes sense (and in particular does not depend on the order in which the terms are
added). Specializing further, if X itself is real-valued and g = 1 we have
x∈S
which was our original definition of expected value in the discrete case.
Continuous Distributions
For the second special case, suppose that (S, S , λ ) is a Euclidean measure space, so that S is a Lebesgue measurable subset of
n
R
n
for some n ∈ N , S is the σ-algebra of Lebesgue measurable subsets of S , and λ is Lebesgue measure on (S, S ). The
+ n
distribution of X is absolutely continuous with respect to λ if λ (A) = 0 implies P(X ∈ A) = 0 for A ∈ S . If this is the case,
n n
assuming that the expected value exists. When g is a typically nice function, this integral reduces to an ordinary n -dimensional
Riemann integral of calculus. Specializing further, if X is itself real-valued and g = 1 then
which was our original definition of expected value in the continuous case.
Interchange Properties
In this subsection, we review properties that allow the interchange of expected value and other operations: limits of sequences,
infinite sums, and integrals. We assume again that the random variables are real-valued unless otherwise specified.
4.9.3 https://stats.libretexts.org/@go/page/10165
Limits
Our first set of convergence results deals with the interchange of expected value and limits. We start with the expected value
version of Fatou's lemma, named in honor of Pierre Fatou. Its usefulness stems from the fact that no assumptions are placed on the
random variables, except that they be nonnegative.
Our next set of results gives conditions for the interchange of expected value and limits.
4. lim X
n→∞ exists, and |X | ≤ Y for n ∈ N where Y is a nonnegative random variable with E(Y ) < ∞ .
n n
5. lim X
n→∞ exists, and |X | ≤ c for n ∈ N where c is a positive constant.
n n
Statements about the random variables in the theorem above (nonnegative, increasing, existence of limit, etc.) need only hold with
probability 1. Part (a) is the monotone convergence theorem, one of the most important convergence results and in a sense, essential
to the definition of the integral in the first place. Parts (b) and (c) are slight generalizations of the monotone convergence theorem.
In parts (a), (b), and (c), note that lim Xn→∞ exists (with probability 1), although the limit may be ∞ in parts (a) and (b) and
n
−∞ in part (c) (with positive probability). Part (d) is the dominated convergence theorem, another of the most important
convergence results. It's sometimes also known as Lebesgue's dominated convergence theorem in honor of Henri Lebesgue. Part (e)
is a corollary of the dominated convergence theorem, and is known as the bounded convergence theorem.
Infinite Series
Our next results involve the interchange of expected value and an infinite sum, so these results generalize the basic additivity
property of expected value.
∞ ∞
E ( ∑ Xn ) = ∑ E (Xn ) (4.9.13)
n=1 n=1
2. E (∑ ∞
| X |) < ∞
n=1 n
Part (a) is a consequence of the monotone convergence theorem, and part (b) is a consequence of the dominated convergence
theorem. In (b), note that ∑ | X | < ∞ and hence ∑
∞
n=1 n
additivity of the expected value over a countably infinite collection of disjoint events.
Suppose that X is a random variable whose expected value exists, and that {A n : n ∈ N+ } is a disjoint collection events. Let
A . Then
∞
A =⋃ n
n=1
n=1
4.9.4 https://stats.libretexts.org/@go/page/10165
Integrals
Suppose that (T , T , μ) is a σ-finite measure space, and that X is a real-valued random variable for each t ∈ T . Thus we can
t
think of {X : t ∈ T } is a stochastic process indexed by T . We assume that (ω, t) ↦ X (ω) is measurable, as a function from the
t t
product space (Ω × T , F ⊗ T ) into R. Our next result involves the interchange of expected value and integral, and is a
consequence of Fubini's theorem, named for Guido Fubini.
Fubini's theorem actually states that the two iterated integrals above equal the joint integral
where of course, P ⊗ μ is the product measure on (Ω × T , F ⊗ T ) . However, our interest is usually in evaluating the iterated
integral above on the left in terms of the iterated integral on the right. Part (a) is the expected value version of Tonelli's theorem,
named for Leonida Tonelli.
given by
1
f (x) = , x ∈ R (4.9.17)
2
π (1 + x )
The Cauchy distribution is studied in more generality in the chapter on Special Distributions.
Answer
Open the Cauchy Experiment and keep the default parameters. Run the experiment 1000 times and note the behaior of the
sample mean.
where a > 0 is the shape parameter. The Pareto distribution is studied in more generality in the chapter on Special Distributions.
Suppose that X has the Pareto distribution with shape parameter a . Find E(X) is the following cases:
1. 0 < a ≤ 1
4.9.5 https://stats.libretexts.org/@go/page/10165
2. a > 1
Answer
Open the special distribution simulator and select the Pareto distribution. Vary the shape parameter and note the shape of the
probability density function and the location of the mean. For various values of the parameter, run the experiment 1000 times
and compare the sample mean with the distribution mean.
Suppose that X has the Pareto distribution with shape parameter a . Find E (1/X n
) for n ∈ N .
+
Answer
Proof
∞
When n = 1 we have E(X) = ∫ P(X > x) dx . We saw this result before in the section on additional properties of expected
0
value, but now we can understand the proof in terms of Fubini's theorem.
For a random variable taking nonnegative integer values, the moments can be computed from sums involving the right-tail
distribution function.
n n n
E (X ) = ∑ [k − (k − 1 ) ] P(X ≥ k) (4.9.21)
k=1
Proof
When n = 1 we have E(X) = ∑ P(X ≥ k) . We saw this result before in the section on additional properties of expected
∞
k=0
value, but now we can understand the proof in terms of the interchange of sum and expected value.
This page titled 4.9: Expected Value as an Integral is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
4.9.6 https://stats.libretexts.org/@go/page/10165
4.10: Conditional Expected Value Revisited
Conditional expected value is much more important than one might at first think. In fact, conditional expected value is at the core
of modern probability theory because it provides the basic way of incorporating known information into a probability measure.
Basic Theory
Definition
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P), so that Ω is the set of outcomes,, F
is the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ). In our first elementary discussion, we
studied the conditional expected value of a real-value random variable X given a general random variable Y . The more general
approach is to condition on a sub σ-algebra G of F . The sections on σ-algebras and measure theory are essential prerequisites for
this section.
Before we get to the definition, we need some preliminaries. First, all random variables mentioned are assumed to be real valued.
next the notion of equivalence plays a fundamental role in this section. Next recall that random variables X and X are equivalent
1 2
if P(X = X ) = 1 . Equivalence really does define an equivalence relation on the collection of random variables defined on the
1 2
sample space. Moreover, we often regard equivalent random variables as being essentially the same object. More precisely from
this point of view, the objects of our study are not individual random variables but rather equivalence classes of random variables
under this equivalence relation. Finally, for A ∈ F , recall the notation for the expected value of X on the event A
E(X; A) = E(X 1A ) (4.10.1)
assuming of course that the expected value exists. For the remainder of this subsection, suppose that G is a sub σ-algebra of F .
Suppose that X is a random variable with E(|X|) < ∞ . The conditional expected value of X given G is the random variable
E(X ∣ G ) defined by the following properties:
The basic idea is that E(X ∣ G ) is the expected value of X given the information in the σ-algebra G . Hopefully this idea will
become clearer during our study. The conditions above uniquely define E(X ∣ G ) up to equivalence. The proof of this fact is a
simple application of the Radon-Nikodym theorem, named for Johann Radon and Otto Nikodym
Proof
The following characterization might seem stronger but in fact in equivalent to the definition.
Suppose again that X is a random variable with E(|X|) < ∞ . Then E(X ∣ G ) is characterized by the following properties:
1. E(X ∣ G ) is measurable with respect to G
2. If U is measurable with respect to G and E(|U X|) < ∞ then E[U E(X ∣ G )] = E(U X) .
Proof
Properties
Our next discussion concerns some fundamental properties of conditional expected value. All equalities and inequalities are
understood to hold modulo equivalence, that is, with probability 1. Note also that many of the proofs work by showing that the
right hand side satisfies the properties in the definition for the conditional expected value on the left side. Once again we assume
that G is a sub sigma-algebra of F .
Our first property is a simple consequence of the definition: X and E(X ∣ G ) have the same mean.
4.10.1 https://stats.libretexts.org/@go/page/10338
Suppose that X is a random variable with E(|X|) < ∞ . Then E[E(X ∣ G )] = E(X) .
Proof
The result above can often be used to compute E(X), by choosing the σ-algebra G in a clever way. We say that we are computing
E(X) by conditioning on G . Our next properties are fundamental: every version of expected value must satisfy the linearity
properties. The first part is the additive property and the second part is the scaling property.
Suppose that X and Y are random variables with E(|X|) < ∞ and E(|Y |) < ∞ , and that c ∈ R . Then
1. E(X + Y ∣ G ) = E(X ∣ G ) + E(Y ∣ G)
2. E(cX ∣ G ) = cE(X ∣ G )
Proof
The next set of properties are also fundamental to every notion of expected value. The first part is the positive property and the
second part is the increasing property.
Suppose again that X and Y are random variables with E(|X|) < ∞ and E(|Y |) < ∞ .
1. If X ≥ 0 then E(X ∣ G ) ≥ 0
2. If X ≤ Y then E(X ∣ G ) ≤ E(Y ∣ G)
Proof
The next few properties relate to the central idea that E(X ∣ G ) is the expected value of X given the information in the σ-algebra
G.
Suppose that X and V are random variables with E(|X|) < ∞ and E(|XV |) < ∞ and that V is measurable with respect to
G . Then E(V X ∣ G ) = V E(X ∣ G ) .
Proof
Compare this result with the scaling property. If V is measurable with respect to G then V is like a constant in terms of the
conditional expected value given G . On the other hand, note that this result implies the scaling property, since a constant can be
viewed as a random variable, and as such, is measurable with respect to any σ-algebra. As a corollary to this result, note that if X
itself is measurable with respect to G then E(X ∣ G ) = X . The following result gives the other extreme.
Suppose that X is a random variable with E(|X|) < ∞ . If X and G are independent then E(X ∣ G ) = E(X) .
Proof
Every random variable X is independent of the trivial σ-algebra {∅, Ω} so it follows that E(X ∣ {∅, Ω}) = E(X).
The next properties are consistency conditions, also known as the tower properties. When conditioning twice, with respect to
nested σ-algebras, the smaller one (representing the least amount of information) always prevails.
Suppose that X is a random variable with E(|X|) < ∞ and that H is a sub σ-algebra of G . Then
1. E[E(X ∣ H ) ∣ G ] = E(X ∣ H )
2. E[E(X ∣ G ) ∣ H ] = E(X ∣ H )
Proof
The next result gives Jensen's inequality for conditional expected value, named for Johan Jensen.
Suppose that X takes values in an interval S ⊆ R and that g : S → R is convex. If E(|X|) < ∞ and E(|g(X)| < ∞ then
E[g(X) ∣ G ] ≥ g[E(X ∣ G )] (4.10.9)
Proof
Conditional Probability
For our next discussion, suppose as usual that G is a sub σ-algebra of F . The conditional probability of an event A given G can be
defined as a special case of conditional expected value. As usual, let 1 denote the indicator random variable of A .
A
4.10.2 https://stats.libretexts.org/@go/page/10338
For A ∈ F we define
P(A ∣ G ) = E(1A ∣ G ) (4.10.11)
Thus, we have the following characterizations of conditional probability, which are special cases of the definition and the alternate
version:
The properties above for conditional expected value, of course, have special cases for conditional probability. In particular, we can
compute the probability of an event by conditioning on a σ-algebra:
Again, the last theorem is often a good way to compute P(A) when we know the conditional probability of A given G . This is a
very compact and elegant version of the law of total probability given first in the section on Conditional Probability in the chapter
on Probability Spaces and later in the section on Discrete Distributions in the Chapter on Distributions. The following theorem
gives the conditional version of the axioms of probability.
Proof
From the last result, it follows that other standard probability rules hold for conditional probability given G (as always, modulo
equivalence). These results include
the complement rule
the increasing property
Boole's inequality
Bonferroni's inequality
the inclusion-exclusion laws
However, it is not correct to state that A ↦ P(A ∣ G ) is a probability measure, because the conditional probabilities are only
defined up to equivalence, and so the mapping does not make sense. We would have to specify a particular version of E(A ∣ G ) for
each A ∈ F for the mapping to make sense. Even if we do this, the mapping may not define a probability measure. In part (c), the
left and right sides are random variables and the equation is an event that has probability 1. However this event depends on the
collection {A : i ∈ I } . In general, there will be uncountably many such collections in F , and the intersection of all of the
i
corresponding events may well have probability less than 1 (if it's measurable at all). It turns out that if the underlying probability
space (Ω, F , P) is sufficiently “nice” (and most probability spaces that arise in applications are nice), then there does in fact exist a
regular conditional probability. That is, for each A ∈ F , there exists a random variable P(A ∣ G ) satisfying the conditions in (12)
and such that with probability 1, A ↦ P(A ∣ G ) is a probability measure.
The following theorem gives a version of Bayes' theorem, named for the inimitable Thomas Bayes.
4.10.3 https://stats.libretexts.org/@go/page/10338
E[P(B ∣ G ); A]
P(A ∣ B) = (4.10.14)
E[P(B ∣ G )]
Proof
Basic Examples
The purpose of this discussion is to tie the general notions of conditional expected value that we are studying here to the more
elementary concepts that you have seen before. Suppose that A is an event (that is, a member of F ) with P(A) > 0 . If B is
another event, then of course, the conditional probability of B given A is
P(A ∩ B)
P(B ∣ A) = (4.10.15)
P(A)
If X is a random variable then the conditional distribution of X given A is the probability measure on R given by
P({X ∈ R} ∩ A)
R ↦ P(X ∈ R ∣ A) = for measurable R ⊆ R (4.10.16)
P(A)
If E(|X|) < ∞ then the conditional expected value of X given A , denoted E(X ∣ A) , is simply the mean of this conditional
distribution.
Suppose now that A = {A : i ∈ I } is a countable partition of the sample space Ω into events with positive probability. To review
i
the jargon, A ⊆ F ; the index set I is countable; A ∩ A = ∅ for distinct i, j ∈ I ; ⋃ A = Ω ; and P(A ) > 0 for i ∈ I . Let
i j i∈I i i
G = σ(A ) , the σ-algebra generated by A . The elements of G are of the form ⋃ A for J ⊆ I . Moreover, the random variables
j
j∈J
that are measurable with respect to G are precisely the variables that are constant on A for each i ∈ I . The σ-algebra G is said to
i
be countably generated.
If B ∈ F then P(B ∣ G ) is the random variable whose value on A is P(B ∣ Ai i) for each i ∈ I .
Proof
In this setting, the version of Bayes' theorem in (15) reduces to the usual elementary formulation: For i ∈ I ,
E[P(B ∣ G ); A ] = P(A )P(B ∣ A ) and E[P(B ∣ G )] = ∑
i i i P(A )P(B ∣ A ) . Hence
j∈I j j
P(Ai )P(B ∣ Ai )
P(Ai ∣ B) = (4.10.18)
∑ P(Aj )P(B ∣ Aj )
j∈I
If X is a random variable with E(|X|) < ∞ , then E(X ∣ G ) is the random variable whose value on A is E(X ∣ A i i) for each
i ∈ I.
Proof
The previous examples would apply to G = σ(Y ) if Y is a discrete random variable taking values in a countable set T . In this
case, the partition is simply A = {{Y = y} : y ∈ T } . On the other hand, suppose that Y is a random variable taking values in a
general set T with σ-algebra T . The real-valued random variables that are measurable with respect to G = σ(Y ) are (up to
equivalence) the measurable, real-valued functions of Y .
Specializing further, Suppose that X takes values in S ⊆ R , Y takes values in T ⊆ R (where S and T are Lebesgue measurable)
n
and that (X, Y ) has a joint continuous distribution with probability density function f . Then Y has probability density function h
given by
Assume that h(y) > 0 for y ∈ T . Then for y ∈ T , a conditional probability density function of X given Y =y is defined by
f (x, y)
g(x ∣ y) = , x ∈ S (4.10.21)
h(y)
This is precisely the setting of our elementary discussion of conditional expected value. If E(|X|) < ∞ then we usually write
E(X ∣ Y ) instead of the clunkier E[X ∣ σ(Y )].
4.10.4 https://stats.libretexts.org/@go/page/10338
In this setting above suppose that E(|X|) < ∞ . Then
Proof
Best Predictor
In our elementary treatment of conditional expected value, we showed that the conditional expected value of a real-valued random
variable X given a general random variable Y is the best predictor of X, in the least squares sense, among all real-valued functions
of Y . A more careful statement is that E(X ∣ Y ) is the best predictor of X among all real-valued random variables that are
measurable with respect to σ(Y ). Thus, it should come as not surprise that if G is a sub σ-algebra of F , then E(X ∣ G ) is the best
predictor of X, in the least squares sense, among all real-valued random variables that are measurable with respect to G ). We will
show that this is indeed the case in this subsection. The proofs are very similar to the ones given in the elementary section. For the
rest of this discussion, we assume that G is a sub σ-algebra of F and that all random variables mentioned are real valued.
Suppose that X and U are random variables with E(|X|) < ∞ and E(|XU |) < ∞ and that U is measurable with respect to
G . Then X − E(X ∣ G ) and U are uncorrelated.
Proof
The next result is the main one: E(X ∣ G ) is closer to X in the mean square sense than any other random variable that is
measurable with respect to G . Thus, if G represents the information that we have, then E(X ∣ G ) is the best we can do in
estimating X.
2. Equality holds if and only if P[U = E(X ∣ G )] = 1 , so U and E(X ∣ G ) are equivalent.
Proof
Conditional Variance
Once again, we assume that G is a sub σ-algebra of F and that all random variables mentioned are real valued, unless otherwise
noted. It's natural to define the conditional variance of a random variable given G in the same way as ordinary variance, but witl all
expected values conditioned on G .
2 ∣
var(X ∣ G ) = E ([X − E(X ∣ G )] ∣ G) (4.10.27)
∣
Like all conditional expected values relative to G , var(X ∣ G ) is a random variable that is measurable with respect to G and is
unique up to equivalence. The first property is analogous to the computational formula for ordinary variance.
Proof
Next is a formula for the ordinary variance in terms of conditional variance and expected value.
Proof
4.10.5 https://stats.libretexts.org/@go/page/10338
So the variance of X is the expected conditional variance plus the variance of the conditional expected value. This result is often a
good way to compute var(X) when we know the conditional distribution of X given G . In turn, this property leads to a formula for
the mean square error when E(X ∣ G ) is thought of as a predictor of X.
Proof
Let us return to the study of predictors of the real-valued random variable X, and compare them in terms of mean square error.
cov(X, Y )
L(X ∣ Y ) = E(X) + [Y − E(Y )] (4.10.34)
var(Y )
3. If Y is a (general) random variable, then the best predictor of X among all real-valued functions of Y with finite variance is
E(X ∣ Y ) with mean square error var(X) − var[E(X ∣ Y )] .
4. If G is a sub σ-algebra of F , then the best predictor of X among random variables with finite variance that are measurable
with respect to G is E(X ∣ G ) with mean square error var(X) − var[E(X ∣ G )] .
Of course, (a) is a special case of (d) with G = {∅, Ω} and (c) is a special case of (d) with G = σ(Y ) . Only (b), the linear case,
cannot be interpreted in terms of conditioning with respect to a σ-algebra.
Conditional Covariance
Suppose again that G is a sub σ-algebra of F . The conditional covariance of two random variables is defined like the ordinary
covariance, but with all expected values conditioned on G .
given G is defined as
∣
cov(X, Y ∣ G ) = E ([X − E(X ∣ G )][Y − E(Y ∣ G )] ∣ G ) (4.10.35)
∣
So cov(X, Y ∣ G ) is a random variable that is measurable with respect to G and is unique up to equivalence. As should be the case,
conditional covariance generalizes conditional variance.
Our next result is a computational formula that is analogous to the one for standard covariance—the covariance is the mean of the
product minus the product of the means, but now with all expected values conditioned on G :
Proof
Our next result shows how to compute the ordinary covariance of X and Y by conditioning on X.
Proof
4.10.6 https://stats.libretexts.org/@go/page/10338
Thus, the covariance of X and Y is the expected conditional covariance plus the covariance of the conditional expected values.
This result is often a good way to compute cov(X, Y ) when we know the conditional distribution of (X, Y ) given G .
This page titled 4.10: Conditional Expected Value Revisited is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
4.10.7 https://stats.libretexts.org/@go/page/10338
4.11: Vector Spaces of Random Variables
Basic Theory
Many of the concepts in this chapter have elegant interpretations if we think of real-valued random variables as vectors in a vector
space. In particular, variance and higher moments are related to the concept of norm and distance, while covariance is related to
inner product. These connections can help unify and illuminate some of the ideas in the chapter from a different point of view. Of
course, real-valued random variables are simply measurable, real-valued functions defined on the sample space, so much of the
discussion in this section is a special case of our discussion of function spaces in the chapter on Distributions, but recast in the
notation of probability.
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). Thus, Ω is the set of outcomes, F is
the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ). Our basic vector space V consists of all
real-valued random variables defined on (Ω, F , P) (that is, defined for the experiment). Recall that random variables X and X 1 2
are equivalent if P(X = X ) = 1 , in which case we write X ≡ X . We consider two such random variables as the same vector,
1 2 1 2
so that technically, our vector space consists of equivalence classes under this equivalence relation. The addition operator
corresponds to the usual addition of two real-valued random variables, and the operation of scalar multiplication corresponds to the
usual multiplication of a real-valued random variable by a real (non-random) number. These operations are compatible with the
equivalence relation in the sense that if X ≡ X and Y ≡ Y then X + Y ≡ X + Y and cX ≡ cX for c ∈ R . In short, the
1 2 1 2 1 1 2 2 1 2
Norm
Suppose that k ∈ [1, ∞). The k norm of X ∈ V is defined by
1/k
k
∥X ∥k = [E (|X| )] (4.11.1)
Thus, ∥X∥ is a measure of the size of X in a certain sense, and of course it's possible that
k ∥X ∥k = ∞ . The following theorems
establish the fundamental properties. The first is the positive property.
Suppose again that k ∈ [1, ∞). Then ∥cX∥ k = |c| ∥X ∥k for X ∈ V and c ∈ R .
Proof
The next result is Minkowski's inequality, named for Hermann Minkowski, and also known as the triangle inequality.
It follows from the last three results that the set of random variables (again, modulo equivalence) with finite k norm forms a
subspace of our parent vector space V , and that the k norm really is a norm on this vector space.
In analysis, p is often used as the index rather than k as we have used here, but p seems too much like a probability, so we have
broken with tradition on this point. The L is in honor of Henri Lebesgue, who developed much of this theory. Sometimes, when
we need to indicate the dependence on the underlying σ-algebra F , we write L (F ). Our next result is Lyapunov's inequality,
k
named for Aleksandr Lyapunov. This inequality shows that the k -norm of a random variable is increasing in k .
4.11.1 https://stats.libretexts.org/@go/page/10339
Suppose that j, k ∈ [1, ∞) with j ≤ k . Then ∥X∥ j ≤ ∥X ∥k for X ∈ V .
Proof
Lyapunov's inequality shows that if 1 ≤ j ≤ k and ∥X∥ k <∞ then ∥X∥ j <∞ . Thus, L is a subspace of L .
k j
Metric
The k norm, like any norm on a vector space, can be used to define a metric, or distance function; we simply compute the norm of
the difference between two vectors.
The following properties are analogous to the properties in norm properties (and thus very little additional work is required for the
proofs). These properties show that the k metric really is a metric on L (as always, modulo equivalence). The first is the positive
k
property.
dk (X, Y ) = dk (Y , X) for X, Y ∈ V .
Proof
The last three properties mean that d is indeed a metric on L for k ≥ 1 . In particular, note that the standard deviation is simply
k k
and the variance is the square of this. More generally, the k th moment of X about a is simply the k th power of the k -distance from
X to a . The 2-distance is especially important for reasons that will become clear below, in the discussion of inner product. This
−−−−−−−−−−
2
d2 (X, t) = ∥X − t∥2 = √ E [(X − t) ] , t ∈ R (4.11.10)
For X ∈ L , d 2 2 (X, t) is minimized when t = E(X) and the minimum value is sd(X).
Proof
4.11.2 https://stats.libretexts.org/@go/page/10339
We have seen this computation several times before. The best constant predictor of X is E(X), with mean square error var(X).
The physical interpretation of this result is that the moment of inertia of the mass distribution of X about t is minimized when
t = μ , the center of mass. Next, let us apply our procedure to the 1-distance.
We will show that d (X, t) is minimized when t is any median of X. (Recall that the set of medians of X forms a closed, bounded
1
interval.) We start with a discrete case, because it's easier and has special interest.
Suppose that X ∈ L has a discrete distribution with values in a finite set S ⊆ R . Then d
1 1 (X, t) is minimized when t is any
median of X.
Proof
The last result shows that mean absolute error has a couple of basic deficiencies as a measure of error:
The function may not be smooth (differentiable).
The function may not have a unique minimizing value of t .
Indeed, when X does not have a unique median, there is no compelling reason to choose one value in the median interval, as the
measure of center, over any other value in the interval.
Convergence
Whenever we have a measure of distance, we automatically have a criterion for convergence.
dk (Xn , X) = ∥ Xn − X ∥k → 0 as n → ∞ (4.11.15)
k
or equivalently E (|X n − X| ) → 0 as n → ∞ .
When k = 1 , we simply say that X → X as n n → ∞ in mean; when k =2 , we say that Xn → X as n → ∞ in mean square.
These are the most important special cases.
Suppose that Xn ∈ Lk for n ∈ N and that X ∈ L , where k ∈ [1, ∞). If X → X as n → ∞ in k th mean then
+ k n
Proof
The converse is not true; a counterexample is given below. Our next result shows that convergence in mean is stronger than
convergence in probability.
The converse is not true. That is, convergence with probability 1 does not imply convergence in k th mean; a counterexample is
given below. Also convergence in k th mean does not imply convergence with probability 1; a counterexample to this is given
4.11.3 https://stats.libretexts.org/@go/page/10339
below. In summary, the implications in the various modes of convergence are shown below; no other implications hold in general.
Convergence with probability 1 implies convergence in probability.
Convergence in k th mean implies convergence in j th mean if j ≤ k .
Convergence in k th mean implies convergence in probability.
Convergence in probability implies convergence in distribution.
However, the next section on uniformly integrable variables gives a condition under which convergence in probability implies
convergence in mean.
Inner Product
The vector space L of real-valued random variables on (Ω, F , P) (modulo equivalence of course) with finite second moment is
2
special, because it's the only one in which the norm corresponds to an inner product.
The following results are analogous to the basic properties of covariance, and show that this definition really does give an inner
product on the vector space
For X, Y , Z ∈ L2 and a ∈ R ,
1. ⟨X, Y ⟩ = ⟨Y , X⟩ , the symmetric property.
2. ⟨X, X⟩ ≥ 0 and ⟨X, X⟩ = 0 if and only if P(X = 0) = 1 (so that X ≡ 0 ), the positive property.
3. ⟨aX, Y ⟩ = a⟨X, Y ⟩ , the scaling property.
4. ⟨X + Y , Z⟩ = ⟨X, Z⟩ + ⟨Y , Z⟩ , the additive property.
Proof
From parts (a), (c), and (d) it follows that inner product is bi-linear, that is, linear in each variable with the other fixed. Of course
bi-linearity holds for any inner product on a vector space. Covariance and correlation can easily be expressed in terms of this inner
product. The covariance of two random variables is the inner product of the corresponding centered variables. The correlation is the
inner product of the corresponding standard scores.
For X, Y ∈ L2 ,
1. cov(X, Y ) = ⟨X − E(X), Y − E(Y )⟩
2. cor(X, Y ) = ⟨[X − E(X)]/sd(X), [Y − E(Y )]/sd(Y )⟩
Proof
Thus, real-valued random variables X and Y are uncorrelated if and only if the centered variables X − E(X) and Y − E(Y ) are
perpendicular or orthogonal as elements of L . 2
Thus, the norm associated with the inner product is the 2-norm studied above, and corresponds to the root mean square operation
on a random variable. This fact is a fundamental reason why the 2-norm plays such a special, honored role; of all the k -norms, only
the 2-norm corresponds to an inner product. In turn, this is one of the reasons that root mean square difference is of fundamental
importance in probability and statistics. Technically, the vector space L is a Hilbert space, named for David Hilbert.
2
j
+
1
k
=1 . For X ∈ L and Y
j ∈ Lk ,
⟨|X| , |Y |⟩ ≤ ∥X ∥j ∥Y ∥k (4.11.18)
Proof
4.11.4 https://stats.libretexts.org/@go/page/10339
In the context of the last theorem, j and k are called conjugate exponents. If we let j = k = 2 in Hölder's inequality, then we get
the Cauchy-Schwarz inequality, named for Augustin Cauchy and Karl Schwarz: For X, Y ∈ L , 2
−−−−−− −−−−−−
2 2
E (|X| |Y |) ≤ √E (X )√E (Y ) (4.11.21)
In turn, the Cauchy-Schwarz inequality is equivalent to the basic inequalities for covariance and correlations: For X, Y ∈ L2 ,
|cov(X, Y )| ≤ sd(X)sd(Y ), |cor(X, Y )| ≤ 1 (4.11.22)
The following result is an equivalent to the identity var(X + Y ) + var(X − Y ) = 2 [var(X) + var(Y )] that we studied in the
section on covariance and correlation. In the context of vector spaces, the result is known as the parallelogram rule:
If X, Y ∈ L2 then
2 2 2 2
∥X + Y ∥ + ∥X − Y ∥ = 2∥X ∥ + 2∥Y ∥ (4.11.23)
2 2 2 2
Proof
The following result is equivalent to the statement that the variance of the sum of uncorrelated variables is the sum of the variances,
which again we proved in the section on covariance and correlation. In the context of vector spaces, the result is the famous
Pythagorean theorem, named for Pythagoras of course.
Proof
Projections
The best linear predictor studied in the section on covariance and correlation and conditional expected values have nice
interpretation in terms of projections onto subspaces of L . First let's review the concepts. Recall that U is a subspace of L if
2 2
U ⊆L 2and U is also a vector space (under the same operations of addition and scalar multiplication). To show that U ⊆ L is 2
a subspace, we just need to show the closure properties (the other axioms of a vector space are inherited).
If U , V ∈ U then U + V ∈ U .
If U ∈ U and c ∈ R then cU ∈ U .
Suppose now that U is a subspace of L and that X ∈ L . Then the projection of X onto U (if it exists) is the vector V
2 2 ∈ U
The projection has two critical properties: It is unique (if it exists) and it is the vector in U closest to X. If you look at the proofs of
these results, you will see that they are essentially the same as the ones used for the best predictors of X mentioned at the
beginning of this subsection. Moreover, the proofs use only vector space concepts—the fact that our vectors are random variables
on a probability space plays no special role.
2
≤ ∥X − U ∥
2
2
for all U ∈ U .
4.11.5 https://stats.libretexts.org/@go/page/10339
2. Equality holds in (a) if and only if U ≡V
Proof
1.
Proof
Here is the meaning of the predictor in the context of our vector spaces.
Proof
The previous result is actually just the random variable version of the standard formula for the projection of a vector onto a space
spanned by two other vectors. Note that 1 is a unit vector and that X = X − E(X) = X − ⟨X, 1⟩1 is perpendicular to 1. Thus,
0
⟨Y , X0 ⟩
L(Y ∣ X) = ⟨Y , 1⟩1 + X0 (4.11.34)
⟨X0 , X0 ⟩
Proof
But remember that E(X ∣ G ) is defined more generally for X ∈ L 1 (F ) . Our final result in this discussion concerns convergence.
2. If X ∈ L (F ) for n ∈ N , X ∈ L (F ) , and X
n k + k n → X as n → ∞ in L k (F ) then E(X
n ∣ G ) → E(X ∣ G ) as n → ∞
in L (G )
k
Proof
In the error function app, select the root mean square error function. Click on the x-axis to generate an empirical distribution,
and note the shape and location of the graph of the error function.
In the error function app, select the mean absolute error function. Click on the x-axis to generate an empirical distribution, and
note the shape and location of the graph of the error function.
Computational Exercises
Suppose that X is uniformly distributed on the interval [0, 1].
1. Find ∥X∥ for k ∈ [1, ∞).
k
3. Find lim ∥X ∥ .
k→∞ k
4.11.6 https://stats.libretexts.org/@go/page/10339
Answer
xa +1
for 1 ≤x <∞ , where a >0 is a parameter. Thus, X has the
Pareto distribution with shape parameter a .
1. Find ∥X∥ for k ∈ [1, ∞).
k
Answer
Suppose that (X, Y ) has probability density function f (x, y) = x + y for 0 ≤x ≤1 , 0 ≤y ≤1 . Verify Minkowski's
inequality.
Answer
Let X be an indicator random variable with P(X = 1) = p , where 0 ≤ p ≤ 1 . Graph E (|X − t|) as a function of t ∈ R in
each of the cases below. In each case, find the minimum value of the function and the values of t where the minimum occurs.
1. p < 1
2. p = 1
3. p > 1
Answer
Suppose that X is uniformly distributed on the interval [0, 1]. Find d (X, t) = E (|X − t|) as a function of
1 t and sketch the
graph. Find the minimum value of the function and the value of t where the minimum occurs.
Suppose that X is uniformly distributed on the set [0, 1] ∪ [2, 3]. Find d (X, t) = E (|X − t|) as a function of t and sketch the
1
graph. Find the minimum value of the function and the values of t where the minimum occurs.
Suppose that (X, Y ) has probability density function f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Verify Hölder's inequality in
the following cases:
1. j = k = 2
2. j = 3 , k = 3
Answer
Counterexamples
The following exercise shows that convergence with probability 1 does not imply convergence in mean.
3
1 1
P (X = n ) = , P(Xn = 0) = 1 − ; n ∈ N+ (4.11.38)
2 2
n n
1. X → 0 as n → ∞ with probability 1.
n
2. X → 0 as n → ∞ in probability.
n
3. E(X ) → ∞ as n → ∞ .
n
Proof
The following exercise shows that convergence in mean does not imply convergence with probability 1.
4.11.7 https://stats.libretexts.org/@go/page/10339
4. X n → 0 as n → ∞ in k th mean for every k ≥ 1 .
Proof
The following exercise show that convergence of the k th means does not imply convergence in k th mean.
2
, so that P(U = 1) = P(U = 0) =
1
2
. Let Xn = U for
n ∈ N and let X = 1 − U . Let k ∈ [1, ∞). Then
+
2
+
k
n
k
Proof
This page titled 4.11: Vector Spaces of Random Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
4.11.8 https://stats.libretexts.org/@go/page/10339
4.12: Uniformly Integrable Variables
Two of the most important modes of convergence in probability theory are convergence with probability 1 and convergence in
mean. As we have noted several times, neither mode of convergence implies the other. However, if we impose an additional
condition on the sequence of variables, convergence with probability 1 will imply convergence in mean. The purpose of this brief,
but advanced section, is to explore the additional condition that is needed. This section is particularly important for the theory of
martingales.
Basic Theory
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So Ω is the set of outcomes, F is
the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ). In this section, all random variables that are
mentioned are assumed to be real valued, unless otherwise noted. Next, recall from the section on vector spaces that for
1/k
k ∈ [1, ∞) Lk ,is the vector space of random variables X with E(|X| ) < ∞ , endowed with the norm ∥X∥ = [E(X )] . In
k
k
k
particular, X ∈ L simply means that E(|X|) < ∞ so that E(X) exists as a real number. From the section on expected value as an
1
integral, recall the following notation, assuming of course that the expected value makes sense:
Definition
The following result is motivation for the main definition in this section.
Suppose now that X is a random variable for each i in a nonempty index set I (not necessarily countable). The critical definition
i
for this section is to require the convergence in the previous theorem to hold uniformly for the collection of random variables
X = {X : i ∈ I } .
i
The collection X = {X i : i ∈ I} is uniformly integrable if for each ϵ > 0 there exists x > 0 such that for all i ∈ I ,
E(| Xi |; | Xi | > x) < ϵ (4.12.3)
Properties
Our next discussion centers on conditions that ensure that the collection of random variables X = { Xi : i ∈ I } is uniformly
integrable. Here is an equivalent characterization:
The collection X = {X i : i ∈ I} is uniformly integrable if and only if the following conditions hold:
1. {E(|X |) : i ∈ I } is bounded.
i
2. For each ϵ > 0 there exists δ > 0 such that if A ∈ F and P(A) < δ then E(|X i |; A) < ϵ for all i ∈ I .
Proof
Condition (a) means that X is bounded (in norm) as a subset of the vector space L1 . Trivially, a finite collection of integrable
random variables is uniformly integrable.
Suppose that I is finite and that E(|X i |) <∞ for each i ∈ I . Then X = {X i : i ∈ I} is uniformly integrable.
4.12.1 https://stats.libretexts.org/@go/page/10340
If the random variables in the collection are dominated in absolute value by a random variable with finite mean, then the collection
is uniformly integrable.
Suppose that Y is a nonnegative random variable with E(Y ) < ∞ and that |X i| ≤Y for each i ∈ I . Then X = {X i : i ∈ I}
is uniformly integrable.
Proof
The following result is more general, but essentially the same proof works.
Suppose that Y = {X : j ∈ J} is uniformly integrable, and X = {X : i ∈ I } is a set of variables with the property that for
j i
As a simple corollary, if the variables are bounded in absolute value then the collection is uniformly integrable.
If there exists c > 0 such that |X i| ≤c for all i ∈ I then X = {X i : i ∈ I} is uniformly integrable.
Just having E(|X |) bounded in i ∈ I (condition (a) in the characterization above) is not sufficient for X = {X : i ∈ I } to be
i i
uniformly integrable; a counterexample is given below. However, if E(|X | ) is bounded in i ∈ I for some k > 1 , then X is
i
k
uniformly integrable. This condition means that X is bounded (in norm) as a subset of the vector space L . k
If {E(|X k
i| : i ∈ I} is bounded for some k > 1 , then {X i : i ∈ I} is uniformly integrable.
Proof
Uniformly integrability is closed under the operations of addition and scalar multiplication.
Suppose that X = {X : i ∈ I } and Y i = { Yi : i ∈ I } are uniformly integrable and that c ∈ R . Then each of the following
collections is also uniformly integrable.
1. X + Y = {X + Y : i ∈ I } i i
2. cX = {cX : i ∈ I } i
Proof
The following corollary is trivial, but will be needed in our discussion of convergence below.
Suppose that {X : i ∈ I } is uniformly integrable and that X is a random variable with E(|X|) < ∞ . Then {X
i i − X : i ∈ I}
is uniformly integrable.
Proof
Convergence
We now come to the main results, and the reason for the definition of uniform integrability in the first place. To set up the notation,
suppose that X is a random variable for n ∈ N and that X is a random variable. We know that if X → X as n → ∞ in mean
n + n
then X → X as n → ∞ in probability. The converse is also true if and only if the sequence is uniformly integrable. Here is the
n
first half:
Here is the more important half, known as the uniform integrability theorem:
4.12.2 https://stats.libretexts.org/@go/page/10340
Examples
Our first example shows that bounded L norm is not sufficient for uniform integrability.
1
Suppose that U is uniformly distributed on the interval (0, 1) (so U has the standard uniform distribution). For n ∈ N+ , let
X = n1(U ≤ 1/n) . Then
n
measurable random variable closest to X in a sense. Indeed if X ∈ L (F ) then E(X ∣ G ) is the projection of X onto L (G ). The
2 2
Suppose that X is a real-valued random variable with E(|X|) < ∞ . Then {E(X ∣ G ) : G is a sub σ-algebra of F } is
uniformly integrable.
Proof
Note that the collection of sub σ-algebras of F , and so also the collection of conditional expected values above, might well be
uncountable. The conditional expected values range from E(X), when G = {Ω, ∅} to X itself, when G = F .
This page titled 4.12: Uniformly Integrable Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
4.12.3 https://stats.libretexts.org/@go/page/10340
4.13: Kernels and Operators
The goal of this section is to study a type of mathematical object that arises naturally in the context of conditional expected value
and parametric distributions, and is of fundamental importance in the study of stochastic processes, particularly Markov processes.
In a sense, the main object of study in this section is a generalization of a matrix, and the operations generalizations of matrix
operations. If you keep this in mind, this section may seem less abstract.
Basic Theory
Definitions
Recall that a measurable space (S, S ) consists of a set S and a σ-algebra S of subsets of S . If μ is a positive measure on
(S, S ), then (S, S , μ) is a measure space. The two most important special cases that we have studied frequently are
1. Discrete: S is countable, S = P(S) is the collection of all subsets of S , and μ = # is counting measure on (S, S ).
2. Euclidean: S is a measurable subset of R for some n ∈ N , S is the collection of subsets of S that are also measurable,
n
+
More generally, S usually comes with a topology that is locally compact, Hausdorff, with a countable base (LCCB), and S is the
Borel σ-algebra, the σ-algebra generated by the topology (the collection of open subsets of S ). The measure μ is usually a Borel
measure, and so satisfies μ(C ) < ∞ if C ⊆ S is compact. A discrete measure space is of this type, corresponding to the discrete
topology. A Euclidean measure space is also of this type, corresponding to the Euclidean topology, if S is open or closed (which is
usually the case). In the discrete case, every function from S to another measurable space is measurable, and every from function
from S to another topological space is continuous, so the measure theory is not really necessary.
Recall also that the measure space (S, S , μ) is σ-finite if there exists a countable collection {A : i ∈ I } ⊆ S such that
i
finite.
If f : S → R is measurable, define ∥f ∥ = sup{|f (x)| : x ∈ S}. Of course we may well have ∥f ∥ = ∞. Let B(S) denote the
collection of bounded measurable functions f : S → R . Under the usual operations of pointwise addition and scalar multiplication,
B(S) is a vector space, and ∥ ⋅ ∥ is the natural norm on this space, known as the supremum norm. This vector space plays an
important role.
In this section, it is sometimes more natural to write integrals with respect to the positive measure μ with the differential before the
integrand, rather than after. However, rest assured that this is mere notation, the meaning of the integral is the same. So if
f : S → R is measurable then we may write the integral of f with respect to μ in operator notation as
assuming, as usual, that the integral exists. This will be the case if f is nonnegative, although ∞ is a possible value. More
generally, the integral exists in R ∪ {−∞, ∞} if μf < ∞ or μf < ∞ where f and f are the positive and negative parts of
+ − + −
f . If both are finite, the integral exists in R (and f is integrable with respect to μ ). If If μ is a probability measure and we think of
(S, S ) as the sample space of a random experiment, then we can think of f as a real-valued random variable, in which case our
new notation is not too far from our traditional expected value E(f ). Our main definition comes next.
Suppose that (S, S ) and (T , T ) are measurable spaces. A kernel from (S, S ) to (T , T ) is a function K : S × T → [0, ∞]
such that
1. x ↦ K(x, A) is a measurable function from S into [0, ∞] for each A ∈ T .
2. A ↦ K(x, A) is a positive measure on T for each x ∈ S .
If (T , T ) = (S, S ) , then K is said to be a kernel on (S, S ).
4.13.1 https://stats.libretexts.org/@go/page/10499
1. K is σ-finite if the measure K(x, ⋅) is σ-finite for every x ∈ S .
2. K is finite if K(x, T ) < ∞ for every x ∈ S .
3. K is bounded if K(x, T ) is bounded in x ∈ S .
4. K is a probability kernel if K(x, T ) = 1 for every x ∈ S .
Define ∥K∥ = sup{K(x, T ) : x ∈ S} , so that ∥K∥ < ∞ if K is a bounded kernel and ∥K∥ = 1 if K is a probability kernel.
So a probability kernel is bounded, a bounded kernel is finite, and a finite kernel is σ-finite. The terms stochastic kernel and
Markov kernel are also used for probability kernels, and for a probability kernel ∥K∥ = 1 of course. The terms are consistent with
terms used for measures: K is a finite kernel if and only if K(x, ⋅) is a finite measure for each x ∈ S , and K is a probability kernel
if and only if K(x, ⋅) is a probability measure for each x ∈ S . Note that ∥K∥ is simply the supremum norm of the function
x ↦ K(x, T ) .
A kernel defines two natural integral operators, by operating on the left with measures, and by operating on the right with
functions. As usual, we are often a bit casual witht the question of existence. Basically in this section, we assume that any integrals
mentioned exist.
2. If f : T → R is measurable, then Kf : S → R defined as follows is measurable (assuming that the integrals exist in R):
Proof
Thus, a kernel transforms measures on (S, S ) into measures on (T , T ), and transforms certain measurable functions from T to R
into measurable functions from S to R. Again, part (b) assumes that f is integrable with respect to the measure K(x, ⋅) for every
x ∈ S . In particular, the last statement will hold in the following important special case:
The identity kernel I on the measurable space (S, S ) is defined by I (x, A) = 1(x ∈ A) for x ∈ S and A ∈ S .
Thus, I (x, A) = 1 if x ∈ A and I (x, A) = 0 if x ∉ A . So x ↦ I (x, A) is the indicator function of A ∈ S , while A ↦ I (x, A)
is point mass at x ∈ S . Clearly the identity kernel is a probability kernel. If we need to indicate the dependence on the particular
space, we will add a subscript. The following result justifies the name.
Constructions
We can create a new kernel from two given kernels, by the usual operations of addition and scalar multiplication.
Suppose that K and L are kernels from (S, S ) to (T , T ), and that c ∈ [0, ∞). Then cK and K + L defined below are also
kernels from (S, S ) to (T , T ).
1. (cK)(x, A) = cK(x, A) for x ∈ S and A ∈ T .
2. (K + L)(x, A) = K(x, A) + L(x, A) for x ∈ S and A ∈ T .
4.13.2 https://stats.libretexts.org/@go/page/10499
If K and L are σ-finite (finite) (bounded) then cK and K + L are σ-finite (finite) (bounded), respectively.
Proof
A simple corollary of the last result is that if a, b ∈ [0, ∞) then aK + bL is a kerneal from (S, S ) to (T , T ). In particular, if
K, L are probability kernels and p ∈ (0, 1) then pK + (1 − p)L is a probability kernel. A more interesting and important way to
form a new kernel from two given kernels is via a “multiplication” operation.
Suppose that K is a kernel from (R, R) to (S, S ) and that L is a kernel from (S, S ) to (T , T ). Then KL defined as follows
is a kernel from (R, R) to (T , T ):
2. KI = K
T
The next several results show that the operations are associative whenever they make sense.
Suppose that K is a kernel from (S, S ) to (T , T ), μ is a positive measure on S , c ∈ [0, ∞), and f : T → R is measurable.
Then, assuming that the appropriate integrals exist,
1. c(μK) = (cμ)K
2. c(Kf ) = (cK)f
3. (μK)f = μ(Kf )
Proof
Suppose that K is a kernel from (R, R) to (S, S ) and L is a kernel from (S, S ) to (T , T ). Suppose also that μ is a positive
measure on (R, R) , f : T → R is measurable, and c ∈ [0, ∞). Then, assuming that the appropriate integrals exist,
1. (μK)L = μ(KL)
2. K(Lf ) = (KL)f
3. c(KL) = (cK)L
Proof
Suppose that K is a kernel from (R, R) to (S, S ), L is a kernel from (S, S ) to (T , T ), and M is a kernel from (T , T ) to
(U , U ) . Then (KL)M = K(LM ) .
Proof
The next several results show that the distributive property holds whenever the operations makes sense.
Suppose that K and L are kernels from (R, R) to (S, S ) and that M and N are kernels from (S, S ) to (T , T ). Suppose
also that μ is a positive measure on (R, R) and that f : S → R is measurable. Then, assuming that the appropriate integrals
exist,
1. (K + L)M = KM + LM
2. K(M + N ) = KM + KN
3. μ(K + L) = μK + μL
4. (K + L)f = Kf + Lf
4.13.3 https://stats.libretexts.org/@go/page/10499
Suppose that K is a kernel from (S, S ) to (T , T ), and that μ and ν are positive measures on , and that
(S, S ) f and g are
measurable functions from T to R. Then, assuming that the appropriate integrals exist,
1. (μ + ν )K = μK + ν K
2. K(f + g) = Kf + Kg
3. μ(f + g) = μf + μg
4. (μ + ν )f = μf + ν f
In particular, note that if K is a kernel from (S, S ) to (T , T ), then the transformation μ ↦ μK defined for positive measures on
(S, S ), and the transformation f ↦ Kf defined for measurable functions f : T → R (for which Kf exists), are both linear
operators. If μ is a positive measure on (S, S ), then the integral operator f ↦ μf defined for measurable f : S → R (for which
μf exists) is also linear, but of course, we already knew that. Finally, note that the operator f ↦ Kf is positive: if f ≥ 0 then
Kf ≥ 0 . Here is the important summary of our results when the kernel is bounded.
If K is a bounded kernel from (S, S ) to (T , T ), then f ↦ Kf is a bounded, linear transformation from B(T ) to B(S) and
∥K∥ is the norm of the transformation.
The commutative property for the product of kernels fails with a passion. If K and L are kernels, then depending on the measurable
spaces, KL may be well defined, but not LK . Even if both products are defined, they may be kernels from or to different
measurable spaces. Even if both are defined from and to the same measurable spaces, it may well happen that KL ≠ LK . Some
examples are given below
If K is a kernel on (S, S ) and n ∈ N , we let K
n
= KK ⋯ K , the n -fold power of K . By convention, K
0
=I , the identity
kernel on S .
Fixed points of the operators associated with a kernel turn out to be very important.
So in the language of linear algebra (or functional analysis), an invariant measure is a left eigenvector of the kernel, while an
invariant function is a right eigenvector of the kernel, both corresponding to the eigenvalue 1. By our results above, if μ and ν are
invariant measures and c ∈ [0, ∞), then μ + ν and cμ are also invariant. Similarly, if f and g are invariant functions and c ∈ R ,
the f + g and cf are also invariant.
Of couse we are particularly interested in probability kernels.
Suppose that P is a probability kernel from (R, R) to (S, S ) and that Q is a probability kernel from (S, S ) to (T , T ) .
Suppose also that μ is a probability measure on (R, R) . Then
1. P Q is a probability kernel from (R, R) to (T , T ).
2. μP is a probability measure on (S, S ).
Proof
The operators associated with a kernel are of fundamental importance, and we can easily recover the kernel from the operators.
Suppose that K is a kernel from (S, S ) to (T , T ), and let x ∈ S and A ∈ T . Then trivially, K1 (x) = K(x, A) where as usual,
A
1A is the indicator function of A . Trivially also δ K(A) = K(x, A) where δ is point mass at x.
x x
Kernel Functions
Usually our measurable spaces are in fact measure spaces, with natural measures associated with the spaces, as in the special cases
described in (1). When we start with measure spaces, kernels are usually constructed from density functions in much the same way
that positive measures are defined from density functions.
4.13.4 https://stats.libretexts.org/@go/page/10499
Suppose that (S, S , λ) and (T , T , μ) are measure spaces. As usual, S × T is given the product σ-algebra S ⊗T . If
k : S × T → [0, ∞) is measurable, then the function K defined as follows is a kernel from (S, S ) to (T , T ):
Proof
Clearly the kernel K depends on the positive measure μ on (T , T ) as well as the function k , while the measure λ on (S, S ) plays
no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Appropriately
enough, the function k is called a kernel density function (with respect to μ ), or simply a kernel function.
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces. Suppose also K is a kernel from (S, S ) to (T , T ) with
kernel function k . If f : T → R is measurable, then, assuming that the integrals exists,
Proof
A kernel function defines an operator on the left with functions on S in a completely analogous way to the operator on the right
above with functions on T .
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces, and that K is a kernel from (S, S ) to (T , T ) with kernel
function k . If f : S → R is measurable, then the function f K : T → R defined as follows is also measurable, assuming that
the integrals exists
The operator defined above depends on the measure λ on (S, S ) as well as the kernel function k , while the measure μ on (T , T )
playes no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Here is
how our new operation on the left with functions relates to our old operation on the left with measures.
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces, and that K is a kernel from (S, S ) to (T , T ) with kernel
function k . Suppose also that f : S → [0, ∞) is measurable, and let ρ denote the measure on (S, S ) that has density f with
respect to λ . Then f K is the density of the measure ρK with respect to μ .
Proof
As always, we are particularly interested in stochastic kernels. With a kernel function, we can have doubly stochastic kernels.
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces and that k : S × T → [0, ∞) is measurable. Then k is a
double stochastic kernel function if
1. ∫T
k(x, y)μ(dy) = 1 for x ∈ S
2. ∫S
λ(dx)k(x, y) = 1 for y ∈ S
Of course, condition (a) simply means that the kernel associated with k is a stochastic kernel according to our original definition.
The most common and important special case is when the two spaces are the same. Thus, if (S, S , λ) is a measure space and
k : S × S → [0, ∞) is measurable, then we have an operator K that operates on the left and on the right with measurable
functions f : S → R :
If f is nonnegative and μ is the measure on with density function f , then f K is the density function of the measure μK (both with
respect to λ ).
4.13.5 https://stats.libretexts.org/@go/page/10499
Suppose again that (S, S , λ) is a measure space and k : S × S → [0, ∞) is measurable. Then k is symmetric if
k(x, y) = k(y, x) for all (x, y) ∈ S .
2
Of course, if k is a symmetric, stochastic kernel function on (S, S , λ) then k is doubly stochastic, but the converse is not true.
Suppose that (R, R, λ) , (S, S , μ), and (T , T , ρ) are measure spaces. Suppose also that K is a kernel from (R, R) to (S, S )
with kernel function k , and that L is a kernel from (S, S ) to (T , T ) with kernel function l. Then the kernel KL from (R, R)
to (T , T ) has density kl given by
Proof
y∈A
The function (x, y) ↦ K(x, y) is simply the kernel function of the kernel K , as defined above, but in this case we usually don't
bother with using a different symbol for the function as opposed to the kernel. The function K can be thought of as a matrix, with
rows indexed by S and columns indexed by T (and so an infinite matrix if S or T is countably infinite). With this interpretation, all
of the operations defined above can be thought of as matrix operations. If f : T → R and f is thought of as a column vector
indexed by T , then Kf is simply the ordinary product of the matrix K and the vector f ; the product is a column vector indexed by
S:
y∈S
Similarly, if f : S → R and f is thought of as a row vector indexed by S , then f K is simple the ordinary product of the vector f
x∈S
If L is another kernel from T to another discrete space U , then as functions, KL is the simply the matrix product of K and L:
y∈T
Let S = {1, 2, 3} and T = {1, 2, 3, 4}. Define the kernel K from S to T by K(x, y) = x + y for (x, y) ∈ S × T . Define the
function f on S by f (x) = x! for x ∈ S , and define the function g on T by g(y) = y for y ∈ T . Compute each of the
2
Let R = {0, 1} , S = {a, b} , and T = {1, 2, 3}. Define the kernel K from R to S , the kernel L from S to S and the kernel M
from S to T in matrix form as follows:
1 4 2 2 1 0 2
K =[ ], L = [ ], M = [ ] (4.13.20)
2 3 1 5 0 3 1
4.13.6 https://stats.libretexts.org/@go/page/10499
Compute each of the following kernels, or explain why the operation does not make sense:
1. KL
2. LK
3. K 2
4. L2
5. KM
6. LM
Proof
Conditional Probability
An important class of probability kernels arises from the distribution of one random variable, conditioned on the value of another
random variable. In this subsection, suppose that (Ω, F , P) is a probability space, and that (S, S ) and (T , T ) are measurable
spaces. Further, suppose that X and Y are random variables defined on the probability space, with X taking values in S and that Y
taking values in T . Informally, X and Y are random variables defined on the same underlying random experiment.
The function P defined as follows is a probability kernel from (S, S ) to (T , T ), known as the conditional probability kernel
of Y given X.
P (x, A) = P(Y ∈ A ∣ X = x), x ∈ S, A ∈ T (4.13.25)
Proof
As in the general discussion above, the measurable spaces (S, S ) and (T , T ) are usually measure spaces with natural measures
attached. So the conditional probability distributions are often given via conditional probability density functions, which then play
the role of kernel functions. The next two exercises give examples.
Suppose that X and Y are random variables for an experiment, taking values in R. For x ∈ R, the conditional distribution of
Y given X = x is normal with mean x and standard deviation 1. Use the notation and operations of this section for the
following computations:
1. Give the kernel function for the conditional distribution of Y given X.
2. Find E (Y ∣∣ X = x) .
2
3. Suppose that X has the standard normal distribution. Find the probability density function of Y .
Answer
Suppose that X and Y are random variables for an experiment, with X taking values in {a, b, c} and Y taking values in
{1, 2, 3, 4}. The kernel function of Y given X is as follows: P (a, y) = 1/4 , P (b, y) = y/10, and P (c, y) = y /30, each for
2
y ∈ {1, 2, 3, 4}.
1. Give the kernel P in matrix form and verify that it is a probability kernel.
2. Find f P where f (a) = f (b) = f (c) = 1/3 . The result is the density function of Y given that X is uniformly distributed.
3. Find P g where g(y) = y for y ∈ {1, 2, 3, 4}. The resulting function is E(Y ∣ X = x) for x ∈ {a, b, c}.
Answer
Parametric Distributions
A parametric probability distribution also defines a probability kernel in a natural way, with the parameter playing the role of the
kernel variable, and the distribution playing the role of the measure. Such distributions are usually defined in terms of a parametric
density function which then defines a kernel function, again with the parameter playing the role of the first argument and the
4.13.7 https://stats.libretexts.org/@go/page/10499
variable the role of the second argument. If the parameter is thought of as a given value of another random variable, as in Bayesian
analysis, then there is considerable overlap with the previous subsection. In most cases, (and in particular in the examples below),
the spaces involved are either discrete or Euclidean, as described in (1).
Consider the parametric family of exponential distributions. Let f denote the identity function on (0, ∞).
1. Give the probability density function as a probability kernel function p on (0, ∞).
2. Find P f .
3. Find f P .
4. Find p , the kernel function corresponding to the product kernel P .
2 2
Answer
Consider the parametric family of Poisson distributions. Let f be the identity function on N and let g be the identity function
on (0, ∞).
1. Give the probability density function p as a probability kernel function from (0, ∞) to N.
2. Show that P f = g .
3. Show that gP = f .
Answer
Clearly the Poisson distribution has some very special and elegant properties. The next family of distributions also has some very
special properties. Compare this exercise with the exercise (30).
Consider the family of normal distributions, parameterized by the mean and with variance 1.
1. Give the probability density function as a probability kernel function p on R.
2. Show that p is symmetric.
3. Let f be the identity function on R. Show that P f = f and f P = f .
4. For n ∈ N , find p the kernel function for the operator P .
n n
Answer
For each of the following special distributions, express the probability density function as a probability kernel function. Be sure
to specify the parameter spaces.
1. The general normal distribution on R.
2. The beta distribution on (0, 1).
3. The negative binomial distribution on N.
Answer
This page titled 4.13: Kernels and Operators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
4.13.8 https://stats.libretexts.org/@go/page/10499
CHAPTER OVERVIEW
5: Special Distributions
In this chapter, we study several general families of probability distributions and a number of special parametric families of
distributions. Unlike the other expository chapters in this text, the sections are not linearly ordered and so this chapter serves
primarily as a reference. You may want to study these topics as the need arises.
First, we need to discuss what makes a probability distribution special in the first place. In some cases, a distribution may be
important because it is connected with other special distributions in interesting ways (via transformations, limits, conditioning,
etc.). In some cases, a parametric family may be important because it can be used to model a wide variety of random phenomena.
This may be the case because of fundamental underlying principles, or simply because the family has a rich collection of
probability density functions with a small number of parameters (usually 3 or less). As a general philosophical principle, we try to
model a random process with as few parameters as possible; this is sometimes referred to as the principle of parsimony of
parameters. In turn, this is a special case of Ockham's razor, named in honor of William of Ockham, the principle that states that
one should use the simplest model that adequately describes a given phenomenon. Parsimony is important because often the
parameters are not known and must be estimated.
In many cases, a special parametric family of distributions will have one or more distinguished standard members, corresponding
to specified values of some of the parameters. Usually the standard distributions will be mathematically simplest, and often other
members of the family can be constructed from the standard distributions by simple transformations on the underlying standard
random variable.
An incredible variety of special distributions have been studied over the years, and new ones are constantly being added to the
literature. To truly deserve the adjective special, a distribution should have a certain level of mathematical elegance and economy,
and should arise in interesting and diverse applications.
5.1: Location-Scale Families
5.2: General Exponential Families
5.3: Stable Distributions
5.4: Infinitely Divisible Distributions
5.5: Power Series Distributions
5.6: The Normal Distribution
5.7: The Multivariate Normal Distribution
5.8: The Gamma Distribution
5.9: Chi-Square and Related Distribution
5.10: The Student t Distribution
5.11: The F Distribution
5.12: The Lognormal Distribution
5.13: The Folded Normal Distribution
5.14: The Rayleigh Distribution
5.15: The Maxwell Distribution
5.16: The Lévy Distribution
5.17: The Beta Distribution
5.18: The Beta Prime Distribution
5.19: The Arcsine Distribution
5.20: General Uniform Distributions
5.21: The Uniform Distribution on an Interval
5.22: Discrete Uniform Distributions
5.23: The Semicircle Distribution
5.24: The Triangle Distribution
1
5.25: The Irwin-Hall Distribution
5.26: The U-Power Distribution
5.27: The Sine Distribution
5.28: The Laplace Distribution
5.29: The Logistic Distribution
5.30: The Extreme Value Distribution
5.31: The Hyperbolic Secant Distribution
5.32: The Cauchy Distribution
5.33: The Exponential-Logarithmic Distribution
5.34: The Gompertz Distribution
5.35: The Log-Logistic Distribution
5.36: The Pareto Distribution
5.37: The Wald Distribution
5.38: The Weibull Distribution
5.39: Benford's Law
5.40: The Zeta Distribution
5.41: The Logarithmic Series Distribution
This page titled 5: Special Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
2
5.1: Location-Scale Families
General Theory
As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P), so that Ω is the set of outcomes, F
the collection of events, and P the probability measure on the sample space (Ω, F ). In this section, we assume that we fixed
random variable Z defined on the probability space, taking values in R.
Definition
For a ∈ R and b ∈ (0, ∞), let X = a + b Z . The two-parameter family of distributions associated with X is called the
location-scale family associated with the given distribution of Z . Specifically, a is the location parameter and b the scale
parameter.
Thus a linear transformation, with positive slope, of the underlying random variable Z creates a location-scale family for the
underlying distribution. In the special case that b = 1 , the one-parameter family is called the location family associated with the
given distribution, and in the special case that a = 0 , the one-parameter family is called the scale family associated with the given
distribution. Scale transformations, as the name suggests, occur naturally when physical units are changed. For example, if a
random variable represents the length of an object, then a change of units from meters to inches corresponds to a scale
transformation. Location transformations often occur when the zero reference point is changed, in measuring distance or time, for
example. Location-scale transformations can also occur with a change of physical units. For example, if a random variable
represents the temperature of an object, then a change of units from Fahrenheit to Celsius corresponds to a location-scale
transformation.
Distribution Functions
Our goal is to relate various functions that determine the distribution of X = a + bZ to the corresponding functions for Z . First we
consider the (cumulative) distribution function.
Proof
Next we consider the probability density function. The results are a bit different for discrete distributions and continuous
distribution, not surprising since the density function has different meanings in these two cases.
If Z has a discrete distribution with probability density function g then X also has a discrete distribution, with probability
density function f given by
x −a
f (x) = g ( ), x ∈ R (5.1.3)
b
Proof
If Z has a continuous distribution with probability density function g , then X also has a continuous distribution, with
probability density function f given by
1 x −a
f (x) = g( ), x ∈ R (5.1.5)
b b
1. For the location family associated with g , the graph of f is obtained by shifting the graph of g , a units to the right if a > 0
and −a units to the left if a < 0 .
2. For the scale family associated with g , if b > 1 , the graph of f is obtained from the graph of g by stretching horizontally
and compressing vertically, by a factor of b . If 0 < b < 1 , the graph of f is obtained from the graph of g by compressing
horizontally and stretching vertically, by a factor of b .
5.1.1 https://stats.libretexts.org/@go/page/10167
Proof
Suppose now that Z has a continuous distribution on [0, ∞), and that we think of Z as the failure time of a device (or the time of
death of an organism). Let X = bZ where b ∈ [0, ∞), so that the distribution of X is the scale family associated with the
distribution of Z . Then X also has a continuous distribution on [0, ∞) and can also be thought of as the failure time of a device
(perhaps in different units).
Let G and F denote the reliability functions of Z and X respectively, and let r and R denote the failure rate functions of Z
c c
b
x
Proof
Moments
The following theorem relates the mean, variance, and standard deviation of Z and X.
3. sd(X) = b sd(Z)
Proof
Recall that the standard score of a random variable is obtained by subtracting the mean and dividing by the standard deviation. The
standard score is dimensionless (that is, has no physical units) and measures the distance from the mean to the random variable in
standard deviations. Since location-scale familes essentially correspond to a change of units, it's not surprising that the standard
score is unchanged by a location-scale transformation.
Proof
Recall that the skewness and kurtosis of a random variable are the third and fourth moments, respectively, of the standard score.
Thus it follows from the previous result that skewness and kurtosis are unchanged by location-scale transformations:
skew(X) = skew(Z) , kurt(X) = kurt(Z) .
We can represent the moments of X (about 0) to those of Z by means of the binomial theorem:
n
n
n k n−k k
E (X ) = ∑( )b a E (Z ) , n ∈ N (5.1.8)
k
k=0
Of course, the moments of X about the location parameter a have a simple representation in terms of the moments of Z about 0:
n n n
E [(X − a) ] = b E (Z ), n ∈ N (5.1.9)
5.1.2 https://stats.libretexts.org/@go/page/10167
The following exercise relates the moment generating functions of Z and X.
If Z has moment generating function m then X has moment generating function M given by
at
M (t) = e m(bt) (5.1.10)
Proof
Type
As we noted earlier, two probability distributions that are related by a location-scale transformation can be thought of as governing
the same underlying random quantity, but in different physical units. This relationship is important enough to deserve a name.
Suppose that P and Q are probability distributions on R with distribution functions F and G, respectively. Then P and Q are
of the same type if there exist constants a ∈ R and b ∈ (0, ∞) such that
x −a
F (x) = G ( ), x ∈ R (5.1.12)
b
Being of the same type is an equivalence relation on the collection of probability distributions on R. That is, if P , Q, and R are
probability distribution on R then
1. P is the same type as P (the reflexive property).
2. If P is the same type as Q then Q is the same type as P (the symmetric property).
3. If P is the same type as Q, and Q is the same type as R , then P is the same type as R (the transitive property).
Proof
So, the collection of probability distributions on R is partitioned into mutually exclusive equivalence classes, where the
distributions in each class are all of the same type.
The exponential-logarithmic distribution is a scale family for each value of the shape parameter.
The gamma distribution is a scale family for each value of the shape parameter.
The Gompertz distribution is a scale family for each value of the shape parameter.
5.1.3 https://stats.libretexts.org/@go/page/10167
The log-logistic distribution is a scale family for each value of the shape parameter.
The Pareto distribution is a scale family for each value of the shape parameter.
The triangle distribution is a location-scale family for each value of the shape parameter.
The U-power distribution is a location-scale family for each value of the shape parameter.
The Weibull distribution is a scale family for each value of the shape parameter.
The Wald distribution is a scale family, although in the usual formulation, neither of the parameters is a scale parameter.
This page titled 5.1: Location-Scale Families is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.1.4 https://stats.libretexts.org/@go/page/10167
5.2: General Exponential Families
Basic Theory
Definition
We start with a probability space (Ω, F , P) as a model for a random experiment. So as usual, Ω is the set of outcomes, F the σ-
algebra of events, and P the probability measure on the sample space (Ω, F ). For the general formulation that we want in this
section, we need two additional spaces, a measure space (S, S , μ) (where the probability distributions will live) and a measurable
space (T , T ) (serving the role of a parameter space). Typically, these spaces fall into our two standard categories. Specifically, the
measure space is usually one of the following:
Discrete. S is countable, S is the collection of all subsets of S , and μ = # is counting measure.
Euclidean. S is a sufficiently nice Borel measurable subset of R for some n ∈ N , S is the σ-algebra of Borel measurable
n
+
Similarly, the parameter space (T , T ) is usually either discrete, so that T is countable and T the collection of all subsets of T , or
Euclidean so that T is a sufficiently nice Borel measurable subset of R for some m ∈ N and T is the σ-algebra of Borel
m
+
measurable subsets of T .
Suppose now that X is random variable defined on the probability space, taking values in S , and that the distribution of X depends
on a parameter θ ∈ T . For θ ∈ T we assume that the distribution of X has probability density function f with respect to μ . θ
i=1
2. the random variables (h (X), h (X), … , h (X)) are called the natural statistics of the distribution.
1 2 k
Although the definition may look intimidating, exponential families are useful because many important theoretical results in
statistics hold for exponential families, and because many special parametric families of distributions turn out to be exponential
families. It's important to emphasize that the representation of f (x) given in the definition must hold for all x ∈ S and θ ∈ T . If
θ
the representation only holds for a set of x ∈ S that depends on the particular θ ∈ T , then the family of distributions is not a
general exponential family.
The next result shows that if we sample from the distribution of an exponential family, then the distribution of the random sample
is itself an exponential family with the same natural parameters.
Suppose that the distribution of random variable X is a k -parameter exponential family with natural parameters
(β (θ), β (θ), … , β (θ)) , and natural statistics (h (X), h (X), … , h (X)). Let X = (X , X , … , X ) be a sequence of n
1 2 k 1 2 k 1 2 n
independent random variables, each with the same distribution as X. Then X is a k -parameter exponential family with natural
parameters (β (θ), β (θ), … , β (θ)) , and natural statistics
1 2 k
i=1
Proof
5.2.1 https://stats.libretexts.org/@go/page/10168
Special Distributions
Many of the special distributions studied in this chapter are general exponential families, at least with respect to some of their
parameters. On the other hand, most commonly, a parametric family fails to be a general exponential family because the support set
depends on the parameter. The following theorems give a number of examples. Proofs will be provided in the individual sections.
The Bernoulli distribution is a one parameter exponential family in the success parameter p ∈ [0, 1]
The beta distiribution is a two-parameter exponential family in the shape parameters a ∈ (0, ∞) , b ∈ (0, ∞).
The beta prime distribution is a two-parameter exponential family in the shape parameters a ∈ (0, ∞) , b ∈ (0, ∞).
The binomial distribution is a one-parameter exponential family in the success parameter p ∈ [0, 1] for a fixed value of the trial
parameter n ∈ N .+
The chi-square distribution is a one-parameter exponential family in the degrees of freedom n ∈ (0, ∞) .
The exponential distribution is a one-parameter exponential family (appropriately enough), in the rate parameter r ∈ (0, ∞).
The gamma distribution is a two-parameter exponential family in the shape parameter k ∈ (0, ∞) and the scale parameter
b ∈ (0, ∞).
The geometric distribution is a one-parameter exponential family in the success probability p ∈ (0, 1).
The half normal distribution is a one-parameter exponential family in the scale parameter σ ∈ (0, ∞)
The Laplace distribution is a one-parameter exponential family in the scale parameter b ∈ (0, ∞) for a fixed value of the
location parameter a ∈ R .
The Lévy distribution is a one-parameter exponential family in the scale parameter b ∈ (0, ∞) for a fixed value of the location
parameter a ∈ R .
The logarithmic distribution is a one-parameter exponential family in the shape parameter p ∈ (0, 1)
The lognormal distribution is a two parameter exponential family in the shape parameters μ ∈ R , σ ∈ (0, ∞).
The Maxwell distribution is a one-parameter exponential family in the scale parameter b ∈ (0, ∞).
The k -dimensional multinomial distribution is a k -parameter exponential family in the probability parameters (p 1, p2 , … , pk )
2
2
(k + 3k) -parameter exponential family with respect to the mean
vector μ and the variance-covariance matrix V .
The negative binomial distribution is a one-parameter exponential family in the success parameter p ∈ (0, 1) for a fixed value
of the stopping parameter k ∈ N .+
The normal distribution is a two-parameter exponential family in the mean μ ∈ R and the standard deviation σ ∈ (0, ∞).
The Pareto distribution is a one-parameter exponential family in the shape parameter for a fixed value of the scale parameter.
5.2.2 https://stats.libretexts.org/@go/page/10168
The Rayleigh distribution is a one-parameter exponential family.
The U-power distribution is a one-parameter exponential family in the shape parameter, for fixed values of the location and
scale parameters.
The Weibull distribution is a one-parameter exponential family in the scale parameter for a fixed value of the shape parameter.
This page titled 5.2: General Exponential Families is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.2.3 https://stats.libretexts.org/@go/page/10168
5.3: Stable Distributions
This section discusses a theoretical topic that you may want to skip if you are a new student of probability.
Basic Theory
Stable distributions are an important general class of probability distributions on R that are defined in terms of location-scale
transformations. Stable distributions occur as limits (in distribution) of scaled and centered sums of independent, identically
distributed variables. Such limits generalize the central limit theorem, and so stable distributions generalize the normal distribution
in a sense. The pioneering work on stable distributions was done by Paul Lévy.
Definition
In this section, we consider real-valued random variables whose distributions are not degenerate (that is, not concentrated at a
single value). After all, a random variable with a degenerate distribution is not really random, and so is not of much interest.
Random variable X has a stable distribution if the following condition holds: If n ∈ N and (X , X , … , X ) is a sequence
+ 1 2 n
of independent variables, each with the same distribution as X, then X + X + ⋯ + X has the same distribution as
1 2 n
Details
Recall that two distributions on R that are related by a location-scale transformation are said to be of the same type, and that being
of the same type defines an equivalence relation on the class of distributions on R. With this terminology, the definition of stability
has a more elegant expression: X has a stable distribution if the sum of a finite number of independent copies of X is of the same
type as X. As we will see, the norming parameters are more important than the centering parameters, and in fact, only certain
norming parameters can occur.
Basic Properties
We start with some very simple results that follow easily from the definition, before moving on to the deeper results.
−
Suppose that X has a stable distribution with mean μ and finite variance. Then the norming parameters are √n and the
centering parameters are (n − √−
n ) μ for n ∈ N . +
Proof
It will turn out that the only stable distribution with finite variance is the normal distribution, but the result above is useful as an
intermediate step. Next, it seems fairly clear from the definition that the family of stable distributions is itself a location-scale
family.
Suppose that the distribution of X is stable, with centering parameters a ∈ R and norming parameters b ∈ (0, ∞) for
n n
n ∈ N . If c ∈ R and d ∈ (0, ∞) , then the distribution of Y = c + dX is also stable, with centering parameters
+
Proof
An important point is the the norming parameters are unchanged under a location-scale transformation.
Suppose that the distribution of X is stable, with centering parameters a ∈ R and norming parameters b ∈ (0, ∞) for
n n
n ∈ N . Then the distribution of −X is stable, with centering parameters −a and norming parameters b for n ∈ N .
+ n n +
Proof
From the last two results, if X has a stable distribution, then so does c + dX , with the same norming parameters, for every
c, d ∈ R with d ≠ 0 . Stable distributions are also closed under convolution (corresponding to sums of independent variables) if the
5.3.1 https://stats.libretexts.org/@go/page/10169
Suppose that X and Y are independent variables. Assume also that X has a stable distribution with centering parameters
a ∈ R and norming parameters b ∈ (0, ∞) for n ∈ N
n n , and that + Y has a stable distribution with centering parameters
c ∈ R and the same norming parameters b
n for n ∈ N . Then
n + Z = X +Y has a stable distribution with centering
paraemters a + c and norming parameters b for n ∈ N .
n n n +
Proof
We can now give another characterization of stability that just involves two independent copies of X.
Random variable X has a stable distribution if and only if the following condition holds: If X , X are independent variables,
1 2
each with the same distribution as X and d , d ∈ (0, ∞) then d X + d X has the same distribution as a + bX for some
1 2 1 1 2 2
Proof
Suppose that X and Y are independent with the same stable distribution. Then the distribution of X − Y is strictly stable, with
the same norming parameters.
Note that the distribution of X − Y is symmetric (about 0). The last result is useful because it allows us to get rid of the centering
parameters when proving facts about the norming parameters. Here is the most important of those facts:
Suppose that X has a stable distribution. Then the norming parameters have the form b = n n
1/α
for n ∈ N+ , for some
α ∈ (0, 2]. The parameter α is known as the index or characteristic exponent of the distribution.
Proof
The next result is a precise statement of the limit theorem alluded to in the introductory paragraph.
n
Suppose that (X , X , …) is a sequence of independent, identically distributed random variables, and let Y = ∑ X for
1 2 n i=1 i
n ∈ N +. If there exist constants a ∈ R and b ∈ (0, ∞) for n ∈ N such that (Y − a )/b has a (non-degenerate) limiting
n n + n n n
The following theorem completely characterizes stable distributions in terms of the characteristic function.
Suppose that X has a stable distribution. The characteristic function of X has the following form, for some α ∈ (0, 2] ,
β ∈ [−1, 1], c ∈ R , and d ∈ (0, ∞)
itX α α
χ(t) = E (e ) = exp(itc − d |t| [1 + iβ sgn(t)uα (t)]), t ∈ R (5.3.15)
Thus, the family of stable distributions is a 4 parameter family. The index parameter α and and the skewness parameter β can be
considered shape parameters. When the location parameter c = 0 and the scale parameter d = 1 , we get the standard form of the
stable distributions, with characteristic function
itX α
χ(t) = E (e ) = exp(−|t| [1 + iβ sgn(t)uα (t)]), t ∈ R (5.3.17)
5.3.2 https://stats.libretexts.org/@go/page/10169
The characteristic function gives another proof that stable distributions are closed under convolution (corresponding to sums of
independent variables), if the index is fixed.
Suppose that X and X are independent random variables, and that X and X have the stable distribution with common
1 2 1 2
index α ∈ (0, 2], skewness parameter β ∈ [−1, 1], location parameter c ∈ R , and scale parameter d ∈ (0, ∞) . Then
k k k
X +X
1 has the stable distribution with index α , location parameter c = c + c , scale parameter d = (d + d )
2 1 2 , and α
1 2
α 1/α
skewness parameter
α α
β1 d + β2 d
1 2
β = (5.3.18)
α α
d +d
1 2
Proof
Special Cases
Three special parametric families of distributions studied in this chapter are stable. In the proofs in this subsection, we use the
definition of stability and various important properties of the distributions. These properties, in turn, are verified in the sections
devoted to the distributions. We also give proofs based on the characteristic function, which allows us to identify the skewness
parameter.
Of course, the normal distribution has finite variance, so once we know that it is stable, it follows from the finite variance property
above that the index must be 2. Moreover, the characteristic function shows that the normal distribution is the only stable
distribution with index 2, and hence the only stable distribution with finite variance.
Open the special distribution simulator and select the normal distribution. Vary the parameters and note the shape and location
of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the
empirical density function to the probability density function.
Open the special distribution simulator and select the Cauchy distribution. Vary the parameters and note the shape and location
of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the
empirical density function to the probability density function.
2
and skewness parameter β = 1 .
Proof
Open the special distribution simulator and select the Lévy distribution. Vary the parameters and note the shape and location of
the probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical
density function to the probability density function.
The normal, Cauchy, and Lévy distributions are the only stable distributions for which the probability density function is known in
closed form.
This page titled 5.3: Stable Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.3.3 https://stats.libretexts.org/@go/page/10169
5.4: Infinitely Divisible Distributions
This section discusses a theoretical topic that you may want to skip if you are a new student of probability.
Basic Theory
Infinitely divisible distributions form an important class of distributions on R that includes the stable distributions, the compound
Poisson distributions, as well as several of the most important special parametric families of distribtions. Basically, the distribution
of a real-valued random variable is infinitely divisible if for each n ∈ N , the variable can be decomposed into the sum of n
+
The distribution of a real-valued random variable X is infinitely divisible if for every n ∈ N , there exists a sequence of
+
independent, identically distributed variables (X , X , … , X ) such that X + X + ⋯ + X has the same distribution as
1 2 n 1 2 n
X.
Suppose now that X = (X , X , …) is a sequence of independent, identically distributed random variables, and that N has a
1 2
Poisson distribution and is independent of X. Recall that the distribution of ∑ X is said to be compound Poisson. Like the
N
i=1 i
stable distributions, the compound Poisson distributions form another important class of infinitely divisible distributions.
Special Cases
A number of special distributions are infinitely divisible. Proofs of the results stated below are given in the individual sections.
Stable Distributions
First, the normal distribution, the Cauchy distribution, and the Lévy distribution are stable, so they are infinitely divisible.
However, direct arguments give more information, because we can identify the distribution of the component variables.
The normal distribution is infinitely divisible. If X has the normal distribution with mean μ ∈ R and standard deviation
σ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X
+ 1 2 where (X , X , … , X ) are
n 1 2 n
−
independent, and X has the normal distribution with mean μ/n and standard deviation σ/√n for each i ∈ {1, 2, … , n}.
i
The Cauchy distribution is infinitely divisible. If X has the Cauchy distribution with location parameter a ∈ R and scale
parameter b ∈ (0, ∞), then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are
+ 1 2 n 1 2 n
independent, and X has the Cauchy distribution with location parameter a/n and scale parameter b/n for each
i
i ∈ {1, 2, … , n}.
The gamma distribution is infinitely divisible. If X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale
parameter b ∈ (0, ∞), then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are
+ 1 2 n 1 2 n
independent, and X has the gamma distribution with shape parameter k/n and scale parameter b for each i ∈ {1, 2, … , n}
i
The chi-square distribution is infinitely divisible. If X has the chi-square distribution with k ∈ (0, ∞) degrees of freedom,
then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has
+ 1 2 n 1 2 n i
5.4.1 https://stats.libretexts.org/@go/page/10170
the chi-square distribution with k/n degrees of freedom for each i ∈ {1, 2, … , n}.
The Poisson distribution distribution is infinitely divisible. If X has the Poisson distribution with rate parameter λ ∈ (0, ∞) ,
then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has
+ 1 2 n 1 2 n i
the Poisson distribution with rate parameter λ/n for each i ∈ {1, 2, … , n}.
The general negative binomial distribution on N is infinitely divisible. If X has the negative binomial distribution on N with
parameters k ∈ (0, ∞) and p ∈ (0, 1), then for n ∈ N , X has the same distribution as X + X + ⋯ + X
+ 1 2 were n
(X , X , … , X ) are independent, and X has the negative binomial distribution on N with parameters k/n and p for each
1 2 n i
i ∈ {1, 2, … , n}.
Since the Poisson distribution and the negative binomial distributions are distributions on N, it follows from the characterization
above that these distributions must be compound Poisson. Of course it is completely trivial that the Poisson distribution is
compound Poisson, but it's far from obvious that the negative binomial distribution has this property. It turns out that the negative
binomial distribution can be obtained by compounding the logarithmic series distribution with the Poisson distribution.
The Wald distribution is infinitely divisible. If X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean
μ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X
+ 1 2 where (X , X , … , X ) are
n 1 2 n
independent, and X has the Wald distribution with shape parameter λ/n and mean μ/n for each i ∈ {1, 2, … , n}.
i
2
This page titled 5.4: Infinitely Divisible Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.4.2 https://stats.libretexts.org/@go/page/10170
5.5: Power Series Distributions
Power Series Distributions are discrete distributions on (a subset of) N constructed from power series. This class of distributions is
important because most of the special, discrete distributions are power series distributions.
Basic Theory
Power Series
Suppose that a = (a , a , a , …) is a sequence of nonnegative real numbers. We are interested in the power series with a as the
0 1 2
k
gn (θ) = ∑ ak θ , θ ∈ R (5.5.1)
k=0
The power series g is then defined by g(θ) = lim n→∞ gn (θ) for θ ∈ R for which the limit exists, and is denoted
∞
n
g(θ) = ∑ an θ (5.5.2)
n=0
Note that the series converges when θ = 0 , and g(0) = a . Beyond this trivial case, recall that there exists r ∈ [0, ∞] such that the
0
series converges absolutely for |θ| < r and diverges for |θ| > r . The number r is the radius of convergence. From now on, we
assume that r > 0 . If r < ∞ , the series may converge (absolutely) or may diverge to ∞ at the endpoint r. At −r, the series may
converge absolutely, may converge conditionally, or may diverge.
Distributions
From now on, we restrict θ to the interval [0, r); this interval is our parameter space. Some of the results below may hold when
r < ∞ and θ = r , but dealing with this case explicitly makes the exposition unnecessarily cumbersome.
Suppose that N is a random variable with values in N. Then N has the power series distribution associated with the function g
(or equivalently with the sequence a ) and with parameter θ ∈ [0, r) if N has probability density function f given by θ
n
an θ
fθ (n) = , n ∈ N (5.5.3)
g(θ)
Proof
Note that when θ = 0 , the distribution is simply the point mass distribution at 0; that is, f 0 (0) =1 .
gn (θ)
Fθ (n) = , n ∈ N (5.5.4)
g(θ)
Proof
Of course, the probability density function f is most useful when the power series g(θ) can be given in closed form, and similarly
θ
the distribution function F is most useful when the power series and the partial sums can be given in closed form
θ
Moments
The moments of N can be expressed in terms of the underlying power series function g , and the nicest expression is for the
factorial moments. Recall that the permutation formula is t = t(t − 1) ⋯ (t − k + 1) for t ∈ R and k ∈ N , and the factorial
(k)
5.5.1 https://stats.libretexts.org/@go/page/10171
Proof
Proof
Proof
Relations
Power series distributions are closed with respect to sums of independent variables.
Suppose that N and N are independent, and have power series distributions relative to the functions g and g , respectively,
1 2 1 2
each with parameter value θ < min{r , r } . Then N + N has the power series distribution relative to the function g g ,
1 2 1 2 1 2
Suppose that (N , N , … , N ) is a sequence of independent variables, each with the same power series distribution, relative
1 2 k
to the function g and with parameter value θ < r . Then N + N + ⋯ + N has the power series distribution relative to the
1 2 k
In the context of this result, recall that (N 1, N2 , … , Nk ) is a random sample of size k from the common distribution.
The Poisson distribution with rate parameter λ ∈ [0, ∞) is a power series distribution relative to the function g(λ) = e
λ
for
λ ∈ [0, ∞).
Proof
The geometric distribution on N with success parameter p ∈ (0, 1] is a power series distribution relative to the function
g(θ) = 1/(1 − θ) for θ ∈ [0, 1), where θ = 1 − p .
Proof
For fixed k ∈ (0, ∞) , the negative binomial distribution on N with with stopping parameter k and success parameter
p ∈ (0, 1] is a power series distribution relative to the function g(θ) = 1/(1 − θ) for θ ∈ [0, 1), where θ = 1 − p .
k
Proof
For fixed n ∈ N , the binomial distribution with trial parameter n and success parameter
+ p ∈ [0, 1) is a power series
distribution relative to the function g(θ) = (1 + θ) for θ ∈ [0, ∞) , where θ = p/(1 − p) .
n
Proof
The logarithmic distribution with parameter p ∈ [0, 1) is a power series distribution relative to the function
g(p) = − ln(1 − p) for p ∈ [0, 1).
Proof
5.5.2 https://stats.libretexts.org/@go/page/10171
This page titled 5.5: Power Series Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.5.3 https://stats.libretexts.org/@go/page/10171
5.6: The Normal Distribution
The normal distribution holds an honored role in probability and statistics, mostly because of the central limit theorem, one of the
fundamental theorems that forms a bridge between the two subjects. In addition, as we will see, the normal distribution has many
nice mathematical properties. The normal distribution is also called the Gaussian distribution, in honor of Carl Friedrich Gauss,
who was among the first to use the distribution.
The standard normal distribution is a continuous distribution on R with probability density function ϕ given by
1 2
−z /2
ϕ(z) = −−e , z ∈ R (5.6.1)
√2π
The standard normal probability density function has the famous “bell shape” that is known to just about everyone.
In the Special Distribution Simulator, select the normal distribution and keep the default settings. Note the shape and location
of the standard normal density function. Run the simulation 1000 times, and compare the empirical density function to the
probability density function.
and its inverse, the quantile function Φ , cannot be expressed in closed form in terms of elementary functions. However
−1
approximate values of these functions can be obtained from the special distribution calculator, and from most mathematics and
statistics software. Indeed these functions are so important that they are considered special functions of mathematics.
Proof
In the special distribution calculator, select the normal distribution and keep the default settings.
1. Note the shape of the density function and the distribution function.
2. Find the first and third quartiles.
3. Compute the interquartile range.
In the special distribution calculator, select the normal distribution and keep the default settings. Find the quantiles of the
following orders for the standard normal distribution:
1. p = 0.001, p = 0.999
5.6.1 https://stats.libretexts.org/@go/page/10172
2. p = 0.05, p = 0.95
3. p = 0.1 , p = 0.9
Moments
Suppose that random variable Z has the standard normal distribution.
In the Special Distribution Simulator, select the normal distribution and keep the default settings. Note the shape and size of
the mean ± standard deviation bar.. Run the simulation 1000 times, and compare the empirical mean and standard deviation to
the true mean and standard deviation.
More generally, we can compute all of the moments. The key is the following recursion formula.
For n ∈ N , E (Z
+
n+1
) = nE (Z
n−1
)
Proof
The moments of the standard normal distribution are now easy to compute.
For n ∈ N ,
1. E (Z 2n
) = 1 ⋅ 3 ⋯ (2n − 1) = (2n)!/(n! 2 )
n
2. E (Z 2n+1
) =0
Proof
Of course, the fact that the odd-order moments are 0 also follows from the symmetry of the distribution. The following theorem
gives the skewness and kurtosis of the standard normal distribution.
Because of the last result, (and the use of the standard normal distribution literally as a standard), the excess kurtosis of a random
variable is defined to be the ordinary kurtosis minus 3. Thus, the excess kurtosis of the normal distribution is 0.
Many other important properties of the normal distribution are most easily obtained using the moment generating function or the
characteristic function.
2. χ(t) = e for t ∈ R .
2
−t /2
Proof
Thus, the standard normal distribution has the curious property that the characteristic function is a multiple of the probability
density function:
−−
χ = √2πϕ (5.6.11)
The moment generating function can be used to give another derivation of the moments of Z , since we know that
(0) .
n (n)
E (Z ) = m
5.6.2 https://stats.libretexts.org/@go/page/10172
The General Normal Distribution
The general normal distribution is the location-scale family associated with the standard normal distribution.
Suppose that μ ∈ R and σ ∈ (0, ∞) and that Z has the standard normal distribution. Then X = μ + σZ has the normal
distribution with location parameter μ and scale parameter σ.
Distribution Functions
Suppose that X has the normal distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞). The basic properties of
the density function and distribution function of X follow from general results for location scale families.
Proof
In the special distribution simulator, select the normal distribution. Vary the parameters and note the shape and location of the
probability density function. With your choice of parameter settings, run the simulation 1000 times and compare the empirical
density function to the true probability density function.
Let F denote the distribution function of X, and as above, let Φ denote the standard normal distribution function.
Proof
In the special distribution calculator, select the normal distribution. Vary the parameters and note the shape of the density
function and the distribution function.
Moments
Suppose again that X has the normal distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞). As the notation
suggests, the location and scale parameters are also the mean and standard deviation, respectively.
Proof
So the parameters of the normal distribution are usually referred to as the mean and standard deviation rather than location and
scale. The central moments of X can be computed easily from the moments of the standard normal distribution. The ordinary (raw)
moments of X can be computed from the central moments, but the formulas are a bit messy.
For n ∈ N ,
5.6.3 https://stats.libretexts.org/@go/page/10172
1. E [(X − μ) 2n
] = 1 ⋅ 3 ⋯ (2n − 1)σ
2n
= (2n)! σ
2n n
/(n! 2 )
2. E [(X − μ) 2 n+1
] =0
All of the odd central moments of X are 0, a fact that also follows from the symmetry of the probability density function.
In the special distribution simulator select the normal distribution. Vary the mean and standard deviation and note the size and
location of the mean/standard deviation bar. With your choice of parameter settings, run the simulation 1000 times and
compare the empirical mean and standard deviation to the true mean and standard deviation.
2
2
σ t )
2
for t ∈ R .
2. χ(t) = exp(iμt − 1
2
2
σ t )
2
for t ∈ R
Proof
Relations
The normal family of distributions satisfies two very important properties: invariance under linear transformations of the variable
and invariance with respect to sums of independent variables. The first property is essentially a restatement of the fact that the
normal distribution is a location-scale family.
Proof
Recall that in general, if X is a random variable with mean μ and standard deviation σ > 0 , then Z = (X − μ)/σ is the standard
score of X. A corollary of the last result is that if X has a normal distribution then the standard score Z has a standard normal
distribution. Conversely, any normally distributed variable can be constructed from a standard normal variable.
Standard score.
X−μ
1. If X has the normal distribution with mean μ and standard deviation σ then Z = has the standard normal distribution. σ
2. If Z has the standard normal distribution and if μ ∈ R and σ ∈ (0, ∞), then X = μ + σZ has the normal distribution with
mean μ and standard deviation σ.
Suppose that X and X are independent random variables, and that X is normally distributed with mean μ and variance σ
1 2 i i
2
i
1. E(X + X ) = μ + μ
1 2 1 2
2. var(X + X ) = σ + σ
1 2
2
1
2
2
Proof
This theorem generalizes to a sum of n independent, normal variables. The important part is that the sum is still normal; the
expressions for the mean and variance are standard results that hold for the sum of independent variables generally. As a
consequence of this result and the one for linear transformations, it follows that the normal distribution is stable.
The normal distribution is stable. Specifically, suppose that X has the normal distribution with mean μ ∈ R and variance
σ ∈ (0, ∞) . If (X , X , … , X ) are independent copies of X, then X + X + ⋯ + X has the same distribution as
2
1 2 n 1 2 n
− −
(n − √n ) μ + √n X , namely normal with mean nμ and variance nσ .
2
5.6.4 https://stats.libretexts.org/@go/page/10172
Proof
All stable distributions are infinitely divisible, so the normal distribution belongs to this family as well. For completeness, here is
the explicit statement:
The normal distribution is infinitely divisible. Specifically, if X has the normal distribution with mean μ ∈ R and variance
σ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are
2
+ 1 2 n 1 2 n
independent, and each has the normal distribution with mean μ/n and variance σ /n. 2
Finally, the normal distribution belongs to the family of general exponential distributions.
Suppose that X has the normal distribution with mean μ and variance σ
2
. The distribution is a two-parameter exponential
μ
family with natural parameters ( σ2
,−
1
2 σ2
, and natural statistics (X, X ).
)
2
Proof
A number of other special distributions studied in this chapter are constructed from normally distributed variables. These include
The lognormal distribution
The folded normal distribution, which includes the half normal distribution as a special case
The Rayleigh distribution
The Maxwell distribution
The Lévy distribution
Also, as mentioned at the beginning of this section, the importance of the normal distribution stems in large part from the central
limit theorem, one of the fundamental theorems of probability. By virtue of this theorem, the normal distribution is connected to
many other distributions, by means of limits and approximations, including the special distributions in the following list. Details
are given in the individual sections.
The binomial distribution
The negative binomial distribution
The Poisson distribution
The gamma distribution
The chi-square distribution
The student t distribution
The Irwin-Hall distribution
Computational Exercises
Suppose that the volume of beer in a bottle of a certain brand is normally distributed with mean 0.5 liter and standard deviation
0.01 liter.
1. Find the probability that a bottle will contain at least 0.48 liter.
2. Find the volume that corresponds to the 95th percentile
Answer
A metal rod is designed to fit into a circular hole on a certain assembly. The radius of the rod is normally distributed with mean
1 cm and standard deviation 0.002 cm. The radius of the hole is normally distributed with mean 1.01 cm and standard deviation
0.003 cm. The machining processes that produce the rod and the hole are independent. Find the probability that the rod is to
big for the hole.
Answer
The weight of a peach from a certain orchard is normally distributed with mean 8 ounces and standard deviation 1 ounce. Find
the probability that the combined weight of 5 peaches exceeds 45 ounces.
Answer
5.6.5 https://stats.libretexts.org/@go/page/10172
A Further Generlization
In some settings, it's convenient to consider a constant as having a normal distribution (with mean being the constant and variance
0, of course). This convention simplifies the statements of theorems and definitions in these settings. Of course, the formulas for
the probability density function and the distribution function do not hold for a constant, but the other results involving the moment
generating function, linear transformations, and sums are still valid. Moreover, the result for linear transformations would hold for
all a and b .
This page titled 5.6: The Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.6.6 https://stats.libretexts.org/@go/page/10172
5.7: The Multivariate Normal Distribution
The multivariate normal distribution is among the most important of multivariate distributions, particularly in statistical inference and the
study of Gaussian processes such as Brownian motion. The distribution arises naturally from linear transformations of independent
normal variables. In this section, we consider the bivariate normal distribution first, because explicit results can be given and because
graphical interpretations are possible. Then, with the aid of matrix notation, we discuss the general multivariate distribution.
The corresponding distribution function is denoted Φ and is considered a special function in mathematics:
z z
1 2
−x /2
Φ(z) = ∫ ϕ(x) dx = ∫ −−e dx, z ∈ R (5.7.2)
−∞ −∞ √2π
tZ
1 2
t /2
m(t) = E (e ) = exp[ var(tZ)] = e , t ∈ R (5.7.3)
2
Suppose that Z and W are independent random variables, each with the standard normal distribution. The distribution of (Z, W ) is
known as the standard bivariate normal distribution.
The basic properties of the standard bivariate normal distribution follow easily from independence and properties of the (univariate)
normal distribution. Recall first that the graph of a function f : R → R is a surface. For c ∈ R , the set of points
2
{(x, y) ∈ R : f (x, y) = c} is the level curve of f at level c . The graph of f can be understood by means of the level curves.
2
The probability density function ϕ of the standard bivariate normal distribution is given by
2
1 1 2 2
− (z +w ) 2
ϕ2 (z, w) = e 2 , (z, w) ∈ R (5.7.4)
2π
Proof
Clearly ϕ has a number of symmetry properties as well: ϕ (z, w) is symmetric in z about 0 so that ϕ (−z, w) = ϕ (z, w) ; ϕ (z, w) is
2 2 2 2
symmetric in w about 0 so that ϕ (z, −w) = ϕ (z, w) ; ϕ (z, w) is symmetric in (z, w) so that ϕ (z, w) = ϕ (w, z) . In short, ϕ has
2 2 2 2 2 2
Open the bivariate normal experiment, keep the default settings to get the standard bivariate normal distribution. Run the experiment
1000 times. Observe the cloud of points in the scatterplot, and compare the empirical density functions to the probability density
functions.
Suppose that (Z, W ) has the standard bivariate normal distribution. The moment generating function m of (Z, W ) is given by 2
1 1
2 2 2
m2 (s, t) = E [exp(sZ + tW )] = exp[ var(sZ + tW )] = exp[ (s + t )], (s, t) ∈ R (5.7.6)
2 2
Proof
5.7.1 https://stats.libretexts.org/@go/page/10173
parameter.
Suppose that (Z, W ) has the standard bivariate normal distribution. Let μ, ν ∈ R ; σ, τ ∈ (0, ∞) ; and ρ ∈ (−1, 1), and let X and Y
be new random variables defined by
X = μ+σ Z (5.7.7)
−−−−−
2
Y = ν + τ ρZ + τ √ 1 − ρ W (5.7.8)
The joint distribution of (X, Y ) is called the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ).
We can use the change of variables formula to find the joint probability density function.
Suppose that (X, Y ) has the bivariate normal distribution with the parameters (μ, ν , σ, τ , ρ) as specified above. The joint probability
density function f of (X, Y ) is given by
2 2
1 1 (x − μ) (x − μ)(y − ν ) (y − ν )
2
f (x, y) = exp{− [ − 2ρ + ]}, (x, y) ∈ R (5.7.9)
− −−− − 2 2 2
2 2(1 − ρ ) σ στ τ
2πστ √ 1 − ρ
The following theorem gives fundamental properties of the bivariate normal distribution.
Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) as specified above. Then
1. X is normally distributed with mean μ and standard deviation σ.
2. Y is normally distributed with mean ν and standard deviation τ .
3. cor(X, Y ) = ρ .
4. X and Y are independent if and only if ρ = 0 .
Proof
Thus, two random variables with a joint normal distribution are independent if and only if they are uncorrelated.
In the bivariate normal experiment, change the standard deviations of X and Y with the scroll bars. Watch the change in the shape of
the probability density functions. Now change the correlation with the scroll bar and note that the probability density functions do
not change. For various values of the parameters, run the experiment 1000 times. Observe the cloud of points in the scatterplot, and
compare the empirical density functions to the probability density functions.
In the case of perfect correlation (ρ = 1 or ρ = −1 ), the distribution of (X, Y ) is also said to be bivariate normal, but degenerate. In this
case, we know from our study of covariance and correlation that (X, Y ) takes values on the regression line
(x − μ)} , and hence does not have a probability density function (with respect to Lebesgue measure on R ).
2 τ 2
{(x, y) ∈ R : y = ν + ρ
σ
In the bivariate normal experiment, run the experiment 1000 times with the values of ρ given below and selected values of σ and τ .
Observe the cloud of points in the scatterplot and compare the empirical density functions to the probability density functions.
1. ρ ∈ {0, 0.3, 0.5, 0.7, 1}
2. ρ ∈ {−0.3, −0.5, −0.7, −1}
Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) as specified above.
1. For x ∈ R, the conditional distribution of Y given X = x is normal with mean E(Y ∣ X = x) = ν + ρ τ
σ
(x − μ) and variance
var(Y ∣ X = x) = τ (1 − ρ ) .
2 2
τ
(y − ν ) and variance
var(X ∣ Y = y) = σ (1 − ρ ) .
2 2
5.7.2 https://stats.libretexts.org/@go/page/10173
Note that the conditional variances do not depend on the value of the given variable.
In the bivariate normal experiment, set the standard deviation of X to 1.5, the standard deviation of Y to 0.5, and the correlation to
0.7.
1. Run the experiment 100 times.
2. For each run, compute E(Y ∣ X = x) the predicted value of Y for the given the value of X.
3. Over all 100 runs, compute the square root of the average of the squared errors between the predicted value of Y and the true
value of Y .
You may be perplexed by the lack of symmetry in how (X, Y ) is defined in terms of (Z, W ) in the original definition. Note however
−−−− −
that the distribution is completely determined by the 5 parameters. If we define X = μ + σρZ + σ √1 − ρ W and Y = ν + τ Z then ′ 2 ′
(X , Y ) has the same distribution as (X, Y ), namely the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) (although, of
′ ′
course (X , Y ) and (X, Y ) are different random vectors). There are other ways to define the same distribution as an affine
′ ′
Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) . Then (X, Y ) has moment generating
function M given by
1
M (s, t) = E [exp(sX + tY )] = exp[E(sX + tY ) + var(sX + tY )] = exp (5.7.15)
2
1 2 2 2 2 2
[μs + ν t + (σ s + 2ρστ st + τ t )], (s, t) ∈ R
2
Proof
We showed above that if (X, Y ) has a bivariate normal distribution then the marginal distributions of X and Y are also normal. The
converse is not true.
Transformations
Like its univariate counterpart, the family of bivariate normal distributions is preserved under two types of transformations on the
underlying random vector: affine transformations and sums of independent vectors. We start with a preliminary result on affine
transformations that should help clarify the original definition. Throughout this discussion, we assume that the parameter vector
(μ, ν , σ, τ , ρ) satisfies the usual conditions: μ, ν ∈ R , and σ, τ ∈ (0, ∞), and ρ ∈ (−1, 1).
Suppose that (Z, W ) has the standard bivariate normal distribution. Let X = a + b Z + c W and Y = a + b Z + c 1 1 1 2 2 2W where
the coefficients are in R and b c − c b ≠ 0 . Then (X, Y ) has a bivariate normal distribution with parameters given by
1 2 1 2
1. E(X) = a 1
2. E(Y ) = a 2
3. var(X) = b + c 2
1
2
1
4. var(Y ) = b + c 2
2
2
2
5. cov(X, Y ) = b b 1 2 + c1 c2
Proof
Now it is easy to show more generally that the bivariate normal distribution is closed with respect to affine transformations.
Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ). Define U = a + b X + c Y and 1 1 1
V = a2 + b2 X + c2 Y , where the coefficients are in R and b c − c b ≠ 0 . Then (U , V ) has a bivariate normal distribution with
1 2 1 2
parameters as follows:
1. E(U ) = a 1 + b1 μ + c1 ν
5.7.3 https://stats.libretexts.org/@go/page/10173
2. E(V ) = a + b 2 2μ + c2 ν
3. var(U ) = b σ 2
1
2
+c τ
2
1
2
+ 2 b1 c1 ρστ
4. var(V ) = b σ 2
2
2
+c τ
2
2
2
+ 2 b2 c2 ρστ
5. cov(U , V ) = b 1 b2 σ
2
+ c1 c2 τ
2
+ (b1 c2 + b2 c1 )ρστ
Proof
The bivariate normal distribution is preserved with respect to sums of independent variables.
Suppose that (X , Y ) has the bivariate normal distribution with parameters (μ , ν , σ , τ , ρ ) for i ∈ {1, 2}, and that
i i i i i i i (X1 , Y1 ) and
(X , Y ) are independent. Then (X + X , Y + Y ) has the bivariate normal distribution with parameters given by
2 2 1 2 1 2
1. E(X + X ) = μ + μ
1 2 1 2
2. E(Y + Y ) = ν + ν
1 2 1 2
3. var(X + X ) = σ + σ
1 2
2
1
2
2
4. var(Y + Y ) = τ + τ
1 2
2
1
2
2
5. cov(X + X , Y + Y ) = ρ
1 2 1 2 1 σ1 τ1 + ρ2 σ2 τ2
Proof
Suppose that (Z, W ) has the standard bivariate normal distribution. Define the polar coordinates (R, Θ) of (Z, W ) by the equations
Z = R cos Θ , W = R sin Θ where R ≥ 0 and 0 ≤ Θ < 2 π . Then
1 2
The distribution of R is known as the standard Rayleigh distribution, named for William Strutt, Lord Rayleigh. The Rayleigh
distribution studied in more detail in a separate section.
Since the quantile function Φ of the normal distribution cannot be given in a simple, closed form, we cannot use the usual random
−1
quantile method of simulating a normal random variable. However, the quantile method works quite well to simulate a Rayleigh variable,
and of course simulating uniform variables is trivial. Hence we have a way of simulating a standard bivariate normal vector with a pair of
random numbers (which, you will recall are independent random variables, each with the standard uniform distribution, that is, the
uniform distribution on [0, 1)).
−−−−−−
Suppose that U and V are independent random variables, each with the standard uniform distribution. Let R = √−2 ln U and
Θ = 2πV . Define Z = R cos Θ and W = R sin Θ . Then (Z, W ) has the standard bivariate normal distribution.
Proof
Of course, if we can simulate (Z, W ) with a standard bivariate normal distribution, then we can simulate (X, Y ) with the general
−−−− −
bivariate normal distribution, with parameter (μ, ν , σ, τ , ρ) by definition (5), namely X = μ + σZ , Y = ν + τ ρZ + τ √1 − ρ W . 2
5.7.4 https://stats.libretexts.org/@go/page/10173
n
1 1 1 1
2 n
ϕn (z) = exp(− z ⋅ z) = exp(− ∑ z ), z = (z1 , z2 , … , zn ) ∈ R (5.7.32)
i
n/2 2 n/2 2
(2π) (2π)
i=1
n
1 1 1 n
2
mn (t) = E [exp(t ⋅ Z)] = exp[ var(t ⋅ Z)] = exp( t ⋅ t) = exp( ∑ t ), t = (t1 , t2 , … , tn ) ∈ R (5.7.33)
i
2 2 2
i=1
Proof
Proof
In the context of this result, recall that the variance-covariance matrix vc(X) = AA is symmetric and positive definite (and hence also
T
invertible). We will now see that the multivariate normal distribution is completely determined by the expected value vector μ and the
variance-covariance matrix V , and hence these give the basic parameters of the distribution.
Suppose that X has an n -dimensional normal distribution with expected value vector μ and variance-covariance matrix V . The
probability density function f of X is given by
1 1 −1 n
f (x) = −−−−−− exp[− (x − μ) ⋅ V (x − μ)], x ∈ R (5.7.34)
n/2 2
(2π) √det(V )
Proof
Suppose again that X has an n -dimensional normal distribution with expected value vector μ and variance-covariance matrix V .
The moment generating function M of X is given by
1 1 n
M (t) = E [exp(t ⋅ X)] = exp[E(t ⋅ X) + var(t ⋅ X)] = exp(t ⋅ μ + t ⋅ V t), t ∈ R (5.7.39)
2 2
Proof
Of course, the moment generating function completely determines the distribution. Thus, if a random vector X in R has a moment
n
generating function of the form given above, for some μ ∈ R and symmetric, positive definite V ∈ R
n n×n
, then X has the n -
dimensional normal distribution with mean μ and variance-covariance matrix V .
Note again that in the representation X = μ + AZ , the distribution of X is uniquely determined by the expected value vector μ and the
variance-covariance matrix V = AA , but not by μ and A. In general, for a given positive definite matrix V , there are many invertible
T
matrices A such that V = AA (the matrix A is a bit like a square root of V ). A theorem in matrix theory states that there is a unique
T
lower triangular matrix L with this property. The representation X = μ + LZ is known as the canonical representation of X.
If X = (X, Y ) has bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) , then the lower triangular matrix L such that
LL
T
= vc(X) is
σ 0
L =[ −−−− −] (5.7.41)
2
τρ τ √1 − ρ
Proof
Note that the matrix L above gives the canonical representation of (X, Y ) in terms of the standard normal vector (Z, W ) in the original
−− −−−
definition, namely X = μ + σZ , Y = ν + τ ρZ + τ √1 − ρ W . 2
5.7.5 https://stats.libretexts.org/@go/page/10173
If the matrix A ∈ R in the definition is not invertible, then the variance-covariance matrix V = AA is symmetric, but only
n×n T
positive semi-definite. The random vector X = μ + AZ takes values in a lower dimensional affine subspace of R that has measure 0 n
relative to n -dimensional Lebesgue measure λ . Thus, X does not have a probability density function relative to λ , and so the
n n
distribution is degenerate. However, the formula for the moment generating function still holds. Degenerate normal distributions are
discussed in more detail below.
Transformations
The multivariate normal distribution is invariant under two basic types of transformations on the underlying random vectors: affine
transformations (with linearly independent rows), and concatenation of independent vectors. As simple corollaries of these two results,
the normal distribution is also invariant with respect to subsequences of the random vector, re-orderings of the terms in the random
vector, and sums of independent random vectors. The main tool that we will use is the moment generating function. We start with the
first main result on affine transformations.
Suppose that X has the n -dimensional normal distribution with mean vector μ and variance-covariance matrix V . Suppose also that
a∈ R
m
and that A ∈ R has linearly independent rows (thus, m ≤ n ). Then Y = a + AX has an m-dimensional normal
m×n
distribution, with
1. E(Y ) = a + Aμ
2. vc(Y ) = AV A T
Proof
Suppose that X = (X 1, X2 , … , Xn ) has an n -dimensional normal distribution. If {i 1, i2 , … , im } is a set of distinct indices, then
Y = (Xi , Xi , … , Xi
1 2 m
) has an m-dimensional normal distribution.
Proof
In the context of the previous result, if X has mean vector μ and variance-covariance matrix V , then Y has mean vector Aμ and
variance-covariance matrix AV A , where A is the 0-1 matrix defined in the proof. As simple corollaries, note that if
T
X = (X , X , … , X ) has an n -dimensional normal distribution, then any permutation of the coordinates of X also has an n -
1 2 n
dimensional normal distribution, and (X , X , … , X ) has an m-dimensional normal distribution for any m ≤ n . Here is a slight
1 2 m
Next is the converse to part (c) of the previous result: concatenating independent normally distributed vectors produces another normally
distributed vector.
Suppose that X has the m-dimensional normal distribution with mean vector μ and variance-covariance matrix U , Y has the n -
dimensional normal distribution with mean vector ν and variance-covariance matrix V , and that X and Y are independent. Then
Z = (X, Y ) has the m + n -dimensional normal distribution with
1. E(X, Y ) = (μ, ν )
vc(X) 0
2. vc(X, Y ) = [ T
] where 0 is the m × n zero matrix.
0 vc(Y )
Proof
Just as in the univariate case, the normal family of distributions is closed with respect to sums of independent variables. The proof
follows easily from the previous result.
Suppose that X has the n -dimensional normal distribution with mean vector μ and variance-covariance matrix U , Y has the n -
dimensional normal distribution with mean vector ν and variance-covariance matrix V , and that X and Y are independent. Then
5.7.6 https://stats.libretexts.org/@go/page/10173
X +Y has the n -dimensional normal distribution with
1. E(X + Y ) = μ + ν
2. vc(X + Y ) = U + V
Proof
We close with a trivial corollary to the general result on affine transformation, but this corollary points the way to a further generalization
of the multivariate normal distribution that includes the degenerate distributions.
Suppose that X has an n -dimensional normal distribution with mean vector μ and variance-covariance matrix V , and that a∈ R
n
A Further Generalization
The last result can be used to give a simple, elegant definition of the multivariate normal distribution that includes the degenerate
distributions as well as the ones we have considered so far. First we will adopt our general definition of the univariate normal distribution
that includes constant random variables.
A random variable X that takes values in R has an n -dimensional normal distribution if and only if a ⋅ X has a univariate normal
n
Although an n -dimensional normal distribution may not have a probability density function with respect to n -dimensional Lebesgue
measure λ , the form of the moment generating function is unchanged.
n
Suppose that X has mean vector μ and variance-covariance matrix V , and that X has an n -dimensional normal distribution. The
moment generating function of X is given by
1 1
n
E [exp(t ⋅ X)] = exp[E(t ⋅ X) + var(t ⋅ X)] = exp(t ⋅ μ + t ⋅ V t), t ∈ R (5.7.47)
2 2
Proof
Suppose that X has an n -dimensional normal distribution in the sense of the general definition, and that the distribution of X has a
probability density function on R with respect to Lebesgue measure λ . Then X has an n -dimensional normal distribution in the
n
n
This page titled 5.7: The Multivariate Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.7.7 https://stats.libretexts.org/@go/page/10173
5.8: The Gamma Distribution
In this section we will study a family of distributions that has special importance in probability and statistics. In particular, the
arrival times in the Poisson process have gamma distributions, and the chi-square distribution in statistics is a special case of the
gamma distribution. Also, the gamma distribution is widely used to model physical quantities that take positive values.
Definition
The gamma function Γ is defined as follows
∞
k−1 −x
Γ(k) = ∫ x e dx, k ∈ (0, ∞) (5.8.1)
0
The function is well defined, that is, the integral converges for any k >0 . On the other hand, the integral diverges to ∞ for
k ≤ 0.
Proof
Figure 5.8.1 : The graph of the gamma function on the interval (0, 5)
Properties
Here are a few of the essential properties of the gamma function. The first is the fundamental identity.
It's clear that the gamma function is a continuous extension of the factorial function.
Γ(k + 1) = k! for k ∈ N .
Proof
5.8.1 https://stats.libretexts.org/@go/page/10174
The values of the gamma function for non-integer arguments generally cannot be expressed in simple, closed forms. However,
there are exceptions.
−
Γ(
1
2
) = √π .
Proof
2
.
For n ∈ N ,
2n + 1 1 ⋅ 3 ⋯ (2n − 1) (2n)!
− −
Γ( ) = √π = √π (5.8.11)
n n
2 2 4 n!
Proof
Stirling's Approximation
One of the most famous asymptotic formulas for the gamma function is Stirling's formula, named for James Stirling. First we need
to recall a definition.
Suppose that f , g : D → (0, ∞) where D = (0, ∞) or D = N . Then f (x) ≈ g(x) as x → ∞ means that
+
f (x)
→ 1 as x → ∞ (5.8.12)
g(x)
Stirling's formula
x x
−−−
Γ(x + 1) ≈ ( ) √2πx as x → ∞ (5.8.13)
e
As a special case, Stirling's result gives an asymptotic formula for the factorial function:
n n
−−−
n! ≈ ( ) √2πn as n → ∞ (5.8.14)
e
Clearly f is a valid probability density function, since f (x) > 0 for x > 0 , and by definition, Γ(k) is the normalizing constant for
the function x ↦ x e k−1
on (0, ∞). The following theorem shows that the gamma density has a rich variety of shapes, and
−x
The gamma probability density function f with shape parameter k ∈ (0, ∞) satisfies the following properties:
1. If 0 < k < 1 , f is decreasing with f (x) → ∞ as x ↓ 0.
2. If k = 1 , f is decreasing with f (0) = 1 .
3. If k > 1 , f increases and then decreases, with mode at k − 1 .
4. If 0 < k ≤ 1 , f is concave upward.
−−− −
5. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at k − 1 + √k − 1 .
−−−−
6. If k > 2 , f is concave upward, then downward, then upward again, with inflection points at k − 1 ± √k − 1 .
Proof
The special case k = 1 gives the standard exponential distribuiton. When k ≥ 1 , the distribution is unimodal.
5.8.2 https://stats.libretexts.org/@go/page/10174
In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape parameter and note the
shape of the density function. For various values of k , run the simulation 1000 times and compare the empirical density
function to the true probability density function.
The distribution function and the quantile function do not have simple, closed representations for most values of the shape
parameter. However, the distribution function has a trivial representation in terms of the incomplete and complete gamma
functions.
The distribution function F of the standard gamma distribution with shape parameter k ∈ (0, ∞) is given by
Γ(k, x)
F (x) = , x ∈ (0, ∞) (5.8.16)
Γ(k)
Approximate values of the distribution and quantile functions can be obtained from special distribution calculator, and from most
mathematical and statistical software packages.
Using the special distribution calculator, find the median, the first and third quartiles, and the interquartile range in each of the
following cases:
1. k = 1
2. k = 2
3. k = 3
Moments
Suppose that X has the standard gamma distribution with shape parameter k ∈ (0, ∞) . The mean and variance are both simply the
shape parameter.
In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape parameter and note the
size and location of the mean ± standard deviation bar. For selected values of k , run the simulation 1000 times and compare
the empirical mean and standard deviation to the distribution mean and standard deviation.
More generally, the moments can be expressed easily in terms of the gamma function:
√k
2. kurt(X) = 3 + 6
Proof
In particular, note that skew(X) → 0 and kurt(X) → 3 as k → ∞ . Note also that the excess kurtosis kurt(X) − 3 → 0 as
k → ∞.
5.8.3 https://stats.libretexts.org/@go/page/10174
In the simulation of the special distribution simulator, select the gamma distribution. Increase the shape parameter and note the
shape of the density function in light of the previous results on skewness and kurtosis. For various values of k , run the
simulation 1000 times and compare the empirical density function to the true probability density function.
Proof
If Z has the standard gamma distribution with shape parameter k ∈ (0, ∞) and if b ∈ (0, ∞) , then X = bZ has the gamma
distribution with shape parameter k and scale parameter b .
The reciprocal of the scale parameter, r = 1/b is known as the rate parameter, particularly in the context of the Poisson process.
The gamma distribution with parameters k = 1 and b is called the exponential distribution with scale parameter b (or rate
parameter r = 1/b ). More generally, when the shape parameter k is a positive integer, the gamma distribution is known as the
Erlang distribution, named for the Danish mathematician Agner Erlang. The exponential distribution governs the time between
arrivals in the Poisson model, while the Erlang distribution governs the actual arrival times.
Basic properties of the general gamma distribution follow easily from corresponding properties of the standard distribution and
basic results for scale transformations.
Distribution Functions
Suppose that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞).
Proof
Recall that the inclusion of a scale parameter does not change the shape of the density function, but simply scales the graph
horizontally and vertically. In particular, we have the same basic shapes as for the standard gamma density function.
In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape and scale parameters and
note the shape and location of the probability density function. For various values of the parameters, run the simulation 1000
times and compare the empirical density function to the true probability density function.
Once again, the distribution function and the quantile function do not have simple, closed representations for most values of the
shape parameter. However, the distribution function has a simple representation in terms of the incomplete and complete gamma
functions.
5.8.4 https://stats.libretexts.org/@go/page/10174
The distribution function F of X is given by
Γ(k, x/b)
F (x) = , x ∈ (0, ∞) (5.8.24)
Γ(k)
Proof
Approximate values of the distribution and quanitle functions can be obtained from special distribution calculator, and from most
mathematical and statistical software packages.
Open the special distribution calculator. Vary the shape and scale parameters and note the shape and location of the distribution
and quantile functions. For selected values of the parameters, find the median and the first and third quartiles.
Moments
Suppose again that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞).
Proof
In the special distribution simulator, select the gamma distribution. Vary the parameters and note the shape and location of the
mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical
mean and standard deviation to the distribution mean and standard deviation.
Note also that E(X ) = ∞ if a ≤ −k . Recall that skewness and kurtosis are defined in terms of the standard score, and hence are
a
2. kurt(X) = 3 + 6
tX
1 1
E (e ) = , t < (5.8.25)
k
(1 − bt) b
Proof
Relations
Our first result is simply a restatement of the meaning of the term scale parameter.
Suppose that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) . If c ∈ (0, ∞),
then cX has the gamma distribution with shape parameter k and scale parameter bc.
Proof
More importantly, if the scale parameter is fixed, the gamma family is closed with respect to sums of independent variables.
Suppose that X and X are independent random variables, and that X has the gamma distribution with shape parameter
1 2 i
ki ∈ (0, ∞) and scale parameter b ∈ (0, ∞) for i ∈ {1, 2}. Then X + X has the gamma distribution with shape parameter
1 2
5.8.5 https://stats.libretexts.org/@go/page/10174
k1 + k2 and scale parameter b .
Proof
From the previous result, it follows that the gamma distribution is infinitely divisible:
Suppose that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For n ∈ N , X +
has the same distribution as ∑ X , where (X , X , … , X ) is a sequence of independent random variables, each with the
n
i=1 i 1 2 n
gamma distribution with with shape parameter k/n and scale parameter b .
From the sum result and the central limit theorem, it follows that if k is large, the gamma distribution with shape parameter k and
scale parameter b can be approximated by the normal distribution with mean kb and variance kb . Here is the precise statement:
2
Suppose that X has the gamma distribution with shape parameter k ∈ (0, ∞) and fixed scale parameter b ∈ (0, ∞). Then the
k
distribution of the standardized variable below converges to the standard normal distribution as k → ∞ :
Xk − kb
Zk = −
− (5.8.27)
√k b
In the special distribution simulator, select the gamma distribution. For various values of the scale parameter, increase the
shape parameter and note the increasingly “normal” shape of the density function. For selected values of the parameters, run
the experiment 1000 times and compare the empirical density function to the true probability density function.
The gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a two-parameter exponential
family with natural parameters (k − 1, −1/b) , and natural statistics (ln X, X).
Proof
For n ∈ (0, ∞) , the gamma distribution with shape parameter n/2 and scale parameter 2 is known as the chi-square distribution
with n degrees of freedom. The chi-square distribution is important enough to deserve a separate section.
Computational Exercise
Suppose that the lifetime of a device (in hours) has the gamma distribution with shape parameter k =4 and scale parameter
b = 100 .
1. Find the probability that the device will last more than 300 hours.
2. Find the mean and standard deviation of the lifetime.
Answer
Suppose that Y has the gamma distribution with parameters k = 10 and b = 2 . For each of the following, compute the true
value using the special distribution calculator and then compute the normal approximation. Compare the results.
1. P(18 < X < 25)
2. The 80th percentile of Y
Answer
This page titled 5.8: The Gamma Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.8.6 https://stats.libretexts.org/@go/page/10174
5.9: Chi-Square and Related Distribution
In this section we will study a distribution, and some relatives, that have special importance in statistics. In particular, the chi-
square distribution will arise in the study of the sample variance when the underlying distribution is normal and in goodness of fit
tests.
So the chi-square distribution is a continuous distribution on (0, ∞). For reasons that will be clear later, n is usually a positive
integer, although technically this is not a mathematical requirement. When n is a positive integer, the gamma function in the
normalizing constant can be be given explicitly.
If n ∈ N + then
1. Γ(n/2) = (n/2 − 1)! if n is even.
(n−1)! −
2. Γ(n/2) = n−1
√π if n is odd.
2 (n/2−1/2)!
The chi-square probability density function with n ∈ (0, ∞) degrees of freedom satisfies the following properties:
1. If 0 < n < 2 , f is decreasing with f (x) → ∞ as x ↓ 0.
2. If n = 2 , f is decreasing with f (0) = . 1
In the special distribution simulator, select the chi-square distribution. Vary n with the scroll bar and note the shape of the
probability density function. For selected values of n , run the simulation 1000 times and compare the empirical density
function to the true probability density function.
The distribution function and the quantile function do not have simple, closed-form representations for most values of the
parameter. However, the distribution function can be given in terms of the complete and incomplete gamma functions.
Suppose that X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom. The distribution function F of X is given
by
Γ(n/2, x/2)
F (x) = , x ∈ (0, ∞) (5.9.2)
Γ(n/2)
Approximate values of the distribution and quantile functions can be obtained from the special distribution calculator, and from
most mathematical and statistical software packages.
In the special distribution calculator, select the chi-square distribution. Vary the parameter and note the shape of the probability
density, distribution, and quantile functions. In each of the following cases, find the median, the first and third quartiles, and
the interquartile range.
5.9.1 https://stats.libretexts.org/@go/page/10175
1. n = 1
2. n = 2
3. n = 5
4. n = 10
Moments
The mean, variance, moments, and moment generating function of the chi-square distribution can be obtained easily from general
results for the gamma distribution.
In the simulation of the special distribution simulator, select the chi-square distribution. Vary n with the scroll bar and note the
size and location of the mean ± standard deviation bar. For selected values of n , run the simulation 1000 times and compare
the empirical moments to the distribution moments.
The skewness and kurtosis of the chi-square distribution are given next.
Note that skew(X) → 0 and kurt(X) → 3 as n → ∞ . In particular, the excess kurtosis kurt(X) − 3 → 0 as n → ∞ .
In the simulation of the special distribution simulator, select the chi-square distribution. Increase n with the scroll bar and note
the shape of the probability density function in light of the previous results on skewness and kurtosis. For selected values of n ,
run the simulation 1000 times and compare the empirical density function to the true probability density function.
The next result gives the general moments of the chi-square distribution.
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then for k > −n/2 ,
Γ(n/2 + k)
k k
E (X ) =2 (5.9.3)
Γ(n/2)
In particular, if k ∈ N+ then
n n n
k k
E (X ) =2 ( )( + 1) ⋯ ( + k − 1) (5.9.4)
2 2 2
Note also E (X k
) =∞ if k ≤ −n/2 .
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then X has moment generating function
1 1
tX
E (e ) = , t < (5.9.5)
n/2 2
(1 − 2t)
Relations
The chi-square distribution is connected to a number of other special distributions. Of course, the most important relationship is the
definition—the chi-square distribution with n degrees of freedom is a special case of the gamma distribution, corresponding to
shape parameter n/2 and scale parameter 2. On the other hand, any gamma distributed variable can be re-scaled into a variable
with a chi-square distribution.
5.9.2 https://stats.libretexts.org/@go/page/10175
If X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) then Y =
2
b
X has the chi-
square distribution with 2k degrees of freedom.
Proof
The chi-square distribution with 2 degrees of freedom is the exponential distribution with scale parameter 2.
Proof
If Z has the standard normal distribution then X = Z has the chi-square distribution with 1 degree of freedom.
2
Proof
Recall that if we add independent gamma variables with a common scale parameter, the resulting random variable also has a
gamma distribution, with the common scale parameter and with shape parameter that is the sum of the shape parameters of the
terms. Specializing to the chi-square distribution, we have the following important result:
If X has the chi-square distribution with m ∈ (0, ∞) degrees of freedom, Y has the chi-square distribution with n ∈ (0, ∞)
degrees of freedom, and X and Y are independent, then X + Y has the chi-square distribution with m + n degrees of
freedom.
The last two results lead to the following theorem, which is fundamentally important in statistics.
2
V = ∑Z (5.9.8)
i
i=1
This theorem is the reason that the chi-square distribution deserves a name of its own, and the reason that the degrees of freedom
parameter is usually a positive integer. Sums of squares of independent normal variables occur frequently in statistics.
From the central limit theorem, and previous results for the gamma distribution, it follows that if n is large, the chi-square
distribution with n degrees of freedom can be approximated by the normal distribution with mean n and variance 2n. Here is the
precise statement:
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then the distribution of the standard score
n
Xn − n
Zn = −− (5.9.9)
√2n
In the simulation of the special distribution simulator, select the chi-square distribution. Start with n = 1 and increase n . Note
the shape of the probability density function in light of the previous theorem. For selected values of n , run the experiment 1000
times and compare the empirical density function to the true density function.
Suppose that X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom. For k ∈ N , X has the same distribution
+
k
as ∑ X , where (X , X , … , X ) is a sequence of independent random variables, each with the chi-square distribution
i=1 i 1 2 k
Also like the gamma distribution, the chi-square distribution is a member of the general exponential family of distributions:
The chi-square distribution with with n ∈ (0, ∞) degrees of freedom is a one-parameter exponential family with natural
parameter n/2 − 1 , and natural statistic ln X .
5.9.3 https://stats.libretexts.org/@go/page/10175
Proof
So like the chi-square distribution, the chi distribution is a continuous distribution on (0, ∞).
Distribution Functions
The distribution function G of the chi distribution with n ∈ (0, ∞) degrees of freedom is given by
2
Γ(n/2, u /2)
G(u) = , u ∈ (0, ∞) (5.9.11)
Γ(n/2)
Proof
The probability density function g of the chi distribution with n ∈ (0, ∞) degrees of freedom is given by
1 n−1
2
−u /2
g(u) = u e , u ∈ (0, ∞) (5.9.13)
n/2−1
2 Γ(n/2)
Proof
The chi probability density function with n ∈ (0, ∞) degrees of freedom satisfies the following properties:
1. If 0 < n < 1 , g is decreasing with g(u) → ∞ as u ↓ 0.
−−−
2. If n = 1 , g is decreasing with g(0) = √2/π as u ↓ 0.
−−−−−
3. If n > 1 , g increases and then decreases with mode u = √n − 1
4. If 0 < n < 1 , g is concave upward.
−−−−−−−−−−−−−−− −
−−−−−
5. If 1 ≤ n ≤ 2 , g is concave downward and then upward with inflection point at u = √ 1
2
[2n − 1 + √8n − 7 ]
−−−−−−−−−−−−−−− −
−−−−−
6. If n > 2 , g is concave upward then downward then upward again with inflection points at u = √ 1
2
[2n − 1 ± √8n − 7 ]
Moments
The raw moments of the chi distribution are easy to comput in terms of the gamma function.
Suppose that U has the chi distribution with n ∈ (0, ∞) degrees of freedom. Then
Γ[(n + k)/2]
k k/2
E(U ) =2 , k ∈ (0, ∞) (5.9.15)
Γ(n/2)
Proof
Suppose again that U has the chi distribution with n ∈ (0, ∞) degrees of freedom. Then
1/2 Γ[(n+1)/2]
1. E(U ) = 2 Γ(n/2)
2. E(U 2
) =n
2
Γ [(n+1)/2]
3. var(U ) = n − 2 2
Γ (n/2)
Proof
5.9.4 https://stats.libretexts.org/@go/page/10175
Relations
The fundamental relationship of course is the one between the chi distribution and the chi-square distribution given in the
definition. In turn, this leads to a fundamental relationship between the chi distribution and the normal distribution.
Suppose that n ∈ N + and that (Z1 , Z2 , … , Zn ) is a sequence of independent variables, each with the standard normal
distribution. Then
−−−−−−−−−−−−−−−−
2 2 2
U =√ Z +Z + ⋯ + Zn (5.9.19)
1 2
Note that the random variable U in the last result is the standard Euclidean norm of (Z , Z , … , Z ), thought of as a vector in R .
1 2 n
n
Note also that the chi distribution with 1 degree of freedom is the distribution of |Z|, the absolute value of a standard normal
variable, which is known as the standard half-normal distribution.
Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent variables, where X has the normal
+ 1 2 n k
n
distribution with mean μ ∈ R and variance 1 for k ∈ {1, 2, … , n}. The distribution of Y = ∑
k X is the non-central chi-
k=1 k
2
k=1
2
k
Note that the degrees of freedom is a positive integer while the non-centrality parameter λ ∈ [0, ∞) , but we will soon generalize
the degrees of freedom.
Distribution Functions
Like the chi-square and chi distributions, the non-central chi-square distribution is a continuous distribution on (0, ∞). The
probability density function and distribution function do not have simple, closed expressions, but there is a fascinating connection
to the Poisson distribution. To set up the notation, let f and F denote the probability density and distribution functions of the chi-
k k
square distribution with k ∈ (0, ∞) degrees of freedom. Suppose that Y has the non-central chi-square distribution with n ∈ N +
degrees of freedom and non-centrality parameter λ ∈ [0, ∞). The following fundamental theorem gives the probability density
function of Y as an infinite series, and shows that the distribution does in fact depend only on n and λ .
Proof
k
(λ/2)
The function k ↦ e −λ/2
k!
on N is the probability density function of the Poisson distribution with parameter λ/2. So it
follows that if N has the Poisson distribution with parameter λ/2 and the conditional distribution of Y given N is chi-square with
parameter n + 2N , then Y has the distribution discussed here—non-central chi-square with n degrees of freedom and non-
centrality parameter λ . Moreover, it's clear that g is a valid probability density function for any n ∈ (0, ∞) , so we can generalize
our definition a bit.
For n ∈ (0, ∞) and λ ∈ [0, ∞), the distribution with probability density function g above is the non-central chi-square
distribution with n degrees of freedom and non-centrality parameter λ .
5.9.5 https://stats.libretexts.org/@go/page/10175
Proof
Moments
In this discussion, we assume again that Y has the non-central chi-square distribution with n ∈ (0, ∞) degrees of freedom and
non-centrality parameter λ ∈ [0, ∞).
Proof
3/2
(n+2λ)
2. kurt(Y ) = 3 + 12 n+4λ
2
(n+2λ)
Note that skew(Y ) → 0 as n → ∞ or as λ → ∞ . Note also that the excess kurtosis is kurt(Y ) − 3 = 12
n+4λ
2
. So
(n+2λ)
Relations
Trivially of course, the ordinary chi-square distribution is a special case of the non-central chi-square distribution, with non-
centrality parameter 0. The most important relation is the orignal definition above. The non-central chi-square distribution with
n ∈ N + degrees of freedom and non-centrality parameter λ ∈ [0, ∞) is the distribution of the sum of the squares of n independent
n
normal variables with variance 1 and whose means satisfy ∑ μ = λ . The next most important relation is the one that arose in
k=1
2
k
the probability density function and was so useful for computing moments. We state this one again for emphasis.
Suppose that N has the Poisson distribution with parameter λ/2, where λ ∈ (0, ∞) , and that the conditional distribution of Y
given N is chi-square with n + 2N degrees of freedom, where n ∈ (0, ∞) . Then the (unconditional) distribution of Y is non-
central chi-square with n degree of freedom and non-centrality parameter λ .
Proof
As the asymptotic results for the skewness and kurtosis suggest, there is also a central limit theorem.
Suppose that Y has the non-central chi-square distribution with n ∈ (0, ∞) degrees of freedom and non-centrality parameter
λ ∈ (0, ∞) . Then the distribution of the standard score
Y − (n + λ)
− −−−−− −− (5.9.33)
√ 2(n + 2λ)
Computational Exercises
Suppose that a missile is fired at a target at the origin of a plane coordinate system, with units in meters. The missile lands at
(X, Y ) where X and Y are independent and each has the normal distribution with mean 0 and variance 100. The missile will
destroy the target if it lands within 20 meters of the target. Find the probability of this event.
Answer
5.9.6 https://stats.libretexts.org/@go/page/10175
Suppose that X has the chi-square distribution with n = 18 degrees of freedom. For each of the following, compute the true
value using the special distribution calculator and then compute the normal approximation. Compare the results.
1. P(15 < X < 20)
2. The 75th percentile of X.
Answer
This page titled 5.9: Chi-Square and Related Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.9.7 https://stats.libretexts.org/@go/page/10175
5.10: The Student t Distribution
In this section we will study a distribution that has special importance in statistics. In particular, this distribution will arise in the
study of a standardized version of the sample mean when the underlying distribution is normal.
Basic Theory
Definition
Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n ∈ (0, ∞) degrees of freedom,
and that Z and V are independent. Random variable
Z
T = (5.10.1)
−−−−
√V /n
The student t distribution is well defined for any n > 0 , but in practice, only positive integer values of n are of interest. This
distribution was first studied by William Gosset, who published under the pseudonym Student.
Distribution Functions
Suppose that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then T has a continuous distribution on R with
probability density function f given by
−(n+1)/2
2
Γ[(n + 1)/2] t
f (t) = −
−− (1 + ) , t ∈ R (5.10.2)
√nπ Γ(n/2) n
Proof
The proof of this theorem provides a good way of thinking of the t distribution: the distribution arises when the variance of a mean
0 normal distribution is randomized in a certain way.
In the special distribution simulator, select the student t distribution. Vary n and note the shape of the probability density
function. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true
probability density function.
The Student probability density function f with n ∈ (0, ∞) degrees of freedom has the following properties:
1. f is symmetric about t = 0 .
2. f is increasing and then decreasing with mode t = 0 .
−−−−−−− −
3. f is concave upward, then downward, then upward again with inflection points at ±√n/(n + 1) .
4. f (t) → 0 as t → ∞ and as t → −∞ .
In particular, the distribution is unimodal with mode and median at t =0 . Note also that the inflection points converge to ±1 as
n → ∞.
The distribution function and the quantile function of the general t distribution do not have simple, closed-form representations.
Approximate values of these functions can be obtained from the special distribution calculator, and from most mathematical and
statistical software packages.
In the special distribution calculator, select the student distribution. Vary the parameter and note the shape of the probability
density, distribution, and quantile functions. In each of the following cases, find the first and third quartiles:
1. n = 2
2. n = 5
3. n = 10
5.10.1 https://stats.libretexts.org/@go/page/10176
4. n = 20
Moments
Suppose that T has a t distribution. The representation in the definition can be used to find the mean, variance and other moments
of T . The main point to remember in the proofs that follow is that since V has the chi-square distribution with n degrees of
freedom, E (V ) = ∞ if k ≤ − , while if k > − ,
k n
2
n
Γ(k + n/2)
k k
E (V ) =2 (5.10.6)
Γ(n/2)
Suppose that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then
1. E(T ) is undefined if 0 < n ≤ 1
2. E(T ) = 0 if 1 < n < ∞
Proof
Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom then
1. var(T ) is undefined if 0 < n ≤ 1
2. var(T ) = ∞ if 1 < n ≤ 2
3. var(T ) = n
n−2
if 2 < n < ∞
Proof
In the simulation of the special distribution simulator, select the student t distribution. Vary n and note the location and shape
of the mean ± standard deviation bar. For selected values of n , run the simulation 1000 times and compare the empirical mean
and standard deviation to the distribution mean and standard deviation.
Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom and k ∈ N . Then
1. E (T ) is undefined if k is odd and k ≥ n
k
2. E (T ) = ∞ if k is even and k ≥ n
k
Proof
From the general moments, we can compute the skewness and kurtosis of T .
Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then
1. skew(T ) = 0 if n > 3
2. kurt(T ) = 3 + if n > 4
6
n−4
Proof
In the special distribution simulator, select the student t distribution. Vary n and note the shape of the probability density
function in light of the previous results on skewness and kurtosis. For selected values of n , run the simulation 1000 times and
compare the empirical density function to the true probability density function.
5.10.2 https://stats.libretexts.org/@go/page/10176
Since T does not have moments of all orders, there is no interval about 0 on which the moment generating function of T is finite.
The characteristic function exists, of course, but has no simple representation, except in terms of special functions.
Relations
The t distribution with 1 degree of freedom is known as the Cauchy distribution. The probability density function is
1
f (t) = , t ∈ R (5.10.11)
2
π(1 + t )
The Cauchy distribution is named after Augustin Cauchy and is studied in more detail in a separate section.
You probably noticed that, qualitatively at least, the t probability density function is very similar to the standard normal probability
density function. The similarity is quantitative as well:
Let f denote the t probability density function with n ∈ (0, ∞) degrees of freedom. Then for fixed t ∈ R ,
n
1 −
1 2
t
fn (t) → −−e
2 as n → ∞ (5.10.12)
√2π
Proof
Note that the function on the right is the probability density function of the standard normal distribution. We can also get
convergence of the t distribution to the standard normal distribution from the basic random variable representation in the
definition.
Z
Tn = −−− − (5.10.15)
√Vn /n
where Z has the standard normal distribution, V has the chi-square distribution with n degrees of freedom, and Z and V are
n n
Proof
The t distribution has more probability in the tails, and consequently less probability near 0, compared to the standard normal
distribution.
Suppose that Z has the standard normal distribution, μ ∈ R , V has the chi-squared distribution with n ∈ (0, ∞) degrees of
freedom, and that Z and V are independent. Random variable
Z +μ
T = −−−− (5.10.16)
√V /n
has the non-central student t distribution with n degrees of freedom and non-centrality parameter μ .
The standard functions that characterize a distribution—the probability density function, distribution function, and quantile
function—do not have simple representations for the non-central t distribution, but can only be expressed in terms of other special
functions. Similarly, the moments do not have simple, closed form expressions either. For the beginning student of statistics, the
most important fact is that the probability density function of the non-central t distribution is similar (but not exactly the same) as
that of the standard t distribution (with the same degrees of freedom), but shifted and scaled. The density function is shifted to the
right or left, depending on whether μ > 0 or μ < 0 .
5.10.3 https://stats.libretexts.org/@go/page/10176
Computational Exercises
Suppose that T has the t distribution with n = 10 degrees of freedom. For each of the following, compute the true value using
the special distribution calculator and then compute the normal approximation. Compare the results.
1. P(−0.8 < T < 1.2)
2. The 90th percentile of T .
Answer
This page titled 5.10: The Student t Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.10.4 https://stats.libretexts.org/@go/page/10176
5.11: The F Distribution
In this section we will study a distribution that has special importance in statistics. In particular, this distribution arises from ratios
of sums of squares when sampling from a normal distribution, and so is important in estimation and in the two-sample normal
model and in hypothesis testing in the two-sample normal model.
Basic Theory
Definition
Suppose that U has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, V has the chi-square distribution with
d ∈ (0, ∞) degrees of freedom, and that U and V are independent. The distribution of
U /n
X = (5.11.1)
V /d
is the F distribution with n degrees of freedom in the numerator and d degrees of freedom in the denominator.
The F distribution was first derived by George Snedecor, and is named in honor of Sir Ronald Fisher. In practice, the parameters n
and d are usually positive integers, but this is not a mathematical requirement.
Distribution Functions
Suppose that X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of
freedom in the denominator. Then X has a continuous distribution on (0, ∞) with probability density function f given by
n/2−1
Γ(n/2 + d/2) n [(n/d)x]
f (x) = , x ∈ (0, ∞) (5.11.2)
Γ(n/2)Γ(d/2) d [1 + (n/d)x] n/2+d/2
Recall that the beta function B can be written in terms of the gamma function by
Γ(a)Γ(b)
B(a, b) = , a, b ∈ (0, ∞) (5.11.8)
Γ(a + b)
Hence the probability density function of the F distribution above can also be written as
n/2−1
1 n [(n/d)x]
f (x) = , x ∈ (0, ∞) (5.11.9)
n/2+d/2
B(n/2, d/2) d
[1 + (n/d)x]
When n ≥ 2 , the probability density function is defined at x = 0 , so the support interval is [0, ∞) is this case.
In the special distribution simulator, select the F distribution. Vary the parameters with the scroll bars and note the shape of the
probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical
density function to the probability density function.
Both parameters influence the shape of the F probability density function, but some of the basic qualitative features depend only
on the numerator degrees of freedom. For the remainder of this discussion, let f denote the F probability density function with
n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator.
5.11.1 https://stats.libretexts.org/@go/page/10351
Proof
Qualitatively, the second order properties of f also depend only on n , with transitions at n = 2 and n = 4 .
3. If n > 4 , f is concave upward, then downward, then upward again, with inflection points at x and x .1 2
Proof
The distribution function and the quantile function do not have simple, closed-form representations. Approximate values of these
functions can be obtained from the special distribution calculator and from most mathematical and statistical software packages.
In the special distribution calculator, select the F distribution. Vary the parameters and note the shape of the probability density
function and the distribution function. In each of the following cases, find the median, the first and third quartiles, and the
interquartile range.
1. n = 5 , d = 5
2. n = 5 , d = 10
3. n = 10 , d = 5
4. n = 10 , d = 10
The general probability density function of the F distribution is a bit complicated, but it simplifies in a couple of special cases.
Special cases.
1. If n = 2 ,
1
f (x) = , x ∈ (0, ∞) (5.11.14)
(1 + 2x/d)1+d/2
2. If n = d ∈ (0, ∞) ,
n/2−1
Γ(n) x
f (x) = , x ∈ (0, ∞) (5.11.15)
2 n
Γ (n/2) (1 + x)
3. If n = d = 2 ,
1
f (x) = , x ∈ (0, ∞) (5.11.16)
(1 + x)2
4. If n = d = 1 ,
1
f (x) = − , x ∈ (0, ∞) (5.11.17)
π √x (1 + x)
Moments
The random variable representation in the definition, along with the moments of the chi-square distribution can be used to find the
mean, variance, and other moments of the F distribution. For the remainder of this discussion, suppose that X has the F
distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator.
Mean
5.11.2 https://stats.libretexts.org/@go/page/10351
1. E(X) = ∞ if 0 < d ≤ 2
2. E(X) = if d > 2
d
d−2
Proof
Thus, the mean depends only on the degrees of freedom in the denominator.
Variance
1. var(X) is undefined if 0 < d ≤ 2
2. var(X) = ∞ if 2 < d ≤ 4
3. If d > 4 then
2
d n+d−2
var(X) = 2( ) (5.11.19)
d−2 n(d − 4)
Proof
In the simulation of the special distribution simulator, select the F distribution. Vary the parameters with the scroll bar and
note the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000
times and compare the empirical mean and standard deviation to the distribution mean and standard deviation..
2. If d > 2k then
k
d Γ(n/2 + k) Γ(d/2 − k)
k
E (X ) =( ) (5.11.23)
n Γ(n/2)Γ(d/2)
Proof
If k ∈ N , then using the fundamental identity of the gamma distribution and some algebra,
k
d n(n + 2) ⋯ [n + 2(k − 1)]
k
E (X ) =( ) (5.11.26)
n (d − 2)(d − 4) ⋯ (d − 2k)
From the general moment formula, we can compute the skewness and kurtosis of the F distribution.
2. If d > 8 ,
2
n(5d − 22)(n + d − 2) + (d − 4)(d − 2)
kurt(X) = 3 + 12 (5.11.28)
n(d − 6)(d − 8)(n + d − 2)
Proof
Not surprisingly, the F distribution is positively skewed. Recall that the excess kurtosis is
2
n(5d − 22)(n + d − 2) + (d − 4)(d − 2)
kurt(X) − 3 = 12 (5.11.29)
n(d − 6)(d − 8)(n + d − 2)
In the simulation of the special distribution simulator, select the F distribution. Vary the parameters with the scroll bar and
note the shape of the probability density function in light of the previous results on skewness and kurtosis. For selected values
of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density
function.
5.11.3 https://stats.libretexts.org/@go/page/10351
Relations
The most important relationship is the one in the definition, between the F distribution and the chi-square distribution. In addition,
the F distribution is related to several other special distributions.
Suppose that X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of
freedom in the denominator. Then 1/X has the F distribution with d degrees of freedom in the numerator and n degrees of
freedom in the denominator.
Proof
Suppose that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then X = T 2
has the F distribution with 1 degree
of freedom in the numerator and n degrees of freedom in the denominator.
Proof
Our next relationship is between the F distribution and the exponential distribution.
Suppose that X and Y are independent random variables, each with the exponential distribution with rate parameter
r ∈ (0, ∞). Then Z = X/Y . has the F distribution with 2 degrees of freedom in both the numerator and denominator.
Proof
A simple transformation can change a variable with the F distribution into a variable with the beta distribution, and conversely.
has the beta distribution with left parameter n/2 and right parameter d/2.
2. If Y has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) then
bY
X = (5.11.38)
a(1 − Y )
has the F distribution with 2a degrees of freedom in the numerator and 2b degrees of freedom in the denominator.
Proof
The F distribution is closely related to the beta prime distribution by a simple scale transformation.
2. If Y has the beta prime distribution with parameters a ∈ (0, ∞) and b ∈ (0, ∞) then X = X has the F distribution with
b
2a degrees of the freedom in the numerator and 2b degrees of freedom in the denominator.
Proof
Suppose that U has the non-central chi-square distribution with n ∈ (0, ∞) degrees of freedom and non-centrality parameter
λ ∈ [0, ∞), V has the chi-square distribution with d ∈ (0, ∞) degrees of freedom, and that U and V are independent. The
distribution of
5.11.4 https://stats.libretexts.org/@go/page/10351
U /n
X = (5.11.43)
V /d
is the non-central F distribution with n degrees of freedom in the numerator, d degrees of freedom in the denominator, and
non-centrality parameter λ .
One of the most interesting and important results for the non-central chi-square distribution is that it is a Poisson mixture of
ordinary chi-square distributions. This leads to a similar result for the non-central F distribution.
Suppose that N has the Poisson distribution with parameter λ/2, and that the conditional distribution of X given N is the F
distribution with N + 2n degrees of freedom in the numerator and d degrees of freedom in the denominator, where
λ ∈ [0, ∞) and n, d ∈ (0, ∞) . Then X has the non-central F distribution with n degrees of freedom in the numerator, d
From the last result, we can express the probability density function and distribution function of the non-central F distribution as a
series in terms of ordinary F density and distribution functions. To set up the notation, for j, k ∈ (0, ∞) let f be the probability
jk
density function and F the distribution function of the F distribution with j degrees of freedom in the numerator and k degrees
jk
of freedom in the denominator. For the rest of this discussion, λ ∈ [0, ∞) and n, d ∈ (0, ∞) as usual.
The probability density function g of the non-central F distribution with n degrees of freedom in the numerator, d degrees of
freedom in the denominator, and non-centrality parameter λ is given by
∞ k
(λ/2)
−λ/2
g(x) = ∑ e fn+2k,d (x), x ∈ (0, ∞) (5.11.44)
k!
k=0
The distribution function G of the non-central F distribution with n degrees of freedom in the numerator, d degrees of
freedom in the denominator, and non-centrality parameter λ is given by
∞ k
(λ/2)
−λ/2
G(x) = ∑ e Fn+2k,d (x), x ∈ (0, ∞) (5.11.45)
k!
k=0
This page titled 5.11: The F Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.11.5 https://stats.libretexts.org/@go/page/10351
5.12: The Lognormal Distribution
Basic Theory
Definition
Suppose that Y has the normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞) . Then X =e
Y
has the
lognormal distribution with parameters μ and σ.
1. The parameter σ is the shape parameter of the distribution.
2. The parameter e is the scale parameter of the distribution.
μ
So equivalently, if X has a lognormal distribution then ln X has a normal distribution, hence the name. The lognormal distribution
is a continuous distribution on (0, ∞) and is used to model random quantities when the distribution is believed to be skewed, such
as certain income and lifetime variables. It's easy to write a general lognormal variable in terms of a standard lognormal variable.
Suppose that Z has the standard normal distribution and let W = e so that W has the standard lognormal distribution. If μ ∈ R
Z
and σ ∈ (0, ∞) then Y = μ + σZ has the normal distribution with mean μ and standard deviation σ and hence X = e has the Y
Distribution Functions
Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞).
2
σ
2
±
1
2
2
σ √σ + 4 )
3. f (x) → 0 as x ↓ 0 and as x → ∞ .
Proof
In the special distribution simulator, select the lognormal distribution. Vary the parameters and note the shape and location of
the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the
empirical density function to the true probability density function.
Let Φ denote the standard normal distribution function, so that Φ is the standard normal quantile function. Recall that values of
−1
Φ and Φ can be obtained from the special distribution calculator, as well as standard mathematical and statistical software
−1
packages, and in fact these functions are considered to be special functions in mathematics. The following two results show how to
compute the lognormal distribution function and quantiles in terms of the standard normal distribution function and quantiles.
Proof
5.12.1 https://stats.libretexts.org/@go/page/10352
Proof
In the special distribution calculator, select the lognormal distribution. Vary the parameters and note the shape and location of
the probability density function and the distribution function. With μ = 0 and σ = 1 , find the median and the first and third
quartiles.
Moments
The moments of the lognormal distribution can be computed from the moment generating function of the normal distribution. Once
again, we assume that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞).
For t ∈ R ,
t
1 2 2
E (X ) = exp(μt + σ t ) (5.12.8)
2
Proof
2
2
2. var(X) = exp[2(μ + σ 2
)] − exp(2μ + σ )
2
In the simulation of the special distribution simulator, select the lognormal distribution. Vary the parameters and note the shape
and location of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and
compare the empirical moments to the true moments.
From the general formula for the moments, we can also compute the skewness and kurtosis of the lognormal distribution.
2 2 2
2. kurt(X) = e 4σ
+ 2e
3σ
+ 3e
2σ
−3
Proof
The fact that the skewness and kurtosis do not depend on μ is due to the fact that μ is a scale parameter. Recall that skewness and
kurtosis are defined in terms of the standard score and so are independent of location and scale parameters. Naturally, the
lognormal distribution is positively skewed. Finally, note that the excess kurtosis is
2 2 2
4σ 3σ 2σ
kurt(X) − 3 = e + 2e + 3e −6 (5.12.10)
Even though the lognormal distribution has finite moments of all orders, the moment generating function is infinite at any positive
number. This property is one of the reasons for the fame of the lognormal distribution.
E (e
tX
) =∞ for every t > 0 .
Proof
Related Distributions
The most important relations are the ones between the lognormal and normal distributions in the definition: if X has a lognormal
distribution then ln X has a normal distribution; conversely if Y has a normal distribution then e has a lognormal distribution. Y
Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞) and that c ∈ (0, ∞) . Then cX has the
lognormal distribution with parameters μ + ln c and σ.
Proof
5.12.2 https://stats.libretexts.org/@go/page/10352
If X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞) then 1/X has the lognormal distribution with
parameters −μ and σ.
Proof
The lognormal distribution is closed under non-zero powers of the underlying variable. In particular, this generalizes the previous
result.
Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞) and that a ∈ R ∖ {0} . Then X has the a
Since the normal distribution is closed under sums of independent variables, it's not surprising that the lognormal distribution is
closed under products of independent variables.
Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent variables, where X has the lognormal
+ 1 2 n i
n
distribution with parameters μ ∈ R and σ ∈ (0, ∞) for i ∈ {1, 2, … , n}. Then ∏ X has the lognormal distribution with
i i i=1 i
n n
parameters μ and σ where μ = ∑ μ and σ = ∑ σ .
i=1 i
2
i=1
2
i
Proof
Finally, the lognormal distribution belongs to the family of general exponential distributions.
Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞). The distribution of X is a 2-parameter
exponential family with natural parameters and natural statistics, respectively, given by
1. (−1/2σ , μ/σ
2 2
)
2. (ln (X), ln X)
2
Proof
Computational Exercises
Suppose that the income X of a randomly chosen person in a certain population (in $1000 units) has the lognormal distribution
with parameters μ = 2 and σ = 1 . Find P(X > 20) .
Answer
Suppose that the income X of a randomly chosen person in a certain population (in $1000 units) has the lognormal distribution
with parameters μ = 2 and σ = 1 . Find each of the following:
1. E(X)
2. var(X)
Answer
This page titled 5.12: The Lognormal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.12.3 https://stats.libretexts.org/@go/page/10352
5.13: The Folded Normal Distribution
Suppose that Y has a normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞). Then X = |Y | has the folded
normal distribution with parameters μ and σ.
Distribution Functions
Suppose that Z has the standard normal distribution. Recall that Z has probability density function ϕ and distribution function Φ
given by
1 −z
2
/2
ϕ(z) = −−e , z ∈ R (5.13.1)
√2π
z z
1 2
−x /2
Φ(z) = ∫ ϕ(x) dx = ∫ e dx, z ∈ R (5.13.2)
−−
−∞ −∞ √2π
The standard normal distribution is so important that Φ is considered a special function and can be computed using most
mathematical and statistical software. If μ ∈ R and σ ∈ (0, ∞), then Y = μ + σZ has the normal distribution with mean μ and
standard deviation σ, and therefore X = |Y | = |μ + σZ| has the folded normal distribution with parameters μ and σ. For the
remainder of this discussion we assume that X has this folded normal distribution.
x 2 2
1 1 y +μ 1 y −μ
=∫ −− {exp[− ( ) ] + exp[− ( ) ]} dy, x ∈ [0, ∞) (5.13.4)
0 σ √2π 2 σ 2 σ
Proof
Open the special distribution calculator and select the folded normal distribution, and set the view to CDF. Vary the parameters
and note the shape of the distribution function. For selected values of the parameters, compute the median and the first and
third quartiles.
2 2
1 1 x +μ 1 x −μ
= −− {exp[− ( ) ] + exp[− ( ) ]} , x ∈ [0, ∞) (5.13.8)
σ √2π 2 σ 2 σ
Proof
5.13.1 https://stats.libretexts.org/@go/page/10353
Open the special distribution simulator and select the folded normal distribution. Vary the parameters μ and σ and note the
shape of the probability density function. For selected values of the parameters, run the simulation 1000 times and compae the
empirical density function to the true probability density function.
Note that the folded normal distribution is unimodal for some values of the parameters and decreasing for other values. Note also
that μ is not a location parameter nor is σ a scale parameter; both influence the shape of the probability density function.
Moments
We cannot compute the the mean of the folded normal distribution in closed form, but the mean can at least be given in terms of Φ.
Once again, we assume that X has the folded normal distribution with parmaeters μ ∈ R and σ ∈ (0, ∞).
2. E(X ) = μ + σ
2 2 2
Proof
2 2
μ 2 μ
var(X) = μ +σ − {μ [1 − 2Φ (− )] + σ √ exp(− )} (5.13.12)
2
σ π 2σ
Open the special distribution simulator and select the folded normal distribution. Vary the parameters and note the size and
location of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare
the empirical mean and standard deviation to the true mean and standard deviation.
Related Distributions
The most important relation is the one between the folded normal distribution and the normal distribution in the definition: If Y has
a normal distribution then X = |Y | has a folded normal distribution. The folded normal distribution is also related to itself through
a symmetry property that is perhaps not completely obvious from the initial definition:
For μ ∈ R and σ ∈ (0, ∞), the folded normal distribution with parameters −μ and σ is the same as the folded normal
distribution with parameters μ and σ.
Proof 1
Proof 2
Suppose that X has the folded normal distribution with parameters μ ∈ R and σ ∈ (0, ∞) and that b ∈ (0, ∞) . Then bX has
the folded normal distribution with parameters bμ and bσ.
Proof
Suppose that Z has the standard normal distribution and that σ ∈ (0, ∞). Then X = σ|Z| has the half-normal distribution
with scale parameter σ. If σ = 1 so that X = |Z| , then X has the standard half-normal distribution.
Distribution Functions
For our next discussion, suppose that X has the half-normal distribution with parameter σ ∈ (0, ∞). Once again, Φ and Φ
−1
denote the distribution function and quantile function, respectively, of the standard normal distribution.
5.13.2 https://stats.libretexts.org/@go/page/10353
x −
− 2
x 1 2 y
F (x) = 2Φ ( )−1 = ∫ √ exp(− ) dy, x ∈ [0, ∞) (5.13.13)
2
σ 0
σ π 2σ
−1 −1
1 +p
F (p) = σ Φ ( ), p ∈ [0, 1) (5.13.14)
2
Proof
Open the special distribution calculator and select the folded normal distribution. Select CDF view and keep μ = 0 . Vary σ and
note the shape of the CDF. For various values of σ, compute the median and the first and third quartiles.
Open the special distribution simulator and select the folded normal distribution. Keep μ = 0 and vary σ, and note the shape of
the probability density function. For selected values of σ, run the simulation 1000 times and compare the empricial density
function to the true probability density function.
Moments
The moments of the half-normal distribution can be computed explicitly. Once again we assume that X has the half-normal
distribution with parameter σ ∈ (0, ∞).
For n ∈ N
(2n)!
2n 2n
E(X ) =σ (5.13.16)
n
n!2
−−
2n+1 2n+1 n
2
E(X ) =σ 2 √ n! (5.13.17)
π
Proof
−−−
In particular, we have E(X) = σ √2/π and var(X) = σ 2
(1 − 2/π)
Open the special distribution simulator and select the folded normal distribution. Keep μ = 0 and vary σ, and note the size and
location of the mean±standard deviation bar. For selected values of σ, run the simulation 1000 times and compare the mean
and standard deviation to the true mean and standard deviation.
2. The kurtosis of X is
2
3 − 4/π − 12/π
kurt(X) = ≈ 3.8692 (5.13.20)
2
(1 − 2/π)
Proof
5.13.3 https://stats.libretexts.org/@go/page/10353
Related Distributions
Once again, the most important relation is the one in the definition: If Y has a normal distribution with mean 0 then X = |Y | has a
half-normal distribution. Since the half normal distribution is a scale family, it is trivially closed under scale transformations.
Suppose that X has the half-normal distribution with parameter σ and that b ∈ (0, ∞) . Then bX has the half-normal
distribution with parameter bσ.
Proof
The standard half-normal distribution is also a special case of the chi distribution.
The standard half-normal distribution is the chi distribution with 1 degree of freedom.
Proof
This page titled 5.13: The Folded Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.13.4 https://stats.libretexts.org/@go/page/10353
5.14: The Rayleigh Distribution
The Rayleigh distribution, named for William Strutt, Lord Rayleigh, is the distribution of the magnitude of a two-dimensional
random vector whose coordinates are independent, identically distributed, mean 0 normal variables. The distribution has a number
of applications in settings where magnitudes of normal variables are important.
Distribution Functions
We give five functions that completely characterize the standard Rayleigh distribution: the distribution function, the probability
density function, the quantile function, the reliability function, and the failure rate function. For the remainder of this discussion,
we assume that R has the standard Rayleigh distribution.
Proof
Open the Special Distribution Simulator and select the Rayleigh distribution. Keep the default parameter value and note the
shape of the probability density function. Run the simulation 1000 times and compare the emprical density function to the
probability density function.
−−−−− −−−− −
R has quantile function G −1
given by G −1
(p) = √−2 ln(1 − p) for p ∈ [0, 1). In particular, the quartiles of R are
−−−−−−−−− −
1. q 1 = √4 ln 2 − 2 ln 3 ≈ 0.7585 , the first quartile
−−− −
2. q 2 = √2 ln 2 ≈ 1.1774 , the median
−−− −
3. q 3 = √4 ln 2 ≈ 1.6651 , the third quartile
Proof
Open the Special Distribution Calculator and select the Rayleigh distribution. Keep the default parameter value. Note the shape
and location of the distribution function. Compute selected values of the distribution function and the quantile function.
Proof
R has failure rate function h given by h(x) = x for x ∈ [0, ∞). In particular, R has increasing failure rate.
Proof
Moments
Once again we assume that R has the standard Rayleigh distribution. We can express the moment generating function of R in
terms of the standard normal distribution function Φ. Recall that Φ is so commonly used that it is a special function of
mathematics.
5.14.1 https://stats.libretexts.org/@go/page/10354
R has moment generating function m given by
tR −− 2
t /2
m(t) = E(e ) = 1 + √2πte Φ(t), t ∈ R (5.14.3)
Proof
Open the Special Distribution Simulator and select the Rayleigh distribution. Keep the default parameter value. Note the size
and location of the mean±standard deviation bar. Run the simulation 1000 times and compare the empirical mean and stadard
deviation to the true mean and standard deviation.
Of course, the formula for the general moments gives an alternate derivation of the mean and variance above, since
−
Γ(3/2) = √π /2 and Γ(2) = 1 . On the other hand, the moment generating function can be also be used to derive the formula for
Proof
Related Distributions
The fundamental connection between the standard Rayleigh distribution and the standard normal distribution is given in the very
definition of the standard Rayleigh, as the distribution of the magnitude of a point with independent, standard normal coordinates.
−
−
2. If V has the chi-square distribution with 2 degrees of freedom then √V has the standard Rayleigh distribution.
Proof
Recall also that the chi-square distribution with 2 degrees of freedom is the same as the exponential distribution with scale
parameter 2.
Since the quantile function is in closed form, the standard Rayleigh distribution can be simulated by the random quantile method.
Connections between the standard Rayleigh distribution and the standard uniform distribution.
−−−−−−−−− −
1. If U has the standard uniform distribution (a random number) then R = G −1
(U ) = √−2 ln(1 − U ) has the standard
Rayleigh distribution.
2. If R has the standard Rayleigh distribution then U = G(R) = 1 − exp(−R 2
/2) has the standard uniform distribution
−−−−− −
In part (a), note that 1 − U has the same distribution as U (the standard uniform). Hence R = √−2 ln U also has the standard
Rayleigh distribution.
Open the random quantile simulator and select the Rayleigh distribution with the default parameter value (standard). Run the
simulation 1000 times and compare the empirical density function to the true density function.
5.14.2 https://stats.libretexts.org/@go/page/10354
There is another connection with the uniform distribution that leads to the most common method of simulating a pair of
independent standard normal variables. We have seen this before, but it's worth repeating. The result is closely related to the
definition of the standard Rayleigh variable as the magnitude of a standard bivariate normal pair, but with the addition of the polar
coordinate angle.
Suppose that R has the standard Rayleigh distribution, Θ is uniformly distributed on [0, 2π), and that R and Θ are
independent. Let Z = R cos Θ , W = R sin Θ . Then (Z, W ) have the standard bivariate normal distribution.
Proof
If R has the standard Rayleigh distribution and b ∈ (0, ∞) then X = bR has the Rayleigh distribution with scale parameter b .
Equivalently, the Rayleigh distribution is the distribution of the magnitude of a two-dimensional vector whose components have
independent, identically distributed mean 0 normal variables.
−−−−−−−
If U and U are independent normal variables with mean 0 and standard deviation σ ∈ (0, ∞) then X = √U
1 2 1
2
+U
2
2
has the
Rayleigh distribution with scale parameter σ.
Proof
Distribution Functions
In this section, we assume that X has the Rayleigh distribution with scale parameter b ∈ (0, ∞).
2
Proof
2
Open the Special Distribution Simulator and select the Rayleigh distribution. Vary the scale parameter and note the shape and
location of the probability density function. For various values of the scale parameter, run the simulation 1000 times and
compare the emprical density function to the probability density function.
−−−−− −−−− −
X has quantile function F −1
given by F −1
(p) = b √−2 ln(1 − p) for p ∈ [0, 1). In particular, the quartiles of X are
−−−−−−−−− −
1. q1 = b √4 ln 2 − 2 ln 3 , the first quartile
−−− −
2. q2 = b √2 ln 2, the median
−−− −
3. q3 = b √4 ln 2, the third quartile
Proof
Open the Special Distribution Calculator and select the Rayleigh distribution. Vary the scale parameter and note the location
and shape of the distribution function. For various values of the scale parameter, compute selected values of the distribution
function and the quantile function.
Proof
5.14.3 https://stats.libretexts.org/@go/page/10354
X has failure rate function h given by h(x) = x/b for x ∈ [0, ∞). In particular, X has increasing failure rate.
2
Proof
Moments
Again, we assume that X has the Rayleigh distribution with scale parameter b , and recall that Φ denotes the standard normal
distribution function.
Proof
Proof
Open the Special Distribution Simulator and select the Rayleigh distribution. Vary the scale parameter and note the size and
location of the mean±standard deviation bar. For various values of the scale parameter, run the simulation 1000 times and
compare the empirical mean and stadard deviation to the true mean and standard deviation.
Again, the general moments can be expressed in terms of the gamma function Γ.
E(X
n n
) =b 2
n/2
Γ(1 + n/2) for n ∈ N .
Proof
Proof
Related Distributions
The fundamental connection between the Rayleigh distribution and the normal distribution is the defintion, and of course, is the
primary reason that the Rayleigh distribution is special in the first place. By construction, the Rayleigh distribution is a scale
family, and so is closed under scale transformations.
If X has the Rayleigh distribution with scale parameter b ∈ (0, ∞) and if c ∈ (0, ∞) then cX has the Rayleigh distribution
with scale parameter bc.
The Rayleigh distribution with scale parameter b ∈ (0, ∞) is the Weibull distribution with shape parameter 2 and scale
–
parameter √2b.
The following result generalizes the connection between the standard Rayleigh and chi-square distributions.
Proof
Since the quantile function is in closed form, the Rayleigh distribution can be simulated by the random quantile method.
5.14.4 https://stats.libretexts.org/@go/page/10354
−−−−−−−−−−
1. If U has the standard uniform distribution (a random number) then X = F (U ) = b√−2 ln(1 − U ) has the Rayleigh
−1
distribution
−−−−− −
In part (a), note that 1 − U has the same distribution as U (the standard uniform). Hence X = b √−2 ln U also has the Rayleigh
distribution with scale parameter b .
Open the random quantile simulator and select the Rayleigh distribution. For selected values of the scale parameter, run the
simulation 1000 times and compare the empirical density function to the true density function.
If X has the Rayleigh distribution with scale parameter b ∈ (0, ∞) then X has a one-parameter exponential distribution with
natural parameter −1/b and natural statistic X /2.
2 2
Proof
This page titled 5.14: The Rayleigh Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.14.5 https://stats.libretexts.org/@go/page/10354
5.15: The Maxwell Distribution
The Maxwell distribution, named for James Clerk Maxwell, is the distribution of the magnitude of a three-dimensional random
vector whose coordinates are independent, identically distributed, mean 0 normal variables. The distribution has a number of
applications in settings where magnitudes of normal variables are important, particularly in physics. It is also called the Maxwell-
Boltzmann distribution in honor also of Ludwig Boltzmann. The Maxwell distribution is closely related to the Rayleigh
distribution, which governs the magnitude of a two-dimensional random vector whose coordinates are independent, identically
distributed, mean 0 normal variables.
Suppose that Z1 Z2 , , and Z3 are independent random variables with standard normal distributions. The magnitude
−−−−−−−−−−−
R = √Z
2
1
+Z
2
2
+Z
3
2
of the vector (Z 1, Z2 , Z3 ) has the standard Maxwell distribution.
So in the context of the definition, (Z1 , Z2 , Z3 ) has the standard trivariate normal distribution. The Maxwell distribution is a
continuous distribution on [0, ∞).
Distribution Functions
In this discussion, we assume that R has the standard Maxwell distribution. The distribution function of R can be expressed in
terms of the standard normal distribution function Φ. Recall that Φ occurs so frequently that it is considered a special function in
mathematics.
Proof
–
1. g increases and then decreases with mode at x = √2 .
−−−−−−−−− −
−−
2. g is concave upward, then downward, then upward again, with inflection points at x 1 = √(5 − √17)/2 ≈ 0.6622 and
−−−−−−−−− −
−−
x2 = √(5 + √17)/2 ≈ 2.1358
Proof
Open the Special Distribution Simulator and select the Maxwell distribution. Keep the default parameter value and note the
shape of the probability density function. Run the simulation 1000 times and compare the emprical density function to the
probability density function.
Open the Special Distribution Calculator and select the Maxwell distribution. Keep the default parameter value. Find
approximate values of the median and the first and third quartiles.
Moments
Suppose again that R has the standard Maxwell distribution. The moment generating function of R , like the distribution function,
can be expressed in terms of the standard normal distribution function Φ.
5.15.1 https://stats.libretexts.org/@go/page/10355
R has moment generating function m given by
−
−
tR
2 2
2
t /2
m(t) = E (e ) =√ t + 2(1 + t )e Φ(t), t ∈ R (5.15.5)
π
Proof
The mean and variance of R can be found from the moment generating function, but direct computations are also easy.
Open the Special Distribution Simulator and select the Maxwell distribution. Keep the default parameter value. Note the size
and location of the mean±standard deviation bar. Run the simulation 1000 times and compare the empirical mean and standard
deviation to the true mean and standard deviation.
For n ∈ N ,+
n/2+1
n
2 n+3
E(R ) = − Γ( ) (5.15.13)
√π 2
Proof
Of course, the formula for the general moments gives an alternate derivation for the mean and variance above since Γ(2) = 1 and
−
Γ(5/2) = 3 √π /4. On the other hand, the moment generating function can be also be used to derive the formula for the general
Proof
Related Distributions
The fundamental connection between the standard Maxwell distribution and the standard normal distribution is given in the very
definition of the standard Maxwell, as the distribution of the magnitude of a vector in R with independent, standard normal
3
coordinates.
−
−
2. If V has the chi-square distribution with 3 degrees of freedom then √V has the standard Maxwell distribution.
Proof
Equivalently, the Maxwell distribution is simply the chi distribution with 3 degrees of freedom.
If R has the standard Maxwell distribution and b ∈ (0, ∞) then X = bR has the Maxwell distribution with scale parameter b .
5.15.2 https://stats.libretexts.org/@go/page/10355
Equivalently, the Maxwell distribution is the distribution of the magnitude of a three-dimensional vector whose components have
independent, identically distributed, mean 0 normal variables.
If U1 , U2 and U3 are independent normal variables with mean 0 and standard deviation σ ∈ (0, ∞) then
−−−−−−−−−−−−
X = √U
2
1
+U
2
2
+U
2
3
has the Maxwell distribution with scale parameter σ.
Proof
Distribution Functions
In this section, we assume that X has the Maxwell distribution with scale parameter b ∈ (0, ∞) . We can give the distribution
function of X in terms of the standard normal distribution function Φ.
Proof
–
1. f increases and then decreases with mode at x = b√2 .
−−−−−−−−−−
−−
2. f is concave upward, then downward, then upward again, with inflection points at x = b√(5 ± √17)/2 .
Proof
Open the Special Distribution Simulator and select the Maxwell distribution. Vary the scale parameter and note the shape and
location of the probability density function. For various values of the scale parameter, run the simulation 1000 times and
compare the emprical density function to the probability density function.
Again, the quantile function does not hava a simple, closed-form expression.
Open the Special Distribution Calculator and select the Maxwell distribution. For various values of the scale parameter,
compute the median and the first and third quartiles.
Moments
Again, we assume that X has the Maxwell distribution with scale parameter b ∈ (0, ∞). As before, the moment generating
function of X can be written in terms of the standard normal distribution function Φ.
Proof
Proof
Open the Special Distribution Simulator and select the Maxwell distribution. Vary the scale parameter and note the size and
location of the mean±standard deviation bar. For various values of the scale parameter, run the simulation 1000 times compare
the empirical mean and standard deviation to the true mean and standard deviation.
5.15.3 https://stats.libretexts.org/@go/page/10355
As before, the general moments can be expressed in terms of the gamma function Γ.
For n ∈ N ,
n/2+1
2 n+3
n n
E(X ) =b Γ( ) (5.15.18)
−
√π 2
Proof
Proof
Related Distributions
The fundamental connection between the Maxwell distribution and the normal distribution is given in the definition, and of course,
is the primary reason that the Maxwell distribution is special in the first place.
By construction, the Maxwell distribution is a scale family, and so is closed under scale transformations.
If X has the Maxwell distribution with scale parameter b ∈ (0, ∞) and if c ∈ (0, ∞) then cX has the Maxwell distribution
with scale parameter bc.
Proof
If X has the Maxwell distribution with scale parameter b ∈ (0, ∞) then X is a one-parameter exponential family with natural
parameter −1/b and natural statistic X /2.
2 2
Proof
This page titled 5.15: The Maxwell Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.15.4 https://stats.libretexts.org/@go/page/10355
5.16: The Lévy Distribution
The Lévy distribution, named for the French mathematician Paul Lévy, is important in the study of Brownian motion, and is one of
only three stable distributions whose probability density function can be expressed in a simple, closed form.
Distribution Functions
We assume that U has the standard Lévy distribution. The distribution function of U has a simple expression in terms of the
standard normal distribution function Φ, not surprising given the definition.
Proof
Similarly, the quantile function of U has a simple expression in terms of the standard normal quantile function Φ −1
.
8
)] ≈ 0.7557 , the first quartile.
−2
2. q2 = [Φ
−1
(
3
4
)] ≈ 2.1980 , the median.
−2
3. q3 = [Φ
−1
(
5
8
)] ≈ 9.8516 , the third quartile.
Proof
Open the Special Distribution Calculator and select the Lévy distribution. Keep the default parameter values. Note the shape
and location of the distribution function. Compute a few values of the distribution function and the quantile function.
3
.
√10
2. g is concave upward, then downward, then upward again, with inflection points at x = 1
3
−
15
≈ 0.1225 and at
√10
x =
1
3
+
15
≈ 0.5442 .
Proof
Open the Special Distribtion Simulator and select the Lévy distribution. Keep the default parameter values. Note the shape of
the probability density function. Run the simulation 1000 times and compare the empirical density function to the probability
density function.
5.16.1 https://stats.libretexts.org/@go/page/10356
Moments
We assume again that U has the standard Lévy distribution. After exploring the graphs of the probability density function and
distribution function above, you probably noticed that the Lévy distribution has a very heavy tail. The 99th percentile is about
6400, for example. The following result is not surprising.
E(U ) = ∞
Proof
Of course, the higher-order moments are infinite as well, and the variance, skewness, and kurtosis do not exist. The moment
generating function is infinite at every positive value, and so is of no use. On the other hand, the characteristic function of the
standard Lévy distribution is very useful. For the following result, recall that the sign function sgn is given by sgn(t) = 1 for
t > 0 , sgn(t) = −1 for t < 0 , and sgn(0) = 0 .
itU 1/2
χ0 (t) = E (e ) = exp(−|t| [1 + i sgn(t)]), t ∈ R (5.16.9)
Related Distributions
The most important relationship is the one in the definition: If Z has the standard normal distribution then U = 1/Z
2
has the
standard Lévy distribution. The following result is bascially the converse.
−
−
If U has the standard Lévy distribution, then V = 1/ √U has the standard half-normal distribution.
Proof
Definition
Suppose that U has the standard Lévy distribution, and a ∈ R and b ∈ (0, ∞) . Then X = a + bU has the Lévy distribution
with location parameter a and scale parameter b .
Distribution Functions
Suppose that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). As before, the
distribution function of X has a simple expression in terms of the standard normal distribution function Φ.
Proof
Similarly, the quantile function of X has a simple expression in terms of the standard normal quantile function Φ −1
.
8
)] , the first quartile.
5.16.2 https://stats.libretexts.org/@go/page/10356
−2
2. q2 = a + b[ Φ
−1
(
3
4
)] , the median.
−2
3. q3 = a + b[ Φ
−1
(
5
8
)] , the third quartile.
Proof
Open the Special Distribution Calculator and select the Lévy distribution. Vary the parameter values and note the shape of the
graph of the distribution function function. For various values of the parameters, compute a few values of the distribution
function and the quantile function.
3
b .
√10
2. f is concave upward, then downward, then upward again with inflection points at x = a + ( 1
3
±
15
)b .
Proof
Open the Special Distribtion Simulator and select the Lévy distribution. Vary the parameters and note the shape and location of
the probability density function. For various parameter values, run the simulation 1000 times and compare the empirical
density function to the probability density function.
Moments
Assume again that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). Of course, since the
standard Lévy distribution has infinite mean, so does the general Lévy distribution.
E(X) = ∞
Also as before, the variance, skewness, and kurtosis of X are undefined. On the other hand, the characteristic function of X is very
important.
Proof
Related Distributions
Since the Lévy distribution is a location-scale family, it is trivially closed under location-scale transformations.
Suppose that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R and
d ∈ (0, ∞) . Then Y = c + dX has the Lévy distribution with location parameter c + ad and scale parameter bd .
Proof
Of more interest is the fact that the Lévy distribution is closed under convolution (corresponding to sums of independent variables).
Suppose that X and X are independent, and that, X has the Lévy distribution with location parameter a ∈ R and scale
1 2 k k
parameter b ∈ (0, ∞) for k ∈ {1, 2}. Then X + X has the Lévy distribution with location parameter a + a and scale
k 1 2 1 2
1/2 1/2
parameter (b + b ) .
1 2
2
Proof
2
:
5.16.3 https://stats.libretexts.org/@go/page/10356
Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent random variables, each having the Lévy
+ 1 2 n
distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). Then X + X + ⋯ + X has the Lévy
1 2 n
Stability is one of the reasons for the importance of the Lévy distribution. From the characteristic function, it follows that the
skewness parameter is β = 1 .
This page titled 5.16: The Lévy Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.16.4 https://stats.libretexts.org/@go/page/10356
5.17: The Beta Distribution
In this section, we will study the beta distribution, the most important distribution that has bounded support. But before we can
study the beta distribution we must study the beta function.
Properties
The beta function satisfies the following properties:
1. B(a, b) = B(b, a) for a, b ∈ (0, ∞), so B is symmetric.
2. B(a, 1) = for a ∈ (0, ∞)
1
Proof
The beta function has a simple expression in terms of the gamma function:
If a, b ∈ (0, ∞) then
Γ(a)Γ(b)
B(a, b) = (5.17.4)
Γ(a + b)
Proof
Recall that the gamma function is a generalization of the factorial function. Here is the corresponding result for the beta function:
If j, k ∈ N+ then
(j − 1)!(k − 1)!
B(j, k) = (5.17.8)
(j + k − 1)!
Proof
Let's generalize this result. First, recall from our study of combinatorial structures that for a ∈ R and j ∈ N , the ascending power
of base a and order j is
[j]
a = a(a + 1) ⋯ [a + (j − 1)] (5.17.9)
Proof
B(
1
2
,
1
2
) =π .
Proof
5.17.1 https://stats.libretexts.org/@go/page/10357
Figure 5.17.1 : The graph of B(a, b) on the square 0 < a < 5 , 0 < b < 5
Of course, the ordinary (complete) beta function is B(a, b) = B(1; a, b) for a, b ∈ (0, ∞) .
The (standard) beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) has probability density
function f given by
1 a−1 b−1
f (x) = x (1 − x ) , x ∈ (0, 1) (5.17.12)
B(a, b)
Of course, the beta function is simply the normalizing constant, so it's clear that f is a valid probability density function. If a ≥ 1 ,
f is defined at 0, and if b ≥ 1 , f is defined at 1. In these cases, it's customary to extend the domain of f to these endpoints. The
beta distribution is useful for modeling random probabilities and proportions, particularly in the context of Bayesian analysis. The
distribution has just two parameters and yet a rich variety of shapes (so in particular, both parameters are shape parameters).
Qualitatively, the first order properties of f depend on whether each parameter is less than, equal to, or greater than 1.
1. If 0 < a < 1 and 0 < b < 1 , f decreases and then increases with minimum value at x and with f (x) → ∞ as x ↓ 0 and
0
as x ↑ 1.
2. If a = 1 and b = 1 , f is constant.
3. If 0 < a < 1 and b ≥ 1 , f is decreasing with f (x) → ∞ as x ↓ 0.
4. If a ≥ 1 and 0 < b < 1 , f is increasing with f (x) → ∞ as x ↑ 1.
5. If a = 1 and b > 1 , f is decreasing with mode at x = 0 .
6. If a > 1 and b = 1 , f is increasing with mode at x = 1 .
7. If a > 1 and b > 1 , f increases and then decreases with mode at x . 0
Proof
From part (b), note that the special case a = 1 and b = 1 gives the continuous uniform distribution on the interval (0, 1) (the
standard uniform distribution). Note also that when a < 1 or b < 1 , the probability density function is unbounded, and hence the
5.17.2 https://stats.libretexts.org/@go/page/10357
distribution has no mode. On the other hand, if a ≥ 1 , b ≥ 1 , and one of the inequalites is strict, the distribution has a unique mode
at x . The second order properties are more complicated.
0
3. If 1 < a < 2 and b ≤ 1 , f is concave downward and then upward with inflection point at x . 2
6. If a > 2 and 1 < b ≤ 2 , f is concave upward and then downward with inflection point at x . 1
7. If a > 2 and b > 2 , f is concave upward, then downward, then upward again, with inflection points at x and x . 1 2
Proof
In the special distribution simulator, select the beta distribution. Vary the parameters and note the shape of the beta density
function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to
the true density function.
2
,b= 1
2
is the arcsine distribution, with probability density function given by
1
f (x) = − −−−−−−, x ∈ (0, 1) (5.17.18)
π √ x(1 − x)
This distribution is important in a number of applications, and so the arcsine distribution is studied in a separate section.
The beta distribution function F can be easily expressed in terms of the incomplete beta function. As usual a denotes the left
parameter and b the right parameter.
The distribution function F is sometimes known as the regularized incomplete beta function. In some special cases, the distribution
function F and its inverse, the quantile function F , can be computed in closed form, without resorting to special functions.
−1
2. F (p) = p
−1
for p ∈ (0, 1)
1/a
2. F (p) = 1 − (1 − p)
−1
for p ∈ (0, 1)1/b
If a = b = 1
2
(the arcsine distribution) then
−
1. F (x) = arcsin(√x) for x ∈ (0, 1)
2
5.17.3 https://stats.libretexts.org/@go/page/10357
There is an interesting relationship between the distribution functions of the beta distribution and the binomial distribution, when
the beta parameters are positive integers. To state the relationship we need to embellish our notation to indicate the dependence on
the parameters. Thus, let F denote the beta distribution function with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞),
a,b
and let G denote the binomial distribution function with trial parameter n ∈ N and success parameter p ∈ (0, 1).
n,p +
Proof
In the special distribution calculator, select the beta distribution. Vary the parameters and note the shape of the density function
and the distribution function. In each of the following cases, find the median, the first and third quartiles, and the interquartile
range. Sketch the boxplot.
1. a = 1 , b = 1
2. a = 1 , b = 3
3. a = 3 , b = 1
4. a = 2 , b = 4
5. a = 4 , b = 2
6. a = 4 , b = 4
Moments
The moments of the beta distribution are easy to express in terms of the beta function. As before, suppose that X has the beta
distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞).
If k ∈ [0, ∞) then
B(a + k, b)
k
E (X ) = (5.17.24)
B(a, b)
In particular, if k ∈ N then
[k]
k
a
E (X ) = (5.17.25)
[k]
(a + b)
Proof
From the general formula for the moments, it's straightforward to compute the mean, variance, skewness, and kurtosis.
ab
var(X) = (5.17.28)
2
(a + b ) (a + b + 1)
Proof
Note that the variance depends on the parameters a and b only through the product ab and the sum a + b .
Open the special distribution simulator and select the beta distribution. Vary the parameters and note the size and location of
the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the sample
mean and standard deviation to the distribution mean and standard deviation.
5.17.4 https://stats.libretexts.org/@go/page/10357
−−−−−− −
2(b − a)√ a + b + 1
skew(X) = (5.17.29)
−−
(a + b + 2)√ab
3 3 2 2 3 3 2 2 2 2
3 a b + 3ab + 6a b +a +b + 13 a b + 13ab +a +b + 14ab
kurt(X) = (5.17.30)
ab(a + b + 2)(a + b + 3)
Proof
In particular, note that the distribution is positively skewed if a < b , unskewed if a = b (the distribution is symmetric about x = 1
Open the special distribution simulator and select the beta distribution. Vary the parameters and note the shape of the
probability density function in light of the previous result on skewness. For various values of the parameters, run the
simulation 1000 times and compare the empirical density function to the true probability density function.
Related Distributions
The beta distribution is related to a number of other special distributions.
If X has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) then Y = 1 −X has the beta
distribution with left parameter b and right parameter a .
Proof
The beta distribution with right parameter 1 has a reciprocal relationship with the Pareto distribution.
The following result gives a connection between the beta distribution and the gamma distribution.
Suppose that X has the gamma distribution with shape parameter a ∈ (0, ∞) and rate parameter r ∈ (0, ∞), Y has the
gamma distribution with shape parameter b ∈ (0, ∞) and rate parameter r, and that X and Y are independent. Then
V = X/(X + Y ) has the beta distribution with left parameter a and right parameter b .
Proof
The following result gives a connection between the beta distribution and the F distribution. This connection is a minor variation
of the previous result.
If X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the
denominator then
(n/d)X
Y = (5.17.38)
1 + (n/d)X
has the beta distribution with left parameter a = n/2 and right parameter b = d/2 .
Proof
Our next result is that the beta distribution is a member of the general exponential family of distributions.
Suppose that X has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Then the distribution
is a two-parameter exponential family with natural parameters a − 1 and b − 1 , and natural statistics ln(X) and ln(1 − X) .
Proof
The beta distribution is also the distribution of the order statistics of a random sample from the standard uniform distribution.
5.17.5 https://stats.libretexts.org/@go/page/10357
Suppose n ∈ N and that (X , X , … , X ) is a sequence of independent variables, each with the standard uniform
+ 1 2 n
distribution. For k ∈ {1, 2, … , n}, the k th order statistics X has the beta distribution with left parameter a = k and right
(k)
parameter b = n − k + 1 .
Proof
One of the most important properties of the beta distribution, and one of the main reasons for its wide use in statistics, is that it
forms a conjugate family for the success probability in the binomial and negative binomial distributions.
Suppose that P is a random probability having the beta distribution with left parameter a ∈ (0, ∞) and right parameter
b ∈ (0, ∞). Suppose also that X is a random variable such that the conditional distribution of X given P = p ∈ (0, 1) is
binomial with trial parameter n ∈ N and success parameter p. Then the conditional distribution of P given X = k is beta
+
Suppose again that P is a random probability having the beta distribution with left parameter a ∈ (0, ∞) and right parameter
b ∈ (0, ∞). Suppose also that N is a random variable such that the conditional distribution of N given P = p ∈ (0, 1) is
negative binomial with stopping parameter k ∈ N and success parameter p. Then the conditional distribution of P given
+
Proof
in both cases, note that in the posterior distribution of P , the left parameter is increased by the number of successes and the right
parameter by the number of failures. For more on this, see the section on Bayesian estimation in the chapter on point estimation.
Suppose that Z has the standard beta distibution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). For c ∈ R
and d ∈ (0, ∞) random variable X = c + dZ has the beta distribution with left parameter a , right parameter b , location
parameter c and scale parameter d .
For the remainder of this discussion, suppose that X has the distribution in the definition above.
Proof
Most of the results in the previous sections have simple extensions to the general beta distribution.
a+b
2. var(X) = d 2
2
ab
(a+b) (a+b+1)
Proof
Recall that skewness and variance are defined in terms of standard scores, and hence are unchanged under location-scale
transformations. Hence the skewness and kurtosis of X are just as for the standard beta distribution.
This page titled 5.17: The Beta Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.17.6 https://stats.libretexts.org/@go/page/10357
5.18: The Beta Prime Distribution
Basic Theory
The beta prime distribution is the distribution of the odds ratio associated with a random variable with the beta distribution. Since variables with
beta distributions are often used to model random probabilities and proportions, the corresponding odds ratios occur naturally as well.
Definition
Suppose that U has the beta distribution with shape parameters a, b ∈ (0, ∞) . Random variable X = U /(1 − U ) has the beta prime
distribution with shape parameters a and b .
The special case a = b = 1 is known as the standard beta prime distribution. Since U has a continuous distribution on the interval (0, 1) ,
random variable X has a continuous distribution on the interval (0, ∞).
Distribution Functions
Suppose that X has the beta prime distribution with shape parameters a, b ∈ (0, ∞) , and as usual, let B denote the beta function.
Proof
If a ≥ 1 , the probability density function is defined at x = 0 , so in this case, it's customary add this endpoint to the domain. In particular, for the
standard beta prime distribution,
1
f (x) = , x ∈ [0, ∞) (5.18.4)
2
(1 + x)
Qualitatively, the first order properties of the probability density function f depend only on a , and in particular on whether a is less than, equal
to, or greater than 1.
Qualitatively, the second order properties of f also depend only on a , with transitions at a = 1 and a = 2 .
3. If a > 2 , f is concave upward, then downward, then upward again, with inflection points at x and x . 1 2
Proof
Open the Special Distribution Simulator and select the beta prime distribution. Vary the parameters and note the shape of the probability
density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the
probability density function.
5.18.1 https://stats.libretexts.org/@go/page/10358
Because of the definition of the beta prime variable, the distribution function of X has a simple expression in terms of the beta distribution
function with the same parameters, which in turn is the regularized incomplete beta function. So let G denote the distribution function of the
beta distribution with parameters a, b ∈ (0, ∞), and recall that
B(x; a, b)
G(x) = , x ∈ (0, 1) (5.18.9)
B(a, b)
Proof
Similarly, the quantile function of X has a simple expression in terms of the beta quantile function G −1
with the same parameters.
Proof
Open the Special Distribution Calculator and choose the beta prime distribution. Vary the parameters and note the shape of the distribution
function. For selected values of the parameters, find the median and the first and third quartiles.
For certain values of the parameters, the distribution and quantile functions have simple, closed form expressions.
1. F (x) = ( x
x+1
) for x ∈ [0, ∞)
1/a
p
2. F −1
(p) =
1/a
for p ∈ [0, 1)
1−p
Proof
1. F (x) = 1 − ( x+1
1
) for x ∈ [0, ∞)
1/b
1−(1−p)
2. F −1
(p) =
1/b
for p ∈ [0, 1)
(1−p)
Proof
If a = b = 1
2
then
−−−
1. F (x) = 2
π
arcsin(√
x
x+1
) for x ∈ [0, ∞)
2 π
sin ( p)
2. F for p ∈ [0, 1)
2
−1
(p) =
2 π
1−sin ( p)
2
Proof
When a = b = 1
2
, X is the odds ratio for a variable with the standard arcsine distribution.
Moments
As before, X denotes a random variable with the beta prime distribution, with parameters a, b ∈ (0, ∞) . The moments of X have a simple
expression in terms of the beta function.
If t ∈ (−a, b) then
B(a + t, b − t)
t
E (X ) = (5.18.13)
B(a, b)
5.18.2 https://stats.libretexts.org/@go/page/10358
Of course, we are usually most interested in the integer moments of X . Recall that for x ∈ R and n ∈ N , the rising power of x of order n is
= x(x + 1) ⋯ (x + n − 1) .
[n]
x
If n ≥ b then E (X n
) =∞ .
Proof
If b > 1 then
a
E(X) = (5.18.17)
b −1
If b > 2 then
a(a + b − 1)
var(X) = (5.18.18)
2
(b − 1 ) (b − 2)
Proof
Open the Special Distribution Simulator and select the beta prime distribution. Vary the parameters and note the size and location of the
mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and
standard deviation to the distribution mean and standard deviation.
Finally, the general moment result leads to the skewness and kurtosis of X.
If b > 3 then
−−−−−−−−−−
2(2a + b − 1) b −2
skew(X) = √ (5.18.19)
b −3 a(a + b − 1)
Proof
In particular, the distibution is positively skewed for all a > 0 and b > 3 .
If b > 4 then
kurt(X) (5.18.20)
3 2 3 3 2 3 2 2 2 2 4 3 2 4 3
3a b + 69 a b − 30 a + 6a b + 12 a b − 78 a b + 60 a + 3ab + 9ab − 69ab + 99ab − 42a + 6 b − 30 b
2
+ 54 b − 42b + 12
=
(a + b − 1)(b − 3)(b − 4)
Proof
Related Distributions
The most important connection is the one between the beta prime distribution and the beta distribution given in the definition. We repeat this for
emphasis.
If X has the beta prime distribution with parameters a, b ∈ (0, ∞) then 1/X has the beta prime distribution with parameters b and a .
Proof
The beta prime distribution is closely related to the F distribution by a simple scale transformation.
5.18.3 https://stats.libretexts.org/@go/page/10358
Connections with the F distributions.
1. If X has the beta prime distribution with parameters a, b ∈ (0, ∞) then Y = X has the F distribution with 2a degrees of the freedom
b
Proof
The beta prime is the distribution of the ratio of independent variables with standard gamma distributions. (Recall that standard here means that
the scale parameter is 1.)
Suppose that Y and Z are independent and have standard gamma distributions with shape parameters a ∈ (0, ∞) and b ∈ (0, ∞) ,
respectively. Then X = Y /Z has the beta prime distribution with parameters a and b .
Proof
The standard beta prime distribution is the same as the standard log-logistic distribution.
Proof
Finally, the beta prime distribution is a member of the general exponential family of distributions.
Suppose that X has the beta prime distribution with parameters a, b ∈ (0, ∞). Then X has a two-parameter general exponential
distribution with natural parameters a − 1 and −(a + b) and natural statistics ln(X) and ln(1 + X) .
Proof
This page titled 5.18: The Beta Prime Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.18.4 https://stats.libretexts.org/@go/page/10358
5.19: The Arcsine Distribution
The arcsine distribution is important in the study of Brownian motion and prime numbers, among other applications.
The standard arcsine distribution is a continuous distribution on the interval (0, 1) with probability density function g given by
1
g(x) = − −−−−−−, x ∈ (0, 1) (5.19.1)
π √ x(1 − x)
Proof
The occurrence of the arcsine function in the proof that g is a probability density function explains the name.
The standard arcsine probability density function g satisfies the following properties:
1. g is symmetric about x = . 1
2
.
3. g is concave upward
4. g(x) → ∞ as x ↓ 0 and as x ↑ 1.
Proof
Open the Special Distribution Simulator and select the arcsine distribution. Keep the default parameter values and note the
shape of the probability density function. Run the simulation 1000 times and compare the emprical density function to the
probability density function.
The distribution function has a simple expression in terms of the arcsine function, again justifying the name of the distribution.
−
The standard arcsine distribution function G is given by G(x) = 2
π
arcsin(√x ) for x ∈ [0, 1].
Proof
Not surprisingly, the quantile function has a simple expression in terms of the sine function.
2
p) for p ∈ [0, 1]. In particular, the quartiles are
–
1. q1 = sin (
2 π
8
) =
1
4
(2 − √2) ≈ 0.1464 , the first quartile
2. q2 =
1
2
, the median
–
3. q3 = sin (
2 3π
8
) =
1
4
(2 + √2) ≈ 0.8536 , the third quartile
Proof
Open the Special Distribution Calculator and select the arcsine distribution. Keep the default parameter values and note the
shape of the distribution function. Compute selected values of the distribution function and the quantile function.
Moments
Suppose that random variable Z has the standard arcsine distribution. First we give the mean and variance.
2. var(Z) = 1
Proof
5.19.1 https://stats.libretexts.org/@go/page/10359
Open the Special Distribution Simulator and select the arcsine distribution. Keep the default parameter values. Run the
simulation 1000 times and compare the empirical mean and stadard deviation to the true mean and standard deviation.
For n ∈ N ,
n−1
2j + 1
n
E (Z ) = ∏ (5.19.8)
2j + 2
j=0
Proof
Of course, the moments can be used to give a formula for the moment generating function, but this formula is not particularly
helpful since it is not in closed form.
Proof
Related Distributions
As noted earlier, the standard arcsine distribution is a special case of the beta distribution.
The standard arcsine distribution is the beta distribution with left parameter 1
2
and right parameter 1
2
.
Proof
Since the quantile function is in closed form, the standard arcsine distribution can be simulated by the random quantile method.
2
−−
2. If X has the standard arcsine distribution then U = arcsin(√X ) has the standard uniform distribution.
2
Open the random quantile simulator and select the arcsine distribution. Keep the default parameters. Run the experiment 1000
times and compare the empirical probability density function, mean, and standard deviation to their distributional counterparts.
Note how the random quantiles simulate the distribution.
The following exercise illustrates the connection between the Brownian motion process and the standard arcsine distribution.
Open the Brownian motion simulator. Keep the default time parameter and select the last zero random variable. Note that this
random variable has the standard arcsine distribution. Run the experiment 1000 time and compare the empirical probability
density function, mean, and standard deviation to their distributional counterparts. Note how the last zero simulates the
distribution.
5.19.2 https://stats.libretexts.org/@go/page/10359
Definition
If Z has the standard arcsine distribution, and if a ∈ R and b ∈ (0, ∞) , then X = a + bZ has the arcsine distribution with
location parameter a and scale parameter b .
Distribution Functions
Suppose that X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) .
1. f is symmetric about a + w . 1
2
w .
3. f is concave upward.
4. f (x) → ∞ as x ↓ a and as x ↑ a + w .
Proof
An alternate parameterization of the general arcsine distribution is by the endpoints of the support interval: the left endpoint
(location parameter) a and the right endpoint b = a + w .
Open the Special Distribution Simulator and select the arcsine distribution. Vary the location and scale parameters and note the
shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and
compare the emprical density function to the probability density function.
Once again, the distribution function has a simple representation in terms of the arcsine function.
Proof
As before, the quantile function has a simple representation in terms of the sine functioon
2
p) for p ∈ [0, 1] In particular, the quartiles of X are
–
1. q1 = a + w sin (
2 π
8
) = a+
1
4
(2 − √2) w , the first quartile
2. q2 = a+
1
2
w , the median
–
3. q3 = a + w sin (
2 3π
8
) = a+
1
4
(2 + √2) w , the third quartile
Proof
Open the Special Distribution Calculator and select the arcsine distribution. Vary the parameters and note the shape and
location of the distribution function. For various values of the parameters, compute selected values of the distribution function
and the quantile function.
Moments
Again, we assume that X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) . First we give
the mean and variance.
2. var(X) = w 1
8
2
5.19.3 https://stats.libretexts.org/@go/page/10359
Proof
Open the Special Distribution Simulator and select the arcsine distribution. Vary the parameters and note the size and location
of the mean±standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the
empirical mean and stadard deviation to the true mean and standar deviation.
The moments of X can be obtained from the moments of Z , but the results are messy, except when the location parameter is 0.
n n
2j + 1
E(X ) =w ∏ (5.19.14)
2j + 2
j=0
Proof
The moment generating function can be expressed as a series with product coefficients, and so is not particularly helpful.
Proof
Proof
Related Distributions
By construction, the general arcsine distribution is a location-scale family, and so is closed under location-scale transformations.
If X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) and if c ∈ R and d ∈ (0, ∞)
then c + dX has the arcsine distribution with location parameter c + ad scale parameter dw.
Proof
Since the quantile function is in closed form, the arcsine distribution can be simulated by the random quantile method.
2
U) has the arcsine distribution with
location parameter a and scale parameter b .
−−−−
2. If X has the arcsine distribution with location parameter a and scale parameter b then U =
2
π
arcsin(√
X−a
w
) has the
standard uniform distribution.
Open the random quantile simulator and select the arcsine distribution. Vary the parameters and note the location and shape of
the probability density function. For selected parameter values, run the experiment 1000 times and compare the empirical
probability density function, mean, and standard deviation to their distributional counterparts. Note how the random quantiles
simulate the distribution.
The following exercise illustrates the connection between the Brownian motion process and the arcsine distribution.
Open the Brownian motion simulator and select the last zero random variable. Vary the time parameter t and note that the last
zero has the arcsine distribution on the interval (0, t). Run the experiment 1000 time and compare the empirical probability
5.19.4 https://stats.libretexts.org/@go/page/10359
density function, mean, and standard deviation to their distributional counterparts. Note how the last zero simulates the
distribution.
This page titled 5.19: The Arcsine Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.19.5 https://stats.libretexts.org/@go/page/10359
5.20: General Uniform Distributions
This section explores uniform distributions in an abstract setting. If you are a new student of probability, or are not familiar with
measure theory, you may want to skip this section and read the sections on the uniform distribution on an interval and the discrete
uniform distributions.
Basic Theory
Definition
Suppose that (S, S , λ) is a measure space. That is, S is a set, S a σ-algebra of subsets of S , and λ a positive measure on S .
Suppose also that 0 < λ(S) < ∞ , so that λ is a finite, positive measure.
Random variable X with values in S has the uniform distribution on S (with respect to λ ) if
λ(A)
P(X ∈ A) = , A ∈ S (5.20.1)
λ(S)
Thus, the probability assigned to a set A ∈ S depends only on the size of A (as measured by λ ).
measure on (R , R ). In this setting, S ∈ R with 0 < λ (S) < ∞ , S = {A ∈ R : A ⊆ S} , and the measure is λ
n
n n n n n
restricted to (S, S ).
In the Euclidean case, recall that λ is length measure on R, λ is area measure on R , λ is volume measure on
1 2
2
3 R
3
, and in
general λ is sometimes referred to as n -dimensional volume. Thus, S ∈ R is a set with positive, finite volume.
n n
Properties
Suppose (S, S , λ) is a finite, positive measure space, as above, and that X is uniformly distributed on S .
Proof
Thus, the defining property of the uniform distribution on a set is constant density on that set. Another basic property is that
uniform distributions are preserved under conditioning.
Suppose that R ∈ S with λ(R) > 0 . The conditional distribution of X given X ∈ R is uniform on R .
Proof
In the setting of previous result, suppose that X = (X , X , …) is a sequence of independent variables, each uniformly
1 2
distributed on S . Let N = min{n ∈ N : X ∈ R} . Then N has the geometric distribution on N with success parameter
+ n +
p = P(X ∈ R) . More importantly, the distribution of X Nis the same as the conditional distribution of X given X ∈ R , and hence
is uniform on R . This is the basis of the rejection method of simulation. If we can simulate a uniform distribution on S , then we
can simulate a uniform distribution on R .
If h is a real-valued function on S , then E[h(X)] is the average value of h on S , as measured by λ :
5.20.1 https://stats.libretexts.org/@go/page/10360
Proof
The entropy of the uniform distribution on S depends only on the size of S , as measured by λ :
Product Spaces
Suppose now that (S, S , λ) and (T , T , μ) are finite, positive measure spaces, so that 0 < λ(S) < ∞ and 0 < μ(T ) < ∞ . Recall
the product space (S × T , S ⊗ T , λ ⊗ μ) . The product σ-algebra S ⊗ T is the σ-algebra of subsets of S × T generated by
product sets A × B where A ∈ S and B ∈ T . The product measure λ ⊗ μ is the unique positive measure on (S × T , S ⊗ T )
that satisfies (λ ⊗ μ)(A × B) = λ(A)μ(B) for A ∈ S and B ∈ T .
(X, Y ) is uniformly distributed on S × T if and only if X is uniformly distributed on S , Y is uniformly distributed on T , and
X and Y are independent.
Proof
This page titled 5.20: General Uniform Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.20.2 https://stats.libretexts.org/@go/page/10360
5.21: The Uniform Distribution on an Interval
The continuous uniform distribution on an interval of R is one of the simplest of all probability distributions, but nonetheless very
important. In particular, continuous uniform distributions are the basic tools for simulating other probability distributions. The
uniform distribution corresponds to picking a point at random from the interval. The uniform distribution on an interval is a special
case of the general uniform distribution with respect to a measure, in this case Lebesgue measure (length measure) on R.
The continuous uniform distribution on the interval [0, 1] is known as the standard uniform distribution. Thus if U has the
standard uniform distribution then
P(U ∈ A) = λ(A) (5.21.1)
for every (Borel measurable) subset A of [0, 1], where λ is Lebesgue (length) measure.
A simulation of a random variable with the standard uniform distribution is known in computer science as a random number. All
programming languages have functions for computing random numbers, as do calculators, spreadsheets, and mathematical and
statistical software packages.
Distribution Functions
Suppose that U has the standard uniform distribution. By definition, the probability density function is constant on [0, 1].
Open the Special Distribution Simulator and select the continuous uniform distribution. Keep the default parameter values.
Run the simulation 1000 times and compare the empirical density function and to the probability density function.
4
, the first quartile
2. q2 =
1
2
, the median
3. q3 =
3
4
, the third quartile
Proof
Open the Special Distribution Calculator and select the continuous uniform distribution. Keep the default parameter values.
Compute a few values of the distribution function and the quantile function.
Moments
Suppose again that U has the standard uniform distribution. The moments (about 0) are simple.
For n ∈ N ,
1
n
E (U ) = (5.21.2)
n+1
5.21.1 https://stats.libretexts.org/@go/page/10361
Proof
The mean and variance follow easily from the general moment formula.
2. var(U ) = 1
12
Open the Special Distribution Simulator and select the continuous uniform distribution. Keep the default parameter values.
Run the simulation 1000 times and compare the empirical mean and standard deviation to the true mean and standard
deviation.
Proof
Proof
Related Distributions
The standard uniform distribution is connected to every other probability distribution on R by means of the quantile function of the
other distribution. When the quantile function has a simple closed form expression, this result forms the primary method of
simulating the other distribution with a random number.
Suppose that F is the distribution function for a probability distribution on R, and that F is the corresponding quantile
−1
function. If U has the standard uniform distribution, then X = F (U ) has distribution function F .
−1
Proof
Open the Random Quantile Experiment. For each distribution, run the simulation 1000 times and compare the empirical
density function to the probability density function of the selected distribution. Note how the random quantiles simulate the
distribution.
For a continuous distribution on an interval of R, the connection goes the other way.
Suppose that X has a continuous distribution on an interval I ⊆R , with distribution function F . Then U = F (X) has the
standard uniform distribution.
Proof
The beta distribution with left parameter a = 1 and right parameter b = 1 is the standard uniform distribution.
Proof
The standard uniform distribution is also the building block of the Irwin-Hall distributions.
5.21.2 https://stats.libretexts.org/@go/page/10361
The Uniform Distribution on a General Interval
Definition
The standard uniform distribution is generalized by adding location-scale parameters.
Suppose that U has the standard uniform distribution. For a ∈ R and w ∈ (0, ∞) random variable X = a + wU has the
uniform distribution with location parameter a and scale parameter w.
Distribution Functions
Suppose that X has the uniform distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) .
The last result shows that X really does have a uniform distribution, since the probability density function is constant on the
support interval. Moreover, we can clearly parameterize the distribution by the endpoints of this interval, namely a and b = a + w ,
rather than by the location, scale parameters a and w. In fact, the distribution is more commonly known as the uniform distribution
on the interval [a, b]. Nonetheless, it is useful to know that the distribution is the location-scale family associated with the standard
uniform distribution. In terms of the endpoint parameterization,
1
f (x) = , x ∈ [a, b] (5.21.10)
b −a
Open the Special Distribution Simulator and select the uniform distribution. Vary the location and scale parameters and note
the graph of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare
the empirical density function to the probability density function.
Proof
4
w =
3
4
a+
1
4
b , the first quartile
2. q2 = a+
1
2
w =
1
2
a+
1
2
b , the median
3. q3 = a+
3
4
w =
1
4
a+
3
4
b , the third quartile
Proof
Open the Special Distribution Calculator and select the uniform distribution. Vary the parameters and note the graph of the
distribution function. For selected values of the parameters, compute a few values of the distribution function and the quantile
function.
Moments
Again we assume that X has the uniform distribution on the interval [a, b] where a, b ∈ R and a < b . Thus the location parameter
is a and the scale parameter w = b − a .
5.21.3 https://stats.libretexts.org/@go/page/10361
Proof
2. var(X) = (b − a)
1
12
2
Open the Special Distribution Simulator and select the uniform distribution. Vary the parameters and note the location and size
of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation.
Proof
5
.
Proof
If h is a real-valued function on [a, b], then E[h(X)] is the average value of h on [a, b], as defined in calculus:
Proof
The entropy of the uniform distribution on an interval depends only on the length of the interval.
Related Distributions
Since the uniform distribution is a location-scale family, it is trivially closed under location-scale transformations.
If X has the uniform distribution with location parameter a and scale parameter w, and if c ∈ R and d ∈ (0, ∞) , then
Y = c + dX has the uniform distribution with location parameter c + da and scale parameter dw.
Proof
As we saw above, the standard uniform distribution is a basic tool in the random quantile method of simulation. Uniform
distributions on intervals are also basic in the rejection method of simulation. We sketch the method in the next paragraph; see the
section on general uniform distributions for more theory.
Suppose that h is a probability density function for a continuous distribution with values in a bounded interval (a, b) ⊆ R . Suppose
also that h is bounded, so that there exits c > 0 such that h(x) ≤ c for all x ∈ (a, b). Let X = (X , X , …) be a sequence of
1 2
independent variables, each uniformly distributed on (a, b), and let Y = (Y , Y , …) be a sequence of independent variables, each
1 2
uniformly distributed on (0, c). Finally, assume that X and Y are independent. Then ((X , Y ), (X , Y ), …)) is a sequence of
1 1 2 2
independent variables, each uniformly distributed on (a, b) × (0, c) . Let N = min{n ∈ N : 0 < Y < h(X )} . Then (X , Y )
+ n n N N
is uniformly distributed on R = {(x, y) ∈ (a, b) × (0, c) : y < h(x)} (the region under the graph of h ), and therefore X has N
probability density function h . In words, we generate uniform points in the rectangular region (a, b) × (0, c) until we get a point
5.21.4 https://stats.libretexts.org/@go/page/10361
under the graph of h . The x-coordinate of that point is our simulated value. The rejection method can be used to approximately
simulate random variables when the region under the density function is unbounded.
Open the rejection method simulator. For each distribution, select a set of parameter values. Run the experiment 2000 times
and observe how the rejection method works. Compare the empirical density function, mean, and standard deviation to their
distributional counterparts.
This page titled 5.21: The Uniform Distribution on an Interval is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.21.5 https://stats.libretexts.org/@go/page/10361
5.22: Discrete Uniform Distributions
The discrete uniform distribution is a special case of the general uniform distribution with respect to a measure, in this case
counting measure. The distribution corresponds to picking an element of S at random. Most classical, combinatorial probability
models are based on underlying discrete uniform distributions. The chapter on Finite Sampling Models explores a number of such
models.
Proof
Like all uniform distributions, the discrete uniform distribution on a finite set is characterized by the property of constant density
on the set. Another property that all uniform distributions share is invariance under conditioning on a subset.
Suppose that R is a nonempty subset of S . Then the conditional distribution of X given X ∈ R is uniform on R .
Proof
If h : S → R then the expected value of h(X) is simply the arithmetic average of the values of h :
1
E[h(X)] = ∑ h(x) (5.22.4)
#(S)
x∈S
Proof
that S = {x , x , … , x } is a subset of R with n points. We will assume that the points are indexed in order, so that
1 2 n
n
for x ∈ S .
n
k k+1 and k ∈ {1, 2, … n − 1}
3. F (x) = 1 for x > x n
Proof
5.22.1 https://stats.libretexts.org/@go/page/10362
The moments of X are ordinary arithmetic averages.
For k ∈ N
n
k
1
k
E (X ) = ∑x (5.22.7)
i
n
i=1
In particular,
n
∑
n
i=1
xi
2. σ =
2 1
n
∑
n
i=1
(xi − μ)
2
Of course, the results in the previous subsection apply with x i = i −1 and i ∈ {1, 2, … , n}.
n
for z ∈ S .
Open the Special Distribution Simulation and select the discrete uniform distribution. Vary the number of points, but keep the
default values for the other parameters. Note the graph of the probability density function. Run the simulation 1000 times and
compare the empirical density function to the probability density function.
n
(⌊z⌋ + 1) for z ∈ [0, n − 1] .
Proof
Proof
Open the special distribution calculator and select the discrete uniform distribution. Vary the number of points, but keep the
default values for the other parameters. Note the graph of the distribution function. Compute a few values of the distribution
function and the quantile function.
For the standard uniform distribution, results for the moments can be given in closed form.
2. var(Z) = (n − 1) 1
12
2
Proof
Open the Special Distribution Simulation and select the discrete uniform distribution. Vary the number of points, but keep the
default values for the other parameters. Note the size and location of the mean±standard devation bar. Run the simulation 1000
times and compare the empirical mean and standard deviation to the true mean and standard deviation.
5.22.2 https://stats.libretexts.org/@go/page/10362
The skewness and kurtosis of Z are
1. skew(Z) = 0
2
2. kurt(Z) = 3
5
3 n −7
n −1
2
Proof
5
as n → ∞ . The limiting value is the skewness of the uniform distribution on an interval.
Proof
Suppose that Z has the standard discrete uniform distribution on n ∈ N points, and that a ∈ R and + h ∈ (0, ∞) . Then
X = a + hZ has the uniform distribution on n points with location parameter a and scale parameter h .
so that S has n elements, starting at a , with step size h , a discrete interval. In the further special case where a ∈ Z and h = 1 , we
have an integer interval. Note that the last point is b = a + (n − 1)h , so we can clearly also parameterize the distribution by the
endpoints a and b , and the step size h . With this parametrization, the number of points is n = 1 + (b − a)/h . For the remainder of
this discussion, we assume that X has the distribution in the definiiton. Our first result is that the distribution of X really is
uniform.
n
for x ∈ S
Proof
Open the Special Distribution Simulation and select the discrete uniform distribution. Vary the parameters and note the graph
of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the
empirical density function to the probability density function.
Proof
Proof
Open the special distribution calculator and select the discrete uniform distribution. Vary the parameters and note the graph of
the distribution function. Compute a few values of the distribution function and the quantile function.
2
(n − 1)h =
1
2
(a + b)
2. var(X) = 1
12
(n
2
− 1)h
2
=
1
12
(b − a)(b − a + 2h)
5.22.3 https://stats.libretexts.org/@go/page/10362
Proof
Note that the mean is the average of the endpoints (and so is the midpoint of the interval [a, b]) while the variance depends only on
the number of points and the step size.
Open the Special Distribution Simulator and select the discrete uniform distribution. Vary the parameters and note the shape
and location of the mean/standard deviation bar. For selected values of the parameters, run the simulation 1000 times and
compare the empirical mean and standard deviation to the true mean and standard deviation.
2. kurt(X) = 3
5
3 n −7
n2 −1
Proof
Proof
Related Distributions
Since the discrete uniform distribution on a discrete interval is a location-scale family, it is trivially closed under location-scale
transformations.
Suppose that X has the discrete uniform distribution on n ∈ N points with location parameter a ∈ R and scale parameter
+
h ∈ (0, ∞) . If c ∈ R and w ∈ (0, ∞) then Y = c + wX has the discrete uniform distribution on n points with location
In terms of the endpoint parameterization, X has left endpoint a , right endpoint a + (n − 1)h , and step size h while Y has left
endpoint c + wa , right endpoint (c + wa) + (n − 1)wh , and step size wh.
The uniform distribution on a discrete interval converges to the continuous uniform distribution on the interval with the same
endpoints, as the step size decreases to 0.
Suppose that X has the discrete uniform distribution with endpoints a and b , and step size (b − a)/n , for each n ∈ N . Then
n +
Proof
This page titled 5.22: Discrete Uniform Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.22.4 https://stats.libretexts.org/@go/page/10362
5.23: The Semicircle Distribution
given by
2 −−−− −
2
g(x) = √1 − x , x ∈ [−1, 1] (5.23.1)
π
Proof
−−−− −
As noted in the proof, x ↦ √1 − x
2
for x ∈ [−1, 1] is the upper half of the circle of radius 1 centered at the origin, hence the
name.
The standard semicircle probability density function g satisfies the following properties:
1. g is symmetric about x = 0 .
2. g increases and then decreases with mode at x = 0 .
3. g is concave downward.
Proof
As noted earlier, except for the normalizing constant, the graph of g is the upper half of the circle of radius 1 centered at the
origin, and so these properties are obvious.
Open special distribution simulator and select the semicircle distribution. With the default parameter value, note the shape of
the probability density function. Run the simulation 1000 times and compare the empirical density function to the probability
density function.
Proof
We cannot give the quantile function G in closed form, but values of this function can be approximated. Clearly by symmetry,
−1
Open the special distribution simulator and select the semicircle distribution. With the default parameter value, note the shape
of the distribution function. Compute the first and third quartiles.
Moments
Suppose that X has the standard semicircle distribution. The moments of X about 0 can be computed explicitly. In particular, the
odd order moments are 0 by symmetry.
5.23.1 https://stats.libretexts.org/@go/page/10363
Proof
The numbers C = n
1
(
n+1
) for n ∈ N are known as the Catalan numbers, and are named for the Belgian mathematician Eugene
2n
Catalan. In particular, we can compute the mean, variance, skewness, and kurtosis.
Open the special distribution simulator and select the semicircle distribution. With the default parameter value, note the size
and location of the mean ± standard deviation bar. Run the simulation 1000 times and compare the empirical mean and
standard deviation to the true mean and standard deviation.
Related Distributions
The semicircle distribution has simple connections to the continuous uniform distribution.
If (X, Y ) is uniformly distributed on the circular region in R centered at the orgin with radius 1, then X and Y each have the
2
It's easy to simulate a random point that is uniformly distributed on circular region in the previous theorem, and this provides a way
of simulating a standard semicircle distribution. This is important since we can't use the random quantile method of simulation.
Suppose that U , V , and W are independent random variables, each with the standard uniform distribution (random numbers).
Let R = max{U , V } and Θ = 2πW , and then let X = R cos Θ , Y = R sin Θ . Then (X, Y ) is uniformly distributed on the
circular region of radius 1 centered at the origin, and hence X and Y each have the standard semicircle distribution.
Proof
Of course, note that X and Y in the previous theorem are not independent. Another method of simulation is to use the rejection
method. This method works well since the semicircle distribution has a bounded support interval and a bounded probability density
function.
Open the rejection method app and select the semicircle distribution. Keep the default parameters to get the standard semicirle
distribution. Run the simulation 1000 times and note the points in the scatterplot. Compare the empirical density function,
mean, and standard deviation to their distributional counterparts.
Definition
Suppose that Z has the standard semicircle distribution. For a ∈ R and r ∈ (0, ∞) , X = a + rZ has the semicircle
distribution with center (location parameter) a and radius (scale parameter) r.
5.23.2 https://stats.libretexts.org/@go/page/10363
Distribution Functions
Suppose that X has the semicircle distribution with center a ∈ R and radius r ∈ (0, ∞).
Proof
−−−−−−−−−−
The graph of x ↦ √r − (x − a) for x ∈ [a − r, a + r] is the upper half of the circle of radius r centered at a . The area under
2 2
this semicircle is π r /2 so as a check on our work, we see that f is a valid probability density function.
2
Open special distribution simulator and select the semicircle distribution. Vary the center a and the radius r, and note the shape
of the probability density function. For selected values of a and r, run the simulation 1000 times and compare the empirical
density function to the probability density function.
Proof
As in the standard case, we cannot give the quantile function F in closed form, but values of this function can be approximated.
−1
F
−1
(
1
− p) = 2a − F
2
(
−1
+ p) for 0 ≤ p ≤ . The median is a .
1
2
1
Open the special distribution simulator and select the semicircle distribution. Vary the center a and the radius r, and note the
shape of the distribution function. For selected values of a and r, compute the first and third quartiles.
Moments
Suppose again that X has the semicircle distribution with center a ∈ R and radius r ∈ (0, ∞), so by definition we can assume
X = a + rZ where Z has the standard semicircle distribution. The moments of X can be computed from the moments of Z .
Using the binomial theorem and the linearity of expected value we have
n
n
n k n−k k
E (X ) = ∑( )r a E (Z ) , n ∈ N (5.23.13)
k
k=0
In particular,
Proof
5.23.3 https://stats.libretexts.org/@go/page/10363
Open the special distribution simulator and select the semicircle distribution. Vary the center a and the radius r, and note the
size and location of the mean ± standard deviation bar. For selected values of a and r, run the simulation 1000 times and
compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
Since the semicircle distribution is a location-scale family, it's invariant under location-scale transformations.
Suppose that X has the semicircle distribution with center a ∈ R and radius r ∈ (0, ∞). If b ∈ R and c ∈ (0, ∞) then b + cX
has the semicircle distribution with center b + ca and radius cr.
Proof
The beta distribution with left parameter 3/2 and right parameter 3/2 is the semicircle distribution with center 1/2 and radius
1/2.
Proof
Since we can simulate a variable Z with the standard semicircle distribution by the method above, we can simulate a variable with
the semicircle distribution with center a ∈ R and radius r ∈ (0, ∞) by our very definition: X = a + rZ . Once again, the rejection
method also works well since the support and probability density fucntion of X are bounded.
Open the rejection method app and select the semicircle distribution. For selected values of a and r, run the simulation 1000
times and note the points in the scatterplot. Compare the empirical density function, mean and standard deviation to their
distributional counterparts.
This page titled 5.23: The Semicircle Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.23.4 https://stats.libretexts.org/@go/page/10363
5.24: The Triangle Distribution
Like the semicircle distribution, the triangle distribution is based on a simple geometric shape. The distribution arises naturally
when uniformly distributed random variables are transformed in various ways.
The standard triangle distribution with vertex at p ∈ [0, 1] (equivalently, shape parameter p) is a continuous distribution on
[0, 1] with probability density function g described as follows:
The shape of the probability density function justifies the name triangle distribution.
The graph of g , together with the domain , forms a triangle with vertices
[0, 1] ,
(0, 0) , and
(1, 0) (p, 2) . The mode of the
distribution is x = p .
1. If p = 0 , g is decreasing.
2. If p = 1 , g is increasing.
3. If p ∈ (0, 1), g increases and then decreases.
Proof
Open special distribution simulator and select the triangle distribution. Vary p (but keep the default values for the other
parameters) and note the shape of the probability density function. For selected values of p, run the simulation 1000 times and
compare the empirical density function to the probability density function.
3. If p ∈ (0, 1)
2
x
⎧ , x ∈ [0, p]
p
G(x) = ⎨ 2 (5.24.2)
⎩ (1−x)
1− , x ∈ [p, 1]
1−p
Proof
−
−− −−−−−−−
1. The first quartile is √ 1
4
p if p ∈ [ 1
4
, 1] and is 1 − √ 3
4
(1 − p) if p ∈ [0, 1
4
]
−
−− −−−−−−−
2. The median is √ 1
2
p if p ∈ [ 1
2
, 1] and is 1 − √ 1
2
(1 − p) if p ∈ [0, 1
2
.
]
−
−− −−−−−−−
3. The third quartile is √ 3
4
p if p ∈ [ 3
4
, 1] and is 1 − √ 1
4
(1 − p) if p ∈ [0, 3
4
] .
5.24.1 https://stats.libretexts.org/@go/page/10364
Open the special distribution calculator and select the triangle distribution. Vary p (but keep the default values for the other
parameters) and note the shape of the distribution function. For selected values of p, compute the first and third quartiles.
Moments
Suppose that X has the standard triangle distribution with vertex p ∈ [0, 1]. The moments are easy to compute.
Suppose that n ∈ N .
1. If p = 1 , E(X n
) = 2/(n + 2) .
2. If p ∈ [0, 1),
n+1 n+2
n
2 n+2
2 1 −p 2 1 −p
E(X ) = p + + (5.24.4)
n+1 n+1 1 −p n+2 1 −p
Proof
From the general moment formula, we can compute the mean, variance, skewness, and kurtosis.
Proof
3
2
3
var(X) is a parabola opening downward; the
largest value is when p = 0 or p = 1 and the smallest value is
18
1
when p = . 1
24
1
Open the special distribution simulator and select the triangle distribution. Vary p (but keep the default values for the other
paramters) and note the size and location of the mean ± standard deviation bar. For selected values of p, run the simulation
1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
The skewness of X is
–
√2(1 − 2p)(1 + p)(2 − p)
skew(X) = (5.24.5)
3/2
5[1 − p(1 − p)]
5
.
Proof
Note that X is positively skewed for p < , negatively skewed for p > , and symmetric for p = . More specifically, if we
1
2
1
2
1
indicate the dependence on the parameter p then skew (X) = −skew (X) . Note also that the kurtosis is independent of p, and
1−p p
Open the special distribution simulator and select the triangle distribution. Vary p (but keep the default values for the other
paramters) and note the degree of symmetry and the degree to which the distribution is peaked. For selected values of p, run
the simulation 1000 times and compare the empirical density function to the probability density function.
Related Distributions
If X has the standard triangle distribution with parameter p, then 1 − X has the standard triangle distribution with parameter
1 −p.
Proof
The standard triangle distribution has a number of connections with the standard uniform distribution. Recall that a simulation of a
random variable with a standard uniform distribution is a random number in computer science.
Suppose that U and U are independent random variables, each with the standard uniform distribution. Then
1 2
5.24.2 https://stats.libretexts.org/@go/page/10364
1. X = min{U , U } has the standard triangle distribution with p = 0 .
1 2
Proof
Suppose again that U and U are independent random variables, each with the standard uniform distribution. Then
1 2
2
.
Proof
In the previous result, note that Y is the sample mean from a random sample of size 2 from the standard uniform distribution. Since
the quantile function has a simple closed-form expression, the standard triangle distribution can be simulated using the random
quantile method.
Suppose that U is has the standard uniform distribution and p ∈ [0, 1] . Then the random variable below has the standard
triangle distribution with parameter p:
−
−−
√pU , U ≤p
X ={ − −−−− −−− −−−− (5.24.6)
1 − √ (1 − p)(1 − U ) , p <U ≤1
Open the random quantile experiment and select the triangle distribution. Vary p (but keep the default values for the other
parameters) and note the shape of the distribution function/quantile function. For selected values of p, run the experiment 1000
times and watch the random quantiles. Compare the empirical density function, mean, and standard deviation to their
distributional counterparts.
The standard triangle distribution can also be simulated using the rejection method, which also works well since the region R under
the probability density function g is bounded. Recall that this method is based on the following fact: if (X, Y ) is uniformly
distributed on the rectangular region S = {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 2} which contains R , then the conditional distribution of
(X, Y ) given (X, Y ) ∈ R is uniformly distributed on R , and hence X has probability density function g .
Open the rejection method experiment and select the triangle distribution. Vary p (but keep the default values for the other
parameters) and note the shape of the probability density function. For selected values of p, run the experiment 1000 times and
watch the scatterplot. Compare the empirical density function, mean, and standard deviation to their distributional counterparts.
For the extreme values of the shape parameter, the standard triangle distributions are also beta distributions.
Open the special distribution simulator and select the beta distribution. For parameter values given below, run the simulation
1000 times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
1. a = 1 , b = 2
2. a = 2 , b = 1
5.24.3 https://stats.libretexts.org/@go/page/10364
Definition
Suppose that Z has the standard triangle distribution with vertex at p ∈ [0, 1]. For a ∈ R and w ∈ (0, ∞) , random variable
X = a + wZ has the triangle distribution with location parameter a , and scale parameter w, and shape parameter p
Distribution Functions
Suppose that X has the general triangle distribution given in the definition above.
Proof
Once again, the shape of the probability density function justifies the name triangle distribution.
The graph of f , together with the domain [a, a + w] , forms a triangle with vertices (a, 0), (a + w, 0) , and (a + pw, 2/w). The
mode of the distribution is x = a + pw .
1. If p = 0 , f is decreasing.
2. If p = 1 , f is increasing.
3. If p ∈ (0, 1), f increases and then decreases.
Clearly the general triangle distribution could be parameterized by the left endpoint a , the right endpoint b = a+w and the
location of the vertex c = a + pw , but the location-scale-shape parameterization is better.
Open special distribution simulator and select the triangle distribution. Vary the parameters a , w, and p, and note the shape and
location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare
the empirical density function to the probability density function.
w
2
2
3. If p ∈ (0, 1),
1 2
(x − a) , x ∈ [a, a + pw]
pw2
F (x) = { (5.24.9)
1 2
1− 2
(a + w − x ) , x ∈ [a + pw, a + w]
w (1−p)
Proof
−
−− −−−−−−−
1. The first quartile is a + w √ 1
4
p if p ∈ [ 1
4
, 1] and is a + w (1 − √ 3
4
(1 − p) ) if p ∈ [0, 1
4
]
−
−− −−−−−−−
2. The median is a + w √ 1
2
p if p ∈ [ 1
2
, 1] and is a + w (1 − √ 1
2
(1 − p) ) if p ∈ [0, 1
2
.
]
−
−− −−−−−−−
3. The third quartile is a + w √ 3
4
p if p ∈ [ 3
4
, 1] and is a + w (1 − √ 1
4
(1 − p) ) if p ∈ [0, 3
4
] .
Proof
5.24.4 https://stats.libretexts.org/@go/page/10364
Open the special distribution simulator and select the triangle distribution. Vary the the parameters a , w, and p, and note the
shape and location of the distribution function. For selected values of parameters, compute the median and the first and third
quartiles.
Moments
Suppose again that X has the triangle distribution with location parameter a ∈ R , scale parameter w ∈ (0, ∞) and shape parameter
p ∈ [0, 1]. Then we can take X = a + wZ where Z has the standard triangle distribution with parameter p . Hence the moments of
X can be computed from the moments of Z . Using the binomial theorem and the linearity of expected value we have
n
n k n−k k
E(X ) = ∑( )w a E(Z ), n ∈ N (5.24.12)
k
k=0
3
(1 + p)
2
2. var(X) = w
18
[1 − p(1 − p)]
Proof
Open the special distribution simulator and select the triangle distribution. Vary the parameters a , w, and p, and note the size
and location of the mean ± standard deviation bar. For selected values of the paramters, run the simulation 1000 times and
compare the empirical mean and standard deviation to the distribution mean and standard deviation.
The skewness of X is
–
√2(1 − 2p)(1 + p)(2 − p)
skew(X) = (5.24.13)
5[1 − p(1 − p)]3/2
5
.
Proof
5
.
Related Distributions
Since the triangle distribution is a location-scale family, it's invariant under location-scale transformations. More generally, the
family is closed under linear transformations with nonzero slope.
Suppose that X has the triangle distribution with shape parameter a ∈ R , scale parameter w ∈ (0, ∞) , and shape parameter
p ∈ [0, 1]. If b ∈ R and c ∈ (0, ∞) then
1. b + cX has the triangle distribution with location parameter b + ca , scale parameter cw, and shape parameter p.
2. b − cX has the triangle distribution with location parameter b − c(a + w) , scale parameter cw, and shape parameter 1 − p .
Proof
As with the standard distribution, there are several connections between the triangle distribution and the continuous uniform
distribution.
Suppose that V and 1 V2 are independent and are uniformly distributed on the interval [a, a + w] , where a ∈ R and
w ∈ (0, ∞) . Then
1. min{V , V } has the triangle distribution with location parameter a , scale parameter w, and shape parameter p = 0 .
1 2
2. max{V , V } has the triangle distribution with location parameter a , scale parameter w, and shape parameter p = 1 .
1 2
Proof
Suppose again that V1 and V2 are independent and are uniformly distributed on the interval [a, a + w] , where a ∈ R and
w ∈ (0, ∞) . Then
5.24.5 https://stats.libretexts.org/@go/page/10364
1. |V − V | has the triangle distribution with location parameter 0, scale parameter w, and shape parameter p = 0 .
2 1
2. V + V has the triangle distribution with location parameter 2a, scale parameter 2w, and shape parameter p = .
1 2
1
3. V − V has the triangle distribution with location parameter −w, scale parameter 2w, and shape parameter p =
2 1
1
Proof
A special case of (b) leads to a connection between the triangle distribution and the Irwin-Hall distribution.
Suppose that U and U are independent random variables, each with the standard uniform distribution. Then U + U has the
1 2 1 2
triangle distribtion with location parameter 0, scale parameter 2 and shape parameter . But this is also the Irwin-Hall
1
distribution of order n = 2 .
Open the special distribution simulator and select the Irwin-Hall distribution. Set n = 2 and note the shape and location of the
probability density function. Run the simulation 1000 times and compare the empirical density function, mean, and standard
deviation to their distributional counterparts.
Since we can simulate a variable Z with the basic triangle distribution with parameter p ∈ [0, 1] by the random quantile method
above, we can simulate a variable with the triangle distribution that has location parameter a ∈ R , scale parameter w ∈ (0, ∞) ,
and shape parameter p by our very definition: X = a + wZ . Equivalently, we could compute a random quantile using the quantile
function of X.
Open the random quantile experiment and select the triangle distribution. Vary the location parameter a , the scale parameter w,
and the shape parameter p, and note the shape of the distribution function. For selected values of the parameters, run the
experiment 1000 times and watch the random quantiles. Compare the empirical density function, mean and standard deviation
to their distributional counterparts.
As with the standard distribution, the general triangle distribution has a bounded probability density function on a bounded interval,
and hence can be simulated easily via the rejection method.
Open the rejection method experiment and select the triangle distribution. Vary the parameters and note the shape of the
probability density function. For selected values of the parameters, run the experiment 1000 times and watch the scatterplot.
Compare the empirical density function, mean, and standard deviation to their distributional counterparts.
This page titled 5.24: The Triangle Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.24.6 https://stats.libretexts.org/@go/page/10364
5.25: The Irwin-Hall Distribution
The Irwin-Hall distribution, named for Joseph Irwin and Phillip Hall, is the distribution that governs the sum of independent
random variables, each with the standard uniform distribution. It is also known as the uniform sum distribution. Since the standard
uniform is one of the simplest and most basic distributions (and corresponds in computer science to a random number), the Irwin-
Hall is a natural family of distributions. It also serves as a nice example of the central limit theorem, conceptually easy to
understand.
Basic Theory
Definition
Suppose that U = (U , U , …) is a sequence of indpendent random variables, each with the uniform distribution on the
1 2
Xn = ∑ Ui (5.25.1)
i=1
Distribution Functions
Let f denote the probability density function of the standard uniform distribution, so that f (x) = 1 for 0 ≤ x ≤ 1 (and is 0
otherwise). It follows immediately that the probability density function f of X satisfies f = f , where of course f is the n -
n n n
∗n ∗n
x, 0 ≤x ≤1
f2 (x) = { (5.25.2)
x − 2(x − 1), 1 ≤x ≤2
Proof
Note that the graph of f on [0, 2] consists of two lines, pieced together in a continuous way at x = 1 . The form given above is not
2
the simplest, but makes the continuity clear, and will be helpful when we generalize.
In the special distribution simulator, select the Irwin-Hall distribution and set n = 2 . Note the shape of the probability density
function. Run the simulation 1000 times and compare the empirical density function to the probability density function.
1 2
⎧
⎪ x , 0 ≤x ≤1
⎪ 2
1 2 3 2
f3 (x) = ⎨ x − (x − 1 ) , 1 ≤x ≤2 (5.25.3)
2 2
⎪
⎩
⎪ 1 2 3 2 3 2
x − (x − 1 ) + (x − 2 ) , 2 ≤x ≤3
2 2 2
Note that the graph of f on [0, 3] consists of three parabolas pieced together in a continuous way at x = 1 and x = 2 . The
3
expressions for f (x) for 1 ≤ x ≤ 2 and 2 ≤ x ≤ 3 can be expanded and simplified, but again the form given above makes the
3
In the special distribution simulator, select the Irwin-Hall distribution and set n = 3 . Note the shape of the probability density
function. Run the simulation 1000 times and compare the empirical density function to the probability density function.
Naturally, we don't want to perform the convolutions one at a time; we would like a general formula. To state the formula
succinctly, we need to recall the floor function:
5.25.1 https://stats.libretexts.org/@go/page/10365
⌊x⌋ = max{n ∈ Z : n ≤ x}, x ∈ R (5.25.4)
⌊x⌋
1 k
n n−1
fn (x) = ∑(−1 ) ( )(x − k) , x ∈ R (5.25.5)
(n − 1)! k
k=0
Proof
Note that for n ∈ N , the graph of f on [0, n] consists of n polynomials of degree n − 1 pieced together in a continuous way.
+ n
Such a construction is known as a polynomial spline. The points where the polynomials are connected are known as knots. So f is n
a polynomial spline of degree n − 1 with knots at x ∈ {1, 2, … , n − 1}. There is another representation of f as a sum. To state n
⎧ −1, x <0
sgn(x) = ⎨ 0, x =0 (5.25.12)
⎩
1, x >0
n
1 n
k n−1
fn (x) = ∑(−1 ) ( )sgn(x − k)(x − k) , x ∈ R (5.25.13)
2(n − 1)! k
k=0
Direct Proof
Proof by induction
Open the special distribution simulator and select the Irwin-Hall distribution. Start with n = 1 and increase n successively to
the maximum n = 10 . Note the shape of the probability density function. For various values of n , run the simulation 1000
times and compare the empirical density function to the probability density function.
For n ∈ {2, 3, …}, the Irwin-Hall distribution is symmetric and unimodal, with mode at n/2.
⌊x⌋
1 k
n n
Fn (x) = ∑(−1 ) ( )(x − k) , x ∈ [0, n] (5.25.20)
n! k
k=0
Proof
n
1 1 k
n n
Fn (x) = + ∑(−1 ) ( )sgn(x − k)(x − k) , x ∈ [0, n] (5.25.21)
2 2n! k
k=0
Proof
Open the special distribution calculator and select the Irwin-Hall distribution. Vary n from 1 to 10 and note the shape of the
distribution function. For each value of n compute the first and third quartiles.
Moments
The moments of the Irwin-Hall distribution are easy to obtain from the representation as a sum of independent standard uniform
variables. Once again, we assume that X has the Irwin-Hall distribution of order n ∈ N .
n +
5.25.2 https://stats.libretexts.org/@go/page/10365
The mean and variance of X are n
1. E(X ) = n/2
n
2. var(X ) = n/12
n
Proof
Open the special distribution simulator and select the Irwin-Hall distribution. Vary n and note the shape and location of the
mean ± standard deviation bar. For selected values of n run the simulation 1000 times and compare the empirical mean and
standard deviation to the distribution mean and standard deviation.
1. skew(X ) = 0n
2. kurt(X ) = 3 −
n
6
5n
Proof
Note that kurt(X n) → 3 , the kurtosis of the normal distribution, as n → ∞ . That is, the excess kurtosis kurt(Xn ) − 3 → 0 as
n → ∞.
Open the special distribution simulator and select the Irwin-Hall distribution. Vary n and note the shape and of the probability
density function in light of the previous results on skewness and kurtosis. For selected values of n run the simulation 1000
times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
Proof
Related Distributions
The most important connection is to the standard uniform distribution in the definition: The Irwin-Hall distribution of order
n ∈ N + is the distribution of the sum of n independent variables, each with the standard uniform distribution. The Irwin-Hall
distribution of order 2 is also a triangle distribution:
The Irwin-Hall distribution of order 2 is the triangle distribution with location parameter 0, scale parameter 2, and shape
parameter . 1
Proof
The Irwin-Hall distribution is connected to the normal distribution via the central limit theorem.
Suppose that X has the Irwin-Hall distribution of order n for each n ∈ N . Then the distribution of
n +
Xn − n/2
Zn = −−−− (5.25.23)
√n/12
Thus, if n is large, X has approximately a normal distribution with mean n/2 and variance n/12.
n
Open the special distribution simulator and select the Irwin-Hall distribution. Start with n = 1 and increase n successively to
the maximum n = 10 . Note how the probability density function becomes more “normal” as n increases. For various values of
n , run the simulation 1000 times and compare the empirical density function to the probability density function.
The Irwin-Hall distribution of order n is trivial to simulate, as the sum of n random numbers. Since the probability density function
is bounded on a bounded support interval, the distribution can also be simulated via the rejection method. Computationally, this is a
5.25.3 https://stats.libretexts.org/@go/page/10365
dumb thing to do, of course, but it can still be a fun exercise.
Open the rejection method experiment and select the Irwin-Hall distribution. For various values of n , run the simulation 2000
times. Compare the empirical density function, mean, and standard deviation to their distributional counterparts.
This page titled 5.25: The Irwin-Hall Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.25.4 https://stats.libretexts.org/@go/page/10365
5.26: The U-Power Distribution
The U-power distribution is a U-shaped family of distributions based on a simple family of power functions.
The standard U-power distribution with shape parameter k ∈ N+ is a continuous distribution on [−1, 1] with probability
density function g given by
2k + 1
2k
g(x) = x , x ∈ [−1, 1] (5.26.1)
2
Proof
The algebraic form of the probability density function explains the name of the distribution. The most common of the standard U-
power distributions is the U-quadratic distribution, which corresponds to k = 1 .
The standard U-power probability density function g satisfies the following properties:
1. g is symmetric about x = 0 .
2. g decreases and then increases with minimum value at x = 0 .
3. The modes are x = ±1 .
4. g is concave upward.
Proof
Open the Special Distribution Simulator and select the U-power distribution. Vary the shape parameter but keep the default
values for the other parameters. Note the graph of the probability density function. For selected values of the shape parameter,
run the simulation 1000 times and compare the emprical density function to the probability density function.
Proof
3. The median is 0.
4. The third quartile is q 3 =
1
1/( 2k+1)
.
2
Proof
Open the Special Distribution Calculator and select the U-power distribution. Vary the shape parameter but keep the default
values for the other parameters. Note the shape of the distribution function. For various values of the shape parameter, compute
a few quantiles.
Moments
Suppose that Z has the standard U-power distribution with parameter k ∈ N . The moments (about 0) are easy to compute.
+
5.26.1 https://stats.libretexts.org/@go/page/10366
Proof
Since the mean is 0, the moments about 0 are also the central moments.
2k+3
Proof
Open the Special Distribution Simulator and select the U-power distribution. Vary the shape parameter but keep the default
values for the other paramters. Note the position and size of the mean ± standard deviation bar. For selected values of the
shape parameter, run the simulation 1000 times and compare the empirical mean and stadard deviation to the distribution mean
and standard deviation.
Proof
2
(2k+3)
Note that kurt(Z) → 1 as k → ∞ . The excess kurtosis is kurt(Z) − 3 =
(2k+5)(2k+1)
−3 and so kurt(Z) − 3 → −2 as
k → ∞ .
Related Distributions
The U-power probability density function g actually makes sense for k = 0 as well, and in this case the distribution reduces to the
uniform distribution on the interval [−1, 1]. But of course, this distribution is not U-shaped, except in a degenerate sense. There are
other connections to the uniform distribution. The first is a standard result since the U-power quantile function has a simple, closed
representation:
Suppose that k ∈ N . +
2
(1 + Z
2k+1
) has the standard uniform
distribution.
Open the random quantile simulator and select the U-power distribution. Vary the shape parameter but keep the default values
for the other parameters. Note the shape of the distribution and density functions. For selected values of the parameter, run the
simulation 1000 times and note the random quantiles. Compare the empirical density function to the probability density
function.
The standard U-power distribution with shape parameter k ∈ N + converges to the discrete uniform distribution on {−1, 1} as
k → ∞.
Proof
5.26.2 https://stats.libretexts.org/@go/page/10366
Definition
Suppose that Z has the standard U-power distribution with shape parameter k ∈ N . If μ ∈ R and c ∈ (0, ∞) then +
X = μ + cZ has the U-power distribution with shape parameter k , location parameter μ and scale parameter c .
Note that X has a continuous distribution on the interval [a, b] where a = μ − c and b = μ + c , so the distribution can also be
parameterized by the the shape parameter k and the endpoints a and b . With this parametrization, the location parameter is
μ =
a+b
2
and the scale parameter is c = . b−a
Distribution Functions
Suppose that X has the U-power distribution with shape parameter k ∈ N+ , location parameter μ ∈ R , and scale parameter
c ∈ (0, ∞).
1. f is symmetric about μ .
2. f decreases and then increases with minimum value at x = μ .
3. The modes are at x = μ ± c .
4. f is concave upward.
Proof
Open the Special Distribution Simulator and select the U-power distribution. Vary the parameters and note the shape and
location of the probability density function. For various values of the parameters, run the simulation 1000 times and compare
the emprical density function to the probability density function.
Proof
3. The median is μ .
4. The third quartile is q3 = μ+c
1
1/( 2k+1)
2
Proof
Open the Special Distribution Calculator and select the U-power distribution. Vary the parameters and note the graph of the
distribution function. For various values of the parameters, compute selected values of the distribution function and the
quantile function.
Moments
Suppose again that X has the U-power distribution with shape parameter k ∈ N , location parameter μ ∈ R , and scale parameter
+
c ∈ (0, ∞).
2k+3
Proof
5.26.3 https://stats.libretexts.org/@go/page/10366
Note that var(Z) → c as k → ∞
2
Open the Special Distribution Simulator and select the U-power distribution. Vary the parameters and note the size and
location of the mean ± standard deviation bar. For various values of the parameters, run the simulation 1000 times and
compare the empirical mean and stadard deviation to the distribution mean and standard deviation.
The moments about 0 are messy, but the central moments are simple.
2n 2n
2k + 1
E [(x − μ) ] =c (5.26.9)
2(n + k) + 1
Proof
Proof
2
(2k+3)
Again, kurt(X) → 1 as k → ∞ and the excess kurtosis is kurt(X) − 3 = (2k+5)(2k+1)
−3
Related Distributions
Since the U-power distribution with a given shape parameter is a location-scale family, it is trivially closed under location-scale
transformations.
Suppose that X has the U-power distribution with shape parameter k ∈ N , location parameter μ ∈ R , and scale parameter
+
c ∈ (0, ∞). If α ∈ R and β ∈ (0, ∞) , then Y = α + βX has the U-power distribution with shape parameter k , location
As before, since the U-power distribution function and the U-power quantile function have simple forms, we have the usual
connections with the standard uniform distribution.
1. If U has the standard uniform distribution then X = μ + c(2U − 1) has the U-power distribution with shape
1/(2k+1)
2
[1 + (
c
) ] has the standard uniform distribution.
Again, part (a) of course leads to the random quantile method of simulation.
Open the random quantile simulator and select the U-power distribution. Vary the parameters and note the shape of the
distribution and density functions. For selected values of the parameters, run the simulation 1000 times and note the random
quantiles. Compare the empirical density function to the probability density function.
The U-power distribution with given location and scale parameters converges to the discrete uniform distribution at the endpoints
as the shape parameter increases.
The U-power distribution with shape parameter k ∈ N , location parameter μ ∈ R , and scale parameter c ∈ (0, ∞) converges
+
5.26.4 https://stats.libretexts.org/@go/page/10366
The U-power distribution is a general exponential family in the shape parameter, if the location and scale parameters are fixed.
Suppose that X has the U-power distribution with unspecified shape parameter k ∈ N , but with specified location parameter
+
μ ∈ R and scale parameter c ∈ (0, ∞). Then X has a one-parameter exponential distribution with natural parameter 2k and
X−μ
natural statistics ln( c
.
)
Proof
Since the U-power distribution has a bounded probability density function on a bounded support interval, it can also be simulated
via the rejection method.
Open the rejection method experiment and select the U-power distribution. Vary the parameters and note the shape of the
probability density function. For selected values of the parameters, run the experiment 1000 times and watch the scatterplot.
Compare the empirical density function, mean, and standard deviation to their distributional counterparts.
This page titled 5.26: The U-Power Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.26.5 https://stats.libretexts.org/@go/page/10366
5.27: The Sine Distribution
The sine distribution is a simple probability distribution based on a portion of the sine curve. It is also known as Gilbert's sine
distribution, named for the American geologist Grove Karl (GK) Gilbert who used the distribution in 1892 to study craters on the
moon.
1. g is symmetric about z = . 1
2
.
3. g is concave downward.
Proof
Open the Special Distribution Simulator and select the sine distribution. Run the simulation 1000 times and compare the
emprical density function to the probability density function.
2
[1 − cos(πz)] for z ∈ [0, 1].
Proof
π
arccos(1 − 2p) for p ∈ [0, 1].
1. The first quartile is q
1 =
1
3
.
2. The median is . 1
3
.
Proof
Open the Special Distribution Calculator and select the sine distribution. Compute a few quantiles.
Moments
Suppose that Z has the standard sine distribution. The moment generating function can be given in closed form.
Proof
The moments of all orders exist, but a general formula is complicated and involves special functions. However, the mean and
variance are easy to compute.
Proof
5.27.1 https://stats.libretexts.org/@go/page/10367
Open the Special Distribution Simulator and select the sine distribution. Note the position and size of the mean ± standard
deviation bar. Run the simulation 1000 times and compare the empirical mean and stadard deviation to the distribution mean
and standard deviation.
Proof
Related Distributions
Since the distribution function and the quantile function have closed form representations, the standard sine distribution has the
usual connection to the standard uniform distribution.
1. If U has the standard uniform distribution then Z = G (U ) = arccos(1 − 2U ) has the standard sine distribution.
−1 1
2. If Z has the standard sine distribution then U = G(Z) = [1 − cos(πZ)] has the standard uniform distribution.
1
Open the random quantile simulator and select the sine distribution. Note the shape of the distribution and density functions.
Run the simulation 1000 times and note the random quantiles. Compare the empirical density function to the probability
density function.
Since the probability density function is continuous and is defined on a closed, bounded interval, the standard sine distribution can
also be simulated using the rejection method.
Open the rejection method app and select the sine distribution. Run the simulation 1000 times and compare the empirical
density function to the probability density function.
Suppose that Z has the standard sine distribution. For a ∈ R and b ∈ (0, ∞) , random variable X = a + bZ has the sine
distribution with location parameter a and scale parameter h .
Distribution Functions
Analogies of the results above for the standard sine distribution follow easily from basic properties of the location-scale
transformation. Suppose that X has the sine distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). So X has
a continuous distribution on the interval [a, a + b] .
Pure scale transformations (a = 0 and b > 0 ) are particularly common, since X often represents a random angle. The scale
transformation with b = π gives the angle in radians. In this case the probability density function is f (x) = sin(x) for x ∈ [0, π].
1
5.27.2 https://stats.libretexts.org/@go/page/10367
Since the radian is the standard angle unit, this distribution could also be considered the “standard one”. The scale transformation
with b = 90 gives the angle in degrees. In this case, the probability density function is f (x) = sin( x) for x ∈ [0, 90]. This
π
180
π
90
In the special distribution simulator, select the sine distribution. Vary the parameters and note the shape and location of the
probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical
density function to the probability density function.
Proof
−1
b
F (p) = a + arccos(1 − 2p), p ∈ (0, 1) (5.27.14)
π
In the special distribution calculator, select the sine distribution. Vary the parameters and note the shape and location of the
probability density function and the distribution function. For selected values of the parameters, find the quantiles of order 0.1
and 0.9.
Moments
Suppose again that X has the sine distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞).
Proof
Proof
In the special distribution simulator, select the sine distribution. Vary the parameters and note the shape and location of the
mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical
mean and standard deviation to the distribution mean and standard deviation.
Proof
Related Distributions
The general sine distribution is a location-scale family, so it is trivially closed under location-scale transformations.
5.27.3 https://stats.libretexts.org/@go/page/10367
Suppose that X has the sine distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R and
d ∈ (0, ∞) . Then Y = c + dX has the sine distribution with location parameter c + ad and scale parameter bd .
Proof
This page titled 5.27: The Sine Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.27.4 https://stats.libretexts.org/@go/page/10367
5.28: The Laplace Distribution
The Laplace distribution, named for Pierre Simon Laplace arises naturally as the distribution of the difference of two independent,
identically distributed exponential variables. For this reason, it is also called the double exponential distribution.
The standard Laplace distribution is a continuous distribution on R with probability density function g given by
1
−|u|
g(u) = e , u ∈ R (5.28.1)
2
Proof
Open the Special Distribution Simulator and select the Laplace distribution. Keep the default parameter value and note the
shape of the probability density function. Run the simulation 1000 times and compare the emprical density function and the
probability density function.
Proof
3. The median is q = 0
2
Proof
Open the Special Distribution Calculator and select the Laplace distribution. Keep the default parameter value. Compute
selected values of the distribution function and the quantile function.
Moments
Suppose that U has the standard Laplace distribution.
Proof
5.28.1 https://stats.libretexts.org/@go/page/10368
The moments of U are
1. E(U n
) =0 if n ∈ N is odd.
2. E(U n
) = n! if n ∈ N is even.
Proof
Open the Special Distribution Simulator and select the Laplace distribution. Keep the default parameter value. Run the
simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
Of course, the standard Laplace distribution has simple connections to the standard exponential distribution.
If U has the standard Laplace distribution then V = |U | has the standard exponential distribution.
Proof
If V and W are independent and each has the standard exponential distribution, then U = V −W has the standard Laplace
distribution.
Proof using PDFs
Proof using MGFs
If V has the standard exponential distribution, I has the standard Bernoulli distribution, and V and I are independent, then
U = (2I − 1)V has the standard Laplace distribution.
Proof
The standard Laplace distribution has a curious connection to the standard normal distribution.
Suppose that (Z , Z , Z , Z ) is a random sample of size 4 from the standard normal distribution. Then
1 2 3 4 U = Z1 Z2 + Z3 Z4
The standard Laplace distribution has the usual connections to the standard uniform distribution by means of the distribution
function and the quantile function computed above.
2
) − ln[2(1 − V )]1 (V ≥
1
2
) has the standard
Laplace distribution.
2. If U has the standard Laplace distribution then V =
1
2
e
U
1(U < 0) + (1 −
1
2
e
−U
) 1(U ≥ 0) has the standard uniform
distribution.
From part (a), the standard Laplace distribution can be simulated with the usual random quantile method.
Open the random quantile experiment and select the Laplace distribution. Keep the default parameter values and note the shape
of the probability density and distribution functions. Run the simulation 1000 times and compare the empirical density
5.28.2 https://stats.libretexts.org/@go/page/10368
function, mean, and standard deviation to their distributional counterparts.
Suppose that U has the standard Laplace distribution. If a ∈ R and b ∈ (0, ∞), then X = a + bU has the Laplace distribution
with location parameter a and scale parameter b .
Distribution Functions
Suppos that X has the Laplace distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞).
1 |x − a|
f (x) = exp(− ), x ∈ R (5.28.16)
2b b
1. f is symmetric about a .
2. f increases on [0, a] and decreases on [a, ∞) with mode x = a .
3. f is concave upward on [0, a] and on [a, ∞) with a cusp at x = a .
Proof
Open the Special Distribution Simulator and select the Laplace distribution. Vary the parameters and note the shape and
location of the probability density function. For various values of the parameters, run the simulation 1000 times and compare
the emprical density function to the probability density function.
Proof
3. The median is q = a
2
Proof
Open the Special Distribution Calculator and select the Laplace distribution. For various values of the scale parameter,
compute selected values of the distribution function and the quantile function.
Moments
Again, we assume that X has the Laplace distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞) , so that by
definition, X = a + bU where U has the standard Lapalce distribution.
Proof
5.28.3 https://stats.libretexts.org/@go/page/10368
The moments of X about the location parameter have a simple form.
Proof
Open the Special Distribution Simulator and select the Laplace distribution. Vary the parameters and note the size and location
of the mean ± standard deviation bar. For various values of the scale parameter, run the simulation 1000 times and compare
the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
By construction, the Laplace distribution is a location-scale family, and so is closed under location-scale transformations.
Suppose that X has the Laplace distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R
and d ∈ (0, ∞) . Then Y = c + dX has the Laplace distribution with location parameter c + ad scale parameter bd .
Proof
Once again, the Laplace distribution has the usual connections to the standard uniform distribution by means of the distribution
function and the quantile function computed above. The latter leads to the usual random quantile method of simulation.
has the Laplace distribution with location parameter a and scale parameter b .
2. If X has the Laplace distribution with location parameter a and scale parameter b , then
1 X −a 1 X −a
V = exp( )1(X < a) + [1 − exp(− )] 1(X ≥ a) (5.28.21)
2 b 2 b
Open the random quantile experiment and select the Laplace distribution. Vary the parameter values and note the shape of the
probability density and distribution functions. For selected values of the parameters, run the simulation 1000 times and
compare the empirical density function, mean, and standard deviation to their distributional counterparts.
The Laplace distribution is also a member of the general exponential family of distributions.
Suppose that X has the Laplace distribution with known location parameter a ∈ R and unspecified scale parameter
b ∈ (0, ∞). Then X has a general exponential distribution in the scale parameter b , with natural parameter −1/b and natural
5.28.4 https://stats.libretexts.org/@go/page/10368
statistics |X − a| .
Proof
This page titled 5.28: The Laplace Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.28.5 https://stats.libretexts.org/@go/page/10368
5.29: The Logistic Distribution
The logistic distribution is used for various growth models, and is used in a certain type of regression, known appropriately as
logistic regression.
The standard logistic distribution is a continuous distribution on R with distribution function G given by
z
e
G(z) = , z ∈ R (5.29.1)
1 + ez
Proof
1. g is symmetric about x = 0 .
2. g increases and then decreases with the mode x = 0 .
–
3. g is concave upward, then downward, then upward again with inflection points at x = ± ln(2 + √3) ≈= ±1.317 .
Proof
In the special distribution simulator, select the logistic distribution. Keep the default parameter values and note the shape and
location of the probability density function. Run the simulation 1000 times and compare the empirical density function to the
probability density function.
Recall that p : 1 − p are the odds in favor of an event with probability p. Thus, the logistic distribution has the interesting property
that the quantiles are the logarithms of the corresponding odds ratios. Indeed, this function of p is sometimes called the logit
function. The fact that the median is 0 also follows from symmetry, of course.
In the special distribution calculator, select the logistic distribution. Keep the default parameter values and note the shape and
location of the probability density function and the distribution function. Find the quantiles of order 0.1 and 0.9.
Moments
Suppose that Z has the standard logistic distribution. The moment generating function of Z has a simple representation in terms of
the beta function B , and hence also in terms of the gamma function Γ
Proof
5.29.1 https://stats.libretexts.org/@go/page/10369
Since the moment generating function is finite on an open interval containing 0, random variable Z has moments of all orders. By
symmetry, the odd order moments are 0. The even order moments can be represented in terms of Bernoulli numbers, named of
course for Jacob Bernoulli. Let β Bernoulli number of order n ∈ N .
n
Let n ∈ N
1. If n is odd then E(Z ) = 0 .
n
2. If n is even then E (Z ) = (2
n n
− 2)π
n
| βn |
Proof
2. var(Z) = π
Proof
In the special distribution simulator, select the logistic distribution. Keep the default parameter values and note the shape and
location of the mean ± standard deviation bar. Run the simulation 1000 times and compare the empirical mean and standard
deviation to the distribution mean and standard deviation.
Proof
5
.
Related Distributions
The standard logistic distribution has the usual connections with the standard uniform distribution by means of the distribution
function and quantile function given above. Recall that the standard uniform distribution is the continuous uniform distribution on
the interval (0, 1).
−1
U
Z =G (U ) = ln( ) = ln(U ) − ln(1 − U ) (5.29.14)
1 −U
Since the quantile function has a simple closed form, we can use the usual random quantile method to simulate the standard logistic
distribution.
Open the random quantile experiment and select the logistic distribution. Keep the default parameter values and note the shape
of the probability density and distribution functions. Run the simulation 1000 times and compare the empirical density
function, mean, and standard deviation to their distributional counterparts.
5.29.2 https://stats.libretexts.org/@go/page/10369
The standard logistic distribution also has several simple connections with the standard exponential distribution (the exponential
distribution with rate parameter 1).
2. If Y has the standard exponential distribution then Z = ln(e − 1) has the standard logistic distribution.
Y
Proof
Suppose that X and Y are independent random variables, each with the standard exponential distribution. Then Z = ln(X/Y )
has the standard logistic distribution.
Proof
There are also simple connections between the standard logistic distribution and the Pareto distribution.
2. If Y has the Pareto distribution with shape parameter 1, then Z = ln(Y − 1) has the standard logistic distribution.
Proof
If X and Y are independent and each has the standard Gumbel distribution, them Z = Y −X has the standard logistic
distribution.
Proof
Suppose that Z has the standard logistic distribution. For a ∈ R and b ∈ (0, ∞), random variable X = a + bZ has the logistic
distribution with location parameter a and scale parameter b .
Distribution Functions
Analogies of the results above for the general logistic distribution follow easily from basic properties of the location-scale
transformation. Suppose that X has the logistic distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞).
1. f is symmetric about x = a .
2. f increases and then decreases, with mode x = a .
–
3. f is concave upward, then downward, then upward again, with inflection points at x = a ± ln(2 + √3)b .
Proof
In the special distribution simulator, select the logistic distribution. Vary the parameters and note the shape and location of the
probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical
density function to the probability density function.
5.29.3 https://stats.libretexts.org/@go/page/10369
Proof
In the special distribution calculator, select the logistic distribution. Vary the parameters and note the shape and location of the
probability density function and the distribution function. For selected values of the parameters, find the quantiles of order 0.1
and 0.9.
Moments
Suppose again that X has the logistic distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞) . Recall that B
Proof
2. var(X) = b 2 π
Proof
In the special distribution simulator, select the logistic distribution. Vary the parameters and note the shape and location of the
mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical
mean and standard deviation to the distribution mean and standard deviation.
Proof
Once again, it follows that the excess kurtosis of X is kurt(X) − 3 = . The central moments of X can be given in terms of the
6
Let n ∈ N .
1. If n is odd then E [(X − a) ] = 0 .
n
Proof
Related Distributions
The general logistic distribution is a location-scale family, so it is trivially closed under location-scale transformations.
Suppose that X has the logistic distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R and
d ∈ (0, ∞) . Then Y = c + dX has the logistic distribution with location parameter c + ad and scale parameter bd .
Proof
5.29.4 https://stats.libretexts.org/@go/page/10369
This page titled 5.29: The Logistic Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.29.5 https://stats.libretexts.org/@go/page/10369
5.30: The Extreme Value Distribution
Extreme value distributions arise as limiting distributions for maximums or minimums (extreme values) of a sample of
independent, identically distributed random variables, as the sample size increases. Thus, these distributions are important in
probability and mathematical statistics.
Proof
The distribution is also known as the standard Gumbel distribution in honor of Emil Gumbel. As we will show below, it arises as
the limit of the maximum of n independent random variables, each with the standard exponential distribution (when this maximum
is appropriately centered). This fact is the main reason that the distribution is special, and is the reason for the name. For the
remainder of this discussion, suppose that random variable V has the standard Gumbel distribution.
In the special distribution simulator, select the extreme value distribution. Keep the default parameter values and note the shape
and location of the probability density function. In particular, note the lack of symmetry. Run the simulation 1000 times and
compare the empirical density function to the probability density function.
In the special distribution calculator, select the extreme value distribution. Keep the default parameter values and note the
shape and location of the probability density and distribution functions. Compute the quantiles of order 0.1, 0.3, 0.6, and 0.9
Moments
Suppose again that V has the standard Gumbel distribution. The moment generating function of V has a simple expression in terms
of the gamma function Γ.
Proof
Next we give the mean and variance. First, recall that the Euler constant, named for Leonhard Euler is defined by
5.30.1 https://stats.libretexts.org/@go/page/10370
∞
′ −x
γ = −Γ (1) = − ∫ e ln x dx ≈ 0.5772156649 (5.30.6)
0
2. var(V ) = π
Proof
In the special distribution simulator, select the extreme value distribution and keep the default parameter values. Note the shape
and location of the mean ± standard deviation bar. Run the simulation 1000 times and compare the empirical mean and
standard deviation to the distribution mean and standard deviation.
Next we give the skewness and kurtosis of V . The skewness involves a value of the Riemann zeta function ζ , named of course for
Georg Riemann. Recall that ζ is defined by
∞
1
ζ(n) = ∑ , n >1 (5.30.8)
n
k
k=1
2. kurt(V ) = 27
The particular value of the zeta function, ζ(3) , is known as Apéry's constant. From (b), it follows that the excess kurtosis is
kurt(V ) − 3 =
5
.
12
Related Distributions
The standard Gumbel distribution has the usual connections to the standard uniform distribution by means of the distribution
function and quantile function given above. Recall that the standard uniform distribution is the continuous uniform distribution on
the interval (0, 1).
The standard Gumbel and standard uniform distributions are related as follows:
1. If U has the standard uniform distribution then V =G
−1
(U ) = − ln(− ln U ) has the standard Gumbel distribution.
2. If V has the standard Gumbel distribution then U = G(V ) = exp(e
−V
) has the standard uniform distribution.
So we can simulate the standard Gumbel distribution using the usual random quantile method.
Open the random quantile experiment and select the extreme value distribution. Keep the default parameter values and note
again the shape and location of the probability density and distribution functions. Run the simulation 1000 times and compare
the empirical density function, mean, and standard deviation to their distributional counteparts.
The standard Gumbel distribution also has simple connections with the standard exponential distribution (the exponential
distribution with rate parameter 1).
The standard Gumbel and standard exponential distributions are related as follows:
1. If X has the standard exponential distribution then V = − ln X has the standard Gumbel distribution.
2. If V has the standard Gumbel distribution then X = e has the standard exponential distribution.
−V
Proof
As noted in the introduction, the following theorem provides the motivation for the name extreme value distribution.
Suppose that (X , X 1 2, …) is a sequence of independent random variables, each with the standard exponential distribution.
The distribution of Y n = max{ X1 , X2 , … , Xn } − ln n converges to the standard Gumbel distribution as n → ∞ .
Proof
5.30.2 https://stats.libretexts.org/@go/page/10370
The General Extreme Value Distribution
As with many other distributions we have studied, the standard extreme value distribution can be generalized by applying a linear
transformation to the standard variable. First, if V has the standard Gumbel distribution (the standard extreme value distribution for
maximums), then −V has the standard extreme value distribution for minimums. Here is the general definition.
Suppose that V has the standard Gumbel distribution, and that a, b ∈ R with b ≠ 0 . Then X = a + bV has the extreme value
distribution with location parameter a and scale parameter |b|.
1. If b > 0 , then the distribution corresponds to maximums.
2. If b < 0 , then the distribution corresponds to minimums.
So the family of distributions with a ∈ R and b ∈ (0, ∞) is a location-scale family associated with the standard distribution for
maximums, and the family of distributions with a ∈ R and b ∈ (−∞, 0) is the location-scale family associated with the standard
distribution for minimums.. The distributions are also referred to more simply as Gumbel distributions rather than extreme value
distributions. The web apps in this project use only the extreme value distributions for maximums. As you will see below, the
differences in the distribution for maximums and the distribution for minimums are minor. For the remainder of this discussion,
suppose that X has the form given in the definition.
Distribution Functions
Lef F denote the distribution function of X.
1. If b > 0 then
x −a
F (x) = exp[− exp(− )], x ∈ R (5.30.12)
b
2. If b < 0 then
x −a
F (x) = 1 − exp[− exp(− )], x ∈ R (5.30.13)
b
Proof
Proof
Open the special distribution simulator and select the extreme value distribution. Vary the parameters and note the shape and
location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare
the empirical density function to the probability density function.
Open the special distribution calculator and select the extreme value distribution. Vary the parameters and note the shape and
location of the probability density and distribution functions. For selected values of the parameters, compute a few values of
the quantile function and the distribution function.
Moments
Suppose again that X = a + bV where V has the standard Gumbel distribution, and that a, b ∈ R with b ≠ 0 .
5.30.3 https://stats.libretexts.org/@go/page/10370
1. With domain t ∈ (−∞, 1/b) if b > 0
2. With domain t ∈ (1/b, ∞) if b < 0
Proof
2. var(X) = b 2 π
Proof
Open the special distribution simulator and select the extreme value distribution. Vary the parameters and note the size and
location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and
compare the empirical mean and standard deviation to the distribution mean and standard deviation.
The skewness of X is
–
1. skew(X) = 12√6ζ(3)/π ≈ 1.13955 if b > 0 .
3
–
2. skew(X) = −12√6ζ(3)/π ≈ −1.13955 if b < 0
3
Proof
Proof
5
.
Related Distributions
Since the general extreme value distributions are location-scale families, they are trivially closed under linear transformations of
the underlying variables (with nonzero slope).
Suppose that X has the extreme value distribution with parameters a, b with b ≠0 and that c, d ∈ R with d ≠0 . Then
Y = c + dX has the extreme value distribution with parameters ad + c and bd .
Proof
Note if d > 0 then X and Y have the same association (max, max) or (min, min). If d <0 then X and Y have opposite
associations (max, min) or (min, max).
As with the standard Gumbel distribution, the general Gumbel distribution has the usual connections with the standard uniform
distribution by means of the distribution and quantile functions. Since the quantile function has a simple closed form, the latter
connection leads to the usual random quantile method of simulation. We state the result for maximums.
2. If X has the extreme value distribution with parameters a and b then U = F (X) has the standard uniform distribution.
Open the random quantile experiment and select the extreme value distribution. Vary the parameters and note again the shape
and location of the probability density and distribution functions. For selected values of the parameters, run the simulation
1000 times and compare the empirical density function, mean, and standard deviation to their distributional counteparts.
The extreme value distribution for maximums has a simple connection to the Weibull distribution, and this generalizes the in
connection between the standard Gumbel and exponential distributions above. There is a similar result for the extreme value
distribution for minimums.
b
−a
5.30.4 https://stats.libretexts.org/@go/page/10370
2. If Y has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) then X = − ln Y has
the extreme value distribution with parameters − ln b and . 1
Proof
This page titled 5.30: The Extreme Value Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.30.5 https://stats.libretexts.org/@go/page/10370
5.31: The Hyperbolic Secant Distribution
The hyperbolic secant distribution is a location-scale familty with a number of interesting parallels to the normal distribution. As
the name suggests, the hyperbolic secant function plays an important role in the distribution, so we should first review some
definitions
The hyperbolic trig functions sinh, cosh, tanh, and sech are defined as follows, for x ∈ R
1 x −x
sinh x = (e −e ) (5.31.1)
2
1 x −x
cosh x = (e +e ) (5.31.2)
2
x −x
sinh x e −e
tanh x = = (5.31.3)
x −x
cosh x e +e
1 2
sech x = = (5.31.4)
cosh x ex + e−x
1. g is symmetric about 0.
2. g increases and then decreases with mode z = 0 .
–
3. g is concave upward then downward then upward again, with inflection points at z = ± 2
π
ln(√2 + 1) ≈ ±0.561 .
Proof
So g has the classic unimodal shape. Recall that the inflection points in the standard normal probability density function are ±1.
Compared to the standard normal distribution, the hyperbolic secant distribution is more peaked at the mode 0 but has fatter tails.
Open the special distribution simulator and select the hyperbolic secant distribtion. Keep the default parameter settings and
note the shape and location of the probability density function. Run the simulation 1000 times and compare the empirical
density function to the probability density function.
Proof
−1
2 π
G (p) = ln[tan( p)], p ∈ (0, 1) (5.31.9)
π 2
–
1. The first quartile is G ( ) = − ln(1 + √2) ≈ −0.561
−1 1
4
2
2. The median is G ( ) = 0
−1 1
2
–
3. The third quartile is G ( ) = ln(1 + √2) ≈ 0.561
−1 3
4
2
Proof
Of course, the fact that the median is 0 also follows from the symmetry of the distribution, as does the relationship between the first
and third quartiles. In general, G (1 − p) = −G (p) for p ∈ (0, 1). Note that the first and third quartiles coincide with the
−1 −1
5.31.1 https://stats.libretexts.org/@go/page/10371
inflection points, whereas in the normal distribution, the inflection points are at ±1 and coincide with the standard deviation.
Open the sepcial distribution calculator and select the hyperbolic secant distribution. Keep the default values of the parameters
and note the shape of the distribution and probability density functions. Compute a few values of the distribution and quantile
functions.
Moments
Suppose that Z has the standard hyperbolic secant distribution. The moments of Z are easiest to compute from the generating
functions.
Proof
Note that the probability density function can be obtained from the characteristic function by a scale transformation:
z) for z ∈ R . This is another curious simularity to the normal distribution: the probability density function ϕ and
1 π
g(z) = χ(
2 2
characteristic function χ of the standard normal distribution are related by ϕ(z) = χ(z) .
√2π
1
Proof
It follows that Z has moments of all orders, and then by symmetry, that the odd order moments are all 0.
Thus, the standard hyperbolic secant distribution has mean 0 and variance 1, just like the standard normal distribution.
Open the special distribution simulator and select the hyperbolic secant distribution. Keep the default parameters and note the
size and location of the mean ± standard deviation bar. Run the simulation 1000 times compare the empirical mean and
standard deviation to the distribution mean and standard deviation.
Recall that the kurtosis of the standard normal distribution is 3, so the excess kurtosis of the standard hyperbolic secant distribution
is kurt(Z) − 3 = 2 . This distribution is more sharply peaked at the mean 0 and has fatter tails, compared with the normal.
Related Distributions
The standard hyperbolic secant distribution has the usual connections with the standard uniform distribution by means of the
distribution function and the quantile function computed above.
The standard hyperbolic secant distribution is related to the standard uniform distribution as follows:
1. If Z has the standard hyperbolic secant distribution then
2 π
U = G(Z) = arctan[exp( Z)] (5.31.14)
π 2
5.31.2 https://stats.libretexts.org/@go/page/10371
has the standard uniform distribution.
2. If U has the standard uniform distribution then
−1
2 π
Z =G (U ) = ln[tan( U )] (5.31.15)
π 2
Since the quantile function has a simple closed form, the standard hyperbolic secant distribution can be easily simulated by means
of the random quantile method.
Open the random quantile experiment and select the hyperbolic secant distribution. Keep the default parameter values and note
again the shape of the probability density and distribution functions. Run the experiment 1000 times and compare the empirical
density function, mean, and standard deviation to their distributional counterparts.
Suppose that Z has the standard hyperbolic secant distribution and that μ ∈ R and σ ∈ (0, ∞) . Then X = μ + σZ has the
hyperbolic secant distribution with location parameter μ and scale parameter σ.
Distribution Functions
Suppose that X has the hyperbolic secant distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞).
1. f is symmetric about μ .
2. f increases and then decreases with mode x = μ .
–
3. f is concave upward then downward then upward again, with inflection points at x = μ ± 2
π
ln(√2 + 1)σ ≈ μ ± 0.561σ .
Proof
Open the special distribution simulator and select the hyperbolic secant distribution. Vary the parameters and note the shape
and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and
compare the empirical density function to the probability density function.
Proof
−1
2 π
F (p) = μ + σ ln[tan( p)], p ∈ (0, 1) (5.31.18)
π 2
–
1. The first quartile is F
−1
(
1
4
) = μ−
2
π
ln(1 + √2)σ ≈ μ − 0.561σ
2
) =μ
–
3. The third quartile is F −1
(
3
4
) = μ+
2
π
ln(1 + √2)σ ≈ μ + 0.561σ
Proof
Open the sepcial distribution calculator and select the hyperbolic secant distribution. Vary the parameters and note the shape of
the distribution and density functions. For various values of the parameters, compute a few values of the distribution and
5.31.3 https://stats.libretexts.org/@go/page/10371
quantile functions.
Moments
Suppose again that X has the hyperbolic secant distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞).
Proof
Just as in the normal distribution, the location and scale parameters are the mean and standard deviation, respectively.
Proof
Open the special distribution simulator and select the hyperbolic secant distribution. Vary the parameters and note the size and
location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and
compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
Since the hyperbolic secant distribution is a location-scale family, it is trivially closed under location-scale transformations.
Suppose that X has the hyperbolic secant distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞), and that
a ∈ R and b ∈ (0, ∞). Then Y = a + bX has the hyperbolic secant distribution with location parameter a + bμ and scale
parameter bσ.
Proof
The hyperbolic secant distribution has the usual connections with the standard uniform distribution by means of the distribution
function and the quantile function computed above.
has the hyperbolic secant distribution with location parameter μ and scale parameter σ.
Since the quantile function has a simple closed form, the hyperbolic secant distribution can be easily simulated by means of the
random quantile method.
5.31.4 https://stats.libretexts.org/@go/page/10371
Open the random quantile experiment and select the hyperbolic secant distribution. Vary the parameters and note again the
shape of the probability density and distribution functions. Run the experiment 1000 times and compare the empirical density
function, mean, and standard deviation to their distributional counterparts.
This page titled 5.31: The Hyperbolic Secant Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.31.5 https://stats.libretexts.org/@go/page/10371
5.32: The Cauchy Distribution
The Cauchy distribution, named of course for the ubiquitous Augustin Cauchy, is interesting for a couple of reasons. First, it is a
simple family of distributions for which the expected value (and other moments) do not exist. Second, the family is closed under
the formation of sums of independent variables, and hence is an infinitely divisible family of distributions.
1. g is symmetric about x = 0
2. g increases and then decreases, with mode x = 0 .
3. g is concave upward, then downward, and then upward again, with inflection points at x = ± 1
√3
.
4. g(x) → 0 as x → ∞ and as x → −∞
Proof
Thus, the graph of g has a simple, symmetric, unimodal shape that is qualitatively (but certainly not quantitatively) like the
standard normal probability density function. The probability density function g is obtained by normalizing the function
1
x ↦ , x ∈ R (5.32.3)
2
1 +x
The graph of this function is known as the witch of Agnesi, named for the Italian mathematician Maria Agnesi.
Open the special distribution simulator and select the Cauchy distribution. Keep the default parameter values to get the
standard Cauchy distribution and note the shape and location of the probability density function. Run the simulation 1000
times and compare the empirical density function to the probability density function.
2
+
1
π
arctan x for x ∈ R
Proof
2
)] for p ∈ (0, 1). In particular,
1. The first quartile is G ( ) = −1
−1 1
2. The median is G ( ) = 0
−1 1
Proof
Of course, the fact that the median is 0 also follows from the symmetry of the distribution, as does the fact that
(p) for p ∈ (0, 1).
−1 −1
G (1 − p) = −G
Open the special distribution calculator and select the Cauchy distribution. Keep the default parameter values and note the
shape of the distribution and probability density functions. Compute a few quantiles.
Moments
Suppose that random variable X has the standard Cauchy distribution. As we noted in the introduction, part of the fame of this
distribution comes from the fact that the expected value does not exist.
5.32.1 https://stats.libretexts.org/@go/page/10372
By symmetry, if the expected value did exist, it would have to be 0, just like the median and the mode, but alas the mean does not
exist. Moreover, this is not just an artifact of how mathematicians define improper integrals, but has real consequences. Recall that
if we think of the probability distribution as a mass distribution, then the mean is center of mass, the balance point, the point where
the moment (in the sense of physics) to the right is balanced by the moment to the left. But as the proof of the last result shows, the
moments to the right and to the left at any point a ∈ R are infinite. In this sense, 0 is no more important than any other a ∈ R .
Finally, if you are not convinced by the argument from physics, the next exercise may convince you that the law of large numbers
fails as well.
Open the special distribution simulator and select the Cauchy distribution. Keep the default parameter values, which give the
standard Cauchy distribution. Run the simulation 1000 times and note the behavior of the sample mean.
Earlier we noted some superficial similarities between the standard Cauchy distribution and the standard normal distribution
(unimodal, symmetric about 0). But clearly there are huge quantitative differences. The Cauchy distribution is a heavy tailed
distribution because the probability density function g(x) decreases at a polynomial rate as x → ∞ and x → −∞ , as opposed to
an exponential rate. This is yet another way to understand why the expected value does not exist.
In terms of the higher moments, E (X ) does not exist if n is odd, and is ∞ if n is even. It follows that the moment generating
n
function m(t) = E (e ) cannot be finite in an interval about 0. In fact, m(t) = ∞ for every t ≠ 0 , so this generating function is
tX
of no use to us. But every distribution on R has a characteristic function, and for the Cauchy distribution, this generating function
will be quite useful.
Related Distributions
The standard Cauchy distribution a member of the Student t family of distributions.
The standard Cauchy distribution is the Student t distribution with one degree of freedom.
Proof
The standard Cauchy distribution also arises naturally as the ratio of independent standard normal variables.
Suppose that Z and W are independent random variables, each with the standard normal distribution. Then X = Z/W has the
standard Cauchy distribution. Equivalently, the standard Cauchy distribution is the Student t distribution with 1 degree of
freedom.
Proof
Proof
The standard Cauchy distribution has the usual connections to the standard uniform distribution via the distribution function and
the quantile function computed above.
The standard Cauchy distribution and the standard uniform distribution are related as follows:
1. If U has the standard uniform distribution then X = G (U ) = tan[π (U − )] has the standard Cauchy distribution.
−1 1
2. If X has the standard Cauchy distribution then U = G(X) = + arctan(X) has the standard uniform distribution.
1
2
1
Proof
Since the quantile function has a simple, closed form, it's easy to simulate the standard Cauchy distribution using the random
quantile method.
Open the random quantile experiment and select the Cauchy distribution. Keep the default parameter values and note again the
shape and location of the distribution and probability density functions. For selected values of the parameters, run the
simulation 1000 times and compare the empirical density function to the probability density function. Note the behavior of the
empirical mean and standard deviation.
5.32.2 https://stats.libretexts.org/@go/page/10372
For the Cauchy distribution, the random quantile method has a nice physical interpretation. Suppose that a light source is 1 unit
away from position 0 of an infinite, straight wall. We shine the light at the wall at an angle Θ (to the perpendicular) that is
uniformly distributed on the interval (− , ). Then the position X = tan Θ of the light beam on the wall has the standard
π
2
π
Cauchy distribution. Note that this follows since Θ has the same distribution as π (U − ) where U has the standard uniform 1
distribution.
Open the Cauchy experiment and keep the default parameter values.
1. Run the experiment in single-step mode a few times, to make sure that you understand the experiment.
2. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the
probability density function. Note the behavior of the empirical mean and standard deviation.
Suppose that Z has the standard Cauchy distribution and that a ∈ R and b ∈ (0, ∞) . Then X = a + bZ has the Cauchy
distribution with location parameter a and scale parameter b .
Distribution Functions
Suppose that X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞).
1. f is symmetric about x = a .
2. f increases and then decreases, with mode x = a .
3. f is concave upward, then downward, then upward again, with inflection points at x = a ± √3
1
b .
4. f (x) → 0 as x → ∞ and as x → −∞ .
Proof
Open the special distribution simulator and select the Cauchy distribution. Vary the parameters and note the location and shape
of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the
empirical density function to the probability density function.
Proof
In particular,
1. The first quartile is F
−1
(
1
4
) = a−b .
2. The median is F (−1 1
2
) =a .
3. The third quartile is F −1
(
3
4
) = a+b .
Proof
5.32.3 https://stats.libretexts.org/@go/page/10372
Open the special distribution calculator and select the Cauchy distribution. Vary the parameters and note the shape and location
of the distribution and probability density functions. Compute a few values of the distribution and quantile functions.
Moments
Suppose again that X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). Since the mean
and other moments of the standard Cauchy distribution do not exist, they don't exist for the general Cauchy distribution either.
Open the special distribution simulator and select the Cauchy distribution. For selected values of the parameters, run the
simulation 1000 times and note the behavior of the sample mean.
But of course the characteristic function of the Cauchy distribution exists and is easy to obtain from the characteristic function of
the standard distribution.
Related Distributions
Like all location-scale families, the general Cauchy distribution is closed under location-scale transformations.
Suppose that X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R
and d ∈ (0, ∞) . Then Y = c + dX has the Cauchy distribution with location parameter c + da and scale parameter bd .
Proof
Much more interesting is the fact that the Cauchy family is closed under sums of independent variables. In fact, this is the main
reason that the generalization to a location-scale family is justified.
Suppose that X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞) for i ∈ {1, 2},
i i i
and that X and X are independent. Then Y = X + X has the Cauchy distribution with location parameter a + a and
1 2 1 2 1 2
scale parameter b + b .
1 2
Proof
If X = (X , X , … , X ) is a sequence of independent variables, each with the Cauchy distribution with location parameter
1 2 n
Another corollary is the strange property that the sample mean of a random sample from a Cauchy distribution has that same
Cauchy distribution. No wonder the expected value does not exist!
Suppose that X = (X , X , … , X ) is a sequence of independent random variables, each with the Cauchy distribution with
1 2 n
location parameter a ∈ R and scale parameter b ∈ (0, ∞). (That is, X is a random sample of size n from the Cauchy
n
distribution.) Then the sample mean M = ∑ X also has the Cauchy distribution with location parameter a and scale
1
n i=1 i
parameter b .
Proof
The next result shows explicitly that the Cauchy distribution is infinitely divisible. But of course, infinite divisibility is also a
consequence of stability.
Suppose that a ∈ R and b ∈ (0, ∞). For every n ∈ N the Cauchy distribution with location parameter a and scale parameter
+
b is the distribution of the sum of n independent variables, each of which has the Cauchy distribution with location parameters
Our next result is a very slight generalization of the reciprocal result above for the standard Cauchy distribution.
5.32.4 https://stats.libretexts.org/@go/page/10372
Suppose that X has the Cauchy distribution with location parameter 0 and scale parameter b ∈ (0, ∞). Then Y = 1/X has the
Cauchy distribution with location parameter 0 and scale parameter 1/b.
Proof
As with its standard cousin, the general Cauchy distribution has simple connections with the standard uniform distribution via the
distribution function and quantile function computed above, and in particular, can be simulated via the random quantile method.
2
+
1
π
arctan(
b
) has the standard uniform distribution.
Open the random quantile experiment and select the Cauchy distribution. Vary the parameters and note again the shape and
location of the distribution and probability density functions. For selected values of the parameters, run the simulation 1000
times and compare the empirical density function to the probability density function. Note the behavior of the empirical mean
and standard deviation.
As before, the random quantile method has a nice physical interpretation. Suppose that a light source is b units away from position
a of an infinite, straight wall. We shine the light at the wall at an angle Θ (to the perpendicular) that is uniformly distributed on the
interval (− , ). Then the position X = a + b tan Θ of the light beam on the wall has the Cauchy distribution with location
π
2
π
Open the Cauchy experiment. For selected values of the parameters, run the simulation 1000 times and compare the empirical
density function to the probability density function. Note the behavior of the empirical mean and standard deviation.
This page titled 5.32: The Cauchy Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.32.5 https://stats.libretexts.org/@go/page/10372
5.33: The Exponential-Logarithmic Distribution
The exponential-logarithmic distribution arises when the rate parameter of the exponential distribution is randomized by the
logarithmic distribution. The exponential-logarithmic distribution has applications in reliability theory in the context of devices or
organisms that improve with age, due to hardening or immunity.
Open the special distribution simulator and select the exponential-logarithmic distribution. Vary the shape parameter and note
the shape of the probability density function. For selected values of the shape parameter, run the simulation 1000 times and
compare the empirical density function to the probability density function.
Proof
−1
1 −p 1−u
G (u) = ln( ) = ln(1 − p) − ln(1 − p ), u ∈ [0, 1) (5.33.6)
1−u
1 −p
Proof
Open the special distribution calculator and select the exponential-logarithmic distribution. Vary the shape parameter and note
the shape of the distribution and probability density functions. For selected values of the shape parameter, computer a few
values of the distribution function and the quantile function.
−x
ln[1 − (1 − p)e ]
c
G (x) = , x ∈ [0, ∞) (5.33.7)
ln(p)
Proof
5.33.1 https://stats.libretexts.org/@go/page/10373
1. r is decreasing on [0, ∞).
2. r is concave upward on [0, ∞).
Proof
The Polylogarithm
The moments of the standard exponential-logarithmic distribution cannot be expressed in terms of the usual elementary functions,
but can be expressed in terms of a special function known as the polylogarithm.
In this section, we are only interested in nonnegative integer orders, but the polylogarithm will show up again, for non-integer
orders, in the study of the zeta distribution.
Thus, the polylogarithm of order 0 is a simple geometric series, and the polylogarithm of order 1 is the standard power series for
the natural logarithm. Note that the probability density function of X can be written in terms of the polylogarithms of orders 0 and
1:
−x −x
Li 0 [(1 − p)e ] Li 0 [(1 − p)e ]
g(x) = − = , x ∈ [0, ∞) (5.33.13)
ln(p) Li 1 (1 − p)
The most important property of the polylogarithm is given in the following theorem:
Equivalently, x Li ′
s+1
(x) = Li s (x) for x ∈ (−1, 1) and s ∈ R .
Proof
where ζ is the Riemann zeta function, named for Georg Riemann. The polylogarithm can be extended to complex orders and
defined for complex z with |z| < 1 , but the simpler version suffices for our work here.
5.33.2 https://stats.libretexts.org/@go/page/10373
Moments
We assume that X has the standard exponential-logarithmic distribution with shape parameter p ∈ (0, 1).
1. E(X n
) → 0 as p ↓ 0
2. E(X n
) → n! as p ↑ 1
Proof
We will get some additional insight into the asymptotics below when we consider the limiting distribution as p ↓ 0 and p ↑ 1 . The
mean and variance of the standard exponential logarithmic distribution follow easily from the general moment formula.
From the asymptotics of the general moments, note that E(X) → 0 and var(X) → 0 as p ↓ 0 , and E(X) → 1 and var(X) → 1
as p ↑ 1 .
Open the special distribution simulator and select the exponential-logarithmic distribution. Vary the shape parameter and note
the size and location of the mean ± standard deviation bar. For selected values of the shape parameter, run the simulation 1000
times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
The standard exponential-logarithmic distribution has the usual connections to the standard uniform distribution by means of the
distribution function and the quantile function computed above.
Since the quantile function of the basic exponential-logarithmic distribution has a simple closed form, the distribution can be
simulated using the random quantile method.
Open the random quantile experiment and select the exponential-logarithmic distribution. Vary the shape parameter and note
the shape of the distribution and probability density functions. For selected values of the parameter, run the simulation 1000
times and compare the empirical density function to the probability density function.
As the name suggests, the basic exponential-logarithmic distribution arises from the exponential distribution and the logarithmic
distribution via a certain type of randomization.
5.33.3 https://stats.libretexts.org/@go/page/10373
Suppose that T = (T , T , …) is a sequence of independent random variables, each with the standard exponential distribution.
1 2
Suppose also that N has the logarithmic distribution with parameter 1 − p ∈ (0, 1) and is independent of T . Then
X = min{ T , T , … , T } has the basic exponential-logarithmic distribution with shape parameter p .
1 2 N
Proof
Also of interest, of course, are the limiting distributions of the standard exponential-logarithmic distribution as p → 0 and as
p → 1.
Suppose that Z has the standard exponential-logarithmic distribution with shape parameter p ∈ (0, 1). If b ∈ (0, ∞) , then
X = bZ has the exponential-logarithmic distribution with shape parameter p and scale parameter b .
Using the same terminology as the exponential distribution, 1/b is called the rate parameter.
Distribution Functions
Suppose that X has the exponential-logarithmic distribution with shape parameter p ∈ (0, 1) and scale parameter b ∈ (0, ∞).
Open the special distribution simulator and select the exponential-logarithmic distribution. Vary the shape and scale parameters
and note the shape and location of the probability density function. For selected values of the parameters, run the simulation
1000 times and compare the empirical density function to the probability density function.
Proof
Proof
Open the special distribution calculator and select the exponential-logarithmic distribution. Vary the shape and scale parameter
and note the shape and location of the probability density and distribution functions. For selected values of the parameters,
5.33.4 https://stats.libretexts.org/@go/page/10373
computer a few values of the distribution function and the quantile function.
−x/b
ln[1 − (1 − p)e ]
c
F (x) = , x ∈ [0, ∞) (5.33.30)
ln(p)
Proof
Moments
Suppose again that X has the exponential-logarithmic distribution with shape parameter p ∈ (0, 1) and scale parameter b ∈ (0, ∞).
The moments of X can be computed easily from the representation X = bZ where Z has the basic exponential-logarithmic
distribution.
1. E(X n
) → 0 as p ↓ 0
2. E(X n n
) → b n! as p ↑ 1
Proof
2
2. var(X) = b 2
(−2 Li 3 (1 − p)/ ln(p) − [ Li 2 (1 − p)/ ln(p)] )
From the general moment results, note that E(X) → 0 and var(X) → 0 as p ↓ 0 , while E(X) → b and var(X) → b as p ↑ 1 . 2
Open the special distribution simulator and select the exponential-logarithmic distribution. Vary the shape and scale parameters
and note the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation
1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
Since the exponential-logarithmic distribution is a scale family for each value of the shape parameter, it is trivially closed under
scale transformations.
Suppose that X has the exponential-logarithmic distribution with shape parameter p ∈ (0, 1) and scale parameter b ∈ (0, ∞).
If c ∈ (0, ∞), then Y = cX has the exponential-logarithmic distribution with shape parameter p and scale parameter bc.
Proof
Once again, the exponential-logarithmic distribution has the usual connections to the standard uniform distribution by means of the
distribution function and quantile function computed above.
5.33.5 https://stats.libretexts.org/@go/page/10373
1. If U has the standard exponential distribution then
1 −p U
X = b [ln( )] = b [ln(1 − p) − ln(1 − p )] (5.33.33)
U
1 −p
has the exponential-logarithmic distribution with shape parameter p and scale parameter b .
2. If X has the exponential-logarithmic distribution with shape parameter p and scale parameter b , then
−X/b
ln[1 − (1 − p)e ]
U = (5.33.34)
ln(p)
Again, since the quantile function of the exponential-logarithmic distribution has a simple closed form, the distribution can be
simulated using the random quantile method.
Open the random quantile experiment and select the exponential-logarithmic distribution. Vary the shape and scale parameters
and note the shape and location of the distribution and probability density functions. For selected values of the parameters, run
the simulation 1000 times and compare the empirical density function to the probability density function.
Suppose that T = (T , T , …) is a sequence of independent random variables, each with the exponential distribution with
1 2
scale parameter b ∈ (0, ∞). Suppose also that N has the logarithmic distribution with parameter 1 − p ∈ (0, 1) and is
independent of T . Then X = min{T , T , … , T } has the exponential-logarithmic distribution with shape parameter p and
1 2 N
scale parameter b .
Proof
The limiting distributions as p ↓ 0 and as p ↑ 1 also follow easily from the corresponding results for the standard case.
For fixed b ∈ (0, ∞), the exponential-logarithmic distribution with shape parameter p ∈ (0, 1) and scale parameter b
converges to
1. Point mass at 0 as p ↓ 0 .
2. The exponential distribution with scale parameter b as p ↑ 1 .
Proof
This page titled 5.33: The Exponential-Logarithmic Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated
by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history
is available upon request.
5.33.6 https://stats.libretexts.org/@go/page/10373
5.34: The Gompertz Distribution
The Gompertz distributon, named for Benjamin Gompertz, is a continuous probability distribution on [0, ∞) that has exponentially
increasing failure rate. Unfortunately, the death rate of adult humans increases exponentially, so the Gompertz distribution is widely
used in actuarial science.
The basic Gompertz distribution with shape parameter a ∈ (0, ∞) is a continuous distribution on [0, ∞) with reliability
function G given by
c
c x
G (x) = exp[−a (e − 1)], x ∈ [0, ∞) (5.34.1)
Proof
−1
1
G (p) = ln[1 − ln(1 − p)], p ∈ [0, 1) (5.34.3)
a
Proof
For the standard Gompertz distribution (a = 1 ), the first quartile is q = ln[1 + (ln 4 − ln 3)] ≈ 0.2529 , the median is
1
Open the special distribution calculator and select the Gompertz distribution. Vary the shape parameter and note the shape of
the distribution function. For selected values of the shape parameter, computer a few values of the distribution function and the
quantile function.
5.34.1 https://stats.libretexts.org/@go/page/10467
Open the special distribution simulator and select the Gompertz distribution. Vary the shape parameter and note the shape of
the probability density function. For selected values of the shape parameter, run the simulation 1000 times and compare the
empirical density function to the probability density function.
Finally, as promised, the Gompertz distribution has exponentially increasing failure rate.
Proof
Moments
The moments of the basic Gompertz distribution cannot be given in simple closed form, but the mean and moment generating
function can at least be expressed in terms of a special function known as the exponential integral. There are many variations on
the exponential integral, but for our purposes, the following version is best:
The exponential integral with parameter a ∈ (0, ∞) is the function E a : R → (0, ∞) defined by
∞
t −au
Ea (t) = ∫ u e du, t ∈ R (5.34.7)
1
For the remainder of this discussion, we assume that X has the basic Gompertz distribution with shape parameter a ∈ (0, ∞) .
Proof
Open the special distribution simulator and select the Gompertz distribution. Vary the shape parameter and note the size and
location of the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and
compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
The basic Gompertz distribution has the usual connections to the standard uniform distribution by means of the distribution
function and quantile function computed above.
parameter a .
2. If X has the basic Gompertz distribution with shape parameter a then U = exp[−a (e − 1)] has the standard uniform
X
distribution.
Proof
Since the quantile function of the basic Gompertz distribution has a simple closed form, the distribution can be simulated using the
random quantile method.
Open the random quantile experiment and select the Gompertz distribution. Vary the shape parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, run the simulation 1000 times and compare
the empirical density function, mean, and standard deviation to their distributional counterparts.
5.34.2 https://stats.libretexts.org/@go/page/10467
The basic Gompertz distribution also has simple connections to the exponential distribution.
rate parameter a .
2. If Y has the exponential distribution with rate parameter a , then X = ln(Y + 1) has the Gompertz distribution with shape
parameter a .
Proof
In particular, if Y has the standard exponential distribution (rate parameter 1), then X = ln(Y + 1) has the standard Gompertz
distribution (shape parameter 1). Since the exponential distribution is a scale family (the scale parameter is the reciprocal of the rate
parameter), we can construct an arbitrary basic Gompertz variable from a standard exponential variable. Specifically, if Y has the
standard exponential variable and a ∈ (0, ∞) , then
1
X = ln( Y + 1) (5.34.14)
a
If X has the standard extreme value distribution for minimums, then the conditional distribution of X given X ≥0 is the
standard Gompertz distribution.
Proof
If Z has the basic Gompertz distribution with shape parameter a ∈ (0, ∞) and b ∈ (0, ∞) then X = bZ has the Gompertz
distribution with shape parameter a and scale parameter b .
Distribution Functions
Suppose that X has the Gompertz distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞).
c x/b
F (x) = P(X > x) = exp[−a (e − 1)], x ∈ [0, ∞) (5.34.16)
Proof
Proof
Proof
5.34.3 https://stats.libretexts.org/@go/page/10467
Open the special distribution calculator and select the Gompertz distribution. Vary the shape and scale parameters and note the
shape and location of the distribution function. For selected values of the parameters, computer a few values of the distribution
function and the quantile function.
Open the special distribution simulator and select the Gompertz distribution. Vary the shape and scale parameters and note the
shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and
compare the empirical density function to the probability density function.
Proof
Moments
As with the basic distribution, the moment generating function and mean of the general Gompertz distribution can be expressed in
terms of the exponential integral. Suppose again that X has the Gompertz distribution with shape parameter a ∈ (0, ∞) and scale
parameter b ∈ (0, ∞).
Proof
Open the special distribution simulator and select the Gompertz distribution. Vary the shape and scale parameters and note the
size and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times
and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
Since the Gompertz distribution is a scale family for each value of the shape parameter, it is trivially closed under scale
transformations.
Suppose that X has the Gompertz distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞). If c ∈ (0, ∞)
then Y = cX has the Gompertz distribution with shape parameter a and scale parameter bc.
Proof
5.34.4 https://stats.libretexts.org/@go/page/10467
As with the basic distribution, the Gompertz distribution has the usual connections with the standard uniform distribution by means
of the distribution function and quantile function computed above.
2. If X has the Gompertz distribution with shape parameter a and scale parameter b , then U = exp[−a (e − 1)] has the
X/b
Again, since the quantile function of the Gompertz distribution has a simple closed form, the distribution can be simulated using
the random quantile method.
Open the random quantile experiment and select the Gompertz distribution. Vary the shape and scale parameters and note the
shape and location of the distribution and probability density functions. For selected values of the parameters, run the
simulation 1000 times and note the agreement between the empirical density function and the probability density function.
The following result is a slight generalization of the connection above between the basic Gompertz distribution and the extreme
value distribution.
If X has the extreme value distribution for minimums with scale parameter b > 0 , then the conditional distribution of X given
X ≥ 0 is the Gompertz distribution with shape parameter 1 and scale parameter b .
Proof
Finally, we give a slight generalization of the connection above between the Gompertz distribution and the exponential distribution.
As a corollary, we can construct a general Gompertz variable from a standard exponential variable. Specifically, if Y has the
standard exponential distribution and if a, b ∈ (0, ∞) then
1
X = b ln( Y + 1) (5.34.22)
a
has the Gompertz distribution with shape parameter a and scale parameter b .
This page titled 5.34: The Gompertz Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.34.5 https://stats.libretexts.org/@go/page/10467
5.35: The Log-Logistic Distribution
As the name suggests, the log-logistic distribution is the distribution of a variable whose logarithm has the logistic distribution. The
log-logistic distribution is often used to model random lifetimes, and hence has applications in reliability.
The basic log-logistic distribution with shape parameter k ∈ (0, ∞) is a continuous distribution on [0, ∞) with distribution
function G given by
k
z
G(z) = , z ∈ [0, ∞) (5.35.1)
k
1 +z
In the special case that k = 1 , the distribution is the standard log-logistic distribution.
Proof
k+1
) .
4. If k ≤ 1 , g is concave upward.
5. If 1 < k ≤ 2 , g is concave downward and then upward, with inflection point at
−−−−−− − 1/k
2
2(k − 1) + 2k√ 3(k2 − 1)
z =[ ] (5.35.4)
(k + 1)(k + 2)
6. If k > 2 , g is concave upward then downward then upward again, with inflection points at
2
−−−−−− − 1/k
2
2(k − 1) ± 2k√ 3(k − 1)
z =[ ] (5.35.5)
(k + 1)(k + 2)
Proof
So g has a rich variety of shapes, and is unimodal if k > 1 . When k ≥ 1 , g is defined at 0 as well.
Open the special distribution simulator and select the log-logistic distribution. Vary the shape parameter and note the shape of
the probability density function. For selected values of the shape parameter, run the simulation 1000 times and compare the
empirical density function to the probability density function.
Proof
5.35.1 https://stats.libretexts.org/@go/page/10468
Recall that p/(1 − p) is the odds ratio associated with probability p ∈ (0, 1). Thus, the quantile function of the basic log-logistic
distribution with shape parameter k is the k th root of the odds ratio function. In particular, the quantile function of the standard log-
logistic distribution is the odds ratio function itself. Also of interest is that the median is 1 for every value of the shape parameter.
Open the special distribution calculator and select the log-logistic distribution. Vary the shape parameter and note the shape of
the distribution and probability density functions. For selected values of the shape parameter, computer a few values of the
distribution function and the quantile function.
c
1
G (z) = , z ∈ [0, ∞) (5.35.9)
k
1 +z
Proof
The basic log-logistic distribution has either decreasing failure rate, or mixed decreasing-increasing failure rate, depending on the
shape parameter.
1. If 0 < k ≤ 1 , r is decreasing.
2. If k > 1 , r decreases and then increases with minimum at z = (k − 1) 1/k
.
Proof
If k ≥ 1 , r is defined at 0 also.
Moments
Suppose that Z has the basic log-logistic distribution with shape parameter k ∈ (0, ∞) . The moments (about 0) of the Z have an
interesting expression in terms of the beta function B and in terms of the sine function. The simplest representation is in terms of a
new special function constructed from the sine function.
Figure 5.35.1 : The graph of the sinc function on the interval [−10, 10]
If n ≥ k then E(Z n
) =∞ . If 0 ≤ n < k then
n n 1
n
E(Z ) = B (1 − ,1+ ) = (5.35.13)
k k sinc(n/k)
Proof
5.35.2 https://stats.libretexts.org/@go/page/10468
If k > 1 then
1
E(Z) = (5.35.16)
sinc(1/k)
If k > 2 then
1 1
var(Z) = − (5.35.17)
2
sinc(2/k) sinc (1/k)
Open the special distribution simulator and select the log-logistic distribution. Vary the shape parameter and note the size and
location of the mean ± standard deviation bar. For selected values of the shape parameter, run the simulation 1000 times and
compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
The basic log-logistic distribution is preserved under power transformations.
If Z has the basic log-logistic distribution with shape parameter k ∈ (0, ∞) and if n ∈ (0, ∞) , then W =Z
n
has the basic
log-logistic distribution with shape parameter k/n.
Proof
In particular, it follows that if V has the standard log-logistic distribution and k ∈ (0, ∞) , then Z = V 1/k
has the basic log-logistic
distribution with shape parameter k .
The log-logistic distribution has the usual connections with the standard uniform distribution by means of the distribution function
and the quantile function given above.
uniform distribution.
Since the quantile function of the basic log-logistic distribution has a simple closed form, the distribution can be simulated using
the random quantile method.
Open the random quantile experiment and select the log-logistic distribution. Vary the shape parameter and note the shape of
the distribution and probability density functions. For selected values of the parameter, run the simulation 1000 times and
compare the empirical density function, mean, and standard deviation to their distributional counterparts..
Of course, as mentioned in the introduction, the log-logistic distribution is related to the logistic distribution.
As a special case, (and as noted in the proof), if Z has the standard log-logistic distribution, then Y = ln Z has the standard
logistic distribution, and if Y has the standard logistic distribution, then Z = e has the standard log-logistic distribution.
Y
The standard log-logistic distribution is the same as the standard beta prime distribution.
Proof
5.35.3 https://stats.libretexts.org/@go/page/10468
Of course, limiting distributions with respect to parameters are always interesting.
The basic log-logistic distribution with shape parameter k ∈ (0, ∞) converges to point mass at 1 as k → ∞ .
Proof from the definition
Random variable proof
If Z has the basic log-logistic distribution with shape parameter k ∈ (0, ∞) and if b ∈ (0, ∞) then X = bZ has the log-
logistic distribution with shape parameter k and scale parameter b .
Distribution Functions
Suppose that X has the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞).
Proof
4. If k ≤ 1 , f is concave upward.
5. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at
−−−−−− − 1/k
2
2(k − 1) + 2k√ 3(k2 − 1)
x = b[ ] (5.35.23)
(k + 1)(k + 2)
6. If k > 2 , f is concave upward then downward then upward again, with inflection points at
−−−−−− − 1/k
2
2(k − 1) ± 2k√ 3(k2 − 1)
x = b[ ] (5.35.24)
(k + 1)(k + 2)
Proof
Open the special distribution simulator and select the log-logistic distribution. Vary the shape and scale parameters and note the
shape of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the
empirical density function to the probability density function.
−1
p
F (p) = b ( ) , p ∈ [0, 1) (5.35.25)
1 −p
5.35.4 https://stats.libretexts.org/@go/page/10468
2. The median is q = b .
2
Open the special distribution calculator and select the log-logistic distribution. Vary the shape and sclae parameters and note
the shape of the distribution and probability density functions. For selected values of the parameters, computer a few values of
the distribution function and the quantile function.
k
b
c
F (x) = , x ∈ [0, ∞) (5.35.26)
bk + xk
Proof
The log-logistic distribution has either decreasing failure rate, or mixed decreasing-increasing failure rate, depending on the shape
parameter.
1. If 0 < k ≤ 1 , R is decreasing.
2. If k > 1 , R decreases and then increases with minimum at x = b(k − 1) 1/k
.
Proof
Moments
Suppose again that X has the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). The
moments of X can be computed easily from the representation X = bZ where Z has the basic log-logistic distribution with shape
parameter k . Again, the expressions are simplest in terms of the beta function B and in terms of the normalized cardinal sine
function sinc.
If n ≥ k then E(X n
) =∞ . If 0 ≤ n < k then
n
n n b
n n
E(X ) = b B (1 − ,1+ ) = (5.35.28)
k k sinc(n/k)
If k > 1 then
b
E(X) = (5.35.29)
sinc(1/k)
If k > 2 then
2
1 1
var(X) = b [ − ] (5.35.30)
2
sinc(2/k) sinc (1/k)
Open the special distribution simulator and select the log-logistic distribution. Vary the shape and scale parameters and note the
size and location of the mean/standard deviation bar. For selected values of the parameters, run the simulation 1000 times
compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions
Since the log-logistic distribution is a scale family for each value of the shape parameter, it is trivially closed under scale
transformations.
5.35.5 https://stats.libretexts.org/@go/page/10468
If X has the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞), and if c ∈ (0, ∞), then
Y = cX has the log-logistic distribution with shape parameter k and scale parameter bc.
Proof
If X has the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞), and if n ∈ (0, ∞) ,
then Y = X has the log-logistic distribution with shape parameter k/n and scale parameter b .
n n
Proof
As before, the log-logistic distribution has the usual connections with the standard uniform distribution by means of the distribution
function and the quantile function computed above.
Again, since the quantile function of the log-logistic distribution has a simple closed form, the distribution can be simulated using
the random quantile method.
Open the random quantile experiment and select the log-logistic distribution. Vary the shape and scale parameters and note the
shape and location of the distribution and probability density functions. For selected values of the parameters, run the
simulation 1000 times and compare the empirical density function, mean and standard deviation to their distributional
counterparts.
Proof
For fixed b ∈ (0, ∞), the log-logistic distribution with shape parameter k ∈ (0, ∞) and scale parameter b converges to point
mass at b as k → ∞ .
Proof
This page titled 5.35: The Log-Logistic Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.35.6 https://stats.libretexts.org/@go/page/10468
5.36: The Pareto Distribution
The Pareto distribution is a skewed, heavy-tailed distribution that is sometimes used to model the distribution of incomes and other
financial variables.
The basic Pareto distribution with shape parameter a ∈ (0, ∞) is a continuous distribution on [1, ∞) with distribution
function G given by
1
G(z) = 1 − , z ∈ [1, ∞) (5.36.1)
a
z
The reason that the Pareto distribution is heavy-tailed is that the g decreases at a power rate rather than an exponential rate.
Open the special distribution simulator and select the Pareto distribution. Vary the shape parameter and note the shape of the
probability density function. For selected values of the parameter, run the simulation 1000 times and compare the empirical
density function to the probability density function.
1/a
1. The first quartile is q 1 =(
4
3
) .
2. The median is q = 2 2
1/a
.
3. The third quartile is q 3 =4
1/a
.
Proof
Open the special distribution calculator and select the Pareto distribution. Vary the shape parameter and note the shape of the
probability density and distribution functions. For selected values of the parameters, compute a few values of the distribution
and quantile functions.
Moments
Suppose that random variable Z has the basic Pareto distribution with shape parameter a ∈ (0, ∞) . Because the distribution is
heavy-tailed, the mean, variance, and other moments of Z are finite only if the shape parameter a is sufficiently large.
a−n
if 0 < n < a
2. E(Z n
) =∞ if n ≥ a
5.36.1 https://stats.libretexts.org/@go/page/10469
Proof
It follows that the moment generating function of Z cannot be finite on any interval about 0.
a−1
if a > 1
2. var(Z) = a
2
if a > 2
(a−1 ) (a−2)
Proof
In the special distribution simulator, select the Pareto distribution. Vary the parameters and note the shape and location of the
mean ± standard deviation bar. For each of the following parameter values, run the simulation 1000 times and note the
behavior of the empirical moments:
1. a = 1
2. a = 2
3. a = 3
2. If a > 4 ,
2
3(a − 2)(3 a + a + 2)
kurt(Z) = (5.36.6)
a(a − 3)(a − 4)
Proof
So the distribution is positively skewed and skew(Z) → 2 as a → ∞ while skew(Z) → ∞ as a ↓ 3 . Similarly, kurt(Z) → 9 as
a → ∞ and kurt(Z) → ∞ as a ↓ 4 . Recall that the excess kurtosis of Z is
2 3 2
3(a − 2)(3 a + a + 2) 6(a +a − 6a − 1)
kurt(Z) − 3 = −3 = (5.36.7)
a(a − 3)(a − 4) a(a − 3)(a − 4)
Related Distributions
The basic Pareto distribution is invariant under positive powers of the underlying variable.
Suppose that Z has the basic Pareto distribution with shape parameter a ∈ (0, ∞) and that n ∈ (0, ∞) . Then W =Z
n
has the
basic Pareto distribution with shape parameter a/n.
Proof
In particular, if Z has the standard Pareto distribution and a ∈ (0, ∞) , then Z has the basic Pareto distribution with shape
1/a
parameter a . Thus, all basic Pareto variables can be constructed from the standard one.
The basic Pareto distribution has a reciprocal relationship with the beta distribution.
The basic Pareto distribution has the usual connections with the standard uniform distribution by means of the distribution function
and quantile function computed above.
5.36.2 https://stats.libretexts.org/@go/page/10469
Suppose that a ∈ (0, ∞) .
1. If U has the standard uniform distribution then Z = 1/U has the basic Pareto distribution with shape parameter a .
1/a
2. If Z has the basic Pareto distribution with shape parameter a then U = 1/Z has the standard uniform distribution.
a
Proof
Since the quantile function has a simple closed form, the basic Pareto distribution can be simulated using the random quantile
method.
Open the random quantile experiment and selected the Pareto distribution. Vary the shape parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, run the experiment 1000 times and
compare the empirical density function, mean, and standard deviation to their distributional counterparts.
The basic Pareto distribution also has simple connections to the exponential distribution.
parameter a .
Proof
Suppose that Z has the basic Pareto distribution with shape parameter a ∈ (0, ∞) and that b ∈ (0, ∞) . Random variable
X = bZ has the Pareto distribution with shape parameter a and scale parameter b .
Distribution Functions
Suppose again that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞).
Proof
Proof
Open the special distribution simulator and select the Pareto distribution. Vary the parameters and note the shape and location
of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the
empirical density function to the probability density function.
5.36.3 https://stats.libretexts.org/@go/page/10469
1/a
1. The first quartile is q = b( 1
4
3
) .
2. The median is q = b2 . 2
1/a
Open the special distribution calculator and select the Pareto distribution. Vary the parameters and note the shape and location
of the probability density and distribution functions. For selected values of the parameters, compute a few values of the
distribution and quantile functions.
Moments
Suppose again that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞)
a−n
if 0 < n < a
2. E(X n
) =∞ if n ≥ a
Proof
a−1
if a > 1
2. var(X) = b 2 a
2
if a > 2
(a−1 ) (a−2)
Open the special distribution simulator and select the Pareto distribution. Vary the parameters and note the shape and location
of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation.
2. If a > 4 ,
2
3(a − 2)(3 a + a + 2)
kurt(X) = (5.36.17)
a(a − 3)(a − 4)
Proof
Related Distributions
Since the Pareto distribution is a scale family for fixed values of the shape parameter, it is trivially closed under scale
transformations.
Suppose that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞) . If c ∈ (0, ∞)
then Y = cX has the Pareto distribution with shape parameter a and scale parameter bc.
Proof
The Pareto distribution is closed under positive powers of the underlying variable.
Suppose that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞) . If n ∈ (0, ∞)
then Y = X has the Pareto distribution with shape parameter a/n and scale parameter b .
n n
Proof
5.36.4 https://stats.libretexts.org/@go/page/10469
All Pareto variables can be constructed from the standard one. If Z has the standard Pareto distribution and a, b ∈ (0, ∞) then
X = bZ has the Pareto distribution with shape parameter a and scale parameter b .
1/a
As before, the Pareto distribution has the usual connections with the standard uniform distribution by means of the distribution
function and quantile function given above.
parameter b .
2. If X has the Pareto distribution with shape parameter a and scale parameter b , then U = (b/X) has the standard uniform
a
distribution.
Proof
Again, since the quantile function has a simple closed form, the basic Pareto distribution can be simulated using the random
quantile method.
Open the random quantile experiment and selected the Pareto distribution. Vary the parameters and note the shape of the
distribution and probability density functions. For selected values of the parameters, run the experiment 1000 times and
compare the empirical density function, mean, and standard deviation to their distributional counterparts.
Suppose that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For c ∈ [b, ∞) ,
the conditional distribution of X given X ≥ c is Pareto with shape parameter a and scale parameter c .
Proof
Finally, the Pareto distribution is a general exponential distribution with respect to the shape parameter, for a fixed value of the
scale parameter.
Suppose that X has the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For fixed b , the
distribution of X is a general exponential distribution with natural parameter −(a + 1) and natural statistic ln X .
Proof
Computational Exercises
Suppose that the income of a certain population has the Pareto distribution with shape parameter 3 and scale parameter 1000.
Find each of the following:
1. The proportion of the population with incomes between 2000 and 4000.
2. The median income.
3. The first and third quartiles and the interquartile range.
4. The mean income.
5. The standard deviation of income.
6. The 90th percentile.
Answer
This page titled 5.36: The Pareto Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.36.5 https://stats.libretexts.org/@go/page/10469
5.37: The Wald Distribution
The Wald distribution, named for Abraham Wald, is important in the study of Brownian motion. Specifically, the distribution
governs the first time that a Brownian motion with positive drift hits a fixed, positive value. In Brownian motion, the distribution of
the random position at a fixed time has a normal (Gaussian) distribution, and thus the Wald distribution, which governs the random
time at a fixed position, is sometimes called the inverse Gaussian distribution.
The basic Wald distribution with shape parameter λ ∈ (0, ∞) is a continuous distribution on (0, ∞) with distribution function
G given by
−
− −−
λ λ
2λ
G(u) = Φ [√ (u − 1)] + e Φ [−√ (u + 1)] , u ∈ (0, ∞) (5.37.1)
u u
−−−−−−−−
2
1. g increases and then decreases with mode u 0 = √1 + (
3
2λ
) −
3
2λ
So g has the classic unimodal shape, but the inflection points are very complicated functions of λ . For the mode, note that u 0 ↓ 0 as
λ ↓ 0 and u ↑ 1 as λ ↑ ∞ . The probability density function of the standard Wald distribution is
0
−−−−−
1 1
2
g(u) = √ exp[− (u − 1 ) ], u ∈ (0, ∞) (5.37.8)
3
2πu 2u
Open the special distribution simulator and select the Wald distribution. Vary the shape parameter and note the shape of the
probability density function. For various values of the parameter, run the simulation 1000 times and compare the empirical
density function to the probability density function.
The quantile function of the standard Wald distribution does not have a simple closed form, so the median and other quantiles must
be approximated.
Open the special distribution calculator and select the Wald distribution. Vary the shape parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, compute approximate values of the first
quartile, the median, and the third quartile.
Moments
Suppose that random variable U has the standard Wald distribution with shape parameter λ ∈ (0, ∞) .
Proof
5.37.1 https://stats.libretexts.org/@go/page/10470
Since the moment generating function is finite in an interval containing 0, the basic Wald distribution has moments of all orders.
Proof
So interestingly, the mean is 1 for all values of the shape parameter, while var(U ) → ∞ as λ ↓ 0 and var(U ) → 0 as λ → ∞ .
Open the special distribution simulator and select the Wald distribution. Vary the shape parameter and note the size and
location of the mean ± standard deviation bar. For various values of the parameters, run the simulation 1000 times and
compare the empirical mean and standard deviation to the distribution mean and standard deviation.
2. kurt(U ) = 3 + 15/λ
Proof
It follows that the excess kurtosis is kurt(U ) − 3 = 15/λ . Note that skew(U ) → ∞ as λ → 0 and skew(U ) → 0 as λ → ∞ .
Similarly, kurt(U ) → ∞ as λ → 0 and kurt(U ) → 3 as λ → ∞ .
Suppose that λ, μ ∈ (0, ∞) and that U has the basic Wald distribution with shape parameter λ/μ . Then X = μU has the
Wald distribution with shape parameter λ and mean μ .
Justification for the name of the parameter μ as the mean is given below. Note that the generalization is consistent—when μ =1
Distribution Functions
Suppose that X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ ∈ (0, ∞) . Again, we let Φ denote the
standard normal distribution function.
Proof
−−−−−−−−
2
3μ 3μ
1. f increases and then decreases with mode x 0 = μ [√1 + (
2λ
) −
2λ
]
Once again, the graph of f has the classic unimodal shape, but the inflection points are complicated functions of the parameters.
Open the special distribution simulator and select the Wald distribution. Vary the parameters and note the shape of the
probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical
5.37.2 https://stats.libretexts.org/@go/page/10470
density function to the probability density function.
Again, the quantile function cannot be expressed in a simple closed form, so the median and other quantiles must be approximated.
Open the special distribution calculator and select the Wald distribution. Vary the parameters and note the shape of the
distribution and density functions. For selected values of the parameters, compute approximate values of the first quartile, the
median, and the third quartile.
Moments
Suppose again that X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ ∈ (0, ∞) . By definition, we can
take X = μU where U has the basic Wald distribution with shape parameter λ/μ.
Proof
Proof
Open the special distribution simulator and select the Wald distribution. Vary the parameters and note the size and location of
the mean ± standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation.
2. kurt(U ) = 3 + 15μ/λ
Proof
Related Distribution
As noted earlier, the Wald distribution is a scale family, although neither of the parameters is a scale parameter.
Suppose that X has the Wald distribution with shape parameters λ ∈ (0, ∞) and mean μ ∈ (0, ∞) and that c ∈ (0, ∞). Then
Y = cX has the Wald distribution with shape parameter cλ and mean cμ.
Proof
For the next result, it's helpful to re-parameterize the Wald distribution with the mean μ and the ratio r = λ/μ ∈ (0, ∞) . This
2
parametrization is clearly equivalent, since we can recover the shape parameter from the mean and ratio as λ = rμ . Note also that
2
r = E(X)/ var(X) , the ratio of the mean to the variance. Finally, note that the moment generating function above becomes
−−−−−−
2 r
M (t) = exp[rμ (1 − √ 1 − t )], t < (5.37.27)
r 2
and of course, this function characterizes the Wald distribution with this parametrization. Our next result is that the Wald
distribution is closed under convolution (corresponding to sums of independent variables) when the ratio is fixed.
Suppose that X has the Wald distribution with mean μ ∈ (0, ∞) and ratio r ∈ (0, ∞); X has the Wald distribution with
1 1 2
mean μ ∈ (0, ∞) and ratio r; and that X and X are independent. Then Y = X + X has the Wald distribution with mean
2 1 2 1 2
μ +μ
1 and ratio r.
2
5.37.3 https://stats.libretexts.org/@go/page/10470
Proof
In the previous result, note that the shape parameter of X is rμ , the shape parameter of X is rμ , and the shape parameter of Y
1
2
1 2
2
2
is λ = r(μ + μ ) . Also, of course, the result generalizes to a sum of any finite number of independent Wald variables, as long as
1 2
2
the ratio is fixed. The next couple of results are simple corollaries.
Suppose that (X , X , … , X ) is a sequence of independent variables, each with the Wald distribution with shape parameter
1 2 n
1. Y n =∑
n
i=1
Xi has the Wald distribution with shape parameter n λ and mean nμ.
2
2. M n =
1
n
∑
n
i=1
X has the Wald distribution with shape parameter nλ and mean μ .
i
Proof
Suppose that X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ ∈ (0, ∞) . For every n ∈ N , X has +
n
the same distribution as ∑ X where (X , X , … , X ) are independent, and each has the Wald distribution with shape
i=1 i 1 2 n
The Lévy distribution is related to the Wald distribution, not surprising since the Lévy distribution governs the first time that a
standard Brownian motion hits a fixed positive value.
For fixed λ ∈ (0, ∞) , the Wald distribution with shape parameter λ and mean μ ∈ (0, ∞) converges to the Lévy distribution
with location parameter 0 and scale parameter λ as μ → ∞ .
Proof
The other limiting distribution, this time with the mean fixed, is less interesting.
For fixed μ ∈ (0, ∞) , the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ converges to point mass at μ and
variance 1 as λ → ∞ .
Proof
Finally, the Wald distribution is a member of the general exponential family of distributions.
The Wald distribution is a general exponential distribution with natural parameters −λ/(2μ 2
) and −λ/2, and natural statistics
X and 1/X.
Proof
This page titled 5.37: The Wald Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.37.4 https://stats.libretexts.org/@go/page/10470
5.38: The Weibull Distribution
In this section, we will study a two-parameter family of distributions that has special importance in reliability.
The basic Weibull distribution with shape parameter k ∈ (0, ∞) is a continuous distribution on [0, ∞) with distribution
function G given by
k
G(t) = 1 − exp(−t ), t ∈ [0, ∞) (5.38.1)
The Weibull distribution is named for Waloddi Weibull. Weibull was not the first person to use the distribution, but was the first to
study it extensively and recognize its wide use in applications. The standard Weibull distribution is the same as the standard
exponential distribution. But as we will see, every Weibull random variable can be obtained from a standard Weibull variable by a
simple deterministic transformation, so the terminology is justified.
1/k
3(k−1)±√(5k−1)(k−1)
5. If k > 2 , g is concave upward, then downward, then upward again, with inflection points at t = [ 2k
]
Proof
So the Weibull density function has a rich variety of shapes, depending on the shape parameter, and has the classic unimodal shape
when k > 1 . If k ≥ 1 , g is defined at 0 also.
In the special distribution simulator, select the Weibull distribution. Vary the shape parameter and note the shape of the
probability density function. For selected values of the shape parameter, run the simulation 1000 times and compare the
empirical density function to the probability density function.
Proof
Open the special distribution calculator and select the Weibull distribution. Vary the shape parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, compute the median and the first and third
quartiles.
5.38.1 https://stats.libretexts.org/@go/page/10471
c k
G (t) = exp(−t ), t ∈ [0, ∞) (5.38.6)
Proof
Thus, the Weibull distribution can be used to model devices with decreasing failure rate, constant failure rate, or increasing failure
rate. This versatility is one reason for the wide use of the Weibull distribution in reliability. If k ≥ 1 , r is defined at 0 also.
Moments
Suppose that Z has the basic Weibull distribution with shape parameter k ∈ (0, ∞) . The moments of Z , and hence the mean and
variance of Z can be expressed in terms of the gamma function Γ
E(Z
n
) = Γ (1 +
n
k
) for n ≥ 0 .
Proof
So the Weibull distribution has moments of all orders. The moment generating function, however, does not have a simple, closed
expression in terms of the usual elementary functions.
2. var(Z) = Γ (1 + 2
k
)−Γ
2
(1 +
1
k
)
Note that E(Z) → 1 and var(Z) → 0 as k → ∞ . We will learn more about the limiting distribution below.
In the special distribution simulator, select the Weibull distribution. Vary the shape parameter and note the size and location of
the mean ± standard deviation bar. For selected values of the shape parameter, run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation.
The skewness and kurtosis also follow easily from the general moment result above, although the formulas are not particularly
helpful.
2. The kurtosis of Z is
2 4
Γ(1 + 4/k) − 4Γ(1 + 1/k)Γ(1 + 3/k) + 6 Γ (1 + 1/k)Γ(1 + 2/k) − 3 Γ (1 + 1/k)
kurt(Z) = (5.38.11)
2
2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
Proof
Related Distributions
As noted above, the standard Weibull distribution (shape parameter 1) is the same as the standard exponential distribution. More
generally, any basic Weibull variable can be constructed from a standard exponential variable.
5.38.2 https://stats.libretexts.org/@go/page/10471
1. If U has the standard exponential distribution then Z = U has the basic Weibull distribution with shape parameter k .
1/k
2. If Z has the basic Weibull distribution with shape parameter k then U = Z has the standard exponential distribution.
k
Proof
The basic Weibull distribution has the usual connections with the standard uniform distribution by means of the distribution
function and the quantile function given above.
2. If Z has the basic Weibull distribution with shape parameter k then U = exp(−Z ) has the standard uniform distribution.
k
Proof
Since the quantile function has a simple, closed form, the basic Weibull distribution can be simulated using the random quantile
method.
Open the random quantile experiment and select the Weibull distribution. Vary the shape parameter and note again the shape of
the distribution and density functions. For selected values of the parameter, run the simulation 1000 times and compare the
empirical density, mean, and standard deviation to their distributional counterparts.
The limiting distribution with respect to the shape parameter is concentrated at a single point.
The basic Weibull distribution with shape parameter k ∈ (0, ∞) converges to point mass at 1 as k → ∞ .
Proof
Suppose that Z has the basic Weibull distribution with shape parameter k ∈ (0, ∞) . For b ∈ (0, ∞), random variable X = bZ
has the Weibull distribution with shape parameter k and scale parameter b .
Generalizations of the results given above follow easily from basic properties of the scale transformation.
Distribution Functions
Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞).
Proof
k
) .
1/k
3(k−1)+√(5k−1)(k−1)
4. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at t = b[ 2k
]
5.38.3 https://stats.libretexts.org/@go/page/10471
5. If k > 2 , f is concave upward, then downward, then upward again, with inflection points at
1/k
3(k−1)±√(5k−1)(k−1)
t = b[ ]
2k
Proof
Open the special distribution simulator and select the Weibull distribution. Vary the parameters and note the shape of the
probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical
density function to the probability density function.
Proof
Open the special distribution calculator and select the Weibull distribution. Vary the parameters and note the shape of the
distribution and probability density functions. For selected values of the parameters, compute the median and the first and third
quartiles.
k
t
c
F (t) = exp[−( ) ], t ∈ [0, ∞) (5.38.15)
b
Proof
As before, the Weibull distribution has decreasing, constant, or increasing failure rates, depending only on the shape parameter.
Moments
Suppose again that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Recall that by
definition, we can take X = bZ where Z has the basic Weibull distribution with shape parameter k .
E(X
n n
) = b Γ (1 +
n
k
) for n ≥ 0 .
Proof
2. var(X) = b [Γ (1 +
2 2
k
)−Γ
2
(1 +
1
k
)]
Open the special distribution simulator and select the Weibull distribution. Vary the parameters and note the size and location
of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the
5.38.4 https://stats.libretexts.org/@go/page/10471
empirical mean and standard deviation to the distribution mean and standard deviation.
2. The kurtosis of X is
2 4
Γ(1 + 4/k) − 4Γ(1 + 1/k)Γ(1 + 3/k) + 6 Γ (1 + 1/k)Γ(1 + 2/k) − 3 Γ (1 + 1/k)
kurt(X) = (5.38.18)
2
2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
Proof
Related Distributions
Since the Weibull distribution is a scale family for each value of the shape parameter, it is trivially closed under scale
transformations.
Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) . If c ∈ (0, ∞)
then Y = cX has the Weibull distribution with shape parameter k and scale parameter bc.
Proof
The exponential distribution is a special case of the Weibull distribution, the case corresponding to constant failure rate.
The Weibull distribution with shape parameter 1 and scale parameter b ∈ (0, ∞) is the exponential distribution with scale
parameter b .
Proof
More generally, any Weibull distributed variable can be constructed from the standard variable. The following result is a simple
generalization of the connection between the basic Weibull distribution and the exponential distribution.
exponential distribution.
Proof
The Rayleigh distribution, named for William Strutt, Lord Rayleigh, is also a special case of the Weibull distribution.
The Rayleigh distribution with scale parameter b ∈ (0, ∞) is the Weibull distribution with shape parameter 2 and scale
–
parameter √2b.
Proof
Recall that the minimum of independent, exponentially distributed variables also has an exponential distribution (and the rate
parameter of the minimum is the sum of the rate parameters of the variables). The Weibull distribution has a similar, but more
restricted property.
Suppose that (X , X , … , X ) is an independent sequence of variables, each having the Weibull distribution with shape
1 2 n
parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Then U = min{X , X , … , X } has the Weibull distribution with
1 2 n
Proof
5.38.5 https://stats.libretexts.org/@go/page/10471
As before, Weibull distribution has the usual connections with the standard uniform distribution by means of the distribution
function and the quantile function given above..
scale parameter b .
2. If X has the basic Weibull distribution with shape parameter k then U = exp[−(X/b) ] has the standard uniform
k
distribution.
Proof
Again, since the quantile function has a simple, closed form, the Weibull distribution can be simulated using the random quantile
method.
Open the random quantile experiment and select the Weibull distribution. Vary the parameters and note again the shape of the
distribution and density functions. For selected values of the parameters, run the simulation 1000 times and compare the
empirical density, mean, and standard deviation to their distributional counterparts.
The limiting distribution with respect to the shape parameter is concentrated at a single point.
The Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞) converges to point mass at b as
k → ∞.
Proof
Finally, the Weibull distribution is a member of the family of general exponential distributions if the shape parameter is fixed.
Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For fixed k , X
has a general exponential distribution with respect to b , with natural parameter k − 1 and natural statistics ln X .
Proof
Computational Exercises
The lifetime T of a device (in hours) has the Weibull distribution with shape parameter k = 1.2 and scale parameter b = 1000.
1. Find the probability that the device will last at least 1500 hours.
2. Approximate the mean and standard deviation of T .
3. Compute the failure rate function.
Answer
This page titled 5.38: The Weibull Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.38.6 https://stats.libretexts.org/@go/page/10471
5.39: Benford's Law
Benford's law refers to probability distributions that seem to govern the significant digits in real data sets. The law is named for the
American physicist and engineer Frank Benford, although the “law” was actually discovered earlier by the astronomer and mathematician
Simon Newcomb.
To understand Benford's law, we need some preliminaries. Recall that a positive real number x can be written uniquely in the form
x = y ⋅ 10
n
(sometimes called scientific notation) where y ∈ [ , 1) is the mantissa and n ∈ Z is the exponent (both of these terms are
1
10
where the logarithm function is the base 10 common logarithm instead of the usual base e natural logarithm. In the old days BC (before
calculators), one would compute the logarithm of a number by looking up the logarithm of the mantissa in a table of logarithms, and then
adding the exponent. Of course, these remarks apply to any base b > 1 , not just base 10. Just replace 10 with b and the common logarithm
with the base b logarithm.
The Benford mantissa distribution with base b ∈ (1, ∞), is a continuous distribution on [1/b, 1) with distribution function F given by
F (y) = 1 + logb y, y ∈ [1/b, 1) (5.39.2)
b
.
2. f is concave upward.
Proof
Open the Special Distribution Simulator and select the Benford mantissa distribution. Vary the base b and note the shape of the
probability density function. For various values of b , run the simulation 1000 times and compare the empirical density function to the
probability density function.
−1
1
F (p) = , p ∈ [0, 1] (5.39.4)
1−p
b
4
) =
1
3/4
b
2. The median is F −1
(
1
2
) =
√b
1
4
) =
1/4
1
Proof
Numerical values of the quartiles for the standard (base 10) distribution are given in an exercise below.
Open the special distribution calculator and select the Benford mantissa distribution. Vary the base and note the shape and location of
the distribution and probability density functions. For selected values of the base, compute the median and the first and third quartiles.
5.39.1 https://stats.libretexts.org/@go/page/10472
Moments
Assume that Y has the Benford mantissa distribution with base b ∈ (1, ∞).
Proof
Note that for fixed n > 0 , E(Y ) → 1 as b ↓ 1 and E(Y ) → 0 as b → ∞ . We will learn more about the limiting distribution below. The
n n
mean and variance follow easily from the general moment result.
2. the variance of Y is
b −1 b +1 b −1
var(Y ) = [ − ] (5.39.8)
2
b ln b 2 ln b
Numerical values of the mean and variance for the standard (base 10) distribution are given in an exercise below.
In the Special Distribution Simulator, select the Benford mantissa distribution. Vary the base b and note the size and location of the
mean ± standard deviation bar. For selected values of b , run the simulation 1000 times and compare the empirical mean and standard
deviation to the distribution mean and standard deviation.
Related Distributions
The Benford mantissa distribution has the usual connections to the standard uniform distribution by means of the distribution function and
quantile function given above.
2. If Y has the Benford mantissa distribution with base b then U = − log Y has the standard uniform distribution.
b
Proof
Since the quantile function has a simple closed form, the Benford mantissa distribution can be simulated using the random quantile method.
Open the random quantile experiment and select the Benford mantissa distribution. Vary the base b and note again the shape and
location of the distribution and probability density functions. For selected values of b , run the simulation 1000 times and compare the
empirical density function, mean, and standard deviation to their distributional counterparts.
Also of interest, of course, are the limiting distributions of Y with respect to the base b .
Since the probability density function is bounded on a bounded support interval, the Benford mantissa distribution can also be simulated via
the rejection method.
Open the rejection method experiment and select the Benford mantissa distribution. Vary the base b and note again the shape and
location of the probability density functions. For selected values of b , run the simulation 1000 times and compare the empirical density
function, mean, and standard deviation to their distributional counterparts.
5.39.2 https://stats.libretexts.org/@go/page/10472
Distributions of the Digits
Assume now that the base is a positive integer b ∈ {2, 3, …}, which of course is the case in standard number systems. Suppose that the
sequence of digits of our mantissa Y (in base b ) is (N , N , …), so that1 2
∞
Nk
Y =∑ (5.39.9)
k
b
k=1
Thus, our leading digit N takes values in {1, 2, … , b − 1}, while each of the other significant digits takes values in {0, 1, … , b − 1}.
1
Note that (N , N , …) is a stochastic process so at least we would like to know the finite dimensional distributions. That is, we would like
1 2
to know the joint probability density function of the first k digits for every k ∈ N . But let's start, appropriately enough, with the first digit
+
law. The leading digit is the most important one, and fortunately also the easiest to analyze mathematically.
n
) = log (n + 1) − log (n)
b b
for n ∈ {1, 2, … , b − 1} . The
density function g is decreasing and hence the mode is n = 1 .
1
Proof
Note that when b = 2 , N = 1 deterministically, which of course has to be the case. The first significant digit of a number in base 2 must be
1
1. Numerical values of g for the standard (base 10) distribution are given in an exercise below.
1
In the Special Distribution Simulator, select the Benford first digit distribution. Vary the base b with the input control and note the shape
of the probability density function. For various values of b , run the simulation 1000 times and compare the empirical density function to
the probability density function.
2. The median is ⌈b − 1⌉ .
1/2
Proof
Numerical values of the quantiles for the standard (base 10) distribution are given in an exercise below.
Open the special distribution calculator and choose the Benford first digit distribution. Vary the base and note the shape and location of
the distribution and probability density functions. For selected values of the base, compute the median and the first and third quartiles.
For the most part the moments of N do not have simple expressions. However, we do have the following result for the mean.
1
Numerical values of the mean and variance for the standard (base 10) distribution are given in an exercise below.
Opne the Special Distribution Simulator and select the Benford first digit distribution. Vary the base b with the input control and note
the size and location of the mean ± standard deviation bar. For various values of b , run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation..
Since the quantile function has a simple, closed form, the Benford first digit distribution can be simulated via the random quantile method.
Open the random quantile experiment and select the Benford first digit distribution. Vary the base b and note again the shape and
location of the probability density function. For selected values of the base, run the experiment 1000 times and compare the empirical
density function, mean, and standard deviation to their distributional counterparts.
5.39.3 https://stats.libretexts.org/@go/page/10472
Higher Digits
Now, to compute the joint probability density function of the first k significant digits, some additional notation will help.
k−j
[ n1 n2 ⋯ nk ]b = ∑ nj b (5.39.13)
j=1
Of course, this is just the base b version of what we do in our standard base 10 system: we represent integers as strings of digits between 0
and 9 (except that the first digit cannot be 0). Here is a base 5 example:
2 1 0
[324 ]5 = 3 ⋅ 5 +2 ⋅ 5 +4 ⋅ 5 = 89 (5.39.14)
Proof
The probability density function of (N , N ) in the standard (base 10) case is given in an exercise below. Of course, the probability density
1 2
function of a given digit can be obtained by summing the joint probability density over the unwanted digits in the usual way. However,
except for the first digit, these functions do not reduce to simple expressions.
b−1 b−1
1 1
g2 (n) = ∑ logb (1 + ) = ∑ logb (1 + ), n ∈ {0, 1, … , b − 1} (5.39.18)
[k n]b k b +n
k=1 k=1
The probability density function of N in the standard (base 10) case is given in an exercise below.
2
Theoretical Explanation
Aside from the empirical evidence noted by Newcomb and Benford (and many others since), why does Benford's law work? For a
theoretical explanation, see the article A Statistical Derivation of the Significant Digit Law by Ted Hill.
Computational Exercises
In the following exercises, suppose that Y has the standard Benford mantissa distribution (the base 10 decimal case), and that (N 1, N2 , …)
3. var(N ) 2
5.39.4 https://stats.libretexts.org/@go/page/10472
Answer
Comparing the result for N and the result result for N , note that the distribution of N is flatter than the distribution of N . In general, it
1 2 2 1
turns out that distribution of N converges to the uniform distribution on {0, 1, … , b − 1} as k → ∞ . Interestingly, the digits are
k
dependent.
Proof
2. P(N 1 = 3, N2 = 1, N3 = 5)
3. P(N 1 = 1, N2 = 3, N3 = 5)
This page titled 5.39: Benford's Law is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services)
via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.39.5 https://stats.libretexts.org/@go/page/10472
5.40: The Zeta Distribution
The zeta distribution is used to model the size or ranks of certain types of objects randomly chosen from certain types of
populations. Typical examples include the frequency of occurrence of a word randomly chosen from a text, or the population rank
of a city randomly chosen from a country. The zeta distribution is also known as the Zipf distribution, in honor of the American
linguist George Zipf.
Basic Theory
The Zeta Function
The Riemann zeta function ζ , named after Bernhard Riemann, is defined as follows:
∞
1
ζ(a) = ∑ , a ∈ (1, ∞) (5.40.1)
a
n
n=1
You might recall from calculus that the series in the zeta function converges for a > 1 and diverges for a ≤ 1 .
The zeta function is transcendental, and most of its values must be approximated. However, ζ(a) can be given explicitly for even
2 4
6
and ζ(4) = π
90
.
given by.
1
f (n) = , n ∈ N+ (5.40.2)
ζ(a)na
Open the special distribution simulator and select the zeta distribution. Vary the shape parameter and note the shape of the
probability density function. For selected values of the parameter, run the simulation 1000 times and compare the empirical
density function to the probability density function.
5.40.1 https://stats.libretexts.org/@go/page/10473
The distribution function and quantile function do not have simple closed forms, except in terms of other special functions.
Open the special distribution calculator and select the zeta distribution. Vary the parameter and note the shape of the
distribution and probability density functions. For selected values of the parameter, compute the median and the first and third
quartiles.
Moments
Suppose that N has the zeta distribution with shape parameter a ∈ (1, ∞) . The moments of X can be expressed easily in terms of
the zeta function.
If k ≥ a − 1 , E(X) = ∞ . If k < a − 1 ,
ζ(a − k)
k
E (N ) = (5.40.3)
ζ(a)
Proof
2. If a > 3 ,
2
ζ(a − 2) ζ(a − 1)
var(N ) = −( ) (5.40.6)
ζ(a) ζ(a)
Open the special distribution simulator and select the zeta distribution. Vary the parameter and note the shape and location of
the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation.
2. If a > 5 ,
3 2 2 4
ζ(a − 4)ζ (a) − 4ζ(a − 1)ζ(a − 3)ζ (a) + 6 ζ (a − 1)ζ(a − 2)ζ(a) − 3 ζ (a − 1)
kurt(N ) = (5.40.8)
2 2
[ζ(a − 2)ζ(a) − ζ (a − 1)]
Proof
The probability generating function of N can be expressed in terms of the polylogarithm function Li that was introduced in the
section on the exponential-logarithmic distribution. Recall that the polylogarithm of order s ∈ R is defined by
∞ k
x
Li s (x) = ∑ , x ∈ (−1, 1) (5.40.9)
s
k
k=1
Proof
5.40.2 https://stats.libretexts.org/@go/page/10473
Related Distributions
In an algebraic sense, the zeta distribution is a discrete version of the Pareto distribution. Recall that if a >1 , the Pareto
distribution with shape parameter a − 1 is a continuous distribution on [1, ∞) with probability density function
a−1
f (x) = , x ∈ [1, ∞) (5.40.12)
a
x
Naturally, the limits of the zeta distribution with respect to the shape parameter a are of interest.
The zeta distribution with shape parameter a ∈ (1, ∞) converges to point mass at 1 as a → ∞ .
Proof
Finally, the zeta distribution is a member of the family of general exponential distributions.
Suppose that N has the zeta distribution with parameter a . Then the distribution is a one-parameter exponential family with
natural parameter a and natural statistic − ln N .
Proof
Computational Exercises
Let N denote the frequency of occurrence of a word chosen at random from a certain text, and suppose that X has the zeta
distribution with parameter a = 2 . Find P(N > 4) .
Answer
Suppose that N has the zeta distribution with parameter a = 6 . Approximate each of the following:
1. E(N )
2. var(N )
3. skew(N )
4. kurt(N )
Answer
This page titled 5.40: The Zeta Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
5.40.3 https://stats.libretexts.org/@go/page/10473
5.41: The Logarithmic Series Distribution
The logarithmic series distribution, as the name suggests, is based on the standard power series expansion of the natural logarithm
function. It is also sometimes known more simply as the logarithmic distribution.
Basic Theory
Distribution Functions
The logarithmic series distribution with shape parameter p ∈ (0, 1) is a discrete distribution on N+ with probability density
function f given by
n
1 p
f (n) = , n ∈ N+ (5.41.1)
− ln(1 − p) n
Open the Special Distribution Simulator and select the logarithmic series distribution. Vary the parameter and note the shape of
the probability density function. For selected values of the parameter, run the simulation 1000 times and compare the empirical
density function to the probability density function.
The distribution function and the quantile function do not have simple, closed forms in terms of the standard elementary functions.
Open the special distribution calculator and select the logarithmic series distribution. Vary the parameter and note the shape of
the distribution and probability density functions. For selected values of the parameters, compute the median and the first and
third quartiles.
Moments
Suppose again that random variable N has the logarithmic series distribution with shape parameter p ∈ (0, 1). Recall that the
permutation formula is n = n(n − 1) ⋯ (n − k + 1) for n ∈ R and k ∈ N . The factorial moments of N are E (N ) for
(k) (k)
k ∈ N.
Proof
2. 1 p p
var(N ) = [1 − ] (5.41.10)
2
− ln(1 − p) (1 − p) − ln(1 − p)
Proof
Open the special distribution simulator and select the logarithmic series distribution. Vary the parameter and note the shape of
the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation.
5.41.1 https://stats.libretexts.org/@go/page/10474
ln(1 − pt) 1
N
P (t) = E (t ) = , |t| < (5.41.12)
ln(1 − p) p
Proof
The factorial moments above can also be obtained from the probability generating function, since P
(k)
(1) = E (N
(k)
) for
k ∈ N +.
Related Distributions
Naturally, the limits of the logarithmic series distribution with respect to the parameter p are of interest.
The logarithmic series distribution with shape parameter p ∈ (0, 1) converges to point mass at 1 as p ↓ 0 .
Proof
The logarithmic series distribution is a power series distribution associated with the function g(p) = − ln(1 − p) for
p ∈ [0, 1).
Proof
The moment results above actually follow from general results for power series distributions. The compound Poisson distribution
based on the logarithmic series distribution gives a negative binomial distribution.
Suppose that X = (X , X , …) is a sequence of independent random variables each with the logarithmic series distribution
1 2
with parameter p ∈ (0, 1). Suppose also that N is independent of X and has the Poisson distribution with rate parameter
r ∈ (0, ∞). Then Y = ∑ X has the negative binomial distribution on N with parameters 1 − p and −r/ ln(1 − p)
N
i=1 i
Proof
This page titled 5.41: The Logarithmic Series Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
5.41.2 https://stats.libretexts.org/@go/page/10474
CHAPTER OVERVIEW
6: Random Samples
Point estimation refers to the process of estimating a parameter from a probability distribution, based on observed data from the
distribution. It is one of the core topics in mathematical statistics. In this chapter, we will explore the most common methods of
point estimation: the method of moments, the method of maximum likelihood, and Bayes' estimators. We also study important
properties of estimators, including sufficiency and completeness, and the basic question of whether an estimator is the best possible
one.
6.1: Introduction
6.2: The Sample Mean
6.3: The Law of Large Numbers
6.4: The Central Limit Theorem
6.5: The Sample Variance
6.6: Order Statistics
6.7: Sample Correlation and Regression
6.8: Special Properties of Normal Samples
This page titled 6: Random Samples is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
6.1: Introduction
measurements for the ith object chosen from the population. The set S of possible values of x (before the experiment is conducted)
is called the sample space. It is literally the space of samples. Thus, although the outcome of a statistical experiment can have quite
a complicated structure (a vector of vectors), the hallmark of mathematical abstraction is the ability to gray out the features that are
not relevant at any particular time, to treat a complex structure as a single object. This we do with the outcome x of the experiment.
The techniques of statistics have been enormously successful; these techniques are widely used in just about every subject that
deals with quantification—the natural sciences, the social sciences, law, and medicine. On the other hand, statistics has a legalistic
quality and a great deal of terminology and jargon that can make the subject a bit intimidating at first. In the rest of this section, we
begin discussing some of this terminology.
The empirical distribution associated with x is the probability distribution that places probability 1/n at each x . Thus, if the
i
values are distinct, the empirical distribution is the discrete uniform distribution on {x , x , … , x }. More generally, if x occurs k
1 2 n
times in the data, then the empirical distribution assigns probability k/n to x. Thus, every finite data set defines a probability
distribution.
Statistics
Technically, a statistic w = w(x) is an observable function of the outcome x of the experiment. That is, a statistic is a computable
function defined on the sample space S . The term observable means that the function should not contain any unknown quantities,
because we need to be able to compute the value w of the statistic from the observed data x. As with the data x, a statistic w may
have a complicated structure; typically, w is vector valued. Indeed, the outcome x of the experiment is itself is a statistic; all other
statistics are derived from x.
Statistics u and v are equivalent if there exists a one-to-one function r from the range of u onto the range of v such that v = r(u) .
Equivalent statistics give equivalent information about x.
6.1.1 https://stats.libretexts.org/@go/page/10178
Statistics u and v are equivalent if and only if the following condition holds: for any x ∈ S and y ∈ S , u(x) = u(y) if and
only if v(x) = v(y) .
Equivalence really is an equivalence relation on the collection of statistics for a given statistical experiment. That is, if u, v ,
and w are arbitrary statistics then
1. u is equivalent to u (the reflexive property).
2. If u is equivalent to v then v is equivalent to u (the symmetric property).
3. If u is equivalent to v and v is equivalent to w then u is equivalent to w (the transitive property).
complete knowledge of the distribution. In statistics, by contrast, we observe the value of x of the random variable X and try to
infer information about the underlying distribution of X. In inferential statistics, a statistic (a function of X) is itself a random
variable with a distribution of its own. On the other hand, the term parameter refers to a characteristic of the distribution of X.
Often the inferential problem is to use various statistics to estimate or test hypotheses about a parameter. Another way to think of
inferential statistics is that we are trying to infer from the empirical distribution associated with the observed data x to the true
distribution associated with X.
There are two basic types of random experiments in the general area of inferential statistics. A designed experiment, as the name
suggests, is carefully designed to study a particular inferential question. The experimenter has considerable control over how the
objects are selected, what variables are to be recorded for these objects, and the values of certain of the variables. In an
observational study, by contrast, the researcher has little control over these factors. Often the researcher is simply given the data set
and asked to make sense out of it. For example, the Polio field trials were designed experiments to study the effectiveness of the
Salk vaccine. The researchers had considerable control over how the children were selected, and how the children were assigned to
the treatment and control groups. By contrast, the Challenger data sets used to explore the relationship between temperature and O-
ring erosion are observational studies. Of course, just because an experiment is designed does not mean that it is well designed.
Difficulties
A number of difficulties can arise when trying to explore an inferential question. Often, problems arise because of confounding
variables, which are variables that (as the name suggests) interfere with our understanding of the inferential question. In the first
Polio field trial design, for example, age and parental consent are two confounding variables that interfere with the determination
of the effectiveness of the vaccine. The entire point of the Berkeley admissions data, to give another example, is to illustrate how a
confounding variable (department) can create a spurious correlation between two other variables (gender and admissions status).
When we correct for the interference caused by a confounding variable, we say that we have controlled for the variable.
Problems also frequently arise because of measurement errors. Some variables are inherently difficult to measure, and systematic
bias in the measurements can interfere with our understanding of the inferential question. The first Polio field trial design again
provides a good example. Knowledge of the vaccination status of the children led to systematic bias by doctors attempting to
diagnose polio in these children. Measurement errors are sometimes caused by hidden confounding variables.
Confounding variables and measurement errors abound in political polling, where the inferential question is who will win an
election. How do confounding variables such as race, income, age, and gender (to name just a few) influence how a person will
vote? How do we know that a person will vote for whom she says she will, or if she will vote at all (measurement errors)? The
Literary Digest poll in the 1936 presidential election and the professional polls in the 1948 presidential election illustrate these
problems.
6.1.2 https://stats.libretexts.org/@go/page/10178
Confouding variables, measurement errors and other causes often lead to selection bias, which means that the sample does not
represent the population with respect to the inferential question at hand. Often randomization is used to overcome the effects of
confounding variables and measurement errors.
Random Samples
The most common and important special case of the inferential statistical model occurs when the observation variable
X = (X1 , X2 , … , Xn ) (6.1.1)
is a sequence of independent and identically distributed random variables. Again, in the standard sampling model, X is itself a
i
vector of measurements for the ith object in the sample, and thus, we think of (X , X , … , X ) as independent copies of an
1 2 n
underlying measurement vector X. In this case, (X , X , … , X ) is said to be a random sample of size n from the distribution of
1 2 n
X.
Variables
The mathematical operations that make sense for variable in a statistical experiment depend on the type and level of measurement
of the variable.
Type
Recall that a real variable x is continuous if the possible values form an interval of real numbers. For example, the weight variable
in the M&M data set, and the length and width variables in Fisher's iris data are continuous. In contrast, a discrete variable is one
whose set of possible values forms a discrete set. For example, the counting variables in the M&M data set, the type variable in
Fisher's iris data, and the denomination and suit variables in the card experiment are discrete. Continuous variables represent
quantities that can, in theory, be measured to any degree of accuracy. In practice, of course, measuring devices have limited
accuracy so data collected from a continuous variable are necessarily discrete. That is, there is only a finite (but perhaps very large)
set of possible values that can actually be measured. So, the distinction between a discrete and continuous variable is based on what
is theoretically possible, not what is actually measured. Some additional examples may help:
A person's age is usually given in years. However, one can imagine age being given in months, or weeks, or even (if the time of
birth is known to a sufficient accuracy) in seconds. Age, whether of devices or persons, is usually considered to be a continuous
variable.
The price of an item is usually given (in the US) in dollars and cents, and of course, the smallest monetary object in circulation
is the penny ($0.01). However, taxes are sometimes given in mills ($0.001), and one can imagine smaller divisions of a dollar,
even if there are no coins to represent these divisions. Measures of wealth are usually thought of as continuous variables.
On the other hand, the number of persons in a car at the time of an accident is a fundamentally discrete variable.
Levels of Measurement
A real variable x is also distinguished by its level of measurement.
Qualitative variables simply encode types or names, and thus few mathematical operations make sense, even if numbers are used
for the encoding. Such variables have the nominal level of measurement. For example, the type variable in Fisher's iris data is
qualitative. Gender, a common variable in many studies of persons and animals, is also qualitative. Qualitative variables are almost
always discrete; it's hard to imagine a continuous infinity of names.
A variable for which only order is meaningful is said to have the ordinal level of measurement; differences are not meaningful
even if numbers are used for the encoding. For example, in many card games, the suits are ranked, so the suit variable has the
ordinal level of measurement. For another example, consider the standard 5-point scale (terrible, bad, average, good, excellent)
used to rank teachers, movies, restaurants etc.
A quantitative variable for which difference, but not ratios are meaningful is said to have the interval level of measurement.
Equivalently, a variable at this level has a relative, rather than absolute, zero value. Typical examples are temperature (in Fahrenheit
or Celsius) or time (clock or calendar).
Finally, a quantitative variable for which ratios are meaningful is said to have the ratio level of measurement. A variable at this
level has an absolute zero value. The count and weight variables in the M&M data set, and the length and width variables in
Fisher's iris data are examples.
6.1.3 https://stats.libretexts.org/@go/page/10178
Subsamples
In the basic statistical model, subsamples corresponding to some of the variables can be constructed by filtering with respect to
other variables. This is particularly common when the filtering variables are qualitative. Consider the cicada data for example. We
might be interested in the quantitative variables body weight, body length, wing width, and wing length by species, that is,
separately for species 0, 1, and 2. Or, we might be interested in these quantitative variables by gender, that is separately for males
and females.
Exercises
Study Michelson's experiment to measure the velocity of light.
1. Is this a designed experiment or an observational study?
2. Classify the velocity of light variable in terms of type and level of measurement.
3. Discuss possible confounding variables and problems with measurement errors.
Answer
In the M&M data, classify each variable in terms of type and level of measurement.
Answer
In the Cicada data, classify each variable in terms of type and level of measurement.
Answer
In Fisher's iris data, classify each variable in terms of type and level of measurement.
Answer
Study the Challenger experiment to explore the relationship between temperature and O-ring erosion.
1. Is this a designed experiment or an observational study?
2. Classify each variable in terms of type and level of measurement.
3. Discuss possible confounding variables and problems with measurement errors.
Answer
In the Vietnam draft data, classify each variable in terms of type and level of measurement.
Answer
In the two SAT data sets, classify each variable in terms of type and level of measurement.
Answer
Study the Literary Digest experiment to to predict the outcome of the 1936 presidential election.
1. Is this a designed experiment or an observational study?
2. Classify each variable in terms of type and level of measurement.
3. Discuss possible confounding variables and problems with measurement errors.
6.1.4 https://stats.libretexts.org/@go/page/10178
Answer
Study the 1948 polls to predict the outcome of the presidential election between Truman and Dewey. Are these designed
experiments or an observational studies?
Answer
Study Pearson's experiment to explore the relationship between heights of fathers and heights of sons.
1. Is this a designed experiment or an observational study?
2. Classify each variable in terms of type and level of measurement.
3. Discuss possible confounding variables.
Answer
Note the parameters for each of the following families of special distributions:
1. the normal distribution
2. the gamma distribution
3. the beta distribution
4. the Pareto distribution
5. the Weibull distribution
Answer
During World War II, the Allies recorded the serial numbers of captured German tanks. Classify the underlying serial number
variable by type and level of measurement.
Answer
For a discussion of how the serial numbers were used to estimate the total number of tanks, see the section on Order Statistics in
the chapter on Finite Sampling Models.
This page titled 6.1: Introduction is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
6.1.5 https://stats.libretexts.org/@go/page/10178
6.2: The Sample Mean
Basic Theory
Recall the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that
we make on the objects. We select objects from the population and record the variables for the objects in the sample; these become
our data. Our first discussion is from a purely descriptive point of view. That is, we do not assume that the data are generated by an
underlying probability distribution. However, recall that the data themselves define a probability distribution.
Definition and Basic Properties
Suppose that x = (x , x , … , x 1 2 n) is a sample of size n from a real-valued variable. The sample mean is simply the arithmetic
average of the sample values:
n
1
m = ∑ xi (6.2.1)
n
i=1
If we want to emphasize the dependence of the mean on the data, we write m(x) instead of just m. Note that m has the same
physical units as the underlying variable. For example, if we have a sample of weights of cicadas, in grams, then m is in grams
also. The sample mean is frequently used as a measure of center of the data. Indeed, if each x is the location of a point mass, then
i
m is the center of mass as defined in physics. In fact, a simple graphical display of the data is the dotplot: on a number line, a dot is
placed at x for each i. If values are repeated, the dots are stacked vertically. The sample mean m is the balance point of the
i
dotplot. The image below shows a dot plot with the mean as the balance point.
y = (y , y , … , y ) are samples of size n from real-valued population variables and that c is a constant. In vector notation, recall
1 2 n
Proof
6.2.1 https://stats.libretexts.org/@go/page/10179
Proof
As a special case of these results, suppose that x = (x , x , … , x ) is a sample of size n corresponding to a real variable x, and
1 2 n
that a and b are constants. Then the sample corresponding to the variable y = a + bx , in our vector notation, is a + bx . The
sample means are related in precisely the same way, that is, m(a + bx) = a + bm(x) . Linear transformations of this type, when
b > 0 , arise frequently when physical units are changed. In this case, the transformation is often called a location-scale
transformation; a is the location parameter and b is the scale parameter. For example, if x is the length of an object in inches, then
y = 2.54x is the length of the object in centimeters. If x is the temperature of an object in degrees Fahrenheit, then y =
5
(x − 32)
9
i=1
The relative frequency of A corresponding to x is the proportion of data values that are in A :
n
n(A) 1
p(A) = = ∑ 1(xi ∈ A) (6.2.6)
n n
i=1
Note that for fixed A , p(A) is itself a sample mean, corresponding to the data {1(x ∈ A) : i ∈ {1, 2, … , n}}. This fact bears
i
repeating: every sample proportion is a sample mean, corresponding to an indicator variable. In the picture below, the red dots
represent the data, so p(A) = 4/15.
Proof
This probability measure is known as the empirical probability distribution associated with the data set x. It is a discrete
distribution that places probability at each point x . In fact this observation supplies a simpler proof of previous theorem. Thus,
1
n
i
if the data values are distinct, the empirical distribution is the discrete uniform distribution on {x , x , … , x }. More generally, if
1 2 n
x ∈ S occurs k times in the data then the empirical distribution assigns probability k/n to x.
If the underlying variable is real-valued, then clearly the sample mean is simply the mean of the empirical distribution. It follows
that the sample mean satisfies all properties of expected value, not just the linear properties and increasing properties given above.
These properties are just the most important ones, and so were repeated for emphasis.
Empirical Density
Suppose now that the population variable x takes values in a set S ⊆R
d
for some d ∈ N+ . Recall that the standard measure on
R is given by
d
6.2.2 https://stats.libretexts.org/@go/page/10179
d
λd (A) = ∫ 1 dx, A ⊆R (6.2.9)
A
In particular λ (A) is the length of A , for A ⊆ R ; λ (A) is the area of A , for A ⊆ R ; and λ (A) is the volume of A , for
1 2
2
3
A ⊆ R . Suppose that x is a continuous variable in the sense that λ (S) > 0 . Typically, S is an interval if d = 1 and a Cartesian
3
d
product of intervals if d > 1 . Now for A ⊆ S with λ (A) > 0 , the empirical density of A corresponding to x is
d
n
p(A) 1
D(A) = = ∑ 1(xi ∈ A) (6.2.10)
λd (A) n λd (A)
i=1
Thus, the empirical density of A is the proportion of data values in A , divided by the size of A . In the picture below
(corresponding to d = 2 ), if A has area 5, say, then D(A) = 4/75.
frequency (empirical probability) of (−∞, x] corresponding to the data set x. Thus, for each x ∈ R, F (x) is the sample mean of
the data {1(x ≤ x) : i ∈ {1, 2, … , n}} :
i
n
1
F (x) = p ((−∞, x]) = ∑ 1(xi ≤ x) (6.2.11)
n
i=1
F is a distribution function.
1. F increases from 0 to 1.
2. F is a step function with jumps at the distinct sample values {x 1, x2 , … , xn } .
Proof
Appropriately enough, F is called the empirical distribution function associated with x and is simply the distribution function of
the empirical distribution corresponding to x. If we know the sample size n and the empirical distribution function F , we can
recover the data, except for the order of the observations. The distinct values of the data are the places where F jumps, and the
number of data values at such a point is the size of the jump, times the sample size n .
The Empirical Discrete Density Function
Suppose now that x = (x , x , … , x ) is a sample of size n from a discrete variable that takes values in a countable set S . For
1 2 n
x ∈ S , let f (x) be the relative frequency (empirical probability) of x corresponding to the data set x . Thus, for each x ∈ S , f (x)
n
1
f (x) = p({x}) = ∑ 1(xi = x) (6.2.12)
n
i=1
In the picture below, the dots are the possible values of the underlying variable. The red dots represent the data, and the numbers
indicate repeated values. The blue dots are possible values of the the variable that did not happen to occur in the data. So, the
sample size is 12, and for the value x that occurs 3 times, we have f (x) = 3/12.
6.2.3 https://stats.libretexts.org/@go/page/10179
f is a discrete probabiltiy density function:
1. f (x) ≥ 0 for x ∈ S
2. ∑ f (x) = 1
x∈S
Proof
Appropriately enough, f is called the empirical probability density function or the relative frequency function associated with x,
and is simply the probabiltiy density function of the empirical distribution corresponding to x. If we know the empirical PDF f and
the sample size n , then we can recover the data set, except for the order of the observations.
If the underlying population variable is real-valued, then the sample mean is the expected value computed relative to the
empirical density function. That is,
n
1
∑ xi = ∑ x f (x) (6.2.14)
n
i=1 x∈S
Proof
As we noted earlier, if the population variable is real-valued then the sample mean is the mean of the empirical distribution.
The Empirical Continuous Density Function
Suppose now that is a sample of size n from a continuous variable that takes values in a set S ⊆ R . Let
x = (x1 , x2 , … , xn )
d
A = { A : j ∈ J} be a partition of S into a countable number of subsets, each of positive, finite measure. Recall that the word
j
partition means that the subsets are pairwise disjoint and their union is S . Let f be the function on S defined by the rule that f (x)
is the empricial density of A , corresponding to the data set x, for each x ∈ A . Thus, f is constant on each of the partition sets:
j j
n
p(Aj ) 1
f (x) = D(Aj ) = = ∑ 1(xi ∈ Aj ), x ∈ Aj (6.2.16)
λd (Aj ) nλd (Aj )
i=1
Proof
The function f is called the empirical probability density function associated with the data x and the partition A . For the
probability distribution defined by f , the empirical probability p(A ) is uniformly distributed over A for each j ∈ J . In the
j j
picture below, the red dots represent the data and the black lines define a partition of S into 9 rectangles. For the partition set A in
the upper right, the empirical distribution would distribute probability 3/15 = 1/5 uniformly over A . If the area of A is, say, 4,
then f (x) = 1/20 for x ∈ A .
the empirical PDF is not in general the same as the sample mean when the underlying variable is real-valued.
Histograms
Our next discussion is closely related to the previous one. Suppose again that x = (x , x , … , x ) is a sample of size n from a
1 2 n
variable that takes values in a set S and that A = (A , A , … , A ) is a partition of S into k subsets. The sets in the partition are
1 2 k
6.2.4 https://stats.libretexts.org/@go/page/10179
The mapping that assigns relative frequencies to classes is known as a relative frequency distribution for the data set and the
given partition.
In the case of a continuous variable, the mapping that assigns densities to classes is known as a density distribution for the data
set and the given partition.
In dimensions 1 or 2, the bar graph any of these distributions, is known as a histogram. The histogram of a frequency distribution
and the histogram of the corresponding relative frequency distribution look the same, except for a change of scale on the vertical
axis. If the classes all have the same size, the histogram of the corresponding density histogram also looks the same, again except
for a change of scale on the vertical axis. If the underlying variable is real-valued, the classes are usually intervals (discrete or
continuous) and the midpoints of these intervals are sometimes referred to as class marks.
that p = n /n is the relative frequency of class A . Let t denote the class mark (midpoint) of class A . The cumulative frequency
j j j j j
of class A is N = ∑ n and the cumulative relative frequency of class A is P = ∑ p = N /n . Note that the cumulative
j j
j
i=1
i j j
j
i=1
i j
frequencies increase from n to n and the cumulative relative frequencies increase from p to 1.
1 1
6.2.5 https://stats.libretexts.org/@go/page/10179
The mapping that assigns cumulative frequencies to classes is known as a cumulative frequency distribution for the data set and
the given partition. The polygonal graph that connects the points (t , N ) for j ∈ {1, 2, … , k} is the cumulative frequency
j j
ogive.
The mapping that assigns cumulative relative frequencies to classes is known as a cumulative relative frequency distribution for
the data set and the given partition. The polygonal graph that connects the points (t , P ) for j ∈ {1, 2, … , k} is the cumulative
j j
This approximation is based on the hope that the mean of the data values in each class is close to the midpoint of that class. In fact,
the expression on the right is the expected value of the distribution that places probability p on class mark t for each j .
j j
Exercises
Basic Properties
Suppose that x is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of
operation.
1. Classify x by type and level of measurement.
2. A sample of 30 components has mean 113°. Find the sample mean if the temperature is converted to degrees Celsius. The
transformation is y = (x − 32) .
5
Answer
Suppose that x is the length (in inches) of a machined part in a manufacturing process.
1. Classify x by type and level of measurement.
2. A sample of 50 parts has mean 10.0. Find the sample mean if length is measured in centimeters. The transformation is
y = 2.54x.
Answer
Suppose that x is the number of brothers and y the number of sisters for a person in a certain population. Thus, z = x +y is
the number of siblings.
1. Classify the variables by type and level of measurement.
2. For a sample of 100 persons, m(x) = 0.8 and m(y) = 1.2 . Find m(z).
Answer
Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade
on the first midterm exam was 64 (out of a possible 100 points). Professor Moriarity thinks the grades are a bit low and is
considering various transformations for increasing the grades. In each case below give the mean of the transformed grades, or
state that there is not enough information.
1. Add 10 points to each grade, so the transformation is y = x + 10 .
2. Multiply each grade by 1.2, so the transformation is z = 1.2x
3. Use the transformation w = 10√− x . Note that this is a non-linear transformation that curves the grades greatly at the low
end and very little at the high end. For example, a grade of 100 is still 100, but a grade of 36 is transformed to 60.
One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an
outlier.
6.2.6 https://stats.libretexts.org/@go/page/10179
4. What would the mean be if this score is omitted?
Answer
Computational Exercises
All statistical software packages will compute means and proportions, draw dotplots and histograms, and in general perform the
numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets,
the use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with
small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the
graphs with minimal technological aids.
Suppose that x is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data
.
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)
Suppose that a sample of size 12 from a discrete variable x has empirical density function given by f (−2) = 1/12 ,
f (−1) = 1/4, f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.
The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample
of ESU students.
Class Freq Rel Freq Density Cum Freq Cum Rel Freq Midpoint
(0, 2] 6
(2, 6] 16
(6, 10] 18
(10, 20] 10
In the interactive histogram, click on the x-axis at various points to generate a data set with at least 20 values. Vary the number
of classes and switch between the frequency histogram and the relative frequency histogram. Note how the shape of the
histogram changes as you perform these operations. Note in particular how the histogram loses resolution as you decrease the
number of classes.
6.2.7 https://stats.libretexts.org/@go/page/10179
In the interactive histogram, click on the axis to generate a distribution of the given type with at least 30 points. Now vary the
number of classes and note how the shape of the distribution changes.
1. A uniform distribution
2. A symmetric unimodal distribution
3. A unimodal distribution that is skewed right.
4. A unimodal distribution that is skewed left.
5. A symmetric bimodal distribution
6. A u-shaped distribution.
Consider the petal length and species variables in Fisher's iris data.
1. Classify the variables by type and level of measurement.
2. Compute the sample mean and plot a density histogram for petal length.
3. Compute the sample mean and plot a density histogram for petal length by species.
Answers
6.2.8 https://stats.libretexts.org/@go/page/10179
5. Compute the sample mean and plot a density histogram for the net weight.
Answer
Consider the body weight, species, and gender variables in the Cicada data.
1. Classify the variables by type and level of measurement.
2. Compute the relative frequency function for species and plot the graph.
3. Compute the relative frequeny function for gender and plot the graph.
4. Compute the sample mean and plot a density histogram for body weight.
5. Compute the sample mean and plot a density histogrm for body weight by species.
6. Compute the sample mean and plot a density histogram for body weight by gender.
Answer
This page titled 6.2: The Sample Mean is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
6.2.9 https://stats.libretexts.org/@go/page/10179
6.3: The Law of Large Numbers
Basic Theory
This section continues the discussion of the sample mean from the last section, but we now consider the more interesting setting
where the variables are random. Specifically, suppose that we have a basic random experiment with an underlying probability
measure P, and that X is random variable for the experiment. Suppose now that we perform n independent replications of the basic
experiment. This defines a new, compound experiment with a sequence of independent random variables X = (X , X , … , X ), 1 2 n
each with the same distribution as X. Recall that in statistical terms, X is a random sample of size n from the distribution of X.
All of the relevant statistics discussed in the the previous section, are defined for X, but of course now these statistics are random
variables with distributions of their own. For the most part, we use the notation established previously, except that for the usual
convention of denoting random variables with capital letters. Of course, the deterministic properties and relations established
previously apply as well. When we acutally run the experiment and observe the values x = (x , x , … , x ) of the random
1 2 n
Ofen the distribution mean μ is unknown and the sample mean M is used as an estimator of this unknown parameter.
Moments
Proof
Part (a) means that the sample mean M is an unbiased estimator of the distribution mean μ . Therefore, the variance of M is the
mean square error, when M is used as an estimator of μ . Note that the variance of M is an increasing function of the distribution
variance and a decreasing function of the sample size. Both of these make intuitive sense if we think of the sample mean M as an
estimator of the distribution mean μ . The fact that the mean square error (variance in this case) decreases to 0 as the sample size n
increases to ∞ means that the sample mean M is a consistent estimator of the distribution mean μ .
Recall that X − M is the deviation of X from M , that is, the directed distance from M to X . The following theorem states that
i i i
the sample mean is uncorrelated with each deviation, a result that will be crucial for showing the independence of the sample mean
and the sample variance when the sampling distribution is normal.
X. In probabilistic terms, we have an independent, identically distributed (IID) sequence. For each n , let M denote the sample
n
6.3.1 https://stats.libretexts.org/@go/page/10180
From the result above on variance, note that var(Mn ) = E [(Mn − μ)
2
] → 0 as n → ∞ . This means that Mn → μ as n → ∞
Recall that in general, convergence in mean square implies convergence in probability. The convergence of the sample mean to the
distribution mean in mean square and in probability are known as weak laws of large numbers.
Finally, the strong law of large numbers states that the sample mean M converges to the distribution mean μ with probability 1 .
n
As the name suggests, this is a much stronger result than the weak laws. We will need some additional notation for the proof. First
n
let Y = ∑ X so that M = Y /n . Next, recall the definitions of the positive and negative parts a real number x:
n i=1 i n n
Mn → μ as n → ∞ with probability 1.
Proof
The proof of the strong law of large numbers given above requires that the variance of the sampling distribution be finite (note that
this is critical in the first step). However, there are better proofs that only require that E (|X|) < ∞ . An elegant proof showing that
M → μ as n → ∞ with probability 1 and in mean, using backwards martingales, is given in the chapter on martingales. In the
n
next few paragraphs, we apply the law of large numbers to some of the special statistics studied in the previous section.
Emprical Probability
Suppose that X is the outcome random variable for a basic experiment, with sample space S and probability measure P. Now
suppose that we repeat the basic experiment indefinitley to form a sequence of independent random variables (X , X , …) each 1 2
with the same distribution as X. That is, we sample from the distribution of X. For A ⊆ S , let P (A) denote the empricial n
n
1
Pn (A) = ∑ 1(Xi ∈ A) (6.3.13)
n
i=1
n
Now of course, P (A) is a random variable for each event A . In fact, the sum ∑
n i=1
1(Xi ∈ A) has the binomial distribution with
parameters n and P(A) .
Proof
This special case of the law of large numbers is central to the very concept of probability: the relative frequency of an event
converges to the probability of the event as the experiment is repeated.
The Empirical Distribution Function
Suppose now that X is a real-valued random variable for a basic experiment. Recall that the distribution function of X is the
function F given by
F (x) = P(X ≤ x), x ∈ R (6.3.14)
Now suppose that we repeat the basic experiment indefintely to form a sequence of independent random variables (X , X , …), 1 2
each with the same distribution as X. That is, we sample from the distribution of X. Let F denote the empirical distributionn
n
1
Fn (x) = ∑ 1(Xi ≤ x), x ∈ R (6.3.15)
n
i=1
6.3.2 https://stats.libretexts.org/@go/page/10180
Now, of course, F (x) is a random variable for each x ∈ R. In fact, the sum
n ∑
n
i=1
1(Xi ≤ x) has the binomial distribution with
parameters n and F (x).
For each x ∈ R,
1. E [F (x)] = F (x)
n
Proof
Empirical Density for a Discrete Variable
Suppose now that X is a random variable for a basic experiment with a discrete distribution on a countable set S . Recall that the
probability density function of X is the function f given by
f (x) = P(X = x), x ∈ S (6.3.16)
Now suppose that we repeat the basic experiment to form a sequence of independent random variables (X , X , …) each with the 1 2
same distribution as X. That is, we sample from the distribution of X. Let f denote the empirical probability density function
n
n
1
fn (x) = ∑ 1(Xi = x), x ∈ S (6.3.17)
n
i=1
i=1
1(Xi = x) has the binomial distribution with
parameters n and f (x).
For each x ∈ S ,
1. E [f (x)] = f (x)
n
Proof
Recall that a countable intersection of events with probability 1 still has probability 1. Thus, in the context of the previous theorem,
we actually have
P [ fn (x) → f (x) as n → ∞ for every x ∈ S] = 1 (6.3.18)
Suppose now that X is a random variable for a basic experiment, with a continuous distribution on S ⊆ R , and that X has d
probability density function f . Technically, f is the probability density function with respect to the standard (Lebsesgue) measure
λ . Thus, by definition,
d
Again we repeat the basic experiment to generate a sequence of independent random variables (X , X , …) each with the same
1 2
distribution as X. That is, we sample from the distribution of X. Suppose now that A = {A : j ∈ J} is a partition of S into a
j
countable number of subsets, each with positive, finite size. Let f denote the empirical probability density function corresponding
n
n
Pn (Aj ) 1
fn (x) = = ∑ 1(Xi ∈ Aj ); j ∈ J, x ∈ Aj (6.3.20)
λd (Aj ) n λd (Aj )
i=1
Of course now, f (x) is a random variable for each x ∈ S . If the partition is sufficiently fine (so that λ
n d (Aj ) is small for each j ),
and if the sample size n is sufficiently large, then by the law of large numbers,
6.3.3 https://stats.libretexts.org/@go/page/10180
Exercises
Simulation Exercises
In the dice experiment, recall that the dice scores form a random sample from the specified die distribution. Select the average
random variable, which is the sample mean of the sample of dice scores. For each die distribution, start with 1 die and increase
the sample size n . Note how the distribution of the sample mean begins to resemble a point mass distribution. Note also that
the mean of the sample mean stays the same, but the standard deviation of the sample mean decreases. For selected values of n
and selected die distributions, run the simulation 1000 times and compare the relative frequency function of the sample mean
to the true probability density function, and compare the empirical moments of the sample mean to the true moments.
Several apps in this project are simulations of random experiments with events of interest. When you run the experiment, you are
performing independent replications of the experiment. In most cases, the app displays the relative frequency of the event and its
complement, both graphically in blue, and numerically in a table. When you run the experiment, the relative frequencies are shown
graphically in red and also numerically.
In the simulation of Buffon's coin experiment, the event of interest is that the coin crosses a crack. For various values of the
parameter (the radius of the coin), run the experiment 1000 times and compare the relative frequency of the event to the true
probability.
In the simulation of Bertrand's experiment, the event of interest is that a “random chord” on a circle will be longer than the
length of a side of the inscribed equilateral triangle. For each of the various models, run the experiment 1000 times and
compuare the relative frequency of the event to the true probability.
Many of the apps in this project are simulations of experiments which result in discrete variables. When you run the simulation,
you are performing independent replications of the experiment. In most cases, the app displays the true probability density function
numerically in a table and visually as a blue bar graph. When you run the simulation, the relative frequency function is also shown
numerically in the table and visually as a red bar graph.
In the simulation of the binomial coin experiment, select the number of heads. For selected values of the parameters, run the
simulation 1000 times and compare the sample mean to the distribution mean, and compare the empirical density function to
the probability density function.
In the simulation of the matching experiment, the random variable is the number of matches. For selected values of the
parameter, run the simulation 1000 times and compare the sample mean and the distribution mean, and compare the empirical
density function to the probability density function.
In the poker experiment, the random variable is the type of hand. Run the simulation 1000 times and compare the empirical
density function to the true probability density function.
Many of the apps in this project are simulations of experiments which result in variables with continuous distributions. When you
run the simulation, you are performing independent replications of the experiment. In most cases, the app displays the true
probability density function visually as a blue graph. When you run the simulation, an empirical density function, based on a
partition, is also shown visually as a red bar graph.
In the simulation of the gamma experiment, the random variable represents a random arrival time. For selected values of the
parameters, run the experiment 1000 times and compare the sample mean to the distribution mean, and compare the empirical
density function to the probability density function.
In the special distribution simulator, select the normal distribution. For various values of the parameters (the mean and standard
deviation), run the experiment 1000 times and compare the sample mean to the distribution mean, and compare the empirical
density function to the probability density function.
6.3.4 https://stats.libretexts.org/@go/page/10180
Probability Exercises
2
)
Answer
Suppose now that (X , X , … , X ) is a random sample of size 9 from the distribution in the previous problem. Find the
1 2 9
2
])
Answer
Suppose now that (X , X , … , X ) is a random sample of size 16 from the distribution in the previous problem. Find the
1 2 16
Answer
Recall that for an ace-six flat die, faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability
1
4
1
8
each. Let
X denote the score when an ace-six flat die is thrown. Compute each of the following:
Suppose now that an ace-six flat die is thrown n times. Find the expected value and variance of each of the following random
variables:
1. The empirical probability density function f (x) for x ∈ {1, 2, 3, 4, 5, 6}
n
This page titled 6.3: The Law of Large Numbers is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
6.3.5 https://stats.libretexts.org/@go/page/10180
6.4: The Central Limit Theorem
The central limit theorem and the law of large numbers are the two fundamental theorems of probability. Roughly, the central limit
theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will
be approximately normal, regardless of the underlying distribution. The importance of the central limit theorem is hard to overstate;
indeed it is the reason that many statistical procedures work.
probability density function f , mean μ , and variance σ . We assume that 0 < σ < ∞ , so that in particular, the random variables
2
Yn = ∑ Xi , n ∈ N (6.4.1)
i=1
Note that by convention, Y = 0 , since the sum is over an empty index set. The random process Y = (Y , Y , Y , …) is called the
0 0 1 2
partial sum process associated with X. Special types of partial sum processes have been studied in many places in this text; in
particular see
the binomial distribution in the setting of Bernoulli trials
the negative binomial distribution in the setting of Bernoulli trials
the gamma distribution in the Poisson process
the the arrival times in a general renewal process
Recall that in statistical terms, the sequence X corresponds to sampling from the underlying distribution. In particular,
(X , X , … , X ) is a random sample of size n from the distribution, and the corresponding sample mean is
1 2 n
n
Yn 1
Mn = = ∑ Xi (6.4.2)
n n
i=1
If m ≤ n then Y n − Ym has the same distribution as Y n−m . Thus the process Y has stationary increments.
Proof
Note however that Yn − Ym and Yn−m are very different random variables; the theorem simply states that they have the same
distribution.
If n ≤ n ≤ n ≤ ⋯ then (Y
1 2 3 n1 , Yn
2
− Yn , Yn
1 3
− Yn , …)
2
is a sequence of independent random variables. Thus the process
Y has independent increments.
Proof
Conversely, suppose that V = (V , V , V 0 1 2, …) is a random process with stationary, independent increments. Define
U = V −V
i for i ∈ N . Then U = (U
i i−1 + 1, U2 , …) is a sequence of independent, identically distributed variables and V is
the partial sum process associated with U .
Thus, partial sum processes are the only discrete-time random processes that have stationary, independent increments. An
interesting, and much harder problem, is to characterize the continuous-time processes that have stationary independent increments.
The Poisson counting process has stationary independent increments, as does the Brownian motion process.
6.4.1 https://stats.libretexts.org/@go/page/10181
Moments
If n ∈ N then
1. E(Y ) = nμ
n
2. var(Y ) = nσ
n
2
Proof
−−
2. cor(Y m, Yn ) = √
m
3. E(Y m Yn ) = m σ
2
+ mnμ
2
Proof
Proof
Distributions
Suppose that X has either a discrete distribution or a continuous distribution with probability density function f . Then the
probability density function of Y is f = f ∗ f ∗ ⋯ ∗ f , the convolution power of f of order n .
n
∗n
Proof
More generally, we can use the stationary and independence properties to find the joint distributions of the partial sum process:
If n 1 < n2 < ⋯ < nk then (Y n1 , Yn2 , … , Ynk ) has joint probability density function
∗n1 ∗( n2 −n1 ) ∗( nk −nk−1 ) k
fn1 , n2 ,…, nk (y1 , y2 , … , yk ) = f (y1 )f (y2 − y1 ) ⋯ f (yk − yk−1 ), (y1 , y2 , … , yk ) ∈ R (6.4.5)
Proof
Note that var(Y ) → ∞ as n → ∞ since σ > 0 , and E(Y ) → ∞ as n → ∞ if μ > 0 while E(Y ) → −∞ as n → ∞ if μ < 0 .
n n n
Similarly, we know that M → μ as n → ∞ with probability 1, so the limiting distribution of the sample mean is degenerate.
n
Thus, to obtain a limiting distribution of Y or M that is not degenerate, we need to consider, not these variables themeselves, but
n n
2. var(Z ) = 1n
Proof
The precise statement of the central limit theorem is that the distribution of the standard score Z converges to the standard normal n
distribution as n → ∞ . Recall that the standard normal distribution has probability density function
1 −
1
z
2
ϕ(z) = −−e
2
, z ∈ R (6.4.7)
√2π
and is studied in more detail in the chapter on special distributions. A special case of the central limit theorem (to Bernoulli trials),
dates to Abraham De Moivre. The term central limit theorem was coined by George Pólya in 1920. By definition of convergence in
distribution, the central limit theorem states that F (z) → Φ(z) as n → ∞ for each z ∈ R , where F is the distribution function
n n
6.4.2 https://stats.libretexts.org/@go/page/10181
z z
1 1 2
− x
Φ(z) = ∫ ϕ(x) dx = ∫ e 2 dx, z ∈ R (6.4.8)
−−
−∞ −∞ √2π
An equivalent statment of the central limit theorm involves convergence of the corresponding characteristic functions. This is the
version that we will give and prove, but first we need a generalization of a famous limit from calculus.
Now let χ denote the characteristic function of the standard score of the sample variable X , and let χn denote the characteristic
function of the standard score Z : n
X −μ
χ(t) = E [exp(it )] , χn (t) = E[exp(itZn )]; t ∈ R (6.4.10)
σ
1
2
Recall that t ↦ e −
2
t
is the characteristic function of the standard normal distribution. We can now give a proof.
The central limit theorem. The distribution of Zn converges to the standard normal distribution as n → ∞ . That is,
1 2
χn (t) → e
−
2
t
as n → ∞ for each t ∈ R .
Proof
Normal Approximations
The central limit theorem implies that if the sample size n is “large” then the distribution of the partial sum Y is approximately n
normal with mean nμ and variance nσ . Equivalently the sample mean M is approximately normal with mean μ and variance
2
n
σ /n. The central limit theorem is of fundamental importance, because it means that we can approximate the distribution of certain
2
statistics, even if we know very little about the underlying sampling distribution.
Of course, the term “large” is relative. Roughly, the more “abnormal” the basic distribution, the larger n must be for normal
approximations to work well. The rule of thumb is that a sample size n of at least 30 will usually suffice if the basic distribution is
not too weird; although for many distributions smaller n will do.
Let Y denote the sum of the variables in a random sample of size 30 from the uniform distribution on [0, 1]. Find normal
approximations to each of the following:
1. P(13 < Y < 18)
2. The 90th percentile of Y
Answer
Random variable Y in the previous exercise has the Irwin-Hall distribution of order 30. The Irwin-Hall distributions are studied in
more detail in the chapter on Special Distributions and are named for Joseph Irwin and Phillip Hall.
In the special distribution simulator, select the Irwin-Hall distribution. Vary and n from 1 to 10 and note the shape of the
probability density function. With n = 10 run the experiment 1000 times and compare the empirical density function to the
true probability density function.
Let M denote the sample mean of a random sample of size 50 from the distribution with probability density function
f (x) =
3
x4
for 1 ≤ x < ∞ . This is a Pareto distribution, named for Vilfredo Pareto. Find normal approximations to each of
the following:
1. P(M > 1.6)
2. The 60th percentile of M
Answer
6.4.3 https://stats.libretexts.org/@go/page/10181
The Continuity Correction
A slight technical problem arises when the sampling distribution is discrete. In this case, the partial sum also has a discrete
distribution, and hence we are approximating a discrete distribution with a continuous one. Suppose that X takes integer values
(the most common case) and hence so does the partial sum Y . For any k ∈ Z and h ∈ [0, 1), note that the event
n
{k − h ≤ Y ≤ k + h}
n is equivalent to the event {Y = k} . Different values of h lead to different normal approximations, even
though the events are equivalent. The smallest approximation would be 0 when h = 0 , and the approximations increase as h
increases. It is customary to split the difference by using h = for the normal approximation. This is sometimes called the half-
1
unit continuity correction or the histogram correction. The continuity correction is extended to other events in the natural way,
using the additivity of probability.
2
n
1
2
n
Let Y denote the sum of the scores of 20 fair dice. Compute the normal approximation to P(60 ≤ Y ≤ 75) .
Answer
In the dice experiment, set the die distribution to fair, select the sum random variable Y , and set n = 20 . Run the simulation
1000 times and find each of the following. Compare with the result in the previous exercise:
1. P(60 ≤ Y ≤ 75)
2. The relative frequency of the event {60 ≤ Y ≤ 75} (from the simulation)
The mean is kb and the variance is kb . The gamma distribution is widely used to model random times (particularly in the context
2
of the Poisson model) and other positive random variables. The general gamma distribution is studied in more detail in the chapter
on Special Distributions. In the context of the Poisson model (where k ∈ N ), the gamma distribution is also known as the Erlang
+
distribution, named for Agner Erlang; it is studied in more detail in the chapter on the Poisson Process. Suppose now that Y has k
the gamma (Erlang) distribution with shape parameter k ∈ N and scale parameter b > 0 then
+
Yk = ∑ Xi (6.4.15)
i=1
where (X , X , …) is a sequence of independent variables, each having the exponential distribution with scale parameter b . (The
1 2
exponential distribution is a special case of the gamma distribution with shape parameter 1.) It follows that if k is large, the gamma
distribution can be approximated by the normal distribution with mean kb and variance kb . The same statement actually holds
2
Suppose that Y has the gamma distribution with scale parameter b ∈ (0, ∞) and shape parameter k ∈ (0, ∞) . Then the
k
distribution of the standardized variable Z below converges to the standard normal distribution as k → ∞ :
k
Yk − kb
Zk = −
− (6.4.16)
√k b
In the special distribution simulator, select the gamma distribution. Vary and b and note the shape of the probability density
function. With k = 10 and various values of b , run the experiment 1000 times and compare the empirical density function to
the true probability density function.
6.4.4 https://stats.libretexts.org/@go/page/10181
Suppose that Y has the gamma distribution with shape parameter k = 10 and scale parameter b =2 . Find normal
approximations to each of the following:
1. P(18 ≤ Y ≤ 23)
2. The 80th percentile of Y
Answer
When n is a positive, integer, the chi-square distribution governs the sum of n independent, standard normal variables. For this
reason, it is one of the most important distributions in statistics. The chi-square distribution is studied in more detail in the chapter
on Special Distributions. From the previous discussion, it follows that if n is large, the chi-square distribution can be approximated
by the normal distribution with mean n and variance 2n. Here is the precise statement:
Suppose that Y has the chi-square distribution with n ∈ (0, ∞) degrees of freedom. Then the distribution of the standardized
n
Yn − n
Zn = (6.4.18)
−−
√2n
In the special distribution simulator, select the chi-square distribution. Vary n and note the shape of the probability density
function. With n = 20 , run the experiment 1000 times andcompare the empirical density function to the probability density
function.
Suppose that Y has the chi-square distribution with n = 20 degrees of freedom. Find normal approximations to each of the
following:
1. P(18 < Y < 25)
2. The 75th percentile of Y
Answer
distributed indicator variables with P(X = 1) = p for each i, where p ∈ (0, 1) is the parameter. In the usual language of
i
reliability, X is the outcome of trial i, where 1 means success and 0 means failure. The common mean is p and the common
i
variance is p(1 − p) .
Let Y = ∑ X , so that Y is the number of successes in the first
n
n
The binomial distribution is studied in more detail in the chapter on Bernoulli trials.
It follows from the central limit theorem that if n is large, the binomial distribution with parameters n and p can be approximated
by the normal distribution with mean np and variance np(1 − p) . The rule of thumb is that n should be large enough for np ≥ 5
and n(1 − p) ≥ 5 . (The first condition is the important one when p < and the second condition is the important one when
1
p >
1
2
.) Here is the precise statement:
6.4.5 https://stats.libretexts.org/@go/page/10181
Suppose that Y has the binomial distribution with trial parameter n ∈ N and success parameter p ∈ (0, 1). Then the
n +
distribution of the standardized variable Z given below converges to the standard normal distribution as n → ∞ :
n
Yn − np
Zn = − −−−−−−− (6.4.20)
√ np(1 − p)
In the binomial timeline experiment, vary n and p and note the shape of the probability density function. With n = 50 and
p = 0.3 , run the simulation 1000 times and compute the following:
1. P(12 ≤ Y ≤ 16)
2. The relative frequency of the event {12 ≤ Y ≤ 16} (from the simulation)
Answer
Suppose that Y has the binomial distribution with parameters n = 50 and p = 0.3 . Compute the normal approximation to
P(12 ≤ Y ≤ 16) (don't forget the continuity correction) and compare with the results of the previous exercise.
Answer
where θ > 0 is a parameter. The parameter is both the mean and the variance of the distribution. The Poisson distribution is widely
used to model the number of “random points” in a region of time or space, and is studied in more detail in the chapter on the
Poisson Process. In this context, the parameter is proportional to the size of the region.
Suppose now that Y has the Poisson distribution with parameter n ∈ N . Then
n +
Yn = ∑ Xi (6.4.22)
i=1
where (X , X , … , X ) is a sequence of independent variables, each with the Poisson distribution with parameter 1. It follows
1 2 n
from the central limit theorem that if n is large, the Poisson distribution with parameter n can be approximated by the normal
distribution with mean n and variance n . The same statement holds when the parameter n is not an integer. Here is the precise
statement:
. Suppose that Y has the Poisson distribution with parameter θ ∈ (0, ∞) . Then the distribution of the standardized variable Z
θ θ
In the Poisson experiment, vary the time and rate parameters t and r (the parameter of the Poisson distribution in the
experiment is the product rt ). Note the shape of the probability density function. With r = 5 and t = 4 , run the experiment
1000 times and compare the empirical density function to the true probability density function.
6.4.6 https://stats.libretexts.org/@go/page/10181
n+k−1
k n
f (n) = ( ) p (1 − p ) , n ∈ N+ (6.4.24)
n
The mean is k(1 − p)/p and the variance is k(1 − p)/p . The negative binomial distribution is studied in more detail in the
2
chapter on Bernoulli trials. If k ∈ N , the distribution governs the number of failures Y before success number k in a sequence of
+ k
Yk = ∑ Xi (6.4.25)
i=1
where (X , X , … , X ) is a sequence of independent variables, each having the geometric distribution on N with parameter p.
1 2 k
(The geometric distribution is a special case of the negative binomial, with parameters 1 and p.) In the context of the Bernoulli
trials, X is the number of failures before the first success, and for i ∈ {2, 3, …}, X is the number of failures between success
1 i
number i − 1 success number i. It follows that if k is large, the negative binomial distribution can be approximated by the normal
distribution. The same statement holds if k is not an integer. Here is the precise statement:
Suppose that Y has the negative binomial distribution with shape parameter k ∈ (0, 1) and scale parameter p ∈ (0, 1). Then
k
the distribution of the standardized variable Z below converges to the standard normal distribution as k → ∞ :
k
p Yk − k(1 − p)
Zk = (6.4.26)
− −−−−−−
√ k(1 − p)
Another version of the negative binomial distribution is the distribution of the trial number V of success number k ∈ N . So
k +
V = k+Y
k and V has mean k/p and variance k(1 − p)/p . The normal approximation applies to the distribution of V as well,
k k
2
k
if k is large, and since the distributions are related by a location transformation, the standard scores are the same. That is
p Vk − k p Yk − k(1 − p)
− −−−−−− = − −−−−−− (6.4.27)
√ k(1 − p) √ k(1 − p)
In the negative binomial experiment, vary k and p and note the shape of the probability density function. With k = 5 and
p = 0.4 , run the experiment 1000 times and compare the empirical density function to the true probability density function.
Suppose that Y has the negative binomial distribution with trial parameter k = 10 and success parameter p = 0.4 . Find normal
approximations to each of the following:
1. P(20 < Y < 30)
2. The 80th percentile of Y
Answer
sequence of independent, identically distributed real-valued random variables with common mean μ and variance σ . Suppose now 2
that N is a random variable (on the same probability space) taking values in N, also with finite mean and variance. Then
N
YN = ∑ Xi (6.4.28)
i=1
is a random sum of the independent, identically distributed variables. That is, the terms are random of course, but so also is the
number of terms N . We are primarily interested in the moments of Y . N
The conditional expected value of Y N given N , and the expected value of Y N are
6.4.7 https://stats.libretexts.org/@go/page/10181
1. E(Y N ∣ N) = Nμ
2. E(Y N ) = E(N )μ
2. var(Y N ) = E(N )σ
2
+ var(N )μ
2
Let H denote the probability generating function of N . Show that the moment generating function of Y N is H ∘ G .
1. E(e tYN
∣ N ) = [G(t)]
N
2. E(e tYN
) = H (G(t))
Wald's Equation
The result in Exercise 29 (b) generalizes to the case where the random number of terms N is a stopping time for the sequence X.
This means that the event {N = n} depends only on (technically, is measurable with respect to) (X , X , … , X ) for each
1 2 n
n ∈ N . The generalization is knowns as Wald's equation, and is named for Abraham Wald. Stopping times are studied in much
Suppose that the number of customers arriving at a store during a given day has the Poisson distribution with parameter 50.
Each customer, independently of the others (and independently of the number of customers), spends an amount of money that
is uniformly distributed on the interval [0, 20]. Find the mean and standard deviation of the amount of money that the store
takes in during a day.
Answer
When a certain critical component in a system fails, it is immediately replaced by a new, statistically identical component. The
components are independent, and the lifetime of each (in hours) is exponentially distributed with scale parameter b . During the
life of the system, the number of critical components used has a geometric distribution on N with parameter p. For the total
+
This page titled 6.4: The Central Limit Theorem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
6.4.8 https://stats.libretexts.org/@go/page/10181
6.5: The Sample Variance
Descriptive Theory
Recall the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that
we make on these objects. We select objects from the population and record the variables for the objects in the sample; these
become our data. Once again, our first discussion is from a descriptive point of view. That is, we do not assume that the data are
generated by an underlying probability distribution. Remember however, that the data themselves form a probability distribution.
Variance and Standard Deviation
Suppose that x = (x 1, x2 , … , xn ) is a sample of size n from a real-valued variable x. Recall that the sample mean is
n
1
m = ∑ xi (6.5.1)
n
i=1
and is the most important measure of the center of the data set. The sample variance is defined to be
n
2
1 2
s = ∑(xi − m ) (6.5.2)
n−1
i=1
If we need to indicate the dependence on the data vector x, we write s (x). The difference x − m is the deviation of x from the
2
i i
mean m of the data set. Thus, the variance is the mean square deviation and is a measure of the spread of the data set with respet to
the mean. The reason for dividing by n − 1 rather than n is best understood in terms of the inferential point of view that we discuss
in the next section; this definition makes the sample variance an unbiased estimator of the distribution variance. However, the
reason for the averaging can also be understood in terms of a related concept.
∑
n
i=1
(xi − m) = 0 .
Proof
Thus, if we know n − 1 of the deviations, we can compute the last one. This means that there are only n − 1 freely varying
deviations, that is to say, n − 1 degrees of freedom in the set of deviations. In the definition of sample variance, we average the
squared deviations, not by dividing by the number of terms, but rather by dividing by the number of degrees of freedom in those
terms. However, this argument notwithstanding, it would be reasonable, from a purely descriptive point of view, to divide by n in
the definition of the sample variance. Moreover, when n is sufficiently large, it hardly matters whether we divide by n or by n − 1 .
In any event, the square root s of the sample variance s is the sample standard deviation. It is the root mean square deviation and
2
is also a measure of the spread of the data with respect to the mean. Both measures of spread are important. Variance has nicer
mathematical properties, but its physical unit is the square of the unit of x. For example, if the underlying variable x is the height
of a person in inches, the variance is in square inches. On the other hand, the standard deviation has the same physical unit as the
original variable, but its mathematical properties are not as nice.
Recall that the data set x naturally gives rise to a probability distribution, namely the empirical distribution that places probability
1
n
at x for each i. Thus, if the data are distinct, this is the uniform distribution on {x , x , … , x }. The sample mean m is simply
i 1 2 n
the expected value of the empirical distribution. Similarly, if we were to divide by n rather than n − 1 , the sample variance would
be the variance of the empirical distribution. Most of the properties and results this section follow from much more general
properties and results for the variance of a probability distribution (although for the most part, we give independent proofs).
Measures of Center and Spread
Measures of center and measures of spread are best thought of together, in the context of an error function. The error function
measures how well a single number a represents the entire data set x. The values of a (if they exist) that minimize the error
functions are our measures of center; the minimum value of the error function is the corresponding measure of spread. Of course,
we hope for a single value of a that minimizes the error function, so that we have a unique measure of center.
Let's apply this procedure to the mean square error function defined by
6.5.1 https://stats.libretexts.org/@go/page/10182
n
1 2
mse(a) = ∑(xi − a) , a ∈ R (6.5.3)
n−1
i=1
Proof
Trivially, if we defined the mean square error function by dividing by n rather than n − 1 , then the minimum value would still
occur at m, the sample mean, but the minimum value would be the alternate version of the sample variance in which we divide by
−−−− −−
n . On the other hand, if we were to use the root mean square deviation function rmse(a) = √mse(a) , then because the square
root function is strictly increasing on [0, ∞), the minimum value would again occur at m, the sample mean, but the minimum value
would be s , the sample standard deviation. The important point is that with all of these error functions, the unique measure of
center is the sample mean, and the corresponding measures of spread are the various ones that we are studying.
Next, let's apply our procedure to the mean absolute error function defined by
n
1
mae(a) = ∑ | xi − a| , a ∈ R (6.5.5)
n−1
i=1
Mathematically, mae has some problems as an error function. First, the function will not be smooth (differentiable) at points where
two lines of different slopes meet. More importantly, the values that minimize mae may occupy an entire interval, thus leaving us
without a unique measure of center. The error function exercises below will show you that these pathologies can really happen. It
turns out that mae is minimized at any point in the median interval of the data set x. The proof of this result follows from a much
more general result for probability distributions. Thus, the medians are the natural measures of center associated with mae as a
measure of error, in the same way that the sample mean is the measure of center associated with the mse as a measure of error.
Properties
In this section, we establish some essential properties of the sample variance and standard deviation. First, the following alternate
formula for the sample variance is better for computational purposes, and for certain theoretical purposes as well.
Proof
If we let x = (x , x
2 2
1
2
2
2
, … , xn ) denote the sample from the variable x , then the computational formula in the last exercise can be
2
written succinctly as
2
n 2 2
s (x) = [m(x ) − m (x)] (6.5.9)
n−1
The following theorem gives another computational formula for the sample variance, directly in terms of the variables and thus
without the computation of an intermediate statistic.
6.5.2 https://stats.libretexts.org/@go/page/10182
n n
2
1 2
s = ∑ ∑(xi − xj ) (6.5.10)
2n(n − 1)
i=1 j=1
Proof
2. s2
=0 if and only if x i = xj for each i, j ∈ {1, 2, … , n} .
Proof
Thus, s 2
=0 if and only if the data set is constant (and then, of course, the mean is the common value).
If c is a constant then
1. s (c x) = c
2 2 2
s (x)
Proof
2. s(x + c) = s(x)
Proof
As a special case of these results, suppose that x = (x , x , … , x ) is a sample of size n corresponding to a real variable x, and
1 2 n
that a and b are constants. The sample corresponding to the variable y = a + bx , in our vector notation, is a + bx . Then
m(a + bx) = a + bm(x) and s(a + bx) = |b| s(x) . Linear transformations of this type, when b > 0 , arise frequently when
physical units are changed. In this case, the transformation is often called a location-scale transformation; a is the location
parameter and b is the scale parameter. For example, if x is the length of an object in inches, then y = 2.54x is the length of the
object in centimeters. If x is the temperature of an object in degrees Fahrenheit, then y = (x − 32) is the temperature of the 5
and s have the same physical units, the standard score z is dimensionless (that is, has no physical units); it measures the directed
i
The sample of standard scores z = (z 1, z2 , … , zn ) has mean 0 and variance 1. That is,
1. m(z) = 0
2. s (z) = 1
2
Proof
Approximating the Variance
Suppose that instead of the actual data x, we have a frequency distribution corresponding to a partition with classes (intervals)
(A , A , … , A ), class marks (midpoints of the intervals) (t , t , … , t ), and frequencies (n , n , … , n ). Recall that the
1 2 k 1 2 k 1 2 k
relative frequency of class A is p = n /n . In this case, approximate values of the sample mean and variance are, respectively,
j j j
k k
1
m = ∑ nj tj = ∑ pj tj (6.5.18)
n
j=1 j=1
k k
1 n
2 2 2
s = ∑ nj (tj − m ) = ∑ pj (tj − m ) (6.5.19)
n−1 n−1
j=1 j=1
These approximations are based on the hope that the data values in each class are well represented by the class mark. In fact, these
are the standard definitions of sample mean and variance for the data set in which t occurs n times for each j . j j
6.5.3 https://stats.libretexts.org/@go/page/10182
Inferential Statistics
We continue our discussion of the sample variance, but now we assume that the variables are random. Thus, suppose that we have a
basic random experiment, and that X is a real-valued random variable for the experiment with mean μ and standard deviation σ.
We will need some higher order moments as well. Let σ = E [(X − μ) ] and σ = E [(X − μ) ] denote the 3rd and 4th
3
3
4
4
moments about the mean. Recall that σ /σ = skew(X) , the skewness of X, and σ /σ = kurt(X) , the kurtosis of X. We
3
3
4
4
We repeat the basic experiment n times to form a new, compound experiment, with a sequence of independent random variables
X = (X , X , … , X ) , each with the same distribution as X. In statistical terms, X is a random sample of size n from the
1 2 n
distribution of X. All of the statistics above make sense for X, of course, but now these statistics are random variables. We will use
the same notationt, except for the usual convention of denoting random variables by capital letters. Finally, note that the
deterministic properties and relations established above still hold.
In addition to being a measure of the center of the data X, the sample mean
n
1
M = ∑ Xi (6.5.20)
n
i=1
is a natural estimator of the distribution mean μ . In this section, we will derive statistics that are natural estimators of the
distribution variance σ . The statistics that we will derive are different, depending on whether μ is known or unknown; for this
2
2
1 2
W = ∑(Xi − μ) (6.5.21)
n
i=1
W
2
is the sample mean for a random sample of size n from the distribution of 2
(X − μ) , and satisfies the following
properties:
1. E (W ) = σ 2 2
2. var (W ) = (σ − σ )
2 1
n 4
4
3. W → σ as n → ∞ with probability 1
2 2
−−−−−−
4. The distribution of √−
n (W − σ ) / √σ − σ converges to the standard normal distribution as n → ∞ .
2 2
4
4
Proof
In particular part (a) means that W is an unbiased estimator of σ . From part (b), note that var(W ) → 0 as n → ∞ ; this means
2 2 2
that W is a consistent estimator of σ . The square root of the special sample variance is a special version of the sample standard
2 2
deviation, denoted W .
Next we compute the covariance and correlation between the sample mean and the special sample variance.
1. cov (M , W 2
) = σ3 /n .
−−−−−−−−−−
2. cor (M , W 2 3 2 4
) = σ / √σ (σ4 − σ )
Proof
Note that the correlation does not depend on the sample size, and that the sample mean and the special sample variance are
uncorrelated if σ = 0 (equivalently skew(X) = 0 ).
3
6.5.4 https://stats.libretexts.org/@go/page/10182
The Standard Sample Variance
Consider now the more realistic case in which μ is unknown. In this case, a natural approach is to average, in some sense, the
squared deviations (X − M ) over i ∈ {1, 2, … , n}. It might seem that we should average by dividing by n . However, another
i
2
approach is to divide by whatever constant would give us an unbiased estimator of σ . This constant turns out to be n − 1 , leading 2
E (S
2
) =σ
2
.
Proof
Of course, the square root of the sample variance is the sample standard deviation, denoted S .
S
2
→ σ
2
as n → ∞ with probability 1.
Proof
Since S is an unbiased estimator of σ , the variance of S is the mean square error, a measure of the quality of the estimator.
2 2 2
var (S
2
) =
1
n
(σ4 −
n−3
n−1
4
σ ) .
Proof
Note that var(S ) → 0 as n → ∞ , and hence S is a consistent estimator of σ . On the other hand, it's not surprising that the
2 2 2
variance of the standard sample variance (where we assume that μ is unknown) is greater than the variance of the special standard
variance (in which we assume μ is known).
var (S
2
) > var (W
2
) .
Proof
Next we compute the covariance between the sample mean and the sample variance.
The covariance and correlation between the sample mean and sample variance are
1. cov (M , S 2
) = σ3 /n
σ3
2. cor (M , S 2
) =
σ√σ4 −σ 4 (n−3)/(n−1)
Proof
In particular, note that cov(M , S ) = cov(M , W ) . Again, the sample mean and variance are uncorrelated if σ = 0 so that
2 2
3
skew(X) = 0 . Our last result gives the covariance and correlation between the special sample variance and the standard one.
Curiously, the covariance the same as the variance of the special sample variance.
1. cov (W 2
,S
2
) = (σ4 − σ )/n
4
−−−−−−−−−−−−
4
σ4 −σ
2. cor (W 2
,S
2
) =√
σ4 −σ
4
(n−3)/(n−1)
Proof
6.5.5 https://stats.libretexts.org/@go/page/10182
Exercises
Basic Properties
Suppose that x is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of
operation. A sample of 30 components has mean 113° and standard deviation 18°.
1. Classify x by type and level of measurement.
2. Find the sample mean and standard deviation if the temperature is converted to degrees Celsius. The transformation is
(x − 32) .
5
y =
9
Answer
Suppose that x is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has mean 10.0 and
standard deviation 2.0.
1. Classify x by type and level of measurement.
2. Find the sample mean if length is measured in centimeters. The transformation is y = 2.54x.
Answer
Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade
on the first midterm exam was 64 (out of a possible 100 points) and the standard deviation was 16. Professor Moriarity thinks
the grades are a bit low and is considering various transformations for increasing the grades. In each case below give the mean
and standard deviation of the transformed grades, or state that there is not enough information.
1. Add 10 points to each grade, so the transformation is y = x + 10 .
2. Multiply each grade by 1.2, so the transformation is z = 1.2x
3. Use the transformation w = 10√− x . Note that this is a non-linear transformation that curves the grades greatly at the low
end and very little at the high end. For example, a grade of 100 is still 100, but a grade of 36 is transformed to 60.
One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an
outlier.
4. Find the mean and standard deviation if this score is omitted.
Answer
Computational Exercises
All statistical software packages will compute means, variances and standard deviations, draw dotplots and histograms, and in
general perform the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those
with large data sets, the use of statistical software is essential. On the other hand, there is some value in performing the
computations by hand, with small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the
computations and draw the graphs with minimal technological aids.
Suppose that x is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2).
1. Classify x by type and level of measurement.
2. Sketch the dotplot.
3. Construct a table with rows corresponding to cases and columns corresponding to i, x , x i i −m , and (x
i − m)
2
. Add rows
at the bottom in the i column for totals and means.
Answer
Suppose that a sample of size 12 from a discrete variable x has empirical density function given by f (−2) = 1/12 ,
f (−1) = 1/4, f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.
6.5.6 https://stats.libretexts.org/@go/page/10182
The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample
of ESU students.
Class Freq Rel Freq Density Cum Freq Cum Rel Freq Midpoint
(0, 2] 6
(2, 6] 16
(6, 10] 18
(10, 20]) 10
Total
In the error function app, select root mean square error. As you add points, note the shape of the graph of the error function, the
value that minimizes the function, and the minimum value of the function.
In the error function app, select mean absolute error. As you add points, note the shape of the graph of the error function, the
values that minimizes the function, and the minimum value of the function.
Suppose that our data vector is (2, 1, 5, 7). Explicitly give mae as a piecewise function and sketch its graph. Note that
1. All values of a ∈ [2, 5] minimize mae.
2. mae is not differentiable at a ∈ {1, 2, 5, 7}.
Suppose that our data vector is (3, 5, 1). Explicitly give mae as a piecewise function and sketch its graph. Note that
1. mae is minimized at a = 3 .
2. mae is not differentiable at a ∈ {1, 3, 5}.
Simulation Exercises
Many of the apps in this project are simulations of experiments with a basic random variable of interest. When you run the
simulation, you are performing independent replications of the experiment. In most cases, the app displays the standard deviation
of the distribution, both numerically in a table and graphically as the radius of the blue, horizontal bar in the graph box. When you
run the simulation, the sample standard deviation is also displayed numerically in the table and graphically as the radius of the red
horizontal bar in the graph box.
In the binomial coin experiment, the random variable is the number of heads. For various values of the parameters n (the
number of coins) and p (the probability of heads), run the simulation 1000 times and compare the sample standard deviation to
the distribution standard deviation.
In the simulation of the matching experiment, the random variable is the number of matches. For selected values of n (the
number of balls), run the simulation 1000 times and compare the sample standard deviation to the distribution standard
deviation.
Run the simulation of the gamma experiment 1000 times for various values of the rate parameter r and the shape parameter k .
Compare the sample standard deviation to the distribution standard deviation.
6.5.7 https://stats.libretexts.org/@go/page/10182
Probability Exercises
3. d = E [(X − μ)
3
3
]
4. d = E [(X − μ)
4
4
]
Answer
Suppose now that (X 1, X2 , … , X10 ) is a random sample of size 10 from the beta distribution in the previous problem. Find
each of the following:
1. E(M )
2. var(M )
3. E (W ) 2
4. var (W ) 2
5. E (S )
2
6. var (S ) 2
7. cov (M , W ) 2
8. cov (M , S ) 2
9. cov (W , S ) 2 2
Answer
Suppose that X has probability density function f (x) = λe for 0 ≤ x < ∞ , where λ > 0 is a parameter. Thus
−λx
X has the
exponential distribution with rate parameter λ . Compute each of the following
1. μ = E(X)
2. σ = var(X)
2
3. d = E [(X − μ)
3
3
]
4. d = E [(X − μ)
4
4
]
Answer
Suppose now that (X , X , … , X 1 2 5) is a random sample of size 5 from the exponential distribution in the previous problem.
Find each of the following:
1. E(M )
2. var(M )
3. E (W ) 2
4. var (W ) 2
5. E (S )
2
6. var (S ) 2
7. cov (M , W ) 2
8. cov (M , S ) 2
9. cov (W , S ) 2 2
Answer
Recall that for an ace-six flat die, faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability
1
4
1
8
each. Let
X denote the score when an ace-six flat die is thrown. Compute each of the following:
1. μ = E(X)
2. σ = var(X)
2
3. d = E [(X − μ)
3
3
]
4. d = E [(X − μ)
4
4
]
Answer
6.5.8 https://stats.libretexts.org/@go/page/10182
Suppose now that an ace-six flat die is tossed 8 times. Find each of the following:
1. E(M )
2. var(M )
3. E (W ) 2
4. var (W ) 2
5. E (S )
2
6. var (S ) 2
7. cov (M , W ) 2
8. cov (M , S ) 2
9. cov (W , S ) 2 2
Answer
Data Analysis Exercises
Statistical software should be used for the problems in this subsection.
Consider the petal length and species variables in Fisher's iris data.
1. Classify the variables by type and level of measurement.
2. Compute the sample mean and standard deviation, and plot a density histogram for petal length.
3. Compute the sample mean and standard deviation, and plot a density histogram for petal length by species.
Answers
6.5.9 https://stats.libretexts.org/@go/page/10182
2. Compute the sample mean and standard deviation for each color count variable.
3. Compute the sample mean and standard deviation for the total number of candies.
4. Plot a relative frequency histogram for the total number of candies.
5. Compute the sample mean and standard deviation, and plot a density histogram for the net weight.
Answer
Consider the body weight, species, and gender variables in the Cicada data.
1. Classify the variables by type and level of measurement.
2. Compute the relative frequency function for species and plot the graph.
3. Compute the relative frequeny function for gender and plot the graph.
4. Compute the sample mean and standard deviation, and plot a density histogram for body weight.
5. Compute the sample mean and standard deviation, and plot a density histogrm for body weight by species.
6. Compute the sample mean and standard deviation, and plot a density histogram for body weight by gender.
Answer
This page titled 6.5: The Sample Variance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
6.5.10 https://stats.libretexts.org/@go/page/10182
6.6: Order Statistics
Descriptive Theory
Recall again the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that we
make on these objects. We select objects from the population and record the variables for the objects in the sample; these become our data.
Our first discussion is from a purely descriptive point of view. That is, we do not assume that the data are generated by an underlying
probability distribution. But as always, remember that the data themselves define a probability distribution, namely the empirical distribution.
Order Statistics
Suppose that x is a real-valued variable for a population and that x = (x , x , … , x ) are the observed values of a sample of size n
1 2 n
corresponding to this variable. The order statistic of rank k is the k th smallest value in the data set, and is usually denoted x . To emphasize (k)
the dependence on the sample size, another common notation is x . Thus, n:k
Naturally, the underlying variable x should be at least at the ordinal level of measurement. The order statistics have the same physical units as
x. One of the first steps in exploratory data analysis is to order the data, so order statistics occur naturally. In particular, note that the extreme
2
=
1
2
[ x(n) − x(1) ] . These statistics have the same physical units as x and
are measures of the dispersion of the data set.
The Sample Median
If n is odd, the sample median is the middle of the ordered observations, namely x where k = . If n is even, there is not a single
(k)
n+1
middle observation, but rather two middle observations. Thus, the median interval is [ x , x ] where k = . In this case, the sample
(k) (k+1)
n
2
(k)
. In a sense, this definition is a bit
(k+1)
n
arbitrary because there is no compelling reason to prefer one point in the median interval over another. For more on this issue, see the
discussion of error functions in the section on Sample Variance. In any event, sample median is a natural statistic that gives a measure of the
center of the data set.
Sample Quantiles
We can generalize the sample median discussed above to other sample quantiles. Thus, suppose that p ∈ [0, 1]. Our goal is to find the value
that is the fraction p of the way through the (ordered) data set. We define the rank of the value that we are looking for as (n − 1)p + 1 . Note
that the rank is a linear function of p, and that the rank is 1 when p = 0 and n when p = 1 . But of course, the rank will not be an integer in
general, so we let k = ⌊(n − 1)p + 1⌋ , the integer part of the desired rank, and we let t = [(n − 1)p + 1] − k , the fractional part of the
desired rank. Thus, (n − 1)p + 1 = k + t where k ∈ {1, 2, … , n} and t ∈ [0, 1). So, using linear interpolation, we define the sample
quantile of order p to be
Sample quantiles have the same physical units as the underlying variable x. The algorithm really does generalize the results for sample
medians.
2
is the median as defined earlier, in both cases where n is odd and where n is even.
The sample quantile of order is known as the first quartile and is frequently denoted q . The the sample quantile of order is known as the
1
4
1
3
third quartile and is frequently denoted q . The sample median which is the quartile of order is sometimes denoted q . The interquartile
3
1
2
2
range is defined to be iqr = q − q . Note that iqr is a statistic that measures the spread of the distribution about the median, but of course
3 1
The statistic q − iqr is called the lower fence and the statistic q + iqr is called the upper fence. Sometimes lower limit and upper limit
1
3
2
3
3
are used instead of lower fence and upper fence. Values in the data set that are below the lower fence or above the upper fence are potential
outliers, that is, values that don't seem to fit the overall pattern of the data. An outlier can be due to a measurement error, or may be a valid
but rather extreme value. In any event, outliers usually deserve additional study.
The five statistics (x , q , q , q , x ) are often referred to as the five-number summary. Together, these statistics give a great deal of
(1) 1 2 3 (n)
information about the data set in terms of the center, spread, and skewness. The five numbers roughly separate the data set into four intervals
6.6.1 https://stats.libretexts.org/@go/page/10183
each of which contains approximately 25% of the data. Graphically, the five numbers, and the outliers, are often displayed as a boxplot,
sometimes called a box and whisker plot. A boxplot consists of an axis that extends across the range of the data. A line is drawn from smallest
value that is not an outlier (of course this may be the minimum x ) to the largest value that is not an outlier (of course, this may be the
(1)
maximum x ). Vertical marks (“whiskers”) are drawn at the ends of this line. A rectangular box extends from the first quartile q to the
(n) 1
third quartile q and with an additional whisker at the median q . Finally, the outliers are denoted as points (beyond the extreme whiskers).
3 2
All statistical packages will compute the quartiles and most will draw boxplots. The picture below shows a boxplot with 3 outliers.
Recall that F has the mathematical properties of a distribution function, and in fact F is the distribution function of the empirical distribution
of the data. Recall that this is the distribution that places probability at each data value x (so this is the discrete uniform distribution on
n
1
i
{ x , x , … , x } if the data values are distinct). Thus, F (x) = for x ∈ [x , x ). Then, we could define the quantile function to be the
k
1 2 n (k) (k+1)
n
It's easy to see that with this definition, the quantile of order p ∈ (0, 1) is simply x (k)
where k = ⌈np⌉ .
Another method is to compute the rank of the quantile of order p ∈ (0, 1) as (n + 1)p , rather than (n − 1)p + 1 , and then use linear
interpolation just as we have done. To understand the reasoning behind this method, suppose that the underlying variable x takes value in an
interval (a, b). Then the n points in the data set x separate this interval into n + 1 subintervals, so it's reasonable to think of x as the (k)
quantile of order . This method also reduces to the standard calculation for the median when p = . However, the method will fail if p is
k
n+1
1
variable, where a ∈ R and b ∈ (0, ∞). Recall that transformations of this type are location-scale transformations and often correspond to
changes in units. For example, if x is the length of an object in inches, then y = 2.54x is the length of the object in centimeters. If x is the
temperature of an object in degrees Fahrenheit, then y = (x − 32) is the temperature of the object in degrees Celsius. Let y = a + bx
5
Like standard deviation (our most important measure of spread), range and interquartile range are not affected by the location parameter, but
are scaled by the scale parameter.
More generally, suppose y = g(x) where g is a strictly increasing real-valued function on the set of possible values of x. Let
y = (g(x1 ), g(x2 ), … , g(xn )) denote the sample corresponding to the variable y . Then (as in the proof of Theorem 2), the order statistics
are preserved so y (i)
= g(x ) . However, if g is nonlinear, the quantiles are not preserved (because the quantiles involve linear interpolation).
(i)
6.6.2 https://stats.libretexts.org/@go/page/10183
Suppose that y = g(x) where g is strictly increasing. Then
1. y = g (x ) for i ∈ {1, 2, … , n}
(i) (i)
Proof
Stem and Leaf Plots
A stem and leaf plot is a graphical display of the order statistics (x , x , … , x ). It has the benefit of showing the data in a graphical
(1) (2) (n)
way, like a histogram, and at the same time, preserving the ordered data. First we assume that the data have a fixed number format: a fixed
number of digits, then perhaps a decimal point and another fixed number of digits. A stem and leaf plot is constructed by using an initial part
of this string as the stem, and the remaining parts as the leaves. There are lots of variations in how to do this, so rather than give an
exhaustive, complicated definition, we will just look at a couple of examples in the exercise below.
Probability Theory
We continue our discussion of order statistics except that now we assume that the variables are random variables. Specifically, suppose that
we have a basic random experiment, and that X is a real-valued random variable for the experiment with distribution function F . We perform
n independent replications of the basic experiment to generate a random sample X = (X , X , … , X ) of size n from the distribution of 1 2 n
X. Recall that this is a sequence of independent random variables, each with the distribution of X. All of the statistics defined in the previous
section make sense, but now of course, they are random variables. We use the notation established previously, except that we follow our usual
convention of denoting random variables with capital letters. Thus, for k ∈ {1, 2, … , n}, X is the k th order statistic, that is, the k (k)
smallest of (X , X , … , X ). Our interest now is on the distribution of the order statistics and statistics derived from them.
1 2 n
Proof
2. F n (x) = [F (x)]
n
for x ∈ R
1
(p) = F
−1
[1 − (1 − p ) ] for p ∈ (0, 1)
1/n
2. F −1
n (p) = F
−1
(p
1/n
) for p ∈ (0, 1)
Proof
When the underlying distribution is continuous, we can give a simple formula for the probability density function of an order statistic.
Suppose now that X has a continuous distribution with probability density function f . Then X(k) has a continuous distribution with
probability density function f given by k
n! k−1 n−k
fk (x) = [F (x)] [1 − F (x)] f (x), x ∈ R (6.6.11)
(k − 1)!(n − k)!
Proof
Heuristic Proof
Here are the special cases for the extreme order statistics.
6.6.3 https://stats.libretexts.org/@go/page/10183
Joint Distributions
We assume again that X has a continuous distribution with distribution function F and probability density function f .
Suppose that j, k ∈ {1, 2, … , n} with j < k . The joint probability density function f j,k of (X (j) , X(k) ) is given by
n! j−1 k−j−1 n−k
fj,k (x, y) = [F (x)] [F (y) − F (x)] [1 − F (y)] f (x)f (y); x, y ∈ R, x < y (6.6.17)
(j − 1)!(k − j − 1)!(n − k)!
Heuristic Proof
From the joint distribution of two order statistics we can, in principle, find the distribution of various other statistics: the sample range R ;
sample quantiles X for p ∈ [0, 1], and in particular the sample quartiles Q , Q , Q ; and the inter-quartile range IQR. The joint distribution
[p] 1 2 3
Proof
Arguments similar to the one above can be used to obtain the joint probability density function of any number of the order statistics. Of
course, we are particularly interested in the joint probability density function of all of the order statistics. It turns out that this density function
has a remarkably simple form.
Proof
Heuristic Proof
Probability Plots
A probability plot, also called a quantile-quantile plot or a Q-Q plot for short, is an informal, graphical test to determine if observed data
come from a specified distribution. Thus, suppose that we observe real-valued data (x , x , … , x ) from a random sample of size n . We are
1 2 n
interested in the question of whether the data could reasonably have come from a continuous distribution with distribution function F . First,
we order that data from smallest to largest; this gives us the sequence of observed values of the order statistics: (x , x , … , x ). (1) (2) (n)
Note that we can view x(i) has the sample quantile of order i
n+1
. Of course, by definition, the distribution quantile of order i
n+1
is
yi = F
−1
(
i
n+1
) . If the data really do come from the distribution, then we would expect the points ((x (1) , y1 ) , (x(2) , y2 ) … , (x(n) , yn )) to
be close to the diagonal line y = x ; conversely, strong deviation from this line is evidence that the distribution did not produce the data. The
plot of these points is referred to as a probability plot.
Usually however, we are not trying to see if the data come from a particular distribution, but rather from a parametric family of distributions
(such as the normal, uniform, or exponential families). We are usually forced into this situation because we don't know the parameters; indeed
the next step, after the probability plot, may be to estimate the parameters. Fortunately, the probability plot method has a simple extension for
any location-scale family of distributions. Thus, suppose that G is a given distribution function. Recall that the location-scale family
associated with G has distribution function F (x) = G ( ) for, x ∈ R , where a ∈ R is the location parameter and b ∈ (0, ∞) is the scale
x−a
parameter. Recall also that for p ∈ (0, 1), if z = G (p) denote the quantile of order p for G and y = F (p) the quantile of order p for
p
−1
p
−1
F . Then y = a + b z . It follows that if the probability plot constructed with distribution function F is nearly linear (and in particular, if it is
p p
close to the diagonal line), then the probability plot constructed with distribution function G will be nearly linear. Thus, we can use the
distribution function G without having to know the location and scale parameters.
In the exercises below, you will explore probability plots for the normal, exponential, and uniform distributions. We will study a formal,
quantitative procedure, known as the chi-square goodness of fit test in the chapter on Hypothesis Testing.
Suppose that x is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of operation. A sample
of 30 components has five number summary (84, 102, 113, 120, 135) .
1. Classify x by type and level of measurement.
2. Find the range and interquartile range.
6.6.4 https://stats.libretexts.org/@go/page/10183
3. Find the five number summary, range, and interquartile range if the temperature is converted to degrees Celsius. The transformation is
(x − 32) .
5
y =
9
Answer
Suppose that x is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has five number summary
(9.6, 9.8, 10.0, 10.1, 10.3).
1. Classify x by type and level of measurement.
2. Find the range and interquartile range.
3. Find the five number summary, range, and interquartile if length is measured in centimeters. The transformation is y = 2.54x.
Answer
Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). For the first midterm exam,
the five number summary was (16, 52, 64, 72, 81) (out of a possible 100 points). Professor Moriarity thinks the grades are a bit low and is
considering various transformations for increasing the grades.
1. Find the range and interquartile range.
2. Suppose she adds 10 points to each grade. Find the five number summary, range, and interquartile range for the transformed grades.
3. Suppose she multiplies each grade by 1.2. Find the five number summary, range, and interquartile range for the transformed grades.
4. Suppose she uses the transformation w = 10√− x , which curves the grades greatly at the low end and very little at the high end. Give
whatever information you can about the five number summary of the transformed grades.
5. Determine whether the low score of 16 is an outlier.
Answer
Computational Exercises
All statistical software packages will compute order statistics and quantiles, draw stem-and-leaf plots and boxplots, and in general perform
the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets, the
use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with small, artificial
data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the graphs with minimal
technological aids.
Suppose that x is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data
.
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)
Suppose that a sample of size 12 from a discrete variable x has empirical density function given by f (−2) = 1/12 , f (−1) = 1/4,
f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.
The stem and leaf plot below gives the grades for a 100-point test in a probability course with 38 students. The first digit is the stem and
the second digit is the leaf. Thus, the low score was 47 and the high score was 98. The scores in the 6 row are 60, 60, 62, 63, 65, 65, 67,
68.
4 7
5 0346
6 00235578
7 0112346678899
8 0367889
9 1368
6.6.5 https://stats.libretexts.org/@go/page/10183
Answer
App Exercises
In the histogram app, construct a distribution with at least 30 values of each of the types indicated below. Note the five number summary.
1. A uniform distribution.
2. A symmetric, unimodal distribution.
3. A unimodal distribution that is skewed right.
4. A unimodal distribution that is skewed left.
5. A symmetric bimodal distribution.
6. A u-shaped distribution.
In the error function app, Start with a distribution and add additional points as follows. Note the effect on the five number summary:
1. Add a point below x . (1)
In the last problem, you may have noticed that when you add an additional point to the distribution, one or more of the five statistics does not
change. In general, quantiles can be relatively insensitive to changes in the data.
The Uniform Distribution
Recall that the standard uniform distribution is the uniform distribution on the interval [0, 1].
Suppose that X is a random sample of size n from the standard uniform distribution. For k ∈ {1, 2, … , n}, X(k) has the beta
distribution, with left parameter k and right parameter n − k + 1 . The probability density function f is given by k
n! k−1 n−k
fk (x) = x (1 − x ) , 0 ≤x ≤1 (6.6.22)
(k − 1)!(n − k)!
Proof
In the order statistic experiment, select the standard uniform distribution and n = 5 . Vary k from 1 to 5 and note the shape of the
probability density function of X . For each value of k , run the simulation 1000 times and compare the empirical density function to
(k)
It's easy to extend the results for the standard uniform distribution to the general uniform distribution on an interval.
Suppose that X is a random sample of size n from the uniform distribution on the interval [a, a + h] where a ∈ R and h ∈ (0, ∞) . For
k ∈ {1, 2, … , n}, X has the beta distribution with left parameter k , right parameter n − k + 1 , location parameter a , and scale
(k)
parameter h . In particular,
1. E (X (k) ) = a+h
k
n+1
k(n−k+1)
2. var (X (k)
2
) =h
2
(n+1 ) (n+2)
Proof
We return to the standard uniform distribution and consider the range of the random sample.
Suppose that X is a random sample of size n from the standard uniform distribution. The sample range R has the beta distribution with
left parameter n − 1 and right parameter 2. The probability density function g is given by
n−2
g(r) = n(n − 1)r (1 − r), 0 ≤r ≤1 (6.6.23)
Proof
Once again, it's easy to extend this result to a general uniform distribution.
Suppose that X = (X , X , … , X ) is a random sample of size n from the uniform distribution on [a, a + h] where a ∈ R and
1 2 n
h ∈ (0, ∞) . The sample range R = X −X has the beta distribution with left parameter n − 1 , right parameter 2, and scale
(n) (1)
6.6.6 https://stats.libretexts.org/@go/page/10183
parameter h . In particular,
n−1
1. E(R) = h n+1
2( n1 )
2. var(R) = h 2
2
(n+1 ) (n+2)
Proof
The joint distribution of the order statistics for a sample from the uniform distribution is easy to get.
Suppose that (X , X , … , X ) is a random sample of size n from the uniform distribution on the interval [a, a + h] , where a ∈ R and
1 2 n
Proof
The Exponential Distribution
Recall that the exponential distribution with rate parameter λ > 0 has probability density function
−λx
f (x) = λ e , 0 ≤x <∞ (6.6.25)
The exponential distribution is widely used to model failure times and other random times under certain ideal conditions. In particular, the
exponential distribution governs the times between arrivals in the Poisson process.
Suppose that X is a random sample of size n from the exponential distribution with rate parameter λ . The probability density function of
the k th order statistic X is (k)
n!
−λx k−1 −λ(n−k+1)x
fk (x) = λ(1 − e ) e , 0 ≤x <∞ (6.6.26)
(k − 1)!(n − k)!
In particular, the minimum of the variables X (1) also has an exponential distribution, but with rate parameter nλ.
Proof
In the order statistic experiment, select the standard exponential distribution and n = 5 . Vary k from 1 to 5 and note the shape of the
probability density function of X . For each value of k , run the simulation 1000 times and compare the empirical density function to
(k)
Suppose again that X is a random sample of size n from the exponential distribution with rate parameter λ . The sample range R has the
same distribution as the maximum of a random sample of size n − 1 from the exponential distribution. The probability density function
is
−λt n−2 −λt
h(t) = (n − 1)λ(1 − e ) e , 0 ≤t <∞ (6.6.27)
Proof
Suppose again that X is a random sample of size n from the exponential distribution with rate parameter λ . The joint probability density
function of the order statistics (X , X , … , X ) is
(1) (2) (n)
Proof
Dice
Four fair dice are rolled. Find the probability density function of each of the order statistics.
Answer
In the dice experiment, select the order statistic and die distribution given in parts (a)–(d) below. Increase the number of dice from 1 to
20, noting the shape of the probability density function at each stage. Now with n = 4 , run the simulation 1000 times, and note the
apparent convergence of the relative frequency function to the probability density function.
1. Maximum score with fair dice.
2. Minimum score with fair dice.
3. Maximum score with ace-six flat dice.
4. Minimum score with ace-six flat dice.
Four fair dice are rolled. Find the joint probability density function of the four order statistics.
6.6.7 https://stats.libretexts.org/@go/page/10183
Answer
Four fair dice are rolled. Find the probability density function of the sample range.
Answer
Probability Plot Simulations
In the probability plot experiment, set the sampling distribution to normal distribution with mean 5 and standard deviation 2. Set the
sample size to n = 20 . For each of the following test distributions, run the experiment 50 times and note the geometry of the probability
plot:
1. Standard normal
2. Uniform on the interval [0, 1]
3. Exponential with parameter 1
In the probability plot experiment, set the sampling distribution to the uniform distribution on [4, 10]. Set the sample size to n = 20 . For
each of the following test distributions, run the experiment 50 times and note the geometry of the probability plot:
1. Standard normal
2. Uniform on the interval [0, 1]
3. Exponential with parameter 1
In the probability plot experiment, Set the sampling distribution to the exponential distribution with parameter 3. Set the sample size to
n = 20 . For each of the following test distributions, run the experiment 50 times and note the geometry of the probability plot:
1. Standard normal
2. Uniform on the interval [0, 1]
3. Exponential with parameter 1
Consider the petal length and species variables in Fisher's iris data.
1. Classify the variables by type and level of measurement.
2. Compute the five number summary and draw the boxplot for petal length.
3. Compute the five number summary and draw the boxplot for petal length by species.
4. Draw the normal probability plot for petal length.
Answers
A stem and leaf plot of Michelson's velocity of light data is given below. In this example, the last digit (which is always 0) has been left
out, for convenience. Also, note that there are two sets of leaves for each stem, one corresponding to leaves from 0 to 4 (so actually from
00 to 40) and the other corresponding to leaves from 5 to 9 (so actually from 50 to 90). Thus, the minimum value is 620 and the numbers
in the second 7 row are 750, 760, 760, and so forth.
6.6.8 https://stats.libretexts.org/@go/page/10183
6 2
6 5
7 222444
7 566666788999
8 000001111111111223344444444
9 0011233444
9 55566667888
10 000
10 7
Consider the body weight, species, and gender variables in the Cicada data.
1. Classify the variables by type and level of measurement.
2. Compute the five number summary and draw the boxplot for body weight.
3. Compute the five number summary and draw the boxplot for body weight by species.
4. Compute the five number summary and draw the boxplot for body weight by gender.
Answer
This page titled 6.6: Order Statistics is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via
source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
6.6.9 https://stats.libretexts.org/@go/page/10183
6.7: Sample Correlation and Regression
Descriptive Theory
Recall the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that
we make on these objects. We select objects from the population and record the variables for the objects in the sample; these
become our data. Our first discussion is from a purely descriptive point of view. That is, we do not assume that the data are
generated by an underlying probability distribution. But as always, remember that the data themselves define a probability
distribution, namely the empirical distribution that assigns equal probability to each data point.
Suppose that x and y are real-valued variables for a population, and that ((x , y ), (x , y ), … , (x , y )) is an observed sample of
1 1 2 2 n n
size n from (x, y). We will let x = (x , x , … , x ) denote the sample from x and y = (y , y , … , y ) the sample from y . In this
1 2 n 1 2 n
section, we are interested in statistics that are measures of association between the x and y , and in finding the line (or other curve)
that best fits the data.
Recall that the sample means are
n n
1 1
m(x) = ∑ xi , m(y) = ∑ yi (6.7.1)
n n
i=1 i=1
Scatterplots
Often, the first step in exploratory data analysis is to draw a graph of the points; this is called a scatterplot an can give a visual
sense of the statistical realtionship between the variables.
Assuming that the data vectors are not constant, so that the standard deviations are positive, the sample correlation is defined
to be
6.7.1 https://stats.libretexts.org/@go/page/10184
s(x, y)
r(x, y) = (6.7.4)
s(x)s(y)
Note that the sample covariance is an average of the product of the deviations of the x and y data from their means. Thus, the
physical unit of the sample covariance is the product of the units of x and y . Correlation is a standardized version of covariance. In
particular, correlation is dimensionless (has no physical units), since the covariance in the numerator and the product of the
standard devations in the denominator have the same units (the product of the units of x and y ). Note also that covariance and
correlation have the same sign: positive, negative, or zero. In the first case, the data x and y are said to be positively correlated; in
the second case x and y are said to be negatively correlated; and in the third case x and y are said to be uncorrelated
To see that the sample covariance is a measure of association, recall first that the point (m(x), m(y)) is a measure of the center of
the bivariate data. Indeed, if each point is the location of a unit mass, then (m(x), m(y)) is the center of mass as defined in
physics. Horizontal and vertical lines through this center point divide the plane into four quadrants. The product deviation
[ x − m(x)][ y − m(y)] is positive in the first and third quadrants and negative in the second and fourth quadrants. After we study
i i
linear regression below, we will have a much deeper sense of what covariance measures.
n i i
sample means are simply the expected values of this bivariate distribution, and except for a constant multiple (dividing by n − 1
rather than n ), the sample variances are simply the variances of this bivarite distribution. Similarly, except for a constant multiple
(again dividing by n − 1 rather than n ), the sample covariance is the covariance of the bivariate distribution and the sample
correlation is the correlation of the bivariate distribution. All of the following results in our discussion of descriptive statistics are
actually special cases of more general results for probability distributions.
Properties of Covariance
The next few exercises establish some essential properties of sample covariance. As usual, bold symbols denote samples of a fixed
size n from the corresponding population variables (that is, vectors of length n ), while symbols in regular type denote real
numbers. Our first result is a formula for sample covariance that is sometimes better than the definition for computational purposes.
To state the result succinctly, let xy = (x y , x y , … , x y ) denote the sample from the product variable xy.
1 1 2 2 n n
6.7.2 https://stats.libretexts.org/@go/page/10184
Proof
The following theorem gives another formula for the sample covariance, one that does not require the computation of intermediate
statistics.
Proof
In light of the previous theorem, we can now see that the first computational formula and the second computational formula above
generalize the computational formulas for sample variance. Clearly, sample covariance is symmetric.
s(x, y) = s(y, x) .
Sample covariance is linear in the first argument with the second argument fixed.
If x, y , and z are data vectors from population variables x, y , and z , respectively, and if c is a constant, then
1. s(x + y, z) = s(x, z) + s(y, z)
2. s(cx, y) = cs(x, y)
Proof
By symmetry, sample covariance is also linear in the second argument with the first argument fixed, and hence is bilinear. The
general version of the bilinear property is given in the following theorem:
Suppose that x is a data vector from a population variable x for i ∈ {1, 2, … , k} and that y is a data vector from a
i i j
population variable y for j ∈ {1, 2, … , l}. Suppose also that a , a , … , a and b , b , … , b are constants. Then
j 1 2 k 1 2 l
k l k l
s ( ∑ ai xi , ∑ bj yj ) = ∑ ∑ ai bj s(xi , yj ) (6.7.21)
A special case of the bilinear property provides a nice way to compute the sample variance of a sum.
2 2
s (x + y) = s (x) + 2s(x, y) + s (y)
2
.
Proof
The generalization of this result to sums of three or more vectors is completely straightforward: namely, the sample variance of a
sum is the sum of all of the pairwise sample covariances. Note that the sample variance of a sum can be greater than, less than, or
equal to the sum of the sample variances, depending on the sign and magnitude of the pure covariance term. In particular, if the
vectors are pairwise uncorrelated, then the variance of the sum is the sum of the variances.
Combining the result in the last exercise with the bilinear property, we see that covariance is unchanged if constants are added to
the data sets. That is, if c and d are constant vectors then s(x + c, y + d) = s(x, y) .
Properties of Correlation
A few simple properties of correlation are given next. Most of these follow easily from the corresponding properties of covariance.
First, recall that the standard scores of x and y are, respectively,
i i
6.7.3 https://stats.libretexts.org/@go/page/10184
xi − m(x) yi − m(y)
ui = , vi = (6.7.24)
s(x) s(y)
The standard scores from a data set are dimensionless quantities that have mean 0 and variance 1.
The correlation between x and y is the covariance of their standard scores u and v. That is, r(x, y) = s(u, v).
Proof
Correlation is symmetric.
r(x, y) = r(y, x) .
Unlike covariance, correlation is unaffected by multiplying one of the data sets by a positive constant (recall that this can always be
thought of as a change of scale in the underlying variable). On the other hand, muliplying a data set by a negative constant changes
the sign of the correlation.
If c ≠ 0 is a constant then
1. r(cx, y) = r(x, y) if c > 0
2. r(cx, y) = −r(x, y) if c < 0
Proof
Like covariance, correlation is unaffected by adding constants to the data sets. Adding a constant to a data set often corresponds to
a change of location.
The last couple of properties reinforce the fact that correlation is a standardized measure of association that is not affected by
changing the units of measurement. In the first Challenger data set, for example, the variables of interest are temperature at time of
launch (in degrees Fahrenheit) and O-ring erosion (in millimeters). The correlation between these variables is of critical
importance. If we were to measure temperature in degrees Celsius and O-ring erosion in inches, the correlation between the two
variables would be unchanged.
The most important properties of correlation arise from studying the line that best fits the data, our next topic.
Linear Regression
We are interested in finding the line y = a + bx that best fits the sample points ((x , y ), (x , y ), … , (x , y )). This is a basic
1 1 2 2 n n
and important problem in many areas of mathematics, not just statistics. We think of x as the predictor variable and y as the
response variable. Thus, the term best means that we want to find the line (that is, find the coefficients a and b ) that minimizes the
average of the squared errors between the actual y values in our data and the predicted y values:
n
1 2
mse(a, b) = ∑[ yi − (a + b xi )] (6.7.29)
n−1
i=1
Note that the minimizing value of (a, b) would be the same if the function were simply the sum of the squared errors, of if we
averaged by dividing by n rather than n − 1 , or if we used the square root of any of these functions. Of course that actual minimum
value of the function would be different if we changed the function, but again, not the point (a, b) where the minimum occurs. Our
particular choice of mse as the error function is best for statistical purposes. Finding (a, b) that minimize mse is a standard
problem in calculus.
The graph of mse is a paraboloid opening upward. The function mse is minimized when
s(x, y)
b(x, y) = (6.7.30)
s2 (x)
s(x, y)
a(x, y) = m(y) − b(x, y)m(x) = m(y) − m(x) (6.7.31)
2
s (x)
6.7.4 https://stats.libretexts.org/@go/page/10184
Proof
Of course, the optimal values of a and b are statistics, that is, functions of the data. Thus the sample regression line is
s(x, y)
y = m(y) + [x − m(x)] (6.7.35)
s2 (x)
Proof
Thus, we now see in a deeper way that the sample covariance and correlation measure the degree of linearity of the sample points.
Recall from our discussion of measures of center and spread that the constant a that minimizes
n
1 2
mse(a) = ∑(yi − a) (6.7.37)
n−1
i=1
6.7.5 https://stats.libretexts.org/@go/page/10184
is the sample mean m(y), and the minimum value of the mean square error is the sample variance s (y). Thus, the difference 2
between this value of the mean square error and the one above, namely s (y)r (x, y) is the reduction in the variability of the y
2 2
data when the linear term in x is added to the predictor. The fractional reduction is r (x, y), and hence this statistics is called the
2
(sample) coefficient of determination. Note that if the data vectors x and y are uncorrelated, then x has no value as a predictor of
y ; the regression line in this case is the horizontal line y = m(y) and the mean square error is s (y).
2
The sample regression line with predictor variable x and response variable y is not the same as the sample regression line with
predictor variable y and response variable x, except in the extreme case r(x, y) = ±1 where the sample points all lie on a line.
Residuals
The difference between the actual y value of a data point and the value predicted by the regression line is called the residual of that
data point. Thus, the residual corresponding to (x , y ) is d = y − y^ where y^ is the regression line at x :
i i i i i i i
s(x, y)
^ = m(y) +
y [ xi − m(x)] (6.7.38)
i
2
s(x)
Note that the predicted value y^ and the residual d are statistics, that is, functions of the data (x, y), but we are suppressing this in
i i
Various plots of the residuals can help one understand the relationship between the x and y data. Some of the more common are
given in the following definition:
Residual plots
1. A plot of (i, d ) for i ∈ {1, 2, … , n}, that is, a plot of indices versus residuals.
i
2. A plot of (x , d ) for i ∈ {1, 2, … , n}, that is, a plot of x values versus residuals.
i i
3. A plot of (d , y ) for i ∈ {1, 2, … , n}, that is, a plot of residuals versus actual y values.
i i
4. A plot of (d , y^ ) for i ∈ {1, 2, … , n}, that is a plot of residuals versus predicted y values.
i i
Sums of Squares
For our next discussion, we will re-interpret the minimum mean square error formulat above. Here are the new definitions:
Sums of squares
n
1. sst(y) = ∑ [yi=1 i − m(y)]
2
is the total sum of squares.
n
2. ssr(x, y) = ∑ i=1
^ − m(y)]
[y i
2
is the regression sum of squares
n
3. sse(x, y) = ∑ i=1
^ )
(yi − y i
2
is the error sum of squares.
Note that sst(y) is simply n − 1 times the variance s (y) and is the total of the sums of the squares of the deviations of the y
2
values from the mean of the y values. Similarly, sse(x, y) is simply n − 1 times the minimum mean square error given above. Of
course, sst(y) has n − 1 degrees of freedom, while sse(x, y) has n − 2 degrees of freedom and ssr(x, y) a single degree of
freedom. The total sum of squares is the sum of the regression sum of squares and the error sum of squares:
Note that r (x, y) = ssr(x, y)/sst(y) , so once again, r (x, y) is the coefficient of determination—the proportion of the
2 2
variability in the y data explained by the x data. We can average sse by dividing by its degrees of freedom and then take the square
root to obtain a standard error:
6.7.6 https://stats.libretexts.org/@go/page/10184
The standard error of estimate is
−−−−−−−−
sse(x, y)
se(x, y) = √ (6.7.41)
n−2
This really is a standard error in the same sense as a standard deviation. It's an average of the errors of sorts, but in the root mean
square sense.
Finally, it's important to note that linear regression is a much more powerful idea than might first appear, and in fact the term linear
can be a bit misleading. By applying various transformations to y or x or both, we can fit a variety of two-parameter curves to the
given data ((x , y ), (x , y ), … , (x , y )). Some of the most common transformations are explored in the exercises below.
1 1 2 2 n n
Probability Theory
We continue our discussion of sample covariance, correlation, and regression but now from the more interesting point of view that
the variables are random. Specifically, suppose that we have a basic random experiment, and that X and Y are real-valued random
variables for the experiment. Equivalently, (X, Y ) is a random vector taking values in R . Let μ = E(X) and ν = E(Y ) denote 2
the distribution means, σ = var(X) and τ = var(Y ) the distribution variances, and let δ = cov(X, Y ) denote the distribution
2 2
sample of size n from the distribution of (X, Y ). The statistics discussed in previous section are well defined but now they are all
random variables. We use the notation established previously, except that we use our usual convention of denoting random
variables with capital letters. Of course, the deterministic properties and relations established above still hold. Note that
X = (X , X , … , X ) is a random sample of size n from the distribution of X and Y = (Y , Y , … , Y ) is a random sample of
1 2 n 1 2 n
size n from the distribution of Y . The main purpose of this subsection is to study the relationship between various statistics from
X and Y , and to study statistics that are natural estimators of the distribution covariance and correlation.
From the sections on the law of large numbers and the central limit theorem, we know a great deal about the distributions of M (X)
and M (Y ) individually. But we need to know more about the joint distribution.
Note that the correlation between the sample means is the same as the correlation of the underlying sampling distribution. In
particular, the correlation does not depend on the sample size n .
The Sample Variances
Recall that special versions of the sample variances, in the unlikely event that the distribution means are known, are
n n
1 1
2 2 2 2
W (X) = ∑(Xi − μ) , W (Y ) = ∑(Yi − ν ) (6.7.46)
n n
i=1 i=1
6.7.7 https://stats.libretexts.org/@go/page/10184
Once again, we have studied these statistics individually, so our emphasis now is on the joint distribution.
Proof
Note that the correlation does not dependend on the sample size n . Next, recall that the standard versions of the sample variances
are
n n
2
1 2 2
1 2
S (X) = ∑[ Xi − M (X)] , S (Y ) = ∑[ Yi − M (Y )] (6.7.49)
n−1 n−1
i=1 i=1
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −
2. cor[S 2
(X), S
2
(Y )] = [(n − 1)(δ2 − σ τ
2 2 2 4 4
) + 2 δ ]/ √[(n − 1)σ4 − (n − 3)σ ][(n − 1)τ4 − (n − 3)τ ]
Proof
Asymptotically, the correlation between the sample variances is the same as the correlation between the special sample variances
given above:
2 2
2 2
δ2 − σ τ
cor [ S (X), S (Y )] → − −−−− −− −−−−−− −− as n → ∞ (6.7.52)
4 4
√ (σ4 − σ )(τ4 − τ )
Sample Covariance
Suppose first that the distribution means μ and ν are known. As noted earlier, this is almost always an unrealistic assumption, but
is still a good place to start because the analysis is very simple and the results we obtain will be useful below. A natural estimator of
the distsribution covariance δ = cov(X, Y ) in this case is the special sample covariance
n
1
W (X, Y ) = ∑(Xi − μ)(Yi − ν ) (6.7.53)
n
i=1
Note that the special sample covariance generalizes the special sample variance: W (X, X) = W 2
(X) .
W (X, Y ) is the sample mean for a random sample of size n from the distribution of (X − μ)(Y − ν ) and satisfies the
following properties:
1. E[W (X, Y )] = δ
2. var[W (X, Y )] = (δ − δ ) 1
n
2
2
As an estimator of δ , part (a) means that W (X, Y ) is unbiased and part (b) means that W (X, Y ) is consistent.
Consider now the more realistic assumption that the distribution means μ and ν are unknown. A natural approach in this case is to
average [(X − M (X)][Y − M (Y )] over i ∈ {1, 2, … , n}. But rather than dividing by n in our average, we should divide by
i i
whatever constant gives an unbiased estimator of δ . As shown in the next theorem, this constant turns out to be n − 1 , leading to
the standard sample covariance:
n
1
S(X, Y ) = ∑[ Xi − M (X)][ Yi − M (Y )] (6.7.55)
n−1
i=1
E[S(X, Y )] = δ .
Proof
6.7.8 https://stats.libretexts.org/@go/page/10184
Proof
Since the sample correlation R(X, Y ) is a nonlinear function of the sample covariance and sample standard deviations, it will not
in general be an unbiased estimator of the distribution correlation ρ. In most cases, it would be difficult to even compute the mean
and variance of R(X, Y ). Nonetheless, we can show convergence of the sample correlation to the distribution correlation.
Our next theorem gives a formuala for the variance of the sample covariance, not to be confused with the covariance of the sample
variances given above!
Proof
It's not surprising that the variance of the standard sample covariance (where we don't know the distribution means) is greater than
the variance of the special sample covariance (where we do know the distribution means).
var[S(X, Y )] → 0 as n → ∞ . Thus, the sample covariance is a consistent estimator of the distribution covariance.
Regression
In our first discussion above, we studied regression from a deterministic, descriptive point of view. The results obtained applied
only to the sample. Statistically more interesting and deeper questions arise when the data come from a random experiment, and we
try to draw inferences about the underlying distribution from the sample regression. There are two models that commonly arise.
One is where the response variable is random, but the predictor variable is deterministic. The other is the model we consider here,
where the predictor variable and the response variable are both random, so that the data form a random sample from a bivariate
distribution.
Thus, suppose again that we have a basic random vector (X, Y ) for an experiment. Recall that in the section on (distribution)
correlation and regression, we showed that the best linear predictor of Y given X, in the sense of minimizing mean square error, is
the random variable
cov(X, Y ) δ
L(Y ∣ X) = E(Y ) + [X − E(X)] = ν + (X − μ) (6.7.64)
2
var(X) σ
6.7.9 https://stats.libretexts.org/@go/page/10184
Figure 6.7.5 : The distribution regression line
Of course, in real applications, we are unlikely to know the distribution parameters μ , ν , σ , and δ . If we want to estimate the
2
distribution regression line, a natural approach would be to consider a random sample ((X , Y ), (X , Y ), … , (X , Y )) from the
1 1 2 2 n n
distribution of (X, Y ) and compute the sample regression line. Of course, the results are exactly the same as in the discussion
above, except that all of the relevant quantities are random variables. The sample regression line is
S(X, Y )
y = M (Y ) + [x − M (X)] (6.7.66)
2
S (X)
The coefficients of the sample regression line converge to the coefficients of the distribution regression line with probability 1.
S(X,Y )
1. 2
→
σ2
δ
as n → ∞
S (X)
S(X,Y )
2. M (Y ) − 2
M (X) → ν −
δ
σ2
μ as n → ∞
S (X)
Proof
Of course, if the linear relationship between X and Y is not strong, as measured by the sample correlation, then transformation
applied to one or both variables may help. Again, some typical transformations are explored in the exercises below.
Exercises
Basic Properties
Suppose that x and y are population variables, and x and y samples of size n from x and y respectively. Suppose also that
m(x) = 3 , m(y) = −1 , s (x) = 4 , s (y) = 9 , and s(x, y) = 5 . Find each of the following:
2 2
1. r(x, y)
2. m(2x + 3y)
6.7.10 https://stats.libretexts.org/@go/page/10184
3. s (2x + 3y)
2
4. s(2x + 3y − 1, 4x + 2y − 3)
Suppose that x is the temperature (in degrees Fahrenheit) and y the resistance (in ohms) for a certain type of electronic
component after 10 hours of operation. For a sample of 30 components, m(x) = 113, s(x) = 18 , m(y) = 100 , s(y) = 10 ,
r(x, y) = 0.6.
9
(x − 32) ).
Suppose that x is the length and y the width (in inches) of a leaf in a certain type of plant. For a sample of 50 leaves
m(x) = 10 , s(x) = 2 , m(y) = 4 , s(y) = 1 , and r(x, y) = 0.8.
Click in the interactive scatterplot, in various places, and watch how the means, standard deviations, correlation, and regression
line change.
Click in the interactive scatterplot to define 20 points and try to come as close as possible to each of the following sample
correlations:
1. 0
2. 0.5
3. −0.5
4. 0.7
5. −0.7
6. 0.9
7. −0.9.
Click in the interactive scatterplot to define 20 points. Try to generate a scatterplot in which the regression line has
1. slope 1, intercept 1
2. slope 3, intercept 0
3. slope −2, intercept 1
6.7.11 https://stats.libretexts.org/@go/page/10184
Simulation Exercises
Run the bivariate uniform experiment 2000 times in each of the following cases. Compare the sample means to the distribution
means, the sample standard deviations to the distribution standard deviations, the sample correlation to the distribution
correlation, and the sample regression line to the distribution regression line.
1. The uniform distribution on the square
2. The uniform distribution on the triangle.
3. The uniform distribution on the circle.
Run the bivariate normal experiment 2000 times for various values of the distribution standard deviations and the distribution
correlation. Compare the sample means to the distribution means, the sample standard deviations to the distribution standard
deviations, the sample correlation to the distribution correlation, and the sample regression line to the distribution regression
line.
Transformations
3. Hence, to fit this curve to sample data, simply apply the standard regression procedure to the data from the variables x and 2
y.
a+bx
.
1. Sketch the graph for some representative values of a and b .
2. Note that is a linear function of x, with intercept a and slope b .
1
3. Hence, to fit this curve to our sample data, simply apply the standard regression procedure to the data from the variables x
and .
1
a+bx
.
1. Sketch the graph for some representative values of a and b .
2. Note that is a linear function of , with intercept b and slope a .
1
y
1
3. Hence, to fit this curve to sample data, simply apply the standard regression procedure to the data from the variables 1
x
and
1
y
.
4. Note again that the names of the intercept and slope are reversed from the standard formulas.
6.7.12 https://stats.libretexts.org/@go/page/10184
Computational Exercises
All statistical software packages will perform regression analysis. In addition to the regression line, most packages will typically
report the coefficient of determination r (x, y), the sums of squares sst(y) , ssr(x, y), sse(x, y), and the standard error of estimate
2
se(x, y). Most packages will also draw the scatterplot, with the regression line superimposed, and will draw the various graphs of
residuals discussed above. Many packages also provide easy ways to transform the data. Thus, there is very little reason to perform
the computations by hand, except with a small data set to master the definitions and formulas. In the following problem, do the
computations and draw the graphs with minimal technological aids.
Suppose that x is the number of math courses completed and the number of science courses completed for a student at
y
Enormous State University (ESU). A sample of 10 ESU students gives the following data:
.
((1, 1), (3, 3), (6, 4), (2, 1), (8, 5), (2, 2), (4, 3), (6, 4), (4, 3), (4, 4))
The following two exercise should help you review some of the probability topics in this section.
Suppose that (X, Y ) has a continuous distribution with probability density function f (x, y) = 15x 2
y for 0 ≤ x ≤ y ≤ 1 . Find
each of the following:
1. μ = E(X) and ν = E(Y )
2. σ = var(X) and τ = var(Y )
2 2
6.7.13 https://stats.libretexts.org/@go/page/10184
Consider the height variables in Pearson's height data.
1. Classify the variables by type and level of measurement.
2. Compute the correlation coefficient and the coefficient of determination
3. Compute the least squares regression line, with the height of the father as the predictor variable and the height of the son as
the response variable.
4. Draw the scatterplot and the regression line together.
5. Predict the height of a son whose father is 68 inches tall.
6. Compute the regression line if the heights are converted to centimeters (there are 2.54 centimeters per inch).
Answer
Consider the petal length, petal width, and species variables in Fisher's iris data.
1. Classify the variables by type and level of measurement.
2. Compute the correlation between petal length and petal width.
3. Compute the correlation between petal length and petal width by species.
Answer
Consider the number of candies and net weight variables in the M&M data.
1. Classify the variable by type and level of measurement.
2. Compute the correlation coefficient and the coefficient of determination.
3. Compute the least squares regression line with number of candies as the predictor variable and net weight as the response
variable.
4. Draw the scatterplot and the regression line in part (b) together.
5. Predict the net weight of a bag of M&Ms with 56 candies.
6. Naively, one might expect a much stronger correlation between the number of candies and the net weight in a bag of
M&Ms. What is another source of variability in net weight?
Answer
Consider the response rate and total SAT score variables in the SAT by state data set.
1. Classify the variables by type and level of measurement.
2. Compute the correlation coefficient and the coefficient of determination.
3. Compute the least squares regression line with response rate as the predictor variable and SAT score as the response
variable.
4. Draw the scatterplot and regression line together.
5. Give a possible explanation for the negative correlation.
Answer
Consider the verbal and math SAT scores (for all students) in the SAT by year data set.
1. Classify the variables by type and level of measurement.
2. Compute the correlation coefficient and the coefficient of determination.
3. Compute the least squares regression line.
4. Draw the scatterplot and regression line together.
Answer
Consider the temperature and erosion variables in the first data set in the Challenger data.
1. Classify the variables by type and level of measurement.
2. Compute the correlation coefficient and the coefficient of determination.
3. Compute the least squares regression line.
4. Draw the scatter plot and the regression line together.
5. Predict the O-ring erosion with a temperature of 31° F.
6. Is the prediction in part (c) meaningful? Explain.
6.7.14 https://stats.libretexts.org/@go/page/10184
7. Find the regression line if temperature is converted to degrees Celsius. Recall that the conversion is 5
9
(x − 32) .
Answer
This page titled 6.7: Sample Correlation and Regression is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
6.7.15 https://stats.libretexts.org/@go/page/10184
6.8: Special Properties of Normal Samples
Random samples from normal distributions are the most important special cases of the topics in this chapter. As we will see, many
of the results simplify significantly when the underlying sampling distribution is normal. In addition we will derive the
distributions of a number of random variables constructed from normal samples that are of fundamental important in inferential
statistics.
σ ∈ (0, ∞) . Recall that the term random sample means that X is a sequence of independent, identically distributed random
variables. Recall also that the normal distribution has probability density function
2
1 1 x −μ
f (x) = exp[− ( ) ], x ∈ R (6.8.1)
−
−−
√2 πσ 2 σ
In the notation that we have used elsewhere in this chapter, σ = E [(X − μ) ] = 0 (equivalently, the skewness of the normal
3
3
distribution is 0) and σ = E [(X − μ) ] = 3σ (equivalently, the kurtosis of the normal distribution is 3). Since the sample (and
4
4 4
in particular the sample size n ) is fixed is this subsection, it will be suppressed in the notation.
Proof
Of course, by the central limit theorem, the distribution of M is approximately normal, if n is large, even if the underlying
sampling distribution is not normal. The standard score of M is given as follows:
M −μ
Z = (6.8.3)
−
σ/ √n
The standard score Z associated with the sample mean M plays a critical role in constructing interval estimates and hypothesis
tests for the distribution mean μ when the distribution standard deviation σ is known. The random variable Z will also appear in
several derivations in this section.
function
1
k/2−1 −x/2
f (x) = x e , 0 <x <∞ (6.8.4)
k/2
Γ(k/2)2
and has mean k and variance 2k. The moment generating function is
1 1
G(t) = , −∞ < t < (6.8.5)
k/2 2
(1 − 2t)
6.8.1 https://stats.libretexts.org/@go/page/10185
k
The most important result to remember is that the chi-square distribution with k degrees of freedom governs ∑i=1 Z
i
2
, where
(Z , Z , … , Z ) is a sequence of independent, standard normal random variables.
1 2 k
n
1
2 2
W = ∑(Xi − μ) (6.8.6)
n
i=1
Although the assumption that μ is known is almost always artificial, W is very easy to analyze and it will be used in some of the
2
derivations below. Our first result is the distribution of a simple multiple of W . Let 2
n
2
U = W (6.8.7)
2
σ
The variable U associated with the statistic W plays a critical role in constructing interval estimates and hypothesis tests for the
2
distribution standard deviation σ when the distribution mean μ is known (although again, this assumption is usually not realistic).
1. E(W ) = σ 2 2
2. var(W ) = 2σ 2 4
/n
Proof
As an estimator of σ , part (a) means that W is unbiased and part (b) means that W is consistent. Of course, these moment
2 2 2
results are special cases of the general results obtained in the section on Sample Variance. In that section, we also showed that M
and W are uncorrelated if the underlying sampling distribution has skewness 0 (σ = 0 ), as is the case here.
2
3
Recall now that the standard version of the sample variance is the statistic
n
2
1 2
S = ∑(Xi − M ) (6.8.9)
n−1
i=1
The sample variance S is the usual estimator of σ when μ is unknown (which is usually the case). We showed earlier that in
2 2
general, the sample mean M and the sample variance S are uncorrelated if the underlying sampling distribution has skewness 0 (
2
σ = 0 ). It turns out that if the sampling distribution is normal, these variables are in fact independent, a very important and useful
3
property, and at first blush, a very surprising result since S appears to depend explicitly on M .
2
Proof
We can now determine the distribution of a simple multiple of the sample variance S . Let 2
n−1 2
V = S (6.8.11)
σ2
The variable V associated with the statistic S plays a critical role in constructing interval estimates and hypothesis tests for the
2
distribution standard deviation σ when the distribution mean μ is unknown (almost always the case).
1. E(S ) = σ
2 2
2. var(S ) = 2σ
2 4
/(n − 1)
Proof
6.8.2 https://stats.libretexts.org/@go/page/10185
As before, these moment results are special cases of the general results obtained in the section on Sample Variance. Again, as an
estimator of σ , part (a) means that S is unbiased, and part (b) means that S is consistent. Note also that var(S ) is larger than
2 2 2 2
n−1
In the special distribution simulator, select the chi-square distribution. Vary the degree of freedom parameter and note the
shape and location of the probability density function and the mean, standard deviation bar. For selected values of the
parameter, run the experiment 1000 times and compare the empirical density function and moments to the true probability
density function and moments.
The covariance and correlation between the special sample variance and the standard sample variance are
1. cov(W 2
,S
2 4
) = 2 σ /n
−−−− −−−−
2. cor(W 2
,S
2
) = √(n − 1)/n
Proof
Note that the correlation does not depend on the parameters μ and σ, and converges to 1 as n → ∞ ,
The T Variable
Recall that the Student t distribution with k ∈ N + degrees of freedom has probability density function
2 −(k+1)/2
t
f (t) = Ck (1 + ) , t ∈ R (6.8.15)
k
where C is the appropriate normalizing constant. The distribution has mean 0 if k > 1 and variance k/(k − 2) if
k k >2 . In this
subsection, the main point to remember is that the t distribution with k degrees of freedom is the distribution of
Z
(6.8.16)
−−−
−
√V /k
where Z has the standard normal distribution; V has the chi-square distribution with k degrees of freedom; and Z and V are
independent. Our goal is to derive the distribution of
M −μ
T = − (6.8.17)
S/ √n
Note that T is similar to the standard score Z associated with M , but with the sample standard deviation S replacing the
distribution standard deviation σ. The variable T plays a critical role in constructing interval estimates and hypothesis tests for the
distribution mean μ when the distribution standard deviation σ is unknown.
As usual, let Z denote the standard score associated with the sample mean M , and let V denote the chi-square variable
associated with the sample variance S . Then 2
Z
T = − −−−−−− − (6.8.18)
√ V /(n − 1)
In the special distribution simulator, select the t distribution. Vary the degree of freedom parameter and note the shape and
location of the probability density function and the mean±standard deviation bar. For selected values of the parameters, run the
experiment 1000 times and compare the empirical density function and moments to the distribution density function and
moments.
deviation σ ∈ (0, ∞), and that Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν ∈ R
1 2 n
and standard deviation τ ∈ (0, ∞) . Finally, suppose that X and Y are independent. Of course, all of the results above in the one
6.8.3 https://stats.libretexts.org/@go/page/10185
sample model apply to X and Y separately, but now we are interested in statistics that are helpful in inferential procedures that
compare the two normal distributions. We will use the basic notation established above, but we will indicate the dependence on the
sample.
The two-sample (or more generally the multi-sample model) occurs naturally when a basic variable in the statistical experiment is
filtered according to one or more other variable (often nominal variables). For example, in the cicada data, the weights of the male
cicadas and the weights of the female cicadas may fit observations from the two-sample normal model. The basic variable weight is
filtered by the variable gender. If weight is filtered by gender and species, we might have observations from the 6-sample normal
model.
has the standard normal distribution. This standard score plays a fundamental role in constructing interval estimates and hypothesis
test for the difference μ − ν when the distribution standard deviations σ and τ are known.
where U has the chi-square distribution with j degrees of freedom; V has the chi-square distribution with k degrees of freedom;
and U and V are independent. The F distribution is named in honor of Ronald Fisher and has probability density function
(j−2)/2
x
f (x) = Cj,k , 0 <x <∞ (6.8.21)
(j+k)/2
[1 + (j/k)x]
2
j+k−2
where C j,k is the appropriate normalizing constant. The mean is k
k−2
if k > 2 , and the variance is 2( k
k−2
)
j(k−4)
if k > 4 .
The random variable given below has the F distribution with m degrees of freedom in the numerator and n degrees of
freedom in the denominator:
2 2
W (X)/ σ
(6.8.22)
2 2
W (Y )/ τ
Proof
The random variable given below has the F distribution with m − 1 degrees of freedom in the numerator and n − 1 degrees of
freedom in the denominator:
2 2
S (X)/ σ
(6.8.23)
2 2
S (Y )/ τ
6.8.4 https://stats.libretexts.org/@go/page/10185
Proof
These variables are useful for constructing interval estimates and hypothesis tests of the ratio of the standard deviations σ/τ. The
choice of the F variable depends on whether the means μ and ν are known or unknown. Usually, of course, the means are
unknown and so the statistic in above is used.
In the special distribution simulator, select the F distribution. Vary the degrees of freedom parameters and note the shape and
location of the probability density function and the mean±standard deviation bar. For selected values of the parameters, run the
experiment 1000 times and compare the empirical density function and moments to the true distribution density function and
moments.
The T Variable
Our final construction in the two sample normal model will result in a variable that has the student t distribution. This variable
plays a fundamental role in constructing interval estimates and hypothesis test for the difference μ − ν when the distribution
standard deviations σ and τ are unknown. The construction requires the additional assumption that the distribution standard
deviations are the same: σ = τ . This assumption is reasonable if there is an inherent variability in the measurement variables that
does not change even when different treatments are applied to the objects in the population.
Note first that the standard score associated with the difference in the sample means becomes
[M (Y ) − M (X)] − (ν − μ)
Z = (6.8.24)
− −−−−−−−−
σ √ 1/m + 1/n
To construct our desired variable, we first need an estimate of σ . A natural approach is to consider a weighted average of the
2
sample variances S (X) and S (Y ), with the degrees of freedom as the weight factors (this is called the pooled estimate of σ .
2 2 2
Thus, let
2 2
(m − 1)S (X) + (n − 1)S (Y )
2
S (X, Y ) = (6.8.25)
m +n−2
The random variable V given below has the chi-square distribution with m + n − 2 degrees of freedom:
2 2
(m − 1)S (X) + (n − 1)S (Y )
V = (6.8.26)
2
σ
Proof
The random variable T given below has the student t distribution with m + n − 2 degrees of freedom.
[M (Y ) − M (X)] − (ν − μ)
T = (6.8.27)
− −−−−−−−−
S(X, Y )√ 1/m + 1/n
Proof
means μ ∈ R and ν , standard deviations σ ∈ (0, ∞) and τ ∈ (0, ∞) , and correlation ρ ∈ [0, 1]. Of course,
∈ R
X = (X , X , … , X ) is a random sample of size n from the normal distribution with mean μ and standard deviation σ, and
1 2 n
Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν and standard deviation τ , so the
1 2 n
results above in the one sample model apply to X and Y individually. Thus our interest in this section is in the relation between
various X and Y statistics and properties of sample covariance.
The bivariate (or more generally multivariate) model occurs naturally when considering two (or more) variables in the statistical
experiment. For example, the heights of the fathers and the heights of the sons in Pearson's height data may well fit observations
from the bivariate normal model.
6.8.5 https://stats.libretexts.org/@go/page/10185
In the notation that we have used previously, recall that σ = E [(X − μ) ] = 0 , 3 3 4
σ4 = E [(X − μ) ] = 3 σ
4
,
τ = E [(Y − ν ) ] = 0 , τ = E [(Y − ν ) ] = 3 τ , δ = cov(X, Y ) = στ ρ . and δ = E[(X − μ) .
3 4 4 2 2 2 2 2
3 4 2 (Y − ν ) ] = σ τ (1 + 2 ρ )
The data vector ((X 1, Y1 ), (X2 , Y2 ), … (Xn , Yn )) has a multivariate normal distribution.
1. The mean vector has a block form, with each block being (μ, ν ).
2
σ στ ρ
2. The variance-covariance matrix has a block-diagonal form, with each block being [ 2
] .
στ ρ τ
Proof
Sample Means
(M (X), M (Y )) has a bivariate normal distribution. The covariance and correlation are
1. cov [M (X), M (Y )] = στ ρ/n
2. cor [M (X), M (Y )] = ρ
Proof
Of course, we know the individual means and variances of M (X) and M (Y ) from the one-sample model above. Hence we know
the complete distribution of (M (X), M (Y )).
Sample Variances
The covariance and correlation between the special sample variances are
1. cov [ W 2
(X), W
2
(Y )] = 2 σ τ
2 2 2
ρ /n
2. cor [ W 2
(X), W
2
(Y )] = ρ
2
Proof
The covariance and correlation between the standard sample variances are
1. cov [ S 2
(X), S
2
(Y )] = 2 σ τ
2 2 2
ρ /(n − 1)
2. cor [ S
2
(X), S
2
(Y )] = ρ
2
Proof
Sample Covariance
If μ and ν are known (again usually an artificial assumption), a natural estimator of the distribution covariance δ is the special
version of the sample covariance
n
1
W (X, Y ) = ∑(Xi − μ)(Yi − ν ) (6.8.28)
n
i=1
Proof
If μ and ν are unknown (again usually the case), then a natural estimator of the distribution covariance δ is the standard sample
covariance
n
1
S(X, Y ) = ∑[ Xi − M (X)][ Yi − M (Y )] (6.8.29)
n−1
i=1
Proof
6.8.6 https://stats.libretexts.org/@go/page/10185
Computational Exercises
We use the basic notation established above for samples X and Y , and for the statistics M , W , S , T , and so forth.
2 2
Suppose that the net weights (in grams) of 25 bags of M&Ms form a random sample X from the normal distribution with
mean 50 and standard deviation 4. Find each of the following:
1. The mean and standard deviation of M .
2. The mean and standard deviation of W . 2
Suppose that the SAT math scores from 16 Alabama students form a random sample X from the normal distribution with mean
550 and standard deviation 20, while the SAT math scores from 25 Georgia students form a random sample Y from the normal
distribution with mean 540 and standard deviation 15. The two samples are independent. Find each of the following:
1. The mean and standard deviation of M (X).
2. The mean and standard deviation of M (Y ).
3. The mean and standard deviation of M (X) − M (Y ) .
4. P[M (X) > M (Y )] .
5. The mean and standard deviation of S (X). 2
This page titled 6.8: Special Properties of Normal Samples is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
6.8.7 https://stats.libretexts.org/@go/page/10185
CHAPTER OVERVIEW
7: Point Estimation
Point estimation refers to the process of estimating a parameter from a probability distribution, based on observed data from the
distribution. It is one of the core topics in mathematical statistics. In this chapter, we will explore the most common methods of
point estimation: the method of moments, the method of maximum likelihood, and Bayes' estimators. We also study important
properties of estimators, including sufficiency and completeness, and the basic question of whether an estimator is the best possible
one.
7.1: Estimators
7.2: The Method of Moments
7.3: Maximum Likelihood
7.4: Bayesian Estimation
7.5: Best Unbiased Estimators
7.6: Sufficient, Complete and Ancillary Statistics
This page titled 7: Point Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
7.1: Estimators
where X is the vector of measurements for the ith object. The most important special case is when (X , X , … , X ) are
i 1 2 n
independent and identically distributed (IID). In this case X is a random sample of size n from the distribution of an underlying
measurement variable X.
Statistics
Recall also that a statistic is an observable function of the outcome variable of the random experiment: U = u(X) where u is a
known function from S into another set T . Thus, a statistic is simply a random variable derived from the observation variable X,
with the assumption that U is also observable. As the notation indicates, U is typically also vector-valued. Note that the original
data vector X is itself a statistic, but usually we are interested in statistics derived from X. A statistic U may be computed to
answer an inferential question. In this context, if the dimension of U (as a vector) is smaller than the dimension of X (as is usually
the case), then we have achieved data reduction. Ideally, we would like to achieve significant data reduction with no loss of
information about the inferential question at hand.
Parameters
In the technical sense, a parameter θ is a function of the distribution of X, taking values in a parameter space T . Typically, the
distribution of X will have k ∈ N real parameters of interest, so that θ has the form θ = (θ , θ , … , θ ) and thus T ⊆ R . In
+ 1 2 k
k
many cases, one or more of the parameters are unknown, and must be estimated from the data variable X. This is one of the of the
most important and basic of all statistical problems, and is the subject of this chapter. If U is a statistic, then the distribution of U
will depend on the parameters of X, and thus so will distributional constructs such as means, variances, covariances, probability
density functions and so forth. We usually suppress this dependence notationally to keep our mathematical expressions from
becoming too unwieldy, but it's very important to realize that the underlying dependence is present. Remember that the critical idea
is that by observing a value u of a statistic U we (hopefully) gain information about the unknown parameters.
Estimators
Suppose now that we have an unknown real parameter θ taking values in a parameter space T ⊆ R . A real-valued statistic
U = u(X) that is used to estimate θ is called, appropriately enough, an estimator of θ . Thus, the estimator is a random variable
and hence has a distribution, a mean, a variance, and so on (all of which, as noted above, will generally depend on θ ). When we
actually run the experiment and observe the data x, the observed value u = u(x) (a single number) is the estimate of the parameter
θ . The following definitions are basic.
Thus the error is the difference between the estimator and the parameter being estimated, so of course the error is a random
variable. The bias of U is simply the expected error, and the mean square error (the name says it all) is the expected square of the
error. Note that bias and mean square error are functions of θ ∈ T . The following definitions are a natural complement to the
definition of bias.
7.1.1 https://stats.libretexts.org/@go/page/10189
1. U is unbiased if bias(U ) = 0 , or equivalently E(U ) = θ , for all θ ∈ T .
2. U is negatively biased if bias(U ) ≤ 0 , or equivalently E(U ) ≤ θ , for all θ ∈ T .
3. U is positively biased if bias(U ) ≥ 0 , or equivalently E(U ) ≥ θ , for all θ ∈ T .
Thus, for an unbiased estimator, the expected value of the estimator is the parameter being estimated, clearly a desirable property.
On the other hand, a positively biased estimator overestimates the parameter, on average, while a negatively biased estimator
underestimates the parameter on average. Our definitions of negative and positive bias are weak in the sense that the weak
inequalities ≤ and ≥ are used. There are corresponding strong definitions, of course, using the strong inequalities < and >. Note,
however, that none of these definitions may apply. For example, it might be the case that bias(U ) < 0 for some θ ∈ T ,
bias(U ) = 0 for other θ ∈ T , and bias(U ) > 0 for yet other θ ∈ T .
2
mse(U ) = var(U ) + bias (U )
Proof
In particular, if the estimator is unbiased, then the mean square error of U is simply the variance of U .
Ideally, we would like to have unbiased estimators with small mean square error. However, this is not always possible, and the
result in (3) shows the delicate relationship between bias and mean square error. In the next section we will see an example with
two estimators of a parameter that are multiples of each other; one is unbiased, but the other has smaller mean square error.
However, if we have two unbiased estimators of θ , we naturally prefer the one with the smaller variance (mean square error).
Asymptotic Properties
Suppose again that we have a real parameter θ with possible values in a parameter space T . Often in a statistical experiment, we
observe an infinite sequence of random variables over time, X = (X , X , … , ), so that at time n we have observed
1 2
X = (X , X , … , X ) . In this setting we often have a general formula that defines an estimator of θ for each sample size n .
n 1 2 n
Technically, this gives a sequence of real-valued estimators of θ : U = (U , U , …) where U is a real-valued function of X for
1 2 n n
each n ∈ N . In this case, we can discuss the asymptotic properties of the estimators as n → ∞ . Most of the definitions are
+
Naturally, we expect our estimators to improve, as the sample size n increases, and in some sense to converge to the parameter as
n → ∞ . This general idea is known as consistency. Once again, for the remainder of this discussion, we assume that
U = (U , U , …) is a sequence of estimators for a real-valued parameter θ , with values in the parameter space T .
1 2
Consistency
1. U is consistent if U → θ as n → ∞ in probability for each θ ∈ T . That is, P (|U − θ| > ϵ) → 0 as n → ∞ for every
n n
ϵ > 0 and θ ∈ T .
7.1.2 https://stats.libretexts.org/@go/page/10189
Here is the connection between the two definitions:
That mean-square consistency implies simple consistency is simply a statistical version of the theorem that states that mean-square
convergence implies convergence in probability. Here is another nice consequence of mean-square consistency.
In the next several subsections, we will review several basic estimation problems that were studied in the chapter on Random
Samples.
from the distribution of X to produce a sequence X = (X , X , …) of independent variables, each with the distribution of X. For
1 2
The consistency of M is simply the weak law of large numbers. Moreover, there are a number of important special cases of the
results in (10). See the section on Sample Mean for the details.
sample of size n ∈ N from the distribution of X is the relative frequency or empirical probability of A , denoted P (A).
+ n
2. Suppose that F denotes the distribution function of a real-valued random variable Y . Then for fixed y ∈ R , the empirical
distribution function F (y) is simply the sample mean for a random sample of size n ∈ N from the distribution of the
n +
indicator variable X = 1(Y ≤ y) . Hence F (y) is an unbiased estimator of F (y) for n ∈ N and (F (y) : n ∈ N ) is
n + n +
consistent.
3. Suppose that U is a random variable with a discrete distribution on a countable set S and f denotes the probability density
function of U . Then for fixed u ∈ S , the empirical probability density function f (u) is simply the sample mean for a
n
random sample of size n ∈ N from the distribution of the indicator variable X = 1(U = u) . Hence f (u) is an unbiased
+ n
μ is known (almost always an artificial assumption), then a natural estimator of σ is a special version of the sample variance,
2
defined by
7.1.3 https://stats.libretexts.org/@go/page/10189
n
1 2
2
Wn = ∑(Xi − μ) , n ∈ N+ (7.1.8)
n
i=1
Properties of W 2
= (W
2
1
,W
2
2
, …) as a sequence of estimators of σ . 2
1. E (W ) = σ so W is unbiased for n ∈ N
2
n
2 2
n +
n
4
4
+
2
Proof
If μ is unknown (the more reasonable assumption), then a natural estimator of the distribution variance is the standard version of
the sample variance, defined by
n
1 2
2
Sn = ∑(Xi − Mn ) , n ∈ {2, 3, …} (7.1.9)
n−1
i=1
Properties of S 2
= (S
2
2
,S
3
2
, …) as a sequence of estimators of σ 2
1. E (S2
n) =σ
2
so S is unbiased for n ∈ {2, 3, …}
2
n
2. var (S 2
n) =
1
n
(σ4 −
n−3
n−1
σ )
4
for n ∈ {2, 3, …} so S is consistent sequence.
2
Comparison of W and S 2 2
So by (a) W is better than S for n ∈ {2, 3, …}, assuming that μ is known so that we can actually use W . This is perhaps not
2
n
2
n
2
n
surprising, but by (b) S works just about as well as W for a large sample size n . Of course, the sample standard deviation S is
2
n
2
n n
a natural estimator of the distribution standard deviation σ. Unfortunately, this estimator is biased. Here is a more general result:
Suppose that θ is a parameter with possible values in T ⊆ (0, ∞) (with at least two points) and that U is a statistic with values
in T . If U is an unbiased estimator of θ then U is a negatively biased estimator of θ .
2 2
Proof
Thus, we should not be too obsessed with the unbiased property. For most sampling distributions, there will be no statistic U with
the property that U is an unbiased estimator of σ and U is an unbiased estimator of σ . 2 2
μ = E(X) and σ = var(X) denote the mean and variance of X, and let ν = E(Y ) and τ = var(Y ) denote the mean and
2 2
variance of Y . For the bivariate parameters, let δ = cov(X, Y ) denote the distribution covariance and ρ = cor(X, Y ) the
distribution correlation. We need one higher-order moment as well: let δ = E [(X − μ) (Y − ν ) ] , and as usual, we assume that 2
2 2
all of the parameters exist. So the general parameter spaces are μ, ν ∈ R , σ , τ ∈ (0, ∞) , δ ∈ R , and ρ ∈ [0, 1]. Suppose now 2 2
that we sample from the distribution of (X, Y ) to generate a sequence of independent variables ((X , Y ), (X , Y ), …), each 1 1 2 2
with the distribution of (X, Y ). As usual, we will let X = (X , X , … , X ) and Y = (Y , Y , … , Y ) ; these are random
n 1 2 n n 1 2 n
7.1.4 https://stats.libretexts.org/@go/page/10189
n
1
mn (x) = ∑ xi
n
i=1
n
1
wn (x, y) = ∑(xi − μ)(yi − ν )
n
i=1
n
1
sn (x, y) = ∑[ xi − mn (x)][ yi − mn (y)]
n−1
i=1
It should be clear from context whether we are using the one argument or two argument version of sn . On this point, note that
s (x, x) = s (x) .
2
n n
Proof
If μ and ν are unknown (usually the more reasonable assumption), then a natural estimator of the distribution covariance δ is the
standard version of the sample covariance, defined by
n
1
Sn = sn (X, Y ) = ∑[ Xi − mn (X)][ Yi − mn (Y )], n ∈ {2, 3, …} (7.1.12)
n−1
i=1
2. var (S n) =
1
n
(δ2 +
1
n−1
2
σ τ
2
−
n−2
n−1
2
δ ) for n ∈ {2, 3, …} so S is consistent.
Once again, since we have two competing sequences of estimators of δ , we would like to compare them.
Thus, U is better than V for n ∈ {2, 3, …}, assuming that μ and ν are known so that we can actually use W . But for large n ,
n n n
Note that this statistics is a nonlinear function of the sample covariance and the two sample standard deviations. For most
distributions of (X, Y ), we have no hope of computing the bias or mean square error of this estimator. If we could compute the
7.1.5 https://stats.libretexts.org/@go/page/10189
expected value, we would probably find that the estimator is biased. On the other hand, even though we cannot compute the mean
square error, a simple application of the law of large numbers shows that R → ρ as n → ∞ with probability 1. Thus,n
R = (R , R , …) is at least consistent.
2 3
On the other hand, the sample regression line, based on the sample of size n ∈ {2, 3, …}, is y = A n + Bn x where
sn (X, Y ) sn (X, Y )
An = mn (Y ) − mn (X), Bn = (7.1.15)
2 2
sn (X) sn (X)
Of course, the statistics A and B are natural estimators of the parameters a and b , respectively, and in a sense are derived from
n n
our previous estimators of the distribution mean, variance, and covariance. Once again, for most distributions of (X, Y ), it would
be difficult to compute the bias and mean square errors of these estimators. But applications of the law of large numbers show that
with probability 1, A → a and B → b as n → ∞ , so at least A = (A , A , …) and B = (B , B , …) are consistent.
n n 2 3 2 3
x
λ
−λ
g(x) = e , x ∈ N (7.1.16)
x!
The Poisson distribution is often used to model the number of random “points” in a region of time or space, and is studied in more
detail in the chapter on the Poisson process. The parameter λ is proportional to the size of the region of time or space; the
proportionality constant is the average rate of the random points. The distribution is named for Simeon Poisson.
3. σ = E [(X − λ) ] = 3λ
4
4 2
+λ
Proof
Suppose now that we sample from the distribution of X to produce a sequence of independent random variables
X = (X , X , …), each having the Poisson distribution with unknown parameter λ ∈ (0, ∞) . Again, X = (X , X , … , X )
1 2 n 1 2 n
is a random sample of size n ∈ N from the from the distribution for each n ∈ N . From the previous exercise, λ is both the mean
+
and the variance of the distribution, so that we could use either the sample mean M or the sample variance S as an estimator of
n
2
n
λ . Both are unbiased, so which is better? Naturally, we use mean square error as our criterion.
Comparison of M to S as estimators of λ .
2
1. var (M n) =
λ
n
for n ∈ N . +
2. var 2
(Sn ) =
λ
n
(1 + 2λ
n
n−1
) for n ∈ {2, 3, …}.
3. var (M ) < var (S ) so M for n ∈ {2, 3, …}.
n
2
n n
So our conclusion is that the sample mean M is a better estimator of the parameter
n λ than the sample variance 2
Sn for
n ∈ {2, 3, …}, and the difference in quality increases with λ .
Run the Poisson experiment 100 times for several values of the parameter. In each case, compute the estimators M and S
2
.
Which estimator seems to work better?
7.1.6 https://stats.libretexts.org/@go/page/10189
The emission of elementary particles from a sample of radioactive material in a time interval is often assumed to follow the
Poisson distribution. Thus, suppose that the alpha emissions data set is a sample from a Poisson distribution. Estimate the rate
parameter λ .
1. using the sample mean
2. using the sample variance
Answer
Simulation Exercises
In the sample mean experiment, set the sampling distribution to gamma. Increase the sample size with the scroll bar and note
graphically and numerically the unbiased and consistent properties. Run the experiment 1000 times and compare the sample
mean to the distribution mean.
Run the normal estimation experiment 1000 times for several values of the parameters.
1. Compare the empirical bias and mean square error of M with the theoretical values.
2. Compare the empirical bias and mean square error of S and of W to their theoretical values. Which estimator seems to
2 2
work better?
In matching experiment, the random variable is the number of matches. Run the simulation 1000 times and compare
1. the sample mean to the distribution mean.
2. the empirical density function to the probability density function.
Run the exponential experiment 1000 times and compare the sample standard deviation to the distribution standard deviation.
For Cavendish's density of the earth data, compute the sample mean and sample variance.
Answer
For Short's parallax of the sun data, compute the sample mean and sample variance.
Answer
7.1.7 https://stats.libretexts.org/@go/page/10189
Answer
The estimators of the mean, variance, and covariance that we have considered in this section have been natural in a sense.
However, for other parameters, it is not clear how to even find a reasonable estimator in the first place. In the next several sections,
we will consider the problem of constructing estimators. Then we return to the study of the mathematical properties of estimators,
and consider the question of when we can know that an estimator is the best possible, given the data.
This page titled 7.1: Estimators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
7.1.8 https://stats.libretexts.org/@go/page/10189
7.2: The Method of Moments
Basic Theory
The Method
Suppose that we have a basic random experiment with an observable, real-valued random variable X. The distribution of X has k
unknown real-valued parameters, or equivalently, a parameter vector θ = (θ , θ , … , θ ) taking values in a parameter space, a 1 2 k
subset of R . As usual, we repeat the experiment n times to generate a random sample of size n from the distribution of X.
k
X = (X1 , X2 , … , Xn ) (7.2.1)
Thus, X is a sequence of independent random variables, each with the distribution of X. The method of moments is a technique for
constructing estimators of the parameters that is based on matching the sample moments with the corresponding distribution
moments. First, let
(j) j
μ (θ) = E (X ) , j ∈ N+ (7.2.2)
so that μ (θ) is the j th moment of X about 0. Note that we are emphasizing the dependence of these moments on the vector of
(j)
parameters θ. Note also that μ (θ) is just the mean of X, which we usually denote simply by μ . Next, let
(1)
(j)
1 j
M (X) = ∑X , j ∈ N+ (7.2.3)
i
n
i=1
so that M
(j)
(X) is the j th sample moment about 0. Equivalently, M
(j)
(X) is the sample mean for the random sample
j j j
(X , X , … , Xn )
1 2
from the distribution of X
j
. Note that we are emphasizing the dependence of the sample moments on the
sample X. Note also that M (X) is just the ordinary sample mean, which we usually just denote by M (or by M if we wish to
(1)
n
emphasize the dependence on the sample size). From our previous work, we know that M (X) is an unbiased and consistent (j)
To construct the method of moments estimators (W1 , W2 , … , Wk ) for the parameters (θ1 , θ2 , … , θk ) respectively, we
consider the equations
(j) (j)
μ (W1 , W2 , … , Wk ) = M (X1 , X2 , … , Xn ) (7.2.4)
The equations for j ∈ {1, 2, … , k} give k equations in k unknowns, so there is hope (but no guarantee) that the equations can be
solved for (W , W , … , W ) in terms of (M , M , … , M ). In fact, sometimes we need equations with j > k . Exercise 28
1 2 k
(1) (2) (k)
below gives a simple example. The method of moments can be extended to parameters associated with bivariate or more general
multivariate distributions, by matching sample product moments with the corresponding distribution product moments. The method
of moments also sometimes makes sense when the sample variables (X , X , … , X ) are not independent, but at least are 1 2 n
notation simple. But in the applications below, we put the notation back in because we want to discuss asymptotic behavior.
Occasionally we will also need σ = E[(X − μ) ] , the fourth central moment. We sample from the distribution of X to produce a
4
4
X = (X , X , … , X ) is a random sample of size n from the distribution of X. We start by estimating the mean, which is
n 1 2 n
Suppose that the mean μ is unknown. The method of moments estimator of μ based on X is the sample mean n
7.2.1 https://stats.libretexts.org/@go/page/10190
n
1
Mn = ∑ Xi (7.2.5)
n
i=1
2. var(M ) = σ /n for n ∈ N so M = (M
n
2
+ 1, M2 , …) is consistent.
Proof
Estimating the variance of the distribution, on the other hand, depends on whether the distribution mean μ is known or unknown.
First we will consider the more realistic case when the mean in also unknown. Recall that for n ∈ {2, 3, …}, the sample variance
based on X is n
n
1
2 2
Sn = ∑(Xi − Mn ) (7.2.6)
n−1
i=1
n−3
Recall also that E(Sn ) = σ
2 2
so S is unbiased for
2
n n ∈ {2, 3, …} , and that var(Sn ) =
2 1
n
(σ4 −
n−1
4
σ ) so S
2
= (S
2
2
,S
3
2
, …)
is consistent.
Suppose that the mean μ and the variance σ are both unknown. For n ∈ N , the method of moments estimator of
2
+ σ
2
based
on X is n
n
1 2
2
Tn = ∑(Xi − Mn ) (7.2.7)
n
i=1
1. bias(T 2
n ) = −σ /n
2
for n ∈ N + so T 2
= (T
2
1
,T
2
2
is asymptotically unbiased.
, …)
2. mse(T 2
n ) =
n
1
3
[(n − 1 ) σ4 − (n
2 2
− 5n + 3)σ ]
4
for n ∈ N so T is consistent.+
2
Proof
Next let's consider the usually unrealistic (but mathematically interesting) case where the mean is known, but not the variance.
Suppose that the mean μ is known and the variance σ unknown. For n ∈ N , the method of moments estimator of σ based
2
+
2
on X is n
n
1
2 2
Wn = ∑(Xi − μ) (7.2.8)
n
i=1
2. var(W ) = (σ − σ ) for n ∈ N so W
2
n
1
n
4
4
+
2
= (W
1
2
,W
2
2
, …) is consistent.
Proof
We compared the sequence of estimators S with the sequence of estimators W in the introductory section on Estimators. Recall
2 2
that var(W ) < var(S ) for n ∈ {2, 3, …} but var(S )/var(W ) → 1 as n → ∞ . There is no simple, general relationship
2
n
2
n
2
n
2
n
between mse(T ) and mse(S ) or between mse(T ) and mse(W ), but the asymptotic relationship is simple.
2
n
2
n
2
n
2
n
2
mse(Tn )/mse(Wn ) → 1
2
and mse(T 2 2
n )/mse(Sn ) → 1 as n → ∞
Proof
−
−−
It also follows that if both μ and σ are unknown, then the method of moments estimator of the standard deviation σ is T
2
= √T
2
.
−−−
In the unlikely event that μ is known, but σ unknown, then the method of moments estimator of σ is W = √W .
2 2
7.2.2 https://stats.libretexts.org/@go/page/10190
moment. This alternative approach sometimes leads to easier equations. To setup the notation, suppose that a distribution on R has
parameters a and b . We sample from the distribution to produce a sequence of independent variables X = (X , X , …), each with 1 2
the common distribution. For n ∈ N , X = (X , X , … , X ) is a random sample of size n from the distribution. Let M ,
+ n 1 2 n n
(2)
Mn , and Tn
2
denote the sample mean, second-order sample mean, and biased sample variance corresponding to Xn , and let
μ(a, b), μ (a, b), and σ (a, b) denote the mean, second-order mean, and variance of the distribution.
(2) 2
If the method of moments estimators U and V of a and b , respectively, can be found by solving the first two equations
n n
(2) (2)
μ(Un , Vn ) = Mn , μ (Un , Vn ) = Mn (7.2.9)
2 2
μ(Un , Vn ) = Mn , σ (Un , Vn ) = Tn (7.2.10)
Proof
Because of this result, the biased sample variance T will appear in many of the estimation problems for special distributions that
2
n
we consider below.
Special Distributions
The Normal Distribution
The normal distribution with mean μ ∈ R and variance σ
2
∈ (0, ∞) is a continuous distribution on R with probability density
function g given by
2
1 1 x −μ
g(x) = exp[− ( ) ], x ∈ R (7.2.11)
−−
√2πσ 2 σ
This is one of the most important distributions in probability and statistics, primarily because of the central limit theorem. The
normal distribution is studied in more detail in the chapter on Special Distributions.
Suppose now that X = (X , X , … , X ) is a random sample of size n from the normal distribution with mean μ and variance
1 2 n
σ . Form our general work above, we know that if μ is unknown then the sample mean M is the method of moments estimator of
2
μ , and if in addition, σ is unknown then the method of moments estimator of σ is T . On the other hand, in the unlikely event
2 2 2
that μ is known then W is the method of moments estimator of σ . Our goal is to see how the comparisons above simplify for the
2 2
normal distribution.
2n−1
1. mse(T 2
) =
n
2
σ
4
2. mse(S 2
) =
2
n−1
σ
4
3. mse(T 2
) < mse(S
2
) for n ∈ {2, 3, … , }
Proof
Thus, S and T are multiplies of one another; S is unbiased, but when the sampling distribution is normal, T has smaller mean
2 2 2 2
square error. Surprisingly, T has smaller mean square error even than W .
2 2
1. mse(W ) = σ 2
n
2 4
Run the normal estimation experiment 1000 times for several values of the sample size n and the parameters μ and σ.
Compare the empirical bias and mean square error of S and of T to their theoretical values. Which estimator is better in
2 2
7.2.3 https://stats.libretexts.org/@go/page/10190
−
−−
Next we consider estimators of the standard deviation σ. As noted in the general discussion above, T = √T is the method of 2
− −−
moments estimator when μ is unknown, while W = √W is the method of moments estimator in the unlikely event that μ is
2
−−
known. Another natural estimator, of course, is S = √S , the usual sample standard deviation. The following sequence, defined in
2
terms of the gamma function turns out to be important in the analysis of all three estimators.
For n ∈ N ,
+
1. E(W ) = a σ n
2. bias(W ) = (a − 1)σ n
3. var(W ) = (1 − a ) σ 2
n
2
4. mse(W ) = 2(1 − a )σ n
2
Proof
Thus W is negatively biased as an estimator of σ but asymptotically unbiased and consistent. Of course we know that in general
(regardless of the underlying distribution), W is an unbiased estimator of σ and so W is negatively biased as an estimator of σ.
2 2
In the normal case, since a involves no unknown parameters, the statistic W /a is an unbiased estimator of σ. Next we consider
n n
3. var(S) = (1 − a ) σ 2
n−1
2
Proof
As with W , the statistic S is negatively biased as an estimator of σ but asymptotically unbiased, and also consistent. Since a n−1
involves no unknown parameters, the statistic S/a is an unbiased estimator of σ. Note also that, in terms of bias and mean
n−1
square error, S with sample size n behaves like W with sample size n − 1 . Finally we consider T , the method of moments
estimator of σ when μ is unknown.
n
an−1 − 1) σ
3. var(T ) = n−1
n
(1 − a
2
n−1
)σ
2
−−−
4. mse(T ) = (2 − 1
n
− 2√
n−1
n
an−1 ) σ
2
Proof
where p ∈ (0, 1) is the success parameter. The mean of the distribution is p and the variance is p(1 − p) .
7.2.4 https://stats.libretexts.org/@go/page/10190
Suppose now that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success
1 2 n
parameter p. Since the mean of the distribution is p, it follows from our general work above that the method of moments estimator
of p is M , the sample mean. In this case, the sample X is a sequence of Bernoulli trials, and M has a scaled version of the
binomial distribution with parameters n and p:
k n k n−k
P (M = ) =( ) p (1 − p ) , k ∈ {0, 1, … , n} (7.2.15)
n k
Note that since X = X for every k ∈ N , it follows that μ = p and M = M for every k ∈ N . So any of the method of
k
+
(k) (k)
+
moments equations would lead to the sample mean M as the estimator of p. Although very simple, this is an important application,
since Bernoulli trials are found embedded in all sorts of estimation problems, such as empirical probability density functions and
empirical distribution functions.
x−1
g(x) = p(1 − p ) , x ∈ N+ (7.2.16)
The geometric distribution on N governs the number of trials needed to get the first success in a sequence of Bernoulli trials with
+
Proof
The geometric distribution on N with success parameter p ∈ (0, 1) has probability density function
x
g(x) = p(1 − p ) , x ∈ N (7.2.18)
This version of the geometric distribution governs the number of failures before the first success in a sequence of Bernoulli trials.
The mean of the distribution is μ = (1 − p)/p .
Proof
If k is a positive integer, then this distribution governs the number of failures before the k th success in a sequence of Bernoulli
trials with success parameter p. However, the distribution makes sense for general k ∈ (0, ∞) . The negative binomial distribution
is studied in more detail in the chapter on Bernoulli Trials. The mean of the distribution is k(1 − p)/p and the variance is
k(1 − p)/p . Suppose now that X = (X , X , … , X ) is a random sample of size n from the negative binomial distribution on
2
1 2 n
If k and p are unknown, then the corresponding method of moments estimators U and V are
2
M M
U = , V = (7.2.21)
2 2
T −M T
7.2.5 https://stats.libretexts.org/@go/page/10190
Proof
As usual, the results are nicer when one of the parameters is known.
k
Vk = (7.2.23)
M +k
Proof
1. E(U ) = k so U is unbiased.
p p
2. var(U ) =
p
k
so U is consistent.
n(1−p)
p
Proof
The mean and variance are both r. The distribution is named for Simeon Poisson and is widely used to model the number of
“random points” is a region of time or space. The parameter r is proportional to the size of the region, with the proportionality
constant playing the role of the average rate at which the points are distributed in time or space. The Poisson distribution is studied
in more detail in the chapter on the Poisson Process.
Suppose now that X = (X , X , … , X ) is a random sample of size n from the Poisson distribution with parameter r. Since r is
1 2 n
the mean, it follows from our general work above that the method of moments estimator of r is the sample mean M .
The gamma probability density function has a variety of shapes, and so this distribution is used to model various types of positive
random variables. The gamma distribution is studied in more detail in the chapter on Special Distributions. The mean is μ = kb
and the variance is σ = kb .
2 2
Suppose now that X = (X1 , X2 , … , Xn ) is a random sample from the gamma distribution with shape parameter k and scale
parameter b .
Suppose that k and b are both unknown, and let U and V be the corresponding method of moments estimators. Then
2 2
M T
U = , V = (7.2.28)
2
T M
Proof
The method of moments estimators of k and b given in the previous exercise are complicated, nonlinear functions of the sample
mean M and the sample variance T . Thus, computing the bias and mean square errors of these estimators are difficult problems
2
that we will not attempt. However, we can judge the quality of the estimators empirically, through simulations.
When one of the parameters is known, the method of moments estimator of the other parameter is much simpler.
7.2.6 https://stats.libretexts.org/@go/page/10190
M
Ub = (7.2.29)
b
1. E(U ) = k so U is unbiased.
b b
Proof
1. E(V ) = b so V is unbiased.
k k
Proof
Run the gamma estimation experiment 1000 times for several different values of the sample size n and the parameters k and b .
Note the empirical bias and mean square error of the estimators U , V , U , and V . One would think that the estimators when
b k
one of the parameters is known should work better than the corresponding estimators when both parameters are unknown; but
investigate this question empirically.
The beta probability density function has a variety of shapes, and so this distribution is widely used to model various types of
random variables that take values in bounded intervals. The beta distribution is studied in more detail in the chapter on Special
a(a+1)
Distributions. The first two moments are μ = a
a+b
and μ (2)
=
(a+b)(a+b+1)
.
Suppose now that X = (X 1, X2 , … , Xn ) is a random sample of size n from the beta distribution with left parameter a and right
parameter b .
Suppose that a and b are both unknown, and let U and V be the corresponding method of moments estimators. Then
(2) (2)
M (M − M ) (1 − M ) (M − M )
U = , V = (7.2.32)
(2) 2 (2) 2
M −M M −M
Proof
The method of moments estimators of a and b given in the previous exercise are complicated nonlinear functions of the sample
moments M and M . Thus, we will not attempt to determine the bias and mean square errors analytically, but you will have an
(2)
Suppose that a is unknown, but b is known. Let U be the method of moments estimator of a . Then
b
M
Ub = b (7.2.34)
1 −M
Proof
Suppose that b is unknown, but a is known. Let V be the method of moments estimator of b . Then
a
1 −M
Va = a (7.2.35)
M
Proof
7.2.7 https://stats.libretexts.org/@go/page/10190
Run the beta estimation experiment 1000 times for several different values of the sample size n and the parameters a and b .
Note the empirical bias and mean square error of the estimators U , V , U , and V . One would think that the estimators when
b a
one of the parameters is known should work better than the corresponding estimators when both parameters are unknown; but
investigate this question empirically.
The following problem gives a distribution with just one parameter but the second moment equation from the method of moments
is needed to derive an estimator.
Suppose that X = (X , X , … , X ) is a random sample from the symmetric beta distribution, in which the left and right
1 2 n
parameters are equal to an unknown value c ∈ (0, ∞). The method of moments estimator of c is
(2)
2M
U = (7.2.36)
(2)
1 − 4M
Proof
The Pareto distribution is named for Vilfredo Pareto and is a highly skewed and heavy-tailed distribution. It is often used to model
income and certain other types of positive random variables. The Pareto distribution is studied in more detail in the chapter on
2
Special Distributions. If a > 2 , the first two moments of the Pareto distribution are μ = ab
a−1
and μ (2)
=
ab
a−2
.
Suppose now that X = (X 1, X2 , … , Xn ) is a random sample of size n from the Pareto distribution with shape parameter a >2
Suppose that a and b are both unknown, and let U and V be the corresponding method of moments estimators. Then
−−−−−−−−−−
(2)
M
U = 1 +√ (7.2.39)
(2) 2
M −M
−−−−−−−−−−
(2) (2) 2
M M −M
V = (1 − √ ) (7.2.40)
M (2)
M
Proof
As with our previous examples, the method of moments estimators are complicatd nonlinear functions of M and M , so (2)
computing the bias and mean square error of the estimator is difficult. Instead, we can investigate the bias and mean square error
empirically, through a simulation.
Run the Pareto estimation experiment 1000 times for several different values of the sample size n and the parameters a and b .
Note the empirical bias and mean square error of the estimators U and V .
When one of the parameters is known, the method of moments estimator for the other parameter is simpler.
Suppose that a is unknown, but b is known. Let U be the method of moments estimator of a . Then
b
M
Ub = (7.2.43)
M −b
Proof
Suppose that b is unknown, but a is known. Let V be the method of moments estimator of b . Then
a
a−1
Va = M (7.2.44)
a
7.2.8 https://stats.libretexts.org/@go/page/10190
1. E(V a) =b so V is unbiased.a
2
2. var(V a) =
na(a−2)
b
so V is consistent.
a
Proof
The distribution models a point chosen “at random” from the interval [a, a + h] . The mean of the distribution is μ = a + h and 1
the variance is σ = h . The uniform distribution is studied in more detail in the chapter on Special Distributions. Suppose now
2 1
12
2
Suppose that a and h are both unknown, and let U and V denote the corresponding method of moments estimators. Then
– –
U = 2M − √3T , V = 2 √3T (7.2.46)
Proof
Suppose that a is known and h is unknown, and let V denote the method of moments estimator of h . Then
a
Va = 2(M − a) (7.2.47)
1. E(V a) =h so V is unbiased.
2
2. var(V a) =
h
3n
so V is consistent.
a
Proof
Suppose that h is known and a is unknown, and let U denote the method of moments estimator of a . Then
h
1
Uh = M − h (7.2.48)
2
1. E(U h) =a so U is unbiased.h
2
2. var(U h) =
12n
h
so U is consistent.
h
Proof
random sample from a distribution. However, the method makes sense, at least in some cases, when the variables are identically
distributed but dependent. In the hypergeometric model, we have a population of N objects with r of the objects type 1 and the
remaining N − r objects type 0. The parameter N , the population size, is a positive integer. The parameter r, the type 1 size, is a
nonnegative integer with r ≤ N . These are the basic parameters, and typically one or both is unknown. Here are some typical
examples:
1. The objects are devices, classified as good or defective.
2. The objects are persons, classified as female or male.
3. The objects are voters, classified as for or against a particular candidate.
4. The objects are wildlife or a particular type, either tagged or untagged.
We sample n objects from the population at random, without replacement. Let X be the type of the ith object selected, so that our
i
sequence of observed variables is X = (X , X , … , X ). The variables are identically distributed indicator variables, with
1 2 n
P (X = 1) = r/N for each i ∈ {1, 2, … , n}, but are dependent since the sampling is without replacement. The number of type 1
i
n
objects in the sample is Y = ∑ X . This statistic has the hypergeometric distribution with parameter N , r, and n , and has
i=1 i
7.2.9 https://stats.libretexts.org/@go/page/10190
r N −r
( )( ) (y) (n−y)
y n−y n r (N − r)
P (Y = y) = =( ) , y ∈ {max{0, N − n + r}, … , min{n, r}} (7.2.49)
N (n)
( ) y N
n
The hypergeometric model is studied in more detail in the chapter on Finite Sampling Models.
As above, let X = (X 1, X2 , … , Xn ) be the observed variables in the hypergeometric model with parameters N and r. Then
1. The method of moments estimator of p = r/N is M = Y /n , the sample mean.
2. The method of moments estimator of r with N known is U = N M = N Y /n .
3. The method of moments estimator of N with r known is V = r/M = rn/Y if Y >0 .
Proof
In the voter example (3) above, typically N and r are both unknown, but we would only be interested in estimating the ratio
p = r/N . In the reliability example (1), we might typically know N and would be interested in estimating r . In the wildlife
example (4), we would typically know r and would be interested in estimating N . This example is known as the capture-recapture
model.
Clearly there is a close relationship between the hypergeometric model and the Bernoulli trials model above. In fact, if the
sampling is with replacement, the Bernoulli trials model would apply rather than the hypergeometric model. In addition, if the
population size N is large compared to the sample size n , the hypergeometric model is well approximated by the Bernoulli trials
model.
This page titled 7.2: The Method of Moments is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
7.2.10 https://stats.libretexts.org/@go/page/10190
7.3: Maximum Likelihood
Basic Theory
The Method
Suppose again that we have an observable random variable X for an experiment, that takes values in a set S . Suppose also that
distribution of X depends on an unknown parameter θ , taking values in a parameter space Θ. Of course, our data variable X will
almost always be vector valued. The parameter θ may also be vector valued. We will denote the probability density function of X
on S by f for θ ∈ Θ . The distribution of X could be discrete or continuous.
θ
The likelihood function is the function obtained by reversing the roles of x and θ in the probability density function; that is, we
view θ as the variable and x as the given information (which is precisely the point of view in estimation).
In the method of maximum likelihood, we try to find the value of the parameter that maximizes the likelihood function for each
value of the data vector.
Suppose that the maximum value of L occurs at u(x) ∈ Θ for each x ∈ S . Then the statistic u(X) is a maximum likelihood
x
estimator of θ .
The method of maximum likelihood is intuitively appealing—we try to find the value of the parameter that would have most likely
produced the data we in fact observed.
Since the natural logarithm function is strictly increasing on (0, ∞), the maximum value of the likelihood function, if it exists, will
occur at the same points as the maximum value of the logarithm of the likelihood function.
If the maximum value of ln Lx occurs at u(x) ∈ Θ for each x ∈ S . Then the statistic u(X) is a maximum likelihood
estimator of θ
The log-likelihood function is often easier to work with than the likelihood function (typically because the probability density
function f (x) has a product structure).
θ
Vector of Parameters
An important special case is when θ = (θ , θ , … , θ ) is a vector of k real parameters, so that Θ ⊆ R . In this case, the maximum
1 2 k
k
likelihood problem is to maximize a function of several variables. If Θ is a continuous set, the methods of calculus can be used. If
the maximum value of L occurs at a point θ in the interior of Θ, then L has a local maximum at θ. Therefore, assuming that the
x x
or equivalently
∂
ln Lx (θ) = 0, i ∈ {1, 2, … , k} (7.3.4)
∂θi
On the other hand, the maximum value may occur at a boundary point of Θ, or may not exist at all.
7.3.1 https://stats.libretexts.org/@go/page/10191
Random Samples
The most important special case is when the data variables form a random sample from a distribution.
Suppose that X = (X , X , … , X ) is a random sample of size n from the distribution of a random variable X taking values
1 2 n
in R , with probability density function g for θ ∈ Θ . Then X takes values in S = R , and the likelihood and log-likelihood
θ
n
Lx (θ) = ∏ gθ (xi ), θ ∈ Θ
i=1
ln Lx (θ) = ∑ ln gθ (xi ), θ ∈ Θ
i=1
^ −1
Lx (λ) = Lx [ h (λ)] , λ ∈ Λ (7.3.5)
Suppose that h : Θ → Λ , and let λ = h(θ) denote the new parameter. Define the likelihood function for λ at x ∈ S by
^ −1
Lx (λ) = max { Lx (θ) : θ ∈ h {λ}} ; λ ∈ Λ (7.3.6)
If v(x) ∈ Λ maximizes L
^
for each x ∈ S , then V
x = v(X) is a maximum likelihood estimator of λ .
This definition extends the maximum likelihood method to cases where the probability density function is not completely
parameterized by the parameter of interest. The following theorem is known as the invariance property: if we can solve the
maximum likelihood problem for θ then we can solve the maximum likelihood problem for λ = h(θ) .
In the setting of the previous theorem, if U is a maximum likelihood estimator of θ , then V = h(U ) is a maximum likelihood
estimator of λ .
Proof
n
1
M = ∑ Xi (7.3.7)
n
i=1
n
1
2 2
T = ∑(Xi − M ) (7.3.8)
n
i=1
Of course, M is the sample mean, and T is the biased version of the sample variance. These statistics will also sometimes occur
2
as maximum likelihood estimators. Another statistic that will occur in some of the examples below is
n
1
2
M2 = ∑X (7.3.9)
i
n
i=1
7.3.2 https://stats.libretexts.org/@go/page/10191
the second-order sample mean. As always, be sure to try the derivations yourself before looking at the solutions.
x 1−x
g(x) = p (1 − p ) , x ∈ {0, 1} (7.3.10)
Thus, X is a sequence of independent indicator variables with P(X = 1) = p for each i. In the usual language of reliability, X is
i i
n
the outcome of trial i, where 1 means success and 0 means failure. Let Y = ∑ X denote the number of successes, so that the i=1 i
proportion of successes (the sample mean) is M = Y /n . Recall that Y has the binomial distribution with parameters n and p.
The sample mean M is the maximum likelihood estimator of p on the parameter space (0, 1).
Proof
Recall that M is also the method of moments estimator of p. It's always nice when two different estimation procedures yield the
same result. Next let's look at the same problem, but with a much restricted parameter space.
2
, 1} . Then the maximum likelihood estimator of p is the statistic
1, Y =n
U ={ 1 (7.3.14)
, Y <n
2
1, p =1
1. E(U ) = { 1 1
n+1
1
+( ) , p =
2 2 2
4. U is consistent.
Proof
Note that the Bernoulli distribution in the last exercise would model a coin that is either fair or two-headed. The last two exercises
show that the maximum likelihood estimator of a parameter, like the solution to any maximization problem, depends critically on
the domain.
2
.
, 1}
Proof
Suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success
1 2 n
parameter p ∈ (0, 1). Find the maximum likelihood estimator of p(1 − p) , which is the variance of the sampling distribution.
Answer
x−1
g(x) = p(1 − p ) , x ∈ N+ (7.3.17)
The geometric distribution governs the trial number of the first success in a sequence of Bernoulli trials.
Suppose that X = (X , X , … , X ) is a random sample from the geometric distribution with unknown parameter p ∈ (0, 1).
1 2 n
Recall that U is also the method of moments estimator of p. It's always reassuring when two different estimation procedures
produce the same estimator.
7.3.3 https://stats.libretexts.org/@go/page/10191
The Negative Binomial Distribution
More generally, the negative binomial distribution on N with shape parameter k ∈ (0, ∞) and success parameter p ∈ (0, 1) has
probability density function
x +k−1
k x
g(x) = ( ) p (1 − p ) , x ∈ N (7.3.20)
k−1
If k is a positive integer, then this distribution governs the number of failures before the k th success in a sequence of Bernoulli
trials with success parameter p. However, the distribution makes sense for general k ∈ (0, ∞) . The negative binomial distribution
is studied in more detail in the chapter on Bernoulli Trials.
Suppose that X = (X , X , … , X ) is a random sample of size n from the negative binomial distribution on N with known
1 2 n
shape parameter k and unknown success parameter p ∈ (0, 1). The maximum likelihood estimator of p is
k
U = (7.3.21)
k+M
Proof
Once again, this is the same as the method of moments estimator of p with k known.
The Poisson distribution is named for Simeon Poisson and is widely used to model the number of random “points” in a region of
time or space. The parameter r is proportional to the size of the region. The Poisson distribution is studied in more detail in the
chapter on the Poisson process.
Suppose that X = (X , X , … , X ) is a random sample from the Poisson distribution with unknown parameter r ∈ (0, ∞).
1 2 n
Recall that for the Poisson distribution, the parameter r is both the mean and the variance. Thus M is also the method of moments
estimator of r. We showed in the introductory section that M has smaller mean square error than S , although both are unbiased.
2
Suppose that X = (X1 , X2 , … , Xn ) is a random sample from the Poisson distribution with parameter r ∈ (0, ∞) , and let
p = P(X = 0) = e
−r
. Find the maximum likelihood estimator of p in two ways:
1. Directly, by finding the likelihood function corresponding to the parameter p.
2. By using the result of the last exercise and the invariance property.
Answer
2
1 1 x −μ
g(x) = −
−− exp[− ( ) ], x ∈ R (7.3.26)
√2 πσ 2 σ
The normal distribution is often used to model physical quantities subject to small, random errors, and is studied in more detail in
the chapter on Special Distributions
Suppose that X = (X , X , … , X ) is a random sample from the normal distribution with unknown mean
1 2 n μ ∈ R and
variance σ ∈ (0, ∞) . The maximum likelihood estimators of μ and σ are M and T , respectively.
2 2 2
Proof
Of course, M and T are also the method of moments estimators of μ and σ , respectively.
2 2
7.3.4 https://stats.libretexts.org/@go/page/10191
Run the Normal estimation experiment 1000 times for several values of the sample size n , the mean μ , and the variance σ . 2
For the parameter σ , compare the maximum likelihood estimator T with the standard sample variance S . Which estimator
2 2 2
Suppose again that X = (X , X , … , X ) is a random sample from the normal distribution with unknown mean μ ∈ R and
1 2 n
unknown variance σ ∈ (0, ∞) . Find the maximum likelihood estimator of μ + σ , which is the second moment about 0 for
2 2 2
The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in
more detail in the chapter on Special Distributions
Suppose that X = (X , X , … , X ) is a random sample from the gamma distribution with known shape parameter
1 2 n k and
unknown scale parameter b ∈ (0, ∞). The maximum likelihood estimator of b is V = M . k
1
Proof
estimator of b is V =
T
M
.
Run the gamma estimation experiment 1000 times for several values of the sample size n , shape parameter k , and scale
parameter b . In each case, compare the method of moments estimator V of b when k is unknown with the method of moments
and maximum likelihood estimator V of b when k is known. Which estimator seems to work better in terms of mean square
k
error?
The beta distribution is often used to model random proportions and other random variables that take values in bounded intervals. It
is studied in more detail in the chapter on Special Distribution
Suppose that X = (X , X , … , X ) is a random sample from the beta distribution with unknown left parameter a ∈ (0, ∞)
1 2 n
Proof
Recall that when b = 1 , the method of moments estimator of a is U = M /(1 − M ) , but when b ∈ (0, ∞) is also unknown, the
1
method of moments estimator of a is U = M (M − M )/(M − M ) . When b = 1 , which estimator is better, the method of
2 2
2
In the beta estimation experiment, set b = 1 . Run the experiment 1000 times for several values of the sample size n and the
parameter a . In each case, compare the estimators U , U and W . Which estimator seems to work better in terms of mean
1
square error?
Finally, note that 1/W is the sample mean for a random sample of size n from the distribution of − ln X . This distribution is the
exponential distribution with rate a .
7.3.5 https://stats.libretexts.org/@go/page/10191
The Pareto Distribution
Recall that the Pareto distribution with shape parameter a > 0 and scale parameter b > 0 has probability density function
a
ab
g(x) = , b ≤x <∞ (7.3.37)
a+1
x
The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution often used to model income and certain other
types of random variables. It is studied in more detail in the chapter on Special Distribution.
Suppose that X = (X , X , … , X ) is a random sample from the Pareto distribution with unknown shape parameter
1 2 n
a ∈ (0, ∞) and scale parameter b ∈ (0, ∞). The maximum likelihood estimator of b is X = min{ X , X , … , X }, the (1) 1 2 n
Proof
Open the the Pareto estimation experiment. Run the experiment 1000 times for several values of the sample size n and the
parameters a and b . Compare the method of moments and maximum likelihood estimators. Which estimators seem to work
better in terms of bias and mean square error?
Suppose that X = (X , X , … , X ) is a random sample from the Pareto distribution with unknown shape parameter
1 2 n
a ∈ (0, ∞) and known scale parameter b ∈ (0, ∞). The maximum likelihood estimator of a is
n n
U = = (7.3.42)
n n
∑ ln Xi − n ln b ∑ (ln Xi − ln b)
i=1 i=1
Proof
Uniform Distributions
In this section we will study estimation problems related to the uniform distribution that are a good source of insight and
counterexamples. In a sense, our first estimation problem is the continuous analogue of an estimation problem studied in the
section on Order Statistics in the chapter Finite Sampling Models. Suppose that X = (X , X , … , X ) is a random sample from 1 2 n
the uniform distribution on the interval [0, h], where h ∈ (0, ∞) is an unknown parameter. Thus, the sampling distribution has
probability density function
1
g(x) = , x ∈ [0, h] (7.3.45)
h
The method of moments estimator of h is U = 2M . The estimator U satisfies the following properties:
1. U is unbiased.
2
2. var(U ) = h
3n
so U is consistent.
The maximum likelihood estimator of h is X(n) = max{ X1 , X2 , … , Xn } , the n th order statistic. The estimator X(n)
7.3.6 https://stats.libretexts.org/@go/page/10191
2. bias (X (n) ) =−
n+1
h
so that X (n) is negatively biased but asymptotically unbiased.
3. var (X (n)
) =
n
2
h
2
(n+2)(n+1)
4. mse (X (n) ) =
(n+1)(n+2)
2 2
h so that X (n) is consistent.
Proof
Let V =
n+1
n
X(n) . The estimator V satisfies the following properties:
1. V is unbiased.
2
2. var(V ) = n(n+2)
h
so that V is consistent.
3. The asymptotic relative efficiency of V to U is infinite.
Proof
The last part shows that the unbiased version V of the maximum likelihood estimator is a much better estimator than the method of
moments estimator U . In fact, an estimator such as V , whose mean square error decreases on the order of , is called super n
1
2
efficient. Now, having found a really good estimator, let's see if we can find a really bad one. A natural candidate is an estimator
based on X = min{X , X , … , X }, the first order statistic. The next result will make the computations very easy.
(1) 1 2 n
Proof
Proof
Run the uniform estimation experiment 1000 times for several values of the sample size n and the parameter a . In each case,
compare the empirical bias and mean square error of the estimators with their theoretical values. Rank the estimators in terms
of empirical mean square error.
Our next series of exercises will show that the maximum likelihood estimator is not necessarily unique. Suppose that
X = (X , X , … , X ) is a random sample from the uniform distribution on the interval [a, a + 1] , where a ∈ R is an unknown
1 2 n
2
. The estimator U satisfies the following properties:
1. U is unbiased.
2. var(U ) = so U is consistent.
1
12n
7.3.7 https://stats.libretexts.org/@go/page/10191
For completeness, let's consider the full estimation problem. Suppose that X = (X , X , … , X ) is a random sample of size n
1 2 n
from the uniform distribution on [a, a + h] where a ∈ R and h ∈ (0, ∞) are both unknown. Here's the result from the last section:
Let U and V denote the method of moments estimators of a and h , respectively. Then
– –
U = 2M − √3T , V = 2 √3T (7.3.48)
n n
where M =
1
n
∑i=1 Xi is the sample mean, and T =
1
n
∑i=1 (Xi − M )
2
is the biased version of the sample variance.
It should come as no surprise at this point that the maximum likelihood estimators are functions of the largest and smallest order
statistics.
The maximum likelihood estimators or a and h are U = X(1) and V = X(n) − X(1) , respectively.
1. E(U ) = a + h
n+1
so U is positively biased and asymptotically unbiased.
n−1
2. E(V ) = h n+1
so V is negatively biased and asymptotically unbiased.
3. var(U ) = h 2 n
2
so U is consistent.
(n+1 ) (n+2)
2(n−1)
4. var(V ) = h 2
2
so V is consistent.
(n+1 ) (n+2)
Proof
distribution. However, maximum likelihood is a very general method that does not require the observation variables to be
independent or identically distributed. In the hypergeometric model, we have a population of N objects with r of the objects type 1
and the remaining N − r objects type 0. The population size N , is a positive integer. The type 1 size r, is a nonnegative integer
with r ≤ N . These are the basic parameters, and typically one or both is unknown. Here are some typical examples:
1. The objects are devices, classified as good or defective.
2. The objects are persons, classified as female or male.
3. The objects are voters, classified as for or against a particular candidate.
4. The objects are wildlife or a particular type, either tagged or untagged.
We sample n objects from the population at random, without replacement. Let X be the type of the ith object selected, so that our
i
sequence of observed variables is X = (X , X , … , X ). The variables are identically distributed indicator variables, with
1 2 n
P (X = 1) = r/N for each i ∈ {1, 2, … , n}, but are dependent since the sampling is without replacement. The number of type 1
i
objects in the sample is Y = ∑ X . This statistic has the hypergeometric distribution with parameter N , r, and n , and has
n
i=1 i
Recall the falling power notation: x = x(x − 1) ⋯ (x − k + 1) for x ∈ R and k ∈ N . The hypergeometric model is studied in
(k)
As above, let X = (X 1, X2 , … , Xn ) be the observed variables in the hypergeometric model with parameters N and r. Then
1. The maximum likelihood estimator of r with N known is U = ⌊N M ⌋ = ⌊N Y /n⌋ .
2. The maximum likelihood estimator of N with r known is V = ⌊r/M ⌋ = ⌊rn/Y ⌋ if Y >0 .
Proof
In the reliability example (1), we might typically know N and would be interested in estimating r. In the wildlife example (4), we
would typically know r and would be interested in estimating N . This example is known as the capture-recapture model.
Clearly there is a close relationship between the hypergeometric model and the Bernoulli trials model above. In fact, if the
sampling is with replacement, the Bernoulli trials model with p = r/N would apply rather than the hypergeometric model. In
addition, if the population size N is large compared to the sample size n , the hypergeometric model is well approximated by the
Bernoulli trials model, again with p = r/N .
7.3.8 https://stats.libretexts.org/@go/page/10191
This page titled 7.3: Maximum Likelihood is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
7.3.9 https://stats.libretexts.org/@go/page/10191
7.4: Bayesian Estimation
Basic Theory
The General Method
Suppose again that we have an observable random variable X for an experiment, that takes values in a set S . Suppose also that
distribution of X depends on a parameter θ taking values in a parameter space T . Of course, our data variable X is almost always
vector-valued, so that typically S ⊆ R for some n ∈ N . Depending on the nature of the sample space S , the distribution of X
n
+
may be discrete or continuous. The parameter θ may also be vector-valued, so that typically T ⊆ R for some k ∈ N .
k
+
In Bayesian analysis, named for the famous Thomas Bayes, we model the deterministic, but unknown parameter θ with a random
variable Θ that has a specified distribution on the parameter space T . Depending on the nature of the parameter space, this
distribution may also be either discrete or continuous. It is called the prior distribution of Θ and is intended to reflect our
knowledge of the parameter θ , before we gather data. After observing X = x ∈ S , we then use Bayes' theorem, to compute the
conditional distribution of Θ given X = x . This distribution is called the posterior distribution of Θ, and is an updated
distribution, given the information in the data. Here is the mathematical description, stated in terms of probability density
functions.
Suppose that the prior distribution of Θ on T has probability density function h , and that given Θ = θ ∈ T , the conditional
probability density function of X on S is f (⋅ ∣ θ) . Then the probability density function of the posterior distribution of Θ
given X = x ∈ S is
h(θ)f (x ∣ θ)
h(θ ∣ x) = , θ ∈ T (7.4.1)
f (x)
where the function in the denominator is defined as follows, in the discrete and continuous cases, respectively:
θ∈T
Proof
For x ∈ S , note that f (x) is simply the normalizing constant for the function θ ↦ h(θ)f (x ∣ θ) . It may not be necessary to
explicitly compute f (x), if one can recognize the functional form of θ ↦ h(θ)f (x ∣ θ) as that of a known distribution. This will
indeed be the case in several of the examples explored below.
If the parameter space T has finite measure c (counting measure in the discrete case or Lebesgue measure in the continuous case),
then one possible prior distribution is the uniform distribution on T , with probability density function h(θ) = 1/c for θ ∈ T . This
distribution reflects no prior knowledge about the parameter, and so is called the non-informative prior distributioon.
Random Samples
Of course, an important and essential special case occurs when X = (X , X , … , X ) is a random sample of size n from the
1 2 n
distribution of a basic variable X. Specifically, suppose that X takes values in a set R and has probability density function g(⋅ ∣ θ)
for a given θ ∈ T . In this case, S = R and the probability density function f (⋅ ∣ θ) of X given θ is
n
Real Parameters
Suppose that θ is a real-valued parameter, so that T ⊆R . Here is our main definition.
7.4.1 https://stats.libretexts.org/@go/page/10192
E(Θ ∣ X = x) = ∑ θh(θ ∣ x), x ∈ S (7.4.4)
θ∈T
Recall that E(Θ ∣ X) is a function of X and, among all functions of X, is closest to Θ in the mean square sense. Of course, once
we collect the data and observe X = x , the Bayesian estimate of θ is E(Θ ∣ X = x) . As always, the term estimator refers to a
random variable, before the data are collected, and the term estimate refers to an observed value of the random variable after the
data are collected. The definitions of bias and mean square error are as before, but now conditioned on Θ = θ ∈ T .
As before, bias(U ∣ θ) = E(U ∣ θ) − θ and mse(U ∣ θ) = var(U ∣ θ) + bias (U ∣ θ) . Suppose now that we observe the random
2
variables (X , X , X , …) sequentially, and we compute the Bayes estimator U of θ based on (X , X , … , X ) for each
1 2 3 n 1 2 n
n ∈ N + . Again, the most common case is when we are sampling from a distribution, so that the sequence is independent and
identically distributed (given θ ). We have the natural asymptotic properties that we have seen before.
Often we cannot construct unbiased Bayesian estimators, but we do hope that our estimators are at least asymptotically unbiased
and consistent. It turns out that the sequence of Bayesian estimators U is a martingale. The theory of martingales provides some
powerful tools for studying these estimators.
From the Bayesian perspective, the posterior distribution of Θ given the data X = x is of primary importance. Point estimates of θ
derived from this distribution are of secondary importance. In particular, the mean square error function
u ↦ E[(Θ − u ) ∣ X = x) , minimized as we have noted at E(Θ ∣ X = x) , is not the only loss function that can be used.
2
(Although it's the only one that we consider.) Another possible loss function, among many, is the mean absolute error function
u ↦ E(|Θ − u| ∣ X = x) , which we know is minimized at the median(s) of the posterior distribution.
Conjugate Families
Often, the prior distribution of Θ is itself a member of a parametric family, with the parameters specified to reflect our prior
knowledge of θ . In many important special cases, the parametric family can be chosen so that the posterior distribution of Θ given
X = x belongs to the same family for each x ∈ S . In such a case, the family of distributions of Θ is said to be conjugate to the
family of distributions of X. Conjugate families are nice from a computational point of view, since we can often compute the
posterior distribution through a simple formula involving the parameters of the family, without having to use Bayes' theorem
directly. Similarly, in the case that the parameter is real valued, we can often compute the Bayesian estimator through a simple
formula involving the parameters of the conjugate family.
Special Distributions
The Bernoulli Distribution
Suppose that X = (X , X , …) is sequence of independent variables, each having the Bernoulli distribution with unknown
1 2
success parameter p ∈ (0, 1). In short, X is a sequence of Bernoulli trials, given p. In the usual language of reliability, X = 1 i
means success on trial i and X = 0 means failure on trial i. Recall that given p, the Bernoulli distribution has probability density
i
function
x 1−x
g(x ∣ p) = p (1 − p ) , x ∈ {0, 1} (7.4.6)
7.4.2 https://stats.libretexts.org/@go/page/10192
Note that the number of successes in the first n trials is Y n =∑
n
i=1
Xi . Given p, random variable Y has the binomial distribution
n
and has mean a/(a + b) . For example, if we know nothing about p, we might let a = b = 1 , so that the prior distribution is
uniform on the parameter space (0, 1) (the non-informative prior). On the other hand, if we believe that p is about , we might let 2
a = 4 and b = 2 , so that the prior distribution is unimodal, with mean . As a random process, the sequence X with p randomized
2
by P , is known as the beta-Bernoulli process, and is very interesting on its own, outside of the context of Bayesian estimation.
Proof
Thus, the beta distribution is conjugate to the Bernoulli distribution. Note also that the posterior distribution depends on the data
vector X only through the number of successes Y . This is true because Y is a sufficient statistic for p. In particular, note that the
n n n
left beta parameter is increased by the number of successes Y and the right beta parameter is increased by the number of failures
n
n−Y . n
a + Yn
Un = (7.4.10)
a+b +n
Proof
In the beta coin experiment, set n = 20 and p = 0.3 , and set a = 4 and b = 2 . Run the simulation 100 times and note the
estimate of p and the shape and location of the posterior probability density function of p on each run.
For n ∈ N , +
a(1 − p) − bp
bias(Un ∣ p) = , p ∈ (0, 1) (7.4.11)
a+b +n
Note also that we cannot choose a and b to make U unbiased, since such a choice would involve the true value of p, which we do
n
not know.
In the beta coin experiment, vary the parameters and note the change in the bias. Now set n = 20 and p = 0.8 , and set a = 2
and b = 6 . Run the simulation 1000 times. Note the estimate of p and the shape and location of the posterior probability
density function of p on each update. Compare the empirical bias to the true bias.
For n ∈ N , +
2 2 2
p[n − 2 a(a + b)] + p [(a + b ) − n] + a
mse(Un ∣ p) = , p ∈ (0, 1) (7.4.13)
2
(a + b + n)
7.4.3 https://stats.libretexts.org/@go/page/10192
In the beta coin experiment, vary the parameters and note the change in the mean square error. Now set n = 10 and p = 0.7 ,
and set a = b = 1 . Run the simulation 1000 times. Note the estimate of p and the shape and location of the posterior
probability density function of p on each update. Compare the empirical mean square error to the true mean square error.
Interestingly, we can choose a and b so that U has mean square error that is independent of the unknown parameter p:
n
mse(Un ∣ p) = , p ∈ (0, 1) (7.4.16)
− 2
4 (n + √n )
In the beta coin experiment, set n = 36 and a = b = 3 . Vary p and note that the mean square error does not change. Now set
p = 0.8 and run the simulation 1000 times. Note the estimate of p and the shape and location of the posterior probability
density function on each update. Compare the empirical bias and mean square error to the true values.
Recall that the method of moments estimator and the maximum likelihood estimator of p (on the interval (0, 1)) is the sample mean
(the proportion of successes):
n
Y 1
Mn = = ∑ Xi (7.4.17)
n n
i=1
n
p(1 − p) . To see the connection between the estimators, note from (6) that
a+b a n
Un = + Mn (7.4.18)
a+b +n a+b a+b +n
So U is a weighted average of a/(a + b) (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n
is { , 1}. This setup corresponds to the tossing of a coin that is either fair or two-headed, but we don't know which. We model p
1
with a random variable P that has the prior probability density function h given by h(1) = a , h ( ) = 1 − a , where a ∈ (0, 1) is 1
chosen to reflect our prior knowledge of the probability that the coin is two-headed. If we are completely ignorant, we might let
a = (the non-informative prior). If with think the coin is more likely to be two-headed, we might let a = . Again let
1
2
3
X for n ∈ N .
n
Y =∑
n i +
i=1
1. h(1 ∣ X n) = n
2 a+(1−a)
2 a
if Y
n =n and h(1 ∣ X n) =0 if Y n <n
2. h ( 1
2
∣ Xn ) = n
2 a+(1−a)
1−a
if Y n =n and h ( 1
2
∣ Xn ) = 1 if Y n <n
Proof
Now let
n+1
2 a + (1 − a)
pn = (7.4.23)
n+1
2 a + 2(1 − a)
1. U n = pn if Y n =n
2. U n =
1
2
if Y n <n
Proof
If we observe Y < n then U gives the correct answer . This certainly makes sense since we know that we do not have the two-
n n
1
headed coin. On the other hand, if we observe Y = n then we are not certain which coin we have, and the Bayesian estimate p is
n n
7.4.4 https://stats.libretexts.org/@go/page/10192
not even in the parameter space! But note that p n → 1 as n → ∞ exponentially fast. Next let's compute the bias and mean-square
error for a given p ∈ { , 1} . 1
For n ∈ N , +
1. bias(U n ∣ 1) = pn − 1
n
2. bias (U n ∣
1
2
) =(
1
2
) (pn −
1
2
)
2
, then U is positively biased for sufficiently large n
n
(depending on a ).
For n ∈ N , +
1. mse(U n ∣ 1) = (pn − 1 )
2
n 2
2. mse (U n ∣
1
2
) =(
1
2
) (pn −
1
2
)
with unknown success parameter p ∈ (0, 1). Recall that these variables can be interpreted as the number of trials between
successive successes in a sequence of Bernoulli trials. Given p, the geometric distribution has probability density function
x−1
g(x ∣ p) = p(1 − p ) , x ∈ N+ (7.4.29)
Once again for n ∈ N , let Y = ∑ X . In this setting, Y is the trial number of the n th success, and given p, has the negative
+ n
n
i=1 i n
The posterior distribution of P given Xn = (X1 , X2 , … , Xn ) is beta with left parameter a+n and right parameter
b + (Y − n) .
n
Proof
Thus, the beta distribution is conjugate to the geometric distribution. Moreover, note that in the posterior beta distribution, the left
parameter is increased by the number of successes n while the right parameter is increased by the number of failures Y − n , just as
in the Bernoulli model. In particular, the posterior left parameter is deterministic and depends on the data only through the sample
size n .
a+n
Vn = (7.4.32)
a + b + Yn
Proof
Recall that the method of moments estimator of p, and the maximum likelihood estimator of p on the interval (0, 1) are both
W = 1/ M = n/ Y . To see the connection between the estimators, note from (19) that
n n n
1 a a+b n 1
= + (7.4.33)
Vn a+n a a + n Wn
7.4.5 https://stats.libretexts.org/@go/page/10192
The Poisson Distribution
Suppose that X = (X , X , …) is a sequence of random variable each having the Poisson distribution with unknown parameter
1 2
λ ∈ (0, ∞) . Recall that the Poisson distribution is often used to model the number of “random points” in a region of time or space
and is studied in more detail in the chapter on the Poisson Process. The distribution is named for the inimitable Simeon Poisson and
given λ , has probability density function
x
λ
−λ
g(x ∣ λ) = e , x ∈ N (7.4.34)
x!
i=1
Xi . Given λ , random variable Y also has a Poisson distribution, but with parameter nλ.
n
Suppose now that we model λ with a random variable Λ having a prior gamma distribution with shape parameter k ∈ (0, ∞) and
rate parameter r ∈ (0, ∞). As usual k and r are chosen to reflect our prior knowledge of λ . Thus the prior probability density
function of Λ is
k
r k−1 −rλ
h(λ) = λ e , λ ∈ (0, ∞) (7.4.35)
Γ(k)
and the mean is k/r. The scale parameter of the gamma distribution is b = 1/r , but the formulas will work out nicer if we use the
rate parameter.
The posterior distribution of Λ given Xn = (X1 , X2 , … , Xn ) is gamma with shape parameter k + Yn and rate parameter
r+n.
Proof
It follows that the gamma distribution is conjugate to the Poisson distribution. Note that the posterior rate parameter is
deterministic and depends on the data only through the sample size n .
Proof
For n ∈ N , +
k − rλ
bias(Vn ∣ λ) = , λ ∈ (0, ∞) (7.4.40)
r+n
Note that, as before, we cannot choose k and r to make V unbiased, without knowledge of λ .
n
For n ∈ N , +
2
nλ + (k − rλ)
mse(Vn ∣ λ) = , λ ∈ (0, ∞) (7.4.42)
2
(r + n)
Recall that the method of moments estimator of λ and the maximum likelihood estimator of λ on the interval (0, ∞) are both
M = Y /n , the sample mean. This estimator is unbiased and has mean square error λ/n. To see the connection between the
n n
7.4.6 https://stats.libretexts.org/@go/page/10192
r k n
Vn = + Mn (7.4.44)
r+n r r+n
So V is a weighted average of k/r (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n
unknown mean μ ∈ R but known variance σ ∈ (0, ∞) . Of course, the normal distribution plays an especially important role in
2
statistics, in part because of the central limit theorem. The normal distribution is widely used to model physical quantities subject to
numerous small, random errors. In many statistical applications, the variance of the normal distribution is more stable than the
mean, so the assumption that the variance is known is not entirely artificial. Recall that the normal probability density function
(given μ ) is
2
1 1 x −μ
g(x ∣ μ) = −
−− exp[− ( ) ], x ∈ R (7.4.45)
√2 πσ 2 σ
i=1
Xi . Recall that Yn also has a normal distribution (given μ ) but with mean nμ and variance
nσ .
2
Suppose now that μ is modeled by a random variable Ψ that has a prior normal distribution with mean a ∈ R and variance
b ∈ (0, ∞) . As usual, a and b are chosen to reflect our prior knowledge of μ . An interesting special case is when we take b = σ ,
2
so the variance of the prior distribution of Ψ is the same as the variance of the underlying sampling distribution.
Proof
Therefore, the normal distribution is conjugate to the normal distribution with unknown mean and known variance. Note that the
posterior variance is deterministic, and depends on the data only through the sample size n . In the special case that b = σ , the
posterior distribution of Ψ given X is normal with mean (Y + a)/(n + 1) and variance σ /(n + 1) .
n n
2
Proof
For n ∈ N , +
2
σ (a − μ)
bias(Un ∣ μ) = , μ ∈ R (7.4.56)
2 2
σ +n b
For n ∈ N , +
2 4 4 2
nσ b + σ (a − μ)
mse(Un ∣ μ) = , μ ∈ R (7.4.58)
(σ 2 + n b2 )2
7.4.7 https://stats.libretexts.org/@go/page/10192
Proof
When b = σ , mse(U ∣ μ) = [nσ + (a − μ) ]/(n + 1) . Recall that the method of moments estimator of μ and the maximum
2 2 2
likelihood estimator of μ on R are both M = Y /n , the sample mean. This estimator is unbiased and has mean square error
n n
var(M ) = σ /n . To see the connection between the estimators, note from (25) that
2
2 2
σ nb
Un = a+ Mn (7.4.60)
2 2 2 2
nb +σ nb +σ
So U is a weighted average of a (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n
left shape parameter a ∈ (0, ∞) and right shape parameter b = 1 . The beta distribution is widely used to model random
proportions and probabilities and other variables that take values in bounded intervals (scaled to take values in (0, 1)). Recall that
the probability density function (given a ) is
a−1
g(x ∣ a) = a x , x ∈ (0, 1) (7.4.61)
Suppose now that a is modeled by a random variable A that has a prior gamma distribution with shape parameter k ∈ (0, ∞) and
rate parameter r ∈ (0, ∞). As usual, k and r are chosen to reflect our prior knowledge of a . Thus the prior probabiltiy density
function of A is
k
r k−1 −ra
h(a) = a e , a ∈ (0, ∞) (7.4.62)
Γ(k)
The posterior distribution of A given Xn = (X1 , X2 , … , Xn ) is gamma, with shape parameter k+n and rate parameter
r − ln(X X ⋯ X ) .
1 2 n
Proof
Thus, the gamma distribution is conjugate to the beta distribution with unknown left parameter and right parameter 1. Note that the
posterior shape parameter is deterministic and depends on the data only through the sample size n .
k+n
Un = (7.4.65)
r − ln(X1 X2 ⋯ Xn )
Proof
Given the complicated structure, the bias and mean square error of U given a ∈ (0, ∞) would be difficult to compute explicitly.
n
Recall that the maximum likelihood estimator of a is W = −n/ ln(X X ⋯ X ) . To see the connection between the estimators,
n 1 2 n
shape parameter a ∈ (0, ∞) and scale parameter b = 1 . The Pareto distribution is used to model certain financial variables and
other variables with heavy-tailed distributions, and is named for Vilfredo Pareto. Recall that the probability density function (given
a ) is
a
g(x ∣ a) = , x ∈ [1, ∞) (7.4.67)
a+1
x
7.4.8 https://stats.libretexts.org/@go/page/10192
Suppose now that a is modeled by a random variable A that has a prior gamma distribution with shape parameter k ∈ (0, ∞) and
rate parameter r ∈ (0, ∞). As usual, k and r are chosen to reflect our prior knowledge of a . Thus the prior probabiltiy density
function of A is
k
r
k−1 −ra
h(a) = a e , a ∈ (0, ∞) (7.4.68)
Γ(k)
Proof
Thus, the gamma distribution is conjugate to Pareto distribution with unknown shape parameter. Note that the posterior shape
parameter is deterministic and depends on the data only through the sample size n .
k+n
Un = (7.4.71)
r + ln(X1 X2 ⋯ Xn )
Proof
Given the complicated structure, the bias and mean square error of U given a ∈ (0, ∞) would be difficult to compute explicitly.
Recall that the maximum likelihood estimator of a is W = n/ ln(X X ⋯ X ) . To see the connection between the estimators,
n 1 2 n
This page titled 7.4: Bayesian Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
7.4.9 https://stats.libretexts.org/@go/page/10192
7.5: Best Unbiased Estimators
Basic Theory
Consider again the basic statistical model, in which we have a random experiment that results in an observable random variable X
taking values in a set S . Once again, the experiment is typically to sample n objects from a population and record one or more
measurements for each item. In this case, the observable random variable has the form
X = (X1 , X2 , … , Xn ) (7.5.1)
Suppose that θ is a real parameter of the distribution of X, taking values in a parameter space Θ. Let f denote the probability
θ
density function of X for θ ∈ Θ . Note that the expected value, variance, and covariance operators also depend on θ , although we
will sometimes suppress this to keep the notation from becoming too unwieldy.
Definitions
Suppose now that λ = λ(θ) is a parameter of interest that is derived from θ . (Of course, λ might be θ itself, but more generally
might be a function of θ .) In this section we will consider the general problem of finding the best estimator of λ among a given
class of unbiased estimators. Recall that if U is an unbiased estimator of λ , then var (U ) is the mean square error. Mean square
θ
error is our measure of the quality of unbiased estimators, so the following definitions are natural.
2. If U is uniformly better than every other unbiased estimator of λ , then U is a Uniformly Minimum Variance Unbiased
Estimator (UMVUE) of λ .
Given unbiased estimators U and V of λ , it may be the case that U has smaller variance for some values of θ while V has smaller
variance for other values of θ , so that neither estimator is uniformly better than the other. Of course, a minimum variance unbiased
estimator is the best we can hope for.
In the rest of this subsection, we consider statistics h(X) where h : S → R (and so in particular, h does not depend on θ ). We
need a fundamental assumption:
This is equivalent to the assumption that the derivative operator d/dθ can be interchanged with the expected value operator
E .
θ
7.5.1 https://stats.libretexts.org/@go/page/10193
Proof
Generally speaking, the fundamental assumption will be satisfied if f (x) is differentiable as a function of θ , with a derivative that
θ
is jointly continuous in x and θ , and if the support set {x ∈ S : f (x) > 0} does not depend on θ .
θ
Proof
2
varθ (L1 (X, θ)) = Eθ (L (X, θ))
1
Proof
The following theorem gives the general Cramér-Rao lower bound on the variance of a statistic. The lower bound is named for
Harold Cramér and CR Rao:
Proof
We can now give the first version of the Cramér-Rao lower bound for unbiased estimators of a parameter.
Suppose now that λ(θ) is a parameter of interest and h(X) is an unbiased estimator of λ . Then
2
(dλ/dθ)
varθ (h(X)) ≥ (7.5.11)
2
Eθ (L (X, θ))
1
Proof
An estimator of λ that achieves the Cramér-Rao lower bound must be a uniformly minimum variance unbiased estimator
(UMVUE) of λ .
Equality holds in the previous theorem, and hence h(X) is an UMVUE, if and only if there exists a function u(θ) such that
(with probability 1)
Proof
The quantity E (L (X, θ)) that occurs in the denominator of the lower bounds in the previous two theorems is called the Fisher
θ
2
information number of X, named after Sir Ronald Fisher. The following theorem gives an alternate version of the Fisher
information number that is usually computationally better.
If the appropriate derivatives exist and if the appropriate interchanges are permissible then
2
Eθ (L (X, θ)) = Eθ (L2 (X, θ)) (7.5.13)
1
The following theorem gives the second version of the Cramér-Rao lower bound for unbiased estimators of a parameter.
7.5.2 https://stats.libretexts.org/@go/page/10193
Proof
Random Samples
Suppose now that X = (X , X , … , X ) is a random sample of size n from the distribution of a random variable X having
1 2 n
probability density function g and taking values in a set R . Thus S = R . We will use lower-case letters for the derivative of the
θ
n
log likelihood function of X and the negative of the second derivative of the log likelihood function of X.
L
2
can be written in terms of l and L can be written in terms of l :
2
2 2
1. E θ
2
(L (X, θ)) = nEθ (l (X, θ))
2
The following theorem gives the second version of the general Cramér-Rao lower bound on the variance of a statistic, specialized
for random samples.
The following theorem give the third version of the Cramér-Rao lower bound for unbiased estimators of a parameter, specialized
for random samples.
Suppose now that λ(θ) is a parameter of interest and h(X) is an unbiased estimator of λ . Then
2
(dλ/dθ)
varθ (h(X)) ≥ (7.5.18)
nEθ (l2 (X, θ))
Note that the Cramér-Rao lower bound varies inversely with the sample size n . The following version gives the fourth version of
the Cramér-Rao lower bound for unbiased estimators of a parameter, again specialized for random samples.
If the appropriate derivatives exist and the appropriate interchanges are permissible) then
2
(dλ/dθ)
varθ (h(X)) ≥ (7.5.19)
nEθ (l2 (X, θ))
To summarize, we have four versions of the Cramér-Rao lower bound for the variance of an unbiased estimate of λ : version 1 and
version 2 in the general case, and version 1 and version 2 in the special case that X is a random sample from the distribution of X.
If an ubiased estimator of λ achieves the lower bound, then the estimator is an UMVUE.
n
1
M = ∑ Xi (7.5.20)
n
i=1
7.5.3 https://stats.libretexts.org/@go/page/10193
Recall that E(M ) = μ and var(M ) = σ 2
/n . The special version of the sample variance, when μ is known, and standard version
of the sample variance are, respectively,
n
1
2 2
W = ∑(Xi − μ) (7.5.21)
n
i=1
n
1
2 2
S = ∑(Xi − M ) (7.5.22)
n−1
i=1
parameter p ∈ (0, 1). In the usual language of reliability, X = 1 means success on trial i and X = 0 means failure on trial i; the
i i
distribution is named for Jacob Bernoulli. Recall that the Bernoulli distribution has probability density function
x 1−x
gp (x) = p (1 − p ) , x ∈ {0, 1} (7.5.23)
The basic assumption is satisfied. Moreover, recall that the mean of the Bernoulli distribution is p, while the variance is p(1 − p) .
p(1 − p)/n is the Cramér-Rao lower bound for the variance of unbiased estimators of p.
The sample mean M (which is the proportion of successes) attains the lower bound in the previous exercise and hence is an
UMVUE of p.
Recall that this distribution is often used to model the number of “random points” in a region of time or space and is studied in
more detail in the chapter on the Poisson Process. The Poisson distribution is named for Simeon Poisson and has probability
density function
x
−θ
θ
gθ (x) = e , x ∈ N (7.5.24)
x!
The basic assumption is satisfied. Recall also that the mean and variance of the distribution are both θ .
θ/n is the Cramér-Rao lower bound for the variance of unbiased estimators of θ .
The sample mean M attains the lower bound in the previous exercise and hence is an UMVUE of θ .
σ ∈ (0, ∞) . Recall that the normal distribution plays an especially important role in statistics, in part because of the central limit
2
theorem. The normal distribution is widely used to model physical quantities subject to numerous small, random errors, and has
probability density function
2
1 x −μ
gμ,σ 2 (x) = −
−− exp[−( ) ], x ∈ R (7.5.25)
√2 πσ σ
The basic assumption is satisfied with respect to both of these parameters. Recall also that the fourth central moment is
E ((X − μ) ) = 3 σ .
4 4
σ /n
2
is the Cramér-Rao lower bound for the variance of unbiased estimators of μ .
The sample mean M attains the lower bound in the previous exercise and hence is an UMVUE of μ .
4
2σ
n
is the Cramér-Rao lower bound for the variance of unbiased estimators of σ . 2
7.5.4 https://stats.libretexts.org/@go/page/10193
4
n−1
and hence does not attain the lower bound in the previous exercise.
If μ is known, then the special sample variance W attains the lower bound above and hence is an UMVUE of σ .
2 2
Proof
k > 0 and unknown scale parameter b > 0 . The gamma distribution is often used to model random times and certain other types of
positive random variables, and is studied in more detail in the chapter on Special Distributions. The probability density function is
1 k−1 −x/b
gb (x) = x e , x ∈ (0, ∞) (7.5.26)
k
Γ(k)b
The basic assumption is satisfied with respect to b . Moreover, the mean and variance of the gamma distribution are kb and 2
kb ,
respectively.
2
b
nk
is the Cramér-Rao lower bound for the variance of unbiased estimators of b .
k
attains the lower bound in the previous exercise and hence is an UMVUE of b .
parameter b = 1 . Beta distributions are widely used to model random proportions and other random variables that take values in
bounded intervals, and are studied in more detail in the chapter on Special Distributions. In our specialized case, the probability
density function of the sampling distribution is
a−1
ga (x) = a x , x ∈ (0, 1) (7.5.27)
a+1
2. σ =
2 a
2
(a+1 ) (a+2)
The sample mean M does not achieve the Cramér-Rao lower bound in the previous exercise, and hence is not an UMVUE of
μ.
n
. Of course, the Cramér-Rao Theorem does not
apply, by the previous exercise.
7.5.5 https://stats.libretexts.org/@go/page/10193
2
Recall that V =
n+1
n
max{ X1 , X2 , … , Xn } is unbiased and has variance a
n(n+2)
. This variance is smaller than the Cramér-
Rao bound in the previous exercise.
The reason that the basic assumption is not satisfied is that the support set {x ∈ R : g a (x) > 0} depends on the parameter a .
unknown mean μ ∈ R , but possibly different standard deviations. Let σ = (σ , σ , … , σ ) where σ = sd(X ) for 1 2 n i i
i ∈ {1, 2, … , n}.
We will consider estimators of μ that are linear functions of the outcome variables. Specifically, we will consider estimators of the
following form, where the vector of coefficients c = (c , c , … , c ) is to be determined:
1 2 n
Y = ∑ ci Xi (7.5.29)
i=1
n
Y is unbiased if and only if ∑ i=1
ci = 1 .
The variance of Y is
n
2 2
var(Y ) = ∑ c σ (7.5.30)
i i
i=1
Proof
This exercise shows how to construct the Best Linear Unbiased Estimator (BLUE) of μ , assuming that the vector of standard
deviations σ is known.
Suppose now that σ = σ for i ∈ {1, 2, … , n} so that the outcome variables have the same standard deviation. In particular, this
i
would be the case if the outcome variables form a random sample of size n from a distribution with mean μ and standard deviation
σ.
In this case the variance is minimized when c i = 1/n for each i and hence Y =M , the sample mean.
This exercise shows that the sample mean M is the best linear unbiased estimator of μ when the standard deviations are the same,
and that moreover, we do not need to know the value of the standard deviation.
This page titled 7.5: Best Unbiased Estimators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
7.5.6 https://stats.libretexts.org/@go/page/10193
7.6: Sufficient, Complete and Ancillary Statistics
Basic Theory
The Basic Statistical Model
Consider again the basic statistical model, in which we have a random experiment with an observable random variable X taking
values in a set S . Once again, the experiment is typically to sample n objects from a population and record one or more
measurements for each item. In this case, the outcome variable has the form
X = (X1 , X2 , … , Xn ) (7.6.1)
where X is the vector of measurements for the ith item. In general, we suppose that the distribution of X depends on a parameter
i
θ taking values in a parameter space T . The parameter θ may also be vector-valued. We will sometimes use subscripts in
Sufficient Statistics
Let U = u(X) be a statistic taking values in a set R . Intuitively, U is sufficient for θ if U contains all of the information about θ
that is available in the entire data variable X. Here is the formal definition:
A statistic U is sufficient for θ if the conditional distribution of X given U does not depend on θ ∈ T .
Sufficiency is related to the concept of data reduction. Suppose that X takes values in R . If we can find a sufficient statistic U
n
that takes values in R , then we can reduce the original data vector X (whose dimension n is usually large) to the vector of
j
statistics U (whose dimension j is usually much smaller) with no loss of information about the parameter θ .
The following result gives a condition for sufficiency that is equivalent to this definition.
Let U = u(X) be a statistic taking values in R , and let f and h denote the probability density functions of
θ θ X and U
respectively. Then U is suffcient for θ if and only if the function on S given below does not depend on θ ∈ T :
fθ (x)
x ↦ (7.6.2)
hθ [u(x)]
Proof
The definition precisely captures the intuitive notion of sufficiency given above, but can be difficult to apply. We must know in
advance a candidate statistic U , and then we must be able to compute the conditional distribution of X given U . The Fisher-
Neyman factorization theorem given next often allows the identification of a sufficient statistic from the form of the probability
density function of X. It is named for Ronald Fisher and Jerzy Neyman.
Fisher-Neyman Factorization Theorem. Let f denote the probability density function of X and suppose that U = u(X) is
θ
a statistic taking values in R . Then U is sufficient for θ if and only if there exists G : R × T → [0, ∞) and r : S → [0, ∞)
such that
fθ (x) = G[u(x), θ]r(x); x ∈ S, θ ∈ T (7.6.3)
Proof
Note that r depends only on the data x but not on the parameter θ . Less technically, u(X) is sufficient for θ if the probability
density function f (x) depends on the data vector x and the parameter θ only through u(x).
θ
If U and V are equivalent statistics and U is sufficient for θ then V is sufficient for θ .
7.6.1 https://stats.libretexts.org/@go/page/10194
Minimal Sufficient Statistics
The entire data variable X is trivially sufficient for θ . However, as noted above, there usually exists a statistic U that is sufficient
for θ and has smaller dimension, so that we can achieve real data reduction. Naturally, we would like to find the statistic U that has
the smallest dimension possible. In many cases, this smallest dimension j will be the same as the dimension k of the parameter
vector θ . However, as we will see, this is not necessarily the case; j can be smaller or larger than k . An example based on the
uniform distribution is given in (38).
Suppose that a statistic U is sufficient for θ . Then U is minimally sufficient if U is a function of any other statistic V that is
sufficient for θ .
Once again, the definition precisely captures the notion of minimal sufficiency, but is hard to apply. The following result gives an
equivalent condition.
Let f denote the probability density function of X corresponding to the parameter value θ ∈ T and suppose that U = u(X)
θ
is a statistic taking values in R . Then U is minimally sufficient for θ if the following condition holds: for x ∈ S and y ∈ S
fθ (x)
is independent of θ if and only if u(x) = u(y) (7.6.4)
fθ (y)
Proof
If U and V are equivalent statistics and U is minimally sufficient for θ then V is minimally sufficient for θ .
Suppose that U is sufficient for θ and that there exists a maximum likelihood estimator of θ . Then there exists a maximum
likelihood estimator V that is a function of U .
Proof
In particular, suppose that V is the unique maximum likelihood estimator of θ and that V is sufficient for θ . If U is sufficient for θ
then V is a function of U by the previous theorem. Hence it follows that V is minimally sufficient for θ . Our next result applies to
Bayesian analysis.
Suppose that the statistic U = u(X) is sufficient for the parameter θ and that θ is modeled by a random variable Θ with values
in T . Then the posterior distribution of Θ given X = x ∈ S is a function of u(x).
Proof
Continuing with the setting of Bayesian analysis, suppose that θ is a real-valued parameter. If we use the usual mean-square loss
function, then the Bayesian estimator is V = E(Θ ∣ X) . By the previous result, V is a function of the sufficient statistics U . That
is, E(Θ ∣ X) = E(Θ ∣ U ) .
The next result is the Rao-Blackwell theorem, named for CR Rao and David Blackwell. The theorem shows how a sufficient
statistic can be used to improve an unbiased estimator.
Rao-Blackwell Theorem. Suppose that U is sufficient for θ and that V is an unbiased estimator of a real parameter λ = λ(θ) .
Then E (V ∣ U ) is also an unbiased estimator of λ and is uniformly better than V .
θ
Proof
Complete Statistics
Suppose that U = u(X) is a statistic taking values in a set R . Then U is a complete statistic for θ if for any function
r : R → R
7.6.2 https://stats.libretexts.org/@go/page/10194
To understand this rather strange looking condition, suppose that r(U ) is a statistic constructed from U that is being used as an
estimator of 0 (thought of as a function of θ ). The completeness condition means that the only such unbiased estimator is the
statistic that is 0 with probability 1.
If U and V are equivalent statistics and U is complete for θ then V is complete for θ .
The next result shows the importance of statistics that are both complete and sufficient; it is known as the Lehmann-Scheffé
theorem, named for Erich Lehmann and Henry Scheffé.
Lehmann-Scheffé Theorem. Suppose that U is sufficient and complete for θ and that V = r(U ) is an unbiased estimator of a
real parameter λ = λ(θ) . Then V is a uniformly minimum variance unbiased estimator (UMVUE) of λ .
Proof
Ancillary Statistics
Suppose that V = v(X) is a statistic taking values in a set R . If the distribution of V does not depend on θ , then V is called an
ancillary statistic for θ .
Thus, the notion of an ancillary statistic is complementary to the notion of a sufficient statistic. A sufficient statistic contains all
available information about the parameter; an ancillary statistic contains no information about the parameter. The following result,
known as Basu's Theorem and named for Debabrata Basu, makes this point more precisely.
Basu's Theorem. Suppose that U is complete and sufficient for a parameter θ and that V is an ancillary statistic for θ . Then U
and V are independent.
Proof
If U and V are equivalent statistics and U is ancillary for θ then V is ancillary for θ .
x 1−x
g(x) = p (1 − p ) , x ∈ {0, 1} (7.6.10)
Suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with parameter p. Equivalently,
1 2 n
X is a sequence of Bernoulli trials, so that in the usual langauage of reliability, X = 1 if trial i is a success, and X = 0 if trial i is
i i
a failure. The Bernoulli distribution is named for Jacob Bernoulli and is studied in more detail in the chapter on Bernoulli Trials
Let Y = ∑ X denote the number of successes. Recall that
n
i=1 i Y has the binomial distribution with parameters n and p, and has
probability density function h defined by
n
y n−y
h(y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (7.6.11)
y
Y is sufficient for p. Specifically, for y ∈ {0, 1, … , n}, the conditional distribution of X given Y =y is uniform on the set of
points
n
Dy = {(x1 , x2 , … , xn ) ∈ {0, 1 } : x1 + x2 + ⋯ + xn = y} (7.6.12)
Proof
This result is intuitively appealing: in a sequence of Bernoulli trials, all of the information about the probability of success p is
contained in the number of successes Y . The particular order of the successes and failures provides no additional information. Of
7.6.3 https://stats.libretexts.org/@go/page/10194
course, the sufficiency of Y follows more easily from the factorization theorem (3), but the conditional distribution provides
additional insight.
The proof of the last result actually shows that if the parameter space is any subset of (0, 1) containing an interval of positive
length, then Y is complete for p. But the notion of completeness depends very much on the parameter space. The following result
considers the case where p has a finite set of values.
Suppose that the parameter space T ⊂ (0, 1) is a finite set with k ∈ N + elements. If the sample size n is at least k , then Y is
not complete for p.
Proof
The sample mean M = Y /n (the sample proportion of successes) is clearly equivalent to Y (the number of successes), and hence
is also sufficient for p and is complete for p ∈ (0, 1). Recall that the sample mean M is the method of moments estimator of p, and
is the maximum likelihood estimator of p on the parameter space (0, 1).
In Bayesian analysis, the usual approach is to model p with a random variable P that has a prior beta distribution with left
parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Then the posterior distribution of P given X is beta with left parameter
a + Y and right parameter b + (n − Y ) . The posterior distribution depends on the data only through the sufficient statistic Y , as
The sample variance S is an UMVUE of the distribution variance p(1 − p) for p ∈ (0, 1), and can be written as
2
Y Y
2
S = (1 − ) (7.6.17)
n−1 n
Proof
defined by
x
θ
−θ
g(x) = e , x ∈ N (7.6.19)
x!
The Poisson distribution is named for Simeon Poisson and is used to model the number of “random points” in region of time or
space, under certain ideal conditions. The parameter θ is proportional to the size of the region, and is both the mean and the
variance of the distribution. The Poisson distribution is studied in more detail in the chapter on Poisson process.
Suppose now that X = (X 1, X2 , … , Xn ) is a random sample of size n from the Poisson distribution with parameter θ . Recall
that the sum of the scores Y =∑
n
i=1
Xi also has the Poisson distribution, but with parameter nθ .
The statistic Y is sufficient for θ . Specifically, for y ∈ N , the conditional distribution of X given Y =y is the multinomial
distribution with y trials, n trial values, and uniform trial probabilities.
Proof
As before, it's easier to use the factorization theorem to prove the sufficiency of Y , but the conditional distribution gives some
additional insight.
As with our discussion of Bernoulli trials, the sample mean M = Y /n is clearly equivalent to Y and hence is also sufficient for θ
and complete for θ ∈ (0, ∞) . Recall that M is the method of moments estimator of θ and is the maximum likelihood estimator on
the parameter space (0, ∞).
7.6.4 https://stats.libretexts.org/@go/page/10194
Y
n−1
U =( ) (7.6.23)
n
Proof
The normal distribution is often used to model physical quantities subject to small, random errors, and is studied in more detail in
the chapter on Special Distributions. Because of the central limit theorem, the normal distribution is perhaps the most important
distribution in statistics.
Suppose that X = (X , X , … , X ) is a random sample from the normal distribution with mean
1 2 n μ and variance σ
2
. Then
each of the following pairs of statistics is minimally sufficient for (μ, σ ) 2
1. (Y , V ) where Y = ∑ n
i=1
Xi and V = ∑ X . n
i=1 i
2
n n
2. (M , S ) where M =
2 1
n
∑
i=1
X is the sample mean and S =
i ∑
2
n−1
1
i=1
(Xi − M )
2
is the sample variance.
n
3. (M , T ) where T =
2 2 1
n
∑
i=1
(X − M )
i is the biased sample variance.
2
Proof
Run the normal estimation experiment 1000 times with various values of the parameters. Compare the estimates of the
parameters in terms of bias and mean square error.
Sometimes the variance σ of the normal distribution is known, but not the mean μ . It's rarely the case that μ is known but not σ .
2 2
Suppose again that X = (X 1, X2 , … , Xn ) is a random sample from the normal distribution with mean μ ∈ R and variance
σ ∈ (0, ∞) . If
2
1. If σ is known then Y = ∑
2 n
i=1
Xi is minimally sufficient for μ .
2. If μ is known then U = ∑ n
i=1
(Xi − μ)
2
is sufficient for σ . 2
Proof
Of course by equivalence, in part (a) the sample mean M = Y /n is minimally sufficient for μ , and in part (b) the special sample
variance W = U /n is minimally sufficient for σ . Moreover, in part (a), M is complete for μ on the parameter space R and the
2
sample variance S is ancillary for μ (Recall that (n − 1)S /σ has the chi-square distribution with n − 1 degrees of freedom.) It
2 2 2
follows from Basu's theorem (15) that the sample mean M and the sample variance S are independent. We proved this by more 2
direct means in the section on special properties of normal samples, but the formulation in terms of sufficient and ancillary
statistics gives additional insight.
The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in
more detail in the chapter on Special Distributions.
7.6.5 https://stats.libretexts.org/@go/page/10194
Suppose that X = (X , X , … , X ) is a random sample from the gamma distribution with shape parameter
1 2 n k and scale
parameter b . Each of the following pairs of statistics is minimally sufficient for (k, b)
1. (Y , V ) where Y = ∑ X is the sum of the scores and V = ∏ X is the product of the scores.
n
i=1 i
n
i=1 i
2. (M , U ) where M = Y /n is the sample (arithmetic) mean of X and U = V is the sample geometric mean of X. 1/n
Proof
Recall that the method of moments estimators of k and b are M /T and T /M , respectively, where M = ∑ X is the 2 2 2 1
n
n
i=1 i
sample mean and T = ∑ (X − M ) is the biased sample variance. If the shape parameter k is known, M is both the
2 1
n
n
i=1 i
2 1
method of moments estimator of b and the maximum likelihood estimator on the parameter space (0, ∞). Note that T is not a 2
function of the sufficient statistics (Y , V ), and hence estimators based on T suffer from a loss of information. 2
Run the gamma estimation experiment 1000 times with various values of the parameters and the sample size n . Compare the
estimates of the parameters in terms of bias and mean square error.
The proof of the last theorem actually shows that Y is sufficient for b if k is known, and that V is sufficient for k if b is known.
Suppose again that X = (X , X , … , X ) is a random sample of size n from the gamma distribution with shape parameter
1 2 n
i=1 i
Proof
Suppose again that X = (X , X , … , X ) is a random sample from the gamma distribution on (0, ∞) with shape parameter
1 2 n
n
k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Let M = denote the sample mean and U = (X X … X ) the
1 1/n
∑ X i=1 i 1 2 n
n
1
a−1 b−1
g(x) = x (1 − x ) , x ∈ (0, 1) (7.6.33)
B(a, b)
where B is the beta function. The beta distribution is often used to model random proportions and other random variables that take
values in bounded intervals. It is studied in more detail in the chapter on Special Distribution
Suppose that X = (X , X , … , X ) is a random sample from the beta distribution with left parameter a and right parameter
1 2 n
n n
b . Then (P , Q) is minimally sufficient for (a, b) where P = ∏ X and Q = ∏ (1 − X ) . i=1 i i=1 i
Proof
The proof also shows that P is sufficient for a if b is known, and that Q is sufficient for b if a is known. Recall that the method of
moments estimators of a and b are
(2) (2)
M (M − M ) (1 − M ) (M − M )
U = , V = (7.6.35)
(2) 2 (2) 2
M −M M −M
n n
respectively, where M = ∑ X is the sample mean and M = ∑ X is the second order sample mean. If b is
1
n i=1 i
(2) 1
n i=1 i
2
known, the method of moments estimator of a is U = bM /(1 − M ) , while if a is known, the method of moments estimator of b
b
is V = a(1 − M )/M . None of these estimators is a function of the sufficient statistics (P , Q) and so all suffer from a loss of
a
information. On the other hand, if b = 1 , the maximum likelihood estimator of a on the interval (0, ∞) is W = −n/ ∑ ln X , n
i=1 i
Run the beta estimation experiment 1000 times with various values of the parameters. Compare the estimates of the
parameters.
7.6.6 https://stats.libretexts.org/@go/page/10194
The Pareto Distribution
Recall that the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a continuous distribution on
[b, ∞) with probability density function g given by
a
ab
g(x) = , b ≤x <∞ (7.6.36)
a+1
x
The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution often used to model income and certain other
types of random variables. It is studied in more detail in the chapter on Special Distribution.
Suppose that X = (X , X , … , X ) is a random sample from the Pareto distribution with shape parameter a and scale
1 2 n
n
parameter b . Then (P , X ) is minimally sufficient for (a, b) where P = ∏ X is the product of the sample variables and
(1) i=1 i
Proof
The proof also shows that P is sufficient for a if b is known (which is often the case), and that X (1) is sufficient for b if a is known
(much less likely). Recall that the method of moments estimators of a and b are
−−−−−−−−−− −−−−−−−−−−
(2) (2) (2) 2
M M M −M
U = 1 +√ , V = (1 − √ ) (7.6.39)
(2) 2 M (2)
M −M M
n n
respectively, where as before M = ∑ X is the sample mean and M = ∑ X the second order sample mean. These
1
n i=1 i
(2)
i=1
2
i
estimators are not functions of the sufficient statistics and hence suffers from loss of information. On the other hand, the maximum
likelihood estimators of a and b on the interval (0, ∞) are
n
W = n
, X(1) (7.6.40)
∑ ln Xi − n ln X(1)
i=1
respectively. These are functions of the sufficient statistics, as they must be.
Run the Pareto estimation experiment 1000 times with various values of the parameters a and b and the sample size n .
Compare the method of moments estimates of the parameters with the maximum likelihood estimates in terms of the empirical
bias and mean square error.
Continuous uniform distributions are widely used in applications to model a number chosen “at random” from an interval.
Continuous uniform distributions are studied in more detail in the chapter on Special Distributions. Let's first consider the case
where both parameters are unknown.
Suppose that is a random sample from the uniform distribution on the interval [a, a + h] . Then
X = (X1 , X2 , … , Xn )
(X(1) , X(n) )is minimally sufficient for (a, h), where X = min{X , X , … , X } is the first order statistic and
(1) 1 2 n
Proof
If the location parameter a is known, then the largest order statistic is sufficient for the scale parameter h . But if the scale
parameter h is known, we still need both order statistics for the location parameter a . So in this case, we have a single real-valued
parameter, but the minimally sufficient statistic is a pair of real-valued random variables.
Suppose again that X = (X 1, X2 , … , Xn ) is a random sample from the uniform distribution on the interval [a, a + h] .
1. If a ∈ R is known, then X (n) is sufficient for h .
7.6.7 https://stats.libretexts.org/@go/page/10194
2. If h ∈ (0, ∞) is known, then (X (1)
, X(n) ) is minimally sufficient for a .
Proof
Run the uniform estimation experiment 1000 times with various values of the parameter. Compare the estimates of the
parameter.
– –
Recall that if both parameters are unknown, the method of moments estimators of a and h are U = 2M − √3T and V = 2√3T ,
respectively, where M = ∑ X is the sample mean and T = ∑ (X − M ) is the biased sample variance. If a is
1
n
n
i=1 i
2 1
n
n
i=1 i
2
known, the method of moments estimator of h is V = 2(M − a) , while if h is known, the method of moments estimator of h is
a
h . None of these estimators are functions of the minimally sufficient statistics, and hence result in loss of
1
U =M −
h
2
information.
one or both parameters are unknown. We select a random sample of n objects, without replacement from the population, and let X i
be the type of the ith object chosen. So our basic sequence of random variables is X = (X , X , … , X ). The variables are 1 2 n
identically distributed indicator variables with P(X = 1) = r/N for i ∈ {1, 2, … , n}, but are dependent. Of course, the sample
i
(Recall the falling power notation x = x(x − 1) ⋯ (x − k + 1) ). The hypergeometric distribution is studied in more detail in
(k)
Y is sufficient for (N , r). Specifically, for y ∈ {max{0, N − n + r}, … , min{n, r}} , the conditional distribution of X
Proof
There are clearly strong similarities between the hypergeometric model and the Bernoulli trials model above. Indeed if the
sampling were with replacement, the Bernoulli trials model with p = r/N would apply rather than the hypergeometric model. It's
also interesting to note that we have a single real-valued statistic that is sufficient for two real-valued parameters.
Once again, the sample mean M = Y /n is equivalent to Y and hence is also sufficient for (N , r). Recall that the method of
moments estimator of r with N known is N M and the method of moment estimator of N with r known is r/M . The estimator of
r is the one that is used in the capture-recapture experiment.
Exponential Families
Suppose now that our data vector X takes values in a set S , and that the distribution of X depends on a parameter vector θ taking
values in a parameter space Θ. The distribution of X is a k -parameter exponential family if S does not depend on θ and if the
probability density function of X can be written as
k
i=1
where α and (β , β , … , β ) are real-valued functions on Θ, and where r and (u , u , … , u ) are real-valued functions on S .
1 2 k 1 2 k
Moreover, k is assumed to be the smallest such integer. The parameter vector β = (β (θ), β (θ), … , β (θ)) is sometimes called 1 2 k
7.6.8 https://stats.libretexts.org/@go/page/10194
the natural parameter of the distribution, and the random vector U = (u (X), u (X), … , u (X)) is sometimes called the
1 2 k
natural statistic of the distribution. Although the definition may look intimidating, exponential families are useful because they
have many nice mathematical properties, and because many special parametric families are exponential families. In particular, the
sampling distributions from the Bernoulli, Poisson, gamma, normal, beta, and Pareto considered above are exponential families.
Exponential families of distributions are studied in more detail in the chapter on special distributions.
It turns out that U is complete for θ as well, although the proof is more difficult.
This page titled 7.6: Sufficient, Complete and Ancillary Statistics is shared under a CC BY 2.0 license and was authored, remixed, and/or curated
by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history
is available upon request.
7.6.9 https://stats.libretexts.org/@go/page/10194
CHAPTER OVERVIEW
8: Set Estimation
Set estimation refers to the process of constructing a subset of the parameter space, based on observed data from a probability
distribution. The subset will contain the true value of the parameter with a specified confidence level. In this chapter, we explore
the basic method of set estimation using pivot variables. We study set estimation in some of the most important models: the single
variable normal model, the two-variable normal model, and the Bernoulli model.
8.1: Introduction to Set Estimation
8.2: Estimation the Normal Model
8.3: Estimation in the Bernoulli Model
8.4: Estimation in the Two-Sample Normal Model
8.5: Bayesian Set Estimation
This page titled 8: Set Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
8.1: Introduction to Set Estimation
Basic Theory
The Basic Statistical Model
As usual, our starting point is a random experiment with an underlying sample space and a probability measure P. In the basic
statistical model, we have an observable random variable X taking values in a set S . In general, X can have quite a complicated
structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest,
then
X = (X1 , X2 , … , Xn ) (8.1.1)
where X is the vector of measurements for the ith object. The most important special case occurs when (X , X , … , X
i 1 2 n) are
independent and identically distributed. In this case, we have a random sample of size n from the common distribution.
Suppose also that the distribution of X depends on a parameter θ taking values in a parameter space Θ. The parameter may also be
vector-valued, in which case Θ ⊆ R for some k ∈ N and the parameter vector has the form θ = (θ , θ , … , θ ) .
k
+ 1 2 k
Confidence Sets
A confidence set is a subset C (X) of the parameter space Θ that depends only on the data variable X , and no unknown
parameters. the confidence level is the smallest probability that θ ∈ C (X) :
min {P[θ ∈ C (X)] : θ ∈ Θ} (8.1.2)
Thus, in a sense, a confidence set is a set-valued statistic. A confidence set is an estimator of θ in the sense that we hope that
θ ∈ C (X) with high probability, so that the confidence level is high. Note that since the distribution of X depends on θ , there is a
dependence on θ in the probability measure P in the definition of confidence level. However, we usually suppress this, just to keep
the notation simple. Usually, we try to construct a confidence set for θ with a prescribed confidence level 1 − α where 0 < α < 1 .
Typical confidence levels are 0.9, 0.95, and 0.99. Sometimes the best we can do is to construct a confidence set whose confidence
level is at least 1 − α ; this is called a conservative 1 − α confidence set for θ .
8.1.1 https://stats.libretexts.org/@go/page/10200
Suppose that Ci (X) is a 1 − αi level confidence set for θ for i ∈ {1, 2, … , k}. If α = α1 + α2 + ⋯ + αk < 1 then
C1 (X) ∩ C2 (X) ∩ ⋯ ∩ Ck (X) is a conservative 1 − α level confidence set for θ .
Proof
Real-Valued Parameters
In many cases, we are interested in estimating a real-valued parameter λ = λ(θ) taking values in an interval parameter space
(a, b) , where a, b ∈ R with a < b . Of course, it's possible that a = −∞ or b = ∞ . In this context our confidence set frequently
where L(X) and U (X) are real-valued statistics. In this case (L(X), U (X)) is called a confidence interval for λ . If L(X) and
U (X) are both random, then the confidence interval is often said to be two-sided. In the special case that U (X) = b , L(X) is
called a confidence lower bound for λ . In the special case that L(X) = a , U (X) is called a confidence upper bound for λ .
Suppose that L(X) is a 1 − α level confidence lower bound for λ and that U (X) is a 1 − β level confidence upper bound for
λ . If α + β < 1 then (L(X), U (X)) is a conservative 1 − (α + β) level confidence interval for λ .
Proof
Pivot Variables
You might think that it should be very difficult to construct confidence sets for a parameter θ . However, in many important special
cases, confidence sets can be constructed easily from certain random variables known as pivot variables.
Suppose that V is a function from S × Θ into a set T . The random variable V (X, θ) is a pivot variable for θ if its distribution
does not depend on θ . Specifically, P[V (X, θ) ∈ B] is constant in θ ∈ Θ for each B ⊆ T .
The basic idea is that we try to combine X and θ algebraically in such a way that we factor out the dependence on θ in the
distribution of the resulting random variable V (X, θ) . If we know the distribution of the pivot variable, then for a given α , we can
try to find B ⊆ T (that does not depend on θ ) such that P [V (X, θ) ∈ B] = 1 − α . It then follows that a 1 − α confidence set for
θ
Proof
The confidence set above corresponds to (1 − p)α in the left tail and pα in the right tail, in terms of the distribution of the pivot
variable V (X, λ). The special case p = is the equal-tailed case, the most common case.
1
8.1.2 https://stats.libretexts.org/@go/page/10200
Figure 8.1.3 : Distribution of the pivot variable showing (1 − p)α is the left tail and pα is the right tail.
The confidence set (5) is decreasing in α and hence increasing in 1 − α (in the sense of the subset relation) for fixed p.
For the confidence set (5), we would naturally like to choose p that minimizes the size of the set in some sense. However this is
often a difficult problem. The equal-tailed interval, corresponding to p = , is the most commonly used case, and is sometimes
1
(but not always) an optimal choice. Pivot variables are far from unique; the challenge is to find a pivot variable whose distribution
is known and which gives tight bounds on the parameter (high precision).
Suppose that V (X, θ) is a pivot variable for θ . If g is a function defined on the range of V and g involves no unknown
parameters, then U = g[V (X, θ)] is also a pivot variable for θ .
and the corresponding family of distributions is called the location-scale family associated with the distribution of Z ; μ is the
location parameter and σ is the scale parameter. Generally, we are assuming that these parameters are unknown.
Now suppose that X = (X 1, X2 , … , Xn ) is a random sample of size n from the distribution of X; this is our observable outcome
vector. For each i, let
Xi − μ
Zi = (8.1.6)
σ
In particular, note that Z is a pivot variable for (μ, σ), since Z is a function of X, μ , and σ, but the distribution of Z does not
depend on μ or σ. Hence, any function of Z will also be a pivot variable for (μ, σ), (if the function does not involve the
parameters). Of course, some of these pivot variables will be much more useful than others in estimating μ and σ. In the following
exercises, we will explore two common and important pivot variables.
Let M (X) and M (Z) denote the sample means of X and Z , respectively. Then M (Z) is a pivot variable for (μ, σ) since
M (X) − μ
M (Z) = (8.1.7)
σ
Let m denote the quantile function of the pivot variable M (Z). For any p ∈ (0, 1), a 1 − α confidence set for (μ, σ) is
Zα,p (X) = {(μ, σ) : M (X) − m(1 − pα)σ < μ < M (X) − m(α − pα)σ} (8.1.8)
The confidence set constructed above is a “cone” in the (μ, σ) parameter space, with vertex at (M (X), 0) and boundary lines
of slopes −1/m(1 − pα) and −1/m(α − pα), as shown in the graph below. (Note, however, that both slopes might be
negative or both positive.)
8.1.3 https://stats.libretexts.org/@go/page/10200
Figure 8.1.4 : The confidence set for (μ, σ) constructed from M
The fact that the confidence set is unbounded is clearly not good, but is perhaps not surprising; we are estimating two real
parameters with a single real-valued pivot variable. However, if σ is known, the confidence set defines a confidence interval for μ .
Geometrically, the confidence interval simply corresponds to the horizontal cross section at σ.
Proof
If σ is known, then (a) gives a 1 − α confidence lower bound for μ and (b) gives a 1 − α confidence upper bound for μ .
Let S(X) and S(Z) denote the sample standard deviations of X and Z , respectively. Then S(Z) is a pivot variable for (μ, σ)
and a pivot variable for σ since
S(X)
S(Z) = (8.1.9)
σ
Let s denote the quantile function of S(Z). For any α ∈ (0, 1) and p ∈ (0, 1), a 1 − α confidence set for (μ, σ) is
S(X) S(X)
Vα,p (X) = {(μ, σ) : <σ < } (8.1.10)
s(1 − pα) s(α − pα)
Note that the confidence set gives no information about μ since the random variable above is a pivot variable for σ alone. The
confidence set can also be viewed as a bounded confidence interval for σ.
Proof
The set in part (a) gives a 1 − α confidence lower bound for σ and the set in part (b) gives a 1 − α confidence upper bound for σ.
We can intersect the confidence sets corresponding to the two pivot variables to produce conservative, bounded confidence sets.
If α, β, p, q ∈ (0, 1) with α + β < 1 then Z α,p ∩ Vβ,q is a conservative 1 − (α + β) confidence set for (μ, σ).
Proof
8.1.4 https://stats.libretexts.org/@go/page/10200
Figure 8.1.6 : The bounded confidence set for (μ, σ) constructed from (M , S)
The most important location-scale family is the family of normal distributions. The problem of estimation in the normal model is
considered in the next section. In the remainder of this section, we will explore another important scale family.
density function g(x) = e , x ∈ [0, ∞) . The exponential distribution is widely used to model random times (such as lifetimes
−x
and “arrival” times), particularly in the context of the Poisson model. Now suppose that X = (X , X , … , X ) is a random 1 2 n
sample of size n from the exponential distribution with unknown scale parameter σ. Let
n
Y = ∑ Xi (8.1.11)
i=1
σ
Y has the chi-square distribution with 2n degrees of freedom, and hence is a pivot variable for σ.
Note that this pivot variable is a multiple of the variable M constructed above for general location-scale families (with μ = 0 ). For
p ∈ (0, 1) and k ∈ (0, ∞) , let χ (p) denote the quantile of order p for the chi-square distribution with k degrees of freedom. For
2
k
selected values of k and p, χ (p) can be obtained from the special distribution calculator or from most statistical software
2
k
packages.
Recall that
1. χ 2
k
(p) → 0 as p ↓ 0
2. χ 2
k
(p) → ∞ as p ↑ 1
For any α ∈ (0, 1) and any p ∈ (0, 1), a 1 − α confidence interval for σ is
2Y 2Y
( , ) (8.1.12)
2 2
χ (1 − pα) χ (α − pα)
2n 2n
Note that
1. 2Y /χ 2
2n
(1 − α) is a 1 − α confidence lower bound for σ.
2. 2Y /χ 2
2n
(α) is a 1 − α confidence lower bound for σ.
Of the two-sided confidence intervals constructed above, we would naturally prefer the one with the smallest length, because this
interval gives the most information about the parameter σ. However, minimizing the length as a function of p is computationally
difficult. The two-sided confidence interval that is typically used is the equal tailed interval obtained by letting p = : 1
2Y 2Y
( , ) (8.1.13)
2 2
χ (1 − α/2) χ (α/2)
2n 2n
The lifetime of a certain type of component (in hours) has an exponential distribution with unknown scale parameter σ. Ten
devices are operated until failure; the lifetimes are 592, 861, 1470, 2412, 335, 3485, 736, 758, 530, 1961.
1. Construct the 95% two-sided confidence interval for σ.
2. Construct the 95% confidence lower bound for σ.
8.1.5 https://stats.libretexts.org/@go/page/10200
3. Construct the 95% confidence upper bound for σ.
Answer
This page titled 8.1: Introduction to Set Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
8.1.6 https://stats.libretexts.org/@go/page/10200
8.2: Estimation the Normal Model
Basic Theory
The Normal Model
The normal distribution is perhaps the most important distribution in the study of mathematical statistics, in part because of the
central limit theorem. As a consequence of this theorem, a measured quantity that is subject to numerous small, random errors will
have, at least approximately, a normal distribution. Such variables are ubiquitous in statistical experiments, in subjects varying
from the physical and biological sciences to the social sciences.
So in this section, we assume that X = (X , X , … , X ) is a random sample from the normal distribution with mean μ and
1 2 n
standard deviation σ. Our goal is to construct confidence intervals for μ and σ individually, and then more generally, confidence
sets for (μ, σ). These are among of the most important special cases of set estimation. A parallel section on Tests in the Normal
Model is in the chapter on Hypothesis Testing. First we need to review some basic facts that will be critical for our analysis.
n n
1 1
2 2
M = ∑ Xi , S = ∑(Xi − M ) (8.2.1)
n n−1
i=1 i=1
From our study of point estimation, recall that M is an unbiased and consistent estimator of μ while S is an unbiased and 2
consistent estimator of σ . From these basic statistics we can construct the pivot variables that will be used to construct our interval
2
estimates. The following results were established in the section on Special Properties of the Normal Distribution.
Define
M −μ M −μ n−1
2
Z = , T = , V = S (8.2.2)
− − 2
σ/ √n S/ √n σ
It follows that each of these random variables is a pivot variable for (μ, σ) since the distributions do not depend on the parameters,
but the variables themselves functionally depend on one or both parameters. Pivot variables Z and T will be used to construct
interval estimates of μ while V will be used to construct interval estimates of σ . To construct our estimates, we will need
2
quantiles of these standard distributions. The quantiles can be computed using the special distribution calculator or from most
mathematical and statistical software packages. Here is the notation we will use:
1. z(p) denotes the quantile of order p for the standard normal distribution.
2. t (p) denotes the quantile of order p for the student t distribution with k degrees of freedom.
k
3. χ (p) denotes the quantile of order p for the chi-square distribution with k degrees of freedom
2
k
Since the standard normal and student t distributions are symmetric about 0, it follows that z(1 − p) = −z(p) and
t (1 − p) = −t (p) for p ∈ (0, 1) and k ∈ N
k k . On the other hand, the chi-square distribution is not symmetric.
+
8.2.1 https://stats.libretexts.org/@go/page/10201
For α ∈ (0, 1),
1. [M − z (1 − α
2
)
σ
, M + z (1 −
α
)
σ
] is a 1 − α confidence interval for μ .
√n 2 √n
2. M − z(1 − α) σ
√n
is a 1 − α confidence lower bound for μ
3. M + z(1 − α) σ
√n
is a 1 − α confidence upper bound for μ
Proof
These are the standard interval estimates for μ when σ is known. The two-sided confidence interval in (a) is symmetric about the
sample mean M , and as the proof shows, corresponds to equal probability in each tail of the distribution of the pivot variable Z .
α
But of course, this is not the only two-sided 1 − α confidence interval; we can divide the probability α anyway we want between
the left and right tails of the distribution of Z .
In terms of the distribution of the pivot variable Z , as the proof shows, the two-sided confidence interval above corresponds to pα
in the right tail and (1 − p)α in the left tail. Next, let's study the length of this confidence interval.
2
.
The last result shows again that there is a tradeoff between the confidence level and the length of the confidence interval. If n and p
are fixed, we can decrease L, and hence tighten our estimate, only at the expense of decreasing our confidence in the estimate.
Conversely, we can increase our confidence in the estimate only at the expense of increasing the length of the interval. In terms of
p , the best of the two-sided 1 − α confidence intervals (and the one that is almost always used) is symmetric, equal-tail interval
with p = : 1
Use the mean estimation experiment to explore the procedure. Select the normal distribution and select normal pivot. Use
various parameter values, confidence levels, sample sizes, and interval types. For each configuration, run the experiment 1000
times. As the simulation runs, note that the confidence interval successfully captures the mean if and only if the value of the
pivot variable is between the quantiles. Note the size and location of the confidence intervals and compare the proportion of
successful intervals to the theoretical confidence level.
For the standard confidence intervals, let d denote the distance between the sample mean M and an endpoint. That is,
σ
d = zα − (8.2.6)
√n
8.2.2 https://stats.libretexts.org/@go/page/10201
Note that d is deterministic, and the length of the standard two-sided interval is L = 2d . In many cases, the first step in the design
of the experiment is to determine the sample size needed to estimate μ with a given margin of error and a given confidence level.
The sample size needed to estimate μ with confidence 1 − α and margin of error d is
2 2
zα σ
n =⌈ ⌉ (8.2.7)
2
d
Proof
Note that n varies directly with z and with σ and inversely with d . This last fact implies a law of diminishing return in reducing
2
α
2 2
the margin of error. For example, if we want to reduce a given margin of error by a factor of , we must increase the sample size 1
by a factor of 4.
1. [M − t n−1 (1 −
α
2
)
S
√n
, M + tn−1 (1 −
α
2
)
S
√n
] is a 1 − α confidence interval for μ .
2. M − t n−1 (1 − α)
√n
S
is a 1 − α lower bound for μ
3. M + t n−1 (1 − α)
S
is a 1 − α upper bound for μ
√n
Proof
These are the standard interval estimates of μ with σ unknown. The two-sided confidence interval in (a) is symmetric about the
sample mean M and corresponds to equal probability in each tail of the distribution of the pivot variable T . As before, this is
α
not the only confidence interval; we can divide α between the left and right tails any way that we want.
The two-sided confidence interval above corresponds to pα in the right tail and (1 − p)α in the left tail of the distribution of the
pivot variable T . Next, let's study the length of this confidence interal.
For α, p ∈ (0, 1) , the (random) length of the two-sided 1 − α confidence interval above is
tn−1 (1 − pα) − tn−1 (α − pα)
L = S (8.2.10)
−
√n
2
.
–
3. [ tn−1 (1 − pα) − tn−1 (α − pα)] √2σΓ(n/2)
E(L) = − −−−−−− (8.2.11)
√ n(n − 1) Γ[(n − 1)/2]
4. 1 2 2
2
2 Γ (n/2)
var(L) = [ tn−1 (1 − pα) − tn−1 (α − pα)] σ [1 − ] (8.2.12)
2
n (n − 1)Γ [(n − 1)/2]
Proof
8.2.3 https://stats.libretexts.org/@go/page/10201
√n−1
Parts (a) and (b) follow from properties of the student quantile function t n−1 . Parts (c) and (d) follow from the fact that σ
S
Once again, there is a tradeoff between the confidence level and the length of the confidence interval. If n and p are fixed, we can
decrease L, and hence tighten our estimate, only at the expense of decreasing our confidence in the estimate. Conversely, we can
increase our confidence in the estimate only at the expense of increasing the length of the interval. In terms of p, the best of the
two-sided 1 − α confidence intervals (and the one that is almost always used) is symmetric, equal-tail interval with p = . Finally, 1
note that it does not really make sense to consider L as a function of S , since S is a statistic rather than an algebraic variable.
Similarly, it does not make sense to consider L as a function of n , since changing n means new data and hence a new value of S .
Use the mean estimation experiment to explore the procedure. Select the normal distribution and the T pivot. Use various
parameter values, confidence levels, sample sizes, and interval types. For each configuration, run the experiment 1000 times.
As the simulation runs, note that the confidence interval successfully captures the mean if and only if the value of the pivot
variable is between the quantiles. Note the size and location of the confidence intervals and compare the proportion of
successful intervals to the theoretical confidence level.
Next we will construct confidence intervals for σ using the pivot variable V given above 2
1. [ 2
n−1
S
2
, 2
n−1
S
2
] is a 1 − α confidence interval for σ 2
χ (1−α/2) χ (α/2)
n−1 n−1
2. 2
n−1
S
2
is a 1 − α confidence lower bound for σ 2
χn−1 (1−α)
3. χ
2
n−1
(α)
S
2
is a 1 − α confidence upper bound for σ . 2
n−1
Proof
These are the standard interval estimates for σ . The two-sided interval in (a) is the equal-tail interval, corresponding to probability
2
α/2 in each tail of the distribution of the pivot variable V . Note however that this interval is not symmetric about the sample
variance S . Once again, we can partition the probability α between the left and right tails of the distribution of V any way that we
2
like.
n−1 n−1
2 2
[ S , S ] (8.2.13)
2 2
χ (1 − pα) χ (α − pα)
n−1 n−1
In terms of the distribution of the pivot variable V , the confidence interval above corresponds to pα in the right tail and (1 − p)α
in the left tail. Once again, let's look at the length of the general two-sided confidence interval. The length is random, but is a
multiple of the sample variance S . Hence we can compute the expected value and variance of the length.
2
For α, p ∈ (0, 1) , the (random) length of the two-sided confidence interval in the last theorem is
1 1 2
L =[ − ] (n − 1)S (8.2.14)
2 2
χ (α − pα) χ (1 − pα)
n−1 n−1
1. E(L) = [ χ
2
1
(α−pα)
−
χ
2
1
(1−pα)
] (n − 1)σ
2
n−1 n−1
2. var(L) = 2[ χ
2
1
(α−pα)
−
χ
2
1
(1−pα)
] (n − 1)σ
4
n−1 n−1
8.2.4 https://stats.libretexts.org/@go/page/10201
To construct an optimal two-sided confidence interval, it would be natural to find p that minimizes the expected length. This is a
complicated problem, but it turns out that for large n , the equal-tail interval with p = is close to optimal. Of course, taking
1
square roots of the endpoints of any of the confidence intervals for σ gives 1 − α confidence intervals for the distribution standard
2
deviation σ.
Use variance estimation experiment to explore the procedure. Select the normal distribution. Use various parameter values,
confidence levels, sample sizes, and interval types. For each configuration, run the experiment 1000 time. As the simulation
runs, note that the confidence interval successfully captures the standard deviation if and only if the value of the pivot variable
is between the quantiles. Note the size and location of the confidence intervals and compare the proportion of successful
intervals to the theoretical confidence level.
σ σ
Zα,p = {(μ, σ) : M − z(1 − pα) − < μ < M − z(α − pα) −} (8.2.15)
√n √n
The confidence set is a “cone” in the (μ, σ) parameter space, with vertex at (M , 0) and boundary lines of slopes
− −
−√n /z(1 − pα) and −√n /z(α − pα)
Proof
The confidence cone is shown in the graph below. (Note, however, that both slopes might be negative or both positive.)
Figure 8.2.1 : The confidence set based on the normal pivot variable
The pivot variable T leads to the following result:
Proof
8.2.5 https://stats.libretexts.org/@go/page/10201
Figure 8.2.2 : The confidence set based on the T pivot variable
By design, this confidence set gives no information about σ. Finally, the pivot variable V leads to the following result:
Proof
Intersections
We can now form intersections of some of the confidence sets constructed above to obtain bounded confidence sets for (μ, σ). We
will use the fact that the sample mean M and the sample variance S are independent, one of the most important special properties
2
of a normal sample. We will also need the result from the Introduction on the intersection of confidence interals. In the following
theorems, suppose that α, β, p, q ∈ (0, 1) with α + β < 1 .
The set T α,p ∩ Vβ,q is a conservative 1 − (α + β) confidence sets for (μ, σ).
The set Z α,p ∩ Vβ,q is a (1 − α)(1 − β) confidence set for (μ, σ).
8.2.6 https://stats.libretexts.org/@go/page/10201
It is interesting to note that the confidence set T ∩ V is a product set as a subset of the parameter space, but is not a product set
α,p β,q
as a subset of the sample space. By contrast, the confidence set Z ∩ V is not a product set as a subset of the parameter space,
α,p β,q
Exercises
Robustness
The main assumption that we made was that the underlying sampling distribution is normal. Of course, in real statistical problems,
we are unlikely to know much about the sampling distribution, let alone whether or not it is normal. When a statistical procedure
works reasonably well, even when the underlying assumptions are violated, the procedure is said to be robust. In this subsection,
we will explore the robustness of the estimation procedures for μ and σ.
Suppose in fact that the underlying distribution is not normal. When the sample size n is relatively large, the distribution of the
sample mean will still be approximately normal by the central limit theorem. Thus, our interval estimates of μ may still be
approximately valid.
Use the simulation of the mean estimation experiment to explore the procedure. Select the gamma distribution and select
student pivot. Use various parameter values, confidence levels, sample sizes, and interval types. For each configuration, run the
experiment 1000 times. Note the size and location of the confidence intervals and compare the proportion of successful
intervals to the theoretical confidence level.
In the mean estimation experiment, repeat the previous exercise with the uniform distribution.
How large n needs to be for the interval estimation procedures of μ to work well depends, of course, on the underlying
distribution; the more this distribution deviates from normality, the larger n must be. Fortunately, convergence to normality in the
central limit theorem is rapid and hence, as you observed in the exercises, we can get away with relatively small sample sizes (30
or more) in most cases.
In general, the interval estimation procedures for σ are not robust; there is no analog of the central limit theorem to save us from
deviations from normality.
In variance estimation experiment, select the gamma distribution. Use various parameter values, confidence levels, sample
sizes, and interval types. For each configuration, run the experiment 1000 times. Note the size and location of the confidence
intervals and compare the proportion of successful intervals to the theoretical confidence level.
In variance estimation experiment, select the uniform distribution. Use various parameter values, confidence levels, sample
sizes, and interval types. For each configuration, run the experiment 1000 times. Note the size and location of the confidence
intervals and compare the proportion of successful intervals to the theoretical confidence level.
Computational Exercises
In the following exercises, use the equal-tailed construction for two-sided confidence intervals, unless otherwise instructed.
The length of a certain machined part is supposed to be 10 centimeters but due to imperfections in the manufacturing process,
the actual length is a normally distributed with mean μ and variance σ . The variance is due to inherent factors in the process,
2
which remain fairly stable over time. From historical data, it is known that σ = 0.3. On the other hand, μ may be set by
adjusting various parameters in the process and hence may change to an unknown value fairly frequently. A sample of 100
parts has mean 10.2.
1. Construct the 95% confidence interval for μ .
2. Construct the 95% confidence upper bound for μ .
3. Construct the 95% confidence lower bound for μ .
Answer
Suppose that the weight of a bag of potato chips (in grams) is a normally distributed random variable with mean μ and
standard deviation σ, both unknown. A sample of 75 bags has mean 250 and standard deviation 10.
8.2.7 https://stats.libretexts.org/@go/page/10201
1. Construct the 90% confidence interval for μ .
2. Construct the 90% confidence interval for σ.
3. Construct a conservative 90% confidence rectangle for (μ, σ).
Answer
At a telemarketing firm, the length of a telephone solicitation (in seconds) is a normally distributed random variable with mean
μ and standard deviation σ, both unknown. A sample of 50 calls has mean length 300 and standard deviation 60.
At a certain farm the weight of a peach (in ounces) at harvest time is a normally distributed random variable with standard
deviation 0.5. How many peaches must be sampled to estimate the mean weight with a margin of error ±2 and with 95%
confidence.
Answer
The hourly salary for a certain type of construction work is a normally distributed random variable with standard deviation
$1.25 and unknown mean μ . How many workers must be sampled to construct a 95% confidence lower bound for μ with
margin of error $0.25?
Answer
In Cavendish's data, assume that the measured density of the earth has a normal distribution with mean μ and standard
deviation σ, both unknown.
1. Construct the 95% confidence interval for μ . Is the “true” value of the density of the earth in this interval?
2. Construct the 95% confidence interval for σ.
3. Explore, in an informal graphical way, the assumption that the underlying distribution is normal.
Answer
In Short's data, assume that the measured parallax of the sun has a normal distribution with mean μ and standard deviation σ,
both unknown.
1. Construct the 95% confidence interval for μ . Is the “true” value of the parallax of the sun in this interval?
2. Construct the 95% confidence interval for σ.
3. Explore, in an informal graphical way, the assumption that the underlying distribution is normal.
Answer
Suppose that the length of an iris petal of a given type (Setosa, Verginica, or Versicolor) is normally distributed. Use Fisher's
iris data to construct 90% two-sided confidence intervals for each of the following parameters.
1. The mean length of a Sertosa iris petal.
2. The mean length of a Vergnica iris petal.
3. The mean length of a Versicolor iris petal.
Answer
8.2.8 https://stats.libretexts.org/@go/page/10201
This page titled 8.2: Estimation the Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
8.2.9 https://stats.libretexts.org/@go/page/10201
8.3: Estimation in the Bernoulli Model
Introduction
Recall that an indicator variable is a random variable that just takes the values 0 and 1. In applications, an indicator variable
indicates which of two complementary events in a random experiment has occurred. Typical examples include
A manufactured item subject to unavoidable random factors is either defective or acceptable.
A voter selected from a population either supports a particular candidate or does not.
A person selected from a population either does or does not have a particular medical condition.
A student in a class either passes or fails a standardized test.
A sample of radioactive material either does or does not emit an alpha particle in a specified ten-second period.
Recall also that the distribution of an indicator variable is known as the Bernoulli distribution, named for Jacob Bernoulli, and has
probability density function given by P(X = 1) = p , P(X = 0) = 1 − p , where p ∈ (0, 1) is the basic parameter. In the context
of the examples above,
p is the probability that the manufactured item is defective.
p is the proportion of voters in the population who favor the candidate.
p is the poportion of persons in the population that have the medical condition.
p is the probability that a student in the class will pass the exam.
p is the probability that the material will emit an alpha particle in the specified period.
Recall that the mean and variance of the Bernoulli distribution are E(X) = p and var(X) = p(1 − p) . Often in statistical
applications, p is unknown and must be estimated from sample data. In this section, we will see how to construct interval estimates
for the parameter from sample data. A parallel section on Tests in the Bernoulli Model is in the chapter on Hypothesis Testing.
is, X is a squence of Bernoulli trials. From the examples in the introduction above, note that often the underlying experiment is to
sample at random from a dichotomous population. When the sampling is with replacement, X really is a sequence of Bernoulli
trials. When the sampling is without replacement, the variables are dependent, but the Bernoulli model is still approximately valid
if the population size is large compared to the sample size n . For more on these points, see the discussion of sampling with and
without replacement in the chapter on Finite Sampling Models.
Note that the sample mean of our data vector X, namely
n
1
M = ∑ Xi (8.3.1)
n
i=1
is the sample proportion of objects of the type of interest. By the central limit theorem, the standard score
M −p
Z = (8.3.2)
− −−−−− −−−
√ p(1 − p)/n
has approximately a standard normal distribution and hence is (approximately) a pivot variable for p. For a given sample size n ,
the distribution of Z is closest to normal when p is near and farthest from normal when p is near 0 or 1 (extreme). Because the
1
pivot variable is (approximately) normally distributed, the construction of confidence intervals for p in this model is similar to the
construction of confidence intervals for the distribution mean μ in the normal model. But of course all of the confidence intervals
so constructed are approximate.
As usual, for r ∈ (0, 1), let z(r) denote the quantile of order r for the standard normal distribution. Values of z(r) can be obtained
from the special distribution calculator, or from most statistical software packages.
8.3.1 https://stats.libretexts.org/@go/page/10202
Basic Confidence Intervals
For α ∈ (0, 1), the following are approximate 1 − α confidence sets for p:
−−−−−−−−− −−−−−−−−−
1. {p ∈ [0, 1] : M − z(1 − α/2)√p(1 − p)/n ≤ p ≤ M + z(1 − α/2)√p(1 − p)/n }
−−−−− −−−−
2. {p ∈ [0, 1] : p ≤ M + z(1 − α)√p(1 − p)/n }
−−−−−−−−−
3. {p ∈ [0, 1] : M − z(1 − α)√p(1 − p)/n ≤ p}
Proof
These confidence sets are actually intervals, known as the Wilson intervals, in honor of Edwin Wilson.
As usual, the equal-tailed confidence interval in (a) is not the only two-sided 1 − α confidence interval for p. We can divide the α
probability between the left and right tails of the standard normal distribution in any way that we please.
For α, r ∈ (0, 1), an approximate two-sided 1 − α confidence interval for p is [U [z(α − rα)], U [z(1 − rα)]] where U is the
function in (2).
Proof
In practice, the equal-tailed 1 − α confidence interval in part (a) of (2), obtained by setting r = , is the one that is always used.
1
As r ↑ 1 , the right enpoint converges to the 1 − α confidence upper bound in part (b), and as r ↓ 0 the left endpoint converges to
the 1 − α confidence lower bound in part (c).
For α ∈ (0, 1), the following have approximate confidence level 1 − α for p:
−−−−−−−−−−−
1. The two-sided interval with endpoints M ± z(1 − α/2)√M (1 − M )/n .
− −− −−−−−−−−
2. The upper bound M + z(1 − α)√M (1 − M )/n .
− −− −−−−−−−−
3. The lower bound M − z(1 − α)√M (1 − M )/n .
Proof
These confidence intervals are known as Wald intervals, in honor of Abraham Wald.. Note that the Wald interval can also be
obtained from the Wilson intervals in (2) by assuming that n is large compared to z , so that n/(n + z ) ≈ 1 , z /2n ≈ 0, and
2 2
z /4 n ≈ 0 . Note that this interval in (c) is symmetric about the sample proportion M but that the length of the interval, as well as
2 2
the center is random. This is the two-sided interval that is normally used.
Use the simulation of the proportion estimation experiment to explore the procedure. Use various values of p and various
confidence levels, sample sizes, and interval types. For each configuration, run the experiment 1000 times and compare the
proportion of successful intervals to the theoretical confidence level.
As always, the equal-tailed interval in (4) is not the only two-sided, 1 − α confidence interval.
8.3.2 https://stats.libretexts.org/@go/page/10202
−−−−−−−−− −−−−−−−−−
M (1 − M ) M (1 − M )
[M − z(1 − rα)√ , M − z(α − rα)√ ] (8.3.5)
n n
2
.
2
1
4
. We can
obtain conservative confidence intervals for p from the basic confidence intervals by using this fact.
For α ∈ (0, 1), the following have approximate confidence level at least 1 − α for p:
1. The two-sided interval with endpoints M ± z(1 − α/2) 2 √n
1
.
2. The upper bound M + z(1 − α) 1
2 √n
.
3. The lower bound M − z(1 − α) 1
2 √n
.
Proof
Note that the confidence interval in (a) is symmetric about the sample proportion M and that the length of the interval is
deterministic. Of course, the conservative confidence intervals will be larger than the approximate simplified confidence intervals
in (4). The conservative estimate can be used to design the experiment. Recall that the margin of error is the distance between the
sample proportion M and an endpoint of the confidence interval.
A conservative estimate of the sample size n needed to estimate p with confidence 1 − α and margin of error d is
2
zα
n =⌈ ⌉ (8.3.6)
2
4d
As always, the equal-tailed interval in (7) is not the only two-sided, conservative, 1 − α confidence interval.
2
.
In a quality control setting, suppose that p is the proportion of defective items produced under one set of manufacturing
1
In an election, suppose that p is the proportion of voters who favor a particular candidate at one point in the campaign, while
1
p is the proportion of voters who favor the candidate at a later point (perhaps after a scandal has erupted).
2
Suppose that p is the proportion of students who pass a certain standardized test with the usual test preparation methods while
1
p is the proportion of students who pass the test with a new set of preparation methods.
2
Suppose that p is the proportion of unvaccinated persons in a certain population who contract a certain disease, while p is the
1 2
8.3.3 https://stats.libretexts.org/@go/page/10202
have confidence level 1 − α , then the product set I 1 × I2 has confidence level (1 − α)
2
for (p1 , p2 ) . So if p1 − p2 is our
parameter of interest, we will use a different approach.
Y = (Y , Y , … , Y
1 2 ) is a random sample of size n
n2 from the Bernoulli distribution with parameter p . We assume that the
2 2
denote the sample means (sample proportions) for the samples X and Y . A natural point estimate for p − p , and the building 1 2
block for our interval estimate, is M − M . As noted in the one-sample model, if n is large, M has an approximate normal
1 2 i i
distribution with mean p and variance p (1 − p )/n for i ∈ {1, 2}. Since the samples are independent, so are the sample means.
i i i i
Hence M − M has an approximate normal distribution with mean p − p and variance p (1 − p )/n + p (1 − p )/n . We
1 2 1 2 1 1 1 2 2 2
now have all the tools we need for a simplified, approximate confidence interval for p − p . 1 2
For α ∈ (0, 1), the following have approximate confidence level 1 − α for p 1 − p2 :
−−−−−−−−−−−−−−−−−−−−−−−−−− −
1. The two-sided interval with endpoints (M1 − M2 ) ± z (1 − α/2) √M1 (1 − M1 )/ n1 + M2 (1 − M2 )/ n2 .
−−−−−−−−−−−−−−−−−−−−−−−−−− −
2. The lower bound (M 1 .
− M2 ) − z(1 − α)√M1 (1 − M1 )/ n1 + M2 (1 − M2 )/ n2
−−−−−−−−−−−−−−−−−−−−−−−−−−−
3. The upper bound (M − M ) + z(1 − α)√M (1 − M )/n + M (1 − M )/n .
1 2 1 1 1 2 2 2
Proof
As always, the equal-tailed interval in (a) is not the only approximate two-sided 1 − α confidence interval.
−−−−−−−−−−−−−−−−−−−−−−−−−−−
− z(α − rα)√ M1 (1 − M1 )/ n1 + M2 (1 − M2 )/ n2 ]
Proof
2
with maximum value 1
4
. We can use this to construct approximate
conservative confidence intervals for p − p . 1 2
For α ∈ (0, 1), the following have approximate confidence level at least 1 − α for p 1 − p2 :
−−− −−−− −− −
1. The two-sided interval with endpoints (M − M ) ± 1 2
1
2
z (1 − α/2) √1/ n1 + 1/ n2 .
−−−− −−−−−−
2. The lower bound (M − M ) − z(1 − α)√1/n + 1/n .
1 2
1
2
1 2
−−−− −−−−−−
3. The upper bound (M − M ) + z(1 − α)√1/n + 1/n .
1 2
1
2
1 2
Proof
Computational Exercises
In a poll of 1000 registered voters in a certain district, 427 prefer candidate X. Construct the 95% two-sided confidence interval
for the proportion of all registered voters in the district that prefer X.
Answer
A coin is tossed 500 times and results in 302 heads. Construct the 95% confidence lower bound for the probability of heads.
Do you believe that the coin is fair?
Answer
A sample of 400 memory chips from a production line are tested, and 30 are defective. Construct the conservative 90% two-
sided confidence interval for the proportion of defective chips.
8.3.4 https://stats.libretexts.org/@go/page/10202
Answer
A drug company wants to estimate the proportion of persons who will experience an adverse reaction to a certain new drug.
The company wants a two-sided interval with margin of error 0.03 with 95% confidence. How large should the sample be?
Answer
An advertising agency wants to construct a 99% confidence lower bound for the proportion of dentists who recommend a
certain brand of toothpaste. The margin of error is to be 0.02. How large should the sample be?
Answer
The Buffon trial data set gives the results of 104 repetitions of Buffon's needle experiment. Theoretically, the data should
correspond to Bernoulli trials with p = 2/π, but because real students dropped the needle, the true value of p is unknown.
Construct a 95% confidence interval for p. Do you believe that p is the theoretical value?
Answer
A manufacturing facility has two production lines for a certain item. In a sample of 150 items from line 1, 12 are defective.
From a sample of 130 items from line 2, 10 are defective. Construct the two-sided 95% confidence interval for p − p , where1 2
Answer
The vaccine for influenza is tailored each year to match the predicted dominant strain of influenza. Suppose that of 500
unvaccinated persons, 45 contracted the flu in a certain time period. Of 300 vaccinated persons, 20 contracted the flu in the
same time period. Construct the two-sided 99% confidence interval for p − p , where p is the incidence of flu in the
1 2 1
This page titled 8.3: Estimation in the Bernoulli Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
8.3.5 https://stats.libretexts.org/@go/page/10202
8.4: Estimation in the Two-Sample Normal Model
As we have noted before, the normal distribution is perhaps the most important distribution in the study of mathematical statistics, in part
because of the central limit theorem. As a consequence of this theorem, measured quantities that are subject to numerous small, random
errors will have, at least approximately, normal distributions. Such variables are ubiquitous in statistical experiments, in subjects varying
from the physical and biological sciences to the social sciences.
In this section, we will study estimation problems in the two-sample normal model and in the bivariate normal model. This section parallels
the section on Tests in the Two-Sample Normal Model in the Chapter on Hypothesis Testing.
and that Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν and standard deviation τ . Moreover,
1 2 n
suppose that the samples X and Y are independent. Usually, the parameters are unknown, so the parameter space for our vector of
parameters (μ, ν , σ, τ ) is R × (0, ∞) .
2 2
This type of situation arises frequently when the random variables represent a measurement of interest for the objects of the population, and
the samples correspond to two different treatments. For example, we might be interested in the blood pressure of a certain population of
patients. The X vector records the blood pressures of a control sample, while the Y vector records the blood pressures of the sample
receiving a new drug. Similarly, we might be interested in the yield of an acre of corn. The X vector records the yields of a sample receiving
one type of fertilizer, while the Y vector records the yields of a sample receiving a different type of fertilizer.
Usually our interest is in a comparison of the parameters (either the means or standard deviations) for the two sampling distributions. In this
section we will construct confidence intervals for the difference of the distribution means ν − μ and for the ratio of the distribution
variances τ /σ . As with previous estimation problems, the construction depends on finding appropriate pivot variables.
2 2
For a generic sample U = (U1 , U2 , … , Uk ) from a distribution with mean a , we will use our standard notation for the sample mean and for
the sample variance.
k
1
M (U ) = ∑ Ui (8.4.1)
k
i=1
2
1 2
S (U ) = ∑[ Ui − M (U )] (8.4.2)
k−1
i=1
We will need to also recall the special properties of these statistics when the sampling distribution is normal. The special pivot distributions
that will play a fundamental role in this section are the standard normal, the student t , and the Fisher F distributions. To construct our
interval estimates we will need the quantiles of these distributions. The quantiles can be computed using the special distribution calculator or
from most mathematical and statistical software packages. Here is the notation we will use:
3. f (p) denotes the quantile of order p for the student f distribution with j degrees of freedom in the numerator and k degrees of
j,k
Recall that by symmetry, z(p) = −z(1 − p) and t (p) = −t (1 − p) for p ∈ (0, 1) and
k k k ∈ N+ . On the other hand, there is no simple
relationship between the left and right tail probabilities of the F distribution.
Confidence Intervals for the Difference of the Means with Known Variances
First we will construct confidence intervals for ν − μ under the assumption that the distribution variances σ and τ are known. This is not
2 2
always an artificial assumption. As in the one sample normal model, the variances are sometime stable, and hence are at least approximately
known, while the means change under different treatments. First recall the following basic facts:
The difference of the sample means M (Y ) − M (X) has the normal distribution with mean ν − μ and variance σ 2
/m + τ
2
/n . Hence
the standard score of the difference of the sample means
8.4.1 https://stats.libretexts.org/@go/page/10203
[M (Y ) − M (X)] − (ν − μ)
Z = (8.4.3)
− −−−−−−−− −−
2 2
√ σ /m + τ /n
has the standard normal distribution. Thus, this variable is a pivotal variable for ν − μ when σ, τ are known.
The basic confidence interval and upper and lower bound are now easy to construct.
m
+
τ
n
is a 1 − α confidence lower bound for ν − μ .
−−−−−−−
3. M (Y ) − M (X) + z(1 − α)√ is a 1 − α confidence upper bound for ν − μ .
2 2
σ τ
+
m n
Proof
The two-sided interval in part (a) is the symmetric interval corresponding to α/2 in both tails of the standard normal distribution. As usual,
we can construct more general two-sided intervals by partitioning α between the left and right tails in anyway that we please.
The following theorem gives some basic properties of the length of this interval.
Part (a) means that we can make the estimate more precise by increasing either or both sample sizes. Part (b) means that the estimate
becomes less precise as the variance in either distribution increases. Part (c) we have seen before. All other things being equal, we can
increase the confidence level only at the expense of making the estimate less precise. Part (d) means that the symmetric, equal-tail
confidence interval is the best of the two-sided intervals.
Confidence Intervals for the Difference of the Means with Unknown Variances
Our next method is a construction of confidence intervals for the difference of the means ν − μ without needing to know the standard
deviations σ and τ . However, there is a cost; we will assume that the standard deviations are the same, σ = τ , but the common value is
unknown. This assumption is reasonable if there is an inherent variability in the measurement variables that does not change even when
different treatments are applied to the objects in the population. We need to recall some basic facts from our study of special properties of
normal samples.
8.4.2 https://stats.libretexts.org/@go/page/10203
has the student t distribution with m + n − 2 degrees of freedom
Note that S (X, Y ) is a weighted average of the sample variances, with the degrees of freedom as the weight factors. Note also that T is a
2
pivot variable for ν − μ and so we can construct confidence intervals for ν − μ in the usual way.
2
) S(X, Y )√
1
m
+
1
n
, M (Y ) − M (X) + tm+n−2 (1 −
α
2
) S(X, Y )√
1
m
+
1
n
] is a 1 − α
confidence interval for ν − μ .
−−−−−−
2. M (Y ) − M (X) − t m+n−2 (1 − α)S(X, Y )√
1
m
+
1
n
is a 1 − α confidence lower bound for ν − μ .
−−−−−−
3. M (Y ) − M (X) + t m+n−2 (1 − α)S(X, Y )√
1
m
+
1
n
is a 1 − α confidence upper bound for ν − μ .
Proof
The two-sided interval in part (a) is the symmetric interval corresponding to α/2 in both tails of the student t distribution. As usual, we can
construct more general two-sided intervals by partitioning α between the left and right tails in anyway that we please.
The next result considers the length of the general two-sided interval.
As in the case of known variances, part (c) means that all other things being equal, we can increase the confidence level only at the expense
of making the estimate less precise. Part (b) means that the symmetric, equal-tail confidence interval is the best of the two-sided intervals.
standard deviations τ /σ). Once again, we need to recall some basic facts from our study of special properties of random samples from the
normal distribution.
The ratio
2 2
S (X)τ
U = (8.4.12)
S 2 (Y )σ 2
has the F distribution with m − 1 degrees of freedom in the numerator and n−1 degrees of freedom in the denominator, and hence
this variable is a pivot variable for τ /σ . 2 2
2
)
2
, fm−1,n−1 (1 −
α
2
)
2
] is a 1 − α confidence interval for τ 2
/σ
2
.
S (X) S (X)
2
S (Y )
2. fm−1,n−1 (1 − α) 2
is a 1 − α confidence lower bound for τ 2
/σ
2
.
S (X)
2
S (Y )
3. fm−1,n−1 (α) 2
is a 1 − α confidence upper bound for ν − μ .
S (X)
8.4.3 https://stats.libretexts.org/@go/page/10203
Proof
The two-sided confidence interval in part (a) is the equal-tail confidence interval, and is the one commonly used. But as usual, we can
partition α between the left and right tails of the distribution of the pivot variable in any way that we please.
2. E(L) = σ
τ
2
m−1
m−3
4
2
m−1 m+n−4
3. var(L) = 2 τ
σ
4
(
m−3
)
(n−1)(m−5)
Proof
Optimally, we might want to choose p so that E(L) is minimized. However, this is difficult computationally, and fortunately the equal-tail
interval with p = is not too far from optimal when the sample sizes m and n are large.
1
is a random sample of size n from the bivariate normal distribution of a random vector (X, Y ), with E(X) = μ , E(Y ) = ν , var(X) = σ , 2
Thus, instead of a pair of samples, we have a sample of pairs. This type of model frequently arises in before and after experiments, in which
a measurement of interest is recorded for a sample of n objects from the population, both before and after a treatment. For example, we
could record the blood pressure of a sample of n patients, before and after the administration of a certain drug. The critical point is that in
this model, X and Y are measurements made on the same underlying object in the sample. As with the two-sample normal model, the
i i
(not to be confused with the pooled estimate of the standard deviation in the two sample model).
The vector of differences Y − X = (Y1 − X1 , Y2 − X2 , … , Yn − Xn ) is a random sample of size n from the distribution of Y −X ,
which is normal with
1. E(Y − X) = ν − μ
2. var(Y − X) = σ + τ 2 2
−2 δ
The sample mean and variance of the sample of differences are given by
8.4.4 https://stats.libretexts.org/@go/page/10203
1. M (Y − X) = M (Y ) − M (X)
2. S (Y − X) = S (X) + S (Y ) − 2 S(X, Y )
2 2 2
Thus, the sample of differences Y − X fits the normal model for a single variable. The section on Estimation in the Normal Model could be
used to obtain confidence sets and intervals for the parameters (ν − μ, σ + τ − 2 δ) .
2 2
In the setting of this subsection, suppose that X = (X , X , … , X ) and Y = (Y , Y , … , Y ) are independent. Mathematically this
1 2 n 1 2 n
fits both models—the two-sample normal model and the bivariate normal model. Which procedure would work better for estimating the
difference of means ν − μ ?
1. If the standard deviations σ and τ are known.
2. If the standard deviations σ and τ are unknown.
Answer
Although the setting in the last problem fits both models mathematically, only one model would make sense in a real problem. Again, the
critical point is whether (X , Y ) makes sense as a pair of random variables (measurements) corresponding to a given object in the sample.
i i
Computational Exercises
A new drug is being developed to reduce a certain blood chemical. A sample of 36 patients are given a placebo while a sample of 49
patients are given the drug. Let X denote the measurement for a patient given the placebo and Y the measurement for a patient given
the drug (in mg). The statistics are m(x) = 87 , s(x) = 4 , m(y) = 63 , s(y) = 6 .
1. Compute the 90% confidence interval for τ /σ.
2. Assuming that σ = τ , compute the 90% confidence interval for ν − μ .
3. Based on (a), is the assumption that σ = τ reasonable?
4. Based on (b), is the drug effective?
Answer
A company claims that an herbal supplement improves intelligence. A sample of 25 persons are given a standard IQ test before and after
taking the supplement. Let X denote the IQ of a subject before taking the supplement and Y the IQ of the subject after the supplement.
The before and after statistics are m(x) = 105, s(x) = 13 , m(y) = 110 , s(y) = 17 , s(x, y) = 190 . Do you believe the company's
claim?
Answer
In Fisher's iris data, let X denote consider the petal length of a Versicolor iris and Y the petal length of a Virginica iris.
1. Compute the 90% confidence interval for τ /σ.
2. Assuming that σ = τ , compute the 90% confidence interval for ν − μ .
3. Based on (a), is the assumption that σ = τ reasonable?
Answer
A plant has two machines that produce a circular rod whose diameter (in cm) is critical. Let X denote the diameter of a rod from the
first machine and Y the diameter of a rod from the second machine. A sample of 100 rods from the first machine as mean 10.3 and
standard deviation 1.2. A sample of 100 rods from the second machine has mean 9.8 and standard deviation 1.6.
1. Compute the 90% confidence interval for τ /σ.
2. Assuming that σ = τ , compute the 90% confidence interval for ν − μ .
3. Based on (a), is the assumption that σ = τ reasonable?
Answer
This page titled 8.4: Estimation in the Two-Sample Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon
request.
8.4.5 https://stats.libretexts.org/@go/page/10203
8.5: Bayesian Set Estimation
Basic Theory
As usual, our starting point is a random experiment with an underlying sample space and a probability measure P. In the basic
statistical model, we have an observable random variable X taking values in a set S . In general, X can have quite a complicated
structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest, then
X = (X1 , X2 , … , Xn ) (8.5.1)
Suppose also that the distribution of X depends on a parameter θ taking values in a parameter space Θ. The parameter may also be
vector valued, in which case Θ ⊆ R for some k ∈ N and the parameter has the form θ = (θ , θ , … , θ ) .
k
+ 1 2 k
θ∈Θ
if the parameter has a continuous distribution. Finally, by Bayes' theorem, the posterior probability density function of θ given x ∈ S is
h(θ)f (x ∣ θ)
h(θ ∣ x) = , θ ∈ Θ (8.5.5)
f (x)
In some cases, we can recognize the posterior distribution from the functional form of θ ↦ h(θ)f (x ∣ θ) without having to actually
compute the normalizing constant f (x), and thus reducing the computational burden significantly. In particular, this is often the case
when we have a conjugate parametric family of distributions of θ . Recall that this means that when the prior distribution of θ belongs
to the family, so does the posterior distribution given x ∈ S .
Confidence Sets
Now let C (X) be a confidence set (that is, a subset of the parameter space that depends on the data variable X but no unknown
parameters). One possible definition of a 1 − α level Bayesian confidence set requires that
P [θ ∈ C (x) ∣ X = x] = 1 − α (8.5.6)
In this definition, only θ is random and thus the probability above is computed using the posterior probability density function
θ ↦ h(θ ∣ x) . Another possible definition requires that
P [θ ∈ C (X)] = 1 − α (8.5.7)
In this definition, X and θ are both random, and so the probability above would be computed using the joint probability density
function (x, θ) ↦ h(θ)f (x ∣ θ) . Whatever the philosophical arguments may be, the first definition is certainly the easier one from a
computational viewpoint, and hence is the one most commonly used.
Let us compare the classical and Bayesian approaches. In the classical approach, the parameter θ is deterministic, but unknown. Before
the data are collected, the confidence set C (X) (which is random by virtue of X) will contain the parameter with probability 1 − α .
After the data are collected, the computed confidence set C (x) either contains θ or does not, and we will usually never know which. By
8.5.1 https://stats.libretexts.org/@go/page/10204
contrast in a Bayesian confidence set, the random parameter θ falls in the computed, deterministic confidence set C (x) with probability
1 −α.
Real Parameters
Suppose that θis real valued, so that Θ ⊆ R . For r ∈ (0, 1), we can compute the 1 − α level Bayesian confidence interval as
[ U(1−r)α (x), U1−rα (x)] where U (x) is the quantile of order p for the posterior distribution of θ given X = x . As in past sections, r
p
is the fraction of α in the right tail of the posterior distribution and 1 − r is the fraction of α in the left tail of the posterior distribution.
As usual, r = gives the symmetric, two-sided confidence interval; letting r → 0 gives the confidence lower bound; and letting r → 1
1
Random Samples
In terms of our data vector X the most important special case arises when we have a basic variable X with values in a set R , and given
θ , X = (X , X , … , X ) is a random sample of size n from X. That is, given θ , X is a sequence of independent, identically
1 2 n
distributed variables, each with the same distribution as X given θ . Thus S = R and if X has conditional probability density function
n
g(x ∣ θ) , then
Applications
The Bernoulli Distribution
Suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success parameter
1 2 n
p ∈ (0, 1). In the usual language of reliability, X = 1 means success on trial i and X = 0 means failure on trial i . The distribution is
i i
named for Jacob Bernoulli. Recall that the Bernoulli distribution has probability density function (given p)
x 1−x
g(x ∣ p) = p (1 − p ) , x ∈ {0, 1} (8.5.9)
i=1
Xi . Given p, random variable Y has the binomial distribution with
parameters n and p.
In our previous discussion of Bayesian estimation, we showed that the beta distribution is conjugate for p. Specifically, if the prior
distribution of p is beta with left parameter a > 0 and right parameter b > 0 , then the posterior distribution of p given X is beta with
left parameter a + Y and right parameter b + (n − Y ) ; the left parameter is increased by the number of successes and the right
parameter by the number of failure. It follows that a 1 − α level Bayesian confidence interval for p is [ U (y), U (y)] where
α/2 1−α/2
U (y) is the quantile of order r for the posterior beta distribution. In the special case a = b = 1 the prior distribution is uniform on
r
Suppose that we have a coin with an unknown probability p of heads, and that we give p the uniform prior, reflecting our lack of
knowledge about p. We then toss the coin 50 times, observing 30 heads.
1. Find the posterior distribution of p given the data.
2. Construct the 95% Bayesian confidence interval.
3. Construct the classical Wald confidence interval at the 95% level.
Answer
that the Poisson distribution is often used to model the number of “random points” in a region of time or space and is studied in more
detail in the chapter on the Poisson Process. The distribution is named for the inimitable Simeon Poisson and given λ , has probability
density function
x
−λ
λ
g(x ∣ θ) = e , x ∈ N (8.5.10)
x!
n
As usual, we will denote the sum of the sample values by Y =∑
i=1
Xi . Given λ , random variable Y also has a Poisson distribution,
but with parameter nλ.
In our previous discussion of Bayesian estimation, we showed that the gamma distribution is conjugate for λ . Specifically, if the prior
distribution of λ is gamma with shape parameter k > 0 and rate parameter r > 0 (so that the scale parameter is 1/r), then the posterior
8.5.2 https://stats.libretexts.org/@go/page/10204
distribution of λ given X is gamma with shape parameter k + Y and rate parameter r + n . It follows that a 1 − α level Bayesian
confidence interval for λ is [ U (y), U (y)] where U (y) is the quantile of order p for the posterior gamma distribution.
α/2 1−α/2 p
Consider the alpha emissions data, which we believe come from a Poisson distribution with unknown parameter λ . Suppose that a
priori, we believe that λ is about 5, so we give λ a prior gamma distribution with shape parameter 5 and rate parameter 1. (Thus
–
the mean is 5 and the standard deviation √5 = 2.236.)
1. Find the posterior distribution of λ given the data.
2. Construct the 95% Bayesian confidence interval.
3. Construct the classical t confidence interval at the 95% level.
Answer
variance σ ∈ (0, ∞) . Of course, the normal distribution plays an especially important role in statistics, in part because of the central
2
limit theorem. The normal distribution is widely used to model physical quantities subject to numerous small, random errors. Recall
that the normal probability density function (given the parameters) is
2
1 x −μ
g(x ∣ μ, σ) = exp[−( ) ], x ∈ R (8.5.11)
−−
√2πσ σ
i=1
Xi . Recall that Y also has a normal distribution (given μ and σ), but with mean
nμ and variance nσ .
2
In our previous discussion of Bayesian estimation, we showed that the normal distribution is conjugate for μ (with σ known).
Specifically, if the prior distribution of μ is normal with mean a ∈ R and standard deviation b ∈ (0, ∞), then the posterior distribution
of μ given X is also normal, with
2 2 2 2
Y b + aσ σ b
E(μ ∣ X) = , var(μ ∣ X) = (8.5.12)
2 2 2 2
σ + nb σ + nb
It follows that a 1 − α level Bayesian confidence interval for μ is [ U (y), U (y)] where U (y) is the quantile of order p for the
α/2 1−α/2 p
posterior normal distribution. An interesting special case is when b = σ , so that the standard deviation of the prior distribution of μ is
the same as the standard deviation of the sampling distribution. In this case, the posterior mean is (Y + a)/(n + 1) and the posterior
variance is σ /(n + 1)
2
The length of a certain machined part is supposed to be 10 centimeters but due to imperfections in the manufacturing process, the
actual length is a normally distributed with mean μ and variance σ . The variance is due to inherent factors in the process, which
2
remain fairly stable over time. From historical data, it is known that σ = 0.3. On the other hand, μ may be set by adjusting various
parameters in the process and hence may change to an unknown value fairly frequently. Thus, suppose that we give μ with a prior
normal distribution with mean 10 and standard deviation 0.03 A sample of 100 parts has mean 10.2.
1. Find the posterior distribution of μ given the data.
2. Construct the 95% Bayesian confidence interval.
3. Construct the classical z confidence interval at the 95% level.
Answer
a ∈ (0, ∞) and right shape parameter b = 1 . The beta distribution is widely used to model random proportions and probabilities and
other variables that take values in bounded intervals. Recall that the probability density function (given a ) is
a−1
g(x ∣ a) = ax , x ∈ (0, 1) (8.5.13)
8.5.3 https://stats.libretexts.org/@go/page/10204
[ Uα/2 (w), U1−α/2 (w)] where U (w) is the quantile of order p for the posterior gamma distribution. In the special case that k = 1 , the
p
Suppose that the resistance of an electrical component (in Ohms) has the beta distribution with unknown left parameter a and right
parameter b = 1 . We believe that a may be about 10, so we give a the prior gamma distribution with shape parameter 10 and rate
parameter 1. We sample 20 components and observe the data
0.98, 0.93, 0.99, 0.89, 0.79, 0.99, 0.92, 0.97, 0.88, 0.97, 0.86, 0.84, 0.96, 0.97, 0.92, 0.90, 0.98, 0.96, 0.96, 1.00
(8.5.14)
scale parameter b = 1 . The Pareto distribution is used to model certain financial variables and other variables with heavy-tailed
distributions, and is named for Vilfredo Pareto. Recall that the probability density function (given a ) is
a
g(x ∣ a) = , x ∈ [1, ∞) (8.5.15)
a+1
x
Suppose that a financial variable has the Pareto distribution with unknown shape parameter a and scale parameter b = 1 . We
believe that a may be about 4, so we give a the prior gamma distribution with shape parameter 4 and rate parameter 1. A random
sample of size 20 from the variable gives the data
1.09, 1.13, 2.00, 1.43, 1.26, 1.00, 1.36, 1.03, 1.46, 1.18, 2.16, 1.16, 1.22, 1.06, 1.28, 1.23, 1.11, 1.03, 1.04, 1.05
(8.5.16)
This page titled 8.5: Bayesian Set Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon
request.
8.5.4 https://stats.libretexts.org/@go/page/10204
CHAPTER OVERVIEW
9: Hypothesis Testing
Hypothesis testing refers to the process of choosing between competing hypotheses about a probability distribution, based on
observed data from the distribution. It is a core topic in mathematical statistics, and indeed is a fundamental part of the language of
statistics. In this chapter, we study the basics of hypothesis testing, and explore hypothesis tests in some of the most important
parametric models: the normal model and the Bernoulli model.
9.1: Introduction to Hypothesis Testing
9.2: Tests in the Normal Model
9.3: Tests in the Bernoulli Model
9.4: Tests in the Two-Sample Normal Model
9.5: Likelihood Ratio Tests
9.6: Chi-Square Tests
This page titled 9: Hypothesis Testing is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
9.1: Introduction to Hypothesis Testing
Basic Theory
Preliminaries
As usual, our starting point is a random experiment with an underlying sample space and a probability measure P. In the basic
statistical model, we have an observable random variable X taking values in a set S . In general, X can have quite a complicated
structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest,
then
X = (X1 , X2 , … , Xn ) (9.1.1)
where X is the vector of measurements for the ith object. The most important special case occurs when (X , X , … , X
i 1 2 n) are
independent and identically distributed. In this case, we have a random sample of size n from the common distribution.
The purpose of this section is to define and discuss the basic concepts of statistical hypothesis testing. Collectively, these concepts
are sometimes referred to as the Neyman-Pearson framework, in honor of Jerzy Neyman and Egon Pearson, who first formalized
them.
Hypotheses
A statistical hypothesis is a statement about the distribution of X. Equivalently, a statistical hypothesis specifies a set of
possible distributions of X: the set of distributions for which the statement is true. A hypothesis that specifies a single
distribution for X is called simple; a hypothesis that specifies more than one distribution for X is called composite.
In hypothesis testing, the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a
conjectured alternative hypothesis. The null hypothesis is usually denoted H while the alternative hypothesis is usually denoted
0
H .
1
An hypothesis test is a statistical decision; the conclusion will either be to reject the null hypothesis in favor of the alternative, or to
fail to reject the null hypothesis. The decision that we make must, of course, be based on the observed value x of the data vector
X. Thus, we will find an appropriate subset R of the sample space S and reject H if and only if x ∈ R . The set R is known as
0
the rejection region or the critical region. Note the asymmetry between the null and alternative hypotheses. This asymmetry is due
to the fact that we assume the null hypothesis, in a sense, and then see if there is sufficient evidence in x to overturn this
assumption in favor of the alternative.
An hypothesis test is a statistical analogy to proof by contradiction, in a sense. Suppose for a moment that H is a statement in a
1
mathematical theory and that H is its negation. One way that we can prove H is to assume H and work our way logically to a
0 1 0
contradiction. In an hypothesis test, we don't “prove” anything of course, but there are similarities. We assume H and then see if
0
the data x are sufficiently at odds with that assumption that we feel justified in rejecting H in favor of H .
0 1
Often, the critical region is defined in terms of a statistic w(X), known as a test statistic, where w is a function from S into another
set T . We find an appropriate rejection region R ⊆ T and reject H when the observed value w(x) ∈ R . Thus, the rejection
T 0 T
region in S is then R = w (R ) = {x ∈ S : w(x) ∈ R } . As usual, the use of a statistic often allows significant data reduction
−1
T T
when the dimension of the test statistic is much smaller than the dimension of the data vector.
Errors
The ultimate decision may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is
actually true.
Types of errors:
1. A type 1 error is rejecting the null hypothesis H when H is true.
0 0
2. A type 2 error is failing to reject the null hypothesis H when the alternative hypothesis H is true.
0 1
9.1.1 https://stats.libretexts.org/@go/page/10211
Similarly, there are two ways to make a correct decision: we could reject H when H is true or we could fail to reject H when
0 1 0
Hypothesis Test
State | Decision Fail to reject H
0 Reject H 0
Of course, when we observe X = x and make our decision, either we will have made the correct decision or we will have
committed an error, and usually we will never know which of these events has occurred. Prior to gathering the data, however, we
can consider the probabilities of the various errors.
If H is true (that is, the distribution of X is specified by H ), then P(X ∈ R) is the probability of a type 1 error for this
0 0
distribution. If H is composite, then H specifies a variety of different distributions for X and thus there is a set of type 1 error
0 0
probabilities.
The maximum probability of a type 1 error, over the set of distributions specified by H , is the significance level of the test or
0
The significance level is often denoted by α . Usually, the rejection region is constructed so that the significance level is a
prescribed, small value (typically 0.1, 0.05, 0.01).
If H is true (that is, the distribution of X is specified by H ), then P(X ∉ R) is the probability of a type 2 error for this
1 1
distribution. Again, if H is composite then H specifies a variety of different distributions for X, and thus there will be a set of
1 1
type 2 error probabilities. Generally, there is a tradeoff between the type 1 and type 2 error probabilities. If we reduce the
probability of a type 1 error, by making the rejection region R smaller, we necessarily increase the probability of a type 2 error
because the complementary region S ∖ R is larger.
The extreme cases can give us some insight. First consider the decision rule in which we never reject H , regardless of the 0
evidence x. This corresponds to the rejection region R = ∅ . A type 1 error is impossible, so the significance level is 0. On the
other hand, the probability of a type 2 error is 1 for any distribution defined by H . At the other extreme, consider the decision rule
1
in which we always rejects H regardless of the evidence x. This corresponds to the rejection region R = S . A type 2 error is
0
impossible, but now the probability of a type 1 error is 1 for any distribution defined by H . In between these two worthless tests
0
Power
If H is true, so that the distribution of X is specified by H , then P(X ∈ R) , the probability of rejecting H is the power of
1 1 0
Thus the power of the test for a distribution specified by H is the probability of making the correct decision.
1
Suppose that we have two tests, corresponding to rejection regions R and R , respectively, each having significance level α .
1 2
The test with region R is uniformly more powerful than the test with region R if
1 2
Naturally, in this case, we would prefer the first test. Often, however, two tests will not be uniformly ordered; one test will be more
powerful for some distributions specified by H while the other test will be more powerful for other distributions specified by H .
1 1
If a test has significance level α and is uniformly more powerful than any other test with significance level α , then the test is
said to be a uniformly most powerful test at level α .
9.1.2 https://stats.libretexts.org/@go/page/10211
P -value
In most cases, we have a general procedure that allows us to construct a test (that is, a rejection region Rα ) for any given
significance level α ∈ (0, 1). Typically, R decreases (in the subset sense) as α decreases.
α
The P -value of the observed value x of X, denoted P (x), is defined to be the smallest α for which x ∈ Rα ; that is, the
smallest significance level for which H is rejected, given X = x .
0
Knowing P (x) allows us to test H at any significance level for the given data x: If P (x) ≤ α then we would reject H at
0 0
significance level α ; if P (x) > α then we fail to reject H at significance level α . Note that P (X) is a statistic. Informally, P (x)
0
can often be thought of as the probability of an outcome “as or more extreme” than the observed value x, where extreme is
interpreted relative to the null hypothesis H . 0
H0 : θ ∈ Θ0 versus H1 : θ ∉ Θ0 (9.1.3)
where Θ is a prescribed subset of the parameter space Θ. In this setting, the probabilities of making an error or a correct decision
0
depend on the true value of θ . If R is the rejection region, then the power function Q is given by
Q(θ) = Pθ (X ∈ R), θ ∈ Θ (9.1.4)
If we have two tests, we can compare them by means of their power functions.
Suppose that we have two tests, corresponding to rejection regions R and R , respectively, each having significance level α .
1 2
The test with rejection region R is uniformly more powerful than the test with rejection region R if Q (θ) ≥ Q (θ) for all
1 2 1 2
θ ∉ Θ . 0
Most hypothesis tests of an unknown real parameter θ fall into three special cases:
Suppose that θ is a real parameter and θ0 ∈ Θ a specified value. The tests below are respectively the two-sided test, the left-
tailed test, and the right-tailed test.
1. H0 : θ = θ0 versus H 1 : θ ≠ θ0
9.1.3 https://stats.libretexts.org/@go/page/10211
2. H 0 : θ ≥ θ0 versus H 1 : θ < θ0
3. H 0 : θ ≤ θ0 versus H 1 : θ > θ0
Thus the tests are named after the conjectured alternative. Of course, there may be other unknown parameters besides θ (known as
nuisance parameters).
Suppose that C (x) is a 1 − α level confidence set for θ . The following test has significance level α for the hypothesis
H : θ =θ
0 versus H : θ ≠ θ : Reject H if and only if θ ∉ C (x)
0 1 0 0 0
Proof
Equivalently, we fail to reject H at significance level α if and only if θ is in the corresponding 1 − α level confidence set. In
0 0
particular, this equivalence applies to interval estimates of a real parameter θ and the common tests for θ given above.
In each case below, the confidence interval has confidence level 1 − α and the test has significance level α .
1. Suppose that [L(X, U (X)] is a two-sided confidence interval for θ . Reject H : θ = θ versus H : θ ≠ θ if and only if
0 0 1 0
2. Suppose that L(X) is a confidence lower bound for θ . Reject H : θ ≤ θ versus H : θ > θ if and only if θ < L(X) .
0 0 1 0 0
3. Suppose that U (X) is a confidence upper bound for θ . Reject H : θ ≥ θ versus H : θ < θ if and only if θ > U (X) .
0 0 1 0 0
this case, a natural test statistic for the basic tests given above is W (X, θ ). 0
This page titled 9.1: Introduction to Hypothesis Testing is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
9.1.4 https://stats.libretexts.org/@go/page/10211
9.2: Tests in the Normal Model
Basic Theory
The Normal Model
The normal distribution is perhaps the most important distribution in the study of mathematical statistics, in part because of the
central limit theorem. As a consequence of this theorem, a measured quantity that is subject to numerous small, random errors will
have, at least approximately, a normal distribution. Such variables are ubiquitous in statistical experiments, in subjects varying
from the physical and biological sciences to the social sciences.
So in this section, we assume that X = (X , X , … , X ) is a random sample from the normal distribution with mean μ and
1 2 n
standard deviation σ. Our goal in this section is to to construct hypothesis tests for μ and σ; these are among of the most important
special cases of hypothesis testing. This section parallels the section on Estimation in the Normal Model in the chapter on Set
Estimation, and in particular, the duality between interval estimation and hypothesis testing will play an important role. But first we
need to review some basic facts that will be critical for our analysis.
n n
1 1
2 2
M = ∑ Xi , S = ∑(Xi − M ) (9.2.1)
n n−1
i=1 i=1
From our study of point estimation, recall that M is an unbiased and consistent estimator of μ while S is an unbiased and 2
consistent estimator of σ . From these basic statistics we can construct the test statistics that will be used to construct our
2
hypothesis tests. The following results were established in the section on Special Properties of the Normal Distribution.
Define
M −μ M −μ n−1
2
Z = , T = , V = S (9.2.2)
− − 2
σ/ √n S/ √n σ
It follows that each of these random variables is a pivot variable for (μ, σ) since the distributions do not depend on the parameters,
but the variables themselves functionally depend on one or both parameters. The pivot variables will lead to natural test statistics
that can then be used to perform the hypothesis tests of the parameters. To construct our tests, we will need quantiles of these
standard distributions. The quantiles can be computed using the special distribution calculator or from most mathematical and
statistical software packages. Here is the notation we will use:
1. z(p) denotes the quantile of order p for the standard normal distribution.
2. t (p) denotes the quantile of order p for the student t distribution with k degrees of freedom.
k
3. χ (p) denotes the quantile of order p for the chi-square distribution with k degrees of freedom
2
k
Since the standard normal and student t distributions are symmetric about 0, it follows that z(1 − p) = −z(p) and
t (1 − p) = −t (p) for p ∈ (0, 1) and k ∈ N
k k . On the other hand, the chi-square distribution is not symmetric.
+
9.2.1 https://stats.libretexts.org/@go/page/10212
For a conjectured μ 0 ∈ R , define the test statistic
M − μ0
Z = (9.2.3)
−
σ/ √n
σ/ √n
and variance 1.
μ−μ
So in case (b), 0
σ/ √n
can be viewed as a non-centrality parameter. The graph of the probability density function of Z is like that of
the standard normal probability density function, but shifted to the right or left by the non-centrality parameter, depending on
whether μ > μ or μ < μ .
0 0
For α ∈ (0, 1), each of the following tests has significance level α :
1. Reject H 0 : μ = μ0 versus H : μ ≠ μ if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) if and only if
1 0
√n √n
2. Reject H 0 : μ ≤ μ0 versus H 1 : μ > μ0 if and only if Z > z(1 − α) if and only if M > μ0 + z(1 − α)
σ
.
√n
3. Reject H 0 : μ ≥ μ0 versus H 1 : μ < μ0 if and only if Z < −z(1 − α) if and only if M < μ0 − z(1 − α)
σ
√n
.
Proof
Part (a) is the standard two-sided test, while (b) is the right-tailed test and (c) is the left-tailed test. Note that in each case, the
hypothesis test is the dual of the corresponding interval estimate constructed in the section on Estimation in the Normal Model.
For each of the tests above, we fail to reject H0 at significance level α if and only if μ0 is in the corresponding 1 −α
√n
≤ μ0 ≤ M + z(1 − α/2)
√n
σ
2. μ0 ≤ M + z(1 − α)
√n
σ
3. μ0 ≥ M − z(1 − α)
√n
σ
Proof
The two-sided test in (a) corresponds to α/2 in each tail of the distribution of the test statistic Z , under H . This set is said to be 0
unbiased. But of course we can construct other biased tests by partitioning the confidence level α between the left and right tails in
a non-symmetric way.
For every α, p ∈ (0, 1) , the following test has significance level α : Reject H0 : μ = μ0 versus H1 : μ ≠ μ0 if and only if
Z < z(α − pα) or Z ≥ z(1 − pα) .
1. p = gives the symmetric, unbiased test.
1
The P -value of these test can be computed in terms of the standard normal distribution function Φ.
Recall that the power function of a test of a parameter is the probability of rejecting the null hypothesis, as a function of the true
value of the parameter. Our next series of results will explore the power functions of the tests above.
9.2.2 https://stats.libretexts.org/@go/page/10212
− −
√n √n
Q(μ) = Φ (z(α − pα) − (μ − μ0 )) + Φ ( (μ − μ0 ) − z(1 − pα)) , μ ∈ R (9.2.4)
σ σ
√n
1. Q is decreasing on (−∞, m ) and increasing on (m , ∞) where m
0 0 0 = μ0 + [z(α − pα) + z(1 − pα)]
2σ
.
2. Q(μ ) = α .
0
2
0 0 0
So by varying p, we can make the test more powerful for some values of μ , but only at the expense of making the test less powerful
for other values of μ .
1. Q is increasing on R.
2. Q(μ ) = α .
0
1. Q is decreasing on R.
2. Q(μ ) = α .
0
For any of the three tests in above , increasing the sample size n or decreasing the standard deviation σ results in a uniformly
more powerful test.
In the mean test experiment, select the normal test statistic and select the normal sampling distribution with standard deviation
σ = 2 , significance level α = 0.1 , sample size n = 20 , and μ = 0 . Run the experiment 1000 times for several values of the
0
true distribution mean μ . For each value of μ , note the relative frequency of the event that the null hypothesis is rejected.
Sketch the empirical power function.
In the mean estimate experiment, select the normal pivot variable and select the normal distribution with μ = 0 and standard
deviation σ = 2 , confidence level 1 − α = 0.90 , and sample size n = 10 . For each of the three types of confidence intervals,
run the experiment 20 times. State the corresponding hypotheses and significance level, and for each run, give the set of μ for 0
In many cases, the first step is to design the experiment so that the significance level is α and so that the test has a given power β
for a given alternative μ .
1
For either of the one-sided tests in above, the sample size n needed for a test with significance level α and power β for the
alternative μ is1
2
σ [z(β) − z(α)]
n =( ) (9.2.7)
μ1 − μ0
Proof
For the unbiased, two-sided test, the sample size n needed for a test with significance level α and power β for the alternative
μ is approximately
1
9.2.3 https://stats.libretexts.org/@go/page/10212
2
σ [z(β) − z(α/2)]
n =( ) (9.2.8)
μ1 − μ0
Proof
μ−μ0
2. If μ ≠ μ then T has a non-central t distribution with n − 1 degrees of freedom and non-centrality parameter
0
σ/ √n
.
In case (b), the graph of the probability density function of T is much (but not exactly) the same as that of the ordinary t
distribution with n − 1 degrees of freedom, but shifted to the right or left by the non-centrality parameter, depending on whether
μ >μ 0 or μ < μ . 0
For α ∈ (0, 1), each of the following tests has significance level α :
1. Reject H 0 : μ = μ0 versus H 1 : μ ≠ μ0 if and only if T < −tn−1 (1 − α/2) or T > tn−1 (1 − α/2) if and only if
M < μ0 − tn−1 (1 − α/2)
S
or T > μ0 + tn−1 (1 − α/2)
S
.
√n √n
2. Reject H 0 : μ ≤ μ0 versus H 1 : μ > μ0 if and only if T > tn−1 (1 − α) if and only if M > μ0 + tn−1 (1 − α)
S
.
√n
3. Reject H 0 : μ ≥ μ0 versus H 1 : μ < μ0 if and only if T < −tn−1 (1 − α) if and only if M < μ0 − tn−1 (1 − α)
S
√n
.
Proof
Part (a) is the standard two-sided test, while (b) is the right-tailed test and (c) is the left-tailed test. Note that in each case, the
hypothesis test is the dual of the corresponding interval estimate constructed in the section on Estimation in the Normal Model.
For each of the tests above, we fail to reject H0 at significance level α if and only if μ0 is in the corresponding 1 −α
confidence interval.
1. M − t n−1 (1 − α/2)
S
√n
≤ μ0 ≤ M + tn−1 (1 − α/2)
S
√n
2. μ 0 ≤ M + tn−1 (1 − α)
S
√n
3. μ 0 ≥ M − tn−1 (1 − α)
S
√n
Proof
The two-sided test in (a) corresponds to α/2 in each tail of the distribution of the test statistic T , under H . This set is said to be 0
unbiased. But of course we can construct other biased tests by partitioning the confidence level α between the left and right tails in
a non-symmetric way.
For every α, p ∈ (0, 1) , the following test has significance level α : Reject H : μ = μ versus 0 0 H1 : μ ≠ μ0 if and only if
T < tn−1 (α − pα) or T ≥ t (1 − pα) if and only if M < μ + t (α − pα)
n−1 0 or M > μ
n−1
S
√n
0 + tn−1 (1 − pα)
S
√n
.
The P -value of these test can be computed in terms of the distribution function Φn−1 of the t -distribution with n−1 degrees of
freedom.
9.2.4 https://stats.libretexts.org/@go/page/10212
The P -values of the standard tests above are respectively
1. 2 [1 − Φ (|T |)]
n−1
2. 1 − Φ (T )
n−1
3. Φ (T )
n−1
In the mean test experiment, select the student test statistic and select the normal sampling distribution with standard deviation
σ = 2 , significance level α = 0.1 , sample size n = 20 , and μ = 1 . Run the experiment 1000 times for several values of the
0
true distribution mean μ . For each value of μ , note the relative frequency of the event that the null hypothesis is rejected.
Sketch the empirical power function.
In the mean estimate experiment, select the student pivot variable and select the normal sampling distribution with mean 0 and
standard deviation 2. Select confidence level 0.90 and sample size 10. For each of the three types of intervals, run the
experiment 20 times. State the corresponding hypotheses and significance level, and for each run, give the set of μ for which 0
The power function for the t tests above can be computed explicitly in terms of the non-central t distribution function.
Qualitatively, the graphs of the power functions are similar to the case when σ is known, given above two-sided, left-tailed, and
right-tailed cases.
If an upper bound σ on the standard deviation σ is known, then conservative estimates on the sample size needed for a given
0
confidence level and a given margin of error can be obtained using the methods for the normal pivot variable, in the two-sided and
one-sided cases.
2. If σ ≠ σ then V has the gamma distribution with shape parameter (n − 1)/2 and scale parameter 2σ
0
2
/σ
0
2
.
Recall that the ordinary chi-square distribution with n − 1 degrees of freedom is the gamma distribution with shape parameter
(n − 1)/2 and scale parameter . So in case (b), the ordinary chi-square distribution is scaled by σ /σ . In particular, the scale
1 2 2
2 0
For every α ∈ (0, 1), the following test has significance level α :
1. Reject H 0 : σ = σ0 versus H 1 : σ ≠ σ0 if and only if V <χ
2
n−1
(α/2) or V >χ
2
n−1
(1 − α/2) if and only if
2 2
σ0 σ0
S
2
<χ
2
n−1
(α/2)
n−1
or S 2
>χ
2
n−1
(1 − α/2)
n−1
2
σ
2. Reject H 0 : σ ≥ σ0 versus H 1 : σ < σ0 if and only if V <χ
2
n−1
(α) if and only if S 2
<χ
2
n−1
(α)
n−1
0
2
σ0
3. Reject H 0 : σ ≤ σ0 versus H 1 : σ > σ0 if and only if V >χ
2
n−1
(1 − α) if and only if S 2
>χ
2
n−1
(1 − α)
n−1
Proof
Part (a) is the unbiased, two-sided test that corresponds to α/2 in each tail of the chi-square distribution of the test statistic V ,
under H . Part (b) is the left-tailed test and part (c) is the right-tailed test. Once again, we have a duality between the hypothesis
0
tests and the interval estimates constructed in the section on Estimation in the Normal Model.
For each of the tests in above, we fail to reject H0 at significance level α if and only if σ
2
0
is in the corresponding 1 −α
9.2.5 https://stats.libretexts.org/@go/page/10212
1. 2
n−1
S
2
≤σ
2
0
≤
2
n−1
S
2
χ (1−α/2) χ (α/2)
n−1 n−1
2. σ 0
2
≤ 2
n−1
S
2
χ (α)
n−1
3. σ 0
2
≥ 2
n−1
S
2
χ (1−α)
n−1
Proof
As before, we can construct more general two-sided tests by partitioning the significance level α between the left and right tails of
the chi-square distribution in an arbitrary way.
For every α, p ∈ (0, 1) , the following test has significance level α : Reject H0 : σ = σ0 versus H1 : σ ≠ σ0 if and only if
2 2
σ σ
V ≤χ
2
n−1
(α − pα) or V ≥χ
2
n−1
(1 − pα) if and only if S 2 2
<χ
n−1
(α − pα)
n−1
0
or S 2
>χ
2
n−1
(1 − pα)
n−1
0
.
1. p = gives the equal-tail test.
1
Recall again that the power function of a test of a parameter is the probability of rejecting the null hypothesis, as a function of the
true value of the parameter. The power functions of the tests for σ can be expressed in terms of the distribution function G of n−1
The power function of the general two-sided test above is given by the following formula, and satisfies the given properties:
2 2
σ σ
0 2 0 2
Q(σ) = 1 − Gn−1 ( χ (1 − p α)) + Gn−1 ( χ (α − p α)) (9.2.11)
n−1 n−1
σ2 σ2
The power function of the left-tailed test in above is given by the following formula, and satisfies the given properties:
2
σ
0 2
Q(σ) = 1 − Gn−1 ( χ (1 − α)) (9.2.12)
2 n−1
σ
The power function for the right-tailed test above is given by the following formula, and satisfies the given properties:
2
σ
0 2
Q(σ) = Gn−1 ( χ (α)) (9.2.13)
2 n−1
σ
In the variance test experiment, select the normal distribution with mean 0, and select significance level 0.1, sample size 10,
and test standard deviation 1.0. For various values of the true standard deviation, run the simulation 1000 times. Record the
relative frequency of rejecting the null hypothesis and plot the empirical power curve.
1. Two-sided test
2. Left-tailed test
3. Right-tailed test
9.2.6 https://stats.libretexts.org/@go/page/10212
In the variance estimate experiment, select the normal distribution with mean 0 and standard deviation 2, and select confidence
level 0.90 and sample size 10. Run the experiment 20 times. State the corresponding hypotheses and significance level, and for
each run, give the set of test standard deviations for which the null hypothesis would be rejected.
1. Two-sided confidence interval
2. Confidence lower bound
3. Confidence upper bound
Exercises
Robustness
The primary assumption that we made is that the underlying sampling distribution is normal. Of course, in real statistical problems,
we are unlikely to know much about the sampling distribution, let alone whether or not it is normal. Suppose in fact that the
underlying distribution is not normal. When the sample size n is relatively large, the distribution of the sample mean will still be
approximately normal by the central limit theorem, and thus our tests of the mean μ should still be approximately valid. On the
other hand, tests of the variance σ are less robust to deviations form the assumption of normality. The following exercises explore
2
these ideas.
In the mean test experiment, select the gamma distribution with shape parameter 1 and scale parameter 1. For the three
different tests and for various significance levels, sample sizes, and values of μ , run the experiment 1000 times. For each
0
configuration, note the relative frequency of rejecting H . When H is true, compare the relative frequency with the
0 0
significance level.
In the mean test experiment, select the uniform distribution on [0, 4]. For the three different tests and for various significance
levels, sample sizes, and values of μ , run the experiment 1000 times. For each configuration, note the relative frequency of
0
rejecting H . When H is true, compare the relative frequency with the significance level.
0 0
How large n needs to be for the testing procedure to work well depends, of course, on the underlying distribution; the more this
distribution deviates from normality, the larger n must be. Fortunately, convergence to normality in the central limit theorem is
rapid and hence, as you observed in the exercises, we can get away with relatively small sample sizes (30 or more) in most cases.
In the variance test experiment, select the gamma distribution with shape parameter 1 and scale parameter 1. For the three
different tests and for various significance levels, sample sizes, and values of σ , run the experiment 1000 times. For each
0
configuration, note the relative frequency of rejecting H . When H is true, compare the relative frequency with the
0 0
significance level.
In the variance test experiment, select the uniform distribution on [0, 4]. For the three different tests and for various
significance levels, sample sizes, and values of μ , run the experiment 1000 times. For each configuration, note the relative
0
frequency of rejecting H . When H is true, compare the relative frequency with the significance level.
0 0
Computational Exercises
The length of a certain machined part is supposed to be 10 centimeters. In fact, due to imperfections in the manufacturing
process, the actual length is a random variable. The standard deviation is due to inherent factors in the process, which remain
fairly stable over time. From historical data, the standard deviation is known with a high degree of accuracy to be 0.3. The
mean, on the other hand, may be set by adjusting various parameters in the process and hence may change to an unknown
value fairly frequently. We are interested in testing H : μ = 10 versus H : μ ≠ 10 .
0 1
1. Suppose that a sample of 100 parts has mean 10.1. Perform the test at the 0.1 level of significance.
2. Compute the P -value for the data in (a).
3. Compute the power of the test in (a) at μ = 10.05.
4. Compute the approximate sample size needed for significance level 0.1 and power 0.8 when μ = 10.05.
Answer
9.2.7 https://stats.libretexts.org/@go/page/10212
A bag of potato chips of a certain brand has an advertised weight of 250 grams. Actually, the weight (in grams) is a random
variable. Suppose that a sample of 75 bags has mean 248 and standard deviation 5. At the 0.05 significance level, perform the
following tests:
1. H 0 : μ ≥ 250 versus H : μ < 250
1
2. H 0 : σ ≥7 versus H : σ < 7
1
Answer
At a telemarketing firm, the length of a telephone solicitation (in seconds) is a random variable. A sample of 50 calls has mean
310 and standard deviation 25. At the 0.1 level of significance, can we conclude that
1. μ > 300 ?
2. σ > 20?
Answer
At a certain farm the weight of a peach (in ounces) at harvest time is a random variable. A sample of 100 peaches has mean 8.2
and standard deviation 1.0. At the 0.01 level of significance, can we conclude that
1. μ > 8 ?
2. σ < 1.5?
Answer
The hourly wage for a certain type of construction work is a random variable with standard deviation 1.25. For sample of 25
workers, the mean wage was $6.75. At the 0.01 level of significance, can we conclude that μ < 7.00?
Answer
Using Cavendish's data, test to see if the density of the earth is less than 5.5 times the density of water, at the 0.05 significance
level .
Answer
Using Short's data, test to see if the parallax of the sun differs from 9 seconds of a degree, at the 0.1 significance level.
Answer
Using Fisher's iris data, perform the following tests, at the 0.1 level:
1. The mean petal length of Setosa irises differs from 15 mm.
2. The mean petal length of Verginica irises is greater than 52 mm.
3. The mean petal length of Versicolor irises is less than 44 mm.
Answer
This page titled 9.2: Tests in the Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
9.2.8 https://stats.libretexts.org/@go/page/10212
9.3: Tests in the Bernoulli Model
Basic Tests
Preliminaries
Suppose that X = (X , X , … , X ) is a random sample from the Bernoulli distribution with unknown parameter p ∈ (0, 1).
1 2 n
Thus, these are independent random variables taking the values 1 and 0 with probabilities p and 1 − p respectively. In the usual
language of reliability, 1 denotes success and 0 denotes failure, but of course these are generic terms. Often this model arises in one
of the following contexts:
1. There is an event of interest in a basic experiment, with unknown probability p. We replicate the experiment n times and define
X = 1 if and only if the event occurred on run i .
i
2. We have a population of objects of several different types; p is the unknown proportion of objects of a particular type of
interest. We select n objects at random from the population and let X = 1 if and only if object i is of the type of interest.
i
When the sampling is with replacement, these variables really do form a random sample from the Bernoulli distribution. When
the sampling is without replacement, the variables are dependent, but the Bernoulli model may still be approximately valid if
the population size is very large compared to the sample size n . For more on these points, see the discussion of sampling with
and without replacement in the chapter on Finite Sampling Models.
In this section, we will construct hypothesis tests for the parameter p. The parameter space for p is the interval (0, 1), and all
hypotheses define subsets of this space. This section parallels the section on Estimation in the Bernoulli Model in the Chapter on
Interval Estimation.
i=1
Xi has the binomial distribution with parameters n and p, and has probability
density function given by
n y n−y
P(Y = y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (9.3.1)
y
Recall also that the mean is E(Y ) = np and variance is var(Y ) = np(1 − p) . Moreover Y is sufficient for p and hence is a
natural candidate to be a test statistic for hypothesis tests about p. For α ∈ (0, 1), let b (α) denote the quantile of order α for the
n,p
binomial distribution with parameters n and p. Since the binomial distribution is discrete, only certain (exact) quantiles are
possible. For the remainder of this discussion, p ∈ (0, 1) is a conjectured value of p.
0
For every α ∈ (0, 1), the following tests have approximate significance level α :
1. Reject H 0 : p = p0 versus H 1 : p ≠ p0 if and only if Y ≤ bn,p (α/2)
0
or Y ≥ bn,p (1 − α/2)
0
.
2. Reject H 0 : p ≥ p0 versus H 1 : p < p0 if and only if Y ≤ bn,p (α)
0
.
3. Reject H 0 : p ≤ p0 versus H 1 : p > p0 if and only if Y ≥ bn,p (1 − α)
0
.
Proof
The test in (a) is the standard, symmetric, two-sided test, corresponding to probability α/2 (approximately) in both tails of the
binomial distribution under H . The test in (b) is the left-tailed and test and the test in (c) is the right-tailed test. As usual, we can
0
generalize the two-sided test by partitioning α between the left and right tails of the binomial distribution in an arbitrary manner.
For any α, r ∈ (0, 1), the following test has (approximate) significance level α : Reject H 0 : p = p0 versus H1 : p ≠ p0 if and
only if Y ≤ b (α − rα) or Y ≥ b (1 − rα) .
n,p
0
n,p
0
9.3.1 https://stats.libretexts.org/@go/page/10213
An Approximate Normal Test
When n is large, the distribution of Y is approximately normal, by the central limit theorem, so we can construct an approximate
normal test.
Suppose that the sample size n is large. For a conjectured p 0 ∈ (0, 1), define the test statistic
Y − np0
Z = (9.3.2)
− −−−−−−− −
√ np0 (1 − p0 )
p−p p(1−p)
2. If p ≠ p , then Z has approximately a normal distribution with mean √−
0 n
0
and variance p0 (1−p0 )
√p (1−p )
0 0
Proof
As usual, for α ∈ (0, 1), let z(α) denote the quantile of order α for the standard normal distribution. For selected values of α ,
z(α) can be obtained from the special distribution calculator, or from most statistical software packages. Recall also by symmetry
For every α ∈ (0, 1), the following tests have approximate significance level α :
1. Reject H 0 : p = p0 versus H
1 : p ≠ p0 if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) .
2. Reject H 0 : p ≥ p0 versus H
1 : p < p0 if and only if Z < −z(1 − α) .
3. Reject H 0 : p ≤ p0 versus H
1 : p ≥ p0 if and only if Z > z(1 − α) .
Proof
The test in (a) is the symmetric, two-sided test that corresponds to α/2 in both tails of the distribution of Z , under H . The test in 0
(b) is the left-tailed test and the test in (c) is the right-tailed test. As usual, we can construct a more general two-sided test by
partitioning α between the left and right tails of the standard normal distribution in an arbitrary manner.
For every α, r ∈ (0, 1), the following test has approximate significance level α : Reject H 0 : p = p0 versus H 1 : p ≠ p0 if and
only if Z < z(α − rα) or Z > z(1 − rα) .
1. r = gives the standard, symmetric two-sided test.
1
Simulation Exercises
In the proportion test experiment, set H : p = p , and select sample size 10, significance level 0.1, and p = 0.5 . For each
0 0 0
p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and then note the relative frequency of rejecting the null hypothesis.
Graph the empirical power function.
In the proportion test experiment, repeat the previous exercise with sample size 20.
In the proportion test experiment, set H : p ≤ p , and select sample size 15, significance level 0.05, and p = 0.3 . For each
0 0 0
p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and note the relative frequency of rejecting the null hypothesis. Graph
the empirical power function.
In the proportion test experiment, repeat the previous exercise with sample size 30.
In the proportion test experiment, set H : p ≥ p , and select sample size 20, significance level 0.01, and p = 0.6 . For each
0 0 0
p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and then note the relative frequency of rejecting the null hypothesis.
Graph the empirical power function.
In the proportion test experiment, repeat the previous exercise with sample size 50.
9.3.2 https://stats.libretexts.org/@go/page/10213
Computational Exercises
In a pole of 1000 registered voters in a certain district, 427 prefer candidate X. At the 0.1 level, is the evidence sufficient to
conclude that more that 40% of the registered voters prefer X?
Answer
A coin is tossed 500 times and results in 302 heads. At the 0.05 level, test to see if the coin is unfair.
Answer
A sample of 400 memory chips from a production line are tested, and 32 are defective. At the 0.05 level, test to see if the
proportion of defective chips is less than 0.1.
Answer
A new drug is administered to 50 patients and the drug is effective in 42 cases. At the 0.1 level, test to see if the success rate
for the new drug is greater that 0.8.
Answer
Using the M&M data, test the following alternative hypotheses at the 0.1 significance level:
1. The proportion of red M&Ms differs from . 1
6
.
Answer
In general of course, m is unknown, even though p is specified, because we don't know the distribution of
0 U . Suppose that we
want to construct hypothesis tests for m. For a given test value m , let 0
p = P(U ≤ m0 ) (9.3.5)
Note that p is unknown even though m is specified, because again, we don't know the distribution of U .
0
Relations
1. m = m if and only if p = p .
0 0
Proof
As usual, we repeat the basic experiment n times to generate a random sample U = (U , U , … , U ) of size
1 2 n n from the
distribution of U . Let X = 1(U ≤ m ) be the indicator variable of the event {U ≤ m } for i ∈ {1, 2, … , n}.
i i 0 i 0
Note that X = (X , X , … , X ) is a statistic (an observable function of the data vector U ) and is a random sample of size n
1 2 n
From the last two results it follows that tests of the unknown quantile m can be converted to tests of the Bernoulli parameter p, and
thus the tests developed above apply. This procedure is known as the sign test, because essentially, only the sign of U − m is i 0
recorded for each i. This procedure is also an example of a nonparametric test, because no assumptions about the distribution of U
9.3.3 https://stats.libretexts.org/@go/page/10213
are made (except for continuity). In particular, we do not need to assume that the distribution of U belongs to a particular
parametric family.
The most important special case of the sign test is the case where p = ; this is the sign test of the median. If the distribution of U
0
1
is known to be symmetric, the median and the mean agree. In this case, sign tests of the median are also tests of the mean.
Simulation Exercises
In the sign test experiment, set the sampling distribution to normal with mean 0 and standard deviation 2. Set the sample size to
10 and the significance level to 0.1. For each of the 9 values of m , run the simulation 1000 times.
0
1. When m = m , give the empirical estimate of the significance level of the test and compare with 0.1.
0
2. In the other cases, give the empirical estimate of the power of the test.
In the sign test experiment, set the sampling distribution to uniform on the interval [0, 5]. Set the sample size to 20 and the
significance level to 0.05. For each of the 9 values of m , run the simulation 1000 times.
0
1. When m = m , give the empirical estimate of the significance level of the test and compare with 0.05.
0
2. In the other cases, give the empirical estimate of the power of the test.
In the sign test experiment, set the sampling distribution to gamma with shape parameter 2 and scale parameter 1. Set the
sample size to 30 and the significance level to 0.025. For each of the 9 values of m , run the simulation 1000 times.
0
1. When m = m , give the empirical estimate of the significance level of the test and compare with 0.025.
0
2. In the other cases, give the empirical estimate of the power of the test.
Computational Exercises
Using the M&M data, test to see if the median weight exceeds 47.9 grams, at the 0.1 level.
Answer
Using Fisher's iris data, perform the following tests, at the 0.1 level:
1. The median petal length of Setosa irises differs from 15 mm.
2. The median petal length of Verginica irises is less than 52 mm.
3. The median petal length of Versicolor irises is less than 42 mm.
Answer
This page titled 9.3: Tests in the Bernoulli Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
9.3.4 https://stats.libretexts.org/@go/page/10213
9.4: Tests in the Two-Sample Normal Model
In this section, we will study hypothesis tests in the two-sample normal model and in the bivariate normal model. This section
parallels the section on Estimation in the Two Sample Normal Model in the chapter on Interval Estimation.
deviation σ, and that Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν and standard
1 2 n
Proof
Of course (b) actually subsumes (a), but we separate them because the two cases play an impotrant role in the hypothesis tests. In
part (b), the non-zero mean can be viewed as a non-centrality parameter.
As usual, for p ∈ (0, 1), let z(p) denote the quantile of order p for the standard normal distribution. For selected values of p, z(p)
can be obtained from the special distribution calculator or from most statistical software packages. Recall also by symmetry that
z(1 − p) = −z(p) .
For every α ∈ (0, 1), the following tests have significance level α :
1. Reject H 0 : ν −μ = δ versus H
: ν −μ ≠ δ
1 if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) if and only if
−−−−−− −−− −− −−−−−− −−− −−
2
M (Y ) − M (X) > δ + z(1 − α/2)√σ /m + τ /n
2
or M (Y ) − M (X) < δ − z(1 − α/2)√σ /m + τ /n . 2 2
9.4.1 https://stats.libretexts.org/@go/page/10214
3. Reject H0 : ν −μ ≤ δ versus
H1 : ν − μ > δ if and only if Z > z(1 − α) if and only if
− −−−− −−− −−−
M (Y ) − M (X) > δ + z(1 − α)√σ 2 /m + τ 2 /n .
Proof
For each of the tests above, we fail to reject H0 at significance level α if and only if δ is in the corresponding 1 −α level
confidence interval.
−−−−−−−−−−− −−− −−−−−− −−
1. [M (Y ) − M (X)] − z(1 − α/2)√σ /m + τ /n ≤ δ ≤ [M (Y ) − M (X)] + z(1 − α/2)√σ
2 2 2 2
/m + τ /n
−−−−−−−−−−−
2. δ ≤ [M (Y ) − M (X)] + z(1 − α)√σ /m + τ /n 2 2
−−−−−−−−−−−
3. δ ≥ [M (Y ) − M (X)] − z(1 − α)√σ /m + τ /n 2 2
Proof
that the standard deviations are the same. Thus, we will assume that σ = τ , and the common value σ is unknown. This assumption
is reasonable if there is an inherent variability in the measurement variables that does not change even when different treatments
are applied to the objects in the population. Recall that the pooled estimate of the common variance σ is the weighted average of 2
the sample variances, with the degrees of freedom as the weight factors:
2 2
(m − 1)S (X) + (n − 1)S (Y )
2
S (X, Y ) = (9.4.3)
m +n−2
The statistic S 2
(X, Y ) is an unbiased and consistent estimator of the common variance σ . 2
Proof
As usual, for k > 0 and p ∈ (0, 1), let t (p) denote the quantile of order p for the t distribution with k degrees of freedom. For
k
selected values of k and p, values of t (p) can be computed from the special distribution calculator, or from most statistical
k
3. Reject H0 : ν −μ ≤ δ versus
H1 : ν − μ > δ if and only if T ≥ tm−n+2 (1 − α) if and only if
−− −−−−−−− −−
2 2
M (Y ) − M (X) > δ + tm+n−2 (1 − α)√σ /m + τ /n
Proof
For each of the tests above, we fail to reject H0 at significance level α if and only if δ is in the corresponding 1 −α level
confidence interval.
−−− −−−−−− −− −−−−−−−−− −−
1. [M (Y ) − M (X)] − t m+n−2
2 2 2 2
(1 − α/2)√σ /m + τ /n ≤ δ ≤ [M (Y ) − M (X)] + tm+n−2 (1 − α/2)√σ /m + τ /n
− −−−−−−− − −−
2. δ ≤ [M (Y ) − M (X)] + t 2
m+n−2 (1 − α)√σ /m + τ
2
/n
9.4.2 https://stats.libretexts.org/@go/page/10214
−− − −−−−− − −−
3. δ ≥ [M (Y ) − M (X)] − t m+n−2 (1 − α)√σ 2 /m + τ 2 /n
Proof
1. If τ /σ = ρ then F has the F distribution with m − 1 degrees of freedom in the numerator and n − 1 degrees of freedom
2 2
in the denominator.
2. If τ /σ ≠ ρ then F has a scaled F distribution with m − 1 degrees of freedom in the numerator, n − 1 degrees of
2 2
τ
2
Proof
For each of the tests above, we fail to reject H0 at significance level α if and only if ρ0 is in the corresponding 1 −α level
confidence interval.
2 2
S (Y ) S (Y )
1. 2
Fm−1,n−1 (α/2) ≤ ρ ≤ 2
Fm−1,n−1 (1 − α/2)
S (X) S (X)
2
S (Y )
2. ρ ≤ 2
Fm−1,n−1 (α)
S (X)
2
S (Y )
3. ρ ≥ 2
Fm−1,n−1 (1 − α)
S (X)
Proof
is a random sample of size n from the bivariate normal distribution of (X, Y ) with E(X) = μ , E(Y ) = ν , var(X) = σ
2
,
var(Y ) = τ , and cov(X, Y ) = δ .
2
Thus, instead of a pair of samples, we have a sample of pairs. The fundamental difference is that in this model, variables X and Y
are measured on the same objects in a sample drawn from the population, while in the previous model, variables X and Y are
measured on two distinct samples drawn from the population. The bivariate model arises, for example, in before and after
experiments, in which a measurement of interest is recorded for a sample of n objects from the population, both before and after a
treatment. For example, we could record the blood pressure of a sample of n patients, before and after the administration of a
certain drug.
We will use our usual notation for the sample means and variances of X = (X 1, X2 , … , Xn ) and Y = (Y1 , Y2 , … , Yn ) . Recall
also that the sample covariance of (X, Y ) is
n
1
S(X, Y ) = ∑[ Xi − M (X)][ Yi − M (Y )] (9.4.10)
n−1
i=1
9.4.3 https://stats.libretexts.org/@go/page/10214
(not to be confused with the pooled estimate of the standard deviation in the two-sample model above).
The sample of differences Y − X fits the normal model for a single variable. The section on Tests in the Normal Model could be
used to perform tests for the distribution mean ν − μ and the distribution variance σ + τ − 2δ . 2 2
Computational Exercises
A new drug is being developed to reduce a certain blood chemical. A sample of 36 patients are given a placebo while a sample
of 49 patients are given the drug. The statistics (in mg) are m = 87 , s = 4 , m = 63 , s = 6 . Test the following at the 10%
1 1 2 2
significance level:
1. H : σ = σ versus H : σ ≠ σ .
0 1 2 1 1 2
A company claims that an herbal supplement improves intelligence. A sample of 25 persons are given a standard IQ test before
and after taking the supplement. The before and after statistics are m = 105 , s = 13 , m = 110 , s = 17 , s = 190 . At
1 1 2 2 1, 2
In Fisher's iris data, consider the petal length variable for the samples of Versicolor and Virginica irises. Test the following at
the 10% significance level:
1. H 0 : σ1 = σ2 versus H 1 : σ1 ≠ σ2 .
2. H 0 : μ1 ≤ μ2 versus μ1 > μ2 (assuming that σ 1 = σ2 ).
Answer
A plant has two machines that produce a circular rod whose diameter (in cm) is critical. A sample of 100 rods from the first
machine as mean 10.3 and standard deviation 1.2. A sample of 100 rods from the second machine has mean 9.8 and standard
deviation 1.6. Test the following hypotheses at the 10% level.
1. H 0 : σ1 = σ2 versus H 1 : σ1 ≠ σ2 .
2. H 0 : μ1 = μ2 versus H 1 : μ1 ≠ μ2 (assuming that σ 1 = σ2 ).
Answer
This page titled 9.4: Tests in the Two-Sample Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
9.4.4 https://stats.libretexts.org/@go/page/10214
9.5: Likelihood Ratio Tests
Basic Theory
As usual, our starting point is a random experiment with an underlying sample space, and a probability measure P. In the basic
statistical model, we have an observable random variable X taking values in a set S . In general, X can have quite a complicated
structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest,
then
X = (X1 , X2 , … , Xn ) (9.5.1)
where X is the vector of measurements for the ith object. The most important special case occurs when (X , X , … , X
i 1 2 n) are
independent and identically distributed. In this case, we have a random sample of size n from the common distribution.
In the previous sections, we developed tests for parameters based on natural test statistics. However, in other cases, the tests may
not be parametric, or there may not be an obvious statistic to start with. Thus, we need a more general method for constructing test
statistics. Moreover, we do not yet know if the tests constructed so far are the best, in the sense of maximizing the power for the set
of alternatives. In this and the next section, we investigate both of these ideas. Likelihood functions, similar to those used in
maximum likelihood estimation, will play a key role.
We will use subscripts on the probability measure P to indicate the two hypotheses, and we assume that f and f are postive on S .
0 1
The test that we will construct is based on the following simple idea: if we observe X = x , then the condition f (x) > f (x) is 1 0
evidence in favor of the alternative; the opposite inequality is evidence against the alternative.
Restating our earlier observation, note that small values of L are evidence in favor of H . Thus it seems reasonable that the
1
likelihood ratio statistic may be a good test statistic, and that we should consider tests in which we teject H if and only if L ≤ l ,
0
As usual, we can try to construct a test by choosing l so that α is a prescribed value. If X has a discrete distribution, this will only
be possible when α is a value of the distribution function of L(X).
An important special case of this model occurs when the distribution of X depends on a parameter θ that has two possible values.
Thus, the parameter space is {θ , θ }, and f denotes the probability density function of X when θ = θ and f denotes the
0 1 0 0 1
probability density function of X when θ = θ . In this case, the hypotheses are equivalent to H : θ = θ versus H : θ = θ .
1 0 0 1 1
As noted earlier, another important special case is when X = (X , X , … , X ) is a random sample of size n from a distribution
1 2 n
an underlying random variable X taking values in a set R . In this case, S = R and the probability density function f of X has
n
the form
9.5.1 https://stats.libretexts.org/@go/page/10215
H1 : X has probability density function g . 1
In this special case, it turns out that under H , the likelihood ratio statistic, as a function of the sample size n , is a martingale.
1
and recall that the size of a rejection region is the significance of the test with that rejection region.
Consider the tests with rejection regions R given above and arbitrary A ⊆ S . If the size of R is at least as large as the size of
A then the test with rejection region R is more powerful than the test with rejection region A . That is, if
P (X ∈ R) ≥ P (X ∈ A) then P (X ∈ R) ≥ P (X ∈ A) .
0 0 1 1
Proof
The Neyman-Pearson lemma is more useful than might be first apparent. In many important cases, the same most powerful test
works for a range of alternatives, and thus is a uniformly most powerful test for this range. Several special cases are discussed
below.
the data variable X depends on a parameter θ , taking values in a parameter space Θ. Consider the hypotheses θ ∈ Θ versus 0
θ ∉ Θ , where Θ ⊆ Θ .
0 0
Define
sup { fθ (x) : θ ∈ Θ0 }
L(x) = (9.5.9)
sup { fθ (x) : θ ∈ Θ}
The function L is the likelihood ratio function and L(X) is the likelihood ratio statistic.
By the same reasoning as before, small values of L(x) are evidence in favor of the alternative hypothesis.
b ∈ (0, ∞). The sample variables might represent the lifetimes from a sample of devices of a certain type. We are interested in
testing the simple hypotheses H : b = b versus H : b = b , where b , b ∈ (0, ∞) are distinct specified values.
0 0 1 1 0 1
Y = ∑ Xi (9.5.10)
i=1
Recall also that Y has the gamma distribution with shape parameter n and scale parameter b . For α >0 , we will denote the
quantile of order α for the this distribution by γ (α). n,b
9.5.2 https://stats.libretexts.org/@go/page/10215
Proof
Note that the these tests do not depend on the value of b . This fact, together with the monotonicity of the power function can be
1
used to shows that the tests are uniformly most powerful for the usual one-sided tests.
p . The sample could represent the results of tossing a coin n times, where p is the probability of heads. We wish to test the simple
hypotheses H : p = p versus H : p = p , where p , p ∈ (0, 1) are distinct specified values. In the coin tossing model, we
0 0 1 1 0 1
know that the probability of heads is either p or p , but we don't know which.
0 1
Y = ∑ Xi (9.5.14)
i=1
Recall also that Y has the binomial distribution with parameters n and p. For α ∈ (0, 1), we will denote the quantile of order α for
the this distribution by b (α); although since the distribution is discrete, only certain values of α are possible.
n,p
Proof
Proof
Note that these tests do not depend on the value of p . This fact, together with the monotonicity of the power function can be used
1
to shows that the tests are uniformly most powerful for the usual one-sided tests.
A Nonparametric Example
Suppose that X = (X , X , … , X ) is a random sample of size n ∈ N , either from the Poisson distribution with parameter 1 or
1 2 n +
from the geometric distribution on N with parameter p = . Note that both distributions have mean 1 (although the Poisson
1
distribution has variance 1 while the geometric distribution has variance 2). So, we wish to test the hypotheses
9.5.3 https://stats.libretexts.org/@go/page/10215
H0 : X has probability density function g 0 (x) =e
−1 1
x!
for x ∈ N.
x+1
H1 : X has probability density function g 1 (x) =(
1
2
) for x ∈ N.
n −n
2
L =2 e where Y = ∑ Xi and U = ∏ Xi ! (9.5.18)
U
i=1 i=1
Proof
The most powerful tests have the following form, where d is a constant: reject H if and only if ln(2)Y
0 − ln(U ) ≤ d .
Proof
This page titled 9.5: Likelihood Ratio Tests is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
9.5.4 https://stats.libretexts.org/@go/page/10215
9.6: Chi-Square Tests
In this section, we will study a number of important hypothesis tests that fall under the general term chi-square tests. These are
named, as you might guess, because in each case the test statistics has (in the limit) a chi-square distribution. Although there are
several different tests in this general category, they all share some common themes:
In each test, there are one or more underlying multinomial samples, Of course, the multinomial model includes the Bernoulli
model as a special case.
Each test works by comparing the observed frequencies of the various outcomes with expected frequencies under the null
hypothesis.
If the model is incompletely specified, some of the expected frequencies must be estimated; this reduces the degrees of freedom
in the limiting chi-square distribution.
We will start with the simplest case, where the derivation is the most straightforward; in fact this test is equivalent to a test we have
already studied. We then move to successively more complicated models.
p ∈ (0, 1). Thus, these are independent random variables taking the values 1 and 0 with probabilities p and 1 − p respectively. We
want to test H : p = p versus H : p ≠ p , where p ∈ (0, 1) is specified. Of course, we have already studied such tests in the
0 0 1 0 0
Bernoulli model. But keep in mind that our methods in this section will generalize to a variety of new models that we have not yet
studied.
n n
Let O = ∑ X and O = n − O = ∑ (1 − X ) . These statistics give the number of times (frequency) that outcomes 1
1 j=1 j 0 1 j=1 j
and 0 occur, respectively. Moreover, we know that each has a binomial distribution; O has parameters n and p, while O has 1 0
parameters n and 1 − p . In particular, E(O ) = np , E(O ) = n(1 − p) , and var(O ) = var(O ) = np(1 − p) . Moreover, recall
1 0 1 0
that O is sufficient for p. Thus, any good test statistic should be a function of O . Next, recall that when n is large, the distribution
1 1
O1 − np0
Z = − −−−−−−− − (9.6.1)
√ np0 (1 − p0 )
Note that Z is the standard score of O under H . Hence if n is large, Z has approximately the standard normal distribution under
1 0
H , and therefore V = Z has approximately the chi-square distribution with 1 degree of freedom under H . As usual, let χ
2 2
0 0 k
denote the quantile function of the chi-square distribution with k degrees of freedom.
The test above is equivalent to the unbiased test with test statistic Z (the approximate normal test) derived in the section on
Tests in the Bernoulli model.
For purposes of generalization, the critical result in the next exercise is a special representation of V . Let e0 = n(1 − p0 ) and
e = np . Note that these are the expected frequencies of the outcomes 0 and 1, respectively, under H .
1 0 0
This representation shows that our test statistic V measures the discrepancy between the expected frequencies, under H , and the 0
observed frequencies. Of course, large values of V are evidence in favor of H . Finally, note that although there are two terms in
1
the expansion of V in Exercise 3, there is only one degree of freedom since O + O = n . The observed and expected frequencies
0 1
9.6.1 https://stats.libretexts.org/@go/page/10216
The Multi-Sample Bernoulli Model
Suppose now that we have samples from several (possibly) different, independent Bernoulli trials processes. Specifically, suppose
that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success parameter
i i,1 i,2 i,ni i
p ∈ (0, 1) for each i ∈ {1, 2, … , m}. Moreover, the samples (X , X , … , X ) are independent. We want to test hypotheses
i 1 2 m
about the unknown parameter vector p = (p , p , … , p ). There are two common cases that we consider below, but first let's set
1 2 m
up the essential notation that we will need for both cases. For i ∈ {1, 2, … , m} and j ∈ {0, 1}, let O denote the number of times i,j
that outcome j occurs in sample X . The observed frequency O has a binomial distribution; O has parameters n and p while
i i,j i,1 i i
versus H : p ≠ p . Since the null hypothesis specifies the value of p for each i, this is called the completely specified case. Now
1 0 i
let e = n (1 − p ) and let e = n p . These are the expected frequencies of the outcomes 0 and 1, respectively, from sample
i,0 i i,0 i,1 i i,0
X under H .
i 0
If n is large for each i, then under H the following test statistic has approximately the chi-square distribution with m degrees
i 0
of freedom:
m 1 2
(Oi,j − ei,j )
V = ∑∑ (9.6.3)
ei,j
i=1 j=0
Proof
As a rule of thumb, “large” means that we need e i,j ≥5 for each i ∈ {1, 2, … , m} and j ∈ {0, 1}. But of course, the larger these
expected frequencies the better.
Under the large sample assumption, an approximate test of H0 versus H at the 1 α level of significance is to reject H0 if and
only if V > χ (1 − α) . 2
m
Once again, note that the test statistic V measures the discrepancy between the expected and observed frequencies, over all
outcomes and all samples. There are 2 m terms in the expansion of V in Exercise 4, but only m degrees of freedom, since
O +O
i,0 =n for each i ∈ {1, 2, … , m}. The observed and expected frequencies could be stored in an m × 2 table.
i,1 i
versus the complementary alternative hypothesis H that the probabilities are not all the same. Note, in contrast to the previous
1
model, that the null hypothesis does not specify the value of the common success probability p. But note also that under the null
hypothesis, the m samples can be combined to form one large sample of Bernoulli trials with success probability p. Thus, a natural
approach is to estimate p and then define the test statistic that measures the discrepancy between the expected and observed
frequencies, just as before. The challenge will be to find the distribution of the test statistic.
m
Let n = ∑ n denote the total sample size when the samples are combined. Then the overall sample mean, which in this
i=1 i
The sample proportion P is the best estimate of p, in just about any sense of the word. Next, let E = n (1 − P ) and i,0 i
Ei,1 = n P . These are the estimated expected frequencies of 0 and 1, respectively, from sample X under H . Of course these
i i 0
estimated frequencies are now statistics (and hence random) rather than parameters. Just as before, we define our test statistic
m 1 2
(Oi,j − Ei,j )
V = ∑∑ (9.6.5)
Ei,j
i=1 j=0
It turns out that under H0 , the distribution of V converges to the chi-square distribution with m −1 degrees of freedom as
n → ∞.
9.6.2 https://stats.libretexts.org/@go/page/10216
An approximate test of H versus H at the α level of significance is to reject H if and only if V
0 1 0 >χ
2
m−1
(1 − α) .
Intuitively, we lost a degree of freedom over the completely specified case because we had to estimate the unknown common
success probability p. Again, the observed and expected frequencies could be stored in an m × 2 table.
sequence of multinomial trials. Thus, these are independent, identically distributed random variables, each taking values in a set S
with k elements. If we want, we can assume that S = {0, 1, … , k − 1} ; the one-sample Bernoulli model then corresponds to
k = 2 . Let f denote the common probability density function of the sample variables on S , so that f (j) = P(X = j) for i
i ∈ {1, 2, … , n} and j ∈ S . The values of f are assumed unknown, but of course we must have ∑ f (j) = 1 , so there are j∈S
really only k − 1 unknown parameters. For a given probability density function f on S we want to test H : f = f versus 0 0 0
H : f ≠f .
1 0
By this time, our general approach should be clear. We let O denote the number of times that outcome j ∈ S occurs in sample X:
j
Oj = ∑ 1(Xi = j) (9.6.6)
i=1
2
(Oj − ej )
V =∑ (9.6.7)
j
e
j∈S
It turns out that under H , the distribution of V converges to the chi-square distribution with k − 1 degrees of freedom as n → ∞ .
0
Note that there are k terms in the expansion of V , but only k − 1 degrees of freedom since ∑ O = n.
j∈S j
Again, as a rule of thumb, we need e j ≥5 for each j ∈ S , but the larger the expected frequencies the better.
i ∈ {1, 2, … , m}. Moreover, we assume that the samples (X , X , … , X ) are independent. Again there is no loss in generality
1 2 m
if we take S = {0, 1, … , k − 1} . Then k = 2 reduces to the multi-sample Bernoulli model, and m = 1 corresponds to the one-
sample multinomial model.
Let denote the common probability density function of the variables in sample X , so that f (j) = P(X = j) for
fi i i i,l
,
i ∈ {1, 2, … , m} l ∈ {1, 2, … , ni } , and j ∈ S . These are generally unknown, so that our vector of parameters is the vector of
probability density functions: f = (f , f , … , f ). Of course, ∑
1 2 f (j) = 1 for i ∈ {1, 2, … , m}, so there are actually
m j∈S i
m (k − 1) unknown parameters. We are interested in testing hypotheses about f . As in the multi-sample Bernoulli model, there are
two common cases that we consider below, but first let's set up the essential notation that we will need for both cases. For
i ∈ {1, 2, … , m} and j ∈ S , let O denote the number of times that outcome j occurs in sample X . The observed frequency
i,j i
hypothesis H : f = f , versus H : f ≠ f . Since the null hypothesis specifies the value of f (j) for each i and j , this is called
0 0 1 0 i
the completely specified case. Let e = n f (j) . This is the expected frequency of outcome j in sample X under H .
i,j i 0,i i 0
If n is large for each i, then under H , the test statistic V below has approximately the chi-square distribution with m (k − 1)
i 0
degrees of freedom:
9.6.3 https://stats.libretexts.org/@go/page/10216
m 2
(Oi,j − ei,j )
V = ∑∑ (9.6.8)
ei,j
i=1 j∈S
Proof
As usual, our rule of thumb is that we need e i,j ≥5 for each i ∈ {1, 2, … , m} and j ∈ S . But of course, the larger these expected
frequencies the better.
m (k−1)
As always, the test statistic V measures the discrepancy between the expected and observed frequencies, over all outcomes and all
samples. There are mk terms in the expansion of V in Exercise 8, but we lose m degrees of freedom, since ∑ O =n for j∈S i,j i
same, versus the complementary alternative hypothesis H that the probability density functions are not all the same. Note, in
1
contrast to the previous model, that the null hypothesis does not specify the value of the common success probability density
function f . But note also that under the null hypothesis, the m samples can be combined to form one large sample of multinomial
trials with probability density function f . Thus, a natural approach is to estimate the values of f and then define the test statistic
that measures the discrepancy between the expected and observed frequencies, just as before.
Let n = ∑ m
i=1
ni denote the total sample size when the samples are combined. Under H , our best estimate of f (j) is
0
m
1
Pj = ∑ Oi,j (9.6.9)
n
i=1
Hence our estimate of the expected frequency of outcome j in sample X under H is E = n P . Again, this estimated
i 0 i,j i j
frequency is now a statistic (and hence random) rather than a parameter. Just as before, we define our test statistic
m 2
(Oi,j − Ei,j )
V = ∑∑ (9.6.10)
Ei,j
i=1 j∈S
As you no doubt expect by now, it turns out that under H , the distribution of V converges to a chi-square distribution as n → ∞ .
0
(k−1) (m−1)
(1 − α) .
by {A : j ∈ J} where #(J) = k . Next, we define the sequence of random variables Y = (Y , Y , … , Y ) by Y = j if and only
j 1 2 n i
9.6.4 https://stats.libretexts.org/@go/page/10216
Y is a multinomial trials sequence with parameters n and f , where f (j) = P(X ∈ A j) for j ∈ J .
which of course, is precisely the problem we solved in the one-sample multinomial model.
Generally, we would partition the space S into as many subsets as possible, subject to the restriction that the expected frequencies
all be at least 5.
lose a degree of freedom in the chi-square statistic V for each parameter that we estimate, although the precise mathematics can be
complicated.
A Test of Independence
Suppose that we have observable random variables X and Y for an experiment, where X takes values in a set S with k elements,
and Y takes values in a set T with m elements. Let f denote the joint probability density function of (X, Y ), so that
f (i, j) = P(X = i, Y = j) for i ∈ S and j ∈ T . Recall that the marginal probability density functions of X and Y are the
j∈T
i∈S
Usually, of course, f , g , and h are unknown. In this section, we are interested in testing whether X and Y are independent, a basic
and important test. Formally then we want to test the null hypothesis
H0 : f (i, j) = g(i) h(j), (i, j) ∈ S × T (9.6.13)
Our first step, of course, is to draw a random sample (X, Y ) = ((X , Y ), (X , Y ), … , (X , Y )) from the distribution of
1 1 2 2 n n
(X, Y ). Since the state spaces are finite, this sample forms a sequence of multinomial trials. Thus, with our usual notation, let O i,j
denote the number of times that (i, j) occurs in the sample, for each (i, j) ∈ S × T . This statistic has the binomial distribution with
trial parameter n and success parameter f (i, j). Under H , the success parameter is g(i) h(j) . However, since we don't know the
0
success parameters, we must estimate them in order to compute the expected frequencies. Our best estimate of f (i, j) is the sample
proportion O . Thus, our best estimates of g(i) and h(j) are N and M , respectively, where N is the number of times that
1
n
i,j
1
n
i
1
n
j i
Ni = ∑ Oi,j (9.6.14)
j∈T
Mj = ∑ Oi,j (9.6.15)
i∈S
1 1 1
Ei,j = n Ni Mj = Ni Mj (9.6.16)
n n n
9.6.5 https://stats.libretexts.org/@go/page/10216
As you now expect, the distribution of V converges to a chi-square distribution as n → ∞ . But let's see if we can determine the
appropriate degrees of freedom on heuristic grounds.
(k−1)(m−1)
(1 − α) .
The observed frequencies are often recorded in a k × m table, known as a contingency table, so that O is the number in row i
i,j
and column j . In this setting, note that N is the sum of the frequencies in the ith row and M is the sum of the frequencies in the
i j
j th column. Also, for historical reasons, the random variables X and Y are sometimes called factors and the possible values of the
variables categories.
A coin is tossed 100 times, resulting in 55 heads. Test the null hypothesis that the coin is fair.
Answer
Suppose that we have 3 coins. The coins are tossed, yielding the data in the following table:
Heads Tails
Coin 1 29 21
Coin 2 23 17
Coin 3 42 18
5
2
3
.
3. Test the null hypothesis that the 3 coins have the same probability of heads.
Answer
A die is thrown 240 times, yielding the data in the following table:
Score 1 2 3 4 5 6
Frequency 57 39 28 28 36 52
4
each while faces 2, 3, 4, and 5
have probability each).
1
Answer
Two dice are thrown, yielding the data in the following table:
Score 1 2 3 4 5 6
Die 1 22 17 22 13 22 24
Die 2 44 24 19 19 18 36
1. Test the null hypothesis that die 1 is fair and die 2 is an ace-six flat.
2. Test the null hypothesis that all the dice have have the same probability distribuiton.
9.6.6 https://stats.libretexts.org/@go/page/10216
Answer
A university classifies faculty by rank as instructors, assistant professors, associate professors, and full professors. The data,
by faculty rank and gender, are given in the following contingency table. Test to see if faculty rank and gender are independent.
Answer
reasonable.
Answer
Test to see if the alpha emissions data come from a Poisson distribution.
Answer
Test to see if Michelson's velocity of light data come from a normal distribution.
Answer
Simulation Exercises
In the simulation exercises below, you will be able to explore the goodness of fit test empirically.
In the dice goodness of fit experiment, set the sampling distribution to fair, the sample size to 50, and the significance level to
0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case (a), give the empirical
estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical estimate of the power
of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem reasonable?
1. fair
2. ace-six flats
3. the symmetric, unimodal distribution
4. the distribution skewed right
In the dice goodness of fit experiment, set the sampling distribution to ace-six flats, the sample size to 50, and the significance
level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case (a), give the
empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical estimate of
the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem reasonable?
1. fair
2. ace-six flats
3. the symmetric, unimodal distribution
4. the distribution skewed right
In the dice goodness of fit experiment, set the sampling distribution to the symmetric, unimodal distribution, the sample size to
50, and the significance level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000
times. In case (a), give the empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give
the empirical estimate of the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your
results seem reasonable?
1. the symmetric, unimodal distribution
2. fair
3. ace-six flats
9.6.7 https://stats.libretexts.org/@go/page/10216
4. the distribution skewed right
In the dice goodness of fit experiment, set the sampling distribution to the distribution skewed right, the sample size to 50, and
the significance level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case
(a), give the empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical
estimate of the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem
reasonable?
1. the distribution skewed right
2. fair
3. ace-six flats
4. the symmetric, unimodal distribution
Suppose that D and D are different distributions. Is the power of the test with sampling distribution D and test distribution
1 2 1
D the same as the power of the test with sampling distribution D and test distribution D ? Make a conjecture based on your
2 2 1
In the dice goodness of fit experiment, set the sampling and test distributions to fair and the significance level to 0.05. Run the
experiment 1000 times for each of the following sample sizes. In each case, give the empirical estimate of the significance
level and compare with 0.05.
1. n = 10
2. n = 20
3. n = 40
4. n = 100
In the dice goodness of fit experiment, set the sampling distribution to fair, the test distributions to ace-six flats, and the
significance level to 0.05. Run the experiment 1000 times for each of the following sample sizes. In each case, give the
empirical estimate of the power of the test. Do the powers seem to be converging?
1. n = 10
2. n = 20
3. n = 40
4. n = 100
This page titled 9.6: Chi-Square Tests is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
9.6.8 https://stats.libretexts.org/@go/page/10216
CHAPTER OVERVIEW
10: Geometric Models
In this chapter, we explore several problems in geometric probability. These problems are interesting, conceptually clear, and the
analysis is relatively simple. Thus, they are good problems for the student of probability. In addition, Buffon's problems and
Bertrand's problem are historically famous, and contributed significantly to the early development of probability theory.
10.1: Buffon's Problems
10.2: Bertrand's Paradox
10.3: Random Triangles
This page titled 10: Geometric Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
10.1: Buffon's Problems
Buffon's experiments are very old and famous random experiments, named after comte de Buffon. These experiments are
considered to be among the first problems in geometric probability.
Assumptions
First, let us define the experiment mathematically. As usual, we will idealize the physical objects by assuming that the coin is a
perfect circle with radius r and that the cracks between tiles are line segments. A natural way to describe the outcome of the
experiment is to record the center of the coin relative to the center of the tile where the coin happens to fall. More precisely, we will
2
construct coordinate axes so that the tile where the coin falls occupies the square S = [− 1
2
,
1
2
] .
Now when the coin is tossed, we will denote the center of the coin by (X, Y ) ∈ S so that S is our sample space and X and Y are
our basic random variables. Finally, we will assume that r < so that it is at least possible for the coin to fall inside the square
1
Run Buffon's coin experiment with the default settings. Watch how the points seem to fill the sample space S in a uniform
manner.
Proof
10.1.1 https://stats.libretexts.org/@go/page/10222
In Buffon's coin experiment, change the radius with the scroll bar and watch how the events C and C and change. Run the
c
experiment with various values of r and compare the physical experiment with the points in the scatterplot. Compare the
relative frequency of C to the probability of C .
The convergence of the relative frequency of an event (as the experiment is repeated) to the probability of the event is a special
case of the law of large numbers.
Solve Buffon's coin problem with rectangular tiles that have height h and width w.
Answer
Solve Buffon's coin problem with equilateral triangular tiles that have side length 1.
Recall that random numbers are simulation of independent random variables, each with the standard uniform distribution, that is,
the continuous uniform distribution on the interval (0, 1).
Show how to simulate the center of the coin (X, Y ) in Buffon's coin experiment using random numbers.
Answer
Assumptions
Our first step is to define the experiment mathematically. Again we idealize the physical objects by assuming that the floorboards
are uniform and that each has width 1. We will also assume that the needle has length L < 1 so that the needle cannot cross more
than one crack. Finally, we assume that the cracks between the floorboards and the needle are line segments.
When the needle is dropped, we want to record its orientation relative to the floorboard cracks. One way to do this is to record the
angle X that the top half of the needle makes with the line through the center of the needle, parallel to the floorboards, and the
distance Y from the center of the needle to the bottom crack. These will be the basic random variables of our experiment, and thus
the sample space of the experiment is
Run Buffon's needle experiment with the default settings and watch the outcomes being plotted in the sample space. Note how
the points in the scatterplot seem to fill the sample space S in a uniform way.
The event C can be written in terms of the basic angle and distance variables as follows:
10.1.2 https://stats.libretexts.org/@go/page/10222
L L
C = {Y < sin(X)} ∪ {Y > 1 − sin(X)} (10.1.5)
2 2
The curves y = sin(x) and y = 1 − sin(x) on the interval 0 ≤ x < π are shown in blue in the scatterplot of Buffon's needle
L
2
L
experiment, and hence event C is the union of the regions below the lower curve and above the upper curve. Thus, the needle
crosses a crack precisely when a point falls in this region.
In the Buffon's needle experiment, vary the needle length L with the scroll bar and watch how the event C changes. Run the
experiment with various values of L and compare the physical experiment with the points in the scatterplot. Compare the
relative frequency of C to the probability of C .
The convergence of the relative frequency of an event (as the experiment is repeated) to the probability of the event is a special
case of the law of large numbers.
Find the probabilities of the following events in Buffon's needle experiment. In each case, sketch the event as a subset of the
sample space.
1. {0 < X < π/2, 0 < Y < 1/3}
The Estimate of π
Suppose that we run Buffon's needle experiment a large number of times. By the law of large numbers, the proportion of crack
crossings should be about the same as the probability of a crack crossing. More precisely, we will denote the number of crack
crossings in the first n runs by N . Note that N is a random variable for the compound experiment that consists of n replications
n n
Nn
of the basic needle experiment. Thus, if n is large, we should have ≈
n π
and hence
2L
2Ln
π ≈ (10.1.6)
Nn
This is Buffon's famous estimate of π. In the simulation of Buffon's needle experiment, this estimate is computed on each run and
shown numerically in the second table and visually in a graph.
Run the Buffon's needle experiment with needle lengths L ∈ {0.3, 0.5, 0.7, 1} . In each case, watch the estimate of π as the
simulation runs.
Let us analyze the estimation problem more carefully. On each run j we have an indicator variable I , where I = 1 if the needle
j j
crosses a crack on run j and I = 0 if the needle does not cross a crack on run j . These indicator variables are independent, and
j
identically distributed, since we are assuming independent replications of the experiment. Thus, the sequence forms a Bernoulli
trials process.
10.1.3 https://stats.libretexts.org/@go/page/10222
The number of crack crossings in the first n runs of the experiment is
n
Nn = ∑ Ij (10.1.7)
j=1
1. E(N ) = n
n
2L
2. var(N ) = n
n
2L
π
(1 −
2L
π
)
Nn
With probability 1, 2Ln
→
1
π
as n → ∞ and 2Ln
Nn
→ π as n → ∞ .
Proof
Nn
Thus, we have two basic estimators: 2Ln
as an estimator of 1
π
and 2Ln
Nn
as an estimator of π. The estimator of 1
π
has several
important statistical properties. First, it is unbiased since the expected value of the estimator is the parameter being estimated:
The estimator of 1
π
is unbiased:
Nn 1
E( ) = (10.1.8)
2Ln π
Proof
Since this estimator is unbiased, the variance gives the mean square error:
2
Nn Nn 1
var ( ) = E [( − ) ] (10.1.9)
2Ln 2Ln π
Nn π − 2L
var ( ) = (10.1.10)
2
2Ln 2Lnπ
π
improves as the needle length increases. On the other hand, the estimator of π is biased; it tends to
overestimate π:
Proof
The estimator of π also tends to improve as the needle length increases. This is not easy to see mathematically. However, you can
see it empirically.
In the Buffon's needle experiment, run the simulation 5000 times each with L = 0.3 , L = 0.5 , L = 0.7 , and L = 0.9 . Note
how well the estimator seems to work in each case.
Finally, we should note that as a practical matter, Buffon's needle experiment is not a very efficient method of approximating π.
According to Richard Durrett, to estimate π to four decimal places with L = would require about 100 million tosses!
1
Run the Buffon's needle experiment until the estimates of π seem to be consistently correct to two decimal places. Note the
number of runs required. Try this for needle lengths L = 0.3 , L = 0.5 , L = 0.7 , and L = 0.9 and compare the results.
10.1.4 https://stats.libretexts.org/@go/page/10222
Show how to simulate the angle X and distance Y in Buffon's needle experiment using random numbers.
Answer
Notes
Buffon's needle problem is essentially solved by Monte-Carlo integration. In general, Monte-Carlo methods use statistical
sampling to approximate the solutions of problems that are difficult to solve analytically. The modern theory of Monte-Carlo
methods began with Stanislaw Ulam, who used the methods on problems associated with the development of the hydrogen bomb.
The original needle problem has been extended in many ways, starting with Simon Laplace who considered a floor with rectangular
tiles. Indeed, variations on the problem are active research problems even today.
Neil Weiss has pointed out that our computer simulation of Buffon's needle experiment is circular, in the sense the program
assumes knowledge of π (you can see this from the simulation result above).
Try to write a computer algorithm for Buffon's needle problem, without assuming the value of π or any other transcendental
numbers.
This page titled 10.1: Buffon's Problems is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
10.1.5 https://stats.libretexts.org/@go/page/10222
10.2: Bertrand's Paradox
Preliminaries
Statement of the Problem
Bertrand's problem is to find the probability that a “random chord” on a circle will be longer than the length of a side of the
inscribed equilateral triangle. The problem is named after the French mathematician Joseph Louis Bertrand, who studied the
problem in 1889.
It turns out, as we will see, that there are (at least) three answers to Bertrand's problem, depending on how one interprets the phrase
“random chord”. The lack of a unique answer was considered a paradox at the time, because it was assumed (naively, in hindsight)
that there should be a single natural answer.
Run Bertrand's experiment 100 times for each of the following models. Do not be concerned with the exact meaning of the
models, but see if you can detect a difference in the behavior of the outcomes
1. Uniform distance
2. Uniform angle
3. Uniform endpoint
Mathematical Formulation
To formulate the problem mathematically, let us take (0, 0) as the center of the circle and take the radius of the circle to be 1. These
assumptions entail no loss of generality because they amount to measuring distances relative to the center of the circle, and taking
the radius of the circle as the unit of length. Now consider a chord on the circle. By rotating the circle, we can assume that one
point of the chord is (1, 0) and the other point is (X, Y ) where Y > 0 and X + Y = 1 .
2 2
With these assumptions, the chord is completely specified by giving any one of the following variables
1. The (perpendicular) distance D from the center of the circle to the midpoint of the chord. Note that 0 ≤ D ≤ 1 .
2. The angle A between the x-axis and the line from the center of the circle to the midpoint of the chord. Note that
0 ≤ A ≤ π/2 .
−− −−−−
3. Y = 2D√1 − D 2
The inverse relations are given below. Note again that there are one-to-one correspondences between X, A , and D.
1. A = arccos(D)
−−−−−−−
2. D = √ 1
2
(x + 1)
− −−−− −−−−−− −
−−−−−
3. D =√
1
2
±
1
√1 − y
2
2
10.2.1 https://stats.libretexts.org/@go/page/10223
If the chord is generated in a probabilistic way, D, A , X, and Y become random variables. In light of the previous results,
specifying the distribution of any of the variables D, A , or X completely determines the distribution of all four variables.
The angle A is also the angle between the chord and the tangent line to the circle at (1, 0).
Now consider the equilateral triangle inscribed in the circle so that one of the vertices is (1, 0). Consider the chord defined by the
upper side of the triangle.
For this chord, the angle, distance, and coordinate variables are given as follows:
1. a = π/3
2. d = 1/2
3. x = −1/2
–
4. y = √3/2
The length of the chord is greater than the length of a side of the inscribed equilateral triangle if and only if the following
equivalent conditions occur:
1. 0 < D < 1/2
2. π/3 < A < π/2
3. −1 < X < −1/2
Models
When an object is generated “at random”, a sequence of “natural” variables that determines the object should be given an
appropriate uniform distribution. The coordinates of the coin center are such a sequence in Buffon's coin experiment; the angle and
distance variables are such a sequence in Buffon's needle experiment. The crux of Bertrand's paradox is the fact that the distance
D, the angle A , and the coordinate X each seems to be a natural variable that determine the chord, but different models are
In Bertrand's experiment, select the uniform distance model. Run the experiment 1000 times and compare the relative
frequency function of the chord event to the true probability.
Proof
10.2.2 https://stats.libretexts.org/@go/page/10223
The coordinate X has probability density function
1
h(x) = − −−−−−−, −1 < x < 1 (10.2.3)
√ 8(x + 1)
Proof
Note that A and X do not have uniform distributions. Recall that a random number is a simulation of a variable with the standard
uniform distribution, that is the continuous uniform distribution on the interval [0, 1).
In Bertrand's experiment, select the uniform angle model. Run the experiment 1000 times and compare the relative frequency
function of the chord event to the true probability.
Proof
Proof
In Bertrand's experiment, select the uniform endpoint model. Run the experiment 1000 times and compare the relative
frequency function of the chord event to the true probability.
Proof
10.2.3 https://stats.libretexts.org/@go/page/10223
The angle A has probability density function
π
g(a) = 2 sin(a) cos(a), 0 <a < (10.2.9)
2
Proof
Note that D and A do not have uniform distributions; in fact, D has a beta distribution with left parameter 2 and right parameter 1.
Physical Experiments
Suppose that a random chord is generated by tossing a coin of radius 1 on a table ruled with parallel lines that are distance 2
apart. Which of the models (if any) would apply to this physical experiment?
Answer
Suppose that a needle is attached to the edge of disk of radius 1. A random chord is generated by spinning the needle. Which of
the models (if any) would apply to this physical experiment?
Answer
Suppose that a thin trough is constructed on the edge of a disk of radius 1. Rolling a ball in the trough generates a random point
on the circle, so a random chord is generated by rolling the ball twice. Which of the models (if any) would apply to this
physical experiment?
Answer
This page titled 10.2: Bertrand's Paradox is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
10.2.4 https://stats.libretexts.org/@go/page/10223
10.3: Random Triangles
Preliminaries
Statement of the Problem
Suppose that a stick is randomly broken in two places. What is the probability that the three pieces form a triangle?
Run the triangle experiment 50 times. Do not be concerned with all of the information displayed in the app, but just note
whether the pieces form a triangle. Would you like to revise your guess?
Mathematical Formulation
As usual, the first step is to model the random experiment mathematically. We will take the length of the stick as our unit of length,
so that we can identify the stick with the interval [0, 1]. To break the stick into three pieces, we just need to select two points in the
interval. Thus, let X denote the first point chosen and Y the second point chosen. Note that X and Y are random variables and
hence the sample space of our experiment is S = [0, 1] . Now, to model the statement that the points are chosen at random, let us
2
assume, as in the previous sections, that X and Y are independent and each is uniformly distributed on [0, 1].
Hence
area(A)
P [(X, Y ) ∈ A] = (10.3.1)
area(S)
Triangles
The Probability of a Triangle
The three pieces form a triangle if and only if the triangle inequalities hold: the sum of the lengths of any two pieces must be
greater than the length of the third piece.
2
, x <
1
2
, y −x <
1
2
}
2. T 2 = {(x, y) ∈ S : x >
1
2
, y <
1
2
, x −y <
1
2
}
4
.
How close did you come with your initial guess? The relative low value of P(T ) is a bit surprising.
10.3.1 https://stats.libretexts.org/@go/page/10224
Run the triangle experiment 1000 times and compare the empirical probability of T to the true probability.
c
Suppose that a triangle has side lengths a , b , and c , where c is the largest of these. The triangle is
1. acute if and only if c < a + b .
2 2 2
Part (c), of course, is the famous Pythagorean theorem, named for the ancient Greek mathematician Pythagoras.
2. (1 − x ) = x
2 2
+ (y − x )
2
in T 1
3. x = (y − x )
2 2
+ (1 − y )
2
in T 1
4. (x − y ) = y
2 2
+ (1 − x )
2
in T 2
5. (1 − x ) = y
2 2
+ (x − y )
2
in T 2
6. y = (x − y )
2 2
+ (1 − x )
2
in T 2
Figure 10.3.2 : The events that the pieces form acute and obtuse triangles
Let R denote the event that the pieces form a right triangle. Then P(R) = 0 .
2. A is the region inside curves (d), (e), and (f) of the right triangle equations.
2
2. B , B , and B are the regions inside T and outside of curves (d), (e), and (f) of the right triangle equations, respectively.
4 5 6 2
Proof
Proof
10.3.2 https://stats.libretexts.org/@go/page/10224
Run the triangle experiment 1000 times and compare the empirical probabilities to the true probabilities.
This page titled 10.3: Random Triangles is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
10.3.3 https://stats.libretexts.org/@go/page/10224
CHAPTER OVERVIEW
11: Bernoulli Trials
The Bernoulli trials process is one of the simplest, yet most important, of all random processes. It is an essential topic in any course
in probability or mathematical statistics. The process consists of independent trials with two outcomes and with constant
probabilities from trial to trial. Thus it is the mathematical abstraction of coin tossing. The process leads to several important
probability distributions: the binomial, geometric, and negative binomial.
11.1: Introduction to Bernoulli Trials
11.2: The Binomial Distribution
11.3: The Geometric Distribution
11.4: The Negative Binomial Distribution
11.5: The Multinomial Distribution
11.6: The Simple Random Walk
11.7: The Beta-Bernoulli Process
This page titled 11: Bernoulli Trials is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
11.1: Introduction to Bernoulli Trials
Basic Theory
Definition
The Bernoulli trials process, named after Jacob Bernoulli, is one of the simplest yet most important random processes in
probability. Essentially, the process is the mathematical abstraction of coin tossing, but because of its wide applicability, it is
usually stated in terms of a sequence of generic “trials”.
Random Variables
Mathematically, we can describe the Bernoulli trials process with a sequence of indicator random variables:
X = (X1 , X2 , …) (11.1.1)
An indicator variable is a random variable that takes only the values 1 and 0, which in this setting denote success and failure,
respectively. Indicator variable X simply records the outcome of trial i. Thus, the indicator variables are independent and have the
i
The distribution defined by this probability density function is known as the Bernoulli distribution. In statistical terms, the
Bernoulli trials process corresponds to sampling from the Bernoulli distribution. In particular, the first n trials (X , X , … , X ) 1 2 n
form a random sample of size n from the Bernoulli distribution. Note again that the Bernoulli trials process is characterized by a
single parameter p.
Proof
Note that the exponent of p in the probability density function is the number of successes in the n trials, while the exponent of
1 − p is the number of failures.
Suppose that U = (U , U , …) is a sequence of independent random variables, each with the uniform distribution on the
1 2
interval [0, 1]. For p ∈ [0, 1] and i ∈ N , let X (p) = 1(U ≤ p) . Then X(p) = (X (p), X (p), …) is a Bernoulli trials
+ i i 1 2
Note that in the previous result, the Bernoulli trials processes for all possible values of the parameter p are defined on a common
probability space. This type of construction is sometimes referred to as coupling. This result also shows how to simulate a
Bernoulli trials process with random numbers. All of the other random process studied in this chapter are functions of the Bernoulli
trials sequence, and hence can be simulated as well.
Moments
Let X be an indciator variable with P(X = 1) = p , where p ∈ [0, 1]. Thus, X is the result of a generic Bernoulli trial and has the
Bernoulli distribution with parameter p. The following results give the mean, variance and some of the higher moments. A helpful
11.1.1 https://stats.libretexts.org/@go/page/10233
fact is that if we take a positive power of an indicator variable, nothing happens; that is, X n
=X for n > 0
Note that the graph of var(X), as a function of p ∈ [0, 1] is a parabola opening downward. In particular the largest value is when 1
p =
1
2
, and the smallest value is 0 when p = 0 or p = 1 . Of course, in the last two cases, X is deterministic, taking the single value
0 when p = 0 and the single value 1 when p = 1
2. kurt(X) = −3 + 1
p(1−p)
In the basic coin experiment, set n = 100 and For each p ∈ {0.1, 0.3, 0.5, 0.7, 0.9} run the experiment and observe the
outcomes.
Generic Examples
In a sense, the most general example of Bernoulli trials occurs when an experiment is replicated. Specifically, suppose that we have
a basic random experiment and an event of interest A . Suppose now that we create a compound experiment that consists of
independent replications of the basic experiment. Define success on trial i to mean that event A occurred on the ith run, and define
failure on trial i to mean that event A did not occur on the ith run. This clearly defines a Bernoulli trials process with parameter
p = P(A) .
Bernoulli trials are also formed when we sample from a dichotomous population. Specifically, suppose that we have a population
of two types of objects, which we will refer to as type 0 and type 1. For example, the objects could be persons, classified as male or
female, or the objects could be components, classified as good or defective. We select n objects at random from the population; by
definition, this means that each object in the population at the time of the draw is equally likely to be chosen. If the sampling is
with replacement, then each object drawn is replaced before the next draw. In this case, successive draws are independent, so the
types of the objects in the sample form a sequence of Bernoulli trials, in which the parameter p is the proportion of type 1 objects in
the population. If the sampling is without replacement, then the successive draws are dependent, so the types of the objects in the
sample do not form a sequence of Bernoulli trials. However, if the population size is large compared to the sample size, the
dependence caused by not replacing the objects may be negligible, so that for all practical purposes, the types of the objects in the
sample can be treated as a sequence of Bernoulli trials. Additional discussion of sampling from a dichotomous population is in the
in the chapter Finite Sampling Models.
11.1.2 https://stats.libretexts.org/@go/page/10233
Suppose that a student takes a multiple choice test. The test has 10 questions, each of which has 4 possible answers (only one
correct). If the student blindly guesses the answer to each question, do the questions form a sequence of Bernoulli trials? If so,
identify the trial outcomes and the parameter p.
Answer
Candidate A is running for office in a certain district. Twenty persons are selected at random from the population of registered
voters and asked if they prefer candidate A . Do the responses form a sequence of Bernoulli trials? If so identify the trial
outcomes and the meaning of the parameter p.
Answer
An American roulette wheel has 38 slots; 18 are red, 18 are black, and 2 are green. A gambler plays roulette 15 times, betting
on red each time. Do the outcomes form a sequence of Bernoulli trials? If so, identify the trial outcomes and the parameter p.
Answer
Two tennis players play a set of 6 games. Do the games form a sequence of Bernoulli trials? If so, identify the trial outcomes
and the meaning of the parameter p.
Answer
Reliability
Recall that in the standard model of structural reliability, a system is composed of n components that operate independently of each
other. Let X denote the state of component i, where 1 means working and 0 means failure. If the components are all of the same
i
X = (X1 , X2 , … , Xn ) (11.1.4)
is a sequence of Bernoulli trials. The state of the system, (again where 1 means working and 0 means failed) depends only on the
states of the components, and thus is a random variable
Y = s(X1 , X, … , Xn ) (11.1.5)
where s : {0, 1} → {0, 1} is the structure function. Generally, the probability that a device is working is the reliability of the
n
device, so the parameter p of the Bernoulli trials sequence is the common reliability of the components. By independence, the
system reliability r is a function of the component reliability:
r(p) = Pp (Y = 1), p ∈ [0, 1] (11.1.6)
where we are emphasizing the dependence of the probability measure P on the parameter p. Appropriately enough, this function is
known as the reliability function. Our challenge is usually to find the reliability function, given the structure function.
Recall that in some cases, the system can be represented as a graph or network. The edges represent the components and the
vertices the connections between the components. The system functions if and only if there is a working path between two
designated vertices, which we will denote by a and b .
Find the reliability of the Wheatstone bridge network shown below (named for Charles Wheatstone).
11.1.3 https://stats.libretexts.org/@go/page/10233
Figure 11.1.2 : The Wheatstone bridge network
Answer
2. E(Y ) = 1 + k [1 − (1 − p) ] k
3. var(Y ) = k (1 − p) [1 − (1 − p) ]
2 k k
In terms of expected value, the pooled strategy is better than the basic strategy if and only if
1/k
1
p < 1 −( ) (11.1.7)
k
It follows that if p ≥ 0.307, pooling never makes sense, regardless of the size of the group k . At the other extreme, if p is very
small, so that the disease is quite rare, pooling is better unless the group size k is very large.
Now suppose that we have n persons. If k ∣ n then we can partition the population into n/k groups of k each, and apply the pooled
strategy to each group. Note that k = 1 corresponds to individual testing, and k = n corresponds to the pooled strategy on the
entire population. Let Y denote the number of tests required for group i.
i
The random variables (Y 1, Y2 , … , Yn/k ) are independent and each has the distribution given above.
The total number of tests required for this partitioning scheme is Z n,k = Y1 + Y2 + ⋯ + Yn/k .
11.1.4 https://stats.libretexts.org/@go/page/10233
The expected total number of tests is
n, k =1
E(Zn,k ) = { 1 k
(11.1.8)
n [(1 + ) − (1 − p ) ] , k >1
k
Thus, in terms of expected value, the optimal strategy is to group the population into n/k groups of size k , where k minimizes the
expected value function above. It is difficult to get a closed-form expression for the optimal value of k , but this value can be
determined numerically for specific n and p.
For the following values of n and p, find the optimal pooling size k and the expected number of tests. (Restrict your attention
to values of k that divide n .)
1. n = 100 , p = 0.01
2. n = 1000, p = 0.05
3. n = 1000, p = 0.001
Answer
If k does not divide n , then we could divide the population of n persons into ⌊n/k⌋ groups of k each and one “remainder” group
with n mod k members. This clearly complicates the analysis, but does not introduce any new ideas, so we will leave this
extension to the interested reader.
This page titled 11.1: Introduction to Bernoulli Trials is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
11.1.5 https://stats.libretexts.org/@go/page/10233
11.2: The Binomial Distribution
Basic Theory
Definitions
Our random experiment is to perform a sequence of Bernoulli trials X = (X , X , …). Recall that X is a sequence of
1 2
independent, identically distributed indicator random variables, and in the usual language of reliability, 1 denotes success and 0
denotes failure. The common probability of success p = P(X = 1) , is the basic parameter of the process. In statistical terms, the
i
first n trails (X , X , … , X ) form a random sample of size n from the Bernoulli distribution.
1 2 n
In this section we will study the random variable that gives the number of successes in the first n trials and the random variable that
gives the proportion of successes in the first n trials. The underlying distribution, the binomial distribution, is one of the most
important in probability theory, and so deserves to be studied in considerable detail. As you will see, some of the results in this
section have two or more proofs. In almost all cases, note that the proof from Bernoulli trials is the simplest and most elegant.
For n ∈ N , the number of successes in the first n trials is the random variable
n
Yn = ∑ Xi , n ∈ N (11.2.1)
i=1
The distribution of Y is the binomial distribution with trial parameter n and success parameter p.
n
Note that Y = (Y , Y , …) is the partial sum process associated with the Bernoulli trials sequence
0 1 X . In particular, Y0 = 0 , so
point mass at 0 is considered a degenerate form of the binomial distribution.
Distribution Functions
The probability density function f of Y is given by
n n
n
y n−y
fn (y) = ( ) p (1 − p ) , y ∈ {0, 1, … , n} (11.2.2)
y
Proof
Check that f is a valid PDF
n
In the binomial coin experiment, vary n and p with the scrollbars, and note the shape and location of the probability density
function. For selected values of the parameters, run the simulation 1000 times and compare the relative frequency function to
the probability density function.
Thus, the density function at first increases and then decreases, reaching its maximum value at ⌊(n + 1)p⌋ . This integer is a mode
of the distribution. In the case that m = (n + 1)p is an integer between 1 and n , there are two consecutive modes, at m − 1 and
m.
y y
n k n−k
Fn (y) = P(Yn ≤ y) = ∑ fn (k) = ∑ ( ) p (1 − p ) , y ∈ {0, 1, … , n} (11.2.6)
k
k=0 k=0
11.2.1 https://stats.libretexts.org/@go/page/10234
Open the special distribution calculator and select the binomial distribution and set the view to CDF. Vary n and p and note the
shape and location of the distribution/quantile function. For various values of the parameters, compute the median and the first
and third quartiles.
The binomial distribution function also has a nice relationship to the beta distribution function.
1−p
n!
n−k−1 k
Fn (k) = ∫ x (1 − x ) dx, k ∈ {0, 1, … , n} (11.2.7)
(n − k − 1)!k! 0
Proof
The expression on the right in the previous theorem is the beta distribution function, with left parameter n − k and right parameter
k + 1 , evaluated at 1 − p .
Moments
The mean, variance and other moments of the binomial distribution can be computed in several different ways. Again let
n
Y =∑
n X where X = (X , X , …) is a sequence of Bernoulli trials with success parameter p .
i 1 2
i=1
1. E(Y ) = np
n
2. var(Y ) = np(1 − p)
n
The expected value of Y also makes intuitive sense, since p should be approximately the proportion of successes in a large
n
number of trials. We will discuss the point further in the subsection below on the proportion of successes. Note that the graph of
var(Y ) as a function of p ∈ [0, 1] is parabola opening downward. In particular the maximum value of the variance is n/4 when
n
p = 1/2 , and the minimm value is 0 when p = 0 or p = 1 . Of course, in the last two cases, Y is deterministic, taking just the n
In the binomial coin experiment, vary n and p with the scrollbars and note the location and size of the mean±standard
deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the sample mean and standard
deviation to the distribution mean and standard deviation.
Recall that for x ∈ R and k ∈ N , the falling power of x of order k is x = x(x − 1) ⋯ (x − k + 1) . If X is a random variable
(k)
and k ∈ N , then E [ X ] is the factorial moment of X of onder k . The probability generating function provides an easy way to
(k)
Proof
Our next result gives a recursion equation and initial conditions for the moments of the binomial distribution.
1. E (Y k
n ) = npE [(Yn−1 + 1)
k−1
] for n, k ∈ N+
2. E 0
(Yn ) =1 for n ∈ N
3. E (Y k
0
) =0 for k ∈ N +
Proof
11.2.2 https://stats.libretexts.org/@go/page/10234
The ordinary (raw) moments of Y can be computed from the factorial moments and from the recursion relation. Here are the first
n
1. E (Y n) = np
2. E (Y 2
n ) =n
(2)
p
2
+ np
3. E (Y 3
n ) =n
(3)
p
3
+ 3n
(2) 2
p + np
4. E (Y 4
n ) =n
(4)
p
4
+ +6 n
(3)
p
3
+ 7n
(2)
p
2
+ np
Our final moment results gives the skewness and kurtosis of the binomial distribution.
1 − 2p
skew(Yn ) = − −−−−−−− (11.2.18)
√ np(1 − p)
2
n
1
2
n) =0 if p = 1
Proof
Open the binomial timeline experiment. For each of the following values of n , vary p from 0 to 1 and note the shape of the
probability density function in light of the previous results on skewness.
1. n = 10
2. n = 20
3. n = 100
6 1
kurt(Yn ) = 3 − + (11.2.19)
n np(1 − p)
1. For fixed n , kurt(Y n) decreases and then increases as a function of p, with minimum value 3 − 2
n
at the point of symmetry
1
p =
2
np(1−p)
−
6
n
→ 0 as n → ∞ . This is related to the convergence of the binomial
distribution to the normal, which we will discuss below.
parameters n − m and p.
2. If n ≤ n ≤ n ≤ ⋯ then (Y , Y − Y , Y − Y , …) is a sequence of independent variables.
1 2 3 n1 n2 n1 n3 n1
Proof
The joint probability density functions of the sequence Y are given as follows:
11.2.3 https://stats.libretexts.org/@go/page/10234
n1 n2 − n1 nk − nk−1
yk nk −yk
P (Yn1 = y1 , Yn2 = y2 , … , Ynk = yk ) = ( )( )⋯( )p (1 − p ) (11.2.20)
y1 y2 − y1 yk − yk−1
where n , n , … , n ∈ N
1 2 k + with n1 < n2 < ⋯ < nk and where y1 , y2 , … , yk ∈ N with 0 ≤ yj − yj−1 ≤ nj − nj−1 for
each j ∈ {1, 2, … , k}.
Proof
If U is a random variable having the binomial distribution with parameters n and p, then n − U has the binomial distribution
with parameters n and 1 − p .
Proof from Bernoulli trials
Proof from density functions
The sum of two independent binomial variables with the same success parameter also has a binomial distribution.
Suppose that U and V are independent random variables, and that U has the binomial distribution with parameters m and p,
and V has the binomial distribution with parameters n and p. Then U + V has the binomial distribution with parameters
m + n and p .
select n objects at random from the population, so that all samples of size n are equally likely. If the sampling is with replacement,
the sample size n can be any positive integer. If the sampling is without replacement, then we must have n ∈ {1, 2, … , m}.
n
In either case, let X denote the type of the i'th object selected for i ∈ {1, 2, … , n} so that Y = ∑ X is the number of type 1
i i=1 i
objects in the sample. As noted in the Introduction, if the sampling is with replacement, (X , X , … , X ) is a sequence of 1 2 n
Bernoulli trials, and hence Y has the binomial distribution parameters n and p = r/m . If the sampling is without replacement, then
Y has the hypergeometric distribution with parameters m , r , and n . The hypergeometric distribution is studied in detail in the
chapter on Finite Sampling Models. For reference, the probability density function of Y is given by
r m−r
( )( ) (y) (n−y)
y n−y n r (m − r)
P(Y = y) = =( ) , y ∈ {0, 1, … , n} (11.2.22)
m (n)
( ) y m
n
If the population size m is large compared to the sample size n , then the dependence between the indicator variables is slight, and
so the hypergeometric distribution should be close to the binomial distribution. The following theorem makes this precise.
Suppose that r ∈ {0, 1, … , m} for each m ∈ N and that r /m → p ∈ [0, 1] as m → ∞ . Then for fixed n ∈ N the
m + m +
hypergeometric distribution with parameters m, r and n converges to the binomial distribution with parameters n and p as
m
m → ∞.
Proof
Under the conditions in the previous theorem, the mean and variance of the hypergeometric distribution converge to the mean
and variance of the limiting binomial distribution:
rm
1. n m
→ np as m → ∞
rn rm m−n
2. n m
(1 −
m
)
m−1
→ np(1 − p) as m → ∞
11.2.4 https://stats.libretexts.org/@go/page/10234
Proof
In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with
replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial
probability density function. In particular, note the similarity when m is large and n small. For selected values of the
parameters, and for both sampling modes, run the experiment 1000 times.
From a practical point of view, the convergence of the hypergeometric distribution to the binomial means that if the population size
m is “large” compared to the sample size, then the hypergeometric distribution with parameters m , r and n is well approximated
by the binomial distribution with parameters n and p = r/m . This is often a useful result, because the binomial distribution has
fewer parameters than the hypergeometric distribution (and often in real problems, the parameters may only be known
approximately). Specifically, in the approximating binomial distribution, we do not need to know the population size m and the
number of type 1 objects r individually, but only in the ratio r/m. Generally, the approximation works well if m is large compared
to n that m−n
m−1
is close to 1. This ensures that the variance of the hypergeometric distribution is close to the variance of the
approximating binomial distribution.
Now let's return to our usual sequence of Bernoulli trials X = (X , X , …), with success parameter p, and to the binomial
1 2
variables Y = ∑ X for n ∈ N . Our next result shows that given k successes in the first n trials, the trials on which the
n
n
i=1 i
successes occur is simply a random sample of size k chosen without replacement from {1, 2, … , n}.
n
Suppose that n ∈ N + and k ∈ {0, 1, … , n}. Then for (x 1, x2 , … , xn ) ∈ {0, 1 }
n
with ∑
i=1
xi = k ,
1
P [(X1 , X2 , … , Xn ) = (x1 , x2 , … , xn ) ∣ Yn = k] = (11.2.27)
n
( )
k
Proof
Note in particular that the conditional distribution above does not depend on p. In statistical terms, this means that relative to
(X , X , … , X ), random variable Y
1 2 n is a sufficient statistic for p. Roughly, Y contains all of the information about p that is
n n
available in the entire sample (X , X , … , X ). Sufficiency is discussed in more detail in the chapter on Point Estimation. Next,
1 2 n
if m ≤ n then the conditional distribution of Y given Y = k is hypergeometric, with population size n , type 1 size m, and
m n
sample size k .
Once again, note that the conditional distribution is independent of the success parameter p.
The parameter r is both the mean and the variance of the distribution. In addition, the probability generating function is
t ↦ e .
t(r−1)
Suppose that p ∈ (0, 1) for n ∈ N and that np → r ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters
n + n
11.2.5 https://stats.libretexts.org/@go/page/10234
Proof from density functions
Proof from generating functions
Under the same conditions as the previous theorem, the mean and variance of the binomial distribution converge to the mean
and variance of the limiting Poisson distribution, respectively.
1. np n → r as n → ∞
2. np n (1 − pn ) → ras n → ∞
Proof
From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials n is
“large” and the probability of success p “small”, so that np is small, then the binomial distribution with parameters n and p is well
2
approximated by the Poisson distribution with parameter r = np . This is often a useful result, because the Poisson distribution has
fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately).
Specifically, in the approximating Poisson distribution, we do not need to know the number of trials n and the probability of
success p individually, but only in the product np. The condition that np be small means that the variance of the binomial
2
distribution, namely np(1 − p) = np − np is approximately r, the variance of the approximating Poisson distribution.
2
The characteristic bell shape that you should observe in the previous exercise is an example of the central limit theorem, because
the binomial variable can be written as a sum of n independent, identically distributed random variables (the indicator variables).
Yn − np Mn − p
Zn = − −−−−−−− = − −−−−− −−− (11.2.35)
√ np(1 − p) √ p(1 − p)/n
This version of the central limit theorem is known as the DeMoivre-Laplace theorem, and is named after Abraham DeMoivre and
Simeon Laplace. From a practical point of view, this result means that, for large n , the distribution of Y is approximately normal,
n
− −− −− −−−
with mean np and standard deviation √np(1 − p) and the distribution of M is approximately normal, with mean p and standard
n
− −−− −−− −−
deviation √p(1 − p)/n . Just how large n needs to be for the normal approximation to work well depends on the value of p. The
rule of thumb is that we need np ≥ 5 and n(1 − p) ≥ 5 (the first condition is the significant one when p ≤ and the second
1
condition is the significant one when p ≥ ). Finally, when using the normal approximation, we should remember to use the
1
General Families
For a fixed number of trials n , the binomial distribution is a member of two general families of distributions. First, it is a general
exponential distribution.
11.2.6 https://stats.libretexts.org/@go/page/10234
Suppose that Y has the binomial distribution with parameters n and p, where n ∈ N + is fixed and p ∈ (0, 1). The distribution
p
of Y is a one-parameter exponential family with natural parameter ln( 1−p
) and natural statistic Y .
Proof
Note that the natural parameter is the logarithm of the odds ratio corresponding to p. This function is sometimes called the logit
function. The binomial distribution is also a power series distribution
Suppose again that Y has the binomial distribution with parameters n and p, where n ∈ N is fixed and p ∈ (0, 1). The +
p
distribution of Y is a power series distribution in the parameter θ = , corresponding to the function θ ↦ (1 + θ) .
1−p
n
Proof
Y =∑
n X
n
i=1
is the number of successes in the first n trials for n ∈ N . The proportion of successes in the first n trials is the
i
random variable
n
Yn 1
Mn = = ∑ Xi (11.2.38)
n n
i=1
In statistical terms, M is the sample mean of the random sample (X , X , … , X ). The proportion of successes M is typically
n 1 2 n n
of the number of successes Y . First, note that M takes the values k/n where k ∈ {0, 1, … , n}.
n n
k n k n−k
P ( Mn = ) =( ) p (1 − p ) , k ∈ {0, 1, … , n} (11.2.39)
n k
Proof
In the binomial coin experiment, select the proportion of heads. Vary n and p with the scroll bars and note the shape of the
probability density function. For selected values of the parameters, run the experiment 1000 times and compare the relative
frequency function to the probability density function.
The mean and variance of the proportion of successes Mn are easy to compute from the mean and variance of the number of
successes Y . n
1. E(M ) = p
n
2. var(M ) = n
1
n
p(1 − p) .
Proof
In the binomial coin experiment, select the proportion of heads. Vary n and p and note the size and location of the mean
±standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the empirical
Recall that skewness and kurtosis are standardized measures. Since M and n Yn have the same standard score, the skewness and
kurtosis of M are the same as the skewness and kurtosis of Y given above.
n n
In statistical terms, part (a) of the moment result above means that M is an unbiased estimator of p. From part (b) note that
n
var(M ) ≤
n for any p ∈ [0, 1]. In particular, var(M ) → 0 as n → ∞ and the convergence is uniform in p ∈ [0, 1]. Thus, the
1
4n
n
For every ϵ > 0 , P (|M n − p| ≥ ϵ) → 0 as n → ∞ and the convergence is uniform in p ∈ [0, 1].
11.2.7 https://stats.libretexts.org/@go/page/10234
Proof
The last result is a special case of the weak law of large numbers and means that M n → p as n → ∞ in probability. The strong law
of large numbers states that the convergence actually holds with probability 1.
The proportion of successes M has a number of nice properties as an estimator of the probability of success p. As already noted,
n
it is unbiased and consistent. In addition, since Y is a sufficient statistic for p, based on the sample (X , X , … , X ), it follows
n 1 2 n
that M is sufficient for p as well. Since E(X ) = p , M is trivially the method of moments estimator of p. Assuming that the
n i n
parameter space for p is [0, 1], it is also the maximum likelihood estimator of p.
The likelihood function for p ∈ [0, 1], based on the observed sample (x1 , x2 , … , xn ) ∈ {0, 1 }
n
, is y n−y
Ly (p) = p (1 − p ) ,
where y = ∑ x . The likelihood is maximized at m = y/n.
n
i=1 i
Proof
See Estimation in the Bernoulli Model in the chapter on Set Estimation for a different approach to the problem of estimating p.
A certain type of missile has failure probability 0.02. Let Y denote the number of failures in 50 tests. Find each of the
following:
1. The probability density function of Y .
2. The mean of Y .
3. The variance of Y .
4. The probability of at least 47 successful tests.
Answer
Suppose that in a certain district, 40% of the registered voters prefer candidate A . A random sample of 50 registered voters is
selected. Let Z denote the number in the sample who prefer A . Find each of the following:
1. The probability density function of Z .
2. The mean of Z .
3. The variance of Z .
4. The probability that Z is less that 19.
5. The normal approximation to the probability in (d).
Answer
4
1
A standard, fair die is tossed 10 times. Let N denote the number of aces. Find each of the following:
1. The probability density function of N .
2. The mean of N .
3. The variance of N .
Answer
11.2.8 https://stats.libretexts.org/@go/page/10234
A coin is tossed 100 times and results in 30 heads. Find the probability density function of the number of heads in the first 20
tosses.
Answer
An ace-six flat die is rolled 1000 times. Let Z denote the number of times that a score of 1 or 2 occurred. Find each of the
following:
1. The probability density function of Z .
2. The mean of Z .
3. The variance of Z .
4. The probability that Z is at least 400.
5. The normal approximation of the probability in (d)
Answer
In the binomial coin experiment, select the proportion of heads. Set n = 10 and p = 0.4 . Run the experiment 100 times. Over
all 100 runs, compute the square root of the average of the squares of the errors, when M used to estimate p. This number is a
measure of the quality of the estimate.
In the binomial coin experiment, select the number of heads Y , and set p = 0.5 and n = 15 . Run the experiment 1000 times
with and compute the following:
1. P(5 ≤ Y ≤ 10)
2. The relative frequency of the event {5 ≤ Y ≤ 10}
3. The normal approximation to P(5 ≤ Y ≤ 10)
Answer
In the binomial coin experiment, select the proportion of heads M and set n = 30 , p = 0.6 . Run the experiment 1000 times
and compute each of the following:
1. P(0.5 ≤ M ≤ 0.7)
2. The relative frequency of the event {0.5 ≤ M ≤ 0.7}
3. The normal approximation to P(0.5 ≤ M ≤ 0.7)
Answer
Famous Problems
In 1693, Samuel Pepys asked Isaac Newton whether it is more likely to get at least one 6 in 6 rolls of a die, or at least two 6's in 12
rolls, or at least three 6's in 18 rolls. This problems is known a Pepys' problem; naturally, Pepys had fair dice in mind.
So the first event, at least one 6 in 6 rolls, is the most likely, and in fact the three events, in the order given, decrease in probability.
With fair dice run the simulation of the dice experiment 500 times and compute the relative frequency of the event of interest.
Compare the results with the theoretical probabilities above.
1. At least one 6 with n = 6 .
2. At least two 6's with n = 12 .
3. At least three 6's with n = 18 .
It appears that Pepys had some sort of misguided linearity in mind, given the comparisons that he wanted. Let's solve the general
problem.
11.2.9 https://stats.libretexts.org/@go/page/10234
2. Find the mean and variance of Y . k
3. Give an exact formula for P(Y ≥ k) , the probability of at least k 6's in 6k rolls of a fair die.
k
So on average, the number of 6's in 6k rolls of a fair die is k , and that fact might have influenced Pepys thinking. The next problem
is known as DeMere's problem, named after Chevalier De Mere
Which is more likely: at least one 6 with 4 throws of a fair die or at least one double 6 in 24 throws of two fair dice? .
Answer
In the M&M data, pool the bags to create a large sample of M&Ms. Now compute the sample proportion of red M&Ms.
Answer
The number of the peg that the ball hits in row n has the binomial distribution with parameters n and p.
In the Galton board experiment, select random variable Y (the number of moves right). Vary the parameters n and p and note
the shape and location of the probability density function and the mean±standard deviation bar. For selected values of the
parameters, click single step several times and watch the ball fall through the pegs. Then run the experiment 1000 times and
watch the path of the ball. Compare the relative frequency function and empirical moments to the probability density function
and distribution moments, respectively.
Structural Reliability
Recall the discussion of structural reliability given in the last section on Bernoulli trials. In particular, we have a system of n
similar components that function independently, each with reliability p. Suppose now that the system as a whole functions properly
if and only if at least k of the n components are good. Such a systems is called, appropriately enough, a k out of n system. Note
that the series and parallel systems considered in the previous section are n out of n and 1 out of n systems, respectively.
i
.i n−i
In the binomial coin experiment, set n = 10 and p = 0.9 and run the simulation 1000 times. Compute the empirical reliability
and compare with the true reliability in each of the following cases:
1. 10 out of 10 (series) system.
2. 1 out of 10 (parallel) system.
3. 4 out of 10 system.
11.2.10 https://stats.libretexts.org/@go/page/10234
Consider a system with n = 4 components. Sketch the graphs of r 4,1 ,r
4,2 ,r
4,3 , and r 4,4 on the same set of axes.
Answer
In the binomial coin experiment, compute the empirical reliability, based on 100 runs, in each of the following cases. Compare
your results to the true probabilities.
1. A 2 out of 3 system with p = 0.3
2. A 3 out of 5 system with p = 0.3
3. A 2 out of 3 system with p = 0.8
4. A 3 out of 5 system with p = 0.8
Reliable Communications
Consider the transmission of bits (0s and 1s) through a noisy channel. Specifically, suppose that when bit i ∈ {0, 1} is transmitted,
bit i is received with probability p ∈ (0, 1) and the complementary bit 1 − i is received with probability 1 − p . Given the bits
i i
transmitted, bits are received correctly or incorrectly independently of one-another. Suppose now, that to increase reliability, a
given bit I is repeated n ∈ N times in the transmission. A priori, we believe that P(I = 1) = α ∈ (0, 1) and P(I = 0) = 1 − α .
+
Simplify the results in the last exercise in the symmetric case where p1 = p0 =: p (so that the bits are equally reliable) and
with α = (so that we have no prior information).
1
Answer
Our interest, of course, is predicting the bit transmitted given the bits received.
Presumably, our decision rule would be to conclude that 1 was transmitted if the posterior probability in the previous exercise is
greater than and to conclude that 0 was transmitted if the this probability is less than . If the probability equals , we have no
1
2
1
2
1
Give the decision rule in the symmetric case where p = p =: p , so that the bits are equally reliable. Assume that p >
1 0
1
2
, so
that we at least have a better than even chance of receiving the bit transmitted.
Answer
2
, we conclude that bit i was transmitted if a
majority of bits received are i.
11.2.11 https://stats.libretexts.org/@go/page/10234
Bernstein Polynomials
The Weierstrass Approximation Theorem, named after Karl Weierstrass, states that any real-valued function that is continuous on a
closed, bounded interval can be uniformly approximated on that interval, to any degree of accuracy, with a polynomial. The
theorem is important, since polynomials are simple and basic functions, and a bit surprising, since continuous functions can be
quite strange.
In 1911, Sergi Bernstein gave an explicit construction of polynomials that uniformly approximate a given continuous function,
using Bernoulli trials. Bernstein's result is a beautiful example of the probabilistic method, the use of probability theory to obtain
results in other areas of mathematics that are seemingly unrelated to probability.
Suppose that f is a real-valued function that is continuous on the interval . The Bernstein polynomial of degree
[0, 1] n for f is
defined by
where M is the proportion of successes in the first n Bernoulli trials with success parameter p, as defined earlier. Note that we are
n
emphasizing the dependence on p in the expected value operator. The next exercise gives a more explicit representation, and shows
that the Bernstein polynomial is, in fact, a polynomial
Proof
2
) − f (0)] p + [f (1) − 2f (
1
2
) + f (0)] p
2
for p ∈ [0, 1]
From part (a), the graph of b passes through the endpoints (0, f (0)) and (1, f (1)). From part (b), the graph of b is a line
n 1
connecting the endpoints. From (c), the graph of b is parabola passing through the endpoints and the point
2
f (1)) .
1 1 1 1 1
( , f (0) + f ( )+
2 4 2 2 4
Compute the Bernstein polynomials of orders 1, 2, and 3 for the function f defined by f (x) = cos(πx) for x ∈ [0, 1]. Graph f
and the three polynomials on the same set of axes.
Answer
Use a computer algebra system to compute the Bernstein polynomials of orders 10, 20, and 30 for the function f defined
below. Use the CAS to graph the function and the three polynomials on the same axes.
0, x =0
f (x) = { π (11.2.49)
x sin( ), x ∈ (0, 1]
x
This page titled 11.2: The Binomial Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
11.2.12 https://stats.libretexts.org/@go/page/10234
11.3: The Geometric Distribution
Basic Theory
Definitions
Suppose again that our random experiment is to perform a sequence of Bernoulli trials X = (X , X , …) with success parameter
1 2
p ∈ (0, 1]. In this section we will study the random variable N that gives the trial number of the first success and the random
variable M that gives the number of failures before the first success.
Let N = min{n ∈ N : X = 1} , the trial number of the first success, and let M = N − 1 , the number of failures before the
+ n
first success. The distribution of N is the geometric distribution on N and the distribution of M is the geometric distribution
+
Since N and M differ by a constant, the properties of their distributions are very similar. Nonetheless, there are applications where
it more natural to use one rather than the other, and in the literature, the term geometric distribution can refer to either. In this
section, we will concentrate on the distribution of N , pausing occasionally to summarize the corresponding results for M .
Proof
Check that f is a valid PDF
A priori, we might have thought it possible to have N = ∞ with positive probability; that is, we might have thought that we could
run Bernoulli trials forever without ever seeing a success. However, we now know this cannot happen when the success parameter
p is positive.
In the negative binomial experiment, set k = 1 to get the geometric distribution on N . Vary p with the scroll bar and note the
+
shape and location of the probability density function. For selected values of p, run the simulation 1000 times and compare the
relative frequency function to the probability density function.
Note that the probability density functions of N and M are decreasing, and hence have modes at 1 and 0, respectively. The
geometric form of the probability density functions also explains the term geometric distribution.
n ↦ P(T ≤ n) . In this section, the complementary function n ↦ P(T > n) will play a fundamental role. We will refer to this
function as the right distribution function of T . Of course both functions completely determine the distribution of T . Suppose again
that N has the geometric distribution on N with success parameter p ∈ (0, 1].
+
From the last result, it follows that the ordinary (left) distribution function of N is given by
n
F (n) = 1 − (1 − p ) , n ∈ N (11.3.3)
For m ∈ N , the conditional distribution of N − m given N >m is the same as the distribution of N . That is,
P(N > n + m ∣ N > m) = P(N > n); m, n ∈ N (11.3.4)
11.3.1 https://stats.libretexts.org/@go/page/10235
Proof
Thus, if the first success has not occurred by trial number m, then the remaining number of trials needed to achieve the first
success has the same distribution as the trial number of the first success in a fresh sequence of Bernoulli trials. In short, Bernoulli
trials have no memory. This fact has implications for a gambler betting on Bernoulli trials (such as in the casino games roulette or
craps). No betting strategy based on observations of past outcomes of the trials can possibly help the gambler.
Conversely, if T is a random variable taking values in N+ that satisfies the memoryless property, then T has a geometric
distribution.
Proof
Moments
Suppose again that N is the trial number of the first success in a sequence of Bernoulli trials, so that N has the geometric
distribution on N with parameter p ∈ (0, 1]. The mean and variance of N can be computed in several different ways.
+
1
E(N ) =
p
This result makes intuitive sense. In a sequence of Bernoulli trials with success parameter p we would “expect” to wait 1/p trials
for the first success.
1−p
var(N ) = 2
p
Direct proof
Proof from Bernoulli trials
Note that var(N ) = 0 if p = 1 , hardly surprising since N is deterministic (taking just the value 1) in this case. At the other
extreme, var(N ) ↑ ∞ as p ↓ 0 .
In the negative binomial experiment, set k = 1 to get the geometric distribution. Vary p with the scroll bar and note the
location and size of the mean±standard deviation bar. For selected values of p, run the simulation 1000 times and compare the
sample mean and standard deviation to the distribution mean and standard deviation.
Proof
Recall again that for x ∈ R and k ∈ N , the falling power of x of order k is (k)
x = x(x − 1) ⋯ (x − k + 1) . If X is a random
variable, then E [ X (k)
] is the factorial moment of X of order k .
2
p
2. kurt(N ) = 1−p
Proof
11.3.2 https://stats.libretexts.org/@go/page/10235
Note that the geometric distribution is always positively skewed. Moreover, skew(N ) → ∞ and kurt(N ) → ∞ as p ↑ 1 .
Suppose now that M = N −1 , so that M (the number of failures before the first success) has the geometric distribution on N.
Then
1−p
1. E(M ) = p
1−p
2. var(M ) = p2
2−p
3. skew(M ) =
√1−p
2
p
4. kurt(M ) = 1−p
p
5. E (t M
) =
1−(1−p) t
for |t| < 1
1−p
Of course, the fact that the variance, skewness, and kurtosis are unchanged follows easily, since N and M differ by a constant.
Of course, the quantile function, like the probability density function and the distribution function, completely determines the
distribution of N . Moreover, we can compute the median and quartiles to get measures of center and spread.
The first quartile, the median (or second quartile), and the third quartile are
1. F −1
(
1
4
) = ⌈ln(3/4)/ ln(1 − p)⌉ ≈ ⌈−0.2877/ ln(1 − p)⌉
2. F −1
(
1
2
) = ⌈ln(1/2)/ ln(1 − p)⌉ ≈ ⌈−0.6931/ ln(1 − p)⌉
3. F −1
(
3
4
) = ⌈ln(1/4)/ ln(1 − p)⌉ ≈ ⌈−1.3863/ ln(1 − p)⌉
Open the special distribution calculator, and select the geometric distribution and CDF view. Vary p and note the shape and
location of the CDF/quantile function. For various values of p, compute the median and the first and third quartiles.
If T is interpreted as the (discrete) lifetime of a device, then h is a discrete version of the failure rate function studied in reliability
theory. However, in our usual formulation of Bernoulli trials, the event of interest is success rather than failure (or death), so we
will simply use the term rate function to avoid confusion. The constant rate property characterizes the geometric distribution. As
usual, let N denote the trial number of the first success in a sequence of Bernoulli trials with success parameter p ∈ (0, 1), so that
N has the geometric distribution on N with parameter p. +
Conversely, if T has constant rate p ∈ (0, 1) then T has the geometric distrbution on N with success parameter p.+
11.3.3 https://stats.libretexts.org/@go/page/10235
Proof
Note that the conditional distribution does not depend on the success parameter p. If we know that there is exactly one success in
the first n trials, then the trial number of that success is equally likely to be any of the n possibilities.
Another connection between the geometric distribution and the uniform distribution is given below in the alternating coin tossing
game: the conditional distribution of N given N ≤ n converges to the uniform distribution on {1, 2, … , n} as p ↓ 0 .
x ∈ [0, ∞). The mean of the exponential distribution is 1/r and the variance is 1/r . In addition, the moment generating function
2
is s ↦ 1
s−r
for s > r .
For n ∈ N , suppose that U has the geometric distribution on N with success parameter p ∈ (0, 1) , where np
+ n + n n → r >0
Proof
Note that the condition np → r as n → ∞ is the same condition required for the convergence of the binomial distribution to the
n
Special Families
The geometric distribution on N is an infinitely divisible distribution and is a compound Poisson distribution. For the details, visit
these individual sections and see the next section on the negative binomial distribution.
A type of missile has failure probability 0.02. Let N denote the number of launches before the first failure. Find each of the
following:
11.3.4 https://stats.libretexts.org/@go/page/10235
1. The probability density function of N .
2. The mean of N .
3. The variance of N .
4. The probability of 20 consecutive successful launches.
5. The quantile function of N .
6. The median and the first and third quartiles.
Answer
A student takes a multiple choice test with 10 questions, each with 5 choices (only one correct). The student blindly guesses
and gets one question correct. Find the probability that the correct question was one of the first 4.
Answer
Recall that an American roulette wheel has 38 slots: 18 are red, 18 are black, and 2 are green. Suppose that you observe red or
green on 10 consecutive spins. Give the conditional distribution of the number of additional spins needed for black to occur.
Answer
The game of roulette is studied in more detail in the chapter on Games of Chance.
In the negative binomial experiment, set k = 1 to get the geometric distribution and set p = 0.3 . Run the experiment 1000
times. Compute the appropriate relative frequencies and empirically investigate the memoryless property
P(V > 5 ∣ V > 2) = P(V > 3) (11.3.26)
W =c
Proof
Thus, W is not random and W is independent of p! Since c is an arbitrary constant, it would appear that we have an ideal strategy.
However, let us study the amount of money Z needed to play the strategy.
N
Z = c(2 − 1)
Thus, the strategy is fatally flawed when the trials are unfavorable and even when they are fair, since we need infinite expected
capital to make the strategy work in these cases.
11.3.5 https://stats.libretexts.org/@go/page/10235
In the negative binomial experiment, set k = 1 . For each of the following values of p, run the experiment 100 times. For each
run compute Z (with c = 1 ). Find the average value of Z over the 100 runs:
1. p = 0.2
2. p = 0.5
3. p = 0.8
For more information about gambling strategies, see the section on Red and Black. Martingales are studied in detail in a separate
chapter.
p . Additionally, let W denote the winner of the game; W takes values in the set {1, 2, … , n}. We are interested in the probability
distribution of W .
For i ∈ {1, 2, … , n}, W =i if and only if N = i + kn for some k ∈ N . That is, using modular arithmetic,
Proof
P(W = i) = (1 − p )
i−1
P(W = 1) for i ∈ {1, 2, … , n}.
Proof
Note that P(W = i) is a decreasing function of i ∈ {1, 2, … , n} . Not surprisingly, the lower the toss order the better for the
player.
Explicitly compute the probability density function of W when the coin is fair (p = 1/2 ).
Answer
Note from the result above that W itself has a truncated geometric distribution.
The following problems explore some limiting distributions related to the alternating coin-tossing game.
For fixed p ∈ (0, 1], the distribution of W converges to the geometric distribution with parameter p as n ↑ ∞ .
Players at the end of the tossing order should hope for a coin biased towards tails.
11.3.6 https://stats.libretexts.org/@go/page/10235
Suppose there are k ∈ {2, 3, …} players and p ∈ [0, 1]. In a single round, the probability of an odd man is
2p(1 − p), k =2
rk (p) = { (11.3.32)
k−1 k−1
kp(1 − p ) + kp (1 − p), k ∈ {3, 4, …}
Proof
1. r (0) = r (1) = 0
k k
2. r is symmetric about p =
k
1
Proof
2
.
2. r is concave downward
k
Proof
1. The maximum occurs at two points of the form p and 1 − p where p k k k ∈ (0,
1
2
) and p
k → 0 as k → ∞ .
2. The maximum value r (p ) → 1/e ≈ 0.3679 as k → ∞ .
k k
Proof sketch
Suppose p ∈ (0, 1), and let N denote the number of rounds until an odd man is eliminated, starting with k players. Then
k Nk
has the geometric distribution on N with parameter r (p). The mean and variance are
+ k
1. μ k (p) = 1/ rk (p)
2. σ k
2
(p) = [1 − rk (p)] / r (p)
2
k
Suppose we start with k ∈ {2, 3, …} players and p ∈ (0, 1). The number of rounds until a single player remains is
M =∑k N where (N , N , … , N ) are independent and N has the geometric distribution on N with parameter r (p).
k
j=2 j 2 3 k j + j
2. var(M k) =∑
k
j=2
[1 − rj (p)] / r (p)
2
j
Proof
k
Starting with k players and probability of heads p ∈ (0, 1), the total number of coin tosses is T k =∑
j=2
jNj . The mean and
variance are
11.3.7 https://stats.libretexts.org/@go/page/10235
k
1. E(T k) = ∑j=2 j/ rj (p)
k
2. var(T k) = ∑j=2 j
2
[1 − rj (p)] / r (p)
2
j
Proof
trials M before the first success (outcome 1) occurs has the geometric distribution on N with parameter p. A natural generalization
is the random variable that gives the number of trials before a specific finite sequence of outcomes occurs for the first time. (Such a
sequence is sometimes referred to as a word from the alphabet {0, 1} or simply a bit string). In general, finding the distribution of
this variable is a difficult problem, with the difficulty depending very much on the nature of the word. The problem of finding just
the expected number of trials before a word occurs can be solved using powerful tools from the theory of renewal processes and
from the theory of martingalges.
To set up the notation, let x denote a finite bit string and let M denote the number of trials before x occurs for the first time.
x
Finally, let q = 1 − p . Note that M takes values in N. In the following exercises, we will consider x = 10 , a success followed by
x
a failure. As always, try to derive the results yourself before looking at the proofs.
2
then
n+1 n+1
p −q
f10 (n) = pq , n ∈ N (11.3.33)
p −q
n+2
2. If p = 1
2
then f 10 (n) = (n + 1)(
1
2
) for n ∈ N .
Proof
It's interesting to note that f is symmetric in p and q, that is, symmetric about p = . It follows that the distribution function, 1
probability generating function, expected value, and variance, which we consider below, are all also symmetric about p = . It's 1
also interesting to note that f (0) = f (1) = pq , and this is the largest value. So regardless of p ∈ (0, 1) the distribution is
10 10
2
then
n+3 n+3
p −q
F10 (n) = 1 − , n ∈ N (11.3.35)
p −q
n+2
2. If p = 1
2
then F 10 = 1 − (n + 3)(
1
2
) for n ∈ N .
Proof
2
then
pq p q
P10 (t) = ( − ), |t| < min{1/p, 1/q} (11.3.36)
p −q 1 − tp 1 − tq
2. If p = 1
2
then P 10 (t) = 1/(t − 2 )
2
for |t| < 2
Proof
2
then
4 4
p −q
E(M10 ) = (11.3.38)
pq(p − q)
11.3.8 https://stats.libretexts.org/@go/page/10235
2. If p = 1
2
then E(M 10 ) =2 .
Proof
The graph of E(M ) as a function of p ∈ (0, 1) is given below. It's not surprising that
10 E(M10 ) → ∞ as p ↓ 0 and as p ↑ 1 , and
that the minimum value occurs when p = . 1
2
then
2
6 6 4 4 4 4
2 p −q 1 p −q 1 p −q
var(M10 ) = ( )+ ( )− ( ) (11.3.39)
2 2 2 2
p q p −q pq p −q p q p −q
2. If p = 1
2
then var(M 10 ) =4 .
Proof
This page titled 11.3: The Geometric Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
11.3.9 https://stats.libretexts.org/@go/page/10235
11.4: The Negative Binomial Distribution
Basic Theory
Suppose again that our random experiment is to perform a sequence of Bernoulli trials X = (X 1, X2 , …) with success parameter
p ∈ (0, 1]. Recall that the number of successes in the first n trials
Yn = ∑ Xi (11.4.1)
i=1
has the binomial distribution with parameters n and p. In this section we will study the random variable that gives the trial number
of the k th success:
Vk = min {n ∈ N+ : Yn = k} (11.4.2)
Note that V is the number of trials needed to get the first success, which we now know has the geometric distribution on N with
1 +
parameter p.
n−1
k n−k
P(Vk = n) = ( ) p (1 − p ) , n ∈ {k, k + 1, k + 2, …} (11.4.3)
k−1
Proof
The distribution defined by the density function in (1) is known as the negative binomial distribution; it has two parameters, the
stopping parameter k and the success probability p.
In the negative binomial experiment, vary k and p with the scroll bars and note the shape of the density function. For selected
values of k and p, run the experiment 1000 times and compare the relative frequency function to the probability density
function.
The binomial and negative binomial sequences are inverse to each other in a certain sense.
2. kP (Y = k) = nP (V = n)
n k
Proof
In particular, it follows from part (a) that any event that can be expressed in terms of the negative binomial variables can also be
expressed in terms of the binomial variables.
p
. Then
2. The probability density function at first increases and then decreases, reaching its maximum value at ⌊t⌋.
3. There is a single mode at ⌊t⌋ if t is not an integers, and two consecutive modes at t − 1 and t if t is an integer.
U = (U1 , U2 , …) is a sequence of independent random variables, each having the geometric distribution on N+ with
parameter p. Moreover,
11.4.1 https://stats.libretexts.org/@go/page/10236
k
Vk = ∑ Ui (11.4.5)
i=1
In statistical terms, U corresponds to sampling from the geometric distribution with parameter p, so that for each k ,
(U , U , … , U ) is a random sample of size k from this distribution. The sample mean corresponding to this sample is V /k ; this
1 2 k k
random variable gives the average number of trials between the first k successes. In probability terms, the sequence of negative
binomial variables V is the partial sum process corresponding to the sequence U . Partial sum processes are studied in more
generality in the chapter on Random Samples.
Actually, any partial sum process corresponding to an independent, identically distributed sequence will have stationary,
independent increments.
Basic Properties
The mean, variance and probability generating function of V can be computed in several ways. The method using the
k
representation as a sum of independent, identically distributed geometrically distributed variables is the easiest.
Proof
1. E(V k) =k
1
p
.
1−p
2. var(V k) =k
p2
Proof
In the negative binomial experiment, vary k and p with the scroll bars and note the location and size of the mean/standard
deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard
deviation to the distribution mean and standard deviation.
Suppose that V and W are independent random variables for an experiment, and that V has the negative binomial distribution
with parameters j and p, and W has the negative binomial distribution with parameters k and p. Then V + W has the negative
binomial distribution with parameters k + j and p.
Proof
Normal Approximation
In the negative binomial experiment, start with various values of p and k = 1 . Successively increase k by 1, noting the shape
of the probability density function each time.
Even though you are limited to k = 5 in the app, you can still see the characteristic bell shape. This is a consequence of the central
limit theorem because the negative binomial variable can be written as a sum of k independent, identically distributed (geometric)
random variables.
p Vk − k
Zk = − −−−−−− (11.4.7)
√ k(1 − p)
11.4.2 https://stats.libretexts.org/@go/page/10236
The distribution of Z converges to the standard normal distribution as k → ∞ .
k
From a practical point of view, this result means that if k is “large”, the distribution of V is approximately normal with mean k
k
1
p
1−p
and variance k p
2
. Just how large k needs to be for the approximation to work well depends on p. Also, when using the normal
approximation, we should remember to use the continuity correction, since the negative binomial is a discrete distribution.
Proof
Thus, given exactly k successes in the first n trials, the vector of success trial numbers is uniformly distributed on the set of
possibilities L, regardless of the value of the success parameter p. Equivalently, the vector of success trial numbers is distributed as
the vector of order statistics corresponding to a sample of size k chosen at random and without replacement from {1, 2, … , n}.
m−1 n−m
( )( )
j−1 k−j
P (Vj = m ∣ Yn = k) = , m ∈ {j, j + 1, … , n + k − j} (11.4.10)
n
( )
k
Proof
Thus, given exactly k successes in the first n trials, the trial number of the j th success has the same distribution as the j th order
statistic when a sample of size k is selected at random and without replacement from the population {1, 2, … , n}. Again, this
result does not depend on the value of the success parameter p. The following theorem gives the mean and variance of the
conditional distribution.
n+1
1. E (V j ∣ Yn = k) = j
k+1
(n+1)(n−k)
2. var (V j ∣ Yn = k) = j(k − j + 1) 2
(k+1 ) (k+2)
Proof
A coin is tossed repeatedly. The 10th head occurs on the 25th toss. Find each of the following:
1. The probability density function of the trial number of the 5th head.
2. The mean of the distribution in (a).
3. The variance of the distribution in (a).
Answer
A certain type of missile has failure probability 0.02. Let N denote the launch number of the fourth failure. Find each of the
following:
11.4.3 https://stats.libretexts.org/@go/page/10236
1. The probability density function of N .
2. The mean of N .
3. The variance of N .
4. The probability that there will be at least 4 failures in the first 200 launches.
Answer
In the negative binomial experiment, set p = 0.5 and k = 5 . Run the experiment 1000 times. Compute and compare each of
the following:
1. P(8 ≤ V ≤ 15)
5
Answer
the left pocket as a win for player L. In a hypothetical infinite sequence of trials, let U denote the number of trials necessary for R
to win m + 1 trials, and V the number of trials necessary for L to win m + 1 trials. Note that U and V each have the negative
binomial distribution with parameters m + 1 and p.
m
)(
1
2
)
m
)(
1
2
) .
Proof
We can also solve the non-symmetric Banach match problem, using the same methods as above. Thus, suppose that the professor
reaches for a match in his right pocket with probability p and in his left pocket with probability 1 − p , where 0 < p < 1 . The
11.4.4 https://stats.libretexts.org/@go/page/10236
essential change in the analysis is that U has the negative binomial distribution with parameters m +1 and p, while V has the
negative binomial distribution with parameters m + 1 and 1 − p .
For the Banach match problem with parameter p, W has probability density function
2m − k
m+1 m−k m+1 m−k
P(W = k) = ( ) [p (1 − p ) + (1 − p ) p ], k ∈ {0, 1, … , m} (11.4.17)
m
A wins n points before B wins m points. Computing A (p) is an historically famous problem, known as the problem of points,
n,m
Comment on the validity of the Bernoulli trial assumptions (independence of trials and constant probability of success) for
games of sport that have a skill component as well as a random component.
There is an easy solution to the problem of points using the binomial distribution; this was essentially Pascal's solution. There is
also an easy solution to the problem of points using the negative binomial distribution In a sense, this has to be the case, given the
equivalence between the binomial and negative binomial processes in (3). First, let us pretend that the trials go on forever,
regardless of the outcomes. Let Y denote the number of wins by player A in the first n + m − 1 points, and let V denote
n+m−1 n
the number of trials needed for A to win n points. By definition, Y has the binomial distribution with parameters n + m − 1
n+m−1
and p, and V has the negative binomial distribution with parameters n and p.
n
Player A wins n points before B wins m points if and only if Y n+m−1 ≥n if and only if V n ≤ m +n−1 . Hence
n+m−1 n+m−1
n+m −1 k n+m−1−k
j−1 n j−n
An,m (p) = ∑ ( ) p (1 − p ) = ∑ ( ) p (1 − p ) (11.4.18)
k n−1
k=n j=n
In the problem of points experiments, vary the parameters n , m, and p, and note how the probability changes. For selected
values of the parameters, run the simulation 1000 times and note the apparent convergence of the relative frequency to the
probability.
The win probability function for player A satisfies the following recurrence relation and boundary conditions (this was
essentially Fermat's solution):
1. Am,n (p) = p An−1,m (p) + (1 − p) An,m−1 (p), n, m ∈ N+
Proof
Next let N n,mdenote the number of trials needed until either A wins n points or B wins m points, whichever occurs first—the
length of the problem of points experiment. The following result gives the distribution of N n,m
11.4.5 https://stats.libretexts.org/@go/page/10236
Proof
Series of Games
The special case of the problem of points experiment with m = n is important, because it corresponds to A and B playing a best of
2n − 1 game series. That is, the first player to win n games wins the series. Such series, especially when n ∈ {2, 3, 4}, are
Let A n (p) denote the probability that player A wins the series. Then
2n−1 2n−1
2n − 1 k 2n−1−k
j−1 n j−n
An (p) = ∑ ( ) p (1 − p ) = ∑ ( ) p (1 − p ) (11.4.20)
k n−1
k=n j=n
Proof
Suppose that p = 0.6 . Explicitly find the probability that team A wins in each of the following cases:
1. A best of 5 game series.
2. A best of 7 game series.
Answer
In the problem of points experiments, vary the parameters n , m, and p (keeping n = m ), and note how the probability
changes. Now simulate a best of 5 series by selecting n = m = 3 , p = 0.6 . Run the experiment 1000 times and compare the
relative frequency to the true probability.
2
.
2. A ( ) = .
n
1
2
1
Proof
In the problem of points experiments, vary the parameters n , m, and p (keeping n = m ), and note how the probability
changes. Now simulate a best 7 series by selecting n = m = 4 , p = 0.45. Run the experiment 1000 times and compare the
relative frequency to the true probability.
If n > m then
1. An (p) < Am (p) if 0 < p < 1
Proof
Let N denote the number of trials in the series. Then N has probability density function
n n
Proof
Explicitly compute the probability density function, expected value, and standard deviation for the number of games in a best
of 7 series with the following values of p:
1. 0.5
2. 0.7
3. 0.9
Answer
11.4.6 https://stats.libretexts.org/@go/page/10236
Division of Stakes
The problem of points originated from a question posed by Chevalier de Mere, who was interested in the fair division of stakes
when a game is interrupted. Specifically, suppose that players A and B each put up c monetary units, and then play Bernoulli trials
until one of them wins a specified number of trials. The winner then takes the entire 2c fortune.
If the game is interrupted when A needs to win n more trials and B needs to win m more trials, then the fortune should be
divided between A and B , respectively, as follows:
1. 2cA (p) for A
n,m
2. 2c [1 − A (p)] = 2cA
n,m m,n (1 − p) for B .
Suppose that players A and B bet $50 each. The players toss a fair coin until one of them has 10 wins; the winner takes the
entire fortune. Suppose that the game is interrupted by the gambling police when A has 5 wins and B has 3 wins. How should
the stakes be divided?
Answer
binomial distribution with parameters k and p as we studied above. The random variable W = V − k is the number of failures
k k
before the k th success. Let N = W , the number of failures before the first success, and let N = W − W
1 1 , the number of
k k k−1
failures between the (k − 1) st success and the k th success, for k ∈ {2, 3, …}.
N = (N1 , N2 , …) is a sequence of independent random variables, each having the geometric distribution on N with
parameter p. Moreover,
k
Wk = ∑ Ni (11.4.22)
i=1
Thus, W = (W1 , W2 , …) is the partial sum process associated with N . In particular, W has stationary, independent increments.
n+k−1 k n
n+k−1 k n
P(Wk = n) = ( ) p (1 − p ) =( ) p (1 − p ) , n ∈ N (11.4.23)
k−1 n
Proof
The distribution of W is also referred to as the negative binomial distribution with parameters k and p. Thus, the term negative
k
binomial distribution can refer either to the distribution of the trial number of the k th success or the distribution of the number of
failures before the k th success, depending on the author and the context. The two random variables differ by a constant, so it's not a
particularly important issue as long as we know which version is intended. In this text, we will refer to the alternate version as the
negative binomial distribution on N, to distinguish it from the original version, which has support set {k, k + 1, …}
More interestingly, however, the probability density function in the last result makes sense for any k ∈ (0, ∞) , not just integers. To
see this, first recall the definition of the general binomial coefficient: if a ∈ R and n ∈ N , we define
(n)
a a a(a − 1) ⋯ (a − n + 1)
( ) = = (11.4.24)
n n! n!
The function f given below defines a probability density function for every p ∈ (0, 1) and k ∈ (0, ∞) :
n+k−1
k n
f (n) = ( ) p (1 − p ) , n ∈ N (11.4.25)
n
Proof
11.4.7 https://stats.libretexts.org/@go/page/10236
Once again, the distribution defined by the probability density function in the last theorem is the negative binomial distribution on
N , with parameters k and p . The special case when k is a positive integer is sometimes referred to as the Pascal distribution, in
Basic Properties
Suppose that W has the negative binomial distribution on N with parameters k ∈ (0, ∞) and p ∈ (0, 1). To establish basic
properties, we can no longer use the decomposition of W as a sum of independent geometric variables. Instead, the best approach
is to derive the probability generating function and then use the generating function to obtain other basic properties.
Proof
The moments of W can be obtained from the derivatives of the probability generating funciton.
1−p
2. var(W ) = k p
2
2−p
3. skew(W ) =
√k(1−p)
2
3 (k+2)(1−p)+p
4. kurt(W ) = k (1−p)
Proof
Suppose that V has the negative binomial distribution on N with parameters a ∈ (0, ∞) and p ∈ (0, 1), and that W has the
negative binomial distribution on N with parameters b ∈ (0, ∞) and p ∈ (0, 1), and that V and W are independent. Then
V + W has the negative binomial on N distribution with parameters a + b and p .
Proof
In the last result, note that the success parameter p must be the same for both variables.
Normal Approximation
Because of the decomposition of W when the parameter k is a positive integer, it's not surprising that a central limit theorm holds
for the general negative binomial distribution.
Suppose that W has the negative binomial distibtion with parameters k ∈ (0, ∞) and p ∈ (0, 1). The standard score of W is
p W − k (1 − p)
Z = − −− −−−− (11.4.29)
√ k (1 − p)
1−p
Thus, if k is large (and not necessarily an integer), then the distribution of W is approximately normal with mean k
p
and
1−p
variance k p
2
.
11.4.8 https://stats.libretexts.org/@go/page/10236
Special Families
The negative binomial distribution on N belongs to several special families of distributions. First, It follows from the result above
on sums that we can decompose a negative binomial variable on N into the sum of an arbitrary number of independent, identically
distributed variables. This special property is known as infinite divisibility, and is studied in more detail in the chapter on Special
Distributions.
A Poisson-distributed random sum of independent, identically distributed random variables is said to have a compound Poisson
distributions; these distributions are studied in more detail in the chapter on the Poisson Process. A theorem of William Feller states
that an infinite divisible distribution on N must be compound Poisson. Hence it follows from the previous result that the negative
binomial distribution on N belongs to this family. Here is the explicit result:
Let p, k ∈ (0, ∞). Suppose that X = (X , X , …) is a sequence of independent variables, each having the logarithmic series
1 2
distribution with shape parameter 1 − p . Suppose also that N is independent of X and has the Poisson distribution with
parameter −k ln(p) . Then W = ∑ X has the negative binomial distribution on N with parameters k and p.
N
i=1 i
Proof
As a special case (k = 1 ), it follows that the geometric distribution on N is infinitely divisible and compound Poisson.
Next, the negative binomial distribution on N belongs to the general exponential family. This family is important in inferential
statistics and is studied in more detail in the chapter on Special Distributions.
Suppose that W has the negative binomial distribution on N with parameters k ∈ (0, ∞) and p ∈ (0, 1). For fixed k , W has a
one-parameter exponential distribution with natural statistic W and natural parameter ln(1 − p) .
Proof
Finally, the negative binomial distribution on N is a power series distribution. Many special discrete distribution belong to this
family, which is studied in more detail in the chapter on Special Distributions.
For fixed k ∈ (0, ∞) , the negative binomial distribution on N with parameters k and p ∈ (0, 1) is a power series distribution
corresponding to the function g(θ) = 1/(1 − θ) for θ ∈ (0, 1) , where θ = 1 − p .
k
Proof
Computational Exercises
2
and p = 3
4
. Compute each of the following:
1. P(W = 3)
2. E(W )
3. var(W )
Answer
3
and p = 1
4
. Compute each of the following:
1. P(W ≤ 2)
2. E(W )
3. var(W )
Answer
Suppose that W has the negative binomial distribution with parameters k = 10 π and p = 1
3
. Compute each of the following:
1. E(W )
2. var(W )
3. The normal approximation to P(50 ≤ W ≤ 70)
11.4.9 https://stats.libretexts.org/@go/page/10236
Answer
This page titled 11.4: The Negative Binomial Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
11.4.10 https://stats.libretexts.org/@go/page/10236
11.5: The Multinomial Distribution
Basic Theory
Multinomial trials
A multinomial trials process is a sequence of independent, identically distributed random variables X = (X , X , …) each taking 1 2
k possible values. Thus, the multinomial trials process is a simple generalization of the Bernoulli trials process (which corresponds
to k = 2 ). For simplicity, we will denote the set of outcomes by {1, 2, … , k}, and we will denote the common probability density
function of the trial variables by
i=1
pi = 1 . In statistical terms, the sequence X is formed by sampling from the distribution.
As with our discussion of the binomial distribution, we are interested in the random variables that count the number of times each
outcome occurred. Thus, let
n
j=1
Of course, these random variables also depend on the parameter n (the number of trials), but this parameter is fixed in our
k
discussion so we suppress it to keep the notation simple. Note that ∑ Y = n so if we know the values of k − 1 of the counting
i=1 i
k
∑ j = n,
i
i=1
n n!
( ) = (11.5.3)
j1 , j2 , … , jk j1 ! j2 ! ⋯ jk !
Joint Distribution
i=1
ji = n ,
n j1 j2 jk
P(Y1 = j1 , Y2 , = j2 … , Yk = jk ) = ( )p p ⋯p (11.5.4)
1 2 k
j1 , j2 , … , jk
Proof
also say that (Y , Y , … , Y ) has this distribution (recall that the values of k − 1 of the counting variables determine the value
1 2 k−1
of the remaining variable). Usually, it is clear from context which meaning of the term multinomial distribution is intended. Again,
the ordinary binomial distribution corresponds to k = 2 .
Marginal Distributions
For each i ∈ {1, 2, … , k}, Y has the binomial distribution with parameters n and p :
i i
n j n−j
P(Yi = j) = ( ) p (1 − pi ) , j ∈ {0, 1, … , n} (11.5.5)
i
j
Proof
Grouping
The multinomial distribution is preserved when the counting variables are combined. Specifically, suppose that (A 1, A2 , … , Am )
is a partition of the index set {1, 2, … , k} into nonempty subsets. For j ∈ {1, 2, … , m} let
11.5.1 https://stats.libretexts.org/@go/page/10237
Zj = ∑ Yi , qj = ∑ pi (11.5.6)
i∈Aj i∈Aj
Conditional Distribution
The multinomial distribution is also preserved when some of the counting variables are observed. Specifically, suppose that (A, B)
is a partition of the index set {1, 2, … , k} into nonempty subsets. Suppose that (j : i ∈ B) is a sequence of nonnegative integers,
i
Combinations of the basic results involving grouping and conditioning can be used to compute any marginal or conditional
distributions.
Moments
We will compute the mean and variance of each counting variable, and the covariance and correlation of each pair of variables.
1. E(Y ) = np
i i
2. var(Y ) = np
i i (1 − pi )
Proof
Proof
From the last result, note that the number of times outcome i occurs and the number of times outcome j occurs are negatively
correlated, but the correlation does not depend on n .
If k = 2 , then the number of times outcome 1 occurs and the number of times outcome 2 occurs are perfectly correlated.
Proof
Suppose that we throw 10 standard, fair dice. Find the probability of each of the following events:
1. Scores 1 and 6 occur once each and the other scores occur twice each.
2. Scores 2 and 4 occur 3 times each.
3. There are 4 even scores and 6 odd scores.
4. Scores 1 and 3 occur twice each given that score 2 occurs once and score 5 three times.
Answer
Suppose that we roll 4 ace-six flat dice (faces 1 and 6 have probability each; faces 2, 3, 4, and 5 have probability
1
4
1
8
each).
Find the joint probability density function of the number of times each score occurs.
Answer
11.5.2 https://stats.libretexts.org/@go/page/10237
In the dice experiment, select 4 ace-six flats. Run the experiment 500 times and compute the joint relative frequency function
of the number times each score occurs. Compare the relative frequency function to the true probability density function.
Suppose that we roll 20 ace-six flat dice. Find the covariance and correlation of the number of 1's and the number of 2's.
Answer
In the dice experiment, select 20 ace-six flat dice. Run the experiment 500 times, updating after each run. Compute the
empirical covariance and correlation of the number of 1's and the number of 2's. Compare the results with the theoretical
results computed previously.
This page titled 11.5: The Multinomial Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
11.5.3 https://stats.libretexts.org/@go/page/10237
11.6: The Simple Random Walk
The simple random walk process is a minor modification of the Bernoulli trials process. Nonetheless, the process has a number of
very interesting properties, and so deserves a section of its own. In some respects, it's a discrete time analogue of the Brownian
motion process.
p ∈ [0, 1] and 1 − p respectively. Let X = (X , X , X , …) be the partial sum process associated with U , so that
0 1 2
Xn = ∑ Ui , n ∈ N (11.6.1)
i=1
We imagine a person or a particle on an axis, so that at each discrete time step, the walker moves either one unit to the right (with
probability p) or one unit to the left (with probability 1 − p ), independently from step to step. The walker could accomplish this by
tossing a coin with probability of heads p at each step, to determine whether to move right or move left. Other types of random
walks, and additional properties of this random walk, are studied in the chapter on Markov Chains.
Let I j =
1
2
(Uj + 1) for j ∈ N . Then I
+ = (I1 , I2 , …) is a Bernoulli trials sequence with success parameter p.
Proof
In terms of the random walker, I is the indicator variable for the event that the j th step is to the right.
j
Let R n =∑
n
i=1
Ij for n ∈ N , so that R = (R 0, R1 , …) is the partial sum process associated with I . Then
1. X = 2R − n for n ∈ N .
n n
2. R has the binomial distribution with trial parameter n and success parameter p.
n
In terms of the walker, R is the number of steps to the right in the first n steps.
n
Proof
1. E(X ) = n(2p − 1)
n
2. var(X ) = 4np(1 − p)
n
2
0 1
walk can be analyzed using some special and clever combinatorial arguments. But first we give the basic results above for this
special case.
11.6.1 https://stats.libretexts.org/@go/page/10238
#(A)
P(Un ∈ A) = , A ⊆S (11.6.3)
n
2
1. E(X ) = 0
n
2. var(X ) = n
n
In the random walk simulation, select the final position. Vary the number of steps and note the shape and location of the
probability density function and the mean±standard deviation bar. For selected values of the parameter, run the simulation
1000 times and compare the empirical density function and moments to the true probability density function and moments.
In the random walk simulation, select the final position and set the number of steps to 50. Run the simulation 1000 times and
compute and compare the following:
1. P(−6 ≤ X ≤ 10)
50
Answer
steps. Note that Y takes values in the set {0, 1, … , n}. The distribution of Y can be derived from a simple and wonderful idea
n n
Proof
In the random walk simulation, select the maximum value variable. Vary the number of steps and note the shape and location of
the probability density function and the mean/standard deviation bar. Now set the number of steps to 30 and run the simulation
1000 times. Compare the relative frequency function and empirical moments to the true probability density function and
moments.
The last result is a bit surprising; in particular, the single most likely value for the maximum (and hence the mode of the
distribution) is 0.
Explicitly compute the probability density function, mean, and standard deviation of Y . 5
Answer
A fair coin is tossed 10 times. Find the probability that the difference between the number of heads and the number of tails is
never greater than 4.
Answer
11.6.2 https://stats.libretexts.org/@go/page/10238
The Last Visit to 0
Consider again the simple, symmetric random walk. Our next topic is the last visit to 0 during the first 2n steps:
Note that since visits to 0 can only occur at even times, Z takes the values in the set {0, 2, … , 2n}. This random variable has a
2n
strange and interesting distribution known as the discrete arcsine distribution. Along the way to our derivation, we will discover
some other interesting results as well.
Proof
In the random walk simulation, choose the last visit to 0 and then vary the number of steps with the scroll bar. Note the shape
and location of the probability density function and the mean/standard deviation bar. For various values of the parameter, run
the simulation 1000 times and compare the empirical density function and moments to the true probability density function and
moments.
In particular, 0 and 2n are the most likely values and hence are the modes of the distribution. The discrete arcsine distribution is
quite surprising. Since we are tossing a fair coin to determine the steps of the walker, you might easily think that the random walk
should be positive half of the time and negative half of the time, and that it should return to 0 frequently. But in fact, the arcsine law
implies that with probability , there will be no return to 0 during the second half of the walk, from time n + 1 to 2n, regardless of
1
n , and it is not uncommon for the walk to stay positive (or negative) during the entire time from 1 to 2n.
Answer
Comment on the validity of the assumption that the voters are randomly ordered for a real election.
The ballot problem can be solved by using a simple conditional probability argument to obtain a recurrence relation. Let f (a, b)
f satisfies the initial condition f (1, 0) = 1 and the following recurrence relation:
a b
f (a, b) = f (a − 1, b) + f (a, b − 1) (11.6.17)
a+b a+b
Proof
11.6.3 https://stats.libretexts.org/@go/page/10238
Proof
In the ballot experiment, vary the parameters a and b and note the change the ballot probability. For selected values of the
parameters, run the experiment 1000 times and compare the relative frequency to the true probability.
In an election for mayor of a small town, Mr. Smith received 4352 votes while Ms. Jones received 7543 votes. Compute the
probability that Jones was always ahead of Smith in the vote count.
Answer
Given X n =k ,
1. There are n+k
steps to the right and
2
steps to the left.
n−k
2. All possible orderings of the steps to the right and the steps to the left are equally likely.
For k > 0 ,
k
P(X1 > 0, X2 > 0, … , Xn−1 > 0 ∣ Xn = k) = (11.6.19)
n
Proof
In the ballot experiment, vary the parameters a and b and note the change the ballot probability. For selected values of the
parameters, run the experiment 1000 times and compare the relative frequency to the true probability.
An American roulette wheel has 38 slots; 18 are red, 18 are black, and 2 are green. Fred bet $1 on red, at even stakes, 50 times,
winning 22 times and losing 28 times. Find the probability that Fred's net fortune was always negative.
Answer
Note that returns to 0 can only occur at even times; it may also be possible that the random walk never returns to 0. Thus, T takes
values in the set {2, 4, …} ∪ {∞}.
Proof
Fred and Wilma are tossing a fair coin; Fred gets a point for each head and Wilma gets a point for each tail. Find the probability
that their scores are equal for the first time after n tosses, for each n ∈ {2, 4, 6, 8, 10}.
Answer
This page titled 11.6: The Simple Random Walk is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon
request.
11.6.4 https://stats.libretexts.org/@go/page/10238
11.7: The Beta-Bernoulli Process
An interesting thing to do in almost any parametric probability model is to “randomize” one or more of the parameters. Done in a
clever way, this often leads to interesting new models and unexpected connections between models. In this section we will
randomize the success parameter in the Bernoulli trials process. This leads to interesting and surprising connections with Pólya's
urn process.
Basic Theory
Definitions
First, recall that the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) is a continuous distribution on
the interval (0, 1) with probability density function g given by
1
a−1 b−1
g(p) = p (1 − p ) , p ∈ (0, 1) (11.7.1)
B(a, b)
where B is the beta function. So B(a, b) is simply the normalizing constant for the function p ↦ p
a−1 b−1
(1 − p ) on the interval
(0, 1). Here is our main definition:
Suppose that has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Next suppose that
P
X = (X1 , X2 , …) is a sequence of indicator random variables with the property that given P = p ∈ (0, 1) , X is a
conditionally independent sequence with
P(Xi = 1 ∣ P = p) = p, i ∈ N+ (11.7.2)
In short, given P = p , the sequence X is a Bernoulli trials sequence with success parameter p. In the usual language of reliability,
X is the outcome of trial i , where 1 denotes success and 0 denotes failure. For a specific application, suppose that we select a
i
random probability of heads according to the beta distribution with with parameters a and b , and then toss a coin with this
probability of heads repeatedly.
Outcome Variables
What's our first step? Well, of course we need to compute the finite dimensional distributions of X. Recall that for r ∈ R and
j ∈ N, r denotes the ascending power r(r + 1) ⋯ [r + (j − 1)] . By convention, a product over an empty index set is 1, so
[j]
= 1.
[0]
r
Proof
From this result, it follows that Pólya's urn process with parameters a, b, c ∈ N is equivalent to the beta-Bernoulli process with
+
parameters a/c and b/c, quite an interesting result. Note that since the joint distribution above depends only on
x +x +⋯ +x
1 2 , the sequence X is exchangeable. Finally, it's interesting to note that the beta-Bernoulli process with
n
parameters a and b could simply be defined as the sequence with the finite-dimensional distributions above, without reference to
the beta distribution! It turns out that every exchangeable sequence of indicator random variables can be obtained by randomizing
the success parameter in a sequence of Bernoulli trials. This is de Finetti's theorem, named for Bruno de Finetti, which is studied in
the section on backwards martingales.
For each i ∈ N +
1. E(X ) = i
a
a+b
2. var(X ) = i
a+b
a b
a+b
11.7.1 https://stats.libretexts.org/@go/page/10239
Proof
Thus X is a sequence of identically distributed variables, quite surprising at first but of course inevitable for any exchangeable
sequence. Compare the joint distribution with the marginal distributions. Clearly the variables are dependent, so let's compute the
covariance and correlation of a pair of outcome variables.
2. cor(X i, Xj ) =
1
a+b+1
Proof
Thus, the variables are positively correlated. It turns out that in any infinite sequence of exchangeable variables, the the variables
must be nonnegatively correlated. Here is another result that explores how the variables are related.
n
Suppose that n ∈ N + and (x 1, x2 , … , xn ) ∈ {0, 1 }
n
. Let k = ∑ i=1
xi . Then
a+k
P(Xn+1 = 1 ∣ X1 = x1 , X2 = x2 , … Xn = xn ) = (11.7.7)
a+b +n
Proof
The beta-Bernoulli model starts with the conditional distribution of X given P . Let's find the conditional distribution in the other
direction.
Suppose that n ∈ N+ and (x1 , x2 , … , xn ) ∈ {0, 1 } . Let k = ∑ x . Then the conditional distribution of
n n
i=1 i P given
(X1 = x1 , X2 , = x2 , … , Xn = xn ) is beta with left parameter a + k and right parameter b + (n − k) . Hence
a+k
E(P ∣ X1 = x1 , X2 = x2 , … , Xn = xn ) = (11.7.8)
a+b +k
Proof
Thus, the left parameter increases by the number of successes while the right parameter increases by the number of failures. In the
language of Bayesian statistics, the original distribution of P is the prior distribution, and the conditional distribution of P given
the data (x , x , … , x ) is the posterior distribution. The fact that the posterior distribution is beta whenever the prior distribution
1 2 n
is beta means that the beta distributions is conjugate to the Bernoulli distribution. The conditional expected value in the last
theorem is the Bayesian estimate of p when p is modeled by the random variable P . These concepts are studied in more generality
in the section on Bayes Estimators in the chapter on Point Estimation. It's also interesting to note that the expected values in the last
two theorems are the same: If n ∈ N , (x , x , … , x ) ∈ {0, 1} and k = ∑ x then
1 2 n
n n
i=1 i
a+k
E(Xn+1 ∣ X1 = x1 , … , Xn = xn ) = E(P ∣ X1 = x1 , … , Xn = xn ) = (11.7.12)
a+b +n
Run the simulation of the beta coin experiment for various values of the parameter. Note how the posterior probability density
function changes from the prior probability density function, given the number of heads.
Yn = ∑ Xi (11.7.13)
i=1
denote the number of successes in the first n trials. Of course, Y = (Y0 , Y1 , …) is the partial sum process associated with
X = (X , X , …).
1 2
11.7.2 https://stats.libretexts.org/@go/page/10239
[k] [n−k]
n a b
P(Yn = k) = ( ) , k ∈ {0, 1, … , n} (11.7.14)
[n]
k (a + b)
Proof
In the simulation of the beta-binomial experiment, vary the parameters and note how the shape of the probability density
function of Y (discrete) parallels the shape of the probability density function of P (continuous). For various values of the
n
parameters, run the simulation 1000 times and compare the empirical density function to the probability density function.
If a = b = 1 , so that P is uniformly distributed on (0, 1), then Y is uniformly distributed on {0, 1, … , n}.
n
Proof
1. E(Y n) =n
a
a+b
2. var(Y n) =n
ab
2
[1 + (n − 1)
1
a+b+1
]
(a+b)
Proof
In the simulation of the beta-binomial experiment, vary the parameters and note the location and size of the mean-standard
deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical moments to the
true moments.
We can restate the conditional distributions in the last subsection more elegantly in terms of Y . n
Let n ∈ N .
1. The conditional distribution of X n+1 given Y isn
a + Yn
P(Xn+1 = 1 ∣ Yn ) = E(Xn+1 ∣ Yn ) = (11.7.18)
a+b +n
2. The conditional distribution of P given Y is beta with left parameter a + Y and right parameter b + (n − Y
n n n) . In
particular
a + Yn
E(P ∣ Yn ) = (11.7.19)
a+b +n
Proof
Once again, the conditional expected value E(P ∣ Y ) is the Bayesian estimator of p. In particular, if a = b = 1 , so that P has the
n
uniform distribution on (0, 1), then P(X = 1 ∣ Y = n) = n+1 . This is Laplace's rule of succession, another interesting
n
n+1
n+2
connection. The rule is named for Pierre Simon Laplace, and is studied from a different point of view in the section on
Independence.
n
Yn 1
Mn = = ∑ Xi (11.7.24)
n n
i=1
so that M is the sample mean of (X , X , … , X ), or equivalently the proportion of successes in the first n trials. Properties of
n 1 2 n
M n follow easily from the corresponding properties of Y . In particular, P(M = k/n) = P(Y = k) for k ∈ {0, 1, … , n} as
n n n
11.7.3 https://stats.libretexts.org/@go/page/10239
For n ∈ N , the mean and variance of M are
+ n
1. E(M n) =
a+b
a
n−1
2. var(M n) =
1
n
ab
2
+
n 2
ab
Proof
So E(M ) is constant in n ∈ N while var(M ) → ab/(a + b) (a + b + 1) as n → ∞ . These results suggest that perhaps M
n + n
2
n
has a limit, in some sense, as n → ∞ . For an ordinary sequence of Bernoulli trials with success parameter p ∈ (0, 1), we know
from the law of large numbers that M → p as n → ∞ with probability 1 and in mean (and hence also in distribution). What
n
happens here when the success probability P has been randomized with the beta distribution? The answer is what we might hope.
It follows from the last theorem that E(P with probability 1, in mean square, and in distribution. The stochastic process
∣ Yn ) → P
Z = { Zn = (a + Yn )/(a + b + n) : n ∈ N} that we have seen several times now is of fundamental importance, and turns out to
be a martingale. The theory of martingales provides powerful tools for studying convergence in the beta-Bernoulli process.
[k] [n−k]
n−1 a b
P(Vk = n) = ( ) , n ∈ {k, k + 1, …} (11.7.35)
k−1 [n]
(a + b)
Proof 1
Proof 2
The distribution of V is known as the beta-negative binomial distribution with parameters k , a , and b .
k
Proof
In the simulation of the beta-negative binomial experiment, vary the parameters and note the shape of the probability density
function. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the
probability density function.
11.7.4 https://stats.libretexts.org/@go/page/10239
The mean and variance of V are k
1. E(V k) =k
a+b−1
a−1
if a > 1 .
2
2. var(V k) =k
a+b−1
(a−1)(a−2)
2
[b + k(a + b − 2)] − k (
a+b−1
a−1
)
Proof
In the simulation of the beta-negative binomial experiment, vary the parameters and note the location and size of the mean
±standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical
This page titled 11.7: The Beta-Bernoulli Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
11.7.5 https://stats.libretexts.org/@go/page/10239
CHAPTER OVERVIEW
12: Finite Sampling Models
This chapter explores a number of models and problems based on sampling from a finite population. Sampling without replacement
from a population of objects of various types leads to the hypergeometric and multivariate hypergeometric models. Sampling with
replacement from a finite population leads naturally to the birthday and coupon-collector problems. Sampling without replacement
form an ordered population leads naturally to the matching problem and to the study of order statistics.
12.1: Introduction to Finite Sampling Models
12.2: The Hypergeometric Distribution
12.3: The Multivariate Hypergeometric Distribution
12.4: Order Statistics
12.5: The Matching Problem
12.6: The Birthday Problem
12.7: The Coupon Collector Problem
12.8: Pólya's Urn Process
12.9: The Secretary Problem
This page titled 12: Finite Sampling Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
1
12.1: Introduction to Finite Sampling Models
Basic Theory
Sampling Models
Suppose that we have a population D of m objects. The population could be a deck of cards, a set of people, an urn full of balls, or
any number of other collections. In many cases, we simply label the objects from 1 to m, so that D = {1, 2, … , m}. In other cases
(such as the card experiment), it may be more natural to label the objects with vectors. In any case, D is usually a finite subset of
R for some k ∈ N .
k
+
Our basic experiment consists of selecting n objects from the population D at random and recording the sequence of objects
chosen. Thus, the outcome is X = (X , X , … , X ) where X ∈ D is the ith object chosen. If the sampling is with replacement,
1 2 n i
the sample size n can be any positive integer. In this case, the sample space S is
n
S =D = {(x1 , x2 , … , xn ) : xi ∈ D for each i} (12.1.1)
If the sampling is without replacement, the sample size n can be no larger than the population size m . In this case, the sample
space S consists of all permutations of size n chosen from D:
2. #(D n) =m
(n)
= m(m − 1) ⋯ (m − n + 1)
With either type of sampling, we assume that the samples are equally likely and thus that the outcome variable X is uniformly
distributed on the appropriate sample space S ; this is the meaning of the phrase random sample:
#(A)
P(X ∈ A) = , A ⊆S (12.1.3)
#(S)
Any permutation of X has the same distribution as X itself, namely the uniform distribution on the appropriate sample space
S:
A sequence of random variables with this property is said to be exchangeable. Although this property is very simple to understand,
both intuitively and mathematically, it is nonetheless very important. We will use the exchangeable property often in this chapter.
More generally, any sequence of k of the n outcome variables is uniformly distributed on the appropriate sample space:
1. D if the sampling is with replacement.
k
In particular, for either sampling method, X is uniformly distributed on D for each i ∈ {1, 2, … , n}.
i
Thus, when the sampling is with replacement, the sample variables form a random sample from the uniform distribution, in
statistical terminology.
12.1.1 https://stats.libretexts.org/@go/page/10244
If the sampling is without replacement, then the conditional distribution of a sequence of k of the outcome variables, given the
values of a sequence of j other outcome variables, is the uniform distribution on the set of permutations of size k chosen from
the population when the j known values are removed (of course, j + k ≤ n ).
In particular, X and X are dependent for any distinct i and j when the sampling is without replacement.
i j
W = { X1 , X2 , … , Xn } (12.1.4)
The random set W takes values in the set of combinations of size n chosen from D:
n
) .
Proof
Suppose now that the sampling is with replacement, and we again denote the unordered outcome by W . In this case, W takes
values in the collection of multisets of size n from D. (A multiset is like an ordinary set, except that repeated elements are
allowed).
T = {{ x1 , x2 , … , xn } : xi ∈ D for each i} (12.1.7)
n
) .
m… With replacement m
n m+n−1
(
n
)
m… Without m
(n) m
(
n
)
Multi-type Populations
A dichotomous population consists of two types of objects.
12.1.2 https://stats.libretexts.org/@go/page/10244
Suppose that a batch of 100 components includes 10 that are defective. A random sample of 5 components is selected without
replacement. Compute the probability that the sample contains at least one defective component.
Answer
An urn contains 50 balls, 30 red and 20 green. A sample of 15 balls is chosen at random. Find the probability that the sample
contains 10 red balls in each of the following cases:
1. The sampling is without replacement
2. The sampling is with replacement
Answer
In the ball and urn experiment select 50 balls with 30 red balls, and sample size 15. Run the experiment 100 times. Compute
the relative frequency of the event that the sample has 10 red balls in each of the following cases, and compare with the
respective probability in the previous exercise:
1. The sampling is without replacement
2. The sampling is with replacement
Suppose that a club has 100 members, 40 men and 60 women. A committee of 10 members is selected at random (and without
replacement, of course).
1. Find the probability that both genders are represented on the committee.
2. If you observed the experiment and in fact the committee members are all of the same gender, would you believe that the
sampling was random?
Answer
Suppose that a small pond contains 500 fish, 50 of them tagged. A fisherman catches 10 fish. Find the probability that the catch
contains at least 2 tagged fish.
Answer
The basic distribution that arises from sampling without replacement from a dichotomous population is studied in the section on the
hypergeometric distribution. More generally, a multi-type population consists of objects of k different types.
Suppose that a legislative body consists of 60 republicans, 40 democrats, and 20 independents. A committee of 10 members is
chosen at random. Find the probability that at least one party is not represented on the committee.
Answer
The basic distribution that arises from sampling without replacement from a multi-type population is studied in the section on the
multivariate hypergeometric distribution.
Cards
Recall that a standard card deck can be modeled by the product set
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠} (12.1.8)
where the first coordinate encodes the denomination or kind (ace, 2-10, jack, queen, king) and where the second coordinate encodes
the suit (clubs, diamonds, hearts, spades). The general card experiment consists of drawing n cards at random and without
replacement from the deck D. Thus, the ith card is X = (Y , Z ) where Y is the denomination and Z is the suit. The special case
i i i i i
n = 5 is the poker experiment and the special case n = 13 is the bridge experiment. Note that with respect to the denominations or
with respect to the suits, a deck of cards is a multi-type population as discussed above.
12.1.3 https://stats.libretexts.org/@go/page/10244
1. 3,954,242,643,911,239,680,000 ordered hands
2. 635,013,559,600 unordered hands
In the card experiment, set n = 5 . Run the simulation 5 times and on each run, list all of the (ordered) sequences of cards that
would give the same unordered hand as the one you observed.
Run the card experiment 500 time. Compute the relative frequency corresponding to each probability in the previous exercise.
Find the probability that a bridge hand will contain no honor cards that is, no cards of denomination 10, jack, queen, king, or
ace. Such a hand is called a Yarborough, in honor of the second Earl of Yarborough.
Answer
Dice
Rolling fair, six-sided dice is equivalent to choosing a random sample of size n with replacement from the population
n
{1, 2, 3, 4, 5, 6}. Generally, selecting a random sample of size n with replacement from D = {1, 2, … , m} is equivalent to rolling
n fair, m -sided dice.
In the game of poker dice, 5 standard, fair dice are thrown. Find each of the following:
1. The probability that all dice show the same score.
2. The probability that the scores are distinct.
3. The probability that 1 occurs twice and 6 occurs 3 times.
Answer
Run the poker dice experiment 500 times. Compute the relative frequency of each event in the previous exercise and compare
with the corresponding probability.
The game of poker dice is treated in more detail in the chapter on Games of Chance.
Birthdays
Supposes that we select n persons at random and record their birthdays. If we assume that birthdays are uniformly distributed
throughout the year, and if we ignore leap years, then this experiment is equivalent to selecting a sample of size n with replacement
from D = {1, 2, … , 365}. Similarly, we could record birth months or birth weeks.
Suppose that a probability class has 30 students. Find each of the following:
1. The probability that the birthdays are distinct.
2. The probability that there is at least one duplicate birthday.
12.1.4 https://stats.libretexts.org/@go/page/10244
Answer
In the birthday experiment, set m = 365 and n = 30 . Run the experiment 1000 times and compare the relative frequency of
each event in the previous exercise to the corresponding probability.
than one ball; sampling without replacement means that a cell may contain at most one ball.
Suppose that 5 balls are distributed into 10 cells (with no restrictions). Find each of the following:
1. The probability that the balls are all in different cells.
2. The probability that the balls are all in the same cell.
Answer
Coupons
Suppose that when we purchase a certain product (bubble gum, or cereal for example), we receive a coupon (a baseball card or
small toy, for example), which is equally likely to be any one of m types. We can think of this experiment as sampling with
replacement from the population of coupon types; X is the coupon that we receive on the ith purchase.
i
Suppose that a kid's meal at a fast food restaurant comes with a toy. The toy is equally likely to be any of 5 types. Suppose that
a mom buys a kid's meal for each of her 3 kids. Find each of the following:
1. The probability that the toys are all the same.
2. The probability that the toys are all different.
Answer
The coupon collector problem is studied in more detail later in this chapter.
denote the trial number when the person finds the correct key.
Suppose that unsuccessful keys are discarded (the rational thing to do, of course). Then N has the uniform distribution on
{1, 2, … , n}.
1. P(N = i) = , 1
n
i ∈ {1, 2, … , n} .
2. E(N ) = . n+1
2
2
3. var(N ) = n −1
12
.
Suppose that unsuccessful keys are not discarded (perhaps the person has had a bit too much to drink). Then N has a
geometric distribution on N . +
i−1
1. P(N = i) =
1
n
(
n−1
n
) , i ∈ N+ .
2. E(N ) = n .
3. var(N ) = n(n − 1) .
Let U = (U , U , … , U ) be a sequence of be a random numbers. Recall that these are independent random variables, each
1 2 n
uniformly distributed on the interval [0, 1] (the standard uniform distribution). Then X = ⌈m U ⌉ for i ∈ {1, 2, … , n}
i i
12.1.5 https://stats.libretexts.org/@go/page/10244
simulates a random sample, with replacement, from D.
It's a bit harder to simulate a random sample of size n , without replacement, since we need to remove each sample value before the
next draw.
The following algorithm generates a random sample of size n , without replacement, from D.
1. For i = 1 to m, let b i =i .
2. For i = 1 to n ,
a. let j = m − i + 1
b. let U be a random number
i
c. let J = ⌊jU ⌋ i
d. let X = bi J
e. let k = b j
f. let b = b
j J
g. let b = k
J
3. Return X = (X 1, X2 , … , Xn )
This page titled 12.1: Introduction to Finite Sampling Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
12.1.6 https://stats.libretexts.org/@go/page/10244
12.2: The Hypergeometric Distribution
Basic Theory
Dichotomous Populations
Suppose that we have a dichotomous population D. That is, a population that consists of two types of objects, which we will refer
to as type 1 and type 0. For example, we could have
balls in an urn that are either red or green
a batch of components that are either good or defective
a population of people who are either male or female
a population of animals that are either tagged or untagged
voters who are either democrats or republicans
Let R denote the subset of D consisting of the type 1 objects, and suppose that #(D) = m and #(R) = r . As in the basic
sampling model, we sample n objects at random from D. In this section, our only concern is in the types of the objects, so let X i
denote the type of the ith object chosen (1 or 0). The random vector of types is
X = (X1 , X2 , … , Xn ) (12.2.1)
Our main interest is the random variable Y that gives the number of type 1 objects in the sample. Note that Y is a counting
variable, and thus like all counting variables, can be written as a sum of indicator variables, in this case the type variables:
n
Y = ∑ Xi (12.2.2)
i=1
We will assume initially that the sampling is without replacement, which is usually the realistic setting with dichotomous
populations.
Proof
This distribution defined by this probability density function is known as the hypergeometric distribution with parameters m , r,
and n .
Combinatorial Proof
Algebraic Proof
Recall our convention that j = ( ) = 0 for i > j . With this convention, the two formulas for the probability density function are
(i) j
correct for y ∈ {0, 1, … , n}. We usually use this simpler set as the set of values for the hypergeometric distribution.
(r+1)(n+1)
The hypergeometric distribution is unimodal. Let v = m+2
. Then
1. P(Y = y) > P(Y = y − 1) if and only if y < v .
12.2.1 https://stats.libretexts.org/@go/page/10245
2. The mode occurs at ⌊v⌋ if v is not an integer, and at v and v − 1 if v is an integer greater than 0.
In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the shape of the probability
density function. For selected values of the parameters, run the experiment 1000 times and compare the relative frequency
function to the probability density function.
You may wonder about the rather exotic name hypergeometric distribution, which seems to have nothing to do with sampling from
a dichotomous population. The name comes from a power series, which was studied by Leonhard Euler, Carl Friedrich Gauss,
Bernhard Riemann, and others.
k=0
Many of the basic power series studied in calculus are hypergeometric series, including the ordinary geometric series and the
exponential series.
In addition, the hypergeometric distribution function can be expressed in terms of a hypergeometric series. These representations
are not particularly helpful, so basically were stuck with the non-descriptive term for historical reasons.
Moments
Next we will derive the mean and variance of Y . The exchangeable property of the indicator variables, and properties of covariance
and correlation will play a key role.
E(Xi ) =
m
r
for each i.
Proof
From the representation of Y as the sum of indicator variables, the expected value of Y is trivial to compute. But just for fun, we
give the derivation from the probability density function as well.
E(Y ) = n
m
r
.
Proof
Proof from the definition
Next we turn to the variance of the hypergeometric distribution. For that, we will need not only the variances of the indicator
variables, but their covariances as well.
var(Xi ) =
r
m
(1 −
r
m
) for each i.
Proof
For distinct i, j,
1. cov (X , X ) = −
i j m
r
(1 −
r
m
)
1
m−1
2. cor (X , X ) = −
i j
m−1
1
Proof
Note that the event of a type 1 object on draw i and the event of a type 1 object on draw j are negatively correlated, but the
correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if
m = 2 , which must be the case.
12.2.2 https://stats.libretexts.org/@go/page/10245
var(Y ) = n
r
m
(1 −
m
r
)
m−n
m−1
.
Proof
Note that var(Y ) = 0 if r = 0 or r = m or n = m , which must be true since Y is deterministic in each of these cases.
In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the size and location of the
mean ± standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the
empirical mean and standard deviation to the true mean and standard deviation.
m
.
The following results now follow immediately from the general theory of Bernoulli trials, although modifications of the arguments
above could also be used.
m
:
y n−y
n r r
P(Y = y) = ( )( ) (1 − ) , y ∈ {0, 1, … , n} (12.2.10)
y m m
2. var(Y ) = n r
m
(1 −
r
m
)
Note that for any values of the parameters, the mean of Y is the same, whether the sampling is with or without replacement. On the
m−n
other hand, the variance of Y is smaller, by a factor of , when the sampling is without replacement than with replacement. It
m−1
certainly makes sense that the variance of Y should be smaller when sampling without replacement, since each selection reduces
m−n
the variablility in the population that remains. The factor is sometimes called the finite population correction factor.
m−1
In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with
replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial
probability density function. Note also the difference between the mean ± standard deviation bars. For selected values of the
parameters and for the two different sampling modes, run the simulation 1000 times.
Suppose that r ∈ {0, 1, … , m} for each m ∈ N and that r /m → p ∈ [0, 1] as m → ∞ . Then for fixed n , the
m + m
hypergeometric probability density function with parameters m, r , and n converges to the binomial probability density
m
In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with
replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial
probability density function. In particular, note the similarity when m is large and n small. For selected values of the
parameters, and for both sampling modes, run the experiment 1000 times.
12.2.3 https://stats.libretexts.org/@go/page/10245
In the setting of the convergence result above, note that the mean and variance of the hypergeometric distribution converge to
the mean and variance of the binomial distribution as m → ∞ .
n
Y . This method of deriving an estimator is known as the method of moments.
m
E( Y ) =r
n
Proof
n
Y is an unbiased estimator of r. Hence the variance is a measure of the quality of
the estimator, in the mean square sense.
var (
m
n
Y ) = (m − r)
r
n
m−n
m−1
.
Proof
n
Y ) ↓ 0 as n ↑ m .
Thus, the estimator improves as the sample size increases; this property is known as consistency.
In the ball and urn experiment, select sampling without replacement. For selected values of the parameters, run the experiment
100 times and note the estimate of r on each run.
1. Compute the average error and the average squared error over the 100 runs.
2. Compare the average squared error with the variance in mean square error given above.
Often we just want to estimate the ratio r/m (particularly if we don't know m either. In this case, the natural estimator is the
sample proportion Y /n.
The estimator of m
r
has the following properties:
1. E ( ) = , so the estimator is unbiased.
Y
n m
r
2. var ( ) = Y
(1 −
n
1
n
)
m
r r
m
m−n
m−1
3. var ( Y
n
) ↓ 0 as n ↑ m so the estimator is consistent.
12.2.4 https://stats.libretexts.org/@go/page/10245
Do you think that the main assumption of the sampling model, namely equally likely samples, would be satisfied for a real
capture-recapture problem? Explain.
Once again, we can use the method of moments to derive a simple estimate of m , by hoping that the sample proportion of type 1
objects is close the population proportion of type 1 objects. That is,
Y r nr
≈ ⟹ m ≈ (12.2.12)
n m Y
Y
if Y >0 and is ∞ if Y =0 .
In the ball and urn experiment, select sampling without replacement. For selected values of the parameters, run the experiment
100 times.
1. On each run, compare the true value of m with the estimated value.
2. Compute the average error and the average squared error over the 100 runs.
If y > 0 then nr
y
maximizes P(Y = y) as a function of m for fixed r and n . This means that nr
Y
is a maximum likelihood
estimator of m.
E(
nr
Y
) ≥m .
Proof
Thus, the estimator is positivley biased and tends to over-estimate m. Indeed, if n ≤ m − r , so that P(Y = 0) > 0 then
) = ∞ . For another approach to estimating the population size m , see the section on Order Statistics.
nr
E(
Y
m
r
The estimator m
n
Y of r with m known satisfies
1. E ( m
n
Y ) =r
r(m−r)
2. var ( m
n
Y ) =
n
The estimator 1
n
Y of r
m
satisfies
1. E ( Y ) =
1
n m
r
2. var ( Y ) =1
n
1
n
r
m
(1 −
m
r
)
Thus, the estimators are still unbiased and consistent, but have larger mean square error than before. Thus, sampling without
replacement works better, for any values of the parameters, than sampling with replacement.
In the ball and urn experiment, select sampling with replacement. For selected values of the parameters, run the experiment
100 times.
1. On each run, compare the true value of r with the estimated value.
2. Compute the average error and the average squared error over the 100 runs.
12.2.5 https://stats.libretexts.org/@go/page/10245
3. The probability that the sample contains at least one defective chip.
Answer
A club contains 50 members; 20 are men and 30 are women. A committee of 10 members is chosen at random. Find each of
the following:
1. The probability density function of the number of women on the committee.
2. The mean and variance of the number of women on the committee.
3. The mean and variance of the number of men on the committee.
4. The probability that the committee members are all the same gender.
Answer
A small pond contains 1000 fish; 100 are tagged. Suppose that 20 fish are caught. Find each of the following:
1. The probability density function of the number of tagged fish in the sample.
2. The mean and variance of the number of tagged fish in the sample.
3. The probability that the sample contains at least 2 tagged fish.
4. The binomial approximation to the probability in (c).
Answer
Forty percent of the registered voters in a certain district prefer candidate A . Suppose that 10 voters are chosen at random. Find
each of the following:
1. The probability density function of the number of voters in the sample who prefer A .
2. The mean and variance of the number of voters in the sample who prefer A .
3. The probability that at least 5 voters in the sample prefer A .
Answer
Suppose that 10 memory chips are sampled at random and without replacement from a batch of 100 chips. The chips are tested
and 2 are defective. Estimate the number of defective chips in the entire batch.
Answer
A voting district has 5000 registered voters. Suppose that 100 voters are selected at random and polled, and that 40 prefer
candidate A . Estimate the number of voters in the district who prefer candidate A .
Answer
From a certain lake, 200 fish are caught, tagged and returned to the lake. Then 100 fish are caught and it turns out that 10 are
tagged. Estimate the population of fish in the lake.
Answer
Cards
Recall that the general card experiment is to select n cards at random and without replacement from a standard deck of 52 cards.
The special case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment.
In a poker hand, find the probability density function, mean, and variance of the following random variables:
1. The number of spades
2. The number of aces
Answer
12.2.6 https://stats.libretexts.org/@go/page/10245
Answer
in the population, so that U = (U , U , … , U ) is a sequence of Bernoulli trials with success parameter p. Let V = ∑ U
1 2 n
m
i=1 i
denote the number of type 1 objects in the population, so that V has the binomial distribution with parameters m and p.
As before, we sample n object from the population. Again we let X denote the type of the ith object sampled, and we let
i
n
Y =∑ i=1
X denote the number of type 1 objects in the sample. We will consider sampling with and without replacement. In the
i
first case, the sample size can be any positive integer, but in the second case, the sample size cannot exceed the population size.
The key technique in the analysis of the randomized urn is to condition on V . If we know that V = r , then the model reduces to the
model studied above: a population of size m with r type 1 objects, and a sample of size n .
Proof
Thus, in either model, X is a sequence of identically distributed indicator variables. Ah, but what about dependence?
n
Suppose that the sampling is without replacement. Let (x 1, x2 , … , xn ) ∈ {0, 1 }
n
and let y = ∑ i=1
xi . Then
y n−y
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = p (1 − p ) (12.2.13)
Proof
From the joint distribution in the previous exercise, we see that X is a sequence of Bernoulli trials with success parameter p, and
hence Y has the binomial distribution with parameters n and p. We could also argue that X is a Bernoulli trials sequence directly,
by noting that {X , X , … , X } is a randomly chosen subset of {U , U , … , U }.
1 2 n 1 2 m
Suppose now that the sampling is with replacement. Again, let (x 1, x2 , … , xn ) ∈ {0, 1 }
n
and let y = ∑ n
i=1
xi . Then
y n−y
V (m − V )
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = E [ ] (12.2.15)
n
m
Proof
A closed form expression for the joint distribution of X, in terms of the parameters m, n , and p is not easy, but it is at least clear
that the joint distribution will not be the same as the one when the sampling is without replacement. In particular, X is a dependent
sequence. Note however that X is an exchangeable sequence, since the joint distribution is invariant under a permutation of the
coordinates (this is a simple consequence of the fact that the joint distribution depends only on the sum y ).
Suppose that i and j are distinct indices. The covariance and correlation of (X i, Xj ) are
p(1−p)
1. cov (X , X ) =
i j m
2. cor (X , X ) =
i j
1
Proof
12.2.7 https://stats.libretexts.org/@go/page/10245
1. E(Y ) = np
2. var(Y ) = np(1 − p) m+n−1
Proof
Let's conclude with an interesting observation: For the randomized urn, X is a sequence of independent variables when the
sampling is without replacement but a sequence of dependent variables when the sampling is with replacement—just the opposite
of the situation for the deterministic urn with a fixed number of type 1 objects.
This page titled 12.2: The Hypergeometric Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
12.2.8 https://stats.libretexts.org/@go/page/10245
12.3: The Multivariate Hypergeometric Distribution
Basic Theory
The Multitype Model
As in the basic sampling model, we start with a finite population D consisting of m objects. In this section, we suppose in addition that each
object is one of k types; that is, we have a multitype population. For example, we could have an urn with balls of several different colors, or a
population of voters who are either democrat, republican, or independent. Let D denote the subset of all type i objects and let m = #(D ) for
i i i
i ∈ {1, 2, … , k}. Thus D = ⋃ D and m = ∑ m . The dichotomous model considered earlier is clearly a special case, with k = 2 .
k k
i=1 i i=1 i
As in the basic sampling model, we sample n objects at random from D. Thus the outcome of the experiment is X = (X , X , … , X ) where 1 2 n
X ∈ D is the i th object chosen. Now let Y denote the number of type i objects in the sample, for i ∈ {1, 2, … , k}. Note that ∑ Y = n so
k
i i i=1 i
if we know the values of k − 1 of the counting variables, we can find the value of the remaining counting variable. As with any counting
variable, we can express Y as a sum of indicator variables:
i
For i ∈ {1, 2, … , k}
n
Yi = ∑ 1 (Xj ∈ Di ) (12.3.1)
j=1
We assume initially that the sampling is without replacement, since this is the realistic case in most applications.
Proof
The distribution of (Y , Y , … , Y ) is called the multivariate hypergeometric distribution with parameters m, (m , m , … , m ), and n . We
1 2 k 1 2 k
also say that (Y , Y , … , Y ) has this distribution (recall again that the values of any k − 1 of the variables determines the value of the
1 2 k−1
remaining variable). Usually it is clear from context which meaning is intended. The ordinary hypergeometric distribution corresponds to k = 2 .
Combinatorial Proof
Algebraic Proof
mi m−mi
( )( )
y n−y
P(Yi = y) = , y ∈ {0, 1, … , n} (12.3.4)
m
( )
n
Proof
Grouping
The multivariate hypergeometric distribution is preserved when the counting variables are combined. Specifically, suppose that
(A , A , … , A ) is a partition of the index set {1, 2, … , k} into nonempty, disjoint subsets. Let W = ∑
1 2 l Y and r = ∑ m for j i j i
i∈Aj i∈Aj
j ∈ {1, 2, … , l}
12.3.1 https://stats.libretexts.org/@go/page/10246
Note that the marginal distribution of Y given above is a special case of grouping. We have two types: type i and not type i. More generally, the
i
Conditioning
The multivariate hypergeometric distribution is also preserved when some of the counting variables are observed. Specifically, suppose that
(A, B) is a partition of the index set {1, 2, … , k} into nonempty, disjoint subsets. Suppose that we observe Y = y for j ∈ B . Let j j
z = n−∑ y and r = ∑
j∈B j m .
i∈A i
Combinations of the grouping result and the conditioning result can be used to compute any marginal or conditional distributions of the counting
variables.
Moments
We will compute the mean, variance, covariance, and correlation of the counting variables. Results from the hypergeometric distribution and the
representation in terms of indicator variables are the main tools.
m
mi m−mi
2. var(Y ) = n
i
m m
m−n
m−1
Proof
Now let I ti = 1(Xt ∈ Di ) , the indicator variable of the event that the t th object selected is type i, for t ∈ {1, 2, … , n} and i ∈ {1, 2, … , k}.
Suppose that r and s are distinct elements of {1, 2, … , n}, and i and j are distinct elements of {1, 2, … , k}. Then
mi mj
cov (Iri , Irj ) = − (12.3.5)
m m
1 mi mj
cov (Iri , Isj ) = (12.3.6)
m −1 m m
Proof
Suppose again that r and s are distinct elements of {1, 2, … , n}, and i and j are distinct elements of {1, 2, … , k}. Then
−−−−− −−−−−−− −−
mi mj
cor (Iri , Irj ) = −√ (12.3.7)
m − mi m − mj
−−−−− −−−−−−− −−
1 mi mj
cor (Iri , Isj ) = √ (12.3.8)
m −1 m − mi m − mj
Proof
In particular, I and I are negatively correlated while I and I are positively correlated.
ri rj ri sj
The types of the objects in the sample form a sequence of n multinomial trials with parameters (m 1 /m, m2 /m, … , mk /m) .
The following results now follow immediately from the general theory of multinomial trials, although modifications of the arguments above
could also be used.
12.3.2 https://stats.libretexts.org/@go/page/10246
y y yk
1 2 k
n m m ⋯m
1 2 k k
P(Y1 = y1 , Y2 = y2 , … , Yk = yk ) = ( ) , (y1 , y2 , … , yk ) ∈ N with ∑ yi = n (12.3.11)
n
y1 , y2 , … , yk m
i=1
m−mj
Comparing with our previous results, note that the means and correlations are the same, whether sampling with or without replacement. The
variances and covariances are smaller when sampling without replacement, by a factor of the finite population correction factor
(m − n)/(m − 1)
Proof
Cards
Recall that the general card experiment is to select n cards at random and without replacement from a standard deck of 52 cards. The special
case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment.
12.3.3 https://stats.libretexts.org/@go/page/10246
Answer
In the card experiment, a hand that does not contain any cards of a particular suit is said to be void in that suit.
Use the inclusion-exclusion rule to show that the probability that a poker hand is void in at least one suit is
1913496
≈ 0.736 (12.3.12)
2598960
In the card experiment, set n = 5 . Run the simulation 1000 times and compute the relative frequency of the event that the hand is void in at
least one suit. Compare the relative frequency with the true probability given in the previous exercise.
Use the inclusion-exclusion rule to show that the probability that a bridge hand is void in at least one suit is
32427298180
≈ 0.051 (12.3.13)
635013559600
This page titled 12.3: The Multivariate Hypergeometric Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon
request.
12.3.4 https://stats.libretexts.org/@go/page/10246
12.4: Order Statistics
Basic Theory
Definitions
Suppose that the objects in our population are numbered from 1 to m, so that D = {1, 2, … , m}. For example, the population
might consist of manufactured items, and the labels might correspond to serial numbers. As in the basic sampling model we select
n objects at random, without replacement from D. Thus the outcome is X = (X , X , … , X ) where X ∈ S is the i th object
1 2 n i
chosen. Recall that X is uniformly distributed over the set of permutations of size n chosen from D. Recall also that
W = { X , X , … , X } is the unordered sample, which is uniformly distributed on the set of combinations of size n chosen from
1 2 n
D.
We will denote the vector of order statistics by Y = (X(1) , X(2) , … , X(n) ) . Note that Y takes values in
n
L = {(y1 , y2 , … , yn ) ∈ D : y1 < y2 < ⋯ < yn } (12.4.3)
Run the order statistic experiment. Note that you can vary the population size m and the sample size n . The order statistics are
recorded on each update.
Distributions
L has (m
n
) elements and Y is uniformly distributed on L.
Proof
Proof
In the order statistic experiment, vary the parameters and note the shape and location of the probability density function. For
selected values of the parameters, run the experiment 1000 times and compare the relative frequency function to the probability
density function.
Moments
The probability density function of X above can be used to obtain an interesting identity involving the binomial coefficients.
(i)
This identity, in turn, can be used to find the mean and variance of X . (i)
For i, n, m ∈ N+ with i ≤ n ≤ m ,
m−n+i
k−1 m −k m
∑ ( )( ) =( ) (12.4.5)
i −1 n−i n
k=i
Proof
12.4.1 https://stats.libretexts.org/@go/page/10247
m +1
E [ X(i) ] = i (12.4.6)
n+1
Proof
Proof
In the order statistic experiment, vary the parameters and note the size and location of the mean ± standard deviation bar. For
selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to the
distribution mean and standard deviation.
Proof
Since U is unbiased, its variance is the mean square error, a measure of the quality of the estimator.
i
(m + 1)(m − n)(n − i + 1)
var(Ui ) = (12.4.9)
i(n + 2)
Proof
For fixed m and n , var(U i) decreases as i increases. Thus, the estimators improve as i increases; in particular, U is the best
n
var(Ui ) j(n − i + 1)
= (12.4.10)
var(Uj ) i(n − j + 1)
Note that the relative efficiency depends only on the orders i and j and the sample size n , but not on the population size m (the
unknown parameter). In particular, the relative efficiency of U with respect to U is n . For fixed i and j , the asymptotic relative
n 1
2
efficiency of U to U is j/i. Usually, we hope that an estimator improves (in the sense of mean square error) as the sample size n
j i
increases (the more information we have, the better our estimate should be). This general idea is known as consistency.
(m + 1)(m − n)
var(Un ) = (12.4.11)
n(n + 2)
For fixed i, var(U ) at first increases and then decreases to 0 as n increases from i to m. Thus, U is inconsistent.
i i
12.4.2 https://stats.libretexts.org/@go/page/10247
Figure 12.4.1 : var(U1) as a function of n for m = 100
E(M ) =
m+1
2
.
Proof
It follows that V = 2M − 1 is an unbiased estimator of m. Moreover, it seems that superficially at least, V uses more information
from the sample (since it involves all of the sample variables) than U . Could it be better? To find out, we need to compute the
n
variance of the estimator (which, since it is unbiased, is the mean square error). This computation is a bit complicated since the
sample variables are dependent. We will compute the variance of the sum as the sum of all of the pairwise covariances.
For distinct i, ,
j ∈ {1, 2, … , n} cov (Xi , Xj ) = −
m+1
12
.
Proof
2
12
.
Proof
(m+1)(m−n)
var(M ) =
12 n
.
Proof
(m+1)(m−n)
var(V ) =
3 n
.
Proof
The variance of V is decreasing with n , so V is also consistent. Let's compute the relative efficiency of the estimator based on the
maximum to the estimator based on the mean.
Thus, once again, the estimator based on the maximum is better. In addition to the mathematical analysis, all of the estimators
except U can sometimes be manifestly worthless by giving estimates that are smaller than some of the smaple values.
n
distributed random variables. The order statistics from such samples are studied in the chapter on Random Samples.
12.4.3 https://stats.libretexts.org/@go/page/10247
Examples and Applications
Suppose that in a lottery, tickets numbered from 1 to 25 are placed in a bowl. Five tickets are chosen at random and without
replacement.
1. Find the probability density function of X (3) .
2. Find E [ X ].
(3)
Answer
German tanks had serial numbers, and captured German tanks and records formed the sample data. The statistical estimates turned
out to be much more accurate than intelligence estimates. Some of the data are given in the table below.
German Tank Data. Source: Wikipedia
Date Statistical Estimate Intelligence Estimate German Records
One of the morals, evidently, is not to put serial numbers on your weapons!
Suppose that in a certain war, 5 enemy tanks have been captured. The serial numbers are 51, 3, 27, 82, 65. Compute the
estimate of m, the total number of tanks, using all of the estimators discussed above.
Answer
In the order statistic experiment, and set m = 100 and n = 10 . Run the experiment 50 times. For each run, compute the
estimate of m based on each order statistic. For each estimator, compute the square root of the average of the squares of the
errors over the 50 runs. Based on these empirical error estimates, rank the estimators of m in terms of quality.
Suppose that in a certain war, 10 enemy tanks have been captured. The serial numbers are 304, 125, 417, 226, 192, 340, 468,
499, 87, 352. Compute the estimate of m, the total number of tanks, using the estimator based on the maximum and the
estimator based on the mean.
Answer
This page titled 12.4: Order Statistics is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.4.4 https://stats.libretexts.org/@go/page/10247
12.5: The Matching Problem
Number the couples from 1 to n . Then X is the number of the woman paired with the ith man.
i
Number the letters and corresponding envelopes from 1 to n . Then X is the number of the envelope containing the ith letter.
i
Number the people and their corresponding hats from 1 to n . Then X is the number of the hat chosen by the ith person.
i
Our modeling assumption, of course, is that X is uniformly distributed on the sample space of permutations of D . The number of n
objects n is the basic parameter of the experiment. We will also consider the case of sampling with replacement from the
population D , because the analysis is much easier but still provides insight. In this case, X is a sequence of independent random
n
Matches
We will say that a match occurs at position j if Xj = j . Thus, number of matches is the random variable N defined
mathematically by
n
Nn = ∑ Ij (12.5.1)
j=1
where I = 1(X = j) is the indicator variable for the event of match at position j . Our problem is to compute the probability
j j
distribution of the number of matches. This is an old and famous problem in probability that was first considered by Pierre-Remond
Montmort; it sometimes referred to as Montmort's matching problem in his honor.
n
.
Proof
The number of matches N has the binomial distribution with trial parameter n and success parameter
n
1
n
.
k n−k
n 1 1
P(Nn = k) = ( )( ) (1 − ) , k ∈ {0, 1, … , n} (12.5.2)
k n n
Proof
2. var(N ) = n
n−1
Proof
12.5.1 https://stats.libretexts.org/@go/page/10248
The distribution of the number of matches converges to the Poisson distribution with parameter 1 as n → ∞ :
−1
e
P(Nn = k) → as n → ∞ for k ∈ N (12.5.3)
k!
Proof
matches. This will turn out to be easy once we have counted the number of permutations with no matches; these are called
derangements of D . We will denote the number of permutations of D with exactly k matches by b (k) = #{N = k} for
n n n n
Proof
Proof
Proof
In the matching experiment, vary the parameter n and note the shape and location of the probability density function. For
selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density
function.
P(Nn = n − 1) = 0 .
Proof
The distribution of the number of matches converges to the Poisson distribution with parameter 1 as n → ∞ :
−1
e
P(Nn = k) → as n → ∞, k ∈ N (12.5.9)
k!
Proof
In the matching experiment, increase n and note how the probability density function stabilizes rapidly. For selected values of
n , run the simulation 1000 times and compare the relative frequency function to the probability density function.
12.5.2 https://stats.libretexts.org/@go/page/10248
Moments
The mean and variance of the number of matches could be computed directly from the distribution. However, it is much better to
use the representation in terms of indicator variables. The exchangeable property is an important tool in this section.
E(Ij ) =
1
n
for j ∈ {1, 2, … , n}.
Proof
Thus, the expected number of matches is 1, regardless of n , just as when the sampling is with replacement.
var(Ij ) =
n−1
n2
for j ∈ {1, 2, … , n}.
Proof
A match in one position would seem to make it more likely that there would be a match in another position. Thus, we might guess
that the indicator variables are positively correlated.
2. cor(I j, Ik ) =
1
2
(n−1)
Proof
Note that when n = 2 , the event that there is a match in position 1 is perfectly correlated with the event that there is a match in
position 2. This makes sense, since there will either be 0 matches or 2 matches.
In the matching experiment, vary the parameter n and note the shape and location of the mean ± standard deviation bar. For
selected values of the parameter, run the simulation 1000 times and compare the sample mean and standard deviation to the
distribution mean and standard deviation.
For distinct j, ,
k ∈ {1, 2, … , n} cov(Ij , Ik ) → 0 as n → ∞ .
Thus, the event that a match occurs in position j is nearly independent of the event that a match occurs in position k if n is large.
For large n , the indicator variables behave nearly like n Bernoulli trials with success probability , which of course, is what
1
A Recursion Relation
In this subsection, we will give an alternate derivation of the distribution of the number of matches, in a sense by embedding the
experiment with parameter n into the experiment with parameter n + 1 .
The probability density function of the number of matches satisfies the following recursion relation and initial condition:
1. P(N n = k) = (k + 1)P(Nn+1 = k + 1) for k ∈ {0, 1, … , n}.
2. P(N 1 = 1) = 1 .
Proof
This result can be used to obtain the probability density function of N recursively for any n .
n
12.5.3 https://stats.libretexts.org/@go/page/10248
n
Nn j
Gn (t) = E (t ) = ∑ P(Nn = j)t , t ∈ R (12.5.13)
j=0
The family of probability generating functions satisfies the following differential equations and ancillary conditions:
1. G ′
n+1
(t) = Gn (t) for t ∈ R and n ∈ N +
2. G n (1) = 1 for n ∈ N +
Note also that G 1 (t) =t for t ∈ R . Thus, the system of differential equations can be used to compute G for any n ∈ N .
n +
In particular, for t ∈ R ,
1. G 2 (t) =
1
2
+
1
2
2
t
2. G 3 (t) =
1
3
+
1
2
t+
1
6
3
t
3. G 4 (t) =
3
8
+
1
3
t+
1
4
2
t +
1
24
4
t
Proof
For n ∈ N , +
1
P(Nn = k) = P(Nn−k = 0), k ∈ {0, 1, … , n − 1} (12.5.15)
k!
Proof
Ten married couples are randomly paired for a dance. Find each of the following:
1. The probability density function of the number of matches.
2. The mean and variance of the number of matches.
3. The probability of at least 3 matches.
Answer
In the matching experiment, set n = 10 . Run the experiment 1000 times and compare the following for the number of matches:
1. The true probabilities
2. The relative frequencies from the simulation
3. The limiting Poisson probabilities
Answer
This page titled 12.5: The Matching Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
12.5.4 https://stats.libretexts.org/@go/page/10248
12.6: The Birthday Problem
Introduction
The Sampling Model
As in the basic sampling model, suppose that we select n numbers at random, with replacement, from the population
D = {1, 2, … , m}. Thus, our outcome vector is X = (X , X , … , X ) where X is the i th number chosen. Recall that our basic
1 2 n i
In this section, we are interested in the number of population values missing from the sample, and the number of (distinct)
population values in the sample. The computation of probabilities related to these random variables are generally referred to as
birthday problems. Often, we will interpret the sampling experiment as a distribution of n balls into m cells; X is the cell number
i
of ball i. In this interpretation, our interest is in the number of empty cells and the number of occupied cells.
For i ∈ D , let Y denote the number of times that i occurs in the sample:
i
j=1
Y = (Y1 , Y2 , … , Ym ) has the multinomial distribution with parameters n and (1/m, 1/m, … , 1/m):
m
n 1 n
P(Y1 = y1 , Y2 = y2 , … , Ym = ym ) = ( ) , (y1 , y2 , … , ym ) ∈ N with ∑ yi = n (12.6.2)
n
y1 , y2 , … , yn m
i=1
Proof
j=1
and the number of (distinct) population values that occur in the sample is
m
j=1
Also, U takes values in {max{m − n, 0}, … , m − 1} and V takes values in {1, 2, … , min{m, n}}.
Clearly we must have U + V = m so once we have the probability distribution and moments of one variable, we can easily find
them for the other variable. However, we will first solve the simplest version of the birthday problem.
The (simple) birthday problem is to compute the probability of this event. For example, suppose that we choose n people at
random and note their birthdays. If we ignore leap years and assume that birthdays are uniformly distributed throughout the year,
then our sampling model applies with m = 365. In this setting, the birthday problem is to compute the probability that at least two
people have the same birthday (this special case is the origin of the name).
The solution of the birthday problem is an easy exercise in combinatorial probability.
12.6.1 https://stats.libretexts.org/@go/page/10249
(n)
m
P (Bm,n ) = 1 − , n ≤m (12.6.6)
n
m
The fact that the probability is 1 for n > m is sometimes referred to as the pigeonhole principle: if more than m pigeons are placed
into m holes then at least one hole has 2 or more pigeons. The following result gives a recurrence relation for the probability of
distinct sample values and thus gives another way to compute the birthday probability.
Let p denote the probability of the complementary birthday event B , that the sample variables are distinct, with population
m,n
c
size m and sample size n . Then p satisfies the following recursion relation and initial condition:
m,n
m−n
1. pm,n+1 =
m
pm,n
2. pm,1 =1
Examples
Let m = 365 (the standard birthday problem).
1. P (B 365,10) = 0.117
2. P (B 365,20) = 0.411
3. P (B 365,30) = 0.706
4. P (B 365,40) = 0.891
5. P (B 365,50) = 0.970
6. P (B 365,60) = 0.994
Figure 12.6.1 : P (B
365,n ) as a function of n , smoothed for the sake of appearance
In the birthday experiment, set n = 365 and select the indicator variable I . For n ∈ {10, 20, 30, 40, 50, 60} run the
experiment 1000 times each and compare the relative frequencies with the true probabilities.
In spite of its easy solution, the birthday problem is famous because, numerically, the probabilities can be a bit surprising. Note that
with a just 60 people, the event is almost certain! With just 23 people, the birthday event is about ; specifically 1
) = 0.507. Mathematically, the rapid increase in the birthday probability, as n increases, is due to the fact that m grows
n
P(B 365,23
Four fair, standard dice are rolled. Find the probability that the scores are distinct.
Answer
In the birthday experiment, set m = 6 and select the indicator variable I . Vary n with the scrollbar and note graphically how
the probabilities change. Now with n = 4 , run the experiment 1000 times and compare the relative frequency of the event to
the corresponding probability.
12.6.2 https://stats.libretexts.org/@go/page/10249
1. Find the probability that at least 2 have the same birth month.
2. Criticize the sampling model in this setting
Answer
In the birthday experiment, set m = 12 and select the indicator variable I . Vary n with the scrollbar and note graphically how
the probabilities change. Now with n = 5 , run the experiment 1000 times and compare the relative frequency of the event to
the corresponding probability.
A fast-food restaurant gives away one of 10 different toys with the purchase of a kid's meal. A family with 5 children buys 5
kid's meals. Find the probability that the 5 toys are different.
Answer
In the birthday experiment, set m = 10 and select the indicator variable I . Vary n with the scrollbar and note graphically how
the probabilities change. Now with n = 5 , run the experiment 1000 times and comparethe relative frequency of the event to
the corresponding probability.
Let m = 52 . Find the smallest value of n such that the probability of a duplication is at least 1
2
.
Answer
Proof
The distributions of the number of excluded values and the number of distinct values are now easy.
Proof
Proof
In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the shape and location of
the probability density function. For selected values of the parameters, run the simulation 1000 and compare the relative
frequency function to the probability density function.
The distribution of the number of excluded values can also be obtained by a recursion argument.
Let fm,n denote the probability density function of the number of excluded values U , when the population size is m and the
sample size is n . Then
12.6.3 https://stats.libretexts.org/@go/page/10249
1. fm,1 (m − 1) = 1
m−j j+1
2. fm,n+1 (j) =
m
fm,n (j) +
m
fm,n (j + 1)
Moments
Now we will find the means and variances. The number of excluded values and the number of distinct values are counting variables
and hence can be written as sums of indicator variables. As we have seen in many other models, this representation is frequently
the best for computing moments.
For j ∈ {0, 1, … , m}, let I = 1(Y = 0) , the indicator variable of the event that j is not in the sample. Note that the number of
j j
population values missing in the sample can be written as the sum of the indicator variables:
m
U = ∑ Ij (12.6.13)
j=1
m
)
n 2 n
2. var (I j) = (1 −
m
1
) − (1 −
1
m
)
n 2 n
3. cov (I i, Ij ) = (1 −
2
m
) − (1 −
1
m
)
Proof
The expected number of excluded values and the expected number of distinct values are
n
1. E(U ) = m(1 − 1
m
)
n
2. E(V ) = m [1 − (1 − 1
m
) ]
Proof
The variance of the number of exluded values and the variance of the number of distinct values are
n n 2n
2 1 2
1
var(U ) = var(V ) = m(m − 1) (1 − ) + m (1 − ) −m (1 − ) (12.6.14)
m m m
Proof
In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the size and location of
the mean ± standard-deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the
sample mean and variance to the distribution mean and variance.
In the birthday experiment, set m = 365 and n = 30 . Run the experiment 1000 times with an update frequency of 10 and
compute the relative frequency of the event in part (d) of the last exercise.
Suppose that 10 fair dice are rolled. Find each of the following:
1. The probability density function of the number of distinct scores.
2. The mean of the number of distinct scores.
3. The variance of the number of distinct scores.
12.6.4 https://stats.libretexts.org/@go/page/10249
4. The probability that there will 4 or fewer distinct scores.
Answer
In the birthday experiment, set m = 6 and n = 10 . Run the experiment 1000 times and compute the relative frequency of the
event in part (d) of the last exercise.
A fast food restaurant gives away one of 10 different toys with the purchase of each kid's meal. A family buys 15 kid's meals.
Find each of the following:
1. The probability density function of the number of toys that are missing.
2. The mean of the number of toys that are missing.
3. The variance of the number of toys that are missing.
4. The probability that at least 3 toys are missing.
Answwer
In the birthday experiment, set m = 10 and n = 15 . Run the experiment 1000 times and compute the relative frequency of the
event in part (d).
The lying students problem. Suppose that 3 students, who ride together, miss a mathematics exam. They decide to lie to the
instructor by saying that the car had a flat tire. The instructor separates the students and asks each of them which tire was flat.
The students, who did not anticipate this, select their answers independently and at random. Find each of the following:
1. The probability density function of the number of distinct answers.
2. The probability that the students get away with their deception.
3. The mean of the number of distinct answers.
4. The standard deviation of the number of distinct answers.
Answer
The duck hunter problem. Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each
hunter selects one duck at random and shoots. Find each of the following:
1. The probability density function of the number of ducks that are killed.
2. The mean of the number of ducks that are killed.
3. The standard deviation of the number of ducks that are killed.
Answer
This page titled 12.6: The Birthday Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
12.6.5 https://stats.libretexts.org/@go/page/10249
12.7: The Coupon Collector Problem
Basic Theory
Definitions
In this section, our random experiment is to sample repeatedly, with replacement, from the population D = {1, 2, … , m}. This
generates a sequence of independent random variables X = (X , X , …), each uniformly distributed on D
1 2
We will often interpret the sampling in terms of a coupon collector: each time the collector buys a certain product (bubble gum or
Cracker Jack, for example) she receives a coupon (a baseball card or a toy, for example) which is equally likely to be any one of m
types. Thus, in this setting, X ∈ D is the coupon type received on the ith purchase.
i
Let V denote the number of distinct values in the first n selections, for n ∈ N . This is the random variable studied in the last
n +
section on the Birthday Problem. Our interest is in this section is the sample size needed to get a specified number of distinct
sample values
Wk = min{n ∈ N+ : Vn = k} (12.7.1)
In terms of the coupon collector, this random variable gives the number of products required to get k distinct coupon types. Note
that the set of possible values of W is {k, k + 1, …} . We will be particularly interested in W , the sample size needed to get the
k m
entire population. In terms of the coupon collector, this is the number of products required to get the entire set of coupons.
In the coupon collector experiment, run the experiment in single-step mode a few times for selected values of the parameters.
k−1 n−1
m −1 k−1 k −j−1
j
P(Wk = n) = ( ) ∑(−1 ) ( )( ) , n ∈ {k, k + 1, …} (12.7.2)
k−1 j m
j=0
Proof
In the coupon collector experiment, vary the parameters and note the shape of and position of the probability density function.
For selected values of the parameters, run the experiment 1000 times and compare the relative frequency function to the
probability density function.
1. g k (n + 1) =
k−1
m
gk (n) +
m−k+1
m
gk−1 (n)
2. g 1 (1) =1
Decomposition as a Sum
We will now show that W can be decomposed as a sum of k independent, geometrically distributed random variables. This will
k
provide some additional insight into the nature of the distribution and will make the computation of the mean and variance easy.
For i ∈ {1, 2, … , m}, let Z denote the number of additional samples needed to go from i − 1 distinct values to i distinct
i
values. Then Z = (Z , Z , … , Z ) is a sequence of independent random variables, and Z has the geometric distribution on
1 2 m i
m−i+1
N + with parameter p = . Moreover,
i m
12.7.1 https://stats.libretexts.org/@go/page/10250
k
Wk = ∑ Zi , k ∈ {1, 2, … , m} (12.7.4)
i=1
This result shows clearly that each time a new coupon is obtained, it becomes harder to get the next new coupon.
In the coupon collector experiment, run the experiment in single-step mode a few times for selected values of the parameters.
In particular, try this with m large and k near m.
Moments
The decomposition as a sum of independent variables provides an easy way to compute the mean and other moments of W . k
The mean and variance of the sample size needed to get k distinct values are
k
1. E(W k) =∑
i=1
m
m−i+1
(i−1)m
2. var(W k) =∑
k
i=1 2
(m−i+1)
Proof
In the coupon collector experiment, vary the parameters and note the shape and location of the mean ± standard deviation bar.
For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to
the distribution mean and standard deviation.
k
Wk
m −i +1 m
E (t ) =∏ , |t| < (12.7.5)
m − (i − 1)t k−1
i=1
Proof
Suppose that a standard, fair die is thrown until all 6 scores have occurred. Find each of the following:
1. The probability density function of the number of throws.
2. The mean of the number of throws.
3. The variance of the number of throws.
4. The probability that at least 10 throws are required.
Answer
A box of a certain brand of cereal comes with a special toy. There are 10 different toys in all. A collector buys boxes of cereal
until she has all 10 toys. Find each of the following:
1. The probability density function of the number boxes purchased.
2. The mean of the number of boxes purchased.
3. The variance of the number of boxes purchased.
4. The probability that no more than 15 boxes were purchased.
Answer
12.7.2 https://stats.libretexts.org/@go/page/10250
This page titled 12.7: The Coupon Collector Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
12.7.3 https://stats.libretexts.org/@go/page/10250
12.8: Pólya's Urn Process
Basic Theory
The Model
Pólya's urn scheme is a dichotomous sampling model that generalizes the hypergeometric model (sampling without replacement)
and the Bernoulli model (sampling with replacement). Pólya's urn proccess leads to a famous example of a sequence of random
variables that is exchangeable, but not independent, and has deep conections with the beta-Bernoulli process.
Suppose that we have an urn (what else!) that initially contains a red and b green balls, where a and b are positive integers. At each
discrete time (trial), we select a ball from the urn and then return the ball to the urn along with c new balls of the same color.
Ordinarily, the parameter c is a nonnegative integer. However, the model actually makes sense if c is a negative integer, if we
interpret this to mean that we remove the balls rather than add them, and assuming that there are enough balls of the proper color in
the urn to perform this action. In any case, the random process is known as Pólya's urn process, named for George Pólya.
In terms of the colors of the selected balls, Pólya's urn scheme generalizes the standard models of sampling with and without
replacement.
1. c = 0 corresponds to sampling with replacement.
2. c = −1 corresponds to sampling without replacement.
For the most part, we will assume that c is nonnegative so that the process can be continued indefinitely. Occasionally we consider
the case c = −1 so that we can interpret the results in terms of sampling without replacement.
process is the sequence of indicator variables X = (X , X , …), known as the Pólya process. As with any random process, our
1 2
first goal is to compute the finite dimensional distributions of X. That is, we want to compute the joint distribution of
(X , X , … , X ) for each n ∈ N
1 2 n . Some additional notation will really help. Recall the generalized permutation formula in our
+
Note that the expression has j factors, starting with r, and with each factor obtained by adding s to the previous factor. As usual,
we adopt the convention that a product over an empty index set is 1. Hence r = 1 for every r and s .
(s,0)
Recall that
1. r (0,j)
=r
j
, an ordinary power
2. r (−1,j)
=r
(j)
= r(r − 1) ⋯ (r − j + 1) , a descending power
3. r (1,j)
=r
[j]
= r(r + 1) ⋯ (r + j − 1) , an ascending power
4. r (r,j)
= j! r
j
5. 1 (1,j)
= j!
Proof
The finite dimensional distributions are easy to compute using the multiplication rule of conditional probability. If we know the
contents of the urn at any given time, then the probability of an outcome at the next time is all but trivial.
12.8.1 https://stats.libretexts.org/@go/page/10251
Let n ∈ N , (x + 1, x2 , … , xn ) ∈ {0, 1 }
n
and let k = x 1 + x2 + ⋯ + xn . Then
(c,k) (c,n−k)
a b
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = (12.8.3)
(c,n)
(a + b)
Proof
The joint probability in the previous exercise depends on (x , x , … , x ) only through the number of red balls k = ∑ x in
1 2 n
n
i=1 i
the sample. Thus, the joint distribution is invariant under a permutation of (x , x , … , x ), and hence X is an exchangeable
1 2 n
sequence of random variables. This means that for each n , all permutations of (X , X , … , X ) have the same distribution. Of 1 2 n
course the joint distribution reduces to the formulas we have obtained earlier in the special cases of sampling with replacement (
c = 0 ) or sampling without replacement (c = −1 ), although in the latter case we must have n ≤ a + b . When c > 0 , the Pólya
process is a special case of the beta-Bernoulli process, studied in the chapter on Bernoulli trials.
The Pólya process X = (X 1, X2 , …) with parameters a, b, c ∈ N is the beta-Bernoulli process with parameters
+ a/c and
b/c. That is, for n ∈ N , (x , … , x ) ∈ {0, 1 } , and with k = x + x + ⋯ + x ,
n
+ 1, x2 n 1 2 n
[k] [n−k]
(a/c ) (b/c )
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = (12.8.5)
[n]
(a/c + b/c)
Proof
Recall that the beta-Bernoulli process is obtained, in the usual formulation, by randomizing the success parameter in a Bernoulli
trials sequence, giving the success parameter a beta distribution. So specifically, suppose a, b, c ∈ N and that random variable P +
has the beta distribution with parameters a/c and b/c. Suppose also that given P = p ∈ (0, 1) , the random process
X = (X , X , …) is a sequence of Bernoulli trials with success parameter p . Then X is the Pólya process with parameters
1 2
a, b, c. This is a fascinating connection between two processes that at first, seem to have little in common. In fact however, every
exchangeable sequence of indicator random variables can be obtained by randomizing the success parameter in a sequence of
Bernoulli trials. This is de Finetti's theorem, named for Bruno de Finetti, which is studied in the section on backwards martingales.
When c ∈ N , all of the results in this section are special cases of the corresponding results for the beta-Bernoulli process, but it's
+
For each i ∈ N +
1. E(X i) =
a+b
a
2. var(X i) =
a+b
a b
a+b
Proof
Thus X is a sequence of identically distributed variables, quite surprising at first but of course inevitable for any exchangeable
sequence. Compare the joint and marginal distributions. Note that X is an independent sequence if and only if c = 0 , when we
have simple sampling with replacement. Pólya's urn is one of the most famous examples of a random process in which the outcome
variables are exchangeable, but dependent (in general).
Next, let's compute the covariance and correlation of a pair of outcome variables.
(a+b) (a+b+c)
2. cor (X i, Xj ) =
c
a+b+c
Proof
Thus, the variables are positively correlated if c > 0 , negatively correlated if c < 0 , and uncorrelated (in fact, independent), if
c = 0 . These results certainly make sense when we recall the dynamics of Pólya's urn. It turns out that in any infinite sequence of
exchangeable variables, the variables must be nonnegatively correlated. Here is another result that explores how the variables are
related.
n
Suppose that n ∈ N + and (x 1, x2 , … , xn ) ∈ {0, 1 }
n
. Let k = ∑ i=1
xi . Then
12.8.2 https://stats.libretexts.org/@go/page/10251
a + kc
P(Xn+1 = 1 ∣ X1 = x1 , X2 = x2 , … Xn = xn ) = (12.8.7)
a + b + nc
Proof
Pólya's urn is described by a sequence of indicator variables. We can study the same derived random processes that we studied with
Bernoulli trials: the number of red balls in the first n trials, the trial number of the k th red ball, and so forth.
Yn = ∑ Xi (12.8.8)
i=1
Note that
1. The number of green balls selected in the first n trials is n − Y . n
2. The number of red balls in the urn after the first n trials is a + c Y . n
3. The number of green balls in the urn after the first n trials is b + c(n − Y n) .
4. The number of balls in the urn after the first n trials is a + b + c n .
(c,k) (c,n−k)
n a b
P(Yn = k) = ( ) , k ∈ {0, 1, … , n} (12.8.9)
(c,n)
k (a + b)
Proof
The distribution defined by this probability density function is known, appropriately enough, as the Pólya distribution with
parameters n , a , b , and c . Of course, the distribution reduces to the binomial distribution with parameters n and a/(a + b) in the
case of sampling with replacement (c = 0 ) and to the hypergeometric distribution with parameters n , a , and b in the case of
sampling without replacement (c = −1 ), although again in this case we need n ≤ a + b . When c > 0 , the Póyla distribution is a
special case of the beta-binomial distribution.
If a, b, c ∈ N + then the Pólya distribution with parameters a, b, c is the beta-binomial distribution with parameters a/c and
b/c. That is,
[k] [n−k]
n (a/c ) (b/c )
P (Yn = k) = ( ) , k ∈ {0, 1, … , n} (12.8.10)
[n]
k (a/c + b/c)
Proof
The case where all three parameters are equal is particularly interesting.
Proof
Start the simulation of the Pólya Urn Experiment. Vary the parameters and note the shape of the probability density function. In
particular, note when the function is skewed, when the function is symmetric, when the function is unimodal, when the
function is monotone, and when the function is U-shaped. For various values of the parameters, run the simulation 1000 times
and compare the empirical density function to the probability density function.
12.8.3 https://stats.libretexts.org/@go/page/10251
1. unimodal if a > b > c and n > a−c
b−c
a−c
c−a
c−b
Next, let's find the mean and variance. Curiously, the mean does not depend on the parameter c .
The mean and variance of the number of red balls selected are
1. E(Yn) =n
a
a+b
2. var(Y n) =n
ab
2
[1 + (n − 1)
c
a+b+c
]
(a+b)
Proof
Start the simulation of the Pólya Urn Experiment. Vary the parameters and note the shape and location of the mean ± standard
deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical mean and
standard deviation to the distribution mean and standard deviation.
Explicitly compute the probability density function, mean, and variance of Y5 when a =6 , b =4 , and for the values of
. Sketch the graph of the density function in each case.
c ∈ {−1, 0, 1, 2, 3, 10}
a+b
2. P(Yn = n) →
a
a+b
3. P (Y n ∈ {1, 2, … , n − 1}) → 0
Proof
Thus, the limiting distribution of Y as c → ∞ is concentrated on 0 and n . The limiting probabilities are just the initial proportion
n
of green and red balls, respectively. Interpret this result in terms of the dynamics of Pólya's urn scheme.
Our next result gives the conditional distribution of X n+1 given Y .
n
Proof
In particular, if a = b = c then P(X = 1 ∣ Y = n) = n+1 . This is Laplace's rule of succession, another interesting
n
n+1
n+2
connection. The rule is named for Pierre Simon Laplace, and is studied from a different point of view in the section on
Independence.
is
n
Yn 1
Mn = = ∑ Xi (12.8.15)
n n
i=1
This is an interesting variable, since a little reflection suggests that it may have a limit as n → ∞ . Indeed, if c = 0 , then M is just n
the sample mean corresponding to n Bernoulli trials. Thus, by the law of large numbers, M converges to the success parameter n
a
a+b
as n → ∞ with probability 1. On the other hand, the proportion of red balls in the urn after n trials is
12.8.4 https://stats.libretexts.org/@go/page/10251
a + cYn
Zn = (12.8.16)
a + b + cn
When c = 0 , of course, Z n =
a
a+b
so that in this case, Z and M have the same limiting behavior. Note that
n n
a cn
Zn = + Mn (12.8.17)
a + b + cn a + b + cn
Since the constant term converges to 0 as n → ∞ and the coefficient of M converges to 1 as n → ∞ , it follows that the limits of
n
M n and Z as n → ∞ will be the same, if the limit exists, for any mode of convergence: with probability 1, in mean, or in
n
Suppose that a, b, c ∈ N . There exists a random variable P having the beta distribution with parameters a/c and
+ b/c such
that M → P and Z → P as n → ∞ with probability 1 and in mean square, and hence also in distribution.
n n
Proof
In turns out that the random process Z = {Z = (a + cY )/(a + b + cn) : n ∈ N} is a martingale. The theory of martingales
n n
provides powerful tools for studying convergence in Pólya's urn process. As an interesting special case, note that if a = b = c then
the limiting distribution is the uniform distribution on (0, 1).
selected. Thus
Vk = min{n ∈ N+ : Yn = k} (12.8.18)
For k, n ∈ N+ with k ≤ n ,
1. V k ≤n if and only if Y n ≥k
(c,k) (c,n−k)
n−1 a b
P(Vk = n) = ( ) , n ∈ {k, k + 1, …} (12.8.19)
k−1 (c,n)
(a + b)
Proof
Of course this probability density function reduces to the negative binomial density function with trial parameter k and success
parameter p = when c = 0 (sampling with replacement). When c > 0 , the distribution is a special case of the beta-negative
a
a+b
binomial distribution.
If a, b, c ∈ N+ then V has the beta-negative binomial distribution with parameters k , a/c, and b/c. That is,
k
[k] [n−k]
n−1 (a/c ) (b/c )
P(Vk = n) = ( ) , n ∈ {k, k + 1, …} (12.8.20)
k−1 [n]
(a/c + b/c)
Proof
If a = b = c then
k
P(Vk = n) = , n ∈ {k, k + 1, k + 2, …} (12.8.21)
n(n + 1)
Proof
a+b
12.8.5 https://stats.libretexts.org/@go/page/10251
2. P(V k ∈ {k + 1, k + 2, …}) → 0
Thus, the limiting distribution of V is concentrated on k and ∞. The limiting probabilities at these two points are just the initial
k
proportion of red and green balls, respectively. Interpret this result in terms of the dynamics of Pólya's urn scheme.
This page titled 12.8: Pólya's Urn Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
12.8.6 https://stats.libretexts.org/@go/page/10251
12.9: The Secretary Problem
In this section we will study a nice problem known variously as the secretary problem or the marriage problem. It is simple to state
and not difficult to solve, but the solution is interesting and a bit surprising. Also, the problem serves as a nice introduction to the
general area of statistical decision making.
We have n candidates (perhaps applicants for a job or possible marriage partners). The assumptions are
1. The candidates are totally ordered from best to worst with no ties.
2. The candidates arrive sequentially in random order.
3. We can only determine the relative ranks of the candidates as they arrive. We cannot observe the absolute ranks.
4. Our goal is choose the very best candidate; no one less will do.
5. Once a candidate is rejected, she is gone forever and cannot be recalled.
6. The number of candidates n is known.
The assumptions, of course, are not entirely reasonable in real applications. The last assumption, for example, that n is known, is
more appropriate for the secretary interpretation than for the marriage interpretation.
What is an optimal strategy? What is the probability of success with this strategy? What happens to the strategy and the probability
of success as n increases? In particular, when n is large, is there any reasonable hope of finding the best candidate?
Strategies
Play the secretary game several times with n = 10 candidates. See if you can find a good strategy just by trial and error.
After playing the secretary game a few times, it should be clear that the only reasonable type of strategy is to let a certain number
k − 1 of the candidates go by, and then select the first candidate we see who is better than all of the previous candidates (if she
exists). If she does not exist (that is, if no candidate better than all previous candidates appears), we will agree to accept the last
candidate, even though this means failure. The parameter k must be between 1 and n ; if k = 1 , we select the first candidate; if
k = n , we select the last candidate; for any other value of k , the selected candidate is random, distributed on {k, k + 1, … , n} .
probability over k to find the optimal strategy, and then take the limit over n to study the asymptotic behavior.
Analysis
First, let's do some basic computations.
For the case n =3 , list the 6 permutations of {1, 2, 3} and verify the probabilities in the table below. Note that k =2 is
optimal.
k 1 2 3
2 3 2
p3 (k)
6 6 6
Answer
In the secretary experiment, set the number of candidates to n =3 . Run the experiment 1000 times with each strategy
k ∈ {1, 2, 3}
For the case n = 4 , list the 24 permutations of {1, 2, 3, 4} and verify the probabilities in the table below. Note that k =2 is
optimal. The last row gives the total number of successes for each strategy.
12.9.1 https://stats.libretexts.org/@go/page/10252
k 1 2 3 4
6 11 10 6
p4 (k)
24 24 24 24
Answer
In the secretary experiment, set the number of candidates to n =4 . Run the experiment 1000 times with each strategy
k ∈ {1, 2, 3, 4}
For the case n = 5 , list the 120 permutations of {1, 2, 3, 4, 5}and verify the probabilities in the table below. Note that k = 3 is
optimal.
k 1 2 3 4 5
24 50 52 42 24
p5 (k)
120 120 120 120 120
In the secretary experiment, set the number of candidates to n =5 . Run the experiment 1000 times with each strategy
k ∈ {1, 2, 3, 4, 5}
Well, clearly we don't want to keep doing this. Let's see if we can find a general analysis. With n candidates, let X denote the n
number (arrival order) of the best candidate, and let S denote the event of success for strategy k (we select the best candidate).
n,k
Next we will compute the conditional probability of success given the arrival order of the best candidate.
0, j ∈ {1, 2, … , k − 1}
P(Sn,k ∣ Xn = j) = { k−1 (12.9.1)
, j ∈ {k, k + 1, … , n}
j−1
Proof
The two cases are illustrated below. The large dot indicates the best candidate. Red dots indicate candidates that are rejected out of
hand, while blue dots indicate candidates that are considered.
For n ∈ N +
1
, k =1
n
pn (k) = P(Sn,k ) = { k−1 n
(12.9.2)
1
∑ , k ∈ {2, 3, … , n}
n j=k j−1
Proof
Values of the function p can be computed by hand for small n and by a computer algebra system for moderate n . The graph of
n
p100 is shown below. Note the concave downward shape of the graph and the optimal value of k , which turns out to be 38. The
optimal probability is about 0.37104.
12.9.2 https://stats.libretexts.org/@go/page/10252
Figure 12.9.3 : The graph of p 100
The optimal strategy k that maximizes k ↦ p (k) , the ratio k /n, and the optimal probability
n n n pn (kn ) of finding the best
candidate, as functions of n ∈ {3, 4, … , 20} are given in the following table:
3 2 0.6667 0.5000
4 2 0.5000 0.4583
5 3 0.6000 0.4333
6 3 0.5000 0.4278
7 3 0.4286 0.4143
8 4 0.5000 0.4098
9 4 0.4444 0.4060
10 4 0.4000 0.3987
11 5 0.4545 0.3984
12 5 0.4167 0.3955
13 6 0.4615 0.3923
14 6 0.4286 0.3917
15 6 0.4000 0.3894
16 7 0.4375 0.3881
17 7 0.4118 0.3873
18 7 0.3889 0.3854
19 8 0.4211 0.3850
20 8 0.4000 0.3842
Apparently, as we might expect, the optimal strategy k increases and the optimal probability p (k ) decreases as n → ∞ . On the
n n n
other hand, it's encouraging, and a bit surprising, that the optimal probability does not appear to be decreasing to 0. It's perhaps
least clear what's going on with the ratio. Graphical displays of some of the information in the table may help:
12.9.3 https://stats.libretexts.org/@go/page/10252
Figure 12.9.4: The optimal probability p n (kn )
Could it be that the ratio k /n and the probability p (k ) are both converging, and moreover, are converging to the same number?
n n n
First let's try to establish rigorously some of the trends observed in the table.
n
1
pn (k − 1) < pn (k) if and only if ∑ >1 (12.9.4)
j−1
j=k
It follows that for each n ∈ N+ , the function p at first increases and then decreases. The maximum value of
n pn occurs at the
n
largest k with ∑ j=k
1
j−1
> 1 . This is the optimal strategy with n candidates, which we have denoted by k . n
Asymptotic Analysis
We are naturally interested in the asymptotic behavior of the function p , and the optimal strategy as n → ∞ . The key is
n
recognizing p as a Riemann sum for a simple integral. (Riemann sums, of course, are named for Georg Riemann.)
n
The optimal strategy k that maximizes k ↦ p (k) , the ratio k /n, and the optimal probability
n n n pn (kn ) of finding the best
candidate, as functions of n ∈ {10, 20, … , 100}are given in the following table:
10 4 0.4000 0.3987
20 8 0.4000 0.3842
30 12 0.4000 0.3786
12.9.4 https://stats.libretexts.org/@go/page/10252
Candidates n Optimal strategy k n Ratio k n /n Optimal probability pn (kn )
40 16 0.4000 0.3757
50 19 0.3800 0.3743
60 23 0.3833 0.3732
70 27 0.3857 0.3724
80 30 0.3750 0.3719
90 34 0.3778 0.3714
The graph below shows the true probabilities p n (k) and the limiting values − k
n
ln(
k
n
) as a function of k with n = 100 .
Figure 12.9.6 : True and approximate probabilities of success as a function of k with n = 100
For the optimal strategy k , there exists x ∈ (0, 1) such that k /n → x as n → ∞ . Thus, x
n 0 n 0 0 ∈ (0, 1) is the limiting proportion
of the candidates that we reject out of hand. Moreover, x maximizes x ↦ −x ln x on (0, 1).
0
The maximum value of −x ln x occurs at x 0 = 1/e and the maximum value is also 1/e.
Proof
12.9.5 https://stats.libretexts.org/@go/page/10252
This page titled 12.9: The Secretary Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
12.9.6 https://stats.libretexts.org/@go/page/10252
CHAPTER OVERVIEW
13: Games of Chance
Games of chance hold an honored place in probability theory, because of their conceptual clarity and because of their fundamental
influence on the early development of the subject. In this chapter, we explore some of the most common and basic games of
chance. Roulette, craps, and Keno are casino games. The Monty Hall problem is based on a TV game show, and has become
famous because of the controversy that it generated. Lotteries are now basic ways that governments and other institutions raise
money. In the last four sections on the game of red and black, we study various types of gambling strategies, a study which leads to
some deep and fascinating mathematics.
13.1: Introduction to Games of Chance
13.2: Poker
13.3: Simple Dice Games
13.4: Craps
13.5: Roulette
13.6: The Monty Hall Problem
13.7: Lotteries
13.8: The Red and Black Game
13.9: Timid Play
13.10: Bold Play
13.11: Optimal Strategies
This page titled 13: Games of Chance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
13.1: Introduction to Games of Chance
Figure 13.1.1 : An artificial knucklebone made of steatite, from the Arjan Verweij Dice Website
Gambling is intimately interwoven with the development of probability as a mathematical theory. Most of the early development of
probability, in particular, was stimulated by special gambling problems, such as
DeMere's problem
Pepy's problem
the problem of points
the Petersburg problem
Some of the very first books on probability theory were written to analyze games of chance, for example Liber de Ludo Aleae (The
Book on Games of Chance), by Girolamo Cardano, and Essay d' Analyse sur les Jeux de Hazard (Analytical Essay on Games of
Chance), by Pierre-Remond Montmort. Gambling problems continue to be a source of interesting and deep problems in probability
to this day (see the discussion of Red and Black for an example).
Figure 13.1.2 : Allegory of Fortune by Dosso Dossi (c. 1591), Getty Museum. For more depictions of gambling in paintings, see the
ancillary material on art.
Of course, it is important to keep in mind that breakthroughs in probability, even when they are originally motivated by gambling
problems, are often profoundly important in the natural sciences, the social sciences, law, and medicine. Also, games of chance
provide some of the conceptually clearest and cleanest examples of random experiments, and thus their analysis can be very helpful
to students of probability.
However, nothing in this chapter should be construed as encouraging you, gentle reader, to gamble. On the contrary, our analysis
will show that, in the long run, only the gambling houses prosper. The gambler, inevitably, is a sad victim of the law of large
numbers.
13.1.1 https://stats.libretexts.org/@go/page/10255
In this chapter we will study some interesting games of chance. Poker, poker dice, craps, and roulette are popular parlor and casino
games. The Monty Hall problem, on the other hand, is interesting because of the controversy that it generated. The lottery is a basic
way that many states and nations use to raise money (a voluntary tax, of sorts).
Terminology
Let us discuss some of the basic terminology that will be used in several sections of this chapter. Suppose that A is an event in a
random experiment. The mathematical odds concerning A refer to the probability of A .
If a and b are positive numbers, then by definition, the following are equivalent:
1. the odds in favor of A are a : b .
2. P(A) = a
a+b
.
3. the odds against A are b : a .
4. P(A ) =
c
a+b
b
.
In many cases, a and b can be given as positive integers with no common factors.
On the other hand, the house odds of an event refer to the payout when a bet is made on the event.
Equivalently, the gambler puts up m units (betting on A ), the house puts up n units, (betting on A ) and the winner takes the pot.
c
Of course, it is usually not necessary for the gambler to bet exactly m; a smaller or larger is bet is scaled appropriately. Thus, if the
gambler bets k units and wins, his payout is k . n
Naturally, our main interest is in the net winnings if we make a bet on an event. The following result gives the probability density
function, mean, and variance for a unit bet. The expected value is particularly interesting, because by the law of large numbers, it
gives the long term gain or loss, per unit bet.
Suppose that the odds in favor of event A are a : b and that a bet on event A pays n : m . Let W denote the winnings from a
unit bet on A . Then
1. P(W = −1) =
b
a+b
, P (W =
n
m
) =
a
a+b
2. E(W ) = a n−bm
m(a+b)
2
ab(n+m)
3. var(W ) = 2 2
m (a+b)
In particular, the expected value of the bet is zero if and only if an = bm , positive if and only if an > bm , and negative if and
only if an < bm . The first case means that the bet is fair, and occurs when the payoff is the same as the odds against the event.
The second means that the bet is favorable to the gambler, and occurs when the payoff is greater that the odds against the event.
The third case means that the bet is unfair to the gambler, and occurs when the payoff is less than the odds against the event.
Unfortunately, all casino games fall into the third category.
13.1.2 https://stats.libretexts.org/@go/page/10255
Shapes of Dice
The standard die, of course, is a cube with six sides. A bit more generally, most real dice are in the shape of Platonic solids, named
for Plato naturally. The faces of a Platonic solid are congruent regular polygons. Moreover, the same number of faces meet at each
vertex so all of the edges and angles are congruent as well.
Flat Dice
1. An ace-six flat die is a six-sided die in which faces 1 and 6 have probability each while faces 2, 3, 4, and 5 have
1
probability each.
1
2. A two-five flat die is a six-sided die in which faces 2 and 5 have probability each while faces 1, 3, 4, and 6 have
1
probability each.
1
3. A three-four flat die is a six-sided die in which faces 3 and 4 have probability each while faces 1, 2, 5, and 6 have
1
probability each.
1
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular
probabilities that we use (1/4 and 1/8) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter
axis have slightly larger probabilities (because they have slightly larger areas) than the other four faces. Flat dice are sometimes
used by gamblers to cheat.
In the Dice Experiment, select one die. Run the experiment 1000 times in each of the following cases and observe the
outcomes.
1. fair die
2. ace-six flat die
3. two-five flat die
4. three-four flat die
Simulation
It's very easy to simulate a fair die with a random number. Recall that the ceiling function ⌈x⌉ gives the smallest integer that is at
least as large as x.
Suppose that U is uniformly distributed on the interval (0, 1], so that U has the standard uniform distribution (a random
number). Then X = ⌈6 U ⌉ is uniformly distributed on the set {1, 2, 3, 4, 5, 6} and so simulates a fair six-sided die. More
generally, X = ⌈n U ⌉ is uniformly distributed on {1, 2, … , n} and so simlates a fair n -sided die.
13.1.3 https://stats.libretexts.org/@go/page/10255
We can also use a real fair die to simulate other types of fair dice. Recall that if X is uniformly distributed on {1, 2, … , n} and
k ∈ {1, 2, … , n − 1} , then the conditional distribution of X given that X ∈ {1, 2, … , k} is uniformly distributed on
{1, 2, … , k}. Thus, suppose that we have a real, fair, n -sided die. If we ignore outcomes greater than k then we simulate a fair k -
sided die. For example, suppose that we have a carefully constructed icosahedron that is a fair 20-sided die. We can simulate a fair
13-sided die by simply rolling the die and stopping as soon as we have a score between 1 and 13.
To see how to simulate a card hand, see the Introduction to Finite Sampling Models. A general method of simulating random
variables is based on the quantile function.
This page titled 13.1: Introduction to Games of Chance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
13.1.4 https://stats.libretexts.org/@go/page/10255
13.2: Poker
Basic Theory
The Poker Hand
A deck of cards naturally has the structure of a product set and thus can be modeled mathematically by
where the first coordinate represents the denomination or kind (ace, two through 10, jack, queen, king) and where the second
coordinate represents the suit (clubs, diamond, hearts, spades). Sometimes we represent a card as a string rather than an ordered
pair (for example q ♡).
There are many different poker games, but we will be interested in standard draw poker, which consists of dealing 5 cards at
random from the deck D. The order of the cards does not matter in draw poker, so we will record the outcome of our random
experiment as the random set (hand) X = {X , X , X , X , X } where X = (Y , Z ) ∈ D for each i and X ≠ X for i ≠ j .
1 2 3 4 5 i i i i j
Our basic modeling assumption (and the meaning of the term at random) is that all poker hands are equally likely. Thus, the
random variable X is uniformly distributed over the set of possible poker hands S .
#(A)
P(X ∈ A) = (13.2.3)
#(S)
In statistical terms, a poker hand is a random sample of size 5 drawn without replacement and without regard to order from the
population D. For more on this topic, see the chapter on Finite Sampling Models.
The hand value V of a poker hand is a random variable taking values 0 through 8, and is defined as follows:
0. No Value. The hand is of none of the other types.
1. One Pair. The hand has 2 cards of one kind, and one card each of three other kinds.
2. Two Pair. The hand has 2 cards of one kind, 2 cards of another kind, and one card of a third kind.
3. Three of a Kind. The hand has 3 cards of one kind and one card of each of two other kinds.
4. Straight. The kinds of cards in the hand form a consecutive sequence but the cards are not all in the same suit. An ace can
be considered the smallest denomination or the largest denomination.
5. Flush. The cards are all in the same suit, but the kinds of the cards do not form a consecutive sequence.
6. Full House. The hand has 3 cards of one kind and 2 cards of another kind.
7. Four of a Kind. The hand has 4 cards of one kind, and 1 card of another kind.
8. Straight Flush. The cards are all in the same suit and the kinds form a consecutive sequence.
Run the poker experiment 10 times in single-step mode. For each outcome, note that the value of the random variable
corresponds to the type of hand, as given above.
For some comic relief before we get to the analysis, look at two of the paintings of Dogs Playing Poker by CM Coolidge.
1. His Station and Four Aces
2. Waterloo
13.2.1 https://stats.libretexts.org/@go/page/10256
The Probability Density Function
Computing the probability density function of V is a good exercise in combinatorial probability. In the following exercises, we
need the two fundamental rules of combinatorics to count the number of poker hands of a given type: the multiplication rule and
the addition rule. We also need some basic combinatorial structures, particularly combinations.
Note that the probability density function of V is decreasing; the more valuable the type of hand, the less likely the type of hand is
to occur. Note also that no value and one pair account for more than 92% of all poker hands.
In the poker experiment, note the shape of the density graph. Note that some of the probabilities are so small that they are
essentially invisible in the graph. Now run the poker hand 1000 times and compare the relative frequency function to the
density function.
In the poker experiment, set the stop criterion to the value of V given below. Note the number of poker hands required.
1. V =3
2. V =4
3. V =5
4. V =6
5. V =7
6. V =8
13.2.2 https://stats.libretexts.org/@go/page/10256
Answer
In the movie The Parent Trap (1998), both twins get straight flushes on the same poker deal. Find the probability of this event.
Answer
Classify V in terms of level of measurement: nominal, ordinal, interval, or ratio. Is the expected value of V meaningful?
Answer
A hand with a pair of aces and a pair of eights (and a fifth card of a different type) is called a dead man's hand. The name is in
honor of Wild Bill Hickok, who held such a hand at the time of his murder in 1876. Find the probability of getting a dead man's
hand.
Answer
Drawing Cards
In draw poker, each player is dealt a poker hand and there is an initial round of betting. Typically, each player then gets to discard
up to 3 cards and is dealt that number of cards from the remaining deck. This leads to myriad problems in conditional probability,
as partial information becomes available. A complete analysis is far beyond the scope of this section, but we will consider a comple
of simple examples.
Suppose that Fred's hand is {4 ♡, 5 ♡, 7 ♠, q ♣, 1 ♢}. Fred discards the q ♣ and 1 ♢ and draws two new cards, hoping to
complete the straight. Note that Fred must get a 6 and either a 3 or an 8. Since he is missing a middle denomination (6), Fred is
drawing to an inside straight. Find the probability that Fred is successful.
Answer
Suppose that Wilma's hand is {4 ♡, 5 ♡, 6 ♠, q ♣, 1 ♢}. Wilma discards q ♣ and 1 ♢ and draws two new cards, hoping to
complete the straight. Note that Wilma must get a 2 and a 3, or a 7 and an 8, or a 3 and a 7. Find the probability that Wilma is
successful. Clearly, Wilma has a better chance than Fred.
Answer
This page titled 13.2: Poker is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services)
via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.2.3 https://stats.libretexts.org/@go/page/10256
13.3: Simple Dice Games
In this section, we will analyze several simple games played with dice—poker dice, chuck-a-luck, and high-low. The casino game
craps is more complicated and is studied in the next section.
Figure 13.3.1 : The Dice Players by Georges de La Tour (c. 1651). For more on the influence of probability in painting, see the
ancillary material on art.
Poker Dice
Definition
The game of poker dice is a bit like standard poker, but played with dice instead of cards. In poker dice, 5 fair dice are rolled. We
will record the outcome of our random experiment as the (ordered) sequence of scores:
X = (X1 , X2 , X3 , X4 , X5 ) (13.3.1)
Thus, the sample space is S = {1, 2, 3, 4, 5, 6} . Since the dice are fair, our basic modeling assumption is that X is a sequence of
5
In statistical terms, a poker dice hand is a random sample of size 5 drawn with replacement and with regard to order from the
population D = {1, 2, 3, 4, 5, 6}. For more on this topic, see the chapter on Finite Sampling Models. In particular, in this chapter
you will learn that the result of Exercise 1 would not be true if we recorded the outcome of the poker dice experiment as an
unordered set instead of an ordered sequence.
13.3.1 https://stats.libretexts.org/@go/page/10257
Run the poker dice experiment 10 times in single-step mode. For each outcome, note that the value of the random variable
corresponds to the type of hand, as given above.
P(V = 0) =
720
7776
= 0.09259 .
Proof
P(V = 1) =
3600
7776
≈ 0.46296 .
Proof
P(V = 2) =
1800
7776
≈ 0.23148 .
Proof
P(V = 3) =
1200
7776
≈ 0.15432 .
Proof
P(V = 4) =
300
7776
≈ 0.03858 .
Proof
P(V = 5) =
150
7776
= 0.01929 .
Proof
P(V = 6) =
6
7776
≈ 0.00077 .
Proof
Run the poker dice experiment 1000 times and compare the relative frequency function to the density function.
In the poker dice experiment, set the stop criterion to the value of V given below. Note the number of hands required.
1. V =3
2. V =4
3. V =5
4. V =6
Chuck-a-Luck
Chuck-a-luck is a popular carnival game, played with three dice. According to Richard Epstein, the original name was Sweat Cloth,
and in British pubs, the game is known as Crown and Anchor (because the six sides of the dice are inscribed clubs, diamonds,
hearts, spades, crown and anchor). The dice are over-sized and are kept in an hourglass-shaped cage known as the bird cage. The
dice are rolled by spinning the bird cage.
Chuck-a-luck is very simple. The gambler selects an integer from 1 to 6, and then the three dice are rolled. If exactly k dice show
the gambler's number, the payoff is k : 1 . As with poker dice, our basic mathematical assumption is that the dice are fair, and
therefore the outcome vector X = (X , X , X ) is uniformly distributed on the sample space S = {1, 2, 3, 4, 5, 6} .
1 2 3
3
13.3.2 https://stats.libretexts.org/@go/page/10257
Let Y denote the number of dice that show the gambler's number. Then Y has the binomial distribution with parameters n = 3
and p = : 1
k 3−k
3 1 5
P(Y = k) = ( )( ) ( ) , k ∈ {0, 1, 2, 3} (13.3.3)
k 6 6
216
2. P(W = 1) =
216
75
3. P(W = 2) =
216
15
4. P(W = 3) =
216
1
Run the chuck-a-luck experiment 1000 times and compare the empirical density function of W to the true probability density
function.
216
≈ 0.0787
2. var(W ) = 75815
46656
≈ 1.239
Run the chuck-a-luck experiment 1000 times and compare the empirical mean and standard deviation of W to the true mean
and standard deviation. Suppose you had bet $1 on each of the 1000 games. What would your net winnings be?
High-Low
In the game of high-low, a pair of fair dice are rolled. The outcome is
high if the sum is 8, 9, 10, 11, or 12.
low if the sum is 2, 3, 4, 5, or 6
seven if the sum is 7
A player can bet on any of the three outcomes. The payoff for a bet of high or for a bet of low is 1 : 1 . The payoff for a bet of seven
is 4 : 1 .
Let Z denote the outcome of a game of high-low. Find the probability density function of Z .
Answer
Let W denote the net winnings for a unit bet. Find the expected value and variance of W for each of the three bets:
1. high
2. low
3. seven
Answer
This page titled 13.3: Simple Dice Games is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
13.3.3 https://stats.libretexts.org/@go/page/10257
13.4: Craps
Definitions
The rules for craps are as follows:
As long as the shooter wins, or loses by rolling craps, she retrains the dice and continues. Once she loses by failing to make her
point, the dice are passed to the next shooter.
Let us consider the game of craps mathematically. Our basic assumption, of course, is that the dice are fair and that the outcomes of
the various rolls are independent. Let N denote the (random) number of rolls in the game and let (X , Y ) denote the outcome of
i i
the ith roll for i ∈ {1, 2, … , N }. Finally, let Z = X + Y , the sum of the scores on the ith roll, and let V denote the event that
i i i
In the craps experiment, press single step a few times and observe the outcomes. Make sure that you understand the rules of the
game.
The sum of the scores Z on a given roll has the probability density function in the following table:
z 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P(Z = z)
36 36 36 36 36 36 36 36 36 36 36
13.4.1 https://stats.libretexts.org/@go/page/10258
The probability that the player makes her point can be computed using a simple conditioning argument. For example, suppose that
the player throws 4 initially, so that 4 is the point. The player continues until she either throws 4 again or throws 7. Thus, the final
roll will be an element of the following set:
S4 = {(1, 3), (2, 2), (3, 1), (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} (13.4.1)
Since the dice are fair, these outcomes are equally likely, so the probability that the player makes her 4 point is 3
9
. A similar
argument can be used for the other points. Here are the results:
The probabilities of making the point z are given in the following table:
z 4 5 6 8 9 10
3 4 5 5 4 3
P(V ∣ Z1 = z)
9 10 11 11 10 9
495
≈ 0.49293
Proof
Note that craps is nearly a fair game. For the sake of completeness, the following result gives the probability of winning, given a
“point” on the first roll.
67
P(V ∣ Z1 ∈ {4, 5, 6, 8, 9, 10}) = ≈ 0.406
165
Proof
Bets
There is a bewildering variety of bets that can be made in craps. In the exercises in this subsection, we will discuss some typical
bets and compute the probability density function, mean, and standard deviation of each. (Most of these bets are illustrated in the
picture of the craps table above). Note however, that some of the details of the bets and, in particular the payout odds, vary from
one casino to another. Of course the expected value of any bet is inevitably negative (for the gambler), and thus the gambler is
doomed to lose money in the long run. Nonetheless, as we will see, some bets are better than others.
495
244
495
2. E(W ) = − 7
≈ −0.0141
495
3. sd(W ) ≈ 0.9999
In the craps experiment, select the pass bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
A don't pass bet is a bet that the shooter will lose, except that 12 on the first throw is excluded (that is, the shooter loses, of course,
but the don't pass better neither wins nor loses). This is the meaning of the phrase don't pass bar double 6 on the craps table. The
don't pass bet also pays 1 : 1 .
Let W denote the winnings for a unit don't pass bet. Then
1. P(W = −1) = , P(W = 0) =
244
495 36
1
, P(W = 1) =
949
1980
2. E(W ) = − 27
≈ −0.01363
1980
3. sd(W ) ≈ 0.9859
13.4.2 https://stats.libretexts.org/@go/page/10258
Thus, the don't pass bet is slightly better for the gambler than the pass bet.
In the craps experiment, select the don't pass bet. Run the simulation 1000 times and compare the empirical density function
and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
The come bet and the don't come bet are analogous to the pass and don't pass bets, respectively, except that they are made after the
point has been established.
Field
A field bet is a bet on the outcome of the next throw. It pays 1 : 1 if 3, 4, 9, 10, or 11 is thrown, 2 : 1 if 2 or 12 is thrown, and loses
otherwise.
9
7
18
, P(W = 2) =
1
18
2. E(W ) = − ≈ −0.0556
1
18
3. sd(W ) ≈ 1.0787
In the craps experiment, select the field bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
6
1
2. E(W ) = − ≈ −0.1667
1
3. sd(W ) ≈ 1.8634
In the craps experiment, select the 7 bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
18
1
18
2. E(W ) = − ≈ −0.1111
1
3. sd(W ) ≈ 3.6650
In the craps experiment, select the 11 bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
Craps
All craps bets are bets on the next throw. The basic craps bet pays 7 : 1 if 2, 3, or 12 is thrown. The craps 2 bet pays 30 : 1 if a 2 is
thrown. Similarly, the craps 12 bet pays 30 : 1 if a 12 is thrown. Finally, the craps 3 bet pays 15 : 1 if a 3 is thrown.
9
, P(W = 7) =
1
13.4.3 https://stats.libretexts.org/@go/page/10258
2. E(W ) = − ≈ −0.1111
1
3. sd(W ) ≈ 5.0944
In the craps experiment, select the craps bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
Let W denote the winnings for a unit craps 2 bet or a unit craps 12 bet. Then
1. P(W = −1) = , P(W = 30) =
35
36 36
1
2. E(W ) = − ≈ −0.1389
5
36
3. sd(W ) = 5.0944
In the craps experiment, select the craps 2 bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
In the craps experiment, select the craps 12 bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
18 18
1
2. E(W ) = − ≈ −0.1111
1
3. sd(W ) ≈ 3.6650
In the craps experiment, select the craps 3 bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
Thus, of the craps bets, the basic craps bet and the craps 3 bet are best for the gambler, and the craps 2 and craps 12 are the worst.
Let W denote the winnings for a unit big 6 bet or a unit big 8 bet. Then
1. P(W = −1) = , P(W = 1) =
6
11
5
11
2. E(W ) = − ≈ −0.0909
1
11
3. sd(W ) ≈ 0.9959
In the craps experiment, select the big 6 bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
In the craps experiment, select the big 8 bet. Run the simulation 1000 times and compare the empirical density function and
moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
13.4.4 https://stats.libretexts.org/@go/page/10258
Hardway Bets
A hardway bet can be made on any of the numbers 4, 6, 8, or 10. It is a bet that the chosen number n will be thrown “the hardway”
as (n/2, n/2), before 7 is thrown and before the chosen number is thrown in any other combination. Hardway bets on 4 and 10 pay
7 : 1 , while hardway bets on 6 and 8 pay 9 : 1 .
Let W denote the winnings for a unit hardway 4 or hardway 10 bet. Then
1. P(W = −1) = , P(W = 7) =
8
9
1
2. E(W ) = − ≈ −0.1111
1
3. sd(W ) = 2.5142
In the craps experiment, select the hardway 4 bet. Run the simulation 1000 times and compare the empirical density function
and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
In the craps experiment, select the hardway 10 bet. Run the simulation 1000 times and compare the empirical density function
and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
Let W denote the winnings for a unit hardway 6 or hardway 8 bet. Then
1. P(W = −1) = , P(W = 9) =
10
11
1
11
2. E(W ) = − ≈ −0.0909
1
11
3. sd(W ) ≈ 2.8748
In the craps experiment, select the hardway 6 bet. Run the simulation 1000 times and compare the empirical density and
moments of W to the true density and moments. Suppose that you bet $1 on each of the 1000 games. What would your net
winnings be?
In the craps experiment, select the hardway 8 bet. Run the simulation 1000 times and compare the empirical density function
and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
Thus, the hardway 6 and 8 bets are better than the hardway 4 and 10 bets for the gambler, in terms of expected value.
Proof
The first few values of the probability density function of N are given in the following table:
n 1 2 3 4 5
13.4.5 https://stats.libretexts.org/@go/page/10258
Find the probability that a game of craps will last at least 8 rolls.
Answer
165
≈ 3.3758
27 225
≈ 9.02376
Proof
This page titled 13.4: Craps is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services)
via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.4.6 https://stats.libretexts.org/@go/page/10258
13.5: Roulette
The (American) roulette wheel has 38 slots numbered 00, 0, and 1–36.
1. Slots 0, 00 are green;
2. Slots 1, 3, 5, 7, 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, 36 are red;
3. Slots 2, 4, 6, 8, 10, 11, 13, 15, 17, 20, 22, 24, 26, 28, 29, 31, 33, 35 are black.
Except for 0 and 00, the slots on the wheel alternate between red and black. The strange order of the numbers on the wheel is
intended so that high and low numbers, as well as odd and even numbers, tend to alternate.
38
Bets
As with craps, roulette is a popular casino game because of the rich variety of bets that can be made. The picture above shows the
roulette table and indicates some of the bets we will study. All bets turn out to have the same expected value (negative, of course).
However, the variances differ depending on the bet.
Although all bets in roulette have the same expected value, the standard deviations vary inversely with the number of numbers
selected. What are the implications of this for the gambler?
Straight Bets
A straight bet is a bet on a single number, and pays 35 : 1.
38
1
38
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 5.7626
In the roulette experiment, select the single number bet. Run the simulation 1000 times and compare the empirical density
function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000
games. What would your net winnings be?
13.5.1 https://stats.libretexts.org/@go/page/10259
Two Number Bets
A 2-number bet (or a split bet) is a bet on two adjacent numbers in the roulette table. The bet pays 17 : 1.
19 19
1
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 4.0193
In the roulette experiment, select the 2 number bet. Run the simulation 1000 times and compare the empirical density function
and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
38 38
3
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 3.2359
In the roulette experiment, select the 3-number bet. Run the simulation 1000 times and compare the empirical density function
and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
19
2
19
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 2.7620
In the roulette experiment, select the 4-number bet. Run the simulation 1000 times and compare the empirical density function
and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games.
What would your net winnings be?
19
3
19
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 2.1879
In the roulette experiment, select the 6-number bet. Run the simulation 1000 times and compuare the empirical density
function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000
games. What would your net winnings be?
13.5.2 https://stats.libretexts.org/@go/page/10259
Twelve Number Bets
A 12-number bet is a bet on 12 numbers. In particular, a column bet is bet on any one of the three columns of 12 numbers running
horizontally along the table. Other 12-number bets are the first 12 (1-12), the middle 12 (13-24), and the last 12 (25-36). A 12-
number bet pays 2 : 1 .
19
6
19
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 1.3945
In the roulette experiment, select the 12-number bet. Run the simulation 1000 times and compare the empirical density
function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000
games. What would your net winnings be?
19
9
19
2. E(W ) = − ≈ −0.0526
1
19
3. sd(W ) ≈ 0.9986
In the roulette experiment, select the 18-number bet. Run the simulation 1000 times and compare the empirical density
function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000
games. What would your net winnings be?
This page titled 13.5: Roulette is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.5.3 https://stats.libretexts.org/@go/page/10259
13.6: The Monty Hall Problem
Preliminaries
Statement of the Problem
The Monty Hall problem involves a classical game show situation and is named after Monty Hall, the long-time host of the TV game show Let's Make a
Deal. There are three doors labeled 1, 2, and 3. A car is behind one of the doors, while goats are behind the other two:
The Monty Hall problem became the subject of intense controversy because of several articles by Marilyn Vos Savant in the Ask Marilyn column of Parade
magazine, a popular Sunday newspaper supplement. The controversy began when a reader posed the problem in the following way:
Suppose you're on a game show, and you're given a choice of three doors. Behind one door is a car; behind the
others, goats. You pick a door—say No. 1—and the host, who knows what's behind the doors, opens another
door—say No. 3—which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your
advantage to switch your choice?
Marilyn's response was that the contestant should switch doors, claiming that there is a chance that the car is behind door 1, while there is a chance that
1
3
2
the car is behind door 2. In two follow-up columns, Marilyn printed a number of responses, some from academics, most of whom claimed in angry or
sarcastic tones that she was wrong and that there are equal chances that the car is behind doors 1 or 2. Marilyn stood by her original answer and offered
additional, but non-mathematical, arguments.
Think about the problem. Do you agree with Marilyn or with her critics, or do you think that neither solution is correct?
In the Monty Hall game, set the host strategy to standard (the meaning of this strategy will be explained in the below). Play the Monty Hall game 50
times with each of the following strategies. Do you want to reconsider your answer to question above?
1. Always switch
2. Never switch
In the Monty Hall game, set the host strategy to blind (the meaning of this strategy will be explained below). Play the Monty Hall game 50 times with
each of the following strategies. Do you want to reconsider your answer to question above?
1. Always switch
2. Never switch
Modeling Assumptions
When we begin to think carefully about the Monty Hall problem, we realize that the statement of the problem by Marilyn's reader is so vague that a
meaningful discussion is not possible without clarifying assumptions about the strategies of the host and player. Indeed, we will see that misunderstandings
about these strategies are the cause of the controversy.
Let us try to formulate the problem mathematically. In general, the actions of the host and player can vary from game to game, but if we are to have a random
experiment in the classical sense, we must assume that the same probability distributions govern the host and player on each game and that the games are
independent.
There are four basic random variables for a game:
1. U : the number of the door containing the car.
2. X: the number of the first door selected by the player.
3. V : the number of the door opened by the host.
4. Y : the number of the second door selected by the player.
13.6.1 https://stats.libretexts.org/@go/page/10260
Each of these random variables has the possible values 1, 2, and 3. However, because of the rules of the game, the door opened by the host cannot be either of
the doors selected by the player, so V ≠ X and V ≠ Y . In general, we will allow the possibility V = U , that the host opens the door with the car behind it.
Whether this is a reasonable action of the host is a big part of the controversy about this problem.
The Monty Hall experiment will be completely defined mathematically once the joint distribution of the basic variables is specified. This joint distribution in
turn depends on the strategies of the host and player, which we will consider next.
Strategies
Host Strategies
In the Monty Hall experiment, note that the host determines the probability density function of the door containing the car, namely P(U = i) for
i ∈ {1, 2, 3}. The obvious choice for the host is to randomly assign the car to one of the three doors. This leads to the uniform distribution, and unless
otherwise noted, we will always assume that U has this distribution. Thus, P(U = i) = for i ∈ {1, 2, 3}.
1
The host also determines the conditional density function of the door he opens, given knowledge of the door containing the car and the first door selected by
the player, namely P(V = k ∣ U = i, X = j) for i, j ∈ {1, 2, 3}. Recall that since the host cannot open the door chosen by the player, this probability must
be 0 for k = j .
Thus, the distribution of U and the conditional distribution of V given U and X constitute the host strategy.
This strategy leads to the following conditional distribution for V given U and X:
⎧ 1,
⎪ i ≠ j, i ≠ k, k ≠ j
1
P(V = k ∣ U = i, X = j) = ⎨ , i = j, k ≠ i (13.6.1)
2
⎩
⎪
0, k = i, k = j
This distribution, along with the uniform distribution for U , will be referred to as the standard strategy for the host.
In the Monty Hall game, set the host strategy to standard. Play the game 50 times with each of the following player strategies. Which works better?
1. Always switch
2. Never switch
This strategy leads to the following conditional distribution for V given U and X:
1
, k ≠j
2
P(V = k ∣ U = i, X = j) = { (13.6.2)
0, k =j
This distribution, together with the uniform distribution for U , will be referred to as the blind strategy for the host. The blind strategy seems a bit odd.
However, the confusion between the two strategies is the source of the controversy concerning this problem.
In the Monty Hall game, set the host strategy to blind. Play the game 50 times with each of the following player strategies. Which works better?
1. Always switch
2. Never switch
Player Strategies
The player, on the other hand, determines the probability density function of her first choice, namely P(X = j) for j ∈ {1, 2, 3}. The obvious first choice for
the player is to randomly choose a door, since the player has no knowledge at this point. This leads to the uniform distribution, so P(X = j) = 1
3
for
j ∈ {1, 2, 3}
The player also determines the conditional density function of her second choice, given knowledge of her first choice and the door opened by the host,
namely P(Y = l ∣ X = j, V = k) for i, j, k ∈ {1, 2, 3} with j ≠ k . Recall that since the player cannot choose the door opened by the host, this probability
must be 0 for l = k . The distribution of X and the conditional distribution of Y given X and V constitute the player strategy.
Suppose that the player switches with probability p ∈ [0, 1]. This leads to the following conditional distribution:
⎧ p, j ≠ k, j ≠ l, k ≠ l
P(Y = l ∣ X = j, V = k) = ⎨ 1 − p, j ≠ k, l = j (13.6.3)
⎩
0, j = k, l = k
13.6.2 https://stats.libretexts.org/@go/page/10260
In particular, if p = 1 , the player always switches, while if p = 0 , the player never switches.
Mathematical Analysis
We are almost ready to analyze the Monty Hall problem mathematically. But first we must make some independence assumptions to incorporate the lack of
knowledge that the host and player have about each other's actions. First, the player has no knowledge of the door containing the car, so we assume that U
and X are independent. Also, the only information about the car door that the player has when she makes her second choice is the information (if any)
revealed by her first choice and the host's subsequent selection. Mathematically, this means that Y is conditionally independent of U given X and V .
Distributions
The host and player strategies form the basic data for the Monty Hall problem. Because of the independence assumptions, the joint distribution of the basic
random variables is completely determined by these strategies.
Proof
The probability of any event defined in terms of the Monty Hall problem can be computed by summing the joint density over the appropriate values of
(i, j, k, l).
With either of the basic host strategies, V is uniformly distributed on {1, 2, 3}.
Suppose that the player switches with probability p. With either of the basic host strategies, Y is uniformly distributed on {1, 2, 3}.
In the Monty Hall experiment, set the host strategy to standard. For each of the following values of p, run the simulation 1000 times. Based on relative
frequency, which strategy works best?
1. p = 0 (never switch)
2. p = 0.3
3. p = 0.5
4. p = 0.7
5. p = 1 (always switch)
In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 1000 times. Based on relative
frequency, which strategy works best?
1. p = 0 (never switch)
2. p = 0.3
3. p = 0.5
4. p = 0.7
5. p = 1 (always switch)
Suppose that the host follows the standard strategy and that the player switches with probability p. Then the probability that the player wins is
1 +p
P(Y = U ) = (13.6.5)
3
In particular, if the player always switches, the probability that she wins is p = 2
3
and if the player never switches, the probability that she wins is p = 1
3
.
In the Monty Hall experiment, set the host strategy to standard. For each of the following values of p, run the simulation 1000 times. In each case,
compare the relative frequency of winning to the probability of winning.
1. p = 0 (never switch)
2. p = 0.3
3. p = 0.5
4. p = 0.7
5. p = 1 (always switch)
Suppose that the host follows the blind strategy. Then for any player strategy, the probability that the player wins is
1
P(Y = U ) = (13.6.6)
3
13.6.3 https://stats.libretexts.org/@go/page/10260
In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 1000 times. In each case, compare
the relative frequency of winning to the probability of winning.
1. p = 0 (never switch)
2. p = 0.3
3. p = 0.5
4. p = 0.7
5. p = 1 (always switch)
For a complete solution of the Monty Hall problem, we want to compute the conditional probability that the player wins, given that the host opens a door
with a goat behind it:
P(Y = U )
P(Y = U ∣ V ≠ U ) = (13.6.7)
P(V ≠ U )
With the basic host and player strategies, the numerator, the probability of winning, has been computed. Thus we need to consider the denominator, the
probability that the host opens a door with a goat. If the host use the standard strategy, then the conditional probability of winning is the same as the
unconditional probability of winning, regardless of the player strategy. In particular, we have the following result:
If the host follows the standard strategy and the player switches with probability p, then
1 +p
P(Y = U ∣ V ≠ U ) = (13.6.8)
3
Proof
3
when p = 0 , so that the player never switches, to 2
3
when p = 1 , so that the player always switches.
If the host follows the blind strategy, then for any player strategy, P(V ≠ U) =
2
3
and therefore P(Y = U ∣ V ≠ U) =
1
2
.
In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 500 times. In each case, compute
the conditional relative frequency of winning, given that the host shows a goat, and compare with the theoretical answer above,
1. p = 0 (never switch)
2. p = 0.3
3. p = 0.5
4. p = 0.7
5. p = 1 (always switch)
The confusion between the conditional probability of winning for these two strategies has been the source of much controversy in the Monty Hall problem.
Marilyn was probably thinking of the standard host strategy, while some of her critics were thinking of the blind strategy. This problem points out the
importance of careful modeling, of the careful statement of assumptions. Marilyn is correct if the host follows the standard strategy; the critics are correct if
the host follows the blind strategy; any number of other answers could be correct if the host follows other strategies.
The mathematical formulation we have used is fairly complete. However, if we just want to solve Marilyn's problem, there is a much simpler analysis (which
you may have discovered yourself). Suppose that the host follows the standard strategy, and thus always opens a door with a goat. If the player's first door is
incorrect (contains a goat), then the host has no choice and must open the other door with a goat. Then, if the player switches, she wins. On the other hand, if
the player's first door is correct and she switches, then of course she loses. Thus, we see that if the player always switches, then she wins if and only if her
first choice is incorrect, an event that obviously has probability . If the player never switches, then she wins if and only if her first choice is correct, an
2
This page titled 13.6: The Monty Hall Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source
content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.6.4 https://stats.libretexts.org/@go/page/10260
13.7: Lotteries
You realize the odds of winning [the lottery] are the same as being mauled by a polar bear and a regular bear in the same day.
—E*TRADE baby, January 2010.
Lotteries are among the simplest and most widely played of all games of chance, and unfortunately for the gambler, among the
worst in terms of expected value. Lotteries come in such an incredible number of variations that it is impractical to analyze all of
them. So, in this section, we will study some of the more common lottery formats.
Figure 13.7.1 : A lottery ticket issued by the Continental Congress in 1776 to raise money for the American Revolutionary War.
Source: Wikipedia
Recall that
N N!
#(S) = ( ) = (13.7.2)
n n!(N − n)!
Naturally, we assume that all such combinations are equally likely, and thus, the chosen combination X, the basic random variable
of the experiment, is uniformly distributed on S .
1
P(X = x) = , x ∈ S (13.7.3)
N
( )
n
The player of the lottery pays a fee and gets to select m numbers, without replacement, from the integers from 1 to N . Again, order
does not matter, so the player essentially chooses a combination y of size m from the population {1, 2, … , N }. In many cases
m = n , so that the player gets to choose the same number of numbers as the house. In general then, there are three parameters in
The number of catches U in the (N , n, m), lottery has probability density function given by
m N −m
( )( )
k n−k
P(U = k) = , k ∈ {0, 1, … , m} (13.7.4)
N
( )
n
13.7.1 https://stats.libretexts.org/@go/page/10261
The distribution of U is the hypergeometric distribution with parameters N , n , and m, and is studied in detail in the chapter on
Finite Sampling Models. In particular, from this section, it follows that the mean and variance of the number of catches U are
m
E(U ) =n (13.7.5)
N
m m N −n
var(U ) = n (1 − ) (13.7.6)
N N N −1
Note that P(U = k) = 0 if k > n or k < n + m − N . However, in most lotteries, m ≤ n and N is much larger than n+m . In
these common cases, the density function is positive for the values of k given in above.
We will refer to the special case where m = n as the (N , n) lottery; this is the case in most state lotteries. In this case, the
probability density function of the number of catches U is
n N −n
( )( )
k n−k
P(U = k) = , k ∈ {0, 1, … , n} (13.7.7)
N
( )
n
The mean and variance of the number of catches U in this special case are
2
n
E(U ) = (13.7.8)
N
2 2
n (N − n)
var(U ) = (13.7.9)
2
N (N − 1)
Explicitly give the probability density function, mean, and standard deviation of the number of catches in the (47, 5) lottery.
Answer
Explicitly give the probability density function, mean, and standard deviation of the number of catches in the (49, 5) lottery.
Answer
Explicitly give the probability density function, mean, and standard deviation of the number of catches in the (47, 7) lottery.
Answer
The analysis above was based on the assumption that the player's combination y is selected deterministically. Would it matter if the
player chose the combination in a random way? Thus, suppose that the player's selected combination Y is a random variable taking
values in S . (For example, in many lotteries, players can buy tickets with combinations randomly selected by a computer; this is
typically known as Quick Pick). Clearly, X and Y must be independent, since the player (and her randomizing device) can have no
knowledge of the winning combination X. As you might guess, such randomization makes no difference.
Let U denote the number of catches in the (N , n, m) lottery when the player's combination Y is a random variable,
independent of the winning combination X. Then U has the same distribution as in the deterministic case above.
Proof
There are many websites that publish data on the frequency of occurrence of numbers in various state lotteries. Some gamblers
evidently feel that some numbers are luckier than others.
Given the assumptions and analysis above, do you believe that some numbers are luckier than others? Does it make any
mathematical sense to study historical data for a lottery?
The prize money in most state lotteries depends on the sales of the lottery tickets. Typically, about 50% of the sales money is
returned as prize money; the rest goes for administrative costs and profit for the state. The total prize money is divided among the
winning tickets, and the prize for a given ticket depends on the number of catches U . For all of these reasons, it is impossible to
give a simple mathematical analysis of the expected value of playing a given state lottery. Note however, that since the state keeps a
fixed percentage of the sales, there is essentially no risk for the state.
From a pure gambling point of view, state lotteries are bad games. In most casino games, by comparison, 90% or more of the
money that comes in is returned to the players as prize money. Of course, state lotteries should be viewed as a form of voluntary
13.7.2 https://stats.libretexts.org/@go/page/10261
taxation, not simply as games. The profits from lotteries are typically used for education, health care, and other essential services.
A discussion of the value and costs of lotteries from a political and social point of view (as opposed to a mathematical one) is
beyond the scope of this project.
Bonus Numbers
Many state lotteries now augment the basic (N , n), format with a bonus number. The bonus number T is selected from a specified
set of integers, in addition to the combination X, selected as before. The player likewise picks a bonus number s , in addition to a
combination y . The player's prize then depends on the number of catches U between X and y , as before, and in addition on
whether the player's bonus number s matches the random bonus number T chosen by the house. We will let I denote the indicator
variable of this latter event. Thus, our interest now is in the joint distribution of (I , U ) .
In one common format, the bonus number T is selected at random from the set of integers {1, 2, … , M }, independently of the
combination X of size n chosen from {1, 2, … , N }. Usually M < N . Note that with this format, the game is essentially two
independent lotteries, one in the (N , n), format and the other in the (M , 1), format.
Explicitly compute the joint probability density function of (I , U ) for the (47, 5) lottery with independent bonus number from
1 to 27. This format is used in the California lottery, among others.
Answer
Explicitly compute the joint probability density function of (I , U ) for the (49, 5) lottery with independent bonus number from
1 to 42. This format is used in the Powerball lottery, among others.
Answer
In another format, the bonus number T is chosen from 1 to N , and is distinct from the numbers in the combination X. To model
this game, we assume that T is uniformly distributed on {1, 2, … , N }, and given T = t , X is uniformly distributed on the set of
combinations of size n chosen from {1, 2, … , N } ∖ {t}. For this format, the joint probability density function is harder to
compute.
n N −1−n n−1 N −n
( )( ) ( )( )
k n−k k n−k
P(I = 0, U = k) = (N − n + 1) +n , k ∈ {0, 1, … , n} (13.7.12)
N −1 N −1
N( ) N( )
n n
Proof
Explicitly compute the joint probability density function of (I , U ) for the (47, 7) lottery with bonus number chosen as
described above. This format is used in the Super 7 Canada lottery, among others.
Keno
Keno is a lottery game played in casinos. For a fixed N (usually 80) and n (usually 20), the player can play a range of basic
(N , n, m) games, as described in the first subsection. Typically, m ranges from 1 to 15, and the payoff depends on m and the
number of catches U . In this section, you will compute the density function, mean, and standard deviation of the random payoff,
based on a unit bet, for a typical keno game with N = 80 , n = 20 , and m ∈ {1, 2, … , 15}. The payoff tables are based on the
keno game at the Tropicana casino in Atlantic City, New Jersey.
Recall that the probability density function of the number of catches U above , is given by
m 80−m
( )( )
k 20−k
P(U = k) = , k ∈ {0, 1, … , m} (13.7.13)
80
( )
20
The payoff table for m =1 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 1
13.7.3 https://stats.libretexts.org/@go/page/10261
Catches 0 1
Payoff 0 3
Answer
The payoff table for m =2 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 2
Catches 0 1 2
Payoff 0 0 12
Answer
The payoff table for m =3 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 3
Catches 0 1 2 3
Payoff 0 0 1 43
Answer
The payoff table for m =4 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 4
Catches 0 1 2 3 4
Payoff 0 0 1 3 130
Answer
The payoff table for m =5 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 5
Catches 0 1 2 3 4 5
Payoff 0 0 0 1 10 800
Answer
The payoff table for m =6 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 6
Catches 0 1 2 3 4 5 6
Payoff 0 0 0 1 4 95 1500
Answer
The payoff table for m =7 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 7
Catches 0 1 2 3 4 5 6 7
13.7.4 https://stats.libretexts.org/@go/page/10261
Answer
The payoff table for m =8 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 8
Catches 0 1 2 3 4 5 6 7 8
Answer
The payoff table for m =9 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 9
Catches 0 1 2 3 4 5 6 7 8 9
Answer
The payoff table for m = 10 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 10
Catches 0 1 2 3 4 5 6 7 8 9 10
Answer
The payoff table for m = 11 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 11
Catches 0 1 2 3 4 5 6 7 8 9 10 11
Answer
The payoff table for m = 12 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 12
Catches 0 1 2 3 4 5 6 7 8 9 10 11 12
Answer
The payoff table for m = 13 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 13
Catche
0 1 2 3 4 5 6 7 8 9 10 11 12 13
s
100,00
Payoff 1 0 0 0 0 0 1 20 80 600 3500 10,000 50,000
0
Proof
13.7.5 https://stats.libretexts.org/@go/page/10261
The payoff table for m = 14 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 14
Catch
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
es
Answer
The payoff table for m = 15 is given below. Compute the probability density function, mean, and standard deviation of the
payoff.
Pick m = 15
Catch
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
es
Answer
In the exercises above, you should have noticed that the expected payoff on a unit bet varies from about 0.71 to 0.75, so the
expected profit (for the gambler) varies from about −0.25 to −0.29. This is quite bad for the gambler playing a casino game, but as
always, the lure of a very high payoff on a small bet for an extremely rare event overrides the expected value analysis for most
players.
With m = 15 , show that the top 4 prizes (25,000, 50,000, 100,000, 100,000) contribute only about 0.017 (less than 2 cents) to
the total expected value of about 0.714.
On the other hand, the standard deviation of the payoff varies quite a bit, from about 1 to about 55.
Although the game is highly unfavorable for each m, with expected value that is nearly constant, which do you think is better
for the gambler—a format with high standard deviation or one with low standard deviation?
This page titled 13.7: Lotteries is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.7.6 https://stats.libretexts.org/@go/page/10261
13.8: The Red and Black Game
In this section and the following three sections, we will study gambling strategies for one of the simplest gambling models. Yet in
spite of the simplicity of the model, the mathematical analysis leads to some beautiful and sometimes surprising results that have
importance and application well beyond gambling. Our exposition is based primarily on the classic book by Dubbins and Savage,
Inequalities for Stochastic Processes (How to Gamble if You Must) by Lester E Dubbins and Leonard J Savage (1965).
Basic Theory
Assumptions
Here is the basic situation: The gambler starts with an initial sum of money. She bets on independent, probabilistically identical
games, each with two outcomes—win or lose. If she wins a game, she receives the amount of the bet on that game; if she loses a
game, she must pay the amount of the bet. Thus, the gambler plays at even stakes. This particular situation (IID games and even
stakes) is known as red and black, and is named for the color bets in the casino game roulette. Other examples are the pass and
don't pass bets in craps.
Let us try to formulate the gambling experiment mathematically. First, let I denote the outcome of the n th game for
n n ∈ N+ ,
where 1 denotes a win and 0 denotes a loss. These are independent indicator random variables with the same distribution:
P (Ij = 1) = pj , P (Ij = 0) = q = 1 − pj (13.8.1)
where p ∈ [0, 1] is the probability of winning an individual game. Thus, I = (I1 , I2 , …) is a sequence of Bernoulli trials.
If p = 0 , then the gambler always loses and if p = 1 then the gambler always wins. These trivial cases are not interesting, so we
will usually assume that 0 < p < 1 . In real gambling houses, of course, p < (that is, the games are unfair to the player), so we
1
Random Processes
The gambler's fortune over time is the basic random process of interest: Let X denote the gambler's initial fortune and X the
0 i
gambler's fortune after i games. The gambler's strategy consists of the decisions of how much to bet on the various games and
when to quit. Let Y denote the amount of the ith bet, and let N denote the number of games played by the gambler. If we want to,
i
we can always assume that the games go on forever, but with the assumption that the gambler bets 0 on all games after N . With
this understanding, the game outcome, fortune, and bet processes are defined for all times i ∈ N . +
Strategies
The gambler's strategy can be very complicated. For example, the random variable Y , the gambler's bet on game n , or the event
n
N = n − 1 , her decision to stop after n − 1 games, could be based on the entire past history of the game, up to time n .
Moreover, they could have additional sources of randomness. For example a gambler playing roulette could partly base her bets on
the roll of a lucky die that she keeps in her pocket. However, the gambler cannot see into the future (unfortunately from her point of
view), so we can at least assume that Y and {N = n − 1} are independent of (I , I , … , I ).
n 1 2 n−1
At least in terms of expected value, any gambling strategy is futile if the games are unfair.
Proof
Suppose that the gambler has a positive probability of making a real bet on game i, so that E(Y i) >0 . Then
1. E(X i) < E(Xi−1 ) if p < 1
13.8.1 https://stats.libretexts.org/@go/page/10262
2. E(X i) > E(Xi−1 ) if p > 1
3. E(X i) = E(Xi−1 ) if p = 1
Proof
Thus on any game in which the gambler makes a positive bet, her expected fortune strictly decreases if the games are unfair,
remains the same if the games are fair, and strictly increases if the games are favorable.
As we noted earlier, a general strategy can depend on the past history and can be randomized. However, since the underlying
Bernoulli games are independent, one might guess that these complicated strategies are no better than simple strategies in which the
amount of the bet and the decision to stop are based only on the gambler's current fortune. These simple strategies do indeed play a
fundamental role and are referred to as stationary, deterministic strategies. Such a strategy can be described by a betting function S
from the space of fortunes to the space of allowable bets, so that S(x) is the amount that the gambler bets when her current fortune
is x.
N = min{n ∈ N : Xn = 0 or Xn = a} (13.8.4)
Thus, any strategy (betting function) S must satisfy s(x) ≤ min{x, a − x} for 0 ≤ x ≤ a : the gambler cannot bet what she does
not have, and will not bet more than is necessary to reach the target a .
If we want to, we can think of the difference between the target fortune and the initial fortune as the entire fortune of the house.
With this interpretation, the player and the house play symmetric roles, but with complementary win probabilities: play continues
until either the player is ruined or the house is ruined. Our main interest is in the final fortune X of the gambler. Note that this
N
Presumably, the gambler would like to maximize the probability of reaching the target fortune. Is it better to bet small amounts or
large amounts, or does it not matter? How does the optimal strategy, if there is one, depend on the initial fortune, the target fortune,
and the game win probability?
We are also interested in E(N ), the expected number of games played. Perhaps a secondary goal of the gambler is to maximize the
expected number of games that she gets to play. Are the two goals compatible or incompatible? That is, can the gambler maximize
both her probability of reaching the target and the expected number of games played, or does maximizing one quantity necessarily
mean minimizing the other?
In the next two sections, we will analyze and compare two strategies that are in a sense opposites:
Timid Play: On each game, until she stops, the gambler makes a small constant bet, say $1.
Bold Play: On each game, until she stops, the gambler bets either her entire fortune or the amount needed to reach the target
fortune, whichever is smaller.
In the final section of the chapter, we will return to the question of optimal strategies.
This page titled 13.8: The Red and Black Game is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
13.8.2 https://stats.libretexts.org/@go/page/10262
13.9: Timid Play
Basic Theory
Recall that with the strategy of timid play in red and black, the gambler makes a small constant bet, say $1, on each game until she
stops. Thus, on each game, the gambler's fortune either increases by 1 or decreases by 1, until the fortune reaches either 0 or the
target a (which we assume is a positive integer). Thus, the fortune process (X , X , …) is a random walk on the fortune space
0 1
As usual, we are interested in the probability of winning and the expected number of games. The key idea in the analysis is that
after each game, the fortune process simply starts over again, but with a different initial value. This is an example of the Markov
property, named for Andrei Markov. A separate chapter on Markov Chains explores these random processes in more detail. In
particular, this chapter has sections on Birth-Death Chains and Random Walks on Graphs, particular classes of Markov chains that
generalize the random processes that we are studying here.
The function f satisfies the following difference equation and boundary conditions:
1. f (x) = qf (x − 1) + pf (x + 1) for x ∈ {1, 2, … , a − 1}
2. f (0) = 0 , f (a) = 1
Proof
The difference equation is linear (in the unknown function f ), homogeneous (because each term involves the unknown function f ),
and second order (because 2 is the difference between the largest and smallest fortunes in the equation). Recall that linear
homogeneous difference equations can be solved by finding the roots of the characteristic equation.
If p ≠ 1
2
, then the roots are distinct. In this case, the probability that the gambler reaches her target is
x
(q/p ) −1
f (x) = , x ∈ {0, 1, … , a} (13.9.2)
a
(q/p ) −1
If p = , the characteristic equation has a single root 1 that has multiplicity 2. In this case, the probability that the gambler
1
reaches her target is simply the ratio of the initial fortune to the target fortune:
x
f (x) = , x ∈ {0, 1, … , a} (13.9.3)
a
In the red and black experiment, choose Timid Play. Vary the initial fortune, target fortune, and game win probability and note
how the probability of winning the game changes. For various values of the parameters, run the experiment 1000 times and
compare the relative frequency of winning a game to the probability of winning a game.
13.9.1 https://stats.libretexts.org/@go/page/10263
As a function of x, for fixed p and a ,
1. f is increasing from 0 to a .
2. f is concave upward if p < 1
2
and concave downward if p > 1
2
. Of course, f is linear if p = 1
2
.
The function g satisfies the following difference equation and boundary conditions:
1. g(x) = qg(x − 1) + pg(x + 1) + 1 for x ∈ {1, 2, … , a − 1}
2. g(0) = 0 , g(a) = 0
Proof
The difference equation in the last exercise is linear, second order, but non-homogeneous (because of the constant term 1 on the
right side). The corresponding homogeneous equation is the equation satisfied by the win probability function f . Thus, only a little
additional work is needed to solve the non-homogeneous equation.
If p ≠ 1
2
, then
x a
g(x) = − f (x), x ∈ {0, 1, … , a} (13.9.6)
q −p q −p
If p = 1
2
, then
Consider g as a function of the initial fortune x, for fixed values of the game win probability p and the target fortune a .
1. g at first increases and then decreases.
2. g is concave downward.
13.9.2 https://stats.libretexts.org/@go/page/10263
2
2
a
4
and occurs when x = a
2
. When p ≠ 1
2
, the value of x where the maximum occurs is
rather complicated.
For many parameter settings, the expected number of games is surprisingly large. For example, suppose that p = and the target 1
fortune is 100. If the gambler's initial fortune is 1, then the expected number of games is 99, even though half of the time, the
gambler will be ruined on the first game. If the initial fortune is 50, the expected number of games is 2500.
In the red and black experiment, select Timid Play. Vary the initial fortune, the target fortune and the game win probability and
notice how the expected number of games changes. For various values of the parameters, run the experiment 1000 times and
compare the sample mean number of games to the expect value.
In the red and black game, set the target fortune to 16, the initial fortune to 8, and the win probability to 0.45. Play 10 games
with each of the following strategies. Which seems to work best?
1. Bet 1 on each game (timid play).
2. Bet 2 on each game.
3. Bet 4 on each game.
4. Bet 8 on each game (bold play).
We will need to embellish our notation to indicate the dependence on the target fortune. Let
f (x, a) = P(XN = a ∣ X0 = x), x ∈ {0, 1, … , a}, a ∈ N+ (13.9.8)
Now fix p and suppose that the target fortune is 2a and the initial fortune is 2x. If the gambler plays timidly (betting $1 each time),
then of course, her probability of reaching the target is f (2x, 2a). On the other hand:
Suppose that the gambler bets $2 on each game. The fortune process (X /2 : i ∈ N) corresponds to timid play with initial
i
fortune x and target fortune a and that therefore the probability that the gambler reaches the target is f (x, a).
Thus, we need to compare the probabilities f (2x, 2a) and f (x, a).
13.9.3 https://stats.libretexts.org/@go/page/10263
In particular
1. f (2x, 2a) < f (x, a) if p < 1
Thus, it appears that increasing the bets is a good idea if the games are unfair, a bad idea if the games are favorable, and makes no
difference if the games are fair.
What about the expected number of games played? It seems almost obvious that if the bets are increased, the expected number of
games played should decrease, but a direct analysis using the expected value function above is harder than one might hope (try it!),
We will use a different method, one that actually gives better results. Specifically, we will have the $1 and $2 gamblers bet on the
same underlying sequence of games, so that the two fortune processes are defined on the same sample space. Then we can compare
the actual random variables (the number of games played), which in turn leads to a comparison of their expected values. Recall that
this general method is referred to as coupling.
Let X denote the fortune after n games for the gamble making $1 bets (simple timid play). Then 2X − X is the fortune
n n 0
after n games for the gambler making $2 bets (with the same initial fortune, betting on the same sequence of games). Assume
again that the initial fortune is 2x and the target fortune 2a where 0 < x < a . Let N denote the number of games played by
1
the $1 gambler, and N the number of games played by the $2 gambler, Then
2
Of course, the expected values agree (and are both 0) if x = 0 or x = a . This result shows that N is stochastically smaller than
2
N 1 when the gamblers are not playing the same sequence of games (so that the random variables are not defined on the same
sample space).
Generalize the analysis in this subsection to compare timid play with the strategy of betting $k on each game (let the initial
fortune be kx and the target fortune ka .
It appears that with unfair games, the larger the bets the better, at least in terms of the probability of reaching the target. Thus, we
are naturally led to consider bold play.
This page titled 13.9: Timid Play is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.9.4 https://stats.libretexts.org/@go/page/10263
13.10: Bold Play
Basic Theory
Preliminaries
Recall that with the strategy of bold play in red and black, the gambler on each game bets either her entire fortune or the amount
needed to reach the target fortune, whichever is smaller. As usual, we are interested in the probability that the player reaches the
target and the expected number of trials. The first interesting fact is that only the ratio of the initial fortune to the target fortune
matters, quite in contrast to timid play.
Suppose that the gambler plays boldly with initial fortune x and target fortune a . As usual, let X = (X , X , …) denote the
1 2
fortune process for the gambler. For any c > 0 , the random process cX = (cX , cX , …) is the fortune process for bold play
0 1
Because of this result, it is convenient to use the target fortune as the monetary unit and to allow irrational, as well as rational,
initial fortunes. Thus, the fortune space is [0, 1]. Sometimes in our analysis we will ignore the states 0 or 1; clearly there is no harm
in this because in these states, the game is over.
Recall that the betting function S is the function that gives the amount bet as a function of the current fortune. For bold play,
the betting function is
1
x, 0 ≤x ≤
2
S(x) = min{x, 1 − x} = { (13.10.1)
1
1 − x, ≤x ≤1
2
The function F satisfies the following functional equation and boundary conditions:
1
pF (2x), 0 ≤x ≤
1. F (x) = { 1
2
p + qF (2x − 1), ≤x ≤1
2
2. F (0) = 0 , F (1) = 1
From the previous result, and a little thought, it should be clear that an important role is played by the following function:
13.10.1 https://stats.libretexts.org/@go/page/10264
The function d is called the doubling function, mod 1, since d(x) gives the fractional part of 2x.
Note that until the last bet that ends the game (with the player ruined or victorious), the successive fortunes of the player follow
iterates of the map d . Thus, bold play is intimately connected with the dynamical system associated with d .
Binary Expansions
One of the keys to our analysis is to represent the initial fortune in binary form.
where x ∈ {0, 1} for each i ∈ N . This representation is unique except when x is a binary rational (sometimes also called a
i +
dyadic rational), that is, a number of the form k/2 where n ∈ N and k ∈ {1, 3, … , 2 − 1} ; the positive integer n is
n
+
n
called the rank of x. Binary rationals are discussed in more detail in the chapter on Foundations.
For a binary rational x of rank n , we will use the standard terminating representation where x = 1 and x = 0 for i > n . Rank
n i
can be extended to all numbers in [0, 1) by defining the rank of 0 to be 0 (0 is also considered a binary rational) and by defining the
rank of a binary irrational to be ∞. We will denote the rank of x by r(x).
Applied to the binary sequences, the doubling function d is the shift operator:
Bold play in red and black can be elegantly described by comparing the bits of the initial fortune with the game bits.
Suppose that gambler starts with initial fortune x ∈ (0, 1). The gambler eventually reaches the target 1 if and only if there
exists a positive integer k such that I = 1 − x for j ∈ {1, 2, … , k − 1} and I = x . That is, the gambler wins if and only
j j k k
if when the game bit agrees with the corresponding fortune bit for the first time, that bit is 1.
The random variable whose bits are the complements of the fortune bits will play an important role in our analysis. Thus, let
∞
1 − Ij
W =∑ (13.10.4)
j
j=1
2
Note that W is a well defined random variable taking values in [0, 1].
Suppose that the gambler starts with initial fortune x ∈ (0, 1). Then the gambler reaches the target 1 if and only if W <x .
13.10.2 https://stats.libretexts.org/@go/page/10264
Proof
W has a continuous distribution. That is, P(W = x) = 0 for any x ∈ [0, 1].
From the previous two results, it follows that F is simply the distribution function of W . In particular, F is an increasing function,
and since W has a continuous distribution, F is a continuous function.
The success function F is the unique continuous solution of the functional equation above.
Proof
If we introduce a bit more notation, we can give nice expression for F (x), and later for the expected number of games G(x). Let
p = p and p = q = 1 − p .
0 1
F (x) = ∑ px ⋯ px p xn (13.10.5)
1 n−1
n=1
Note that p ⋅ x in the last expression is correct; it's not a misprint of p . Thus, only terms with x
n xn n =1 are included in the sum.
F is strictly increasing on [0, 1]. This means that the distribution of W has support [0, 1]; that is, there are no subintervals of
[0, 1] that have positive length, but 0 probability.
In particular,
1. F (
1
8
) =p
3
2. F (
2
8
) =p
2
3. F (
3
8
) =p
2
+p q
2
4. F (
4
8
) =p
5. F (
5
8
) = p +p q
2
6. F (
6
8
) = p + pq
7. F (
7
8
) = p + pq + p q
2
If p = 1
2
then F (x) = x for x ∈ [0, 1]
Proof
Thus, for p = (fair trials), the probability that the bold gambler reaches the target fortune a starting from the initial fortune x is
1
x/a, just as it is for the timid gambler. Note also that the random variable W has the uniform distribution on [0, 1]. When p ≠ , 1
the distribution of W is quite strange. To state the result succinctly, we will indicate the dependence of the of the probability
measure P on the parameter p ∈ (0, 1). First we define
n
1
Cp = {x ∈ [0, 1] : ∑(1 − xi ) → p as n → ∞} (13.10.6)
n
i=1
Thus, C is the set of x ∈ [0, 1] for which the relative frequency of 0's in the binary expansion is p.
p
1. Pp (W ∈ Cp ) = 1
2. Pp (W ∈ Ct ) = 0
Proof
When p ≠ , W does not have a probability density function (with respect to Lebesgue measure on [0, 1]), even though
1
2
W
13.10.3 https://stats.libretexts.org/@go/page/10264
When p ≠ 1
2
, F has derivative 0 at almost every point in [0, 1], even though it is strictly increasing.
In the red and black experiment, select Bold Play. Vary the initial fortune, target fortune, and game win probability with the
scroll bars and note how the probability of winning the game changes. In particular, note that this probability depends only on
x/a. Now for various values of the parameters, run the experiment 1000 times and compare the relative frequency function to
2. G(0) = 0 , G(1) = 0
Proof
Note, interestingly, that the functional equation is not satisfied at x =0 or x =1 . As before, we can give an alternate analysis
using the binary representation of an initial fortune x ∈ (0, 1).
Suppose that the initial fortune of the gambler is x ∈ (0, 1). Then N = min{k ∈ N+ : Ik = xk or k = r(x)} .
Proof
We can give an explicit formula for the expected number of trials G(x) in terms of the binary representation of x. Recall our
special notation: p = p , p = q = 1 − p
0 1
G(x) = ∑ px … px (13.10.7)
1 n
n=0
Note that the n = 0 term is 1, since the product is empty. The sum has a finite number of terms if x is a binary rational, and the
sum has an infinite number of terms if x is a binary irrational.
In particular,
1. G ( 1
8
) = 1 +p +p
2
2. G ( 2
8
) = 1 +p
3. G ( 3
8
) = 1 + p + pq
4. G ( 4
8
) =1
5. G ( 5
8
) = 1 + q + pq
6. G ( 6
8
) = 1 +q
13.10.4 https://stats.libretexts.org/@go/page/10264
7. G ( 7
8
) = 1 +q +q
2
If p = 1
2
then
1
2− , x is a binary rational
r(x)− 1
G(x) = { 2 (13.10.8)
2, x is a binary irrational
Figure 13.10.4: The expected number of games in bold play with fair games
In the red and black experiment, select Bold Play. Vary x, a , and p with the scroll bars and note how the expected number of
trials changes. In particular, note that the mean depends only on the ratio x/a. For selected values of the parameters, run the
experiment 1000 times and compare the sample mean to the distribution mean.
However, as a function of the initial fortune x, for fixed p, the function G is very irregular.
G is discontinuous at the binary rationals in [0, 1] and continuous at the binary irrationals.
This page titled 13.10: Bold Play is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.10.5 https://stats.libretexts.org/@go/page/10264
13.11: Optimal Strategies
Basic Theory
Definitions
Recall that the stopping rule for red and black is to continue playing until the gambler is ruined or her fortune reaches the target
fortune a . Thus, the gambler's strategy is to decide how much to bet on each game before she must stop. Suppose that we have a
class of strategies that correspond to certain valid fortunes and bets; A will denote the set of fortunes and B will denote the set of
x
valid bets for x ∈ A . For example, sometimes (as with timid play) we might want to restrict the fortunes to set of integers
{0, 1, … , a}; other times (as with bold play) we might want to use the interval [0, 1] as the fortune space. As for the bets, recall
that the gambler cannot bet what she does not have and will not bet more than she needs in order to reach the target. Thus, a betting
function S must satisfy
Moreover, we always restrict our strategies to those for which the stopping time N is finite.
The success function of a strategy is the probability that the gambler reaches the target a with that strategy, as a function of the
initial fortune x. A strategy with success function V is optimal if for any other strategy with success function U , we have
U (x) ≤ V (x) for x ∈ A .
If there exists an optimal strategy, then the optimal success function is unique.
However, there may not exist an optimal strategy or there may be several optimal strategies. Moreover, the optimality question
depends on the value of the game win probability p, in addition to the structure of fortunes and bets.
Proof
multiples of a basic unit, which we might as well assume is $1. Of course, real gambling houses have this restriction. Thus the set
of valid fortunes is A = {0, 1, … , a} and the set of valid bets for x ∈ A is B = {0, 1, … , min{x, a − x}}. Our main result for
x
this case is
In the red and black game set the target fortune to 16, the initial fortune to 8, and the game win probability to 0.45. Define the
strategy of your choice and play 100 games. Compare your relative frequency of wins with the probability of winning with
timid play.
it is natural to take the target as the monetary unit so that the set of fortunes is A = [0, 1], and the set of bets for x ∈ A is
B = [0, min{x, 1 − x}]. Our main result for this case is given below. The results for timid play will play an important role in the
x
analysis, so we will let f (j, a) denote the probability of reaching an integer target a , starting at the integer j ∈ [0, a] , with unit
bets.
13.11.1 https://stats.libretexts.org/@go/page/10456
The optimal success function is V (x) = x for x ∈ (0, 1].
Proof
Unfair Trials
We will now assume that p ≤ so that the trials are unfair, or at least not favorable. As before, we will take the target fortune as
1
the basic monetary unit and allow any valid fraction of this unit as a bet. Thus, the set of fortunes is A = [0, 1], and the set of bets
for x ∈ A is B = [0, min{x, 1 − x}]. Our main result for this case is
x
In the red and black game, set the target fortune to 16, the initial fortune to 8, and the game win probability to 0.45. Define the
strategy of your choice and play 100 games. Compare your relative frequency of wins with the probability of winning with
bold play.
the only optimal strategy; amazingly, there are infinitely many optimal strategies. Recall first that the bold strategy has betting
function
1
x, 0 ≤x ≤
2
S1 (x) = min{x, 1 − x} = { (13.11.8)
1
1 − x, ≤x ≤1
2
Consider the following strategy, which we will refer to as the second order bold strategy:
1. With fortune x ∈ (0, ), play boldly with the object of reaching before falling to 0.
1
2
1
2. With fortune x ∈ ( , 1), play boldly with the object of reaching 1 without falling below
1
2
1
2
.
3. With fortune , play boldly and bet
1
2
1
1
⎧ x, 0 ≤x <
⎪
⎪ 4
⎪
⎪
⎪ 1 − x, 1
≤x <
1
⎪
⎪ 2 4 2
1 1
S2 (x) = ⎨ , x = (13.11.9)
2 2
⎪ 1 1 3
⎪
⎪ x−
⎪ , <x ≤
⎪ 2 2 4
⎪
⎩
⎪ 3
1 − x, <x ≤1
4
13.11.2 https://stats.libretexts.org/@go/page/10456
Figure 13.11.2: The betting function for the second order bold strategy
Once we understand how this construction is done, it's straightforward to define the third order bold strategy and show that it's
optimal as well.
Figure 13.11.3: The betting function for the third order bold strategy
Explicitly give the third order betting function and show that the strategy is optimal.
More generally, we can define the n th order bold strategy and show that it is optimal as well.
The sequence of bold strategies can be defined recursively from the basic bold strategy S as follows:
1
1 1
⎧
⎪ Sn (2x), 0 ≤x <
⎪ 2 2
1 1
Sn+1 (x) = ⎨ , x = (13.11.12)
2 2
⎪
⎩
⎪ 1 1
Sn (2x − 1), <x ≤1
2 2
Even more generally, we can define an optimal strategy T in the following way: for each x ∈ [0, 1] select n ∈ N and let x +
T (x) = S nx (x) . The graph below shows a few of the graphs of the bold strategies. For an optimal strategy T , we just need to
13.11.3 https://stats.libretexts.org/@go/page/10456
Martingales
Let's return to the unscaled formulation of red and black, where the target fortune is a ∈ (0, ∞) and the initial fortune is
x ∈ (0, a). In the subfair case, when p ≤ , no strategy can do better than the optimal strategies, so that the win probability is
1
bounded by x/a (with equality when p = ). Another elegant proof of this is given in the section on inequalities in the chapter on
1
martingales.
This page titled 13.11: Optimal Strategies is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
13.11.4 https://stats.libretexts.org/@go/page/10456
CHAPTER OVERVIEW
14: The Poisson Process
The Poisson process is one of the most important random processes in probability theory. It is widely used to model random
“points” in time and space, such as the times of radioactive emissions, the arrival times of customers at a service center, and the
positions of flaws in a piece of material. Several important probability distributions arise naturally from the Poisson process—the
Poisson distribution, the exponential distribution, and the gamma distribution. The process has a beautiful mathematical structure,
and is used as a foundation for building a number of other, more complicated random processes.
14.1: Introduction to the Poisson Process
14.2: The Exponential Distribution
14.3: The Gamma Distribution
14.4: The Poisson Distribution
14.5: Thinning and Superpositon
14.6: Non-homogeneous Poisson Processes
14.7: Compound Poisson Processes
14.8: Poisson Processes on General Spaces
This page titled 14: The Poisson Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
1
14.1: Introduction to the Poisson Process
Run the Poisson experiment with the default settings in single step mode. Note the random points in time.
Random Variables
There are three collections of random variables that can be used to describe the process. First, let X denote the time of the first
1
arrival, and X the time between the (i − 1) st and ith arrival for i ∈ {2, 3, …}. Thus, X = (X , X , …) is the sequence of inter-
i 1 2
arrival times. Next, let T denote the time of the n th arrival for n ∈ N . It will be convenient to define T = 0 , although we do
n + 0
not consider this as an arrival. Thus T = (T , T , …) is the sequence of arrival times. Clearly T is the partial sum process
0 1
Tn = ∑ Xi , n ∈ N (14.1.1)
i=1
Xn = Tn − Tn−1 , n ∈ N+ (14.1.2)
Next, let N denote the number of arrivals in (0, t] for t ∈ [0, ∞). The random process N = (N : t ≥ 0) is the counting process.
t t
The arrival time process T and the counting process N are inverses of one another in a sense, and in particular each process
determines the other:
Tn = min{t ≥ 0 : Nt = n}, n ∈ N (14.1.3)
n=1
Thus, A ↦ N (A) is the counting measure associated with the random points (T , T , …), so in particular it is a random measure.
1 2
For our original counting process, note that N = N (0, t] for t ≥ 0 . Thus, t ↦ N is a (random) distribution function, and
t t
14.1.1 https://stats.libretexts.org/@go/page/10266
Think about the strong renewal assumption for each of the specific applications given above.
Run the Poisson experiment with the default settings in single step mode. See if you can detect the strong renewal assumption.
As a first step, note that part of the renewal assumption, namely that the process “restarts” at each arrival time, independently of the
past, implies the following result:
A model of random points in time in which the inter-arrival times are independent and identically distributed (so that the process
“restarts” at each arrival time) is known as a renewal process. A separate chapter explores Renewal Processes in detail. Thus, the
Poisson process is a renewal process, but a very special one, because we also require that the renewal assumption hold at fixed
times.
T = (T , T , …) (in this case, the trial numbers of the successes), and the counting process (N : t ∈ N) (in this case, the number
0 1 t
of successes in the first t trials). Also like the Poisson process, the Bernoulli trials process has the strong renewal property: at each
fixed time and at each arrival time, the process “starts over” independently of the past. But of course, time is discrete in the
Bernoulli trials model and continuous in the Poisson model. The Bernoulli trials process can be characterized in terms of each of
the three sets of random variables.
Each of the following statements characterizes the Bernoulli trials process with success parameter p ∈ (0, 1):
1. The inter-arrival time sequence X is a sequence of independent variables, and each has the geometric distributions on N +
Run the binomial experiment with n = 50 and p = 0.1 . Note the random points in discrete time.
Run the Poisson experiment with t = 5 and r = 1 . Note the random points in continuous time and compare with the behavior
in the previous exercise.
As we develop the theory of the Poisson process we will frequently refer back to the analogy with Bernoulli trials. In particular, we
will show that if we run the Bernoulli trials at a faster and faster rate but with a smaller and smaller success probability, in just the
right way, the Bernoulli trials process converges to the Poisson process.
This page titled 14.1: Introduction to the Poisson Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
14.1.2 https://stats.libretexts.org/@go/page/10266
14.2: The Exponential Distribution
Basic Theory
The Memoryless Property
Recall that in the basic model of the Poisson process, we have “points” that occur randomly in time. The sequence of inter-arrival
times is X = (X , X , …). The strong renewal assumption states that at each arrival time and at each fixed time, the process must
1 2
probabilistically restart, independent of the past. The first part of that assumption implies that X is a sequence of independent,
identically distributed variables. The second part of the assumption implies that if the first arrival has not occurred by time s , then
the time remaining until the arrival occurs must have the same distribution as the first arrival time itself. This is known as the
memoryless property and can be stated in terms of a general random variable as follows:
Suppose that X takes values in [0, ∞). Then X has the memoryless property if the conditional distribution of X −s given
X > s is the same as the distribution of X for every s ∈ [0, ∞) . Equivalently,
The memoryless property determines the distribution of X up to a positive parameter, as we will see now.
Distribution functions
Suppose that X takes values in [0, ∞) and satisfies the memoryless property.
X has a continuous distribution and there exists r ∈ (0, ∞) such that the distribution function F of X is
−r t
F (t) = 1 − e , t ∈ [0, ∞) (14.2.2)
Proof
A random variable with the distribution function above or equivalently the probability density function in the last theorem is said to
have the exponential distribution with rate parameter r. The reciprocal is known as the scale parameter (as will be justified
1
below). Note that the mode of the distribution is 0, regardless of the parameter r, not very helpful as a measure of center.
In the gamma experiment, set n = 1 so that the simulated random variable has an exponential distribution. Vary r with the
scroll bar and watch how the shape of the probability density function changes. For selected values of r, run the experiment
1000 times and compare the empirical density function to the probability density function.
r
1
r
1
r
1
r
1
Proof
14.2.1 https://stats.libretexts.org/@go/page/10267
In the special distribution calculator, select the exponential distribution. Vary the scale parameter (which is 1/r) and note the
shape of the distribution/quantile function. For selected values of the parameter, compute a few values of the distribution
function and the quantile function.
A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only the interarrvial times are independent,
and each has the exponential distribution with rate r.
If X has the exponential distribution with rate r > 0 , then from the results above, the reliability function is F
c
(t) = e
−rt
and the
probability density function is f (t) = re , so trivially X has constant rate r. The converse is also true.
−rt
If X has constant failure rate r > 0 then X has the exponential distribution with parameter r.
Proof
The memoryless and constant failure rate properties are the most famous characterizations of the exponential distribution, but are
by no means the only ones. Indeed, entire books have been written on characterizations of this distribution.
Moments
Suppose again that X has the exponential distribution with rate parameter r >0 . Naturaly, we want to know the the mean,
variance, and various other moments of X.
If n ∈ N then E (X n
) = n!/ r
n
.
Proof
More generally, E (X a
) = Γ(a + 1)/ r
a
for every a ∈ [0, ∞), where Γ is the gamma function.
In particular.
1. E(X) = 1
2. var(X) = 1
r2
3. skew(X) = 2
4. kurt(X) = 9
In the context of the Poisson process, the parameter r is known as the rate of the process. On average, there are 1/r time units
between arrivals, so the arrivals come at an average rate of r per unit time. The Poisson process is completely determined by the
sequence of inter-arrival times, and hence is completely determined by the rate r.
Note also that the mean and standard deviation are equal for an exponential distribution, and that the median is always smaller than
the mean. Recall also that skewness and kurtosis are standardized measures, and so do not depend on the parameter r (which is the
reciprocal of the scale parameter).
Proof
14.2.2 https://stats.libretexts.org/@go/page/10267
In the gamma experiment, set n = 1 so that the simulated random variable has an exponential distribution. Vary r with the
scroll bar and watch how the mean±standard deviation bar changes. For various values of r, run the experiment 1000 times
and compare the empirical mean and standard deviation to the distribution mean and standard deviation, respectively.
Additional Properties
The exponential distribution has a number of interesting and important mathematical properties. First, and not surprisingly, it's a
member of the general exponential family.
Suppose that X has the exponential distribution with rate parameter r ∈ (0, ∞). Then X has a one parameter general
exponential distribution, with natural parameter −r and natural statistic X.
Proof
Suppose that X has the exponential distribution with rate parameter r >0 and that c >0 . Then cX has the exponential
distribution with rate parameter r/c.
Proof
Recall that multiplying a random variable by a positive constant frequently corresponds to a change of units (minutes into hours for
a lifetime variable, for example). Thus, the exponential distribution is preserved under such changes of units. In the context of the
Poisson process, this has to be the case, since the memoryless property, which led to the exponential distribution in the first place,
clearly does not depend on the time units.
In fact, the exponential distribution with rate parameter 1 is referred to as the standard exponential distribution. From the previous
result, if Z has the standard exponential distribution and r > 0 , then X = Z has the exponential distribution with rate parameter
1
r . Conversely, if X has the exponential distribution with rate r > 0 then Z = rX has the standard exponential distribution.
Similarly, the Poisson process with rate parameter 1 is referred to as the standard Poisson process. If Z is the ith inter-arrival time
i
for the standard Poisson process for i ∈ N , then letting X = Z for i ∈ N gives the inter-arrival times for the Poisson process
+ i
1
r
i +
with rate r. Conversely if X is the ith inter-arrival time of the Poisson process with rate r > 0 for i ∈ N , then Z = rX for
i + i i
surprising that the two distributions are also connected through various transformations and limits.
Suppose that X has the exponential distribution with rate parameter r > 0 . Then
1. ⌊X⌋ has the geometric distributions on N with success parameter 1 − e . −r
Proof
The following connection between the two distributions is interesting by itself, but will also be very important in the section on
splitting Poisson processes. In words, a random, geometrically distributed sum of independent, identically distributed exponential
variables is itself exponential.
Suppose that X = (X , X , …) is a sequence of independent variables, each with the exponential distribution with rate r.
1 2
Suppose that U has the geometric distribution on N with success parameter p and is independent of X. Then Y = ∑ X
+
U
i=1 i
The next result explores the connection between the Bernoulli trials process and the Poisson process that was begun in the
Introduction.
14.2.3 https://stats.libretexts.org/@go/page/10267
For n ∈ N+ , suppose that U has the geometric distribution on N with success parameter p , where np
n + n n → r >0 as
n → ∞ . Then the distribution of U /n converges to the exponential distribution with parameter r as n → ∞ .
n
Proof
To understand this result more clearly, suppose that we have a sequence of Bernoulli trials processes. In process n , we run the trials
at a rate of n per unit time, with probability of success p . Thus, the actual time of the first success in process n is U /n. The last
n n
result shows that if np → r > 0 as n → ∞ , then the sequence of Bernoulli trials processes converges to the Poisson process with
n
Proof
The following theorem gives an important random version of the memoryless property.
Suppose that X and Y are independent variables taking values in [0, ∞) and that Y has the exponential distribution with rate
parameter r > 0 . Then X and Y − X are conditionally independent given X < Y , and the conditional distribution of Y − X
is also exponential with parameter r.
Proof
For our next discussion, suppose that X = (X , X , … , X ) is a sequence of independent random variables, and that X has the
1 2 n i
exponential distribution with rate parameter r > 0 for each i ∈ {1, 2, … , n}.
i
i=1
ri .
Proof
In the context of reliability, if a series system has independent components, each with an exponentially distributed lifetime, then the
lifetime of the system is also exponentially distributed, and the failure rate of the system is the sum of the component failure rates.
In the context of random processes, if we have n independent Poisson process, then the new process obtained by combining the
random points in time is also Poisson, and the rate of the new process is the sum of the rates of the individual processes (we will
return to this point latter).
i=1
Proof
Consider the special case where r = r ∈ (0, ∞) for each i ∈ N . In statistical terms, X is a random sample of size n from the
i +
exponential distribution with parameter r. From the last couple of theorems, the minimum U has the exponential distribution with
n
rate nr while the maximum V has distribution function F (t) = (1 − e ) for t ∈ [0, ∞). Recall that U and V are the first and
−rt
14.2.4 https://stats.libretexts.org/@go/page/10267
Curiously, the distribution of the maximum of independent, identically distributed exponential variables is also the distribution of
the sum of independent exponential variables, with rates that grow linearly with the index.
i=1
Xi has distribution function F given by
−rt n
F (t) = (1 − e ) , t ∈ [0, ∞) (14.2.23)
Proof
This result has an application to the Yule process, named for George Yule. The Yule process, which has some parallels with the
Poisson process, is studied in the chapter on Markov processes. We can now generalize the order probability above:
Proof
Suppose that for each i, X is the time until an event of interest occurs (the arrival of a customer, the failure of a device, etc.) and
i
that these times are independent and exponentially distributed. Then the first time U that one of the events occurs is also
exponentially distributed, and the probability that the first event to occur is event i is proportional to the rate r . i
Proof
Of course, the probabilities of other orderings can be computed by permuting the parameters appropriately in the formula on the
right.
The result on minimums and the order probability result above are very important in the theory of continuous-time Markov chains.
But for that application and others, it's convenient to extend the exponential distribution to two degenerate cases: point mass at 0
and point mass at ∞ (so the first is the distribution of a random variable that takes the value 0 with probability 1, and the second
the distribution of a random variable that takes the value ∞ with probability 1). In terms of the rate parameter r and the distribution
function F , point mass at 0 corresponds to r = ∞ so that F (t) = 1 for 0 < t < ∞ . Point mass at ∞ corresponds to r = 0 so that
F (t) = 0 for 0 < t < ∞ . The memoryless property, as expressed in terms of the reliability function F , still holds for these
c
We also need to extend some of results above for a finite number of variables to a countably infinite number of variables. So for the
remainder of this discussion, suppose that {X : i ∈ I } is a countable collection of independent random variables, and that X has
i i
Let U = inf{ Xi : i ∈ I } . Then U has the exponential distribution with parameter ∑ i∈I
ri
Proof
For i ∈ N ,+
ri
P (Xi < Xj for all j ∈ I − {i}) = (14.2.32)
∑j∈I rj
Proof
We need one last result in this setting: a condition that ensures that the sum of an infinite collection of exponential variables is
finite with probability one.
14.2.5 https://stats.libretexts.org/@go/page/10267
Proof
Computational Exercises
Show directly that the exponential probability density function is a valid probability density function.
Solution
Suppose that the length of a telephone call (in minutes) is exponentially distributed with rate parameter r = 0.2. Find each of
the following:
1. The probability that the call lasts between 2 and 7 minutes.
2. The median, the first and third quartiles, and the interquartile range of the call length.
Answer
Suppose that the lifetime of a certain electronic component (in hours) is exponentially distributed with rate parameter
r = 0.001. Find each of the following:
Suppose that the time between requests to a web server (in seconds) is exponentially distributed with rate parameter r =2 .
Find each of the following:
1. The mean and standard deviation of the time between requests.
2. The probability that the time between requests is less that 0.5 seconds.
3. The median, the first and third quartiles, and the interquartile range of the time between requests.
Answer
Suppose that the lifetime X of a fuse (in 100 hour units) is exponentially distributed with P(X > 10) = 0.8 . Find each of the
following:
1. The rate parameter.
2. The mean and standard deviation.
3. The median, the first and third quartiles, and the interquartile range of the lifetime.
Answer
The position X of the first defect on a digital tape (in cm) has the exponential distribution with mean 100. Find each of the
following:
1. The rate parameter.
2. The probability that X < 200 given X > 150 .
3. The standard deviation.
4. The median, the first and third quartiles, and the interquartile range of the position.
Answer
Suppose that X, Y , Z are independent, exponentially distributed random variables with respective parameters
a, b, c ∈ (0, ∞) . Find the probability of each of the 6 orderings of the variables.
Proof
This page titled 14.2: The Exponential Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
14.2.6 https://stats.libretexts.org/@go/page/10267
14.3: The Gamma Distribution
Basic Theory
We now know that the sequence of inter-arrival times X = (X , X , …) in the Poisson process is a sequence of independent
1 2
random variables, each having the exponential distribution with rate parameter r, for some r > 0 . No other distribution gives the
strong renewal assumption that we want: the property that the process probabilistically restarts, independently of the past, at each
arrival time and at each fixed time.
The n th arrival time is simply the sum of the first n inter-arrival times:
n
Tn = ∑ Xi , n ∈ N (14.3.1)
i=0
Thus, the sequence of arrival times T = (T0 , T1 , …) is the partial sum process associated with the sequence of inter-arrival times
X = (X , X , …).
1 2
Distribution Functions
Recall that the common probability density function of the inter-arrival times is
−rt
f (t) = re , 0 ≤t <∞ (14.3.2)
n−1
n
t −rt
fn (t) = r e , 0 ≤t <∞ (14.3.3)
(n − 1)!
2. f is concave upward. f is concave downward and then upward, with inflection point at t = 2/r . For n ≥ 2 , f is
1 2 n
−−−−−
concave upward, then downward, then upward again with inflection points at t = [(n − 1) ± √n − 1 ] /r .
Proof
The distribution with this probability density function is known as the gamma distribution with shape parameter n and rate
parameter r. It is lso known as the Erlang distribution, named for the Danish mathematician Agner Erlang. Again, 1/r is the scale
parameter, and that term will be justified below. The term shape parameter for n clearly makes sense in light of parts (a) and (b) of
the last result. The term rate parameter for r is inherited from the inter-arrival times, and more generally from the underlying
Poisson process itself: the random points are arriving at an average rate of r per unit time. A more general version of the gamma
distribution, allowing non-integer shape parameters, is studied in the chapter on Special Distributions. Note that since the arrival
times are continuous, the probability of an arrival at any given instant of time is 0.
In the gamma experiment, vary r and n with the scroll bars and watch how the shape of the probability density function
changes. For various values of the parameters, run the experiment 1000 times and compare the empirical density function to
the true probability density function.
The distribution function and the quantile function of the gamma distribution do not have simple, closed-form expressions.
However, it's easy to write the distribution function as a sum.
n−1 k
(rt)
−rt
Fn (t) = 1 − ∑ e , t ∈ [0, ∞) (14.3.5)
k!
k=0
Proof
14.3.1 https://stats.libretexts.org/@go/page/10268
Open the special distribution calculator, select the gamma distribution, and select CDF view. Vary the parameters and note the
shape of the distribution and quantile functions. For selected values of the parameters, compute the quartiles.
Moments
The mean, variance, and moment generating function of Tn can be found easily from the representation as a sum of independent
exponential variables.
1. E (T ) = n/r
n
2. var (T ) = n/r
n
2
Proof
(k + n − 1)! 1
k
E (Tn ) = (14.3.7)
k
(n − 1)! r
Proof
In the gamma experiment, vary r and n with the scroll bars and watch how the size and location of the mean±standard
deviation bar changes. For various values of r and n , run the experiment 1000 times and compare the empirical moments to
the true moments.
Our next result gives the skewness and kurtosis of the gamma distribution.
1. skew(X) = 2
√n
2. kurt(X) = 3 + 6
Proof
In particular, note that the gamma distribution is positively skewed but skew(X) → 0 and as n → ∞ . Recall also that the excess
kurtosis is kurt(T ) − 3 = → 0 as n → ∞ . This result is related to the convergence of the gamma distribution to the normal,
n
6
discussed below. Finally, note that the skewness and kurtosis do not depend on the rate parameter r. This is because, as we show
below, 1/r is a scale parameter.
n
r
sTn
Mn (s) = E (e ) =( ) , −∞ < s < r (14.3.10)
r−s
Proof
The moment generating function can also be used to derive the moments of the gamma distribution given above—recall that
(k)
(0) = E (T ) .
k
M n n
14.3.2 https://stats.libretexts.org/@go/page/10268
n
Tn 1
Mn = = ∑ Xi (14.3.11)
n n
i=1
is the sample mean of the first n inter-arrival times (X 1, X2 , … , Xn ) . In statistical terms, this sequence is a random sample of size
n from the exponential distribution with rate r .
2. var(M ) = n
1
nr2
3. M n →
1
r
as n → ∞ with probability 1
Proof
In statistical terms, part (a) means that M is an unbiased estimator of 1/r and hence the variance in part (b) is the mean square
n
error. Part (b) means that M is a consistent estimator of 1/r since var(M ) ↓ 0 as n → ∞ . Part (c) is a stronger from of
n n
consistency. In general, the sample mean of a random sample from a distribution is an unbiased and consistent estimator of the
distribution mean. On the other hand, a natural estimator of r itself is 1/M = n/T . However, this estimator is positively biased.
n n
E(n/ Tn ) ≥ r .
Proof
Suppose that T has the gamma distribution with rate parameter r ∈ (0, ∞) and shape parameter n ∈ N . If c ∈ (0, ∞) then +
cT has the gamma distribution with rate parameter r/c and shape parameter n .
Proof
The scaling property also follows from the fact that the gamma distribution governs the arrival times in the Poisson process. A time
change in a Poisson process clearly does not change the strong renewal property, and hence results in a new Poisson process.
Suppose that T has the gamma distribution with shape parameter n ∈ N and rate parameter r ∈ (0, ∞). Then T has a two +
parameter general exponential distribution with natural parameters n − 1 and −r, and natural statistics ln(T ) and T .
Proof
Increments
A number of important properties flow from the fact that the sequence of arrival times T = (T 0, T1 , …) is the partial sum process
associated with the sequence of independent, identically distributed inter-arrival times X = (X 1, X2 , …).
Of course, the stationary and independent increments properties are related to the fundamental “renewal” assumption that we
started with. If we fix n ∈ N , then (T − T , T −T , T
+ − T , …) is independent of (T , T , … , T ) and has the same
n n n+1 n n+2 n 1 2 n
distribution as (T , T , T , …). That is, if we “restart the clock” at time T , then the process in the future looks just like the
0 1 2 n
original process (in a probabilistic sense) and is indpendent of the past. Thus, we have our second characterization of the Poisson
process.
14.3.3 https://stats.libretexts.org/@go/page/10268
A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only if the arrival time sequence T has
stationary, independent increments, and for n ∈ N , T has the gamma distribution with shape parameter n and rate parameter
+ n
r.
Sums
The gamma distribution is closed with respect to sums of independent variables, as long as the rate parameter is fixed.
Suppose that V has the gamma distribution with shape parameter m ∈ N and rate parameter r > 0 , W has the gamma
+
distribution with shape parameter n ∈ N and rate parameter r, and that V and W are independent. Then V + W has the
+
Normal Approximation
In the gamma experiment, vary r and n with the scroll bars and watch how the shape of the probability density function
changes. Now set n = 10 and for various values of r run the experiment 1000 times and compare the empirical density
function to the true probability density function.
Even though you are restricted to relatively small values of n in the app, note that the probability density function of the n th arrival
time becomes more bell shaped as n increases (for r fixed). This is yet another application of the central limit theorem, since T is n
the sum of n independent, identically distributed random variables (the inter-arrival times).
The distribution of the random variable Z below converges to the standard normal distribution as n → ∞ :
n
r Tn − n
Zn = − (14.3.14)
√n
Proof
as we now know, the time of the n th arrival has the gamma distribution with parameters n and r (the rate). Because of this strong
analogy, we expect a relationship between these two processes. In fact, we have the same type of limit as with the geometric and
exponential distributions.
Fix n ∈ N+ and suppose that for each m ∈ N T has the negative binomial distribution with parameters n and
+ m,n
pm ∈ (0, 1), where mp → r ∈ (0, ∞) as m → ∞ . Then the distribution of T /m converges to the gamma distribution
m m,n
Computational Exercises
Suppose that customers arrive at a service station according to the Poisson model, at a rate of r =3 per hour. Relative to a
given starting time, find the probability that the second customer arrives sometime after 1 hour.
Answer
Defects in a type of wire follow the Poisson model, with rate 1 per 100 meter. Find the probability that the 5th defect is located
between 450 and 550 meters.
Answer
14.3.4 https://stats.libretexts.org/@go/page/10268
Suppose that requests to a web server follow the Poisson model with rate r = 5 . Relative to a given starting time, compute the
mean and standard deviation of the time of the 10th request.
Answer
Suppose that Y has a gamma distribution with mean 40 and standard deviation 20. Find the shape parameter n and the rate
parameter r.
Answer
Suppose that accidents at an intersection occur according to the Poisson model, at a rate of 8 per year. Compute the normal
approximation to the event that the 10th accident (relative to a given starting time) occurs within 2 years.
Answer
In the gamma experiment, set n = 5 and r = 2 . Run the experiment 1000 times and compute the following:
1. P(1.5 ≤ T ≤ 3)
t
Answer
Suppose that requests to a web server follow the Poisson model. Starting at 12:00 noon on a certain day, the requests are
logged. The 100th request comes at 12:15. Estimate the rate of the process.
Answer
This page titled 14.3: The Gamma Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
14.3.5 https://stats.libretexts.org/@go/page/10268
14.4: The Poisson Distribution
Basic Theory
Recall that in the Poisson model, X = (X , X , …) denotes the sequence of inter-arrival times, and T
1 2 = (T0 , T1 , T2 , …) denotes
the sequence of arrival times. Thus T is the partial sum process associated with X:
n
Tn = ∑ Xi , n ∈ N (14.4.1)
i=0
Based on the strong renewal assumption, that the process restarts at each fixed time and each arrival time, independently of the
past, we now know that X is a sequence of independent random variables, each with the exponential distribution with rate
parameter r, for some r ∈ (0, ∞). We also know that T has stationary, independent increments, and that for n ∈ N , T has the + n
gamma distribution with rate parameter r and scale parameter n . Both of the statements characterize the Poisson process with rate
r.
Recall that for t ≥0, N denotes the number of arrivals in the interval (0, t], so that N = max{n ∈ N : T ≤ t} . We refer to
t t n
N = (Nt : t ≥ 0) as the counting process. In this section we will show that N has a Poisson distribution, named for Simeon
t
Poisson, one of the most important distributions in probability theory. Our exposition will alternate between properties of the
distribution and properties of the counting process. The two are intimately intertwined. It's not too much of an exaggeration to say
that wherever there is a Poisson distribution, there is a Poisson process lurking in the background.
n−1
n
t −rt
fn (t) = r e , 0 ≤t <∞ (14.4.2)
(n − 1)!
We can find the distribution of N because of the inverse relation between N and T . In particular, recall that
t
since both events mean that there are at least n arrivals in (0, t].
n
(rt)
−rt
P(Nt = n) = e , n ∈ N (14.4.4)
n!
Proof
In the Poisson experiment, vary r and t with the scroll bars and note the shape of the probability density function. For various
values of r and t , run the experiment 1000 times and compare the relative frequency function to the probability density
function.
In general, a random variable N taking values in N is said to have the Poisson distribution with parameter a ∈ (0, ∞) if it has
the probability density function
n
−a
a
g(n) = e , n ∈ N (14.4.6)
n!
Proof
14.4.1 https://stats.libretexts.org/@go/page/10269
The Poisson distribution does not have simple closed-form distribution or quantile functions. Trivially, we can write the distribution
function as a sum of the probability density function.
Open the special distribution calculator, select the Poisson distribution, and select CDF view. Vary the parameter and note the
shape of the distribution and quantile functions. For various values of the parameter, compute the quartiles.
Sometimes it's convenient to allow the parameter a to be 0. This degenerate Poisson distribution is simply point mass at 0. That is,
with the usual conventions regarding nonnegative integer powers of 0, the probability density function g above reduces to g(0) = 1
and g(n) = 0 for n ∈ N .+
Moments
Suppose that N has the Poisson distribution with parameter a > 0 . Naturally we want to know the mean, variance, skewness and
kurtosis, and the probability generating function of N . The easiest moments to compute are the factorial moments. For this result,
recall the falling power notation for the number of permutations of size k chosen from a population of size n :
(k)
n = n(n − 1) ⋯ [n − (k + 1)] (14.4.8)
Open the special distribution simulator and select the Poisson distribution. Vary the parameter and note the location and size of
the mean±standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the
empirical mean and standard deviation to the distribution mean and standard deviation.
2. kurt(N ) = 3 − 5/a
Proof
Note that the Poisson distribution is positively skewed, but skew(N ) → 0 as n → ∞ . Recall also that the excess kurtosis is
kurt(N ) − 3 = −5/a → 0 as n → ∞ . This limit is related to the convergence of the Poisson distribution to the normal, discussed
below.
Open the special distribution simulator and select the Poisson distribution. Vary the parameter and note the shape of the
probability density function in the context of the results on skewness and kurtosis above.
Proof
Returning to the Poisson counting process N = (N : t ≥ 0) with rate parameter r, it follows that E(N ) = rt and var(N ) = rt
t t t
for t ≥ 0 . Once again, we see that r can be interpreted as the average arrival rate. In an interval of length t , we expect about rt
arrivals.
14.4.2 https://stats.libretexts.org/@go/page/10269
In the Poisson experiment, vary r and t with the scroll bars and note the location and size of the mean±standard deviation bar.
For various values of r and t , run the experiment 1000 times and compare the sample mean and standard deviation to the
distribution mean and standard deviation, respectively.
Proof
Part (a) means that the estimator is unbiased. Since this is the case, the variance in part (b) gives the mean square error. Since
var(N ) decreases to 0 as t → ∞ , the estimator is consistent.
t
N is the number of arrivals in the interval (0, t], so it follows that if s, t ∈ [0, ∞) with s < t , then N − N
t is the number of
t s
arrivals in the interval (s, t] . Of course, the arrival times have continuous distributions, so the probability that an arrival occurs at a
specific point t is 0. Thus, it does not really matter if we write the interval above as (s, t] , (s, t) , [s, t) or [s, t] .
Statements about the increments of the counting process can be expressed more elegantly in terms of our more general counting
process. Recall that for A ⊆ [0, ∞) (measurable of course), N (A) denotes the number of random points in A :
N (A) = # {n ∈ N+ : Tn ∈ A} (14.4.12)
and so in particular, N = N (0, t] . Thus, note that t ↦ N is a (random) distribution function and A ↦ N (A) is the (random)
t t
measure associated with this distribution function. Recall also that λ denotes the standard length (Lebesgue) measure on [0, ∞).
Here is our third characterization of the Poisson process.
A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only if the following properties hold:.
1. If A ⊆ [0, ∞) is measurable then N (A) has the Poisson distribution with parameter rλ(A).
2. if {A : i ∈ I } is a countable, disjoint collection of measurable sets in [0, ∞) then {N (A ) : i ∈ I } is a set of independent
i i
variables.
From a modeling point of view, the assumptions of stationary, independent increments are ones that might be reasonably made. But
the assumption that the increments have Poisson distributions does not seem as clear. Our next characterization of the process is
more primitive in a sense, because it just imposes some limiting assumptions (in addition to stationary, independent increments.
A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only if the following properties hold:
1. If A, B ⊆ [0, ∞) are measurable and λ(A) = λ(B) , then N (A) and N (B) have the same distribution.
2. if {A : i ∈ I } is a countable, disjoint collection of measurable sets in [0, ∞) then {N (A ) : i ∈ I } is a set of independent
i i
variables.
3. If A ⊆ [0, ∞) is measurable and λ(A ) > 0 for n ∈ N , and if λ(A ) → 0 as n → ∞ then
n n + n
14.4.3 https://stats.libretexts.org/@go/page/10269
P [N (An ) = 1]
→ r as n → ∞ (14.4.13)
λ(An )
P [N (An ) > 1]
→ 0 as n → ∞ (14.4.14)
λ(An )
Proof
Of course, part (a) is the stationary assumption and part (b) the independence assumption. The first limit in (c) is sometimes called
the rate property and the second limit the sparseness property. In a “small” time interval of length dt , the probability of a single
random point is approximately r dt, and the probability of two or more random points is negligible.
Sums
Suppose that N and M are independent random variables, and that N has the Poisson distribution with parameter a ∈ (0, ∞)
and M has the Poisson distribution with parameter b ∈ (0, ∞). Then N + M has the Poisson distribution with parameter
a+b .
From the last theorem, it follows that the Poisson distribution is infinitely divisible. That is, a Poisson distributed variable can be
written as the sum of an arbitrary number of independent, identically distributed (in fact also Poisson) variables.
Suppose that N has the Poisson distribution with parameter a ∈ (0, ∞) . Then for n ∈ N , N has the same distribution as
+
N where (N , N , … , N ) are independent, and each has the Poisson distribution with parameter a/n.
n
∑ i 1 2 n
i=1
Normal Approximation
Because of the representation as a sum of independent, identically distributed variables, it's not surprising that the Poisson
distribution can be approximated by the normal.
Proof
Thus, if N has the Poisson distribution with parameter a , and a is “large”, then the distribution of N is approximately normal with
mean a and standard deviation √− −
a. When using the normal approximation, we should remember to use the continuity correction,
In the Poisson experiment, set r = t = 1 . Increase t and note how the graph of the probability density function becomes more
bell-shaped.
General Exponential
The Poisson distribution is a member of the general exponential family of distributions. This fact is important in various statistical
procedures.
Suppose that N has the Poisson distribution with parameter a ∈ (0, ∞) . This distribution is a one-parameter exponential
family with natural parameter ln(a) and natural statistic N .
Proof
14.4.4 https://stats.libretexts.org/@go/page/10269
For t > 0 , the conditional distribution of T given N
1 t =1 is uniform on the interval (0, t].
Proof
More generally, for t > 0 and n ∈ N , the conditional distribution of (T , T , … , T ) given N = n is the same as the
+ 1 2 n t
distribution of the order statistics of a random sample of size n from the uniform distribution on the interval (0, t].
Heuristic proof
Note that the conditional distribution in the last result is independent of the rate r. This means that, in a sense, the Poisson model
gives the most “random” distribution of points in time.
Suppose that (N : t ∈ [0, ∞)) is a Poisson counting process with rate r ∈ (0, ∞). If s, t ∈ (0, ∞) with s < t , and n ∈ N ,
t +
then the conditional distribution of N given N = n is binomial with trial parameter n and success parameter p = s/t .
s t
Proof
Note again that the conditional distribution in the last result does not depend on the rate r. Given N = n , each of the n arrivals,
t
independently of the others, falls into the interval (0, s] with probability s/t and into the interval (s, t] with probability
1 − s/t = (t − s)/t . Here is essentially the same result, outside of the context of the Poisson process.
Suppose that M has the Poisson distribution with parameter a ∈ (0, ∞) , N has the Poisson distribution with parameter
b ∈ (0, ∞), and that M and N are independent. Then the conditional distribution of M given M + N = n is binomial with
More importantly, the Poisson distribution is the limit of the binomial distribution in a certain sense. As we will see, this
convergence result is related to the analogy between the Bernoulli trials process and the Poisson process that we discussed in the
Introduction, the section on the inter-arrival times, and the section on the arrival times.
Suppose that p ∈ (0, 1) for n ∈ N and that np → a ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters
n + n
n and p converges to the Poisson distribution with parameter a as n → ∞ . That is, for fixed k ∈ N ,
n
k
n a
k n−k −a
( ) pn (1 − pn ) → e as n → ∞ (14.4.27)
k k!
Direct proof
Proof from generating functions
The mean and variance of the binomial distribution converge to the mean and variance of the limiting Poisson distribution,
respectively.
1. np n → a as n → ∞
2. np n (1 − pn ) → a as n → ∞
Of course the convergence of the means is precisely our basic assumption, and is further evidence that this is the essential
assumption. But for a deeper look, let's return to the analogy between the Bernoulli trials process and the Poisson process. Recall
that both have the strong renewal property that at each fixed time, and at each arrival time, the process stochastically starts over,
independently of the past. The difference, of course, is that time is discrete in the Bernoulli trials process and continuous in the
Poisson process. The convergence result is a special case of the more general fact that if we run Bernoulli trials at a faster and
faster rate but with a smaller and smaller success probability, in just the right way, the Bernoulli trials process converges to the
Poisson process. Specifically, suppose that we have a sequence of Bernoulli trials processes. In process n we perform the trials at a
rate of n per unit time, with success probability p . Our basic assumption is that np → r as n → ∞ where r > 0 . Now let Y
n n n,t
denote the number of successes in the time interval (0, t] for Bernoulli trials process n , and let N denote the number of arrivals in
t
14.4.5 https://stats.libretexts.org/@go/page/10269
this interval for the Poisson process with rate r. Then Y n,t has the binomial distribution with parameters ⌊nt⌋ and p , and of course
n
Proof
From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials n is
“large” and the probability of success p “small”, so that np is small, then the binomial distribution with parameters n and p is well
2
approximated by the Poisson distribution with parameter r = np . This is often a useful result, because the Poisson distribution has
fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately).
Specifically, in the approximating Poisson distribution, we do not need to know the number of trials n and the probability of
success p individually, but only in the product np. The condition that np be small means that the variance of the binomial
2
distribution, namely np(1 − p) = np − np is approximately r = np , the variance of the approximating Poisson distribution.
2
Recall that the binomial distribution can also be approximated by the normal distribution, by virtue of the central limit theorem.
The normal approximation works well when np and n(1 − p) are large; the rule of thumb is that both should be at least 5. The
Poisson approximation works well, as we have already noted, when n is large and np small. 2
Computational Exercises
Suppose that requests to a web server follow the Poisson model with rate r = 5 per minute. Find the probability that there will
be at least 8 requests in a 2 minute period.
Answer
Defects in a certain type of wire follow the Poisson model with rate 1.5 per meter. Find the probability that there will be no
more than 4 defects in a 2 meter piece of the wire.
Answer
Suppose that customers arrive at a service station according to the Poisson model, at a rate of r =4 . Find the mean and
standard deviation of the number of customers in an 8 hour period.
Answer
In the Poisson experiment, set r = 5 and t = 4 . Run the experiment 1000 times and compute the following:
1. P(15 ≤ N ≤ 22)
4
Answer
Suppose that requests to a web server follow the Poisson model with rate r = 5 per minute. Compute the normal
approximation to the probability that there will be at least 280 requests in a 1 hour period.
Answer
Suppose that requests to a web server follow the Poisson model, and that 1 request comes in a five minute period. Find the
probability that the request came during the first 3 minutes of the period.
Answer
14.4.6 https://stats.libretexts.org/@go/page/10269
Suppose that requests to a web server follow the Poisson model, and that 10 requests come during a 5 minute period. Find the
probability that at least 4 requests came during the first 3 minutes of the period.
Answer
In the Poisson experiment, set r = 3 and t = 5 . Run the experiment 100 times.
1. For each run, compute the estimate of r based on N . t
2. Over the 100 runs, compute the average of the squares of the errors.
3. Compare the result in (b) with var(N ) . t
Suppose that requests to a web server follow the Poisson model with unknown rate r per minute. In a one hour period, the
server receives 342 requests. Estimate r.
Answer
In the binomial experiment, set n = 30 and p = 0.1 , and run the simulation 1000 times. Compute and compare each of the
following:
1. P(Y ≤ 4)
30
Answer
Suppose that we have 100 memory chips, each of which is defective with probability 0.05, independently of the others.
Approximate the probability that there are at least 3 defectives in the batch.
Answer
In the binomial timeline experiment, set n = 40 and p = 0.1 and run the simulation 1000 times. Compute and compare each of
the following:
1. P(Y > 5)
40
Answer
In the binomial timeline experiment, set n = 100 and p = 0.1 and run the simulation 1000 times. Compute and compare each
of the following:
1. P(8 < Y < 15)
100
Answer
A text file contains 1000 words. Assume that each word, independently of the others, is misspelled with probability p.
1. If p = 0.015, approximate the probability that the file contains at least 20 misspelled words.
2. If p = 0.001, approximate the probability that the file contains at least 3 misspelled words.
Answer
This page titled 14.4: The Poisson Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
14.4.7 https://stats.libretexts.org/@go/page/10269
14.5: Thinning and Superpositon
Thinning
Thinning or splitting a Poisson process refers to classifying each random point, independently, into one of a finite number of
different types. The random points of a given type also form Poisson processes, and these processes are independent. Our
exposition will concentrate on the case of just two types, but this case has all of the essential ideas.
N = (N : t ≥ 0) . Suppose now that each arrival, independently of the others, is one of two types: type 1 with probability p and
t
type 0 with probability 1 − p , where p ∈ (0, 1) is a parameter. Here are some common examples:
The arrivals are radioactive emissions and each emitted particle is either detected (type 1) or missed (type 0) by a counter.
The arrivals are customers at a service station and each customer is classified as either male (type 1) or female (type 0).
We want to consider the type 1 and type 0 random points separately. For this reason, the new random process is usually referred to
as thinning or splitting the original Poisson process. In some applications, the type 1 points are accepted while the type 0 points are
rejected. The main result of this section is that the type 1 and type 0 points form separate Poisson processes, with rates rp and
r(1 − p) respectively, and are independent. We will explore this important result from several points of view.
Bernoulli Trials
In the previous sections, we have explored the analogy between the Bernoulli trials process and the Poisson process. Both have the
strong renewal property that at each fixed time and at each arrival time, the process stochastically “restarts”, independently of the
past. The difference, of course, is that time is discrete in the Bernoulli trials process and continuous in the Poisson process. In this
section, we have both processes simultaneously, and given our previous explorations, it's perhaps not surprising that this leads to
some interesting mathematics.
Thus, in addition to the processes X, T , and N , we have a sequence of Bernoulli trials I = (I , I , …) with success parameter p.
1 2
Indicator variable I specifies the type of the j th arrival. Moreover, because of our assumptions, I is independent of X, T , and N .
j
Recall that V , the trial number of the k th success has the negative binomial distribution with parameters k and p for k ∈ N . We
k +
take V = 0 by convention. Also, U , the number of trials needed to go from the (k − 1) st success to the k th success has the
0 k
Vk = ∑ Ui , k ∈ N (14.5.1)
i=1
Uk = Vk − Vk−1 , k ∈ N+ (14.5.2)
As noted above, the Bernoulli trials process can be thought of as random points in discrete time, namely the trial numbers of the
successes. With this understanding, U is the sequence of inter-arrival times and V is the sequence of arrival times.
Yk = ∑ Xi , k ∈ N+ (14.5.3)
i=Vk−1 +1
Note that Y has U terms. The next result shows that the type 1 points form a Poisson process with rate pr.
k k
Y = (Y1 , Y2 , …) is a sequence of independent variables and each has the exponential distribution with rate parameter pr
Proof
14.5.1 https://stats.libretexts.org/@go/page/10270
Similarly, if Z = (Z , Z , …) is the sequence of interarrvial times for the type 0 points, then Z is a sequence of independent
1 2
variables, and each has the exponential distribution with rate (1 − p)r . Moreover, Y and Z are independent.
Counting Processes
For t ≥0 , let denote the number of type 1 arrivals in (0, t] and W the number of type 0 arrivals in (0, t]. Thus,
Mt t
M = (Mt : t ≥ 0) and W = (W : t ≥ 0) are the counting processes for the type 1 arrivals and for the type 0 arrivals. The next
t
result follows from the previous subsection, but a direct proof is interesting.
Proof
In the two-type Poisson experiment vary r, p, and t with the scroll bars and note the shape of the probability density functions.
For various values of the parameters, run the experiment 1000 times and compare the relative frequency functions to the
probability density functions.
of type 1 arrivals M . t
n−k
[(1 − p)rt]
−(1−p)rt
P(Nt = n ∣ Mt = k) = e , n ∈ {k, k + 1, …} (14.5.7)
(n − k)!
Proof
E(Nt ∣ Mt = k) = k + (1 − p) r .
Proof
Thus, if the overall rate r of the process and the probability p that an arrival is type 1 are known, then it follows form the general
theory of conditional expectation that the best estimator of N based on M , in the least squares sense, is
t t
Proof
process with rate r > 0 . Suppose that each arrival, independently of the others, is type i with probability p for i
i ∈ {0, 1, … , k − 1}. Of course we must have p ≥ 0 for each i and ∑ p = 1 . Then for each i , the type i points form a
k−1
i i=0 i
Superposition
Complementary to splitting or thinning a Poisson process is superposition: if we combine the random points in time from
independent Poisson processes, then we have a new Poisson processes. The rate of the new process is the sum of the rates of the
processes that were combined. Once again, our exposition will concentrate on the superposition of two processes. This case
contains all of the essential ideas.
Two Processes
Suppose that we have two independent Poisson processes. We will denote the sequence of inter-arrival times, the sequence of
arrival times, and the counting variables for the process i ∈ {1, 2} by X = (X , X …), T = (T , T , …), and i i
1
i
2
i i
1
i
2
14.5.2 https://stats.libretexts.org/@go/page/10270
: t ∈ [0, ∞)) , and we assume that process i has rate r ∈ (0, ∞) . The new process that we want to consider is
i i
N = (N i
t
obtained by simply combining the random points. That is, the new random points are {T : n ∈ N } ∪ {T : n ∈ N } , but of
1
n +
2
n +
course then ordered in time. We will denote the sequence of inter-arrival times, the sequence of arrival times, and the counting
variables for the new process by X = (X , X …) , T = (T , T , …) , and N = (N : t ∈ [0, ∞)) . Clearly if A is an interval in
1 2 1 2 t
[0, ∞) then
1 2
N (A) = N (A) + N (A) (14.5.11)
the number of combined points in A is simply the sum of the number of point in A for processes 1 and 2. It's also worth noting that
1 2
X1 = min{ X ,X } (14.5.12)
1 1
the first arrival for the combined process is the smaller of the first arrival times for processes 1 and 2. The other inter-arrival times,
and hence also the arrival times, for the combined process are harder to state.
Computational Exercises
In the two-type Poisson experiment, set r = 2 , t = 3 , and p = 0.7 . Run the experiment 1000 times, Compute the appropriate
relative frequency functions and investigate empirically the independence of the number of type 1 points and the number of
type 0 points.
Suppose that customers arrive at a service station according to the Poisson model, with rate r = 20 per hour. Moreover, each
customer, independently, is female with probability 0.6 and male with probability 0.4. Find the probability that in a 2 hour
period, there will be at least 20 women and at least 15 men.
Answer
In the two-type Poisson experiment, set r = 3 , t = 4 , and p = 0.8 . Run the experiment 100 times.
1. Compute the estimate of N based on M for each run.
t t
2. Over the 100 runs, compute average of the sum of the squares of the errors.
3. Compare the result in (b) with the result in Exercise 8.
Suppose that a piece of radioactive material emits particles according to the Poisson model at a rate of r = 100 per second.
Moreover, assume that a counter detects each emitted particle, independently, with probability 0.9. Suppose that the number of
detected particles in a 5 second period is 465.
1. Estimate the number of particles emitted.
2. Compute the mean square error of the estimate.
Answer
This page titled 14.5: Thinning and Superpositon is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
14.5.3 https://stats.libretexts.org/@go/page/10270
14.6: Non-homogeneous Poisson Processes
Basic Theory
A non-homogeneous Poisson process is similar to an ordinary Poisson process, except that the average rate of arrivals is allowed to
vary with time. Many applications that generate random points in time are modeled more faithfully with such non-homogeneous
processes. The mathematical cost of this generalization, however, is that we lose the property of stationary increments.
Non-homogeneous Poisson processes are best described in measure-theoretic terms. Thus, you may need to review the sections on
measure theory in the chapters on Foundations, Probability Measures, and Distributions. Our basic measure space in this section is
[0, ∞) with the σ-algebra of Borel measurable subsets (named for Émile Borel). As usual, λ denotes Lebesgue measure on this
space, named for Henri Lebesgue. Recall that the Borel σ-algebra is the one generated by the intervals, and λ is the generalization
of length on intervals.
in the interval (0, t] for t ≥ 0 , so that N = {N : t ≥ 0} is the counting process. More generally, N (A) denotes the number of
t
random points in a measurable A ⊆ [0, ∞) , so N is our random counting measure. As before, t ↦ N is a (random) distribution
t
function and A ↦ N (A) is the (random) measure associated with this distribution function.
Suppose now that r : [0, ∞) → [0, ∞) is measurable, and define m : [0, ∞) → [0, ∞) by
From properties of the integral, m is increasing and right-continuous on [0, ∞) and hence is distribution function. The positive
measure on [0, ∞) associated with m (which we will also denote by m) is defined on a measurable A ⊆ [0, ∞) by
Thus, m(t) = m(0, t] , and for s, t ∈ [0, ∞) with s < t , m(s, t] = m(t) − m(s) . Finally, note that the measure m is absolutely
continuous with respect to λ , and r is the density function. Note the parallels between the random distribution function and
measure N and the deterministic distribution function and measure m. With the setup involving r and m complete, we are ready
for our first definition.
A process that produces random points in time is a non-homogeneous Poisson process with rate function r if the counting
process N satisfies the following properties:
1. If {A : i ∈ I } is a countable, disjoint collection of measurable subsets of [0, ∞) then {N (A
i i) : i ∈ I} is a collection of
independent random variables.
2. If A ⊆ [0, ∞) is measurable then N (A) has the Poisson distribution with parameter m(A).
Property (a) is our usual property of independent increments, while property (b) is a natural generalization of the property of
Poisson distributed increments. Clearly, if r is a positive constant, then m(t) = rt for t ∈ [0, ∞) and as a measure, m is
proportional to Lebesgue measure λ . In this case, the non-homogeneous process reduces to an ordinary, homogeneous Poisson
process with rate r. However, if r is not constant, then m is not linear, and as a measure, is not proportional to Lebesgue measure.
In this case, the process does not have stationary increments with respect to λ , but does of course, have stationary increments with
respect to m. That is, if A, B are measurable subsets of [0, ∞) and λ(A) = λ(B) then N (A) and N (B) will not in general have
the same distribution, but of course they will have the same distribution if m(A) = m(B) .
In particular, recall that the parameter of the Poisson distribution is both the mean and the variance, so
E [N (A)] = var [N (A)] = m(A) for measurable A ⊆ [0, ∞) , and in particular, E(N ) = var(N ) = m(t) for t ∈ [0, ∞). The
t t
14.6.1 https://stats.libretexts.org/@go/page/10271
function m is usually called the mean function. Since m (t) = r(t) (if r is continuous at t ), it makes sense to refer to r as the rate
′
function. Locally, at t , the arrivals are occurring at an average rate of r(t) per unit time.
As before, from a modeling point of view, the property of independent increments can reasonably be evaluated. But we need
something more primitive to replace the property of Poisson increments. Here is the main theorem.
A process that produces random points in time is a non-homogeneous Poisson process with rate function r if and only if the
counting process N satisfies the following properties:
1. If {A : i ∈ I } is a countable, disjoint collection of measurable subsets of [0, ∞) then {N (A
i i) : i ∈ I} is a set of
independent variables.
2. For t ∈ [0, ∞),
P [N (t, t + h] = 1]
→ r(t) as h ↓ 0 (14.6.3)
h
P [N (t, t + h] > 1]
→ 0 as h ↓ 0 (14.6.4)
h
So if h is “small” the probability of a single arrival in [t, t + h) is approximately r(t)h , while the probability of more than 1
arrival in this interval is negligible.
of the n th arrival for n ∈ N . As with the ordinary Poisson process, we have an inverse relation between the counting process
N = { N : t ∈ [0, ∞)}
t and the arrival time sequence T = {T : n ∈ N} , namely T = min{t ∈ [0, ∞) : N = n} ,
n n t
N = #{n ∈ N : T ≤ t} , and { T ≤ t} = { N ≥ n} , since both events mean at least n random points in (0, t]. The last
t n n t
n−1
m (t)
−m(t)
fn (t) = r(t)e , t ∈ [0, ∞) (14.6.5)
(n − 1)!
Proof
−m(t)
f1 (t) = r(t)e , t ∈ [0, ∞) (14.6.8)
Recall that in reliability terms, r is the failure rate function, and that the reliability function is the right distribution function:
c −m(t)
F (t) = P(T1 > t) = e , t ∈ [0, ∞) (14.6.9)
1
In general, the functional form of f is clearly similar to the probability density function of the gamma distribution, and indeed, T
n n
can be transformed into a random variable with a gamma distribution. This amounts to a time change which will give us additional
insight into the non-homogeneous Poisson process.
Let U n = m(Tn ) for n ∈ N . Then U has the gamma distribution with shape parameter n and rate parameter 1
+ n
Proof
Thus, the time change u = m(t) transforms the non-homogeneous Poisson process into a standard (rate 1) Poisson process. Here is
an equivalent way to look at the time change result.
Equivalently, we can transform a standard (rate 1) Poisson process into a a non-homogeneous Poisson process with a time change.
14.6.2 https://stats.libretexts.org/@go/page/10271
Suppose that M = {M : u ∈ [0, ∞)} is the counting process for a standard Poisson process, and let N = M
u for t m(t)
t ∈ [0, ∞). Then { N : t ∈ [0, ∞)} is the counting process for a non-homogeneous Poisson process with mean function m
t
This page titled 14.6: Non-homogeneous Poisson Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
14.6.3 https://stats.libretexts.org/@go/page/10271
14.7: Compound Poisson Processes
In a compound Poisson process, each arrival in an ordinary Poisson process comes with an associated real-valued random variable
that represents the value of the arrival in a sense. These variables are independent and identically distributed, and are independent
of the underlying Poisson process. Our interest centers on the sum of the random variables for all the arrivals up to a fixed time t ,
which thus is a Poisson-distributed random sum of random variables. Distributions of this type are said to be compound Poisson
distributions, and are important in their own right, particularly since some surprising parametric distributions turn out to be
compound Poisson.
Basic Theory
Definition
Suppose we have a Poisson process with rate r ∈ (0, ∞). As usual, we wil denote the sequence of inter-arrival times by
X = (X , X , …), the sequence of arrival times by T = (T , T , T , …) , and the counting process by N = { N : t ∈ [0, ∞)} .
1 2 0 1 2 t
To review some of the most important facts briefly, recall that X is a sequence of independent random variables, each having the
exponential distribution on [0, ∞) with rate r. The sequence T is the partial sum sequence associated with X, and has stationary
independent increments. For n ∈ N , the n th arrival time T has the gamma distribution with parameters n and r. The process N
+ n
is the inverse of T , in a certain sense, and also has stationary independent increments. For t ∈ (0, ∞) , the number of arrivals N in t
Suppose now that each arrival has an associated real-valued random variable that represents the value of the arrival in a certain
sense. Here are some typical examples:
The arrivals are customers at a store. Each customer spends a random amount of money.
The arrivals are visits to a website. Each visitor spends a random amount of time at the site.
The arrivals are failure times of a complex system. Each failure requires a random repair time.
The arrivals are earthquakes at a particular location. Each earthquake has a random severity, a measure of the energy released.
For n ∈ N , let U denote the value of the n th arrival. We assume that U = (U , U , …) is a sequence of independent,
+ n 1 2
identically distributed, real-valued random variables, and that U is independent of the underlying Poisson process. The common
distribution may be discrete or continuous, but in either case, we let f denote the common probability density function. We will let
μ = E(U ) denote the common mean, σ = var(U ) the common variance, and G the common moment generating function, so
2
n n
that G(s) = E [exp(sU )] for s in some interval I about 0. Here is our main definition:
n
The compound Poisson process associated with the given Poisson process N and the sequence U is the stochastic process
V = { V : t ∈ [0, ∞)} where
t
Nt
Vt = ∑ Un (14.7.1)
n=1
Thus, V is the total value for all of the arrivals in (0, t]. For the examples above
t
Properties
Note that for fixed t , V is a random sum of independent, identically distributed random variables, a topic that we have studied
t
before. In this sense, we have a special case, since the number of terms N has the Poisson distribution with parameter rt . But we
t
also have a new wrinkle, since the process is indexed by the continuous time parameter t , and so we can study its properties as a
stochastic process. Our first result is a pair of properties shared by the underlying Poisson process.
14.7.1 https://stats.libretexts.org/@go/page/10272
V has stationary, independent increments:
1. If s, t ∈ [0, ∞) with s < t , then V − V has the same distribution as V .
t s t−s
1. E(V ) = μrt
t
2. var(V ) = (μ
t
2 2
+ σ )rt
Proof
Proof
By exactly the same argument, the same relationship holds for characteristic functions and, in the case that the variables in U take
values in N, for probability generating functions.. That is, if the variables in U have generating function G, then the generating
function H of V is given by
t
for s in the domain of G , where generating function can be any of the three types we have discussed: probability, moment, or
characteristic.
Poisson process N . In the usual language of thinning, the arrivals are of two types (1 and 0), and U is the type of the ith arrival. i
Thus the compound process V constructed above is the thinned process, so that V is the number of type 1 points up to time t . We
t
takes values in a countable set S ⊆ R , and as before, let f denote the common probability density function so that
f (u) = P(U = u) for u ∈ S and i ∈ N . For u ∈ S , let N denote the number of arrivals up to time t that have the value u, and
u
i + t
let N = {N : t ∈ [0, ∞)} denote the corresponding stochastic process. Armed with this setup, here is the result:
u
t
u
The compound Poisson process V associated with N and U can be written in the form
u
Vt = ∑ u N , t ∈ [0, ∞) (14.7.7)
t
u∈S
The processes {N u
: u ∈ S} are independent Poisson processes, and N has rate rf (u) for u ∈ S .
u
Proof
Suppose that U = (U , U , …) is a sequence of independent, identically distributed random variables, and that N is
1 2
N
independent of U and has the Poisson distribution with parameter λ ∈ (0, ∞) . Then V = ∑ U has a compound Poisson i=1 i
distribution.
14.7.2 https://stats.libretexts.org/@go/page/10272
But in fact, compound Poisson variables usually do arise in the context of an underlying Poisson process. In any event, the results
on the mean and variance above and the generating function above hold with rt replaced by λ . Compound Poisson distributions are
infinitely divisible. A famous theorem of William Feller gives a partial converse: an infinitely divisible distribution on N must be
compound Poisson.
The negative binomial distribution on N is infinitely divisible, and hence must be compound Poisson. Here is the construction:
Let p, k ∈ (0, ∞). Suppose that U = (U , U , …) is a sequence of independent variables, each having the logarithmic series
1 2
distribution with shape parameter 1 − p . Suppose also that N is independent of U and has the Poisson distribution with
parameter −k ln(p) . Then V = ∑ U has the negative binomial distribution on N with parameters k and p.
N
i=1 i
Proof
As a special case (k = 1 ), it follows that the geometric distribution on N is also compound Poisson.
This page titled 14.7: Compound Poisson Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
14.7.3 https://stats.libretexts.org/@go/page/10272
14.8: Poisson Processes on General Spaces
Basic Theory
The Process
So far, we have studied the Poisson process as a model for random points in time. However there is also a Poisson model for
random points in space. Some specific examples of such “random points” are
Defects in a sheet of material.
Raisins in a cake.
Stars in the sky.
The Poisson process for random points in space can be defined in a very general setting. All that is really needed is a measure space
(S, S , μ). Thus, S is a set (the underlying space for our random points), S is a σ-algebra of subsets of S (as always, the
allowable sets), and μ is a positive measure on (S, S ) (a measure of the size of sets). The most important special case is when S is
a (Lebesgue) measurable subset of R for some d ∈ N , S is the σ-algebra of measurable subsets of S , and μ = λ is d -
d
+ d
dimensional Lebesgue measure. Specializing further, recall the lower dimensional spaces:
1. When d = 1 , S ⊆ R and λ is length measure.
1
Of course, the characterizations of the Poisson process on [0, ∞), in term of the inter-arrival times and the characterization in terms
of the arrival times do not generalize because they depend critically on the order relation on [0, ∞). However the characterization
in terms of the counting process generalizes perfectly to our new setting. Thus, consider a process that produces random points in
S , and as usual, let N (A) denote the number of random points in A ∈ S . Thus N is a random, counting measure on (S, S )
The random measure N is a Poisson process or a Poisson random measure on S with density parameter r > 0 if the following
axioms are satisfied:
1. If A ∈ S then N (A) has the Poisson distribution with parameter rμ(A).
2. If {A : i ∈ I } is a countable, disjoint collection of sets in S then {N (A
i i) : i ∈ I} is a set of independent random
variables.
To draw parallels with the Poison process on [0, ∞), note that axiom (a) is the generalization of stationary, Poisson-distributed
increments, and axiom (b) is the generalization of independent increments. By convention, if μ(A) = 0 then N (A) = 0 with
probability 1, and if μ(A) = ∞ then N (A) = ∞ with probability 1. (These distributions are considered degenerate members of
the Poisson family.) On the other hand, note that if 0 < μ(A) < ∞ then N (A) has support N.
In the two-dimensional Poisson process, vary the width w and the rate r. Note the location and shape of the probability density
function of N . For selected values of the parameters, run the simulation 1000 times and compare the empirical density
function to the true probability density function.
For A ⊆ D
1. E [N (A)] = rμ(A)
2. var [N (A)] = rμ(A)
Proof
In particular, r can be interpreted as the expected density of the random points (that is, the expected number of points in a region of
unit size), justifying the name of the parameter.
In the two-dimensional Poisson process, vary the width w and the density parameter r. Note the size and location of the mean
±standard deviation bar of N . For various values of the parameters, run the simulation 1000 times and compare the empirical
mean and standard deviation to the true mean and standard deviation.
14.8.1 https://stats.libretexts.org/@go/page/10273
The Distribution of the Random Points
As before, the Poisson model defines the most random way to distribute points in space, in a certain sense. Assume that we have a
Poisson process N on (S, S , μ) with density parameter r ∈ (0, ∞).
Given that A ∈ S contains exactly one random point, the position X of the point is uniformly distributed on A .
Proof
More generally, if A contains n points, then the positions of the points are independent and each is uniformly distributed in A .
Thus, given N (A) = n , each of the n random points falls into B , independently, with probability p = μ(B)/μ(A) , regardless of
the density parameter r.
More generally, suppose that A ∈ S and that A is partitioned into k subsets (B , B , … , B ) in S . Then the conditional
1 2 k
denote the random counting measures associated with the type 1 and type 0 points, respectively. That is, N (A) is the number of i
N0 and N are independent Poisson processes on (S, S , μ) with density parameters pr and (1 − p)r , respectively.
1
Proof
This result extends naturally to k ∈ N types. As in the standard case, combining independent Poisson processes produces a new
+
Suppose that N and N are independent Poisson processes on (S, S , μ), with density parameters r and r , respectively.
0 1 0 1
Then the process obtained by combining the random points is also a Poisson process on (S, S , μ) with density parameter
r +r .
0 1
Proof
Consider the non-homogeneous Poisson process with rate function r (and hence mean function m). Recall that the Lebesgue-
Stieltjes measure on [0, ∞) associated with m (which we also denote by m) is defined by the condition
Equivalently, m is the measure that is absolutely continuous with respect to λ1 , with density function r. That is, if A is a
measurable subset of [0, ∞) then
14.8.2 https://stats.libretexts.org/@go/page/10273
m(A) = ∫ r(t) dt (14.8.10)
A
The non-homogeneous Poisson process on [0, ∞) with rate function r is the Poisson process on [0, ∞) with respect to the
measure m.
Proof
Nearest Points in R d
In this subsection, we consider a rather specialized topic, but one that is fun and interesting. Consider the Poisson process on
(R , R , λ ) with density parameter r > 0 , where as usual, R is the σ-algebra of Lebesgue measurable subsets of R , and λ is
d
d d d d d
1/d d
d d d
∥x ∥d = (x +x ⋯ +x ) , x = (x1 , x2 , … , xd ) ∈ R (14.8.11)
1 2 d
is the measure of the unit ball in R , and where Γ is the gamma function. Of course, c
d
1 =2 ,c 2 =π ,c 3 = 4π/3 .
For t ≥ 0 , let M = N (B ) , the number of random points in the ball B , or equivalently, the number of random points within
t t t
distance t of the origin. From our formula for the measure of B above, it follows that M has the Poisson distribution with
t t
parameter rc t . d
d
Now let Z = 0 and for n ∈ N let Z denote the distance of the n th closest random point to the origin. Note that Z is analogous
0 + n n
to the n th arrival time for the Poisson process on [0, ∞). Clearly the processes M = (M : t ≥ 0) and Z = (Z , Z , …) are t 0 1
inverses of each other in the sense that Z ≤ t if and only if M ≥ n . Both of these events mean that there are at least n random
n t
Distributions
1. c Z has the gamma distribution with shape parameter n and rate parameter r.
d
d
n
n nd−1
d(cd r) z
d
gn (z) = exp(−rcd z ), 0 ≤z <∞ (14.8.13)
(n − 1)!
Proof
d
cd Zn − cd Z
d
n−1
are independent for n ∈ N + and each has the exponential distribution with rate parameter r.
Computational Exercises
Suppose that defects in a sheet of material follow the Poisson model with an average of 1 defect per 2 square meters. Consider
a 5 square meter sheet of material.
1. Find the probability that there will be at least 3 defects.
2. Find the mean and standard deviation of the number of defects.
Answer
Suppose that raisins in a cake follow the Poisson model with an average of 2 raisins per cubic inch. Consider a slab of cake that
measures 3 by 4 by 1 inches.
1. Find the probability that there will be at no more than 20 raisins.
2. Find the mean and standard deviation of the number of raisins.
Answer
14.8.3 https://stats.libretexts.org/@go/page/10273
Suppose that the occurrence of trees in a forest of a certain type that exceed a certain critical size follows the Poisson model. In
a one-half square mile region of the forest there are 40 trees that exceed the specified size.
1. Estimate the density parameter.
2. Using the estimated density parameter, find the probability of finding at least 100 trees that exceed the specified size in a
square mile region of the forest
Answer
Suppose that defects in a type of material follow the Poisson model. It is known that a square sheet with side length 2 meters
contains one defect. Find the probability that the defect is in a circular region of the material with radius meter. 1
Answer
Suppose that raisins in a cake follow the Poisson model. A 6 cubic inch piece of the cake with 20 raisins is divided into 3 equal
parts. Find the probability that each piece has at least 6 raisins.
Answer
Suppose that defects in a sheet of material follow the Poisson model, with an average of 5 defects per square meter. Each
defect, independently of the others is mild with probability 0.5, moderate with probability 0.3, or severe with probability 0.2.
Consider a circular piece of the material with radius 1 meter.
1. Give the mean and standard deviation of the number of defects of each type in the piece.
2. Find the probability that there will be at least 2 defects of each type in the piece.
Answer
This page titled 14.8: Poisson Processes on General Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
14.8.4 https://stats.libretexts.org/@go/page/10273
CHAPTER OVERVIEW
15: Renewal Processes
A renewal process is an idealized stochastic model for “events” that occur randomly in time (generically called renewals or
arrivals). The basic mathematical assumption is that the times between the successive arrivals are independent and identically
distributed. Renewal processes have a very rich and interesting mathematical structure and can be used as a foundation for building
more realistic models. Moreover, renewal processes are often found embedded in other stochastic processes, most notably Markov
chains.
15.1: Introduction
15.2: Renewal Equations
15.3: Renewal Limit Theorems
15.4: Delayed Renewal Processes
15.5: Alternating Renewal Processes
15.6: Renewal Reward Processes
This page titled 15: Renewal Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
15.1: Introduction
A renewal process is an idealized stochastic model for “events” that occur randomly in time. These temporal events are generically
referred to as renewals or arrivals. Here are some typical interpretations and applications.
The arrivals are “customers” arriving at a “service station”. Again, the terms are generic. A customer might be a person and the
service station a store, but also a customer might be a file request and the service station a web server.
A device is placed in service and eventually fails. It is replaced by a device of the same type and the process is repeated. We do
not count the replacement time in our analysis; equivalently we can assume that the replacement is immediate. The times of the
replacements are the renewals
The arrivals are times of some natural event, such as a lightening strike, a tornado or an earthquake, at a particular geographical
point.
The arrivals are emissions of elementary particles from a radioactive source.
Basic Processes
The basic model actually gives rise to several interrelated random processes: the sequence of interarrival times, the sequence of
arrival times, and the counting process. The term renewal process can refer to any (or all) of these. There are also several natural
“age” processes that arise. In this section we will define and study the basic properties of each of these processes in turn.
Interarrival Times
Let X denote the time of the first arrival, and X the time between the (i − 1) st and ith arrivals for i ∈ {2, 3, …). Our basic
1 i
assumption is that the sequence of interarrival times X = (X , X , …) is an independent, identically distributed sequence of
1 2
random variables. In statistical terms, X corresponds to sampling from the distribution of a generic interarrival time X. We assume
that X takes values in [0, ∞) and P(X > 0) > 0 , so that the interarrival times are nonnegative, but not identically 0. Let
μ = E(X) denote the common mean of the interarrival times. We allow that possibility that μ = ∞ . On the other hand,
μ >0 .
Proof
If μ < ∞ , we will let σ = var(X) denote the common variance of the interarrival times. Let F denote the common distribution
2
The distribution function F turns out to be of fundamental importance in the study of renewal processes. We will let f denote the
probability density function of the interarrival times if the distribution is discrete or if the distribution is continuous and has a
probability density function (that is, if the distribution is absolutely continuous with respect to Lebesgue measure on [0, ∞)). In the
discrete case, the following definition turns out to be important:
If X takes values in the set {nd : n ∈ N} for some d ∈ (0, ∞) , then X (or its distribution) is said to be arithmetic (the terms
lattice and periodic are also used). The largest such d is the span of X.
The reason the definition is important is because the limiting behavior of renewal processes turns out to be more complicated when
the interarrival distribution is arithmetic.
Tn = ∑ Xi , n ∈ N (15.1.2)
i=1
We follow our usual convention that the sum over an empty index set is 0; thus T = 0 . On the other hand, T is the time of the
0 n
n th arrival for n ∈ N . The sequence T = (T , T , …) is called the arrival time process, although note that T is not considered
+ 0 1 0
an arrival. A renewal process is so named because the process starts over, independently of the past, at each arrival time.
15.1.1 https://stats.libretexts.org/@go/page/10277
Figure 15.1.1 : The interarrival times and arrival times
The sequence T is the partial sum process associated with the independent, identically distributed sequence of interarrival times
X. Partial sum processes associated with independent, identically distributed sequences have been studied in several places in this
project. In the remainder of this subsection, we will collect some of the more important facts about such processes. First, we can
recover the interarrival times from the arrival times:
Xi = Ti − Ti−1 , i ∈ N+ (15.1.3)
Recall that if X has probability density function f (in either the discrete or continuous case), then Tn has probability density
function f = f = f ∗ f ∗ ⋯ ∗ f , the n -fold convolution power of f .
n
∗n
2. If n ≤ n ≤ n ≤ ⋯ then (T , T − T , T − T
1 2 3 n1 n2 n1 n3 n2 , …)is a sequence of independent random variables.
Proof
If n, m ∈ N then
1. E (T ) = nμ
n
2. var (T ) = nσ n
2
Proof
Note that T ≤ T n for n ∈ N since the interarrival times are nonnegative. Also P(T = T ) = P(X = 0) = F (0) . This can
n+1 n n−1 n
be positive, so with positive probability, more than one arrival can occur at the same time. On the other hand, the arrival times are
unbounded:
Tn → ∞ as n → ∞ with probability 1.
Proof
n=1
We will refer to the random process N = (N : t ≥ 0) as the counting process. Recall again that
t T0 = 0 is not considered an
arrival, but it's possible to have T = 0 for n ∈ N , so there may be one or more arrivals at time 0.
n +
Nt = max{n ∈ N : Tn ≤ t} for t ≥ 0 .
Note that as a function of t , N is a (random) step function with jumps at the distinct values of (T , T
t 1 2, …) ; the size of the jump at
an arrival time is the number of arrivals at that time. In particular, N is an increasing function of t .
15.1.2 https://stats.libretexts.org/@go/page/10277
Figure 15.1.2 : The counting process
More generally, we can define the (random) counting measure corresponding to the sequence of random points (T , T 1 2, …) in
[0, ∞). Thus, if A is a (measurable) subset of [0, ∞), we will let N (A) denote the number of the random points in A :
n=1
In particular, note that with our new notation, N = N [0, t] for t ≥ 0 and N (s, t] = N − N for s ≤ t . Thus, the random
t t s
counting measure is completely determined by the counting process. The counting process is the “cumulative measure function” for
the counting measure, analogous the cumulative distribution function of a probability measure.
For t ≥ 0 and n ∈ N ,
1. Tn ≤t if and only if N t ≥n
Proof
Of course, the complements of the events in (a) are also equivalent, so T > t if and only if N < n . On the other hand, neither of
n t
the events N ≤ n and T ≥ t implies the other. For example, we couse easily have N = n and T < t < T
t n t . Taking
n n+1
complements, neither of the events N > n and T < t implies the other. The last result also shows that the arrival time process T
t n
All of the results so far in this subsection show that the arrival time process T and the counting process N are inverses of one
another in a sense. The important equivalences above can be used to obtain the probability distribution of the counting variables in
terms of the interarrival distribution function F .
For t ≥ 0 and n ∈ N ,
1. P(N t ≥ n) = Fn (t)
The next result is little more than a restatement of the result above relating the counting process and the arrival time process.
However, you may need to review the section on filtrations and stopping times to understand the result
For t ∈ [0, ∞), N t +1 is a stopping time for the sequence of interarrival times X
Proof
15.1.3 https://stats.libretexts.org/@go/page/10277
The renewal function turns out to be of fundamental importance in the study of renewal processes. Indeed, the renewal function
essentially characterizes the renewal process. It will take awhile to fully understand this, but the following theorem is a first step:
n=1
Proof
Note that we have not yet shown that M (t) < ∞ for t ≥ 0 , and note also that this does not follow from the previous theorem.
However, we will establish this finiteness condition in the subsection on moment generating functions below. If M is
differentiable, the derivative m = M is known as the renewal density, so that m(t) gives the expected rate of arrivals per unit
′
M is a positive measure on [0, ∞). This measure is known as the renewal measure.
Proof
n=1
Proof
If s t ∈ [0, ∞) with s ≤ t then M (t) − M (s) = m(s, t] , the expected number of arrivals in (s, t] .
The last theorem implies that the renewal function actually determines the entire renewal measure. The renewal function is the
“cumulative measure function”, analogous to the cumulative distribution function of a probability measure. Thus, every renewal
process naturally leads to two measures on [0, ∞), the random counting measure corresponding to the arrival times, and the
measure associated with the expected number of arrivals.
Consider the reliability setting in which whenever a device fails, it is immediately replaced by a new device of the same type. Then
the sequence of interarrival times X is the sequence of lifetimes, while T is the time that the n th device is placed in service. There
n
is called the current life at time t . This variable takes values in the interval [0, t] and is the age of the device that is in service at
time t . The random process C = (C : t ≥ 0) is the current life process.
t
is called the remaining life at time t . This variable takes values in the interval (0, ∞) and is the time remaining until the device
that is in service at time t fails. The random process R = (R : t ≥ 0) is the remaining life process.
t
15.1.4 https://stats.libretexts.org/@go/page/10277
The random variable
Lt = Ct + Rt = TN +1 − TN = XN +1 , t ∈ [0, ∞) (15.1.15)
t t t
is called the total life at time t . This variable takes values in [0, ∞) and gives the total life of the device that is in service at
time t . The random process L = (L : t ≥ 0) is the total life process.
t
Tail events of the current and remaining life can be written in terms of each other and in terms of the counting variables.
Suppose that t ∈ [0, ∞), x ∈ [0, t], and y ∈ [0, ∞). Then
1. {R t > y} = { Nt+y − Nt = 0}
Proof
Figure 15.1.4 : The events of interest for the current and remaining life
Of course, the various equivalent events in the last result must have the same probability. In particular, it follows that if we know
the distribution of R for all t then we also know the distribution of C for all t , and in fact we know the joint distribution of
t t
For fixed t ∈ (0, ∞) the total life at t (the lifetime of the device in service at time t ) is stochastically larger than a generic lifetime.
This result, a bit surprising at first, is known as the inspection paradox. Let X denote fixed interarrival time.
Basic Comparison
The basic comparison in the following result is often useful, particularly for obtaining various bounds. The idea is very simple: if
the interarrival times are shortened, the arrivals occur more frequently.
Suppose now that we have two interarrival sequences, X = (X , X , …) and Y = (Y , Y 1 2 1 2, …) defined on the same
probability space, with Y ≤ X (with probability 1) for each i. Then for n ∈ N and t ∈ [0, ∞),
i i
1. TY ,n ≤ TX,n
2. N Y ,t ≥ NX,t
3. m Y (t) ≥ mX (t)
associated with X. The variable Y has the binomial distribution with parameters n and p.
n
2. U = (U , U , …) where U the number of trials needed to go from success number n − 1 to success number n . These are
1 2 n
3. V = (V , V , …) where V is the trial number of success n . The sequence V is the partial sum process associated with U .
0 1 n
The variable V has the negative binomial distribution with parameters n and p.
n
15.1.5 https://stats.libretexts.org/@go/page/10277
Consider the renewal process with interarrival sequence U . Then
1. The basic assumptions are satisfied and that the mean interarrival time is μ = 1/p .
2. V is the sequence of arrival times.
3. Y is the counting process (restricted to N).
4. The renewal function is m(n) = np for n ∈ N .
Run the binomial timeline experiment 1000 times for various values of the parameters n and p. Compare the empirical
distribution of the counting variable to the true distribution.
Run the negative binomial experiment 1000 times for various values of the parameters k and p. Comare the empirical
distribution of the arrival time to the true distribution.
with parameter p.
3. The current life at time n has a truncated geometric distribution with parameters n and p:
k
p(1 − p ) , k ∈ {0, 1, … , n − 1}
P(Cn = k) = { (15.1.20)
n
(1 − p ) , k =n
Proof
This renewal process starts over, independently of the past, not only at the arrival times, but at fixed times n ∈ N as well. The
Bernoulli trials process (with the successes as arrivals) is the only discrete-time renewal process with this property, which is a
consequence of the memoryless property of the geometric interarrival distribution.
We can also use the indicator variables as the interarrival times. This may seem strange at first, but actually turns out to be useful.
4. The number of arrivals in the interval [0, n] is V − 1 for n ∈ N . This gives the counting process.
n+1
p
− 1 for n ∈ N .
The age processes are not very interesting for this renewal process.
2. R n =1
arbitrary renewal process is finite in an interval about 0 for every t ∈ [0, ∞). This implies that N has finite moments of all orders
t
such that p = P(X ≥ a) > 0 . We now consider the renewal process with interarrival sequence X = (X , X , …) , where
a a,1 a,2
X a,i= a 1(X ≥ a)i for i ∈ N . The renewal process with interarrival sequence X is just like the renewal process with
+ a
Bernoulli interarrivals, except that the arrival times occur at the points in the sequence (0, a, 2a, …), instead of (0, 1, 2, …).
15.1.6 https://stats.libretexts.org/@go/page/10277
For each t ∈ [0, ∞), N has finite moment generating function in an interval about 0, and hence N has moments of all orders
t t
at 0.
Proof
4. The counting process N = (N : t ≥ 0) has stationary, independent increments and N has the Poisson distribution with
t t
Consider again the Poisson process with rate parameter r. For t ∈ [0, ∞),
1. The current life and remaining life at time t are independent.
2. The remaining life at time t has the same distribution as an interarrival time X, namely the exponential distribution with
rate parameter r.
3. The current life at time t has a truncated exponential distribution with parameters t and r:
−rs
e , 0 ≤s ≤t
P(Ct ≥ s) = { (15.1.22)
0, s >t
Proof
The Poisson process starts over, independently of the past, not only at the arrival times, but at fixed times t ∈ [0, ∞) as well. The
Poisson process is the only renewal process with this property, which is a consequence of the memoryless property of the
exponential interarrival distribution.
Run the Poisson experiment 1000 times for various values of the parameters t and r. Compare the empirical distribution of the
counting variable to the true distribution.
Run the gamma experiment 1000 times for various values of the parameters n and r. Compare the empirical distribution of the
arrival time to the true distribution.
Simulation Exercises
Open the renewal experiment and set t = 10 . For each of the following interarrival distributions, run the simulation 1000 times
and note the shape and location of the empirical distribution of the counting variable. Note also the mean of the interarrival
distribution in each case.
1. The continuous uniform distribution on the interval [0, 1] (the standard uniform distribution).
2. the discrete uniform distribution starting at a = 0 , with step size h = 0.1 , and with n = 10 points.
3. The gamma distribution with shape parameter k = 2 and scale parameter b = 1 .
4. The beta distribution with left shape parameter a = 3 and right shape parameter b = 2 .
5. The exponential-logarithmic distribution with shape parameter p = 0.1 and scale parameter b = 1 .
6. The Gompertz distribution with shape paraemter a = 1 and scale parameter b = 1 .
7. The Wald distribution with mean μ = 1 and shape parameter λ = 1 .
8. The Weibull distribution with shape parameter k = 2 and scale parameter b = 1 .
15.1.7 https://stats.libretexts.org/@go/page/10277
This page titled 15.1: Introduction is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
15.1.8 https://stats.libretexts.org/@go/page/10277
15.2: Renewal Equations
Many quantities of interest in the study of renewal processes can be described by a special type of integral equation known as a
renewal equation. Renewal equations almost always arise by conditioning on the time of the first arrival and by using the defining
property of a renewal process—the fact that the process restarts at each arrival time, independently of the past. However, before we
can study renewal equations, we need to develop some additional concepts and tools involving measures, convolutions, and
transforms. Some of the results in the advanced sections on measure theory, general distribution functions, the integral with respect
to a measure, properties of the integral, and density functions are needed for this section. You may need to review some of these
topics as necessary. As usual, we assume that all functions and sets that are mentioned are measurable with respect to the
appropriate σ-algebras. In particular, [0, ∞) which is our basic temporal space, is given the usual Borel σ-algebra generated by the
intervals.
Suppose again that G is a distribution function on [0, ∞). Recall that the integral associated with the positive measure G is also
called the Lebesgue-Stieltjes integral associated with the distribution function G (named for Henri Lebesgue and Thomas Stieltjes).
If f : [0, ∞) → R and A ⊆ [0, ∞) (measurable of course), the integral of f over A (if it exists) is denoted
t ∞
We use the more conventional ∫ f (x) dG(x) for the integral over [0, t] and ∫ f (x) dG(x) for the integral over [0, ∞). On the
0 0
t ∞
other hand, ∫ f (x) dG(x) means the integral over (s, t] for s < t , and ∫ f (x) dG(x) means the integral over (s, ∞) . Thus, the
s s
additivity of the integral over disjoint domains holds, as it must. For example, for t ∈ [0, ∞),
∞ t ∞
This notation would be ambiguous without the clarification, but is consistent with how the measure works: G[0, t] = G(t) for
t ≥ 0 , G(s, t] = G(t) − G(s) for 0 ≤ s < t , etc. Of course, if G is continuous as a function, so that G is also continuous as a
measure, then none of this matters—the integral over an interval is the same whether or not endpoints are included. . The following
definition is a natural complement to the locally finite property of the positive measures that we are considering.
A function f : [0, ∞) → R is locally bounded if it is measurable and is bounded on [0, t] for each t ∈ [0, ∞).
15.2.1 https://stats.libretexts.org/@go/page/10278
The locally bounded functions form a natural class for which our integrals of interest exist.
Suppose that G is a distribution function on [0, ∞) and f : [0, ∞) → R is locally bounded. Then g : [0, ∞) → R defined by
t
g(t) = ∫ f (s) dG(s) is also locally bounded.
0
Proof
Note that if f and g are locally bounded, then so are f + g and f g. If f is increasing on [0, ∞) then f is locally bounded, so in
particular, a distribution function on [0, ∞) is locally bounded. If f is continuous on [0, ∞) then f is locally bounded. Similarly, if
G and H are distribution functions on [0, ∞) and if c ∈ (0, ∞), then G + H and cG are also distribution functions on [0, ∞).
Convolution, which we consider next, is another way to construct new distributions on [0, ∞) from ones that we already have.
Convolution
The term convolution means different things in different settings. Let's start with the definition we know, the convolution of
probability density functions, on our space of interest [0, ∞).
Suppose that X and Y are independent random variables with values in [0, ∞) and with probability density functions f and g ,
respectively. Then X + Y has probability density function f ∗ g given as follows, in the discrete and continuous cases,
respectively
s∈[0,t]
In the discrete case, it's understood that t is a possible value of X + Y , and the sum is over the countable collection of s ∈ [0, t]
with s a value of X and t − s a value of Y . Often in this case, the random variables take values in N, in which case the sum is
simply over the set {0, 1, … , t} for t ∈ N . The discrete and continuous cases could be unified by defining convolution with
respect to a general positive measure on [0, ∞). Moreover, the definition clearly makes sense for functions that are not necessarily
probability density functions.
Suppose that f , g : [0, ∞) → R ae locally bounded and that H is a distribution function on [0, ∞). The convolution of f and
g with respect to H is the function on [0, ∞) defined by
If f and g are probability density functions for discrete distributions on a countable set C ⊆ [0, ∞) and if H is counting measure
on C , we get discrete convolution, as above. If f and g are probability density functions for continuous distributions on [0, ∞) and
if H is Lebesgue measure, we get continuous convolution, as above. Note however, that if g is nonnegative then
t
G(t) = ∫ g(s) dH (s) for t ∈ [0, ∞) defines another distribution function on [0, ∞), and the convolution integral above is simply
0
t
∫ f (t − s) dG(s) . This motivates our next version of convolution, the one that we will use in the remainder of this section.
0
Suppose that f : [0, ∞) → R is locally bounded and that G is a distribution function on [0, ∞) . The convolution of the
function f with the distribution G is the function f ∗ G defined by
t
Note that if F and G are distribution functions on [0, ∞), the convolution F ∗ G makes sense, with F simply as a function and G
as a distribution function. The result is another distribution function. Moreover in this case, the operation is commutative.
If F and G are distribution functions on [0, ∞) then F ∗G is also a distribution function on [0, ∞), and F ∗ G = G∗ F
Proof
15.2.2 https://stats.libretexts.org/@go/page/10278
If F and G are probability distribution functions corresponding to independent random variables X and Y with values in [0, ∞),
then F ∗ G is the probabiltiy distribution function of X + Y . Suppose now that f : [0, ∞) → R is locally bounded and that G and
H are distribution functions on [0, ∞). From the previous result, both (f ∗ G) ∗ H and f ∗ (G ∗ H ) make sense. Fortunately, they
Suppose that f : [0, ∞) → R is locally bounded and that G and H are distribution functions on [0, ∞). Then
(f ∗ G) ∗ H = f ∗ (G ∗ H ) (15.2.9)
Proof
Finally, convolution is a linear operation. That is, convolution preserves sums and scalar multiples, whenever these make sense.
Suppose that f , g : [0, ∞) → R are locally bounded, H is a distribution function on [0, ∞), and c ∈ R . Then
1. (f + g) ∗ H = (f ∗ H ) + (g ∗ H )
2. (cf ) ∗ H = c(f ∗ H )
Proof
Suppose that f : [0, ∞) → R is locally bounded, G and H are distribution functions on [0, ∞), and that c ∈ (0, ∞). Then
1. f ∗ (G + H ) = (f ∗ G) + (f ∗ H )
2. f ∗ (cG) = c(f ∗ G)
Proof
Laplace Transforms
Like convolution, the term Laplace transform (named for Pierre Simon Laplace of course) can mean slightly different things in
different settings. We start with the usual definition that you may have seen in your study of differential equations or other subjects:
The Laplace transform of a function f : [0, ∞) → R is the function ϕ defined as follows, for all s ∈ (0, ∞) for which the
integral exists in R:
∞
−st
ϕ(s) = ∫ e f (t) dt (15.2.11)
0
Suppose that f is nonnegative, so that the integral defining the transform exists in [0, ∞] for every s ∈ (0, ∞) . If ϕ(s ) < ∞ for 0
some s ∈ (0, ∞) then ϕ(s) < ∞ for s ≥ s . The transform of a general function f exists (in R) if and only if the transform of
0 0
|f | is finite at s . It follows that if f has a Laplace transform, then the transform ϕ is defined on an interval of the form (a, ∞) for
some a ∈ (0, ∞) . The actual domain is of very little importance; the main point is that the Laplace transform, if it exists, will be
defined for all sufficiently large s . Basically, a nonnegative function will fail to have a Laplace transform if it grows at a “hyper-
exponential rate” as t → ∞ .
We could generalize the Laplace transform by replacing the Riemann or Lebesgue integral with the integral over a positive measure
on [0, ∞).
Suppose that that G is a distribution on [0, ∞). The Laplace transform of f : [0, ∞) → R with respect to G is the function
given below, defined for all s ∈ (0, ∞) for which the integral exists in R:
∞
−st
s ↦ ∫ e f (t) dG(t) (15.2.12)
0
t
However, as before, if f is nonnegative, then H (t) = ∫ f (x) dG(x) for t ∈ [0, ∞) defines another distribution function, and the
0
∞
previous integral is simply ∫ e dH (t) . This motivates the definiton for the Laplace transform of a distribution.
0
−st
The Laplace transform of a distribution F on [0, ∞) is the function Φ defined as follows, for all s ∈ (0, ∞) for which the
integral is finite:
∞
−st
Φ(s) = ∫ e dF (t) (15.2.13)
0
15.2.3 https://stats.libretexts.org/@go/page/10278
Once again if F has a Laplace transform, then the transform will be defined for all sufficiently large s ∈ (0, ∞) . We will try to be
explicit in explaining which of the Laplace transform definitions is being used. For a generic function, the first definition applies,
and we will use a lower case Greek letter. If the function is a distribution function, either definition makes sense, but it is usually
the the latter that is appropriate, in which case we use an upper case Greek letter. Fortunately, there is a simple relationship between
the two.
Suppose that F is a distribution function on [0, ∞). Let Φ denote the Laplace transform of the distribution F and ϕ the
Laplace transform of the function F . Then Φ(s) = sϕ(s) .
Proof
For a probability distribution, there is also a simple relationship between the Laplace transform and the moment generating
function.
Suppose that X is a random variable with values in [0, ∞) and with probability distribution function F . The Laplace transform
Φ and the moment generating function Γ of the distribution F are given as follows, and so Φ(s) = Γ(−s) for all s ∈ (0, ∞) .
∞
−sX −st
Φ(s) = E (e ) =∫ e dF (t) (15.2.16)
0
∞
sX st
Γ(s) = E (e ) =∫ e dF (t) (15.2.17)
0
In particular, a probability distribution F on [0, ∞) always has a Laplace transform Φ , defined on (0, ∞) . Note also that if
F (0) < 1 (so that X is not deterministically 0), then Φ(s) < 1 for s ∈ (0, ∞) .
Laplace transforms are important for general distributions on [0, ∞) for the same reasons that moment generating functions are
important for probability distributions: the transform of a distribution uniquely determines the distribution, and the transform of a
convolution is the product of the corresponding transforms (and products are much nicer mathematically than convolutions). The
following theorems give the essential properties of Laplace transforms. We assume that the transforms exist, of course, and it
should be understood that equations involving transforms hold for sufficiently large s ∈ (0, ∞) .
Suppose that F and G are distributions on [0, ∞) with Laplace transforms Φ and Γ , respectively. If Φ(s) = Γ(s) for s
In the case of general functions on [0, ∞), the conclusion is that f =g except perhaps on a subset of [0, ∞) of measure 0. The
Laplace transform is a linear operation.
Suppose that f , g : [0, ∞) → R have Laplace transforms ϕ and γ, respectively, and c ∈ R then
1. f + g has Laplace transform ϕ + γ
2. cf has Laplace transform cϕ
Proof
The same properties holds for distributions on [0, ∞) with c ∈ (0, ∞). Integral transforms have a smoothing effect. Laplace
transforms are differentiable, and we can interchange the derivative and integral operators.
Suppose that f : [0, ∞) → R has Lapalce transform ϕ . Then ϕ has derivatives of all orders and
∞
(n) n n −st
ϕ (s) = ∫ (−1 ) t e f (t) dt (15.2.18)
0
Suppose that f : [0, ∞) → R is locally bounded with Laplace transform ϕ , and that G is a distribution function on [0, ∞) with
Laplace transform Γ. Then f ∗ G has Laplace transform ϕ ⋅ Γ .
Proof
15.2.4 https://stats.libretexts.org/@go/page/10278
If F and G are distributions on [0, ∞), then so is F ∗ G . The result above applies, of course, with F and F ∗ G thought of as
functions and G as a distribution, but multiplying through by s and using the theorem above, it's clear that the result is also true
with all three as distributions.
N = { N : t ∈ [0, ∞)} . As usual, let F denote the common distribution function of the interarrival times, and let M denote the
t
renewal function, so that M (t) = E(N ) for t ∈ [0, ∞). Of course, the probability distribution function F defines a probability
t
measure on [0, ∞), but as noted earlier, M is also a distribution functions and so defines a positive measure on [0, ∞). Recall that
= 1 − F is the right distribution function (or reliability function) of an interarrival time.
c
F
The distributions of the arrival times are the convolution powers of F . That is, F n =F
∗n
=F ∗F ∗⋯∗F .
Proof
u = a+u ∗ F (15.2.22)
Often u(t) = E(U ) where {U : t ∈ [0, ∞)} is a random process of interest associated with the renewal process. The renewal
t t
equation comes from conditioning on the first arrival time T = X , and then using the defining property of the renewal process—
1 1
the fact that the process starts over, interdependently of the past, at the arrival time. Our next important result illustrates this.
Thus, the renewal function itself satisfies a renewal equation. Of course, we already have a “formula” for M , namely
F . However, sometimes M can be computed more easily from the renewal equation directly. The next result is the
∞
M =∑ n
n=1
The distributions F and M have Laplace transfroms Φ and Γ, respectively, that are related as follows:
Φ Γ
Γ = , Φ = (15.2.25)
1 −Φ Γ+1
In particular, the renewal distribution M always has a Laplace transform. The following theorem gives the fundamental results on
the solution of the renewal equation.
Suppose that a : [0, ∞) → R is locally bounded. Then the unique locally bounded solution to the renewal equation
u = a+u ∗ F is u = a + a ∗ M .
Direct proof
Proof from Laplace transforms
Returning to the renewal equations for M and F above, we now see that the renewal function M completely determines the
renewal process: from M we can obtain F , and everything is ultimately constructed from the interarrival times. Of course, this is
also clear from the Laplace transform result above which gives simple algebraic equations for each transform in terms of the other.
15.2.5 https://stats.libretexts.org/@go/page/10278
The Distribution of the Age Variables
Let's recall the definition of the age variables. A deterministic time t ∈ [0, ∞) falls in the random renewal interval [T , T ). Nt Nt +1
The current life (or age) at time t is C = t − T , the remaining life at time t is R = T
t Nt − t , and the total life at time t is
t Nt +1
L =T
t Nt +1−T . In the usual reliability setting, C is the age of the device that is in service at time t , while R is the time until
Nt t t
and let F (t) = F (t + y) . Note that y ↦ r (t) is the right distribution function of R . We will derive and then solve a renewal
y
c c
y t
equation for r by conditioning on the time of the first arrival. We can then find integral equations that describe the distribution of
y
the current age and the joint distribution of the current and remaining ages.
Proof
Proof
Finally we get the joint distribution of the current and remaining ages.
Proof
2. M (t) = (e − 1) − (t − 1)e
t t−1
for 1 ≤ t ≤ 2
Solution
15.2.6 https://stats.libretexts.org/@go/page/10278
Figure 15.2.2 : The graph of M on the interval [0, 2]
Show that the Laplace transform Φ of the interarrival distribution F and the Laplace transform Γ of the renewal distribution
M are given by
−s −s
1 −e 1 −e
Φ(s) = , Γ(s) = ; s ∈ (0, ∞) (15.2.34)
−s
s s−1 +e
Solution
Open the renewal experiment and select the uniform interarrival distribution on the interval [0, 1]. For each of the following
values of the time parameter, run the experiment 1000 times and note the shape and location of the empirical distribution of the
counting variable.
1. t = 5
2. t = 10
3. t = 15
4. t = 20
5. t = 25
6. t = 30
Show that the current and remaining life at time t ≥ 0 satisfy the following properties:
1. C and R are independent.
t t
2. R has the same distribution as an interarrival time, namely the exponential distribution with rate parameter r.
t
−rx
e , 0 ≤x ≤t
P(Ct ≥ x) = { (15.2.40)
0, x >t
Solution
Bernoulli Trials
Consider the renewal process for which the interarrival times have the geometric distribution with parameter p. Recall that the
probability density function is
n−1
f (n) = (1 − p ) p, n ∈ N+ (15.2.42)
15.2.7 https://stats.libretexts.org/@go/page/10278
The arrivals are the successes in a sequence of Bernoulli trials. The number of successes Y in the first n trials is the counting
n
variable for n ∈ N . The renewal equations in this section can be used to give alternate proofs of some of the fundamental results in
the Introduction.
Show that the current and remaining life at time n ∈ N satisfy the following properties:.
1. C and R are independent.
n n
2. R has the same distribution as an interarrival time, namely the geometric distribution with parameter p.
n
j
p(1 − p ) , j ∈ {0, 1, … , n − 1}
P(Cn = j) = { (15.2.50)
n
(1 − p ) , j=n
Solution
Recall also that F is the distribution of the sum of two independent random variables, each having the exponential distribution with
rate parameter r.
Solution
4
+
1
2
rt as t → ∞ .
Open the renewal experiment and select the gamma interarrival distribution with shape parameter k = 2 and scale parameter
b = 1 (so the rate parameter r = is also 1). For each of the following values of the time parameter, run the experiment 1000
1
times and note the shape and location of the empirical distribution of the counting variable.
1. t = 5
2. t = 10
3. t = 15
4. t = 20
5. t = 25
6. t = 30
15.2.8 https://stats.libretexts.org/@go/page/10278
This page titled 15.2: Renewal Equations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
15.2.9 https://stats.libretexts.org/@go/page/10278
15.3: Renewal Limit Theorems
We start with a renewal process as constructed in the introduction. Thus, X = (X , X , …) is the sequence of interarrival times.
1 2
These are independent, identically distributed, nonnegative variables with common distribution function F (satisfying F (0) < 1 )
and common mean μ . When μ = ∞ , we let 1/μ = 0 . When μ < ∞ , we let σ denote the common standard deviation. Recall also
that F = 1 − F is the right distribution function (or reliability function). Then, T = (T , T , …) is the arrival time sequence,
c
0 1
where T = 0 and
0
Tn = ∑ Xi (15.3.1)
i=1
Nt = ∑ 1(Tn ≤ t) (15.3.2)
n=1
is the number of arrivals in [0, t]. The renewal function M is defined by M (t) = E (N t) for t ∈ [0, ∞).
We noted earlier that the arrival time process and the counting process are inverses, in a sense. The arrival time process is the
partial sum process for a sequence of independent, identically distributed variables. Thus, it seems reasonable that the fundamental
limit theorems for partial sum processes (the law of large numbers and the central limit theorem theorem), should have analogs for
the counting process. That is indeed the case, and the purpose of this section is to explore the limiting behavior of renewal
processes. The main results that we will study, known appropriately enough as renewal theorems, are important for other stochastic
processes, particularly Markov chains.
Basic Theory
The Law of Large Numbers
Our first result is a strong law of large numbers for the renewal counting process, which comes as you might guess, from the law of
large numbers for the sequence of arrival times.
Thus, 1/μ is the limiting average rate of arrivals per unit time.
Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the simulation 1000 times and note
how the empirical distribution is concentrated near t/μ.
Proof
Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the simulation 1000 times and note
− −−−
the “normal” shape of the empirical distribution. Compare the empirical mean and standard deviation to t/μ and σ √t/μ , 3
respectively
15.3.1 https://stats.libretexts.org/@go/page/10279
The Elementary Renewal Theorem
The elementary renewal theorem states that the basic limit in the law of large numbers above holds in mean, as well as with
probability 1. That is, the limiting mean average rate of arrivals is 1/μ. The elementary renewal theorem is of fundamental
importance in the study of the limiting behavior of Markov chains, but the proof is not as easy as one might hope. In particular,
recall that convergence with probability 1 does not imply convergence in mean, so the elementary renewal theorem does not follow
from the law of large numbers.
M (t)/t → 1/μ as t → ∞ .
Proof
Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the experiment 1000 times and
−−− −
once again compare the empirical mean and standard deviation to t/μ and σ √t/μ , respectively. 3
μ
as t → ∞ in each of the following cases:
The renewal theorem is also known as Blackwell's theorem in honor of David Blackwell. The final limit theorem we will study is
the most useful, but before we can state the theorem, we need to define and study the class of functions to which it applies.
Suppose that g : [0, ∞) → [0, ∞) . For h ∈ [0, ∞) and k ∈ N , let m (g, h) = inf{g(t) : t ∈ [kh, (k + 1)h)}
k and
Mk (g, h) = sup{g(t) : t ∈ [kh, (k + 1)h) . The lower and upper Riemann sums of g on [0, ∞) corresponding to h are
∞ ∞
k=0 k=0
A function g : [0, ∞) → [0, ∞) is directly Riemann integrable if U g (h) <∞ for every h > 0 and
lim Lg (h) = lim Ug (h) (15.3.13)
h↓0 h↓0
∞
The common value is ∫ 0
g(t) dt .
15.3.2 https://stats.libretexts.org/@go/page/10279
Ordinary Riemann integrability on [0, ∞) allows functions that are unbounded and oscillate wildly as t → ∞ , and these are the
types of functions that we want to exclude for the renewal theorems. The following result connects ordinary Riemann integrability
with direct Riemann integrability.
If g : [0, ∞) → [0, ∞) is integrable (in the ordinary Riemann sense) on [0, t] for every t ∈ [0, ∞) and if U g (h) <∞ for some
h ∈ (0, ∞) then g is directly Riemann integrable.
Here is a simple and useful class of functions that are directly Riemann integrable.
∞
Suppose that g : [0, ∞) → [0, ∞) is decreasing with ∫ 0
g(t) dt < ∞ . Then g is directly Riemann integrable.
Suppose that the renewal process is non-arithmetic and that g : [0, ∞) → [0, ∞) is directly Riemann integrable. Then
t ∞
1
(g ∗ M )(t) = ∫ g(t − s) dM (s) → ∫ g(x) dx as t → ∞ (15.3.14)
0
μ 0
Connections
Our next goal is to see how the various renewal theorems relate.
Conversely, the elementary renewal theorem almost implies the renewal theorem.
Proof
Proof
Proof
The current and remaining life have the same limiting distribution. In particular,
x
1
c
lim P(Ct ≤ x) = lim P(Rt ≤ x) = ∫ F (y) dy, x ∈ [0, ∞) (15.3.21)
t→∞ t→∞ μ 0
Proof
15.3.3 https://stats.libretexts.org/@go/page/10279
The fact that the current and remaining age processes have the same limiting distribution may seem surprising at first, but there is a
simple intuitive explanation. After a long period of time, the renewal process looks just about the same backward in time as
forward in time. But reversing the direction of time reverses the rolls of current and remaining age.
is μ = 1/r .
Bernoulli Trials
Suppose that X = (X , X , …) is a sequence of Bernoulli trials with success parameter p ∈ (0, 1). Recall that X is a sequence of
1 2
independent, identically distributed indicator variables with p = P(X = 1) . We have studied a number of random processes
derived from X:
associated with X. The variable Y has the binomial distribution with parameters n and p.
n
2. U = (U , U , …) where U the number of trials needed to go from success number n − 1 to success number n . These are
1 2 n
3. V = (V , V , …) where V is the trial number of success n . The sequence V is the partial sum process associated with U .
0 1 n
The variable V has the negative binomial distribution with parameters n and p.
n
Consider the renewal process with interarrival sequence U . Thus, μ = 1/p is the mean interarrival time, and Y is the counting
process. Verify each of the following directly:
1. The law of large numbers for the counting process.
2. The central limit theorem for the counting process.
3. The elementary renewal theorem.
Consider the renewal process with interarrival sequence X. Thus, the mean interarrival time is μ =p and the number of
arrivals in the interval [0, n] is V − 1 for n ∈ N . Verify each of the following directly:
n+1
This page titled 15.3: Renewal Limit Theorems is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
15.3.4 https://stats.libretexts.org/@go/page/10279
15.4: Delayed Renewal Processes
Basic Theory
Preliminaries
A delayed renewal process is just like an ordinary renewal process, except that the first arrival time is allowed to have a different
distribution than the other interarrival times. Delayed renewal processes arise naturally in applications and are also found
embedded in other random processes. For example, in a Markov chain (which we study in the next chapter), visits to a fixed state,
starting in that state form the random times of an ordinary renewal process. But visits to a fixed state, starting in another state
form a delayed renewal process.
Suppose that X = (X , X , …) is a sequence of independent variables taking values in [0, ∞), with (X , X , …) identically
1 2 2 3
distributed. Suppose also that P(X > 0) > 0 for i ∈ N . The stochastic process with X as the sequence of interarrival times
i +
As before, the actual arrival times are the partial sums of X. Thus let
n
Tn = ∑ Xi (15.4.1)
i=1
n=1
If we restart the clock at time T = X , we have an ordinary renewal process with interarrival sequence (X , X , …). We use
1 1 2 3
some of the standard notation developed in the Introduction for this renewal process. In particular, F denotes the common
distribution function and μ the common mean of X for i ∈ {2, 3, …}. Similarly F = F
i denotes the distribution function of
n
∗n
the sum of n independent variables with distribution function F , and M denotes the renewal function:
∞
n=1
On the other hand, we will let G denote the distribution function of X (the special interarrival time, different from the rest), and
1
we will let G denote the distribution function of T for n ∈ N . As usual, F = 1 − F and G = 1 − G are the corresponding
n n +
c c
Proof
Finally, we will let U denote the renewal function for the delayed renewal process. Thus, U (t) = E(N t) is the expected number of
arrivals in [0, t] for t ∈ [0, ∞).
n=1
Proof
The delayed renewal function U satisfies the equation U = G+M ∗ G ; that is,
t
15.4.1 https://stats.libretexts.org/@go/page/10280
Proof
The delayed renewal function U satisfies the renewal equation U = G+U ∗ F ; that is,
t
Proof
Asymptotic Behavior
In a delayed renewal process only the first arrival time is changed. Thus, it's not surprising that the asymptotic behavior of a
delayed renewal process is the same as the asymptotic behavior of the corresponding regular renewal process. Our first result is the
strong law of large numbers for the delayed renewal process.
Our next result is the elementary renewal theorem for the delayed renewal process.
U (t)/t → 1/μ as t → ∞ .
Next we have the renewal theorem for the delayed renwal process, also known as Blackwell's theorem, named for David Blackwell.
Finally we have the key renewal theorem for the delayed renewal process.
Suppose that the renewal process is non-arithmetic and that g : [0, ∞) → [0, ∞) is directly Riemann integrable. Then
t ∞
1
(g ∗ U )(t) = ∫ g(t − s) dU (s) → ∫ g(x) dx as t → ∞ (15.4.11)
0
μ 0
counting measure and S = [0, ∞) with length measure are of particular interest, in part because renewal and delayed renewal
processes give rise to point processes in these spaces.
For a general point process on S , we use our standard notation and denote the number of random points A ∈ S by N (A) . There
are a couple of natural properties that a point process may have. In particular, the process is said to be stationary if λ(A) = λ(B)
implies that N (A) and N (B) have the same distribution for A, B ∈ S . In [0, ∞) the term stationary increments is often used,
because the stationarity property means that for s, t ∈ [0, ∞) , the distribution of N (s, s + t] = N − N depends only on t .
s+t s
Consider now a regular renewal process. We showed earlier that the asymptotic distributions of the current life and remaining life
are the same. Intuitively, after a very long period of time, the renewal process looks pretty much the same forward in time or
backward in time. This suggests that if we make the renewal process into a delayed renewal process by giving the first arrival time
this asymptotic distribution, then the resulting point process will be stationary. This is indeed the case. Consider the setting and
notation of the preliminary subsection above.
For the delayed renewal process, the point process N is stationary if and only if the initial arrival time has distribution function
t
1 c
G(t) = ∫ F (s) ds, t ∈ [0, ∞) (15.4.12)
μ 0
in which case the renewal function is U (t) = t/μ for t ∈ [0, ∞).
15.4.2 https://stats.libretexts.org/@go/page/10280
Proof
S , so that L is a sequence of multinomial trials. Let f denote the common probability density function so that for a generic trial
variable L, we have f (a) = P(L = a) for a ∈ S . We assume that all outcomes in S are actually possible, so f (a) > 0 for a ∈ S .
In this section, we interpret S as an alphabet, and we write the sequence of variables in concatenation form, L = L L ⋯ rather 1 2
than standard sequence form. Thus the sequence is an infinite string of letters from our alphabet S . We are interested in the
repeated occurrence of a particular finite substring of letters (that is, a “word” or “pattern”) in the infinite sequence.
So, fix a word a (again, a finite string of elements of S ), and consider the successive random trial numbers (T , T , …) where the
1 2
word a is completed in L . Since the sequence L is independent and identically distributed, it seems reasonable that these variables
are the arrival times of a renewal process. However there is a slight complication. An example may help.
Suppose that L is a sequence of Bernoulli trials (so S = {0, 1}). Suppose that the outcome of L is
101100101010001101000110 ⋯ (15.4.19)
In this example, you probably noted an important difference between the two words. For b , a suffix of the word (a proper substring
at the end) is also a prefix of the word (a proper substring at the beginning. Word a does not have this property. So, once we
“arrive” at b , there are ways to get to b again (taking advantage of the suffix-prefix) that do not exist starting from the beginning of
the trials. On the other hand, once we arrive at a , arriving at a again is just like with a new sequence of trials. Thus we are lead to
the following definition.
Suppose that a is a finite word from the alphabet S . If no proper suffix of a is also a prefix, then a is simple. Otherwise, a is
compound.
n=1 n
For occurrences of the word a , X = (X , X , …) is the sequence of interarrival times, T = (T , T , …) is the sequence of
1 2 0 1
arrival times, and N = {N : k ∈ N} is the counting process. If a is simple, these form an ordinary renewal process. If a is
k
compound, they form a delayed renewal process, since X will have a different distribution than (X , X , …). Since the structure
1 2 3
of a delayed renewal process subsumes that of an ordinary renewal process, we will work with the notation above for the delayed
process. In particular, let U denote the renewal function. Everything in this paragraph depends on the word a of course, but we
have suppressed this in the notation.
Suppose a = a 1 a2 , where a ∈ S for each i ∈ {1, 2, … , k}, so that a is a word of length k . Note that X takes values in
⋯ ak i 1
{k, k + 1, …} . If is simple, this applies to the other interarrival times as well. If a is compound, the situation is more
a
complicated X , X , … will have some minimum value j < k , but the possible values are positive integers, of course, and
2 3
include {k + 1, k + 2, …} . In any case, the renewal process is arithmetic with span 1. Expanding the definition of the probability
density function f , let
k
i=1
so that f (a)is the probability of forming a with k consecutive trials. Let μ(a) denote the common mean of X for n
n ∈ {2, 3, …} , so μ(a) is the mean number of trials between occurrences of a . Let ν (a) = E(X ) , so that ν (a) is the mean time
1
number of trials until a occurs for the first time. Our first result is an elegant connection between μ(a) and f (a) , which has a
wonderfully simple proof from renewal theory.
If a is a word in S then
15.4.3 https://stats.libretexts.org/@go/page/10280
1
μ(a) = (15.4.21)
f (a)
Proof
Our next goal is to compute ν (a) in the case that a is a compound word.
Suppose that a is a compound word, and that b is the largest word that is a proper suffix and prefix of a . Then
1
ν (a) = ν (b) + μ(a) = ν (b) + (15.4.22)
f (a)
Proof
By repeated use of the last result, we can compute the expected number of trials needed to form any compound word.
Consider Bernoulli trials with success probability p ∈ (0, 1), and let q = 1 − p . For each of the following strings, find the
expected number of trials between occurrences and the expected number of trials to the first occurrence.
1. a = 001
2. b = 010
3. c = 1011011
4. d = 11 ⋯ 1 (k times)
Answer
Recall that an ace-six flat die is a six-sided die for which faces 1 and 6 have probability 1
4
each while faces 2, 3, 4, and 5 have
probability each. Ace-six flat dice are sometimes used by gamblers to cheat.
1
Suppose that an ace-six flat die is thrown repeatedly. Find the expected number of throws until the pattern 6165616 first
occurs.
Solution
Suppose that a monkey types randomly on a keyboard that has the 26 lower-case letter keys and the space key (so 27 keys).
Find the expected number of keystrokes until the monkey produces each of the following phrases:
1. it was the best of times
2. to be or not to be
Proof
This page titled 15.4: Delayed Renewal Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
15.4.4 https://stats.libretexts.org/@go/page/10280
15.5: Alternating Renewal Processes
Basic Theory
Preliminaries
An alternating renewal process models a system that, over time, alternates between two states, which we denote by 1 and 0 (so the
system starts in state 1). Generically, we can imagine a device that, over time, alternates between on and off states. Specializing
further, suppose that a device operates until it fails, and then is replaced with an identical device, which in turn operates until failure
and is replaced, and so forth. In this setting, the times that the device is functioning correspond to the on state, while the
replacement times correspond to the off state. (The device might actually be repaired rather than replaced, as long as the repair
returns the device to pristine, new condition.) The basic assumption is that the pairs of random times successively spent in the two
states form an independent, identically distributed sequence. Clearly the model of a system alternating between two states is basic
and important, but moreover, such alternating processes are often found embedded in other stochastic processes.
Let's set up the mathematical notation. Let U = (U , U , …) denote the successive lengths of time that the system is in state 1,
1 2
and let V = (V , V , …) the successive lengths of time that the system is in state 0. So to be clear, the system starts in state 1 and
1 2
remains in that state for a period of time U , then goes to state 0 and stays in this state for a period of time V , then back to state 1
1 1
for a period of time U , and so forth. Our basic assumption is that W = ((U , V ), (U , V ), …) is an independent, identically
2 1 1 2 2
distributed sequence. It follows that U and V each are independent, identically distributed sequences, but U and V might well be
dependent. In fact, V might be a function of U for n ∈ N . Let μ = E(U ) denote the mean of a generic time period U in state 1
n n +
and let ν = E(V ) denote the mean of a generic time period V in state 0. Let G denote the distribution function of a time period U
in state 1, and as usual, let G = 1 − G denote the right distribution function (or reliability function) of U .
c
Clearly it's natural to consider returns to state 1 as the arrivals in a renewal process. Thus, let X = U + V for n ∈ N and n n n +
consider the renewal process with interarrival times X = (X , X , …). Clearly this makes sense, since X is an independent,
1 2
identically distributed sequence of nonnegative variables. For the most part, we will use our usual notation for a renewal process,
so the common distribution function of X = U + V is denoted by F , the arrival time process is T = (T , T , …) , the counting
n n n 0 1
process is {N : t ∈ [0, ∞)} , and the renewal function is M . But note that the mean interarrival time is now μ + ν .
t
The renwal process associated with W = ((U1 , V1 ), (U2 , V2 ), …) as constructed above is known as an alternating renewal
process.
Clearly the stochastic processes W and I are equivalent in the sense that we can recover one from the other. Let p(t) = P(I = 1) , t
the probability that the device is on at time t ∈ [0, ∞). Our first main result is a renewal equation for the function p.
We can now apply the key renewal theorem to get the asymptotic behavior of p.
Proof
Thus, the limiting probability that the system is on is simply the ratio of the mean of an on period to the mean of an on-off period. It
follows, of course, that
ν
P(It = 0) = 1 − p(t) → as t → ∞ (15.5.5)
μ+ν
15.5.1 https://stats.libretexts.org/@go/page/10281
so in particular, the fact that the system starts in the on state makes no difference in the limit. We will return to the asymptotic
behavior of the alternating renewal process in the next section on renewal reward processes.
Age Processes
The last remark applies in particular to the age processes of a standard renewal process. So, suppose that we have a renewal process
with interarrival sequence X, arrival sequence T , and counting process N . As usual, let μ denote the mean and F the probability
distribution function of an interarrival time, and let F = 1 − F denote the right distribution function (or reliability function).
c
For t ∈ [0, ∞), recall that the current life, remaining life and total life at time t are
Ct = t − TNt , Rt = TNt +1 − t, Lt = Ct + Rt = TNt +1 − TNt = XNt +1 (15.5.6)
respectively. In the usual terminology of reliability, C is the age of the device in service at time t , R is the time remaining until
t t
this device fails, and L is total life of the device. We will use limit theorem above to derive the limiting distributions these age
t
processes. The limiting distributions were obtained earlier, in the section on renewal limit theorems, by a direct application of the
key renewal theorem. So the results are not new, but the method of proof is interesting.
Proof
As we have noted before, the fact that the limiting distributions are the same is not surprising after a little thought. After a long
time, the renewal process looks the same forward and backward in time, and reversing the arrow of time reverses the roles of
current and remaining time.
Proof
This page titled 15.5: Alternating Renewal Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
15.5.2 https://stats.libretexts.org/@go/page/10281
15.6: Renewal Reward Processes
Basic Theory
Preliminaries
In a renewal reward process, each interarrival time is associated with a random variable that is generically thought of as the reward
associated with that interarrival time. Our interest is in the process that gives the total reward up to time t . So let's set up the usual
notation. Suppose that X = (X , X , …) are the interarrival times of a renewal process, so that X is a sequence of independent,
1 2
identically distributed, nonnegative variables with common distribution function F and mean μ . As usual, we assume that
F (0) < 1 so that the interarrival times are not deterministically 0, and in this section we also assume that μ < ∞ . Let
Tn = ∑ Xi , n ∈ N (15.6.1)
i=1
n=1
Suppose now that Y = (Y , Y , …) is a sequence of real-valued random variables, where Y is thought of as the reward associated
1 2 n
with the interarrival time X . However, the term reward should be interpreted generically since Y might actually be a cost or
n n
some other value associated with the interarrival time, and in any event, may take negative as well as positive values. Our basic
assumption is that the interarrival time and reward pairs Z = ((X , Y ), (X , Y ), …) form an independent and identically
1 1 2 2
distributed sequence. Recall that this implies that X is an IID sequence, as required by the definition of the renewal process, and
that Y is also an IID sequence. But X and Y might well be dependent, and in fact Y might be a function of X for n ∈ N . Let
n n +
Rt = ∑ Yi , t ∈ [0, ∞) (15.6.3)
i=1
is the reward renewal process associated with Z . The function r given by r(t) = E(R t) for t ∈ [0, ∞) is the reward function.
As promised, R is the total reward up to time t ∈ [0, ∞). Here are some typical examples:
t
The arrivals are customers at a store. Each customer spends a random amount of money.
The arrivals are visits to a website. Each visitor spends a random amount of time at the site.
The arrivals are failure times of a complex system. Each failure requires a random repair time.
The arrivals are earthquakes at a particular location. Each earthquake has a random severity, a measure of the energy released.
So R is a random sum of random variables for each t ∈ [0, ∞). In the special case that Y and X independent, the distribution of
t
R is known as a compound distribution, based on the distribution of N and the distribution of a generic reward Y . Specializing
t t
further, if the renewal process is Poisson and is independent of Y , the process R is a compound Poisson process.
Note that a renewal reward process generalizes an ordinary renewal process. Specifically, if Y = 1 for each n ∈ N , then n +
R =N
t for t ∈ [0, ∞), so that the reward process simply reduces to the counting process, and then r reduces to the renewal
t
function M .
15.6.1 https://stats.libretexts.org/@go/page/10282
The renewal reward theorem
1. R /t → ν /μ as t → ∞ with probability 1.
t
2. r(t)/t → ν /μ as t → ∞
Proof
Part (a) generalizes the law of large numbers and part (b) generalizes elementary renewal theorem, for a basic renewal process.
Once again, if Y = 1 for each n , then (a) becomes N /t → 1/μ as t → ∞ and (b) becomes M (t)/t → 1/μ as t → ∞ . It's not
n t
surprising then that these two theorems play a fundamental role in the proof of the renewal reward theorem.
[T , T
n ) for each n ∈ N . Effectively, the rewards are received discretely: Y at time T , an additional Y at time T , and so forth.
n+1 1 1 2 2
It's possible to modify the construction so the rewards accrue continuously in time or in a mixed discrete/continuous manner. Here
is a simple set of conditions for a general reward process.
Suppose again that Z = ((X1 , Y1 ), (X2 , Y2 ), …) is the sequence of interarrival times and rewards. A stochastic process
V = { Vt : t ∈ [0, ∞)} (on our underlying probability space) is a reward process associated with Z if the following conditions
hold:
1. V = ∑ Y for n ∈ N
Tn
n
i=1 i
2. V is between V and V
t Tn Tn+1 for t ∈ (T
n, Tn+1 ) and n ∈ N
In the continuous case, with nonnegative rewards (the most important case), the reward process will typically have the following
form:
Suppose that the rewards are nonnegative and that U = { Ut : t ∈ [0, ∞)} is a nonnegative stochastic process (on our
underlying probability space) with
1. t ↦ U piecewise continous
t
Tn+1
2. ∫ Tn
U dt = Y for n ∈ N
t n+1
t
Let V t =∫
0
Us ds for t ∈ [0, ∞). Then V = { Vt : t ∈ [0, ∞)} is a reward process associated with Z .
Proof
Thus in this special case, the rewards are being accrued continuously and U is the rate at which the reward is being accrued at
t
time t . So U plays the role of a reward density process. For a general reward process, the basic renewal reward theorem still holds.
2. v(t)/t → ν /μ as t → ∞ .
Proof
Suppose that the rewards are positive, and consider the continuous reward process with density process
U = { U : t ∈ [0, ∞)} as above. Let u(t) = E(U ) for t ∈ [0, ∞). Then
t t
t
1. 1
t
∫
0
Us ds →
ν
μ
as t → ∞ with probability 1
t
2. 1
t
∫
0
u(s) ds →
ν
μ
as t → ∞
15.6.2 https://stats.libretexts.org/@go/page/10282
Alternating Renewal Processes
Recall that in an alternating renewal process, a system alternates between on and off states (starting in the on state). If we let
U = (U , U , …) be the lengths of the successive time periods in which the system is on, and V = (V , V , …) the lengths of the
1 2 1 2
successive time periods in which the system is off, then the basic assumptions are that ((U , V ), (U , V ), …) is an independent,
1 1 2 2
identically distributed sequence, and that the variables X = U + V for n ∈ N form the interarrival times of a standard
n n n +
renewal process. Let μ = E(U ) denote the mean of a time period that the device is on, and ν = E(V ) the mean of a time period
that the device is off. Recall that I denotes the state (1 or 0) of the system at time t ∈ [0, ∞), so that I = {I : t ∈ [0, ∞)} is the
t t
state process. The state probability function p is given by p(t) = P(I = 1) for t ∈ [0, ∞). t
t
∫
0
Is ds →
μ+ν
as t → ∞ with probability 1
t μ
2. 1
t
∫
0
p(s) ds →
μ+ν
as t → ∞
Proof
Thus, the asymptotic average time that the device is on, and the asymptotic mean average time that the device is on, are both
simply the ratio of the mean of an on period to the mean of an on-off period. In our previous study of alternating renewal processes,
the fundamental result was that in the non-arithmetic case, p(t) → μ/(μ + ν ) as t → ∞ . This result implies part (b) in the
theorem above.
Age Processes
Renewal reward processes can be used to derive some asymptotic results for the age processes of a standard renewal process So,
suppose that we have a renewal process with interarrival sequence X, arrival sequence T , and counting process N . As usual, let
μ = E(X) denote the mean of an interarrival time, but now we will also need ν = E(X ) , the second moment. We assume that
2
respectively. In the usual terminology of reliability, A is the age of the device in service at time t , B is the time remaining until
t t
this device fails, and L is total life of the device. (To avoid notational clashes, we are using different notation than in past
t
sections.) Let a(t) = E(A ) , b(t) = E(B ) , and l(t) = E(L ) for t ∈ [0, ∞), the corresponding mean functions. To derive our
t t t
asymptotic results, we simply use the current life and the remaining life as reward densities (or rates) in a renewal reward process.
t
∫
0
As ds →
ν
2μ
as t → ∞ with probability 1
t
2. 1
t
∫
0
a(s) ds →
ν
2μ
as t → ∞
Proof
t
∫
0
Bs ds →
ν
2μ
as t → ∞ with probability 1
t
2. 1
t
∫
0
b(s) ds →
ν
2μ
as t → ∞
Proof
With a little thought, it's not surprising that the limits for the current life and remaining life processes are the same. After a long
period of time, a renewal process looks stochastically the same forward or backward in time. Changing the “arrow of time”
reverses the role of the current and remaining life. Asymptotic results for the total life process now follow trivially from the results
for the current and remaining life processes.
t
∫
0
Ls ds →
ν
μ
as t → ∞ with probability 1
t
2. 1
t
∫
0
l(s) ds =
ν
μ
as t → ∞
15.6.3 https://stats.libretexts.org/@go/page/10282
Replacement Models
Consider again a standard renewal process as defined in the Introduction, with interarrival sequence X = (X , X , …), arrival 1 2
sequence T = (T , T , …) , and counting process N = {N : t ∈ [0, ∞)} . One of the most basic applications is to reliability,
0 1 t
where a device operates for a random lifetime, fails, and then is replaced by a new device, and the process continues. In this model,
X is the lifetime and T
n the failure time of the n th device in service, for n ∈ N , while N is the number of failures in [0, t] for
n + t
t ∈ [0, ∞). As usual, F denotes the distribution function of a generic lifetime X, and F the corresponding right
c
= 1 −F
distribution function (reliability function). Sometimes, the device is actually a system with a number of critical components—the
failure of any of the critical components causes the system to fail.
Replacement models are variations on the basic model in which the device is replaced (or the critical components replaced) at times
other than failure. Often the cost a of a planned replacement is less than the cost b of an emergency replacement (at failure), so
replacement models can make economic sense. We will consider the the most common model.
In the age replacement model, the device is replaced either when it fails or when it reaches a specified age s ∈ (0, ∞) . This model
gives rise to a new renewal process with interarrival sequence U = (U , U , …) where U = min{X , s} for n ∈ N . If
1 2 n n +
a, b ∈ (0, ∞) are the costs of planned and unplanned replacements, respectively, then the cost associated with the renewal period
Un is
Yn = a1(Un = s) + b1(Un < s) = a1(Xn ≥ s) + b1(Xn < s) (15.6.18)
Clearly ((U , Y ), (U , Y ), …) satisfies the assumptions of a renewal reward process given above. The model makes
1 1 2 2
mathematical sense for any a, b ∈ (0, ∞) but if a ≥ b , so that the planned cost of replacement is at least as large as the unplanned
cost of replacement, then Y ≥ b for n ∈ N , so the model makes no financial sense. Thus we assume that a < b .
n +
Proof
So naturally, given the costs a and b , and the lifetime distribution function F , the goal is be to find the value of s that minimizes
C (s) ; this value of s is the optimal replacement time. Of course, the optimal time may not exist.
Properties of C
1. C (s) → ∞ as s ↓ 0
2. C (s) → μ/b as s ↑ ∞
Proof
As s → ∞ , the age replacement model becomes the standard (unplanned) model with limiting expected average cost b/μ.
Suppose that the lifetime of the device (in appropriate units) has the standard exponential distribution. Find C (s) and solve the
optimal age replacement problem.
Answer
The last result is hardly surprising. A device with an exponentially distributed lifetime does not age—if it has not failed, it's just as
good as new. More generally, age replacement does not make sense for any device with decreasing failure rate. Such devices
improve with age.
Suppose that the lifetime of the device (in appropriate units) has the gamma distribution with shape parameter 2 and scale
parameter 1. Suppose that the costs (in appropriate units) are a = 1 and b = 5 .
1. Find C (s) .
15.6.4 https://stats.libretexts.org/@go/page/10282
2. Sketch the graph of C (s) .
3. Solve numerically the optimal age replacement problem.
Answer
Suppose again that the lifetime of the device (in appropriate units) has the gamma distribution with shape parameter 2 and
scale parameter 1. But suppose now that the costs (in appropriate units) are a = 1 and b = 2 .
1. Find C (s) .
2. Sketch the graph of C (s) .
3. Solve the optimal age replacement problem.
Answer
In the last case, the difference between the cost of an emergency replacement and a planned replacement is not great enough for age
replacement to make sense.
Suppose that the lifetime of the device (in appropriately scaled units) is uniformly distributed on the interval [0, 1]. Find C (s)
and solve the optimal replacement problem. Give the results explicitly for the following costs:
1. a = 4 , b = 6
2. a = 2 , b = 5
3. a = 1 , b = 10
Proof
Thinning
We start with a standard renewal process with interarrival sequence X = (X , X , …), arrival sequence T = (T , T , …) and
1 2 0 1
counting process N = {N : t ∈ [0, ∞)} . As usual, let μ = E(X) denote the mean of an interarrival time. For n ∈ N , suppose
t +
now that arrival n is either accepted or rejected, and define random variable Y to be 1 in the first case and 0 in the second. Let
n
trials. Let p denote the parameter of this sequence, so that p is the probability of accepting an arrival. The procedure of accepting or
rejecting points in a point process is known as thinning the point process. We studied thinning of the Poisson process. In the
notation of this section, note that the reward process R = {R : t ∈ [0, ∞)} is the thinned counting process. That is,
t
Nt
Rt = ∑ Yi (15.6.25)
i=1
is the number of accepted points in [0, t] for t ∈ [0, ∞). So then r(t) = E(R t) is the expected number of accepted points in [0, t].
The renewal reward theorem gives the asymptotic behavior.
2. r(t)/t → p/μ as t → ∞
Proof
This page titled 15.6: Renewal Reward Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
15.6.5 https://stats.libretexts.org/@go/page/10282
CHAPTER OVERVIEW
16: Markov Processes
A Markov process is a random process in which the future is independent of the past, given the present. Thus, Markov processes
are the natural stochastic analogs of the deterministic processes described by differential and difference equations. They form one
of the most important classes of random processes.
16.1: Introduction to Markov Processes
16.2: Potentials and Generators for General Markov Processes
16.3: Introduction to Discrete-Time Chains
16.4: Transience and Recurrence for Discrete-Time Chains
16.5: Periodicity of Discrete-Time Chains
16.6: Stationary and Limiting Distributions of Discrete-Time Chains
16.7: Time Reversal in Discrete-Time Chains
16.8: The Ehrenfest Chains
16.9: The Bernoulli-Laplace Chain
16.10: Discrete-Time Reliability Chains
16.11: Discrete-Time Branching Chain
16.12: Discrete-Time Queuing Chains
16.13: Discrete-Time Birth-Death Chains
16.14: Random Walks on Graphs
16.15: Introduction to Continuous-Time Markov Chains
16.16: Transition Matrices and Generators of Continuous-Time Chains
16.17: Potential Matrices
16.18: Stationary and Limting Distributions of Continuous-Time Chains
16.19: Time Reversal in Continuous-Time Chains
16.20: Chains Subordinate to the Poisson Process
16.21: Continuous-Time Birth-Death Chains
16.22: Continuous-Time Queuing Chains
16.23: Continuous-Time Branching Chains
This page titled 16: Markov Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
16.1: Introduction to Markov Processes
A Markov process is a random process indexed by time, and with the property that the future is independent of the past, given the
present. Markov processes, named for Andrei Markov, are among the most important of all random processes. In a sense, they are
the stochastic analogs of differential equations and recurrence relations, which are of course, among the most important
deterministic processes.
The complexity of the theory of Markov processes depends greatly on whether the time space T is N (discrete time) or [0, ∞)
(continuous time) and whether the state space is discrete (countable, with all subsets measurable) or a more general topological
space. When T = [0, ∞) or when the state space is a general space, continuity assumptions usually need to be imposed in order to
rule out various types of weird behavior that would otherwise complicate the theory.
When the state space is discrete, Markov processes are known as Markov chains. The general theory of Markov chains is
mathematically rich and relatively simple.
When T = N and the state space is discrete, Markov processes are known as discrete-time Markov chains. The theory of such
processes is mathematically elegant and complete, and is understandable with minimal reliance on measure theory. Indeed, the
main tools are basic probability and linear algebra. Discrete-time Markov chains are studied in this chapter, along with a
number of special models.
When T = [0, ∞) and the state space is discrete, Markov processes are known as continuous-time Markov chains. If we avoid
a few technical difficulties (created, as always, by the continuous time space), the theory of these processes is also reasonably
simple and mathematically very nice. The Markov property implies that the process, sampled at the random times when the
state changes, forms an embedded discrete-time Markov chain, so we can apply the theory that we will have already learned.
The Markov property also implies that the holding time in a state has the memoryless property and thus must have an
exponential distribution, a distribution that we know well. In terms of what you may have already studied, the Poisson process
is a simple example of a continuous-time Markov chain.
For a general state space, the theory is more complicated and technical, as noted above. However, we can distinguish a couple of
classes of Markov processes, depending again on whether the time space is discrete or continuous.
When T = N and S = R , a simple example of a Markov process is the partial sum process associated with a sequence of
independent, identically distributed real-valued random variables. Such sequences are studied in the chapter on random samples
(but not as Markov processes), and revisited below.
In the case that T = [0, ∞) and S = R or more generally S = R , the most important Markov processes are the diffusion
k
processes. Generally, such processes can be constructed via stochastic differential equations from Brownian motion, which thus
serves as the quintessential example of a Markov process in continuous time and space.
The goal of this section is to give a broad sketch of the general theory of Markov processes. Some of the statements are not
completely rigorous and some of the proofs are omitted or are sketches, because we want to emphasize the main ideas without
getting bogged down in technicalities. If you are a new student of probability you may want to just browse this section, to get the
basic ideas and notation, but skipping over the proofs and technical details. Then jump ahead to the study of discrete-time Markov
chains. On the other hand, to understand this section in more depth, you will need to review topcis in the chapter on foundations
and in the chapter on stochastic processes.
Basic Theory
Preliminaries
As usual, our starting point is a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the
probability measure on (Ω, F ). The time set T is either N (discrete time) or [0, ∞) (continuous time). In the first case, T is given
the discrete topology and in the second case T is given the usual Euclidean topology. In both cases, T is given the Borel σ-algebra
T , the σ-algebra generated by the open sets. In the discrete case when T = N , this is simply the power set of T so that every
subset of T is measurable; every function from T to another measurable space is measurable; and every function from T to another
topological space is continuous. The time space (T , T ) has a natural measure; counting measure # in the discrete case, and
Lebesgue in the continuous case.
16.1.1 https://stats.libretexts.org/@go/page/10288
The set of states S also has a σ-algebra S of admissible subsets, so that (S, S ) is the state space. Usually S has a topology and
S is the Borel σ-algebra generated by the open sets. A typical set of assumptions is that the topology on S is LCCB: locally
compact, Hausdorff, and with a countable base. These particular assumptions are general enough to capture all of the most
important processes that occur in applications and yet are restrictive enough for a nice mathematical theory. Usually, there is a
natural positive measure λ on the state space (S, S ). When S has an LCCB topology and S is the Borel σ-algebra, the measure λ
wil usually be a Borel measure satisfying λ(C ) < ∞ if C ⊆ S is compact. The term discrete state space means that S is countable
with S = P(S) , the collection of all subsets of S . Thus every subset of S is measurable, as is every function from S to another
measurable space. This is the Borel σ-algebra for the discrete topology on S , so that every function from S to another topological
space is continuous. The compact sets are simply the finite sets, and the reference measure is #, counting measure. If S = R for k
some k ∈ S (another common case), then we usually give S the Euclidean topology (which is LCCB) so that S is the usual Borel
σ-algebra. The compact sets are the closed, bounded sets, and the reference measure λ is k -dimensional Lebesgue measure.
Clearly, the topological and measure structures on T are not really necessary when T = N , and similarly these structures on S are
not necessary when S is countable. But the main point is that the assumptions unify the discrete and the common continuous cases.
Also, it should be noted that much more general state spaces (and more general time spaces) are possible, but most of the important
Markov processes that occur in applications fit the setting we have described here.
Various spaces of real-valued functions on S play an important role. Let B denote the collection of bounded, measurable functions
f : S → R . With the usual (pointwise) addition and scalar multiplication, B is a vector space. We give B the supremum norm,
random variable taking values in S for each t ∈ T , and we think of X ∈ S as the state of a system at time t ∈ T . We also assume
t
that we have a collection F = {F : t ∈ T } of σ-algebras with the properties that X is measurable with respect to F for t ∈ T ,
t t t
and the F ⊆ F ⊆ F for s, t ∈ T with s ≤ t . Intuitively, F is the collection of event up to time t ∈ T . Technically, the
s t t
assumptions mean that F is a filtration and that the process X is adapted to F. The most basic (and coarsest) filtration is the
natural filtration F
0
= {Ft
0
: t ∈ T} where Ft
0
= σ{ Xs : s ∈ T , s ≤ t} , the σ-algebra generated by the process up to time
t ∈ T . In continuous time, however, it is often necessary to use slightly finer σ-algebras in order to have a nice mathematical
theory. In particular, we often need to assume that the filtration F is right continuous in the sense that F = F for t ∈ T where t+ t
= ⋂{ F : s ∈ T , s > t} . We can accomplish this by taking F = F so that F = F for t ∈ T , and in this case, F is
0 0
F t+ s t
+ t+
referred to as the right continuous refinement of the natural filtration. We also sometimes need to assume that F is complete with
respect to P in the sense that if A ∈ S with P(A) = 0 and B ⊆ A then B ∈ F . That is, F contains all of the null events (and
0 0
hence also all of the almost certain events), and therefore so does F for all t ∈ T . t
Definitions
The random process X is a Markov process if
P(Xs+t ∈ A ∣ Fs ) = P(Xs+t ∈ A ∣ Xs ) (16.1.1)
The defining condition, known appropriately enough as the the Markov property, states that the conditional distribution of X s+t
the future. If we know the present state X , then any additional knowledge of events in the past is irrelevant in terms of predicting
s
the future state X . Technically, the conditional probabilities in the definition are random variables, and the equality must be
s+t
interpreted as holding with probability 1. As you may recall, conditional expected value is a more general and useful concept than
conditional probability, so the following theorem may come as no surprise.
16.1.2 https://stats.libretexts.org/@go/page/10288
Technically, we should say that X is a Markov process relative to the filtration F. If X satisfies the Markov property relative to a
filtration, then it satisfies the Markov property relative to any coarser filtration.
Suppose that the stochastic process X = {X t : t ∈ T} is adapted to the filtration F = { Ft : t ∈ T } and that
G = { G : t ∈ T } is a filtration that is finer than
t F . If X is a Markov process relative to G then X is a Markov process
relative to F.
Proof
In particular, if X is a Markov process, then X satisfies the Markov property relative to the natural filtration F
0
. The theory of
Markov processes is simplified considerably if we add an additional assumption.
So if X is homogeneous (we usually don't bother with the time adjective), then the process {X : t ∈ T } given X = x is s+t s
equivalent (in distribution) to the process {X : t ∈ T } given X = x . For this reason, the initial distribution is often unspecified
t 0
in the study of Markov processes—if the process is in state x ∈ S at a particular time s ∈ T , then it doesn't really matter how the
process got to state x; the process essentially “starts over”, independently of the past. The term stationary is sometimes used
instead of homogeneous.
From now on, we will usually assume that our Markov processes are homogeneous. This is not as big of a loss of generality as you
might think. A non-homogenous process can be turned into a homogeneous process by enlarging the state space, as shown below.
For a homogeneous Markov process, if s, t ∈ T , x ∈ S , and f ∈ B , then
E[f (Xs+t ) ∣ Xs = x] = E[f (Xt ) ∣ X0 = x] (16.1.5)
Feller Processes
In continuous time, or with general state spaces, Markov processes can be very strange without additional continuity assumptions.
Suppose (as is usually the case) that S has an LCCB topology and that S is the Borel σ-algebra. Let C denote the collection of
bounded, continuous functions f : S → R . Let C denote the collection of continuous functions f : S → R that vanish at ∞. The
0
last phrase means that for every ϵ > 0 , there exists a compact set C ⊆ S such that |f (x)| < ϵ if x ∉ C . With the usual (pointwise)
operations of addition and scalar multiplication, C is a vector subspace of C , which in turn is a vector subspace of B . Just as with
0
X = y as x → y .
0
Additional details
Feller processes are named for William Feller. Note that if S is discrete, (a) is automatically satisfied and if T is discrete, (b) is
automatically satisfied. In particular, every discrete-time Markov chain is a Feller Markov process. There are certainly more
general Markov processes, but most of the important processes that occur in applications are Feller processes, and a number of nice
properties flow from the assumptions. Here is the first:
Again, this result is only interesting in continuous time T = [0, ∞) . Recall that for ω ∈ Ω , the function t ↦ X (ω) is a sample t
path of the process. So we will often assume that a Feller Markov process has sample paths that are right continuous have left
limits, since we know there is a version with these properties.
16.1.3 https://stats.libretexts.org/@go/page/10288
Stopping Times and the Strong Markov Property
For our next discussion, you may need to review again the section on filtrations and stopping times.To give a quick review, suppose
again that we start with our probability space (Ω, F , P) and the filtration F = {F : t ∈ T } (so that we have a filtered probability
t
space).
Since time (past, present, future) plays such a fundamental role in Markov processes, it should come as no surprise that random
times are important. We often need to allow random times to take the value ∞, so we need to enlarge the set of times to
T∞ = T ∪ {∞} . The topology on T is extended to T by the rule that for s ∈ T , the set {t ∈ T : t > s} is an open
∞ ∞
neighborhood of ∞. This is the one-point compactification of T and is used so that the notion of time converging to infinity is
preserved. The Borel σ-algebra T is used on T , which again is just the power set in the discrete case.
∞ ∞
If X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ), and if τ is a random time, then naturally we want to
t
consider the state X at the random time. There are two problems. First if τ takes the value ∞, X is not defined. The usual
τ τ
solution is to add a new “death state” δ to the set of states S , and then to give S = S ∪ {δ} the σ algebra δ
S = S ∪ {A ∪ {δ} : A ∈ S } . A function f ∈ B is extended to S by the rule f (δ) = 0 . The second problem is that X
δ δ may τ
not be a valid random variable (that is, measurable) unless we assume that the stochastic process X is measurable. Recall that this
means that X : Ω × T → S is measurable relative to F ⊗ T and S . (This is always true in discrete time.)
Recall next that a random time τ is a stopping time (also called a Markov time or an optional time) relative to F if {τ ≤ t} ∈ F t
for each t ∈ T . Intuitively, we can tell whether or not τ ≤ t from the information available to us at time t . In a sense, a stopping
time is a random time that does not require that we see into the future. Of course, the concept depends critically on the filtration.
Recall that if a random time τ is a stopping time for a filtration F = {F : t ∈ T } then it is also a stopping time for a finer
t
filtration G = {G : t ∈ T } , so that F ⊆ G for t ∈ T . Thus, the finer the filtration, the larger the collection of stopping times. In
t t t
fact if the filtration is the trivial one where F = F for all t ∈ T (so that all information is available to us from the beginning of
t
time), then any random time is a stopping time. But of course, this trivial filtration is usually not sensible.
Next, recall that if τ is a stopping time for the filtration F, then the σ-algebra F associated with τ is given by
τ
Intuitively, F is the collection of events up to the random time τ , analogous to the F which is the collection of events up to the
τ t
deterministic time t ∈ T . If X = {X : t ∈ T } is a stochastic process adapted to F and if τ is a stopping time relative to F, then
t
we would hope that X is measurable with respect to F just as X is measurable with respect to F for deterministic t ∈ T .
τ τ t t
However, this will generally not be the case unless X is progressively measurable relative to F, which means that
X : Ω × T → S is measurable with respect to F ⊗ T
t and S where T = {s ∈ T : s ≤ t} and T the corresponding Borel σ-
t t t t
algebra. This is always true in discrete time, of course, and more generally if S has an LCCB topology with S the Borel σ-algebra,
and X is right continuous. If X is progressively measurable with respect to F then X is measurable and X is adapted to F.
The strong Markov property for our stochastic process X = { Xt : t ∈ T } states that the future is independent of the past, given
the present, when the present time is a stopping time.
As with the regular Markov property, the strong Markov property depends on the underlying filtration F. If the property holds with
respect to a given filtration, then it holds with respect to a coarser filtration.
Suppose that the stochastic process X = {X : t ∈ T } is progressively measurable relative to the filtration F = {F : t ∈ T }
t t
and that the filtration G = {G : t ∈ T } is finer than F. If X is a strong Markov process relative to G then X is a strong
t
So if X is a strong Markov process, then X satisfies the strong Markov property relative to its natural filtration. Again there is a
tradeoff: finer filtrations allow more stopping times (generally a good thing), but make the strong Markov property harder to satisfy
and may not be reasonable (not so good). So we usually don't want filtrations that are too much finer than the natural one.
16.1.4 https://stats.libretexts.org/@go/page/10288
With the strong Markov and homogeneous properties, the process {X : t ∈ T } given X = x is equivalent in distribution to the
τ+t τ
process {X : t ∈ T } given X = x . Clearly, the strong Markov property implies the ordinary Markov property, since a fixed time
t 0
Suppose that X = {X n : n ∈ N} is a (homogeneous) Markov process in discrete time. Then X is a strong Markov process.
As always in continuous time, the situation is more complicated and depends on the continuity of the process X and the filtration
F . Here is the standard result for Feller processes.
For t ∈ T , let
Pt (x, A) = P(Xt ∈ A ∣ X0 = x), x ∈ S, A ∈ S (16.1.9)
Then P is a probability kernel on (S, S ), known as the transition kernel of X for time t .
t
Proof
Note that P = I , the identity kernel on (S, S ) defined by I (x, A) = 1(x ∈ A) for x ∈ S and A ∈ S , so that I (x, A) = 1 if
0
x ∈ A and I (x, A) = 0 if x ∉ A . Recall also that usually there is a natural reference measure λ on (S, S ). In this case, the
transition kernel P will often have a transition density p with respect to λ for t ∈ T . That is,
t t
The next theorem gives the Chapman-Kolmogorov equation, named for Sydney Chapman and Andrei Kolmogorov, the
fundamental relationship between the probability kernels, and the reason for the name transition kernel.
Proof
In the language of functional analysis, P is a semigroup. Recall that the commutative property generally does not hold for the
product operation on kernels. However the property does hold for the transition kernels of a homogeneous Markov process. That is,
P P =P P =P
s t t for s, t ∈ T . As a simple corollary, if S has a reference measure, the same basic relationship holds for the
s s+t
transition densities.
Suppose that λ is the reference measure on (S, S ) and that X = {X t : t ∈ T} is a Markov process on S and with transition
densities {p : t ∈ T } . If s, t ∈ T then p p = p . That is,
t s t s+t
Proof
16.1.5 https://stats.libretexts.org/@go/page/10288
If T =N (discrete time), then the transition kernels of X are just the powers of the one-step transition kernel. That is, if we let
P = P1 then P = P for n ∈ N .
n
n
Recall that a kernel defines two operations: operating on the left with positive measures on (S, S ) and operating on the right with
measurable, real-valued functions. For the transition kernels of a Markov process, both of the these operators have natural
interpretations.
Proof
So if P denotes the collection of probability measures on (S, S ), then the left operator P maps P back into P . In particular, if t
X has distribution μ (the initial distribution) then X has distribution μ = μ P for every t ∈ T .
0 0 t t 0 t
Hence if μ is a probability measure that is invariant for X, and X has distribution μ , then X has distribution μ for every t ∈ T
0 t
so that the process X is identically distributed. In discrete time, note that if μ is a positive measure and μP = μ then μP = μ for n
Proof
In particular, the right operator P is defined on B , the vector space of bounded, linear functions f : S → R , and in fact is a linear
t
For the right operator, there is a concept that is complementary to the invariance of of a positive measure for the left operator.
The result above shows how to obtain the distribution of X from the distribution of X and the transition kernel P for t ∈ T . But
t 0 t
we can do more. Recall that one basic way to describe a stochastic process is to give its finite dimensional distributions, that is, the
distribution of (X , X , … , X ) for every n ∈ N and every (t , t , … , t ) ∈ T . For a Markov process, the initial
t1 t2 tn + 1 2 n
n
distribution and the transition kernels determine the finite dimensional distributions. It's easiest to state the distributions in
differential form.
μ0 (dx0 )Pt (x0 , dx1 )Pt −t1 (x1 , dx2 ) ⋯ Pt −tn−1 (xn−1 , dxn ) (16.1.22)
1 2 n
Proof
This result is very important for constructing Markov processes. If we know how to define the transition kernels P for t ∈ T t
(based on modeling considerations, for example), and if we know the initial distribution μ , then the last result gives a consistent 0
set of finite dimensional distributions. From the Kolmogorov construction theorem, we know that there exists a stochastic process
16.1.6 https://stats.libretexts.org/@go/page/10288
that has these finite dimensional distributions. In continuous time, however, two serious problems remain. First, it's not clear how
we would construct the transition kernels so that the crucial Chapman-Kolmogorov equations above are satisfied. Second, we
usually want our Markov process to have certain properties (such as continuity properties of the sample paths) that go beyond the
finite dimensional distributions. The first problem will be addressed in the next section, and fortunately, the second problem can be
resolved for a Feller process.
Suppose that X = { Xt : t ∈ T } is a Markov process on an LCCB state space (S, S ) with transition operators
P = { Pt : t ∈ [0, ∞)} . Then X is a Feller process if and only if the following conditions hold:
1. Continuity in space: If f ∈ C and t ∈ [0, ∞) then P f ∈ C
0 t 0
A semigroup of probability kernels P = {P : t ∈ T } that satisfies the properties in this theorem is called a Feller semigroup. So
t
the theorem states that the Markov process X is Feller if and only if the transition semigroup of transition P is Feller. As before,
(a) is automatically satisfied if S is discrete, and (b) is automatically satisfied if T is discrete. Condition (a) means that P is an t
operator on the vector space C , in addition to being an operator on the larger space B . Condition (b) actually implies a stronger
0
Additional details
So combining this with the remark above, note that if P is a Feller semigroup of transition operators, then f ↦ P f is continuous t
on C for fixed t ∈ T , and t ↦ P f is continuous on T for fixed f ∈ C . Again, the importance of this is that we often start with
0 t 0
the collection of probability kernels P and want to know that there exists a nice Markov process X that has these transition
operators.
Sampling in Time
If we sample a Markov process at an increasing sequence of points in time, we get another Markov process in discrete time. But the
discrete time process may not be homogeneous even if the original process is homogeneous.
Suppose that X = {X t : t ∈ T} is a Markov process with state space (S, S ) and that (t , t , t , …) is a sequence in T with
0 1 2
Proof
If we sample a homogeneous Markov process at multiples of a fixed, positive time, we get a homogenous Markov process in
discrete time.
Suppose that X = { Xt : t ∈ T } is a homogeneous Markov process with state space (S, S ) and transition kernels
P = { P : t ∈ T } . Fix r ∈ T with r > 0 and define Y = X
t for n ∈ N . Then Y = {Y : n ∈ N} is a homogeneous
n nr n
In some cases, sampling a strong Markov process at an increasing sequence of stopping times yields another Markov process in
discrete time. The point of this is that discrete-time Markov processes are often found naturally embedded in continuous-time
Markov processes.
Suppose that X = {X : t ∈ T } is a non-homogeneous Markov process with state space (S, S ). Suppose also that τ is a
t
random variable taking values in T , independent of X. Let τ = τ + t and let Y = (X , τ ) for t ∈ T . Then
t t τt t
Y = { Y : t ∈ T } is a homogeneous Markov process with state space (S × T , S ⊗ T ) . For t ∈ T , the transition kernel P is
t t
given by
16.1.7 https://stats.libretexts.org/@go/page/10288
Pt [(x, r), A × B] = P(Xr+t ∈ A ∣ Xr = x)1(r + t ∈ B), (x, r) ∈ S × T , A × B ∈ S ⊗ T (16.1.29)
Proof
The trick of enlarging the state space is a common one in the study of stochastic processes. Sometimes a process that has a weaker
form of “forgetting the past” can be made into a Markov process by enlarging the state space appropriately. Here is an example in
discrete time.
Proof
The last result generalizes in a completely straightforward way to the case where the future of a random process in discrete time
depends stochastically on the last k states, for some fixed k ∈ N .
Suppose that X = {X n : n ∈ N} is a stochastic process with state space (S, S ) and that X satisfies the recurrence relation
where g : S → S is measurable. Then X is a homogeneous Markov process with one-step transition operator P given by
Pf = f ∘g for a measurable function f : S → R .
Proof
In the deterministic world, as in the stochastic world, the situation is more complicated in continuous time. Nonetheless, the same
basic analogy applies.
Suppose that X = {X t : t ∈ [0, ∞)} with state space (R, R)satisfies the first-order differential equation
d
Xt = g(Xt ) (16.1.35)
dt
In differential form, the process can be described by dX = g(X ) dt . This essentially deterministic process can be extended to a
t t
very important class of Markov processes by the addition of a stochastic term related to Brownian motion. Such stochastic
differential equations are the main tools for constructing Markov processes known as diffusion processes.
“continuous”. Typically, S is either N or Z in the discrete case, and is either [0, ∞) or R in the continuous case. In any case, S is
16.1.8 https://stats.libretexts.org/@go/page/10288
given the usual σ-algebra S of Borel subsets of S (which is the power set in the discrete case). Also, the state space (S, S ) has a
natural reference measure measure λ , namely counting measure in the discrete case and Lebesgue measure in the continuous case.
Let F = {F : t ∈ T } denote the natural filtration, so that F = σ{X : s ∈ T , s ≤ t} for t ∈ T .
t t s
A difference of the form X − X for s, t ∈ T is an increment of the process, hence the names. Sometimes the definition of
s+t s
stationary increments is that X − X have the same distribution as X . But this forces X = 0 with probability 1, and as usual
s+t s t 0
with Markov processes, it's best to keep the initial distribution unspecified. If X has stationary increments in the sense of our
definition, then the process Y = {Y = X − X : t ∈ T } has stationary increments in the more restricted sense. For the
t t 0
remainder of this discussion, assume that X = {X : t ∈ T } has stationary, independent increments, and let Q denote the
t t
distribution of X − X for t ∈ T .
t 0
Qs ∗ Qt = Qs+t for s, t ∈ T .
Proof
So the collection of distributions Q = {Q t : t ∈ T} forms a semigroup, with convolution as the operator. Note that Q is simply 0
point mass at 0.
The process X is a homogeneous Markov process. For t ∈ T , the transition operator P is given by t
Proof
Clearly the semigroup property of P = {P : t ∈ T } (with the usual operator product) is equivalent to the semigroup property of
t
Suppose that for positive t ∈ T , the distribution Q has probability density function g with respect to the reference measure
t t
Of course, from the result above, it follows that gs ∗ gt = gs+t for s, t ∈ T , where here ∗ refers to the convolution operation on
probability density functions.
Thus, by the general theory sketched above, X is a strong Markov process, and there exists a version of X that is right continuous
and has left limits. Such a process is known as a Lévy process, in honor of Paul Lévy.
For a real-valued stochastic process X = {X t : t ∈ T} , let m and v denote the mean and variance functions, so that
m(t) = E(Xt ), v(t) = var(Xt ); t ∈ T (16.1.40)
assuming of course that the these exist. The mean and variance functions for a Lévy process are particularly simple.
2. If in addition, σ = var(X 2
0 0 ) ∈ (0, ∞) and σ = var(X ) ∈ (0, ∞) then v(t) = σ + (σ
2
1 1
2
0
2
1
2
− σ )t
0
for t ∈ T .
Proof
It's easy to describe processes with stationary independent increments in discrete time.
16.1.9 https://stats.libretexts.org/@go/page/10288
A process X = {X n : n ∈ N} has independent increments if and only if there exists a sequence of independent, real-valued
random variables (U 0 , U1 , …) such that
n
Xn = ∑ Ui (16.1.43)
i=0
More generally, for n ∈ N , the n -step transition kernel is P (x, A) = Q (A − x) for x ∈ S and A ∈ S . This Markov process
n ∗n
is known as a random walk (although unfortunately, the term random walk is used in a number of other contexts as well). The idea
is that at time n , the walker moves a (directed) distance U on the real line, and these steps are independent and identically
n
distributed. If Q has probability density function g with respect to the reference measure λ , then the one-step transition density is
Consider the random walk on R with steps that have the standard normal distribution. Give each of the following explicitly:
1. The one-step transition density.
2. The n -step transition density for n ∈ N . +
Proof
In continuous time, there are two processes that are particularly important, one with the discrete state space N and one with the
continuous state space R.
For t ∈ [0, ∞) , let gt denote the probability density function of the Poisson distribution with parameter t , and let
pt (x, y) = gt (y − x) for x, y ∈ N. Then {p : t ∈ [0, ∞)} is the collection of transition densities for a Feller semigroup on N
t
Proof
So a Lévy process N = {N : t ∈ [0, ∞)} with these transition densities would be a Markov process with stationary, independent
t
increments and with sample paths are right continuous and have left limits. We do know of such a process, namely the Poisson
process with rate 1.
Open the Poisson experiment and set the rate parameter to 1 and the time parameter to 10. Run the experiment several times in
single-step mode and note the behavior of the process.
For t ∈ (0, ∞) , let g denote the probability density function of the normal distribution with mean 0 and variance t , and let
t
p (x, y) = g (y − x) for x, y ∈ R. Then { p : t ∈ [0, ∞)} is the collection of transition densities of a Feller semigroup on R .
t t t
Proof
So a Lévy process X = {X : t ∈ [0, ∞)} on R with these transition densities would be a Markov process with stationary,
t
independent increments, and whose sample paths are continuous from the right and have left limits. In fact, there exists such a
process with continuous sample paths. This process is Brownian motion, a process important enough to have its own chapter.
Run the simulation of standard Brownian motion and note the behavior of the process.
This page titled 16.1: Introduction to Markov Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.1.10 https://stats.libretexts.org/@go/page/10288
16.2: Potentials and Generators for General Markov Processes
Our goal in this section is to continue the broad sketch of the general theory of Markov processes. As with the last section, some of
the statements are not completely precise and rigorous, because we want to focus on the main ideas without being overly burdened
by technicalities. If you are a new student of probability, or are primarily interested in applications, you may want to skip ahead to
the study of discrete-time Markov chains.
Preliminaries
Basic Definitions
As usual, our starting point is a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the
probability measure on the sample space (Ω, F ). The set of times T is either N, discrete time with the discrete topology, or [0, ∞),
continuous time with the usual Euclidean topology. The time set T is given the Borel σ-algebra T , which is just the power set if
T = N , and then the time space (T , T ) is given the usual measure, counting measure in the discrete case and Lebesgue measure in
the continuous case. The set of states S has an LCCB topology (locally compact, Hausdorff, with a countable base), and is also
given the Borel σ-algebra S . Recall that to say that the state space is discrete means that S is countable with the discrete topology,
so that S is the power set of S . The topological assumptions mean that the state space (S, S ) is nice enough for a rich
mathematical theory and general enough to encompass the most important applications. There is often a natural Borel measure λ
on (S, S ), counting measure # if S is discrete, and for example, Lebesgue measure if S = R for some k ∈ N .
k
+
Recall also that there are several spaces of functions on S that are important. Let B denote the set of bounded, measurable
functions f : S → R . Let C denote the set of bounded, continuous functions f : S → R , and let C denote the set of continuous
0
functions f : S → R that vanish at ∞ in the sense that for every ϵ > 0 , there exists a compact set K ⊆ S such |f (x)| < ϵ for
x ∈ K . These are all vector spaces under the usual (pointwise) addition and scalar multiplication, and C ⊆ C ⊆ B . The
c
0
supremum norm, defined by ∥f ∥ = sup{|f (x)| : x ∈ S} for f ∈ B is the norm that is used on these spaces.
Suppose now that X = {X : t ∈ T } is a time-homogeneous Markov process with state space (S, S ) defined on the probability
t
space (Ω, F , P). As before, we also assume that we have a filtration F = {F : t ∈ T } , that is, an increasing family of sub σ-
t
algebras of F , indexed by the time space, with the properties that X is measurable with repsect to F for t ∈ T . Intuitively, F is
t t t
Recall that for t ∈ T , the transition kernel P defines two operators, on the left with measures and on the right with functions. So,
t
everything in this section is defined in terms of the semigroup P , which is one of the main analytic tools in the study of Markov
processes.
We assume that the Markov process X = { Xt : t ∈ T } satisfies the following properties (and hence is a Feller Markov
process):
1. For t ∈ T and y ∈ S , the distribution of X given X
t 0 =x converges to the distribution of X given X
t 0 =y as x → y .
16.2.1 https://stats.libretexts.org/@go/page/10289
2. Given X 0 =x ∈ S , X converges in probability to x as t ↓ 0 .
t
Part (a) is an assumption on continuity in space, while part (b) is an assumption on continuity in time. If S is discrete then (a)
automatically holds, and if T is discrete then (b) automatically holds. As we will see, the Feller assumptions are sufficient for a
very nice mathematical theory, and yet are general enough to encompass the most important continuous-time Markov processes.
2. X is a strong Markov process relative to the F , the right-continuous refinement of the natural filtration.
0
+
The Feller assumptions on the Markov process have equivalent formulations in terms of the transition semigroup.
As before, part (a) is a condition on continuity in space, while part (b) is a condition on continuity in time. Once again, (a) is trivial
if S is discrete, and (b) trivial if T is discrete. The first condition means that P is a linear operator on C (as well as being a linear
t 0
Our interest in this section is primarily the continuous time case. However, we start with the discrete time case since the concepts
are clearer and simpler, and we can avoid some of the technicalities that inevitably occur in continuous time.
Discrete Time
Suppose that T = N , so that time is discrete. Recall that the transition kernels are just powers of the one-step kernel. That is, we let
P =P 1and then P = P for n ∈ N .
n
n
Potential Operators
For α ∈ (0, 1], the α -potential kernel R of X is defined as follows:
α
n n
Rα (x, A) = ∑ α P (x, A), x ∈ S, A ∈ S (16.2.5)
n=0
Note that it's quite possible that R(x, A) = ∞ for some x ∈ S and A ∈ S . In fact, knowing when this is the case is of
considerable importance in the study of Markov processes. As with all kernels, the potential kernel R defines two operators, α
operating on the right on functions, and operating on the left on positive measures. For the right potential operator, if f : S → R is
measurable then
∞ ∞ ∞
n n n n n
Rα f (x) = ∑ α P f (x) = ∑ α ∫ P (x, dy)f (y) = ∑ α E[f (Xn ) ∣ X0 = x], x ∈ S (16.2.7)
assuming as usual that the expected values and the infinite series make sense. This will be the case, in particular, if f is
nonnegative or if p ∈ (0, 1) and f ∈ B .
1−α
for all x ∈ S .
Proof
16.2.2 https://stats.libretexts.org/@go/page/10289
It follows that for α ∈ (0, 1), the right operator R is a bounded, linear operator on
α B with ∥ Rα ∥ =
1
1−α
. It also follows that
(1 − α)R is a probability kernel. There is a nice interpretation of this kernel.
α
So (1 − α)R is a transition probability kernel, just as P is a transition probability kernel, but corresponding to the random time
α n
N (with α ∈ (0, 1) as a parameter), rather than the deterministic time n ∈ N . An interpretation of the potential kernel R for α
α ∈ (0, 1) can be also given in economic terms. Suppose that A ∈ S and that we receive one monetary unit each time the process
X visits A . Then as above, R(x, A) is the expected total amount of money we receive, starting at x ∈ S . However, typically
money that we will receive at times distant in the future has less value to us now than money that we will receive soon. Specifically
suppose that a monetary unit received at time n ∈ N has a present value of α , where α ∈ (0, 1) is an inflation factor (sometimes
n
also called a discount factor). Then R (x, A) gives the expected, total, discounted amount we will receive, starting at x ∈ S . A bit
α
more generally, if f ∈ B is a reward function, so that f (x) is the reward (or cost, depending on the sign) that we receive when we
visit state x ∈ S , then for α ∈ (0, 1), R f (x) is the expected, total, discounted reward, starting at x ∈ S .
α
n n n n
μRα (A) = ∑ α μP (A) = ∑ α ∫ μ(dx)P (x, A), A ∈ S (16.2.12)
n=0 n=0 S
In particular, if μ is a probability measure and X has distribution μ then μP is the distribution of X for n ∈ N , so from the last
0
n
n
result, (1 − α)μR is the distribution of X where again, N is independent of X and has the geometric distribution on N with
α N
parameter 1 − α . The family of potential kernels gives the same information as the family of transition kernels.
The potential kernels R = {R α : α ∈ (0, 1)} completely determine the transition kernels P = { Pn : n ∈ N} .
Proof
Of course, it's really only necessary to determine P , the one step transition kernel, since the other transition kernels are powers of
P . In any event, it follows that the kernels R = { R : α ∈ (0, 1)} , along with the initial distribution, completely determine the
α
finite dimensional distributions of the Markov process X. The potential kernels commute with each other and with the transition
kernels.
∞ ∞
2. R α Rβ = Rβ Rα = ∑
m=0
∑
n=0
α
m
β
n
P
m+n
Proof
The same identities hold for the right operators on the entire space B , with the additional restrictions that α <1 and β < 1 . The
fundamental equation that relates the potential kernels is given next.
Proof
The same identity holds holds for the right operators on the entire space B , with the additional restriction that β < 1 .
The same identity holds for the right operators on the entire space B , with the additional restriction that α <1 . This leads to the
following important result:
16.2.3 https://stats.libretexts.org/@go/page/10289
1. R = (I − αP )
α
−1
2. P = (I − R )
1
α
−1
α
Proof
This result shows again that the potential operator R determines the transition operator P .
α
Let I = { In : n ∈ N+ } be a sequence of Bernoulli trials with success parameter p ∈ (0, 1). Define the Markov process
n
X = { Xn : n ∈ N} by X = X + ∑
n 0 I where X takes values in N and is independent of I .
k=1 k 0
n
n y−x n−y+x
P (x, y) = ( )p (1 − p ) , x ∈ N, y ∈ {x, x + 1, … , x + n} (16.2.22)
y −x
y−x
1 αp
Rα (x, y) = ( ) , x ∈ N, y ∈ {x, x + 1, …} (16.2.23)
1 − α + αp 1 − α + αp
3. For α ∈ (0, 1) and x ∈ N, identify the probability distribution defined by (1 − α)R (x, ⋅) . α
4. For x, y ∈ N with x ≤ y , interpret R(x, y), the expected time in y starting in x, in the context of the process X.
Solutions
Continuous Time
With the discrete-time setting as motivation, we now turn the more important continuous-time case where T = [0, ∞) .
Potential Kernels
For α ∈ [0, ∞), the α -potential kernel U of X is defined as follows:
α
∞
−αt
Uα (x, A) = ∫ e Pt (x, A) dt, x ∈ S, A ∈ S (16.2.25)
0
2. For x ∈ S and A ∈ S , U (x, A) is the expected amount of time that X spends in A , starting at x.
3. The family of kernels U = {U : α ∈ (0, ∞)} is known as the reolvent of X.
α
Proof
As with discrete time, it's quite possible that U (x, A) = ∞ for some x ∈ S and A ∈ S , and knowing when this is the case is of
considerable interest. As with all kernels, the potential kernel U defines two operators, operating on the right on functions, and
α
operating on the left on positive measures. If f : S → R is measurable then, giving the right potential operator in its many forms,
∞
−αt
Uα f (x) = ∫ Uα (x, dy)f (y) = ∫ e Pt f (x) dt
S 0
∞ ∞
−αt −αt
=∫ e ∫ Pt (x, dy)f (y) = ∫ e E[f (Xt ) ∣ X0 = x] dt, x ∈ S
0 S 0
assuming that the various integrals make sense. This will be the case in particular if f is nonnegative, or if f ∈ B and α > 0 .
α
for all x ∈ S .
Proof
It follows that for α ∈ (0, ∞) , the right potential operator U is a bounded, linear operator on B with ∥U
α α∥ =
1
α
. It also follows
that αU is a probability kernel. This kernel has a nice interpretation.
α
16.2.4 https://stats.libretexts.org/@go/page/10289
If α > 0 then α U (x, ⋅) is the conditional distribution of X where τ is independent of X and has the exponential distribution
α τ
So αU is a transition probability kernel, just as P is a transition probability kernel, but corresponding to the random time τ (with
α t
α ∈ (0, ∞) as a parameter), rather than the deterministic time t ∈ [0, ∞). As in the discrete case, the potential kernel can also be
interpreted in economic terms. Suppose that A ∈ S and that we receive money at a rate of one unit per unit time whenever the
process X is in A . Then U (x, A) is the expected total amount of money that we receive, starting in state x ∈ S . But again, money
that we receive later is of less value to us now than money that we will receive sooner. Specifically, suppose that one monetary unit
at time t ∈ [0, ∞) has a present value of e where α ∈ (0, ∞) is the inflation factor or discount factor. The U (x, A) is the
−αt
α
total, expected, discounted amount that we receive, starting in x ∈ S . A bit more generally, suppose that f ∈ B and that f (x) is
the reward (or cost, depending on the sign) per unit time that we receive when the process is in state x ∈ S . Then U f (x) is the α
In particular, suppose that α > 0 and that μ is a probability measure and X has distributionμ . Then μP is the distribution of X
0 t t
for t ∈ [0, ∞), and hence from the last result, αμU is the distribution of X , where again, τ is independent of X and has the
α τ
exponential distribution on [0, ∞) with parameter α . The family of potential kernels gives the same information as the family of
transition kernels.
The resolvent U = { Uα : α ∈ (0, ∞)} completely determines the family of transition kernels P = { Pt : t ∈ (0, ∞)} .
Proof
It follows that the resolvent {U : α ∈ [0, ∞)} , along with the initial distribution, completely determine the finite dimensional
α
distributions of the Markov process X. This is much more important here in the continuous-time case than in the discrete-time
case, since the transition kernels P cannot be generated from a single transition kernel. The potential kernels commute with each
t
Proof
The same identities hold for the right operators on the entire space B under the additional restriction that α > 0 and β >0 . The
fundamental equation that relates the potential kernels, known as the resolvent equation, is given in the next theorem:
The same identity holds for the right potential operators on the entire space B , under the additional restriction that α >0 . For
α ∈ (0, ∞) , U is also an operator on the space C .
α 0
If f ∈ C0 then α U αf → f as α → ∞ .
Proof
16.2.5 https://stats.libretexts.org/@go/page/10289
Infinitesimal Generator
In continuous time, it's not at all clear how we could construct a Markov process with desired properties, say to model a real system
of some sort. Stated mathematically, the existential problem is how to construct the family of transition kernels {P : t ∈ [0, ∞)} t
so that the semigroup property P P = P s is satisfied for all s, t ∈ [0, ∞) . The answer, as for similar problems in the
t s+t
Pt f − f
Gf = lim (16.2.39)
t↓0 t
As usual, the limit is with respect to the supremum norm on C , so f 0 ∈ D and Gf =g means that f , g ∈ C0 and
∥ Pt f − f ∥ ∣ Pt f (x) − f (x) ∣
∥ − g∥ = sup {∣ − g(x)∣ : x ∈ S} → 0 as t ↓ 0 (16.2.40)
∥ t ∥ ∣ t ∣
So in particular,
Pt f (x) − f (x) E[f (Xt ) ∣ X0 = x] − f (x)
Gf (x) = lim = lim , x ∈ S (16.2.41)
t↓0 t t↓0 t
Note G is the (right) derivative at 0 of the function t ↦ P f . Because of the semigroup property, this differentiability property at 0
t
implies differentiability at arbitrary t ∈ [0, ∞). Moreover, the infinitesimal operator and the transition operators commute:
If f ∈ D and t ∈ [0, ∞), then P tf ∈ D and the following derivative rules hold with respect to the supremum norm.
1. P t
′
f = Pt Gf , the Kolmogorov forward equation
2. P t
′
f = GPt f , the Kolmogorov backward equation
Proof
The last result gives a possible solution to the dilema that motivated this discussion in the first place. If we want to construct a
Markov process with desired properties, to model a a real system for example, we can start by constructing an appropriate
generator G and then solve the initial value problem
′
P = GPt , P0 = I (16.2.47)
t
to obtain the transition operators P = {P : t ∈ [0, ∞)} . The next theorem gives the relationship between the potential operators
t
and the infinitesimal operator, which in some ways is better. This relationship is analogous to the relationship between the potential
operators and the one-step operator given above in discrete time
Suppose α ∈ (0, ∞) .
1. If f ∈ D the Gf ∈ C and f + U Gf
0 α = α Uα f
2. If f ∈ C0 then U f ∈ D and f + GU
α αf = α Uα f .
Proof
2. G = αI − U : D → Cα
−1
0
16.2.6 https://stats.libretexts.org/@go/page/10289
Proof
So, from the generator G we can determine the potential operators U = {U : α ∈ (0, ∞)} , which in turn determine the transition
α
operators P = {P : t ∈ (0, ∞)} . In continuous time, transition operators P = {P : t ∈ [0, ∞)} can be obtained from the single,
t t
infinitesimal operator G in a way that is reminiscent of the fact that in discrete time, the transition operators P = {P : n ∈ N} n
Consider the Markov process X = {X t : t ∈ [0, ∞)} on R satisfying the ordinary differential equation
d
Xt = g(Xt ), t ∈ [0, ∞) (16.2.52)
dt
Proof
Our next example considers the Poisson process as a Markov process. Compare this with the binomial process above.
Let N = { Nt : t ∈ [0, ∞)} denote the Poisson process on N with rate β ∈ (0, ∞). Define the Markov process
X = { Xt : t ∈ [0, ∞)} by X t = X +N
0 where X takes values in N and is independent of N .
t 0
1. For t ∈ [0, ∞), show that the probability transition matrix P of X is given by
t
y−x
(βt)
−βt
Pt (x, y) = e , x, y ∈ N, y ≥ x (16.2.54)
(y − x)!
y−x
1 β
Uα (x, y) = ( ) , x, y ∈ N, y ≥ x (16.2.55)
α +β α +β
3. For α > 0 and x ∈ N, identify the probability distribution defined by α U (x, ⋅). α
4. Show that the infinitesimal matrix G of X is given by G(x, x) = −β, G(x, x + 1) = β for x ∈ N.
Solutions
This page titled 16.2: Potentials and Generators for General Markov Processes is shared under a CC BY 2.0 license and was authored, remixed,
and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a
detailed edit history is available upon request.
16.2.7 https://stats.libretexts.org/@go/page/10289
16.3: Introduction to Discrete-Time Chains
In this and the next several sections, we consider a Markov process with the discrete time space N and with a discrete (countable)
state space. Recall that a Markov process with a discrete state space is called a Markov chain, so we are studying discrete-time
Markov chains.
Review
We will review the basic definitions and concepts in the general introduction. With both time and space discrete, many of these
definitions and concepts simplify considerably. As usual, our starting point is a probability space (Ω, F , P), so Ω is the sample
space, F the σ-algebra of events, and P the probability measure on (Ω, F ). Let X = (X , X , X , …) be a stochastic process
0 1 2
defined on the probability space, with time space N and with countable state space S . In the context of the general introduction, S
is given the power set P(S) as the σ-algebra, so all subsets of S are measurable, as are all functions from S into another
measurable space. Counting measure # is the natural measure on S , so integrals over S are simply sums. The same comments
apply to the time space N: all subsets of N are measurable and counting measure # is the natural measure on N.
The vector space B consisting of bounded functions f : S → R will play an important role. The norm that we use is the supremum
norm defined by
∥f ∥ = sup{|f (x)| : x ∈ S}, f ∈ B (16.3.1)
For n ∈ N , let F = σ{X , X , … , X }, the σ-algebra generated by the process up to time n . Thus F = {F , F , F , …} is
n 0 1 n 0 1 2
the natural filtration associated with X. We also let G = σ{X , X , …}, the σ-algebra generated by the process from time n
n n n+1
on. So if n ∈ N represents the present time, then F contains the events in the past and G the events in the future.
n n
Definitions
We start with the basic definition of the Markov property: the past and future are conditionally independent, given the present.
There are a number of equivalent formulations of the Markov property for a discrete-time Markov chain. We give a few of these.
Part (a) states that for n ∈ N , the conditional probability density function of X given F is the same as the conditional
n+1 n
probability density function of X given X . Part (b) also states, in terms of expected value, that the conditional distribution of
n+1 n
time step in the future. But with discrete time, this is equivalent to the Markov property at general future times.
Part (a) states that for n, k ∈ N , the conditional probability density function of X given F is the same as the conditional
n+k n
probability density function of X given X . Part (b) also states, in terms of expected value, that the conditional distribution of
n+k n
also be stated without explicit reference to σ-algebras. If you are not familiar with measure theory, you can take this as the starting
definition.
X = (X0 , X1 , X2 , …) is a Markov chain if for every n ∈ N and every sequence of states (x 0, x1 , … , xn−1 , x, y) ,
P(Xn+1 = y ∣ X0 = x0 , X1 = x1 , … , Xn−1 = xn−1 , Xn = x) = P(Xn+1 = y ∣ Xn = x) (16.3.2)
16.3.1 https://stats.libretexts.org/@go/page/10290
The theory of discrete-time Markov chains is simplified considerably if we add an additional assumption.
That is, the conditional distribution of X given X = x depends only on n . So if X is homogeneous (we usually don't bother
n+k k
with the time adjective), then the chain {X : n ∈ N} given X = x is equivalent (in distribution) to the chain { X : n ∈ N}
k+n k n
given X = x . For this reason, the initial distribution is often unspecified in the study of Markov chains—if the chain is in state
0
x ∈ S at a particular time k ∈ N , then it doesn't really matter how the chain got to state x; the process essentially “starts over”,
independently of the past. The term stationary is sometimes used instead of homogeneous.
From now on, we will usually assume that our Markov chains are homogeneous. This is not as big of a loss of generality as you
might think. A non-homogenous Markov chain can be turned into a homogeneous Markov process by enlarging the state space, as
shown in the introduction to general Markov processes, but at the cost of creating an uncountable state space. For a homogeneous
Markov chain, if k, n ∈ N , x ∈ S , and f ∈ B , then
F = (F , F , F , …) as given above. Recall that a random variable τ taking values in N ∪ {∞} is a stopping time or a Markov
0 1 2
time for X if {τ = n} ∈ F for each n ∈ N . Intuitively, we can tell whether or not τ = n by observing the chain up to time n . In
n
a sense, a stopping time is a random time that does not require that we see into the future. The following result gives the
quintessential examples of stopping times.
An example of a random time that is generally not a stopping time is the last time that the process is in A :
ζA = max{n ∈ N+ : Xn ∈ A} (16.3.5)
We cannot tell if ζ A =n without looking into the future: {ζ A = n} = { Xn ∈ A, Xn+1 ∉ A, Xn+2 ∉ A, …} for n ∈ N .
If τ is a stopping time for X, the σ-algebra associated with τ is
Fτ = {A ∈ F : A ∩ {τ = n} ∈ Fn for all n ∈ N} (16.3.6)
Intuitively, F contains the events that can be described by the process up to the random time τ , in the same way that F contains
τ n
the events that can be described by the process up to the deterministic time n ∈ N . For more information see the section on
filtrations and stopping times.
The strong Markov property states that the future is independent of the past, given the present, when the present time is a stopping
time. For a discrete-time Markov chain, the ordinary Markov property implies the strong Markov property.
Part (a) states that the conditional probability density function of X given F is the same as the conditional probability density
τ+k τ
function of X given just X . Part (b) also states, in terms of expected value, that the conditional distribution of X
τ+k τ given F τ+k τ
16.3.2 https://stats.libretexts.org/@go/page/10290
is the same as the conditional distribution of X given just X . Assuming homogeneity as usual, the Markov chain
τ+k τ
Transition Matrices
Suppose again that X = (X , X , X , …) is a homogeneous, discrete-time Markov chain with state space S . With a discrete state
0 1 2
space, the transition kernels studied in the general introduction become transition matrices, with rows and columns indexed by S
(and so perhaps of infinite size). The kernel operations become familiar matrix operations. The results in this section are special
cases of the general results, but we sometimes give independent proofs for completeness, and because the proofs are simpler. You
may want to review the section on kernels in the chapter on expected value.
For n ∈ N let
Pn (x, y) = P(Xn = y ∣ X0 = x), (x, y) ∈ S × S (16.3.7)
Thus, y ↦ P (x, y) is the probability density function of X given X = x . In particular, P is a probability matrix (or stochastic
n n 0 n
kernel on S for n ∈ N :
y∈A
So A ↦ P (x, A) is the probability distribution of X given X = x . The next result is the Chapman-Kolmogorov equation,
n n 0
named for Sydney Chapman and Andrei Kolmogorov. It gives the basic relationship between the transition matrices.
If m, n ∈ N then P m Pn = Pm+n
Proof
It follows immediately that the transition matrices are just the matrix powers of the one-step transition matrix. That is, letting
P =P 1 we have P = P for all n ∈ N . Note that P = I , the identity matrix on S given by I (x, y) = 1 if x = y and 0
n
n 0
Suppose that n ∈ N and that f : S → R . Then, assuming that the expected value exists,
n n
P f (x) = ∑ P (x, y)f (y) = E[f (Xn ) ∣ X0 = x], x ∈ S (16.3.12)
y∈S
Proof
The existence of the expected value is only an issue if S is infinte. In particular, the result holds if f is nonnegative or if f ∈ B
(which in turn would always be the case if S is finite). In fact, P is a linear contraction operator on the space B for n ∈ N . That
n
is, if f ∈ B then P f ∈ B and ∥P f ∥ ≤ ∥f ∥. The left operator corresponding to P is defined similarly. For f : S → R
n n n
n n
fP (y) = ∑ f (x)P (x, y), y ∈ S (16.3.14)
x∈S
assuming again that the sum makes sense (as before, only an issue when S is infinite). The left operator is often restricted to
nonnegative functions, and we often think of such a function as the density function (with respect to #) of a positive measure on S .
In this sense, the left operator maps a density function to another density function.
If X has probability density function f , then X has probability density function f P for n ∈ N .
0 n
n
Proof
16.3.3 https://stats.libretexts.org/@go/page/10290
In particular, if X has probability density function f , and f is invariant for X, then X has probability density function f for all
0 n
n ∈ N , so the sequence of variables X = (X , X , X , …) is identically distributed. Combining two results above, suppose that
0 1 2
X has probability density function f and that g : S → R . Assuming the expected value exists, E[g(X )] = f P g . Explicitly,
n
0 n
n
E[g(Xn )] = ∑ ∑ f (x)P (x, y)g(y) (16.3.16)
x∈S y∈S
It also follows from the last theorem that the distribution of X (the initial distribution) and the one-step transition matrix
0
determine the distribution of X for each n ∈ N . Actually, these basic quantities determine the finite dimensional distributions of
n
Suppose that X has probability density function f . For any sequence of states (x
0 0 0, x1 , … , xn ) ∈ S
n
, ,
P(X0 = x0 , X1 = x1 , … , Xn = xn ) = f0 (x0 )P (x0 , x1 )P (x1 , x2 ) ⋯ P (xn−1 , xn ) (16.3.17)
Proof
Computations of this sort are the reason for the term chain in the name Markov chain. From this result, it follows that given a
probability matrix P on S and a probability density function f on S , we can construct a Markov chain X = (X , X , X , …) 0 1 2
such that X has probability density function f and the chain has one-step transition matrix P . In applied problems, we often know
0
the one-step transition matrix P from modeling considerations, and again, the initial distribution is often unspecified.
There is a natural graph (in the combinatorial sense) associated with a homogeneous, discrete-time Markov chain.
Suppose again that X = (X , X , X , …) is a Markov chain with state space S and transition probability matrix P . The state
0 1 2
graph of X is the directed graph with vertex set S and edge set E = {(x, y) ∈ S : P (x, y) > 0} . 2
That is, there is a directed edge from x to y if and only if state x leads to state y in one step. Note that the graph may well have
loops, since a state can certainly lead back to itself in one step. More generally, we have the following result:
Suppose again that X = (X , X , X , …) is a Markov chain with state space S and transition probability matrix
0 1 2 P . For
x, y ∈ S and n ∈ N , there is a directed path of length n in the state graph from x to y if and only if P (x, y) > 0 . n
+
Proof
Potential Matrices
For α ∈ (0, 1], the α -potential matrix R of X is α
n n 2
Rα = ∑ α P , (x, y) ∈ S (16.3.19)
n=0
Note that it's quite possible that R(x, y) = ∞ for some (x, y) ∈ S . In fact, knowing when this is the case is of considerable
2
importance in recurrence and transience, which we study in the next section. As with any nonnegative matrix, the α -potential
matrix defines a kernel and defines left and right operators. For the kernel,
∞
n n
Rα (x, A) = ∑ Rα (x, y) = ∑ α P (x, A), x ∈ S, A ⊆ S (16.3.21)
y∈A n=0
n
R(x, A) = ∑ R(x, y) = ∑ P (x, A) = E [ ∑ 1(Xn ∈ A)] , x ∈ S, A ⊆ S (16.3.22)
1−α
for all x ∈ S .
Proof
16.3.4 https://stats.libretexts.org/@go/page/10290
Hence R is a bounded matrix for
α α ∈ (0, 1) and (1 − α)Rα is a probability matrix. There is a simple interpretation of this
matrix.
So (1 − α)R can be thought of as a transition matrix just as P is a transition matrix, but corresponding to the random time N
α
n
(with α as a paraamter) rather than the deterministic time n . An interpretation of the potential matrix R for α ∈ (0, 1) can also be α
given in economic terms. Suppose that we receive one monetary unit each time the chain visits a fixed state y ∈ S . Then R(x, y) is
the expected total reward, starting in state x ∈ S . However, typically money that we will receive at times distant in the future have
less value to us now than money that we will receive soon. Specifically suppose that a monetary unit at time n ∈ N has a present
value of α , so that α is an inflation factor (sometimes also called a discount factor). Then R (x, y) gives the expected total
n
α
The potential kernels R = {R α : α ∈ (0, 1)} completely determine the transition kernels P = { Pn : n ∈ N} .
Proof
Of course, it's really only necessary to determine P , the one step transition kernel, since the other transition kernels are powers of
P . In any event, it follows that the matrices R = { R : α ∈ (0, 1)} , along with the initial distribution, completely determine the
α
finite dimensional distributions of the Markov chain X. The potential matrices commute with each other and with the transition
matrices.
n=0
α P
n n+k
2. R α Rβ = Rβ Rα = ∑
∞
m=0
∑
∞
n=0
α
m
β
n
P
m+n
Proof
The fundamental equation that relates the potential matrices is given next.
α Rα = β Rβ + (α − β)Rα Rβ (16.3.31)
Proof
If α ∈ (0, 1] then I + α R αP = I + αP Rα = Rα .
Proof
This leads to an important result: when α ∈ (0, 1), there is an inverse relationship between P and R . α
2. P = (I − R )1
α
−1
α
Proof
This result shows again that the potential matrix R determines the transition operator P . α
Sampling in Time
If we sample a Markov chain at multiples of a fixed time k , we get another (homogeneous) chain.
Suppose that X = (X , X , X , …) is an Markov chain with state space S and transition probability matrix
0 1 2 P . For fixed
k ∈ N , the sequence X = (X , X , X , …) is a Markov chain on S with transition probability matrix P .
+ k 0 k 2k
k
If we sample a Markov chain at a general increasing sequence of time points 0 < n < n < ⋯ in N, then the resulting stochastic 1 2
process Y = (Y , Y , Y , …), where Y = X for k ∈ N , is still a Markov chain, but is not time homogeneous in general.
0 1 2 k nk
16.3.5 https://stats.libretexts.org/@go/page/10290
Recall that if A is a nonempty subset of S , then P is the matrix P restricted to A × A . So P is a sub-stochastic matrix, since the
A A
row sums may be less than 1. Recall also that P means (P ) , not (P ) ; in general these matrices are different.
A
n
A
n n
A
That is, P (x, y) is the probability of going from state x to y in n steps, remaining in A all the while. In terms of the state graph of
A
n
X, it is the sum of products of probabilities along paths of length n from x to y that stay inside A .
P =⎢ ⎥
1 3
⎢ 0 ⎥ (16.3.37)
4 4
⎣ ⎦
1 0 0
3. Find P 2
5. Suppose that X has the uniform distribution on S . Find the probability density function of X .
0 2
Answer
2. PA
2
3. (P 2
)A
Proof
Answer
For n ∈ N ,
n n
1 q + p(1 − p − q) p − p(1 − p − q)
n
P = [ ] (16.3.42)
n n
p +q q − q(1 − p − q) p + q(1 − p − q)
Proof
As n → ∞ ,
16.3.6 https://stats.libretexts.org/@go/page/10290
1 q p
n
P → [ ] (16.3.44)
p +q q p
Proof
Open the simulation of the two-state, discrete-time Markov chain. For various values of p and q, and different initial states, run
the simulation 1000 times. Compare the relative frequency distribution to the limiting distribution, and in particular, note the
rate of convergence. Be sure to try the case p = q = 0.01
Proof
Proof
In spite of its simplicity, the two state chain illustrates some of the basic limiting behavior and the connection with invariant
distributions that we will study in general in a later section.
X is a Markov chain on S with transition probability matrix P given by P (x, y) = f (y) for (x, y) ∈ S × S . Also, f is
invariant for P .
Proof
As a Markov chain, the process X is not very interesting, although of course it is very interesting in other ways. Suppose now that
S = Z , the set of integers, and consider the partial sum process (or random walk) Y associated with X:
Yn = ∑ Xi , n ∈ N (16.3.49)
i=0
Y is a Markov chain on Z with transition probability matrix Q given by Q(x, y) = f (y − x) for (x, y) ∈ Z × Z .
Proof
Thus the probability density function f governs the distribution of a step size of the random walker on Z.
Consider the special case of the random walk on Z with f (1) = p and f (−1) = 1 − p , where p ∈ (0, 1).
1. Give the transition matrix Q explicitly.
2. Give Q explicitly for n ∈ N .
n
Answer
This special case is the simple random walk on Z. When p = we have the simple, symmetric random walk. The simple random
1
walk on Z is studied in more detail in the section on random walks on graphs. The simple symmetric random walk is studied in
more detail in the chapter on Bernoulli Trials.
u∈S u∈s
16.3.7 https://stats.libretexts.org/@go/page/10290
Suppose that X is a Markov chain on a finite state space S with doubly stochastic transition matrix P . Then the uniform
distribution on S is invariant.
Proof
Suppose that X = (X 0, X1 , …) is the Markov chain with state space S = {−1, 0, 1} and with transition matrix
1 1
0
⎡ 2 2 ⎤
1 1
P =⎢ 0 ⎥ (16.3.56)
⎢ 2 2 ⎥
⎣ 1 1 ⎦
0
2 2
4. Show that the uniform distribution on S is the only invariant distribution for X.
5. Suppose that X has the uniform distribution on S . For n ∈ N , find E(X ) and var(X ).
0 n n
Proof
Recall that a matrix M indexed by a countable set S is symmetric if M (x, y) = M (y, x) for all x, y ∈ S .
The converse is not true. The doubly stochastic matrix in the exercise above is not symmetric. But since a symmetric, stochastic
matrix on a finite state space is doubly stochastic, the uniform distribution is invariant.
Suppose that X = (X 0, X1 , …) is the Markov chain with state space S = {−1, 0, 1} and with transition matrix
1 0 0
⎡ ⎤
1 3
P =⎢0 ⎥ (16.3.60)
⎢ 4 4 ⎥
3 1
⎣0 ⎦
4 4
Proof
Special Models
The Markov chains in the following exercises model interesting processes that are studied in separate sections.
16.3.8 https://stats.libretexts.org/@go/page/10290
Read the introduction to the branching chain.
This page titled 16.3: Introduction to Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.3.9 https://stats.libretexts.org/@go/page/10290
16.4: Transience and Recurrence for Discrete-Time Chains
The study of discrete-time Markov chains, particularly the limiting behavior, depends critically on the random times between visits
to a given state. The nature of these random times leads to a fundamental dichotomy of the states.
Basic Theory
As usual, our starting point is a probability space (Ω, F , P), so that Ω is the sample space, F the σ-algebra of events, and P the
probability measure on (Ω, F ). Suppose now that X = (X , X , X , …) is a (homogeneous) discrete-time Markov chain with
0 1 2
for x, y ∈ S and n ∈ N . Let F = σ{X , X , … , X }, the σ-algebra of events defined by the chain up to time
n 0 1 n n ∈ N , so that
F = (F0 , F1 , …) is the natural filtration associated with X.
Since the chain may never enter A , the random variable τ takes values in N ∪ {∞} (recall our convention that the minimum of
A +
the empty set is ∞). Recall also that τ is a stopping time for X. That is, {τ = n} ∈ F for n ∈ N . Intuitively, this means that
A A n +
we can tell if τ = n by observing the chain up to time n . This is clearly the case, since explicitly
A
{ τA = n} = { X1 ∉ A, … , Xn−1 ∉ A, Xn ∈ A} (16.4.3)
When A = {x} for x ∈ S , we will simplify the notation to τ . This random variable gives the first positive time that the chain is in
x
state x. When the chain enters a set of states A for the first time, the chain must visit some state in A for the first time, so it's clear
that
τA = min{ τx : x ∈ A}, A ⊆S (16.4.4)
Next we define two functions on S that are related to the hitting times.
2. H (x, A) = P(τ A
< ∞ ∣ X0 = x)
∞
So H (x, A) = ∑ n=1
Hn (x, A) .
Note that n ↦ H (x, A) is the probability density function of τ , given X = x , except that the density function may be defective
n A 0
in the sense that the sum H (x, A) may be less than 1, in which case of course, 1 − H (x, A) = P(τ = ∞ ∣ X = x) . Again, A 0
when A = {y} , we will simplify the notation to H (x, y) and H (x, y), respectively. In particular, H (x, x) is the probability,
n
starting at x, that the chain eventually returns to x. If x ≠ y , H (x, y) is the probability, starting at x, that the chain eventually
reaches y . Just knowing when H (x, y) is 0, positive, and 1 will turn out to be of considerable importance in the overall structure
and limiting behavior of the chain. As a function on S , we will refer to H as the hitting matrix of X. Note however, that unlike
2
the transition matrix P , we do not have the structure of a kernel. That is, A ↦ H (x, A) is not a measure, so in particular, it is
generally not true that H (x, A) = ∑ H (x, y). The same remarks apply to H
y∈A
for n ∈ N . However, there are interesting
n +
Proof
The following result gives a basic relationship between the sequence of hitting probabilities and the sequence of transition
probabilities.
16.4.1 https://stats.libretexts.org/@go/page/10291
Suppose that (x, y) ∈ S . Then 2
n n−k
P (x, y) = ∑ Hk (x, y)P (y, y), n ∈ N+ (16.4.6)
k=1
Proof
Proof
Let x ∈ S .
1. State x is recurrent if H (x, x) = 1.
2. State x is transient if H (x, x) < 1.
Thus, starting in a recurrent state, the chain will, with probability 1, eventually return to the state. As we will see, the chain will
return to the state infinitely often with probability 1, and the times of the visits will form the arrival times of a renewal process.
This will turn out to be the critical observation in the study of the limiting behavior of the chain. By contrast, if the chain starts in a
transient state, then there is a positive probability that the chain will never return to the state.
NA = ∑ 1(Xn ∈ A) (16.4.12)
n=1
Note that N takes value in N ∪ {∞} . We will mostly be interested in the special case
A A = {x} for x ∈ S , and in this case, we
will simplify the notation to N . x
n=1
Proof
Thus G(x, A) is the expected number of visits to A at positive times. As usual, when A = {y} for y ∈ S we simplify the notation
∞
to G(x, y), and then more generally we have G(x, A) = ∑ G(x, y) for A ⊆ S . So, as a matrix on S , G = ∑
y∈A
P . The n=1
n
matrix G is closely related to the potential matrix R of X, given by R = ∑ P . So R = I + G , and R(x, y) gives the
∞
n=0
n
expected number of visits to y ∈ S at all times (not just positive times), starting at x ∈ S . The matrix G is more useful for our
purposes in this section.
The distribution of N has a simple representation in terms of the hitting probabilities. Note that because of the Markov property
y
and time homogeneous property, whenever the chain reaches state y , the future behavior is independent of the past and is
stochastically the same as the chain starting in state y at time 0. This is the critical observation in the proof of the following
theorem.
If x, y ∈ S then
1. P(N y = 0 ∣ X0 = x) = 1 − H (x, y)
Proof
16.4.2 https://stats.libretexts.org/@go/page/10291
Figure 16.4.1 : Visits to state y starting in state x
The essence of the proof is illustrated in the graphic above. The thick lines are intended as reminders that these are not one step
transitions, but rather represent all paths between the given vertices. Note that in the special case that x = y we have
n
P(Nx = n ∣ X0 = x) = [H (x, x)] [1 − H (x, x)], n ∈ N (16.4.15)
In all cases, the counting variable N has essentially a geometric distribution, but the distribution may well be defective, with some
y
of the probability mass at ∞. The behavior is quite different depending on whether y is transient or recurrent.
Proof
Note that there is an invertible relationship between the matrix H and the matrix G; if we know one we can compute the other. In
particular, we can characterize the transience or recurrence of a state in terms of G. Here is our summary so far:
Let x ∈ S .
1. State x is transient if and only if H (x, x) < 1 if and only if G(x, x) < ∞.
2. State x is recurrent if and only if H (x, x) = 1 if and only if G(x, x) = ∞.
Of course, the classification also holds for the potential matrix R = I +G . That is, state x ∈ S is transient if and only if
R(x, x) < ∞ and state x is recurrent if and only if R(x, x) = ∞.
Relations
The hitting probabilities suggest an important relation on the state space S .
For (x, y) ∈ S , we say that x leads to y and we write x → y if either x = y or H (x, y) > 0.
2
It follows immediately from the result above that x → y if and only if P (x, y) > 0 for some n ∈ N . In terms of the state graph of
n
the chain, x → y if and only if x = y or there is a directed path from x to y . Note that the leads to relation is reflexive by
definition: x → x for every x ∈ S . The relation has another important property as well.
The leads to relation naturally suggests a couple of other definitions that are important.
16.4.3 https://stats.libretexts.org/@go/page/10291
1. P , the restriction of P to A × A , is a transition probability matrix on A .
A
3. (P ) = (P ) for n ∈ N .
n
A A
n
Proof
Of course, the entire state space S is closed by definition. If it is also irreducible, we say the Markov chain X itself is irreducible.
Recall that for a nonempty subset A of S and for n ∈ N , the notation P refers to (P ) and not (P ) . In general, these are not
n
A A
n n
A
the probability of going from x to y in n steps, remaining in A all the while. But if A is closed, then as noted in part (c), this is just
P (x, y).
n
Suppose that A is a nonempty subset of S . Then cl(A) = {y ∈ S : x → y for some x ∈ A} is the smallest closed set
containing A , and is called the closure of A . That is,
1. cl(A) is closed.
2. A ⊆ cl(A) .
3. If B is closed and A ⊆ B then cl(A) ⊆ B
Proof
Recall that for a fixed positive integer k , P is also a transition probability matrix, and in fact governs the k -step Markov chain
k
(X , X , X
0 k , …). It follows that we could consider the leads to relation for this chain, and all of the results above would still
2k
hold (relative, of course, to the k -step chain). Occasionally we will need to consider this relation, which we will denote by →,
k
Proof
By combining the leads to relation → with its inverse, the comes from relation ←, we can obtain another very useful relation.
By definition, this relation is symmetric: if x ↔ y then y ↔ x . From our work above, it is also reflexive and transitive. Thus, the to
and from relation is an equivalence relation. Like all equivalence relations, it partitions the space into mutually disjoint equivalence
classes. We will denote the equivalence class of a state x ∈ S by
[x] = {y ∈ S : x ↔ y} (16.4.18)
Thus, for any two states x, y ∈ S , either [x] = [y] or [x] ∩ [y] = ∅ , and moreover, ⋃ x∈S
[x] = S .
Figure 16.4.2 : The equivalence relation partitions S into mutually disjoint equivalence classes
16.4.4 https://stats.libretexts.org/@go/page/10291
Proof
The to and from equivalence relation is very important because many interesting state properties turn out in fact to be class
properties, shared by all states in a given equivalence class. In particular, the recurrence and transience properties are class
properties.
From the last theorem, note that if x is recurrent, then all states in [x] are also recurrent. Thus, for each equivalence class, either all
states are transient or all states are recurrent. We can therefore refer to transient or recurrent classes as well as states.
Thus, the Markov chain X will have a collection (possibly empty) of recurrent equivalence classes {A : j ∈ J} where J is a
j
countable index set. Each A is irreducible. Let B denote the set of all transient states. The set B may be empty or may consist of a
j
number of equivalence classes, but the class structure of B is usually not important to us. If the chain starts in A for some j ∈ J
j
then the chain remains in A forever, visiting each state infinitely often with probability 1. If the chain starts in B , then the chain
j
may stay in B forever (but only if B is infinite) or may enter one of the recurrent classes A , never to escape. However, in either
j
case, the chain will visit a given transient state only finitely many time with probability 1. This basic structure is known as the
canonical decomposition of the chain, and is shown in graphical form below. The edges from B are in gray to indicate that these
transitions may not exist.
The staying probability function g is an interesting complement to the hitting matrix studied above. The following result
A
characterizes this function and provides a method that can be used to compute it, at least in some cases.
16.4.5 https://stats.libretexts.org/@go/page/10291
For A ⊂S , gA is the largest function on A that takes values in [0, 1] and satisfies g = PA g . Moreover, either gA = 0A or
sup{ gA (x) : x ∈ A} = 1 .
Proof
Note that the characterization in the last result includes a zero-one law of sorts: either the probability that the chain stays in A
forever is 0 for every initial state x ∈ A , or we can find states in A for which the probability is arbitrarily close to 1. The next two
results explore the relationship between the staying function and recurrence.
Suppose that X is an irreducible, recurrent chain with state space S . Then g A = 0A for every proper subset A of S .
Proof
Suppose that X is an irreducible Markov chain with state space S and transition probability matrix P . If there exists a state x
such that g = 0 where A = S ∖ {x} , then X is recurrent.
A A
Proof
More generally, suppose that X is a Markov chain with state space S and transition probability matrix P . The last two theorems
can be used to test whether an irreducible equivalence class C is recurrent or transient. We fix a state x ∈ C and set A = C ∖ {x} .
We then try to solve the equation g = P g on A . If the only solution taking values in [0, 1] is 0 , then the class C is recurrent by
A A
the previous result. If there are nontrivial solutions, then C is transient. Often we try to choose x to make the computations easy.
GB satisfies the equation GB = PB + PB GB and is the smallest nonnegative solution. If B is finite then
GB = (I B −P ) BP .
−1
B
Proof
Now that we can compute G , we can also compute H using the result above. All that remains is for us to compute the hitting
B B
probability H (x, y) when x is transient and y is recurrent. The first thing to notice is that the hitting probability is a class property.
Suppose that x is transient and that A is a recurrent class. Then H (x, y) = H (x, A) for y ∈ A .
That is, starting in the transient state x ∈ S , the hitting probability to y is constant for y ∈ A , and is just the hitting probability to
the class A . As before, let B denote the set of transient states and suppose that A is a recurrent equivalence class. Let h denote A
the function on B that gives the hitting probability to class A , and let p denote the function on B that gives the probability of
A
hA = pA + GB pA .
Proof
This result is adequate if we have already computed G (using the result in above, for example). However, we might just want to
B
compute h directly.
A
16.4.6 https://stats.libretexts.org/@go/page/10291
Examples and Applications
Finite Chains
Consider a Markov chain with state space S = {a, b, c, d} and transition matrix P given below:
1 2
⎡ 0 0 ⎤
2 3
⎢ 1 0 0 0 ⎥
⎢ ⎥
P =⎢ ⎥ (16.4.22)
⎢ 0 0 1 0 ⎥
⎢ ⎥
1 1 1 1
⎣ ⎦
4 4 4 4
Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below:
1 1
⎡ 0 0
2
0
2
0 ⎤
⎢ 0 0 0 0 0 1 ⎥
⎢ ⎥
⎢ 1 1 1
⎥
⎢ 0 0 0 ⎥
⎢ 4 2 4 ⎥
P =⎢ ⎥ (16.4.23)
⎢ 0 0 0 1 0 0 ⎥
⎢ ⎥
⎢ 1 2 ⎥
⎢ 0 0 0 0 ⎥
⎢ 3 3 ⎥
1 1 1 1
⎣ 0 0 ⎦
4 4 4 4
Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below:
1 1
⎡ 0 0 0 0 ⎤
2 2
⎢ 1 3 ⎥
⎢ 4 0 0 0 0 ⎥
4
⎢ ⎥
⎢ 1 1 1 ⎥
⎢ 0 0 0 ⎥
4 2 4
P =⎢ ⎥ (16.4.24)
⎢ 1 1 1 1 ⎥
⎢ 0 0 ⎥
⎢ 4 4 4 4 ⎥
⎢ 1 1
⎥
⎢ 0 0 0 0 ⎥
⎢ 2 2 ⎥
1 1
⎣ 0 0 0 0 ⎦
2 2
Special Models
Read again the definitions of the Ehrenfest chains and the Bernoulli-Laplace chains. Note that since these chains are
irreducible and have finite state spaces, they are recurrent.
16.4.7 https://stats.libretexts.org/@go/page/10291
Read the discussion on random walks on Z in the section on the random walks on graphs.
k
Read the discussion on extinction and explosion in the section on the branching chain.
Read the discussion on recurrence and transience in the section on queuing chains.
Read the discussion on recurrence and transience in the section on birth-death chains.
This page titled 16.4: Transience and Recurrence for Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed,
and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a
detailed edit history is available upon request.
16.4.8 https://stats.libretexts.org/@go/page/10291
16.5: Periodicity of Discrete-Time Chains
A state in a discrete-time Markov chain is periodic if the chain can return to the state only at multiples of some integer larger than
1. Periodic behavior complicates the study of the limiting behavior of the chain. As we will see in this section, we can eliminate the
periodic behavior by considering the d -step chain, where d ∈ N is the period, but only at the expense of introducing additional
+
equivalence classes. Thus, in a sense, we can trade one form of complexity for another.
Basic Theory
Definitions and Basic Results
As usual, our starting point is a (time homogeneous) discrete-time Markov chain X = (X0 , X1 , X2 , …) with (countable) state
space S and transition probability matrix P .
Thus, starting in x, the chain can return to x only at multiples of the period d , and d is the largest such integer. Perhaps the most
important result is that period, like recurrence and transience, is a class property, shared by all states in an equivalence class under
the to and from relation.
Thus, the definitions of period, periodic, and aperiodic apply to equivalence classes as well as individual states. When the chain is
irreducible, we can apply these terms to the entire chain.
Suppose that x ∈ S . If P (x, x) > 0 then x (and hence the equivalence class of x) is aperiodic.
Proof
assuming that the chain is irreducible, for if this were not the case, we could simply restrict our attention to one of the irreducible
equivalence classes. Our exposition will be easier and cleaner if we recall the congruence equivalence relation modulo d on Z,
which in turn is based on the divison partial order. For n, m ∈ Z, n ≡ m if and only if d ∣ (n − m) , equivalently n − m is an
d
integer multiple of d , equivalently m and n have the same remainder after division by d . The basic fact that we will need is that
≡ d is preserved under sums and differences. That is, if m, n, p, q ∈ Z and if m ≡ n and p ≡ q , then m + p ≡ n + q and
d d d
m −p ≡ n−q .
d
Suppose that x, y ∈ S .
1. If x ∈ A and y ∈ A for j, k ∈ {0, 1, … , d − 1} then P (x, y) > 0 for some n ≡ k − j
j k
n
d
2. Conversely, if P (x, y) > 0 for some n ∈ N , then there exists j, k ∈ {0, 1, … d − 1} such that x ∈ A , y ∈ A and
n
j k
n ≡ k−j .
d
Proof
16.5.1 https://stats.libretexts.org/@go/page/10292
(A0 , A1 , … , Ad−1 ) are the equivalence classes for the d -step to and from relation ↔ that governs the d -step chain
d
The sets (A0, A1 , … , Ad−1 ) are known as the cyclic classes. The basic structure of the chain is shown in the state diagram below:
P =⎢0 0 1 ⎥ (16.5.3)
⎣ ⎦
1 0 0
1. Sketch the state graph and show that the chain is irreducible.
2. Show that the chain is aperiodic.
3. Note that P (x, x) = 0 for all x ∈ S .
Answer
Consider the Markov chain with state space S = {1, 2, 3, 4, 5, 6, 7} and transition matrix P given below:
1 1 1
⎡ 0 0
2 4 4
0 0 ⎤
⎢ 1 2 ⎥
⎢ 0 0 0 0 0 ⎥
3 3
⎢ ⎥
⎢ 1 2 ⎥
⎢ 0 0 0 0 0 ⎥
3 3
⎢ ⎥
⎢ 1 1 ⎥
P =⎢ 0 0 0 0 0 ⎥ (16.5.4)
⎢ 2 2 ⎥
⎢ 3 1 ⎥
⎢ 0 0 0 0 0 ⎥
⎢ 4 4 ⎥
⎢ ⎥
1 1
⎢ 0 0 0 0 0 ⎥
⎢ 2 2 ⎥
1 3
⎣ 0 0 0 0 0 ⎦
4 4
1. Sketch the state graph and show that the chain is irreducible.
2. Find the period d .
3. Find P .d
Special Models
Review the definition of the basic Ehrenfest chain. Show that this chain has period 2, and find the cyclic classes.
Review the definition of the modified Ehrenfest chain. Show that this chain is aperiodic.
16.5.2 https://stats.libretexts.org/@go/page/10292
Review the definition of the simple random walk on Z
k
. Show that the chain is periodic with period 2, and find the cyclic
classes.
This page titled 16.5: Periodicity of Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.5.3 https://stats.libretexts.org/@go/page/10292
16.6: Stationary and Limiting Distributions of Discrete-Time Chains
In this section, we study some of the deepest and most interesting parts of the theory of discrete-time Markov chains, involving two
different but complementary ideas: stationary distributions and limiting distributions. The theory of renewal processes plays a
critical role.
Basic Theory
As usual, our starting point is a (time homogeneous) discrete-time Markov chain X = (X , X , X , …) with (countable) state
0 1 2
space S and transition probability matrix P . In the background, of course, is a probability space (Ω, F , P) so that Ω is the sample
space, F the σ-algebra of events, and P the probability measure on (Ω, F ). For n ∈ N , let F = σ{X , X , … , X }, the σ- n 0 1 n
algebra of events determined by the chain up to time n , so that F = {F , F , …} is the natural filtration associated with X.
0 1
i=1
Ny = ∑ 1(Xi = y) (16.6.2)
i=1
is the total number of visits to y at positive times, one of the important random variables that we studied in the section on
transience and recurrence. For n ∈ N , we denote the time of the n th visit to y by
+
where as usual, we define min(∅) = ∞ . Note that τ is the time of the first visit to y , which we denoted simply by τ in the
y,1 y
section on transience and recurrence. The times of the visits to y are stopping times for X. That is, {τ = k} ∈ F for n ∈ N y,n k +
and k ∈ N . Recall also the definition of the hitting probability to state y starting in state x:
2
H (x, y) = P (τy < ∞ ∣ X0 = x) , (x, y) ∈ S (16.6.4)
As noted in the proof, (τ , τ , …) is the sequence of arrival times and (N , N , …) is the associated sequence of counting
y,1 y,2 y,1 y,2
variables for the embedded renewal process associated with the recurrent state y . The corresponding renewal function, given
X = x , is the function n ↦ G (x, y) where
0 n
k
Gn (x, y) = E (Ny,n ∣ X0 = x) = ∑ P (x, y), n ∈ N (16.6.5)
k=1
Thus is the expected number of visits to y in the first n positive time units, starting in state x. Note that
Gn (x, y)
Gn (x, y) → G(x, y) as n → ∞ where G is the potential matrix that we studied previously. This matrix gives the expected total
number visits to state y ∈ S , at positive times, starting in state x ∈ S :
∞
k
G(x, y) = E (Ny ∣ X0 = x) = ∑ P (x, y) (16.6.6)
k=1
16.6.1 https://stats.libretexts.org/@go/page/10293
Limiting Behavior
The limit theorems of renewal theory can now be used to explore the limiting behavior of the Markov chain. Let
μ(y) = E(τ ∣ X = y) denote the mean return time to state y , starting in y . In the following results, it may be the case that
y 0
Proof
Note that 1
n
Ny,n =
1
n
∑
n
k=1
1(Xk = y) is the average number of visits to y in the first n positive time units.
Proof
n
Note that G 1
n n (x, y) =
1
n
∑k=1 P
k
(x, y) is the expected average number of visits to y during the first n positive time units,
starting at x.
Proof
Note that H (y, y) = 1 by the very definition of a recurrent state. Thus, when x = y , the law of large numbers above gives
convergence with probability 1, and the first and second renewal theory limits above are simply 1/μ(y). By contrast, we already
know the corresponding limiting behavior when y is transient.
n
y,n 0
2. G (x, y) = ∑
1
n
n
1 n
P (x, y) → 0 as n → ∞
n k=1
k
3. P (x, y) → 0 as n → ∞
n
Proof
On the other hand, if y is transient then P(τ = ∞ ∣ X = y) > 0 by the very definition of a transience. Thus μ(y) = ∞ , and so
y 0
the results in parts (b) and (c) agree with the corresponding results above for a recurrent state. Here is a summary.
For x, y ∈ S ,
n
1 1 H (x, y)
k
Gn (x, y) = ∑P (x, y) → as n → ∞ (16.6.11)
n n μ(y)
k=1
Let x ∈ S .
16.6.2 https://stats.libretexts.org/@go/page/10293
1. State x is positive recurrent if μ(x) < ∞ .
2. If x is recurrent but μ(x) = ∞ then state x is null recurrent.
On the other hand, it is possible to have P(τ < ∞ ∣ X = x) = 1 , so that x is recurrent, and also E(τ ∣ X = x) = ∞ , so that x
x 0 x 0
is null recurrent. Simply put, a random variable can be finite with probability 1, but can have infinite expected value. A classic
example is the Pareto distribution with shape parameter a ∈ (0, 1).
Like recurrence/transience, and period, the null/positive recurrence property is a class property.
Thus, the terms positive recurrent and null recurrent can be applied to equivalence classes (under the to and from equivalence
relation), as well as individual states. When the chain is irreducible, the terms can be applied to the chain as a whole.
Recall that a nonempty set of states A is closed if x ∈ A and x → y implies y ∈ A . Here are some simple results for a finite,
closed set of states.
In particular, a Markov chain with a finite state space cannot have null recurrent states; every state must be transient or positive
recurrent.
n
1
P (x, y) → as n → ∞ for every x ∈ S (16.6.16)
μ(y)
Note in particular that the limit is independent of the initial state x. Of course in the transient case and in the null recurrent and
aperiodic case, the limit is 0. Only in the positive recurrent, aperiodic case is the limit positive, which motivates our next definition.
A Markov chain X that is irreducible, positive recurrent, and aperiodic, is said to be ergodic.
In the ergodic case, as we will see, X has a limiting distribution as n → ∞ that is independent of the initial distribution.
n
The behavior when the chain is periodic with period d ∈ {2, 3, …} is a bit more complicated, but we can understand this behavior
by considering the d -step chain X = (X , X , X , …) that has transition matrix P . Essentially, this allows us to trade
d 0 d 2d
d
periodicity (one form of complexity) for reducibility (another form of complexity). Specifically, recall that the d -step chain is
aperiodic but has d equivalence classes (A , A , … , A ); and these are the cyclic classes of original chain X.
0 1 d−1
16.6.3 https://stats.libretexts.org/@go/page/10293
Figure 16.6.1 : The cyclic classes of a chain with period d
The mean return time to state x for the d -step chain X is μ d d (x) = μ(x)/d .
Proof
Let i, j, k ∈ {0, 1, … , d − 1} ,
1. P nd+k
(x, y) → d/μ(y) as n → ∞ if x ∈ A and y ∈ A and j = (i + k)
i j mod d .
2. P nd+k
(x, y) → 0 as n → ∞ in all other cases.
Proof
Invariant Distributions
Our next goal is to see how the limiting behavior is related to invariant distributions. Suppose that f is a probability density
function on the state space S . Recall that f is invariant for P (and for the chain X) if f P = f . It follows immediately that
= f for every n ∈ N . Thus, if X has probability density function f then so does X for each n ∈ N , and hence X is a
n
fP 0 n
sequence of identically distributed random variables. A bit more generally, suppose that g : S → [0, ∞) is invariant for P , and let
C =∑ g(x) . If 0 < C < ∞ then f defined by f (x) = g(x)/C for x ∈ S is an invariant probability density function.
x∈S
Proof
Note that if y is transient or null recurrent, then g(y) = 0 . Thus, a invariant function with finite sum, and in particular an invariant
probability density function must be concentrated on the positive recurrent states.
Suppose now that the chain X is irreducible. If X is transient or null recurrent, then from the previous result, the only nonnegative
functions that are invariant for P are functions that satisfy ∑ g(x) = ∞ and the function that is identically 0: g = 0 . In
x∈S
particular, the chain does not have an invariant distribution. On the other hand, if the chain is positive recurrent, then H (x, y) = 1
for all x, y ∈ S . Thus, from the previous result, the only possible invariant probability density function is the function f given by
f (x) = 1/μ(x) for x ∈ S . Any other nonnegative function g that is invariant for P and has finite sum, is a multiple of f (and
indeed the multiple is sum of the values). Our next goal is to show that f really is an invariant probability density function.
If X is an irreducible, positive recurrent chain then the function f given by f (x) = 1/μ(x) for x ∈ S is an invariant
probability density function for X.
Proof
In summary, an irreducible, positive recurrent Markov chain X has a unique invariant probability density function f given by
f (x) = 1/μ(x) for x ∈ S . We also now have a test for positive recurrence. An irreducible Markov chain X is positive recurrent if
and only if there exists a positive function g on S that is invariant for P and satisfies ∑ g(x) < ∞ (and then, of course,
x∈S
16.6.4 https://stats.libretexts.org/@go/page/10293
Consider now a general Markov chain X on S . If X has no positive recurrent states, then as noted earlier, there are no invariant
distributions. Thus, suppose that X has a collection of positive recurrent equivalence classes (A : i ∈ I ) where I is a nonempty,
i
countable index set. The chain restricted to A is irreducible and positive recurrent for each i ∈ I , and hence has a unique invariant
i
1
fi (x) = , x ∈ Ai (16.6.21)
μ(x)
f is an invariant probability density function for X if and only if f has the form
i∈I
where (pi : i ∈ I )is a probability density function on the index set I . That is, f (x) = pi fi (x) for i ∈ I and x ∈ Ai , and
f (x) = 0 otherwise.
Proof
Invariant Measures
Suppose that X is irreducible. In this section we are interested in general functions g : S → [0, ∞) that are invariant for X, so that
gP = g . A function g : S → [0, ∞) defines a positive measure ν on S by the simple rule
x∈A
so in this sense, we are interested in invariant positive measures for X that may not be probability measures. Technically, g is the
density function of ν with respect to counting measure # on S .
From our work above, We know the situation if X is positive recurrent. In this case, there exists a unique invariant probability
density function f that is positive on S , and every other nonnegative invariant function g is a nonnegative multiples of f . In
particular, either g = 0 , the zero function on S , or g is positive on S and satisfies ∑ g(x) < ∞ . x∈S
We can generalize to chains that are simply recurrent, either null or positive. We will show that there exists a positive invariant
function that is unique, up to multiplication by positive constants. To set up the notation, recall that τ = min{k ∈ N : X = x}
x + k
is the first positive time that the chain is in state x ∈ S . In particular, if the chain starts in x then τ is the time of the first return to
x
τx −1
∣
γx (y) = E ( ∑ 1(Xn = y) ∣ X0 = x) , y ∈ S (16.6.28)
∣
n=0
so that γ x (y) is the expected number of visits to y before the first return to x, starting in x. Here is the existence result.
2. γ is invariant for X
x
Proof
Suppose again that X is recurrent and that g : S → [0, ∞) is invariant for X. For fixed x ∈ S ,
Proof
Thus, suppose that X is null recurrent. Then there exists an invariant function g that is positive on S and satisfies
∑ g(x) = ∞ . Every other nonnegative invariant function is a nonnegative multiple of g . In particular, either g = 0 , the zero
x∈S
16.6.5 https://stats.libretexts.org/@go/page/10293
function on S , or g is positive on S and satisfies ∑
x∈S
g(x) = ∞ . The section on reliability chains gives an example of the
invariant function for a null recurrent chain.
The situation is complicated when X is transient. In this case, there may or may not exist nonnegative invariant functions that are
not identically 0. When they do exist, they may not be unique (up to multiplication by nonnegative constants). But we still know
that there are no invariant probability density functions, so if g is a nonnegative function that is invariant for X then either g = 0
or ∑ x∈S
g(x) = ∞ . The section on random walks on graphs provides lots of examples of transient chains with nontrivial invariant
functions. In particular, the non-symmetric random walk on Z has a two-dimensional space of invariant functions.
1 −p p
P =[ ] (16.6.39)
q 1 −q
Consider a Markov chain with state space S = {a, b, c, d} and transition matrix P given below:
1 2
0 0
⎡ 3 3 ⎤
⎢ 1 0 0 0 ⎥
P =⎢ ⎥ (16.6.40)
⎢ ⎥
⎢ 0 0 1 0 ⎥
⎣ 1 1 1 1 ⎦
4 4 4 4
Answer
Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below:
1 1
⎡ 0 0
2
0
2
0 ⎤
⎢ 0 0 0 0 0 1 ⎥
⎢ ⎥
⎢ 1 1 1 ⎥
⎢ 0 0 0 ⎥
4 2 4
P =⎢ ⎥ (16.6.41)
⎢ ⎥
⎢ 0 0 0 1 0 0 ⎥
⎢ ⎥
⎢ 0 1 2
0 0 0 ⎥
⎢ 3 3 ⎥
⎣ 1 1 1 1 ⎦
0 0
4 4 4 4
Answer
16.6.6 https://stats.libretexts.org/@go/page/10293
Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below:
1 1
0 0 0 0
⎡ 2 2 ⎤
1 3
⎢ 0 0 0 0 ⎥
⎢ 4 4 ⎥
⎢ ⎥
⎢ 1 0
1 1
0 0 ⎥
⎢ 4 2 4 ⎥
P =⎢ ⎥ (16.6.42)
⎢ 1 0
1 1
0
1 ⎥
⎢ 4 4 4 4 ⎥
⎢ ⎥
⎢ 1 1 ⎥
0 0 0 0
⎢ 2 2 ⎥
⎣ 1 1 ⎦
0 0 0 0
2 2
Answer
Consider the Markov chain with state space S = {1, 2, 3, 4, 5, 6, 7} and transition matrix P given below:
1 1 1
0 0 0 0
⎡ 2 4 4 ⎤
⎢ 1 2
0 0 0 0 0 ⎥
⎢ 3 3 ⎥
⎢ ⎥
⎢ 1 2 ⎥
⎢ 0 0 0 0 0 ⎥
3 3
⎢ ⎥
⎢ 1 1 ⎥
P =⎢ 0 0 0 0 0 ⎥ (16.6.43)
⎢ 2 2 ⎥
⎢ 3 1 ⎥
⎢ 0 0 0 0 0 ⎥
⎢ 4 4 ⎥
⎢ 1 1
⎥
⎢ 0 0 0 0 0 ⎥
⎢ 2 2 ⎥
1 3
⎣ 0 0 0 0 0 ⎦
4 4
1. Sketch the state digraph, and show that the chain is irreducible with period 3.
2. Identify the cyclic classes.
3. Find the invariant probability density function.
4. Find the mean return time to each state.
5. Find limn→∞ P
3n
.
6. Find limn→∞ P
3n+1
.
7. Find limn→∞ P
3n+2
.
Answer
Special Models
Read the discussion of invariant distributions and limiting distributions in the Ehrenfest chains.
Read the discussion of invariant distributions and limiting distributions in the Bernoulli-Laplace chain.
Read the discussion of positive recurrence and invariant distributions for the reliability chains.
Read the discussion of positive recurrence and limiting distributions for the birth-death chain.
Read the discussion of positive recurrence and for the queuing chains.
Read the discussion of positive recurrence and limiting distributions for the random walks on graphs.
16.6.7 https://stats.libretexts.org/@go/page/10293
This page titled 16.6: Stationary and Limiting Distributions of Discrete-Time Chains is shared under a CC BY 2.0 license and was authored,
remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts
platform; a detailed edit history is available upon request.
16.6.8 https://stats.libretexts.org/@go/page/10293
16.7: Time Reversal in Discrete-Time Chains
The Markov property, stated in the form that the past and future are independent given the present, essentially treats the past and
future symmetrically. However, there is a lack of symmetry in the fact that in the usual formulation, we have an initial time 0, but
not a terminal time. If we introduce a terminal time, then we can run the process backwards in time. In this section, we are
interested in the following questions:
Is the new process still Markov?
If so, how does the new transition probability matrix relate to the original one?
Under what conditions are the forward and backward processes stochastically the same?
Consideration of these questions leads to reversed chains, an important and interesting part of the theory of Markov chains.
Basic Theory
Reversed Chains
Our starting point is a (homogeneous) discrete-time Markov chain X = (X , X , X , …) with (countable) state space S and
0 1 2
transition probability matrix P . Let m be a positive integer, which we will think of as the terminal time or finite time horizon. We
won't bother to indicate the dependence on m notationally, since ultimately the terminal time will not matter. Define X ^ =X n m−n
for n ∈ {0, 1, … , m}. Thus, the process forward in time is X = (X , X , … , X ) while the process backwards in time is
0 1 m
^ ^ ^ ^
X = (X 0 , X 1 , … , X m ) = (Xm , Xm−1 , … , X0 ) (16.7.1)
The process X ^
= (X , X , … , X ) is a Markov chain, but is not time homogenous in general. The one-step transition
^
0
^
1
^
m
Proof
However, the backwards chain will be time homogeneous if X has an invariant distribution.
0
Suppose that X is irreducible and positive recurrent, with (unique) invariant probability density function f . If X0 has the
invariant probability distribution, then X
^
is a time-homogeneous Markov chain with transition matrix P^ given by
f (y)
^ 2
P (x, y) = P (y, x), (x, y) ∈ S (16.7.6)
f (x)
Proof
Recall that a discrete-time Markov chain is ergodic if it is irreducible, positive recurrent, and aperiodic. For an ergodic chain, the
previous result holds in the limit of the terminal time.
Suppose that X is ergodic, with (unique) invariant probability density function f . Regardless of the distribution of X , 0
f (y)
^ ^
P(X n+1 = y ∣ X n = x) → P (y, x) as m → ∞ (16.7.7)
f (x)
Proof
16.7.1 https://stats.libretexts.org/@go/page/10294
These three results are motivation for the definition that follows. We can generalize by defining the reversal of an irreducible
Markov chain, as long as there is a positive, invariant function. Recall that a positive invariant function defines a positive measure
on S , but of course not in general a probability distribution.
Suppose that X is an irreducible Markov chain with transition matrix P , and that g : S → (0, ∞) is invariant for X. The
reversal of X with respect to g is the Markov chain X
^
= (X , X , …) with transition probability matrix P defined by
^ ^
0 1
^
g(y)
^(x, y) = 2
P P (y, x), (x, y) ∈ S (16.7.8)
g(x)
Proof
Recall that if g is a positive invariant function for X then so is cg for every positive constant c . Note that g and cg generate the
same reversed chain. So let's consider the cases:
Nonetheless, the general definition is natural, because most of the important properties of the reversed chain follow from the
balance equation between the transition matrices P and P^ , and the invariant function g :
^ 2
g(x)P (x, y) = g(y)P (y, x), (x, y) ∈ S (16.7.10)
We will see this balance equation repeated with other objects related to the Markov chains.
Suppose that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that ^
X is the reversal of X with
respect to g . For x, y ∈ S ,
1. P^(x, x) = P (x, x)
2. P^(x, y) > 0 if and only if P (y, x) > 0
Proof
Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X
^
is the reversal of X
with respect to g . For every n ∈ N and every sequence of states (x , x , … , x , x ) ∈ S
+ 1 2 , n n+1
n+1
^ ^ ^
g(x1 )P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn , xn+1 ) = g(xn+1 )P (xn+1 , xn ) ⋯ P (x3 , x2 )P (x2 , x1 ) (16.7.11)
Proof
The balance equation holds for the powers of the transition matrix:
Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X
^
is the reversal of X
with respect to g . For every (x, y) ∈ S and n ∈ N ,
2
n
^ (x, y) = g(y)P n
g(x)P (y, x) (16.7.14)
Proof
16.7.2 https://stats.libretexts.org/@go/page/10294
We can now generalize the simple result above.
Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X
^
is the reversal of X
with respect to g . For n ∈ N and (x, y) ∈ S , 2
1. P n ^
(x, x) = P (x, x)
n
2. P^ (x, y) > 0 if and only if P n
(y, x) > 0
In terms of the state graphs, part (b) has an obvious meaning: If there exists a path of length n from y to x in the original state
graph, then there exists a path of length n from x to y in the reversed state graph. The time reversal definition is symmetric with
respect to the two Markov chains.
Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X
^
is the reversal of X
with respect to g . Then
1. g is also invariant for X^
.
2. X is also irreducible.
^
3. X is the reversal of X ^
with respect to g .
Proof
Proof
Markov chains that are time reversals share many important properties:
Proof
The main point of the next result is that we don't need to know a-priori that g is invariant for X, if we can guess g and P^ .
Suppose again that X is irreducible with transition probability matrix P . If there exists a a function g : S → (0, ∞) and a
transition probability matrix P^ such that g(x)P^(x, y) = g(y)P (y, x) for all (x, y) ∈ S , then 2
1. g is invariant for X.
2. P^ is the transition matrix of the reversal of X with respect to g .
Proof
As a corollary, if there exists a probability density function f on S and a transition probability matrix P^ such that
f (x)P (x, y) = f (y)P (y, x) for all (x, y) ∈ S then in addition to the conclusions above, we know that the chains X and X are
^ 2 ^
positive recurrent.
Reversible Chains
Clearly, an interesting special case occurs when the transition matrix of the reversed chain turns out to be the same as the original
transition matrix. A chain of this type could be used to model a physical process that is stochastically the same, forward or
backward in time.
16.7.3 https://stats.libretexts.org/@go/page/10294
Suppose again that X = (X , X , X , …) is an irreducible Markov chain with transition matrix P and invariant function
0 1 2
g : S → (0, ∞) . If the reversal of X with respect to g also has transition matrix P , then X is said to be reversible with respect
Clearly if X is reversible with respect to the invariant function g : S → (0, ∞) then X is reversible with respect to the invariant
function cg for every c ∈ (0, ∞). So again, let's review the cases.
The non-symmetric simple random walk on Z falls into the last case. Using the last result in the previous subsection, we can tell
whether X is reversible with respect to g without knowing a-priori that g is invariant.
Suppose again that X is irreducible with transition matrix P . If there exists a function g : S → (0, ∞) such that
g(x)P (x, y) = g(y)P (y, x) for all (x, y) ∈ S , then2
1. g is invariant for X.
2. X is reversible with respect to g
If we have reason to believe that a Markov chain is reversible (based on modeling considerations, for example), then the condition
in the previous theorem can be used to find the invariant functions. This procedure is often easier than using the definition of
invariance directly. The next two results are minor generalizations:
Suppose again that X is irreducible and that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and
only if for every n ∈ N and every sequence of states (x , x , … x , x ) ∈ S
+ 1 2 , n n+1
n+1
g(x1 )P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn , xn+1 ) = g(xn+1 )P (xn+1 , xn ), ⋯ P (x3 , x2 )P (x2 , x1 ) (16.7.20)
Suppose again that X is irreducible and that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and
only if for every (x, y) ∈ S and n ∈ N ,
2
+
n n
g(x)P (x, y) = g(y)P (y, x) (16.7.21)
Suppose again that X is irreducible and that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and
only if
2
g(x)Rα (x, y) = g(y)Rα (y, x), α ∈ (0, 1], (x, y) ∈ S (16.7.22)
In the positive recurrent case (the most important case), the following theorem gives a condition for reversibility that does not
directly reference the invariant distribution. The condition is known as the Kolmogorov cycle condition, and is named for Andrei
Kolmogorov
Suppose that X is irreducible and positive recurrent. Then X is reversible if and only if for every sequence of states
,
(x1 , x2 , … , xn )
P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn−1 , xn )P (xn , x1 ) = P (x1 , xn )P (xn , xn−1 ) ⋯ P (x3 , x2 )P (x2 , x1 ) (16.7.23)
16.7.4 https://stats.libretexts.org/@go/page/10294
Proof
Note that the Kolmogorov cycle condition states that the probability of visiting states (x , x , … , x , x ) in sequence, starting in
2 3 n 1
state x is the same as the probability of visiting states (x , x , … , x , x ) in sequence, starting in state x . The cycle condition
1 n n−1 2 1 1
where p, q ∈ (0, 1) are parameters. The chain X is reversible and the invariant probability density function is
q p
f =(
p+q
,
p+q
) .
Proof
Suppose that X is a Markov chain on a finite state space S with symmetric transition probability matrix P . Thus
P (x, y) = P (y, x) for all (x, y) ∈ S . The chain X is reversible and that the uniform distribution on S is invariant.
2
Proof
Consider the Markov chain X on S = {a, b, c} with transition probability matrix P given below:
1 1 1
⎡ 4 4 2 ⎤
⎢ 1 1 1 ⎥
P =⎢ ⎥ (16.7.27)
⎢ 3 3 3 ⎥
⎣ 1 1
0 ⎦
2 2
1. Draw the state graph of X and note that the chain is irreducible.
2. Find the invariant probability density function f .
3. Find the mean return time to each state.
4. Find the transition probability matrix P^ of the time-reversed chain X
^.
Answer
Special Models
Read the discussion of reversibility for the Ehrenfest chains.
16.7.5 https://stats.libretexts.org/@go/page/10294
This page titled 16.7: Time Reversal in Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.7.6 https://stats.libretexts.org/@go/page/10294
16.8: The Ehrenfest Chains
Basic Theory
The Ehrenfest chains, named for Paul Ehrenfest, are simple, discrete models for the exchange of gas molecules between two containers. However, they
can be formulated as simple ball and urn models; the balls correspond to the molecules and the urns to the two containers. Thus, suppose that we have
two urns, labeled 0 and 1, that contain a total of m balls. The state of the system at time n ∈ N is the number of balls in urn 1, which we will denote by
X . Our stochastic process is X = (X , X , X , …) with state space S = {0, 1, … , m}. Of course, the number of balls in urn 0 at time n is m − X .
n 0 1 2 n
The Models
In the basic Ehrenfest model, at each discrete time unit, independently of the past, a ball is selected at random and moved to the other urn.
Proof
In the Ehrenfest experiment, select the basic model. For selected values of m and selected values of the initial state, run the chain for 1000 time
steps and note the limiting behavior of the proportion of time spent in each state.
Suppose now that we modify the basic Ehrenfest model as follows: at each discrete time, independently of the past, we select a ball at random and a urn
at random. We then put the chosen ball in the chosen urn.
Proof
2
P (x, y) for y ∈ {x − 1, x + 1} .
In the Ehrenfest experiment, select the modified model. For selected values of m and selected values of the initial state, run the chain for 1000 time
steps and note the limiting behavior of the proportion of time spent in each state.
Classification
The basic and modified Ehrenfest chains are irreducible and positive recurrent.
Proof
The basic Ehrenfest chain is periodic with period 2. The cyclic classes are the set of even states and the set of odd states. The two-step transition
matrix is
x(x − 1) x(m − x + 1) + (m − x)(x + 1) (m − x)(m − x − 1)
2 2 2
P (x, x − 2) = , P (x, x) = , P (x, x + 2) = , x ∈ S (16.8.5)
2 2 2
m m m
Proof
2
.
So the invariant probability density function f is given by
m
m 1
f (x) = ( )( ) , x ∈ S (16.8.6)
x 2
16.8.1 https://stats.libretexts.org/@go/page/10295
Proof
Thus, the invariant distribution corresponds to placing each ball randomly and independently either in urn 0 or in urn 1.
The mean return time to state x ∈ S for the basic or modified Ehrenfest chain is μ(x) = 2 m m
/(
x
) .
Proof
For the basic Ehrenfest chain, the limiting behavior of the chain is as follows:
m−1
1. P 2n m
(x, y) → (
y
)(
1
2
) as n → ∞ if x, y ∈ S have the same parity (both even or both odd). The limit is 0 otherwise.
m−1
2. P 2n+1
(x, y) → (
m
y
)(
1
2
) as n → ∞ if x, y ∈ S have oppositie parity (one even and one odd). The limit is 0 otherwise.
Proof
m
For the modified Ehrenfest chain, Q n m
(x, y) → (
y
)(
1
2
) as n → ∞ for x, y ∈ S .
Proof
In the Ehrenfest experiment, the limiting binomial distribution is shown graphically and numerically. For each model and for selected values of m
and selected values of the initial state, run the chain for 1000 time steps and note the limiting behavior of the proportion of time spent in each state.
How do the choices of m, the initial state, and the model seem to affect the rate of convergence to the limiting distribution?
Reversibility
The basic and modified Ehrenfest chains are reversible.
Proof
Run the simulation of the Ehrenfest experiment 10,000 time steps for each model, for selected values of m, and with initial state 0. Note that at first,
you can see the “arrow of time”. After a long period, however, the direction of time is no longer evident.
Computational Exercises
Consider the basic Ehrenfest chain with m = 5 balls, and suppose that X has the uniform distribution on S .0
4. Sketch the initial probability density function and the probability density functions in parts (a), (b), and (c) on a common set of axes.
Answer
Consider the modified Ehrenfest chain with m = 5 balls, and suppose that the chain starts in state 2 (with probability 1).
1. Compute the probability density function, mean and standard deviation of X . 1
4. Sketch the initial probability density function and the probability density functions in parts (a), (b), and (c) on a common set of axes.
Answer
This page titled 16.8: The Ehrenfest Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via
source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.8.2 https://stats.libretexts.org/@go/page/10295
16.9: The Bernoulli-Laplace Chain
Basic Theory
Introduction
The Bernoulli-Laplace chain, named for Jacob Bernoulli and Pierre Simon Laplace, is a simple discrete model for the diffusion of two
incompressible gases between two containers. Like the Ehrenfest chain, it can also be formulated as a simple ball and urn model. Thus, suppose
that we have two urns, labeled 0 and 1. Urn 0 contains j balls and urn 1 contains k balls, where j, k ∈ N . Of the j + k balls, r are red and the
+
remaining j + k − r are green. Thus r ∈ N and 0 < r < j + k . At each discrete time, independently of the past, a ball is selected at random
+
from each urn and then the two balls are switched. The balls of different colors correspond to molecules of different types, and the urns are the
containers. The incompressible property is reflected in the fact that the number of balls in each urn remains constant over time.
X = (X0 , X1 , X2 , …) is a discrete-time Markov chain with state space S = {max{0, r − j}, … , min{k, r}} and with transition matrix
P given by
(j − r + x)x (r − x)x + (j − r + x)(k − x) (r − x)(k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.1)
jk jk jk
Proof
This is a fairly complicated model, simply because of the number of parameters. Interesting special cases occur when some of the parameters are
the same.
Consider the special case j = k , so that each urn has the same number of balls. The state space is S = {max{0, r − k}, … , min{k, r}}
Consider the special case r = j , so that the number of red balls is the same as the number of balls in urn 0. The state space is
S = {0, … , min{j, k}} and the transition probability matrix is
2
x x(j + k − 2x) (j − x)(k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.3)
jk jk jk
Consider the special case r =k , so that the number of red balls is the same as the number of balls in urn 1. The state space is
S = {max{0, k − j}, … , k} and the transition probability matrix is
2
(j − k + x)x (k − x)(j − k + 2x) (k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.4)
jk jk jk
Consider the special case j = k = r , so that each urn has the same number of balls, and this is also the number of red balls. The state space
is S = {0, 1, … , k} and the transition probability matrix is
2 2
x 2x(k − x) (k − x)
P (x, x − 1) = , P (x, x) = , P (x, x + 1) = ; x ∈ S (16.9.5)
2 2 2
k k k
16.9.1 https://stats.libretexts.org/@go/page/10296
Run the simulation of the Bernoulli-Laplace experiment for 10000 steps and for various values of the parameters. Note the limiting behavior
of the proportion of time spent in each state.
The invariant distribution is the hypergeometric distribution with population parameter j+k , sample parameter k , and type parameter r.
The probability density function is
r j+k−r
( )( )
x k−x
f (x) = , x ∈ S (16.9.6)
j+k
( )
k
Proof
Thus, the invariant distribution corresponds to selecting a sample of k balls at random and without replacement from the j + k balls and placing
them in urn 1. The mean and variance of the invariant distribution are
r r j+k −r j
2
μ =k , σ =k (16.9.7)
j+k j+k j+k j+k −1
Proof
P
n r
(x, y) → f (y) = ( )(
y
r+k−r
k−y
j+k
)/(
k
) as n → ∞ for (x, y) ∈ S . 2
Proof
In the simulation of the Bernoulli-Laplace experiment, vary the parameters and note the shape and location of the limiting hypergeometric
distribution. For selected values of the parameters, run the simulation for 10000 steps and and note the limiting behavior of the proportion of
time spent in each state.
Reversibility
The Bernoulli-Laplace chain is reversible.
Proof
Run the simulation of the Bernoulli-Laplace experiment 10,000 time steps for selected values of the parameters, and with initial state 0.
Note that at first, you can see the “arrow of time”. After a long period, however, the direction of time is no longer evident.
Computational Exercises
Consider the Bernoulli-Laplace chain with j = 10 , k =5 , and r =4 . Suppose that X0 has the uniform distribution on S . Explicitly give
each of the following:
1. The state space S
2. The transition matrix P .
3. The probability density function, mean and variance of X . 1
Answer
Consider the Bernoulli-Laplace chain with j = k = 10 and r = 6 . Give each of the following explicitly:
1. The state space S
2. The transition matrix P
16.9.2 https://stats.libretexts.org/@go/page/10296
3. The invariant probability density function.
Answer
This page titled 16.9: The Bernoulli-Laplace Chain is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.9.3 https://stats.libretexts.org/@go/page/10296
16.10: Discrete-Time Reliability Chains
called the success function. Let X denote the length of the run of successes after n trials.
n
X = (X0 , X1 , X2 , …) is a discrete-time Markov chain with state space N and transition probability matrix P given by
presumably, it is possible that no failure occurs. Let r(n) = P(T > n) for n ∈ N , the probability of at least n consecutive
successes, starting with a fresh set of trials. Let f (n) = P(T = n + 1) for n ∈ N , the probability of exactly n consecutive
successes, starting with a fresh set of trails.
x=0
x=0
Thus, the functions p, r, and f give equivalent information. If we know one of the functions, we can construct the other two, and
hence any of the functions can be used to define the success-runs chain. The function r is the reliability function associated with T .
Essentially, f is the probability density function of T − 1 , except that it may be defective in the sense that the sum of its values
may be less than 1. The leftover probability, of course, is the probability that T = ∞ . This is the critical consideration in the
classification of the success-runs chain, which we will consider shortly.
Verify that each of the following functions has the appropriate properties, and then find the other two functions:
1. p is a constant in (0, 1).
2. r(n) = 1/(n + 1) for n ∈ N .
16.10.1 https://stats.libretexts.org/@go/page/10297
3. r(n) = (n + 1)/(2 n + 1) for n ∈ N .
4. p(x) = 1/(x + 2) for x ∈ N.
Answer
In part (a), note that the trials are Bernoulli trials. We have an app for this case.
The success-runs app is a simulation of the success-runs chain based on Bernoulli trials. Run the simulation 1000 times for
various values of p and various initial states, and note the general behavior of the chain.
Recall that T has the same distribution as τ , the first return time to 0 starting at state 0. Thus, the classification of the chain as
0
recurrent or transient depends on α = P(T = ∞) . Specifically, the success-runs chain is transient if α > 0 and recurrent if α = 0 .
Thus, we see that the chain is recurrent if and only if a failure is sure to occur. We can compute the parameter α in terms of each of
the three functions that define the chain.
In terms of p, r, and f ,
∞ ∞
Compute α and determine whether the success-runs chain X is transient or recurrent for each of the examples above.
Answer
Run the simulation of the success-runs chain 1000 times for various values of p, starting in state 0. Note the return times to
state 0.
Let μ = E(T ) , the expected trial number of the first failure, starting with a fresh sequence of trials.
n=0
nf (n)
3. μ = ∑ ∞
r(n)
n=0
Proof
If X is recurrent, then r is invariant for X . In the positive recurrent case, when μ <∞ , the invariant distribution has
probability density function g given by
r(x)
g(x) = , x ∈ N (16.10.3)
μ
Proof
When X is recurrent, we know from the general theory that every other nonnegative left invariant function is a nonnegative
multiple of r
Determine whether the success-runs chain X is transient, null recurrent, or positive recurrent for each of the examples above.
If the chain is positive recurrent, find the invariant probability density function.
Answer
16.10.2 https://stats.libretexts.org/@go/page/10297
From (a), the success-runs chain corresponding to Bernoulli trials with success probability p ∈ (0, 1) has the geometric distribution
on N, with parameter 1 − p , as the invariant distribution.
Run the simulation of the success-runs chain 1000 times for various values of p and various initial states. Compare the
empirical distribution to the invariant distribution.
Y = (Y0 , Y1 , Y2 , …) is a discrete-time Markov chain with state space N and transition probability matrix Q given by
The Markov chain Y is called the remaining life chain with lifetime probability density function f , and has the state graph below.
Run the simulation of the remaining-life chain 1000 times for various values of p and various initial states. Note the general
behavior of the chain.
If U denotes the lifetime of a device, as before, note that T = 1 +U is the return time to 0 for the chain Y , starting at 0.
Now let r(n) = P(U ≥ n) = P(T > n) for n ∈ N and let μ = E(T ) = 1 + E(U ) . Note that r(n) = ∑
∞
x=n
f (x) and
μ = 1 +∑
∞
x=0
f (x) = ∑
∞
n=0
r(n) .
The success-runs chain X is positive recurrent if and only if μ <∞ , in which case the invariant distribution has probability
density function g given by
r(x)
g(x) = , x ∈ N (16.10.7)
μ
Proof
Suppose that Y is the remaining life chain whose lifetime distribution is the geometric distribution on N with parameter
1 − p ∈ (0, 1) . Then this distribution is also the invariant distribution.
Proof
Run the simulation of the success-runs chain 1000 times for various values of p and various initial states. Compare the
empirical distribution to the invariant distribution.
Time Reversal
You probably have already noticed similarities, in notation and results, between the success-runs chain and the remaining-life
chain. There are deeper connections.
Suppose that f is a probability density function on N with f (n) > 0 for n ∈ N . Let X be the success-runs chain associated
with f and Y the remaining life chain associated with f . Then X and Y are time reversals of each other.
16.10.3 https://stats.libretexts.org/@go/page/10297
Proof
In the context of reliability, it is also easy to see that the chains are time reversals of each other. Consider again a device whose
random lifetime takes values in N, with the device immediately replaced by an identical device upon failure. For n ∈ N , we can
think of X as the age of the device in service at time n and Y as the time remaining until failure for that device.
n n
Run the simulation of the success-runs chain 1000 times for various values of p, starting in state 0. This is the time reversal of
the simulation in the next exercise
Run the simulation of the remaining-life chain 1000 times for various values of p, starting in state 0. This is the time reversal
of the simulation in the previous exercise.
This page titled 16.10: Discrete-Time Reliability Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.10.4 https://stats.libretexts.org/@go/page/10297
16.11: Discrete-Time Branching Chain
Basic Theory
Introduction
Generically, suppose that we have a system of particles that can generate or split into other particles of the same type. Here are
some typical examples:
The particles are biological organisms that reproduce.
The particles are neutrons in a chain reaction.
The particles are electrons in an electron multiplier.
We assume that each particle, at the end of its life, is replaced by a random number of new particles that we will refer to as children
of the original particle. Our basic assumption is that the particles act independently, each with the same offspring distribution on N.
Let f denote the common probability density function of the number of offspring of a particle. We will also let
= f ∗ f ∗ ⋯ ∗ f denote the convolution power of degree n of f ; this is the probability density function of the total number of
∗n
f
children of n particles.
We will consider the evolution of the system in real time in our study of continuous-time branching chains. In this section, we will
study the evolution of the system in generational time. Specifically, the particles that we start with are in generation 0, and
recursively, the children of a particle in generation n are in generation n + 1 .
array of independent random variables (U : n ∈ N, i ∈ N ) , each with probability density function f . We interpret U as the
n,i + n,i
number of children of the ith particle in generation n (if this particle exists). Note that we have more random variables than we
need, but this causes no harm, and we know that we can construct a probability space that supports such an array of random
variables. We can now define our state variables recursively by
Xn
i=1
The branching chain is also known as the Galton-Watson process in honor of Francis Galton and Henry William Watson who
studied such processes in the context of the survival of (aristocratic) family names. Note that the descendants of each initial particle
form a branching chain, and these chains are independent. Thus, the branching chain starting with x particles is equivalent to x
independent copies of the branching chain starting with 1 particle. This features turns out to be very important in the analysis of the
chain. Note also that 0 is an absorbing state that corresponds to extinction. On the other hand, the population may grow to infinity,
sometimes called explosion. Computing the probability of extinction is one of the fundamental problems in branching chains; we
will essentially solve this problem in the next subsection.
16.11.1 https://stats.libretexts.org/@go/page/10384
Extinction and Explosion
The behavior of the branching chain in expected value is easy to analyze. Let m denote the mean of the offspring distribution, so
that
∞
m = ∑ xf (x) (16.11.4)
x=0
Note that m ∈ [0, ∞]. The parameter m will turn out to be of fundamental importance.
Part (c) is extinction in the mean; part (d) is stability in the mean; and part (e) is explosion in the mean.
Recall that state 0 is absorbing (there are no particles), and hence {X = 0 for some n ∈ N} = {τ < ∞} is the extinction event
n 0
(where as usual, τ is the time of the first return to 0). We are primarily concerned with the probability of extinction, as a function
0
of the initial state. First, however, we will make some simple observations and eliminate some trivial cases.
Suppose that f (1) = 1 , so that each particle is replaced by a single new particle. Then
1. Every state is absorbing.
2. The equivalence classes are the singleton sets.
3. With probability 1, X = X for every n ∈ N .
n 0
Proof
Suppose that f (0) > 0 so that with positive probability, a particle will die without offspring. Then
1. Every state leads to 0.
2. Every positive state is transient.
3. With probability 1 either X = 0 for some n ∈ N (extinction) or X
n n → ∞ as n → ∞ (explosion).
Proof
Suppose that f (0) = 0 and f (1) < 1 , so that every particle is replaced by at least one particle, and with positive probability,
more than one. Then
1. Every positive state is transient.
2. P(X → ∞ as n → ∞ ∣ X = x) = 1 for every x ∈ N , so that explosion is certain, starting with at least one particle.
n 0 +
Proof
Suppose that f (0) > 0 and f (0) + f (1) = 1 , so that with positive probability, a particle will die without offspring, and with
probability 1, a particle is not replaced by more than one particle. Then
1. Every state leads to 0.
2. Every positive state is transient.
3. With probability 1, X = 0 for some n ∈ N , so extinction is certain.
n
Proof
Thus, the interesting case is when f (0) > 0 and f (0) + f (1) < 1 , so that with positive probability, a particle will die without
offspring, and also with positive probability, the particle will be replaced by more than one new particles. We will assume these
conditions for the remainder of our discussion. By the state classification above all states lead to 0 (extinction). We will denote the
probability of extinction, starting with one particle, by
16.11.2 https://stats.libretexts.org/@go/page/10384
q = P(τ0 < ∞ ∣ X0 = 1) = P(Xn = 0 for some n ∈ N ∣ X0 = 1) (16.11.6)
The set of positive states N is a transient equivalence class, and the probability of extinction starting with x ∈ N particles is
+
x
q = P(τ0 < ∞ ∣ X0 = x) = P(Xn = 0 for some n ∈ N ∣ X0 = x) (16.11.7)
Proof
x
q = ∑ f (x)q (16.11.8)
x=0
Proof
Thus the extinction probability q starting with 1 particle is a fixed point of the probability generating function Φ of the offspring
distribution:
∞
x
Φ(t) = ∑ f (x)t , t ∈ [0, 1] (16.11.11)
x=0
Moreover, from the general discussion of hitting probabilities in the section on recurrence and transience, q is the smallest such
number in the interval (0, 1]. If the probability generating function Φ can be computed in closed form, then q can sometimes be
computed by solving the equation Φ(t) = t .
5. m = lim Φ (t) .
t↑1
′
Proof
Our main result is next, and relates the extinction probability q and the mean of the offspring distribution m.
The extinction probability q and the mean of the offspring distribution m are related as follows:
1. If m ≤ 1 then q = 1 , so extinction is certain.
2. If m > 1 then 0 < q < 1 , so there is a positive probability of extinction and a positive probability of explosion.
Proof
16.11.3 https://stats.libretexts.org/@go/page/10384
Figure 16.11.2: The case of certain extinction.
Computational Exercises
Consider the branching chain with offspring probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ (0, 1)
is a parameter. Thus, each particle either dies or splits into two new particles. Find each of the following.
1. The transition matrix P .
2. The mean m of the offspring distribution.
3. The generating function Φ of the offspring distribution.
4. The extinction probability q.
Answer
Consider the branching chain whose offspring distribution is the geometric distribution on N with parameter 1 −p , where
p ∈ (0, 1). Thus f (n) = (1 − p)p for n ∈ N . Find each of the following:
n
Curiously, the extinction probability is the same as for the previous problem.
Consider the branching chain whose offspring distribution is the Poisson distribution with parameter m ∈ (0, ∞) . Thus
m /n! for n ∈ N . Find each of the following:
−m n
f (n) = e
16.11.4 https://stats.libretexts.org/@go/page/10384
This page titled 16.11: Discrete-Time Branching Chain is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.11.5 https://stats.libretexts.org/@go/page/10384
16.12: Discrete-Time Queuing Chains
Basic Theory
Introduction
In a queuing model, customers arrive at a station for service. As always, the terms are generic; here are some typical examples:
The customers are persons and the service station is a store.
The customers are file requests and the service station is a web server.
The customers are packages and the service station is a processing facility.
Queuing models can be quite complex, depending on such factors as the probability distribution that governs the arrival of
customers, the probability distribution that governs the service of customers, the number of servers, and the behavior of the
customers when all servers are busy. Indeed, queuing theory has its own lexicon to indicate some of these factors. In this section,
we will study one of the simplest, discrete-time queuing models. However, as we will see, this discrete-time chain is embedded in a
much more realistic continuous-time queuing process knows as the M/G/1 queue. In a general sense, the main interest in any
queuing model is the number of customers in the system as a function of time, and in particular, whether the servers can adequately
handle the flow of customers.
Thus, let X denote the number of customers in the system at time n ∈ N , and let U denote the number of new customers who
n n
arrive at time n ∈ N . Then U = (U , U , …) is a sequence of independent random variables, with common probability density
+ 1 2
function f on N, and
Un+1 , Xn = 0
Xn+1 = { , n ∈ N (16.12.1)
(Xn − 1) + Un+1 , Xn > 0
X = (X0 , X1 , X2 , …) is a discrete-time Markov chain with state space N and transition probability matrix P given by
P (0, y) = f (y), y ∈ N (16.12.2)
m = ∑ xf (x) (16.12.4)
x=0
Thus m is the average number of new customers who arrive during a time period.
16.12.1 https://stats.libretexts.org/@go/page/10385
Proof
Our goal in this section is to compute the probability that the chain reaches 0, as a function of the initial state (so that the server is
able to serve all of the customers). As we will see, there are some curious and unexpected parallels between this problem and the
problem of computing the extinction probability in the branching chain. As a corollary, we will also be able to classify the queuing
chain as transient or recurrent. Our basic parameter of interest is q = H (1, 0) = P(τ < ∞ ∣ X = 1) , where as usual, H is the
0 0
hitting probability matrix and τ = min{n ∈ N : X = 0} is the first positive time that the chain is in state 0 (possibly infinite).
0 + n
Thus, q is the probability that the queue eventually empties, starting with a single customer.
Proof
x=0
Proof
Note that this is exactly the same equation that we considered for the branching chain, namely Φ(q) = q , where Φ is the
probability generating function of the distribution that governs the number of new customers that arrive during each period.
16.12.2 https://stats.libretexts.org/@go/page/10385
Positive Recurrence
Our next goal is to find conditions for the queuing chain to be positive recurrent. Recall that m is the mean of the probability
density function f ; that is, the expected number of new customers who arrive during a time period. As before, let τ denote the first 0
positive time that the chain is in state 0. We assume that the chain is recurrent, so m ≤ 1 and P(τ < ∞) = 1 . 0
Proof
The deriviative of Ψ is
Φ[Ψ(t)]
′
Ψ (t) = , t ∈ (−1, 1) (16.12.11)
′
1 − tΦ [Ψ(t)]
Proof
As usual, let μ 0 = E(τ0 ∣ X0 = 0) , the mean return time to state 0 starting in state 0. Then
1. μ0 =
1
1−m
if m < 1 and therefore the chain is positive recurrent.
2. μ0 =∞ if m = 1 and therefore the chain is null recurrent.
Proof
So to summarize, the queuing chain is positive recurrent if m < 1 , null recurrent if m = 1 , and transient if m > 1 . Since m is the
expected number of new customers who arrive during a service period, the results are certainly reasonable.
Computational Exercises
Consider the queuing chain with arrival probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ (0, 1) is a
parameter. Thus, at each time period, either no new customers arrive or two arrive.
1. Find the transition matrix P .
2. Find the mean m of the arrival distribution.
3. Find the generating function Φ of the arrival distribution.
4. Find the probability q that the queue eventually empties, starting with one customer.
5. Classify the chain as transient, null recurrent, or positive recurrent.
6. In the positive recurrent case, find μ , the mean return time to 0.
0
Answer
Consider the queuing chain whose arrival distribution is the geometric distribution on N with parameter 1 −p , where
p ∈ (0, 1). Thus f (n) = (1 − p)p for n ∈ N . n
Answer
Curiously, the parameter q and the classification of the chain are the same in the last two models.
Consider the queuing chain whose arrival distribution is the Poisson distribution with parameter m ∈ (0, ∞) . Thus
m /n! for n ∈ N . Find each of the following:
−m n
f (n) = e
16.12.3 https://stats.libretexts.org/@go/page/10385
1. The transition matrix P
2. The mean m of the arrival distribution.
3. The generating function Φ of the arrival distribution.
4. The approximate value of q when m = 2 and when m = 3 .
5. Classify the chain as transient, null recurrent, or positive recurrent.
6. In the positive recurrent case, find μ , the mean return time to 0.
0
Answer
This page titled 16.12: Discrete-Time Queuing Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.12.4 https://stats.libretexts.org/@go/page/10385
16.13: Discrete-Time Birth-Death Chains
Basic Theory
Introduction
Suppose that S is an interval of integers (that is, a set of consecutive integers), either finite or infinite. A (discrete-time) birth-
death chain on S is a discrete-time Markov chain X = (X , X , X , …) on S with transition probability matrix P of the
0 1 2
form
where p, q, and r are nonnegative functions on S with p(x) + q(x) + r(x) = 1 for x ∈ S .
If the interval S has a minimum value a ∈ Z then of course we must have q(a) = 0 . If r(a) = 1 , the boundary point a is
absorbing and if p(a) = 1 , then a is reflecting. Similarly, if the interval S has a maximum value b ∈ Z then of course we must
have p(b) = 0 . If r(b) = 1 , the boundary point b is absorbing and if p(b) = 1 , then b is reflecting. Several other special models
that we have studied are birth-death chains; these are explored in below.
In this section, as you will see, we often have sums of products. Recall that a sum over an empty index set is 0, while a product
over an empty index set is 1.
We will use the test for recurrence derived earlier with A = N , the set of positive states. That is, we will compute the probability
+
Proof
Note that r, the function that assigns to each state x ∈ N the probability of an immediate return to x, plays no direct role in whether
the chain is transient or recurrent. Indeed all that matters are the ratios q(x)/p(x) for x ∈ N .
+
is irreducible.
16.13.1 https://stats.libretexts.org/@go/page/10386
is invariant for X, and is the only invariant function, up to multiplication by constants. Hence X is positive recurrent if and
only if B = ∑ ∞
x=0
g(x) < ∞ , in which case the (unique) invariant probability density function f is given by f (x) = g(x)
1
for x ∈ N.
Proof
1. X is transient if A < ∞
2. X is null recurrent if A = ∞ and B = ∞ .
3. X is positive recurrent if B < ∞ .
Note again that r, the function that assigns to each state x ∈ N the probability of an immediate return to x, plays no direct role in
whether the chain is transient, null recurrent, or positive recurrent. Also, we know that an irreducible, recurrent chain has a positive
invariant function that is unique up to multiplication by positive constants, but the birth-death chain gives an example where this is
also true in the transient case.
Suppose now that n ∈ N and that X = (X , X , X , …) is a birth-death chain on the integer interval N = {0, 1, … , n}. We
+ 0 1 2 n
assume that p(x) > 0 for x ∈ {0, 1, … , n − 1} while q(x) > 0 for x ∈ {1, 2, … n}. Of course, we must have q(0) = p(n) = 0 .
With these assumptions, X is irreducible, and since the state space is finite, positive recurrent. So all that remains is to find the
invariant distribution. The result is essentially the same as when the state space is N.
n
1 p(0) ⋯ p(x − 1) p(0) ⋯ p(x − 1)
fn (x) = for x ∈ Nn where Bn = ∑ (16.13.14)
Bn q(1) ⋯ q(x) q(1) ⋯ q(x)
x=0
Proof
Note that B → B as n → ∞ , and if B < ∞ , f (x) → f (x) as n → ∞ for x ∈ N. We will see this type of behavior again.
n n
Results for the birth-death chain on N often converge to the corresponding results for the birth-death chain on N as n → ∞ .
n
Absorption
Often when the state space S = N , the state of a birth-death chain represents a population of individuals of some sort (and so the
terms birth and death have their usual meanings). In this case state 0 is absorbing and means that the population is extinct.
Specifically, suppose that X = (X , X , X , …) is a birth-death chain on N with r(0) = 1 and with p(x), q(x) > 0 for x ∈ N .
0 1 2 +
Thus, state 0 is absorbing and all positive states lead to each other and to 0. Let N = min{n ∈ N : X = 0} denote the time untiln
Naturally we would like to find the probability of these complementary events, and happily we have already done so in our study of
recurrence above. Let
u(x) = P(N = ∞) = P(Xn → ∞ as n → ∞ ∣ X0 = x), x ∈ N (16.13.16)
16.13.2 https://stats.libretexts.org/@go/page/10386
x−1 ∞
1 q(1) ⋯ q(i) q(1) ⋯ q(i)
u(x) = ∑ for x ∈ N+ where A = ∑ (16.13.18)
A p(1) ⋯ p(i) p(1) ⋯ p(i)
i=0 i=0
Proof
So if A = ∞ then u(x) = 0 for all x ∈ S . If A < ∞ then u(x) > 0 for all x ∈ N and u(x) → 1 as x → ∞ . For the absorption
+
Next we consider the mean time to absorption, so let m(x) = E(N ∣ X0 = x) for x ∈ N . +
Probabilisitic Proof
Analytic Proof
Next we will consider a birth-death chain on a finite integer interval with both endpoints absorbing. Our interest is in the
probability of absorption in one endpoint rather than the other, and in the mean time to absorption. Thus suppose that n ∈ N and +
that X = (X , X , X , …) is a birth-death chain on N = {0, 1, … , n} with r(0) = r(n) = 1 and with p(x) > 0 and q(x) > 0
0 1 2 n
for x ∈ {1, 2, … , n − 1}. So the endpoints 0 and n are absorbing, and all other states lead to each other and to the endpoints. Let
N = min{n ∈ N : X ∈ {0, n}} , n the time until absorption, and for x ∈ S let v (x) = P(X = 0 ∣ X = x) n and N 0
m (x) = E(N ∣ X = x) . The definitions make sense since N is finite with probability 1.
n 0
Proof
Note that A → A as n → ∞ where A is the constant above for the absorption probability at 0 with the infinite state space N. If
n
n−1 y
1 q(z + 1) ⋯ q(y)
mn (1) = ∑∑ (16.13.32)
An p(z) ⋯ p(y)
y=1 z=1
Proof
Time Reversal
Our next discussion is on the time reversal of a birth-death chain. Essentially, every recurrent birth-death chain is reversible.
Suppose that X = (X0 , X1 , X2 , …) is an irreducible, recurrent birth-death chain on an integer interval S . Then X is
reversible.
Proof
If S is finite and the chain X is irreducible, then of course X is recurrent (in fact positive recurrent), so by the previous result, X
is reversible. In the case S = N , we can use the invariant function above to show directly that the chain is reversible.
16.13.3 https://stats.libretexts.org/@go/page/10386
Suppose that X = (X 0, X1 , X2 , …) is a birth-death chain on N with p(x) > 0 for x ∈ N and q(x) > 0 for x ∈ N . Then X+
is reversible.
Proof
Thus, in the positive recurrent case, when the variables are given the invariant distribution, the transition matrix P describes the
chain forward in time and backwards in time.
1. X is transient if q < p
2. X is null recurrent if q = p
3. X is positive recurrent if q > p , and the invariant distribution is the geometric distribution on N with parameter p/q
x
p p
f (x) = (1 − )( ) , x ∈ N (16.13.37)
q q
Next we consider the random walk on N with 0 absorbing. As in the discussion of absorption above, v(x) denotes the absorption
probability and m(x) the mean time to absorption, starting in state x ∈ N.
Suppose that X = (X , X , …) is the birth-death chain on N with constant birth probability p ∈ (0, ∞) on N and constant
0 1 +
death probability q ∈ (0, ∞) on N , with p + q ≤ 1 . Assume also that r(0) = 1 , so that 0 is absorbing.
+
Proof
This chain is essentially the gambler's ruin chain. Consider a gambler who bets on a sequence of independent games, where p and
q are the probabilities of winning and losing, respectively. The gambler receives one monetary unit when she wins a game and must
pay one unit when she loses a game. So X is the gambler's fortune after playing n games.
n
Suppose that X = (X , X , …) is the birth-death chain on N = {0, 1, … , n} with constant birth probability p ∈ (0, ∞) on
0 1 n
{0, 1, … , n − 1} and constant death probability q ∈ (0, ∞) on {1, 2, … , n}, with p + q ≤ 1 . Then X is positive recurrent
1. If p ≠ q then
x
(p/q ) (1 − p/q)
fn (x) = , x ∈ Nn (16.13.39)
n+1
1 − (p/q)
Note that if p < q then the invariant distribution is a truncated geometric distribution, and f (x) → f (x) for x ∈ N where f is the
n
invariant probability density function of the birth-death chain on N considered above. If p = q , the invariant distribution is uniform
on N , certainly a reasonable result. Next we consider the chain with both endpoints absorbing. As before, v is the function that
n n
gives the probability of absorption in state 0, while m is the function that gives the mean time to absorption.
n
16.13.4 https://stats.libretexts.org/@go/page/10386
Suppose that X = (X , X , …) is the birth-death chain on N = {0, 1, … , n} with constant birth probability p ∈ (0, 1) and
0 1 n
death probability q ∈ (0, ∞) on {1, 2, … , n − 1}, where p + q ≤ 1 . Assume also that r(0) = r(n) = 1 , so that 0 and n are
absorbing.
1. If p ≠ q then
x n
(q/p ) − (q/p )
vn (x) = , x ∈ Nn (16.13.40)
1 − (q/p)n
2. If p = q then
1
mn (x) = x(n − x), x ∈ Nn (16.13.42)
2p
Other Examples
Consider the birth-death process on N with p(x) = 1
x+1
, q(x) = 1 − p(x) , and r(x) = 0 for x ∈ S .
1. Find the invariant function g .
2. Classify the chain.
Answer
This page titled 16.13: Discrete-Time Birth-Death Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.13.5 https://stats.libretexts.org/@go/page/10386
16.14: Random Walks on Graphs
Basic Theory
Introduction
Suppose that G = (S, E) is a graph with vertex set S and edge set E ⊆ S . We assume that the graph is undirected (perhaps a
2
better term would be bi-directed) in the sense that (x, y) ∈ E if and only if (y, x) ∈ E . The vertex set S is countable, but may be
infinite. Let N (x) = {y ∈ S : (x, y) ∈ E} denote the set of neighbors of a vertex x ∈ S , and let d(x) = #[N (x)] denote the
degree of x. We assume that N (x) ≠ ∅ for x ∈ S , so G has no isolated points.
Suppose now that there is a conductance c(x, y) > 0 associated with each edge (x, y) ∈ E . The conductance is symmetric in the
sense that c(x, y) = c(y, x) for (x, y) ∈ E . We extend c to a function on all of S × S by defining c(x, y) = 0 for (x, y) ∉ E . Let
y∈S
so that C (x) is the total conductance of the edges coming from x. Our main assumption is that C (x) < ∞ for x ∈ S . As the
terminology suggests, we imagine a fluid of some sort flowing through the edges of the graph, so that the conductance of an edge
measures the capacity of the edge in some sense. One of the best interpretation is that the graph is an electrical network and the
edges are resistors. In this interpretation, the conductance of a resistor is the reciprocal of the resistance.
In some applications, specifically the resistor network just mentioned, it's appropriate to impose the additional assumption that G
has no loops, so that (x, x) ∉ E for each x ∈ S . However, that assumption is not mathematically necessary for the Markov chains
that we will consider in this section.
The discrete-time Markov chain X = (X 0, X1 , X2 , …) with state space S and transition probability matrix P given by
c(x, y)
2
P (x, y) = , (x, y) ∈ S (16.14.2)
C (x)
This chain governs a particle moving along the vertices of G. If the particle is at vertex x ∈ S at a given time, then the particle will
be at a neighbor of x at the next time; the neighbor is chosen randomly, in proportion to the conductance. In the setting of an
electrical network, it is natural to interpret the particle as an electron. Note that multiplying the conductance function c by a
positive constant has no effect on the associated random walk.
Suppose that d(x) < ∞ for each x ∈ S and that c is constant on the edges. Then
1. C (x) = cd(x) for every x ∈ S .
2. The transition matrix P is given by P (x, y) = 1
d(x)
for x ∈ S and y ∈ N (x), and P (x, y) = 0 otherwise.
Thus, for the symmetric random walk, if the state is x ∈ S at a given time, then the next state is equally likely to be any of the
neighbors of x. The assumption that each vertex has finite degree means that the graph G is locally finite.
So as usual, we will usually assume that G is connected, for otherwise we could simply restrict our attention to a component of G.
In the case that G has no loops (again, an important special case because of applications), it's easy to characterize the periodicity of
16.14.1 https://stats.libretexts.org/@go/page/10387
the chain. For the theorem that follows, recall that G is bipartite if the vertex set S can be partitioned into nonempty, disjoint sets
A and B (the parts) such that every edge in E has one endpoint in A and one endpoint in B .
Suppose that X is a random walk on a connected graph G with no loops. Then X is either aperiodic or has period 2.
Moreover, X has period 2 if and only if G is bipartite, in which case the parts are the cyclic classes of X.
Proof
The function C is invariant for P . The random walk X is positive recurrent if and only if
in which case the invariant probability density function f is given by f (x) = C (x)/K for x ∈ S .
Proof
Note that K is the total conductance over all edges in G. In particular, of course, if S is finite then X is positive recurrent, with f
as the invariant probability density function. For the symmetric random walk, this is the only way that positive recurrence can
occur:
The symmetric random walk on G is positive recurrent if and only if the set of vertices S is finite, in which case the invariant
probability density function f is given by
d(x)
f (x) = , x ∈ S (16.14.6)
2m
where d is the degree function and where m is the number of undirected edges.
Proof
On the other hand, when S is infinite, the classification of X as recurrent or transient is complicated. We will consider an
interesting special case below, the symmetric random walk on Z . k
Reversibility
Essentially, all reversible Markov chains can be interpreted as random walks on graphs. This fact is one of the reasons for studying
such walks.
Of course, if X is recurrent, then C is the only positive invariant function, up to multiplication by positive constants, and so X is
simply reversible.
Conversely, suppose that X is an irreducible Markov chain on S with transition matrix P and positive invariant function g . If
X is reversible with respect to g then X is the random walk on the state graph with conductance function c given by
Proof
Again, in the important special case that X is recurrent, there exists a positive invariant function g that is unique up to
multiplication by positive constants. In this case the theorem states that an irreducible, recurrent, reversible chain is a random walk
on the state graph.
16.14.2 https://stats.libretexts.org/@go/page/10387
The Wheatstone Bridge Graph
The graph below is called the Wheatstone bridge in honor of Charles Wheatstone.
Figure 16.14.1: The Wheatstone bridge network, with conductance values in red
In this subsection, let X be the random walk on the Wheatstone bridge above, with the given conductance values.
Answer
Answer
Answer
5. Find limn→∞ P .
2n+1
Answer
16.14.3 https://stats.libretexts.org/@go/page/10387
Special Models
Recall that the basic Ehrenfest chain with m ∈ N + balls is reversible. Interpreting the chain as a random walk on a graph,
sketch the graph and find a conductance function.
Answer
Recall that the modified Ehrenfest chain with m ∈ N + balls is reversible. Interpreting the chain as a random walk on a graph,
sketch the graph and find a conductance function.
Answer
Recall that the Bernoulli-Laplace chain with j ∈ N balls in urn 0, k ∈ N balls in urn 1, and with r ∈ {0, … , j + k} of the
+ +
balls red, is reversible. Interpreting the chain as a random walk on a graph, sketch the graph and find a conductance function.
Simplify the conductance function in the special case that j = k = r .
Answer
Random Walks on Z
Random walks on integer lattices are particularly interesting because of their classification as transient or recurrent. We consider
the one-dimensional case in this subsection, and the higher dimensional case in the next subsection.
Let X = (X 0, X1 , X2 , …) be the discrete-time Markov chain with state space Z and transition probability matrix P given by
P (x, x + 1) = p, P (x, x − 1) = 1 − p, x ∈ Z (16.14.9)
where p ∈ (0, 1). The chain X is called the simple random walk on Z with parameter p.
The term simple is used because the transition probabilities starting in state x ∈ Z do not depend on x. Thus the chain is spatially
as well as temporally homogeneous. In the special case p = , the chain X is the simple symmetric random walk on Z. Basic
1
properties of the simple random walk on Z, and in particular, the simple symmetric random walk were studied in the chapter on
Bernoulli Trials. Of course, the state graph G of X has vertex set Z, and the neighbors of x ∈ Z are x + 1 and x − 1 . It's not
immediately clear that X is a random walk on G associated with a conductance function, which after all, is the topic of this
section. But that fact and more follow from the next result.
Then
1. g(x)P (x, y) = g(y)P (y, x) for all (x, y) ∈ Z 2
2. g is invariant for X
3. X is reversible with respect to g
4. X is the random walk on Z with conductance function c given by c(x, x + 1) = p x+1 x
/(1 − p ) for x ∈ Z.
Proof
In particular, the simple symmetric random walk is the symmetric random walk on G.
2n
2n n n
P (0, 0) = ( ) p (1 − p ) , n ∈ N (16.14.11)
n
Proof
2
then X is transient.
2. If p = 1
2
then X is null recurrent.
16.14.4 https://stats.libretexts.org/@go/page/10387
Proof
So for the one-dimensional lattice Z, the random walk X is transient in the non-symmetric case, and null recurrent in the
symmetric case. Let's return to the invariant functions of X
Consider again the random walk X on Z with parameter p ∈ (0, 1). The constant function 1 on Z and the function g given by
x
p
g(x) = ( ) , x ∈ Z (16.14.13)
1 −p
are invariant for X. All other invariant functions are linear combinations of these two functions.
Proof
Note that when p = , the constant function 1 is the only positive invariant function, up to multiplication by positive constants.
1
But we know this has to be the case since the chain is recurrent when p = . Moreover, the chain is reversible. In the non- 1
symmetric case, when p ≠ , we have an example of a transient chain which nonetheless has non-trivial invariant functions—in
1
fact a two dimensional space of such functions. Also, X is reversible with respect to g , as shown above, but the reversal of X with
respect to 1 is the chain with transition matrix Q given by Q(x, y) = P (y, x) for (x, y) ∈ Z . This chain is just the simple random 2
walk on Z with parameter 1 − p . So the non-symmetric simple random walk is an example of a transient chain that is reversible
with respect to one invariant measure but not with respect to another invariant measure.
Random walks on Z k
More generally, we now consider Z , where k ∈ N . For i ∈ {1, 2, … , k}, let u ∈ Z denote the unit vector with 1 in position i
k
+ i
k
and 0 elsewhere. The k -dimensional integer lattice G has vertex set Z , and the neighbors of x ∈ Z are x ± u for k k
i
Let X = (X1 , X2 , …) be the Markov chain on Z with transition probability matrix P given by
k
k
P (x, x + ui ) = pi , P (x, x − ui ) = qi ; x ∈ Z , i ∈ {1, 2, … , k} (16.14.15)
Again, the term simple means that the transition probabilities starting at x ∈ Z do not depend on x, so that the chain is spatially
k
homogeneous as well as temporally homogeneous. In the special case that p = q = for i ∈ {1, 2, … , k}, X is the simple
i i
1
2k
symmetric random walk on Z . The following theorem is the natural generalization of the result abpve for the one-dimensional
k
case.
Then
1. g(x)P (x, y) = g(y)P (y, x) for all x, y ∈ Z k
2. g is invariant for X .
3. X is reversible with respect to g .
4. X is the random walk on G with conductance function c given by c(x, y) = g(x)P (x, y) for x, y ∈ Z
k
.
Proof
It terms of recurrence and transience, it would certainly seem that the larger the dimension k , the less likely the chain is to be
recurrent. That's generally true:
1. For k ∈ {1, 2}, X is null recurrent in the symmetric case and transient for all other values of the parameters.
16.14.5 https://stats.libretexts.org/@go/page/10387
2. For k ∈ {3, 4, …}, X is transient for all values of the parameters.
Proof sketch
So for the simple, symmetric random walk on the integer lattice Z , we have the following interesting dimensional phase shift: the
k
xj
pj
k
gJ (x1 , x2 , … , xk ) = ∏ ( ) , (x1 , x2 , … , xk ) ∈ Z (16.14.18)
qj
j∈J
2. g is invariant for X .
J
Proof
Note that when J = ∅ , g = 1 and when J = {1, 2, … , k}, g = g , the invariant function introduced above. So in the completely
J J
non-symmetric case where p ≠ q for every i ∈ {1, 2, … , k}, the random walk X has 2 positive invariant functions that are
i i
k
This page titled 16.14: Random Walks on Graphs is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
16.14.6 https://stats.libretexts.org/@go/page/10387
16.15: Introduction to Continuous-Time Markov Chains
This section begins our study of Markov processes in continuous time and with discrete state spaces. Recall that a Markov process
with a discrete state space is called a Markov chain, so we are studying continuous-time Markov chains. It will be helpful if you
review the section on general Markov processes, at least briefly, to become familiar with the basic notation and concepts. Also,
discrete-time chains plays a fundamental role, so you will need review this topic also.
We will study continuous-time Markov chains from different points of view. Our point of view in this section, involving holding
times and the embedded discrete-time chain, is the most intuitive from a probabilistic point of view, and so is the best place to start.
In the next section, we study the transition probability matrices in continuous time. This point of view is somewhat less intuitive,
but is closest to how other types of Markov processes are treated. Finally, in the third introductory section we study the Markov
chain from the view point of potential matrices. This is the least intuitive approach, but analytically one of the best. Naturally, the
interconnections between the various approaches are particularly important.
Preliminaries
As usual, we start with a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the
probability measure on the sample space (Ω, F ). The time space is ([0, ∞), T ) where as usual, T is the Borel σ-algebra on
[0, ∞) corresponding to the standard Euclidean topology. The state space is (S, S ) where S is countable and S is the power set
of S . So every subset of S is measurable, as is every function from S to another measurable space. Recall that S is also the Borel
σ algebra corresponding to the discrete topology on S . With this topology, every function from S to another topological space is
continuous. Counting measure # is the natural measure on (S, S ), so in the context of the general introduction, integrals over S
are simply sums. Also, kernels on S can be thought of as matrices, with rows and sums indexed by S . The left and right kernel
operations are generalizations of matrix multiplication.
Suppose now that is stochastic process with state space (S, S ). For t ∈ [0, ∞),
X = { Xt : t ∈ [0, ∞)} let
F
t
0
, so that F is the σ-algebra of events defined by the process up to time t . The collection of σ-algebras
= σ{ Xs : s ∈ [0, t]}
0
: t ∈ [0, ∞)} is the natural filtration associated with X. For technical reasons, it's often necessary to have a filtration
0 0
F = {F
t
0
F = { F : t ∈ [0, ∞)} that is slightly finer than the natural one, so that F
t ⊆ F for t ∈ [0, ∞) (or in equivlaent jargon, X is
t t
adapted to F). See the general introduction for more details on the common ways that the natural filtration is refined. We will also
let G = σ{X : s ∈ [t, ∞)} , the σ-algebra of events defined by the process from time t onward. If t is thought of as the present
t s
time, then F is the collection of events in the past and G is the collection of events in the future.
t t
It's often necessary to impose assumptions on the continuity of the process X in time. Recall that X is right continuous if
t ↦ X (ω) is right continuous on [0, ∞) for every ω ∈ Ω , and similarly X has left limits if t ↦ X (ω) has left limits on (0, ∞)
t t
for every ω ∈ Ω . Since S has the discrete topology, note that if X is right continuous, then for every t ∈ [0, ∞) and ω ∈ Ω , there
exists ϵ (depending on t and ω) such that X (ω) = X (ω) for s ∈ [0, ϵ) . Similarly, if X has left limits, then for every t ∈ (0, ∞)
t+s t
and ω ∈ Ω there exists δ (depending on t and ω) such that X (ω) is constant for s ∈ (0, δ) .
t−s
The process X = {X t : t ∈ [0, ∞)} is a Markov chain on S if for every t ∈ [0, ∞), A ∈ F , and B ∈ G ,
t t
Another version is that the conditional distribution of a state in the future, given the past, is the same as the conditional distribution
just given the present state.
The process X = {X t : t ∈ [0, ∞)} is a Markov chain on S if for every s, t ∈ [0, ∞) , and x ∈ S ,
16.15.1 https://stats.libretexts.org/@go/page/10388
Technically, in the last two definitions, we should say that X is a Markov process relative to the filtration F. But recall that if X
satisfies the Markov property relative to a filtration, then it satisfies the Markov property relative to any coarser filtration, and in
particular, relative to the natural filtration. For the natural filtration, the Markov property can also be stated without explicit
reference to σ-algebras, although at the cost of additional clutter:
The process X = { Xt : t ∈ [0, ∞)} is a Markov chain on S if and only if for every n ∈ N+ , time sequence
(t1 , t2 , … , tn ) ∈ [0, ∞)
n
with t
1 < t2 < ⋯ < tn, and state sequence (x , x , … , x ) ∈ S ,
1 2 n
n
As usual, we also assume that our Markov chain X is time homogeneous, so that P(X = y ∣ X = x) = P(X = y ∣ X = x)s+t s t 0
for s, t ∈ [0, ∞) and x, y ∈ S . So, for a homogeneous Markov chain on S , the process {X : t ∈ [0, ∞)} given X = x , is s+t s
independent of F and equivalent to the process {X : t ∈ [0, ∞)} given X = x , for every s ∈ [0, ∞) and x ∈ S . That is, if the
s t 0
chain is in state x ∈ S at a particular time s ∈ [0, ∞) , it does not matter how the chain got to x; the chain essentially starts over in
state x.
So F is the collection of events up to the random time τ in the same way that F is the collection of events up to the deterministic
τ t
time t ∈ [0, ∞). We usually want the Markov property to extend from deterministic times to stopping times.
The process X = {X t : t ∈ [0, ∞)} is a strong Markov chain on S if for every stopping time τ , t ∈ [0, ∞), and x ∈ S ,
So, for a homogeneous strong Markov chain on S , the process {X : t ∈ [0, ∞)} given X = x , is independent of F
τ+t τ and τ
equivalent to the process {X : t ∈ [0, ∞)} given X = x , for every stopping time τ and x ∈ S . That is, if the chain is in state
t 0
x ∈ S at a stopping time τ , then the chain essentially starts over at x, independently of the past.
by
c −rt
F (t) = P(τ > t) = e , t ∈ [0, ∞) (16.15.6)
The mean of the distribution is 1/r and the variance is 1/r . The exponential distribution has an amazing number of
2
characterizations. One of the most important is the memoryless property which states that a random variable τ with values in
[0, ∞) has an exponential distribution if and only if the conditional distribution of τ − s given τ > s is the same as the distribution
of τ itself, for every s ∈ [0, ∞) . It's easy to see that the memoryless property is equivalent to the law of exponents for right
distribution function F , namely F (s + t) = F (s)F (t) for s, t ∈ [0, ∞) . Since F is right continuous, the only solutions are
c c c c c
exponential functions.
For our study of continuous-time Markov chains, it's helpful to extend the exponential distribution to two degenerate cases, τ = 0
with probability 1, and τ = ∞ with probability 1. In terms of the parameter, the first case corresponds to r = ∞ so that
F (t) = P(τ > t) = 0 for every t ∈ [0, ∞), and the second case corresponds to r = 0 so that F (t) = P(τ > t) = 1 for every
t ∈ [0, ∞). Note that in both cases, the function F satisfies the law of exponents, and so corresponds to a memoryless distribution
16.15.2 https://stats.libretexts.org/@go/page/10388
in a general sense. In all cases, the mean of the exponential distribution with parameter r ∈ [0, ∞] is , where we interpret
1/r
Holding Times
The Markov property implies the memoryless property for the random time when a Markov process first leaves its initial state. It
follows that this random time must have an exponential distribution.
Suppose that X = {X : t ∈ [0, ∞)} is a Markov chain on S , and let τ = inf{t ∈ [0, ∞) : X
t t ≠ X0 } . For x ∈ S , the
conditional distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞].
0
Proof
So, associated with the Markov chain X on S is a function λ : S → [0, ∞] that gives the exponential parameters for the holding
times in the states. Considering the ordinary exponential distribution, and the two degenerate versions, we are led to the following
classification of states:
Suppose again that X = {X t : t ∈ [0, ∞)} is a Markov chain on S with exponential parameter function λ . Let x ∈ S .
1. If λ(x) = 0 then P(τ = ∞ ∣ X = x) = 1 , and x is said to be an absorbing state.
0
2. If λ(x) ∈ (0, ∞) then P(0 < τ < ∞ ∣ X = x) = 1 and x is said to be an stable state.
0
As you can imagine, an instantaneous state corresponds to weird behavior, since the chain starting in the state leaves the state at
times arbitrarily close to 0. While mathematically possible, instantaneous states make no sense in most applications, and so are to
be avoided. Also, the proof of the last result has some technical holes. We did not really show that τ is a valid random time, let
alone a stopping time. Fortunately, one of our standard assumptions resolves these problems.
Suppose again that X = {X t : t ∈ [0, ∞)} is a Markov chain on S . If the process X and the filtration F are right continuous,
then
1. τ is a stopping time.
2. X has no instantaneous states.
3. P(X ≠ x ∣ X = x) = 1 if x ∈ S is stable.
τ 0
There is actually a converse to part (b) that states that if X has no instantaneous states, then there is a version of X that is right
continuous. From now on, we will assume that our Markov chains are right continuous with probability 1, and hence have no
instantaneous states. On the other hand, absorbing states are perfectly reasonable and often do occur in applications. Finally, if the
chain enters a stable state, it will stay there for a (proper) exponentially distributed time, and then leave.
chain changes state for n ∈ N , unless the chain has previously been caught in an absorbing state. Here is the formal construction:
+
In the definition of M , of course, sup(N) = ∞ , so M is the number of changes of state. If M < ∞ , the chain was sucked into an
absorbing state at time τ . Since we have ruled out instantaneous states, the sequence of random times in strictly increasing up
M
until the (random) term M . That is, with probability 1, if n ∈ N and τ < ∞ then τ < τ n . Of course by construction, if
n n+1
τ = ∞ then τ
n n+1 = ∞ . The increments τ −τ for n ∈ N with n < M are the times spent in the states visited by X. The
n+1 n
process at the random times when the state changes forms an embedded discrete-time Markov chain.
16.15.3 https://stats.libretexts.org/@go/page/10388
Suppose again that X = {X : t ∈ [0, ∞)} is a Markov chain on S . Let {τ : n ∈ N} denote the stopping times and M the
t n
As noted in the proof, the one-step transition probability matrix Q for the jump chain Y is given for (x, y) ∈ S by 2
where I is the identity matrix on S . Of course Q satisfies the usual properties of a probability matrix on S , namely Q(x, y) ≥ 0 for
(x, y) ∈ S and ∑
2
Q(x, y) = 1 for x ∈ S . But Q satisfies another interesting property as well. Since the the state actually
y∈S
changes at time τ starting in a stable state, we must have Q(x, x) = 0 if x is stable and Q(x, x) = 1 if x is absorbing.
Given the initial state, the holding time and the next state are independent.
Proof
The following theorem is a generalization. The changes in state and the holding times are independent, given the initial state.
Suppose that n ∈ N + and that (x 0, x1 , … , xn ) is a sequence of stable states and (t 1, t2 , … , tn ) is a sequence in [0, ∞). Then
P(Y1 = x1 , τ1 > t1 , Y2 = x2 , τ2 − τ1 > t2 , … , Yn = xn , τn − τn−1 > tn ∣ Y0 = x0 )
Proof
Regularity
We now know quite a bit about the structure of a continuous-time Markov chain X = {X : t ∈ [0, ∞)} (without instantaneous t
states). Once the chain enters a given state x ∈ S , the holding time in state x has an exponential distribution with parameter
λ(x) ∈ [0, ∞), after which the next state y ∈ S is chosen, independently of the holding time, with probability Q(x, y). However,
we don't know everything about the chain. For the sequence {τ : n ∈ N} defined above, let τ = lim
n τ , which exists in ∞ n→∞ n
(0, ∞] of course, since the sequence is increasing. Even though the holding time in a state is positive with probability 1, it's
possible that τ < ∞ with positive probability, in which case we know nothing about X for t ≥ τ . The event {τ < ∞} is
∞ t ∞ ∞
known as explosion, since it means that the X makes infinitely many transitions before the finite time τ . While not as ∞
pathological as the existence of instantaneous states, explosion is still to be avoided in most applications.
A Markov chain X = {X t : t ∈ [0, ∞)} on S is regular if each of the following events has probability 1:
1. X is right continuous.
2. τ → ∞ as n → ∞ .
n
There is a simple condition on the exponential parameters and the embedded chain that is equivalent to condition (b).
Suppose that X = {X : t ∈ [0, ∞)} is a right-continuous Markov chain on S with exponential parameter function λ and
t
∞
embedded chain Y = (Y , Y , …). Then τ → ∞ as n → ∞ with probability 1 if and only if ∑
0 1 n 1/λ(Y ) = ∞ with n=0 n
probability 1.
Proof
Suppose that X = {X t : t ∈ [0, ∞)} is a Markov chain on S with exponential parameter function λ . If λ is bounded, then X
is regular.
Proof
Here is another sufficient condition that is useful when the state space is infinite.
16.15.4 https://stats.libretexts.org/@go/page/10388
Suppose that X = { Xt : t ∈ [0, ∞)} is a Markov chain on S with exponential parameter function λ : S → [0, ∞) . Let
S+ = {x ∈ S : λ(x) > 0} . Then X is regular if
1
∑ =∞ (16.15.19)
λ(x)
x∈S+
Proof
As a corollary, note that if S is finite then λ is bounded, so a continuous-time Markov chain on a finite state space is regular. So to
review, if the exponential parameter function λ is finite, the chain X has no instantaneous states. Even better, if λ is bounded or if
the conditions in the last theorem are satisfied, then X is regular. A continuous-time Markov chain with bounded exponential
parameter function λ is called uniform, for reasons that will become clear in the next section on transition matrices. As we will see
in later section, a uniform continuous-time Markov chain can be constructed from a discrete-time chain and an independent Poisson
process. For the next result, recall that to say that X has left limits with probability 1 means that the random function t ↦ X has t
Thus, our standard assumption will be that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S . For such a chain, the behavior
t
of X is completely determined by the exponential parameter function λ that governs the holding times, and the transition
probability matrix Q of the jump chain Y . Conversely, when modeling real stochastic systems, we often start with λ and Q. It's
then relatively straightforward to construct the continuous-time Markov chain that has these parameters. For simplicity, we will
assume that there are no absorbing states. The inclusion of absorbing states is not difficult, but mucks up the otherwise elegant
exposition.
Suppose that λ : S → (0, ∞) is bounded and that Q is a probability matrix on S with the property that Q(x, x) = 0 for every
x ∈ S . The regular, continuous-time Markov chain X = { X : t ∈ [0, ∞)} with exponential parameter function λ and jump
t
2. Next, given Y = (x , x , …) , the transition times (τ , τ , …) are constructed so that the holding times (τ , τ − τ
0 1 1 2 1 2 1, …)
3. Again given Y = (x , x , …) , define X = x for 0 ≤ t < τ and for n ∈ N , define X = x for τ ≤ t < τ
0 1 t 0 1 + t n n n+1 ) .
Additional details
Often, particularly when S is finite, the essential structure of a standard, continuous-time Markov chain can be succinctly
summarized with a graph.
Suppose again that X = {X t is a regular Markov chain on S , with exponential parameter function λ and
: t ∈ [0, ∞)}
embedded transition matrix . The state graph of X is the graph with vertex set S and directed edge set
Q
So except for the labels on the vertices, the state graph of X is the same as the state graph of the discrete-time jump chain Y . That
is, there is a directed edge from state x to state y if and only if the chain, when in x, can move to y after the random holding time in
x. Note that the only loops in the state graph correspond to absorbing states, and for such a state there are no outward edges.
Let's return again to the construction above of a continuous-time Markov chain from the jump transition matrix Q and the
exponential parameter function λ . Again for simplicity, assume there are no absorbing states. We assume that Q(x, x) = 0 for all
x ∈ S , so that the state really does change at the transition times. However, if we drop this assumption, the construction still
produces a continuous-time Markov chain, but with an altered jump transition matrix and exponential parameter function.
Suppose that Q is a transition matrix on S × S with Q(x, x) < 1 for x ∈ S , and that λ : S → (0, ∞) is bounded. The
stochastic process X = {X : t ∈ [0, ∞)} constructed above from Q and λ is a regular, continuous-time Markov chain with
t
16.15.5 https://stats.libretexts.org/@go/page/10388
~ ~
exponential parameter function λ and jump transition matrix Q given by
~
λ(x) = λ(x)[1 − Q(x, x)], x ∈ S
~ Q(x, y)
2
Q(x, y) = , (x, y) ∈ S , x ≠y
1 − Q(x, x)
Proof 1
Proof 2
This construction will be important in our study of chains subordinate to the Poisson process.
Transition Times
The structure of a regular Markov chain on S , as described above, can be explained purely in terms of a family of independent,
exponentially distributed random variables. The main tools are some additional special properties of the exponential distribution,
that we need to restate in the setting of our Markov chain. Our interest is in how the process evolves among the stable states until it
enters an absorbing state (if it does). Once in an absorbing state, the chain stays there forever, so the behavior from that point on is
trivial.
Suppose that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S , with exponential parameter function
t λ and transition
probability matrix Q. Define μ(x, y) = λ(x)Q(x, y) for (x, y) ∈ S . Then 2
The main point is that the new parameters μ(x, y) for (x, y) ∈ S determine the exponential parameters λ(x) for x ∈ S , and the
2
transition probabilities Q(x, y) when x ∈ S is stable and y ∈ S . Of course we know that if λ(x) = 0 , so that x is absorbing, then
Q(x, x) = 1. So in fact, the new parameters, as specified by the function μ , completely determine the old parmeters, as specified
Consider the functions μ , λ , and Q as given in the previous result. Suppose that T has the exponential distribution with
x,y
parameter μ(x, y) for each (x, y) ∈ S and that {T : (x, y) ∈ S } is a set of independent random variables. Then
2
x,y
2
1. T = inf {T
x x,y : y ∈ S} has the exponential distribution with parameter λ(x) for x ∈ S .
2. P (T = T
x x,y ) = Q(x, y) for (x, y) ∈ S . 2
Proof
So here's how we can think of a regular, continuous-time Markov chain on S : There is a timer associated with each (x, y) ∈ S , set 2
to the random time T . All of the timers function independently. When the chain enters state x ∈ S , the timers on (x, y) for y ∈ S
x,y
are started simultaneously. As soon as the first alarm goes off for a particular (x, y), the chain immediately moves to state y , and
the process repeats. Of course, if μ(x, y) = 0 then T = ∞ with probability 1, so only the timers with λ(x) > 0 and Q(x, y) > 0
x,y
matter (these correspond to the non-loop edges in the state graph). In particular, if x is absorbing, then the timers on (x, y) are set
to infinity for each y , and no alarm ever sounds.
The new collection of exponential parameters can be used to give an alternate version of the state graph. Again, the vertex set is S
and the edge set is E = {(x, y) ∈ S : Q(x, y) > 0} . But now each edge (x, y) is labeled with the exponential rate parameter
2
μ(x, y). The exponential rate parameters are closely related to the generator matrix, a matrix of fundamental importance that we
S = {0, 1}, with transition rate a ∈ [0, ∞) from 0 to 1 and transition rate b ∈ [0, ∞) from 1 to 0.
The transition matrix Q for the embedded chain is given below. Draw the state graph in each case.
16.15.6 https://stats.libretexts.org/@go/page/10388
0 1
1. Q = [ ] if a > 0 and b > 0 , so that both states are stable.
1 0
1 0
2. Q = [ ] if a = 0 and b > 0 , so that a is absorbing and b is stable.
1 0
0 1
3. Q = [ ] if a > 0 and b = 0 , so that a is stable and b is absorbing.
0 1
1 0
4. Q = [ ] if a = 0 and b = 0 , so that both states are absorbing.
0 1
Computational Exercises
Consider the Markov chain X = { Xt : t ∈ [0, ∞)} on S = {0, 1, 2} with exponential parameter function λ = (4, 1, 3) and
embedded transition matrix
1 1
0
⎡ 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.15.27)
⎢ ⎥
⎣ 1 2
0 ⎦
3 3
Special Models
Read the introduction to chains subordinate to the Poisson process.
This page titled 16.15: Introduction to Continuous-Time Markov Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or
curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed
edit history is available upon request.
16.15.7 https://stats.libretexts.org/@go/page/10388
16.16: Transition Matrices and Generators of Continuous-Time Chains
space (S, S ). By the very meaning of Markov chain, the set of states S is countable and the σ-algebra S is the collection of all
subsets of S . So every subset of S is measurable, as is every function from S to another measurable space. Recall that S is also
the Borel σ algebra corresponding to the discrete topology on S . With this topology, every function from S to another topological
space is continuous. Counting measure # is the natural measure on (S, S ), so in the context of the general introduction, integrals
over S are simply sums. Also, kernels on S can be thought of as matrices, with rows and sums indexed by S . The left and right
kernel operations are generalizations of matrix multiplication.
A space of functions on S plays an important role. Let B denote the collection of bounded functions f : S → R . With the usual
pointwise definitions of addition and scalar multiplication, B is a vector space. The supremum norm on B is given by
∥f ∥ = sup{|f (x)| : x ∈ S}, f ∈ B (16.16.1)
Of course, if S is finite, B is the set of all real-valued functions on S , and ∥f ∥ = max{|f (x)| : x ∈ S} for f ∈ B .
In the last section, we studied X in terms of when and how the state changes. To review briefly, let
τ = inf{t ∈ (0, ∞) : X ≠ X } . Assuming that X is right continuous, the Markov property of X implies the memoryless
t 0
property of τ , and hence the distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞) for each x ∈ S . The
0
assumption of right continuity rules out the pathological possibility that λ(x) = ∞ , which would mean that x is an instantaneous
state so that P(τ = 0 ∣ X = x) = 1 . On the other hand, if λ(x) ∈ (0, ∞) then x is a stable state, so that τ has a proper
0
exponential distribution given X = x with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, if λ(x) = 0 then x is an absorbing state, so
0 0
that P(τ = ∞ ∣ X = x) = 1 . Next we define a sequence of stopping times: First τ = 0 and τ = τ . Recursively, if τ < ∞ then
0 0 1 n
with one-step transition matrix Q given by Q(x, y) = P(X = y ∣ X = x) if x, y ∈ S with x stable, and Q(x, x) = 1 if x ∈ S is
τ 0
absorbing. Assuming that X is regular, which means that τ → ∞ as n → ∞ with probability 1 (ruling out the explosion event of
n
infinitely many transitions in finite time), the structure of X is completely determined by the sequence of stopping times
τ = (τ , τ , …) and the discrete-time jump chain Y = (Y , Y , …) . Analytically, the distribution X is determined by the
0 1 0 1
exponential parameter function λ and the one-step transition matrix Q of the jump chain.
In this section, we sill study the Markov chain X in terms of the transition matrices in continuous time and a fundamentally
important matrix known as the generator. Naturally, the connections between the two points of view are particularly interesting.
2
Pt (x, y) = P(Xt = y ∣ X0 = x), (x, y) ∈ S (16.16.2)
Note that since we are assuming that the Markov chain is homogeneous,
2
Pt (x, y) = P(Xs+t = y ∣ Xs = x), (x, y) ∈ S (16.16.3)
16.16.1 https://stats.libretexts.org/@go/page/10389
for every s, t ∈ [0, ∞) . The Chapman-Kolmogorov equation given next is essentially yet another restatement of the Markov
property. The equation is named for Andrei Kolmogorov and Sydney Chapman,
Suppose that P = { Pt : t ∈ [0, ∞)} is the collection of transition matrices for the chain X . Then Ps Pt = Ps+t for
s, t ∈ [0, ∞) . Explicitly,
y∈S
Proof
Restated in another form of jargon, the collection P = {P : t ∈ [0, ∞)} is a semigroup of probability matrices. The semigroup of
t
transition matrices P , along with the initial distribution, determine the finite-dimensional distributions of X.
and (x , x , … , x ) ∈ S
0 1 n is a state sequence, then
n+1
P (X0 = x0 , Xt1 = x1 , … Xtn = xn ) = f (x0 )Pt1 (x0 , x1 )Pt2 −t1 (x1 , x2 ) ⋯ Ptn −tn−1 (xn−1 , xn ) (16.16.8)
Proof
As with any matrix on S , the transition matrices define left and right operations on functions which are generalizations of matrix
multiplication. For a transition matrix, both have natural interpretations.
Suppose that f : S → R , and that either f is nonnegative or f ∈ B . Then for t ∈ [0, ∞),
y∈S
If f is nonnegative and S is infinte, then it's possible that P f (x) = ∞ . In general, the left operation of a positive kernel acts on
t
positive measures on the state space. In the setting here, if μ is a positive (Borel) measure on (S, S ), then the function
f : S → [0, ∞) given by f (x) = μ{x} for x ∈ S is the density function of μ with respect to counting measure # on (S, S ). This
simply means that μ(A) = ∑ f (x) for A ⊆ S . Conversely, given f : S → [0, ∞) , the set function μ(A) = ∑
x∈A
f (x) for x∈A
A ⊆ S defines a positive measure on (S, S ) with f as its density function. So for the left operation of P , it's natural to consider
t
If f : S → [0, ∞) then
x∈S
Proof
More generally, if f is the density function of a positive measure μ on (S, S ) then f P is the density function of the measure μP ,
t t
defined by
x∈S x∈S
A function f : S → [0, ∞) is invariant for the Markov chain X (or for the transition semigroup P ) if f Pt = f for every
t ∈ [0, ∞).
It follows that if X has an invariant probability density function f , then X has probability density function f for every
0 t
t ∈ [0, ∞), so X is identically distributed. Invariant and limiting distributions are fundamentally important for continuous-time
Markov chains.
16.16.2 https://stats.libretexts.org/@go/page/10389
Standard Semigroups
Suppose again that X = {X : t ∈ [0, ∞)} is a Markov chain on S with transition semigroup P = {P : t ∈ [0, ∞)} . Once
t t
again, continuity assumptions need to be imposed on X in order to rule out strange behavior that would otherwise greatly
complicate the theory. In terms of the transition semigroup P , here is the basic assumption:
Since P (x, x) = 1 for x ∈ S , the standard assumption is clearly a continuity assumption. It actually implies much stronger
0
If the transition semigroup P = { Pt : t ∈ [0, ∞)} is standard, then the function t ↦ Pt (x, y) is right continuous for each
(x, y) ∈ S .
2
Proof
Our next result connects one of the basic assumptions in the section on transition times and the embedded chain with the standard
assumption here.
If the Markov chain X has no instantaneous states then the transition semigroup P is standard.
Proof
Recall that the non-existence of instantaneous states is essentially equivalent to the right continuity of X. So we have the nice
result that if X is right continuous, then so is P . For the remainder of our discussion, we assume that X = {X : t ∈ [0, ∞)} is a t
regular Markov chain on S with transition semigroup P = {P : t ∈ [0, ∞)} , exponential function λ and one-step transition
t
matrix Q for the jump chain. Our next result is the fundamental integral equations relating P , λ , and Q.
Proof
We can now improve on the continuity result that we got earlier. First recall the leads to relation for the jump chain Y : For
(x, y) ∈ S , x leads to y if Q (x, y) > 0 for some n ∈ N . So by definition, x leads to x for each x ∈ S , and for (x, y) ∈ S with
2 n 2
x ≠ y , x leads to y if and only if the discrete-time chain starting in x eventually reaches y with positive probability.
For (x, y) ∈ S , 2
1. t ↦ P (x, y) is continuous.
t
Proof
Parts (b) and (c) are known as the Lévy dichotomy, named for Paul Lévy. It's possible to prove the Lévy dichotomy just from the
semigroup property of P , but this proof is considerably more complicated. In light of the dichotomy, the leads to relation clearly
makes sense for the continuous-time chain X as well as the discrete-time embedded chain Y .
P = { P : t ∈ [0, ∞)} , exponential parameter function λ and one-step transition matrix Q for the embedded jump chain. The
t
fundamental integral equation above now implies that the transition probability matrix P is differentiable in t . The derivative at 0
t
is particularly important.
16.16.3 https://stats.libretexts.org/@go/page/10389
Pt − I
→ G as t ↓ 0 (16.16.25)
t
where the infinitesimal generator matrix G is given by G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y) for (x, y) ∈ S . 2
Proof
Note that λ(x)Q(x, x) = 0 for every x ∈ S , since λ(x) = 0 is x is absorbing, while Q(x, x) = 0 if x is stable. So
G(x, x) = −λ(x) for x ∈ S , and G(x, y) = λ(x)Q(x, y) for (x, y) ∈ S with y ≠ x . Thus, the generator matrix G determines
2
the exponential parameter function λ and the jump transition matrix Q, and thus determines the distribution of the Markov chain
X.
The infinitesimal generator has a nice interpretation in terms of our discussion in the last section. Recall that when the chain first
enters a stable state x, we set independent, exponentially distributed “timers” on (x, y), for each y ∈ S − {x} . Note that G(x, y) is
the exponential parameter for the timer on (x, y). As soon as an alarm sounds for a particular (x, y), the chain moves to state y and
the process continues.
The matrix function t ↦ Pt is differentiable on [0, ∞), and satisfies the Kolmogorov backward equation: P
t
′
= GPt .
Explicitly,
′ 2
Pt (x, y) = −λ(x)Pt (x, y) + ∑ λ(x)Q(x, z)Pt (z, y), (x, y) ∈ S (16.16.27)
z∈S
Proof
The backward equation is named for Andrei Kolmogorov. In continuous time, the transition semigroup P = {P : t ∈ [0, ∞)} can t
be obtained from the single, generator matrix G in a way that is reminiscent of the fact that in discrete time, the transition
semigroup P = {P : n ∈ N} can be obtained from the single, one-step matrix P . From a modeling point of view, we often start
n
with the generator matrix G and then solve the the backward equation, subject to the initial condition P = I , to obtain the 0
If f ∈ B then Gf is given by
y∈S
Proof
But note that Gf is not in B unless λ ∈ B . Without this additional assumption, G is a linear operator from the vector space B of
bounded functions from S to R into the vector space of all functions from S to R. We will return to this point in our next
discussion.
Uniform Transition Semigroups
We can obtain stronger results for the generator matrix if we impose stronger continuity assumptions on P .
16.16.4 https://stats.libretexts.org/@go/page/10389
Proof
As usual, we want to look at this new assumption from different points of view.
So when the equivalent conditions are satisfied, the Markov chain X = {X : t ∈ [0, ∞)} is also said to be uniform. As we will
t
see in a later section, a uniform, continuous-time Markov chain can be constructed from a discrete-time Markov chain and an
independent Poisson process. For a uniform transition semigroup, we have a companion to the backward equation.
Suppose that P is a uniform transition semigroup. Then t ↦ Pt satisfies the Kolmogorov forward equation ′
Pt = Pt G .
Explicitly,
′ 2
P (x, y) = −λ(y)Pt (x, y) + ∑ Pt (x, z)λ(z)Q(z, y), (x, y) ∈ S (16.16.33)
t
z∈S
The backward equation holds with more generality than the forward equation, since we only need the transition semigroup P to be
standard rather than uniform. It would seem that we need stronger conditions on λ for the forward equation to hold, for otherwise
it's not even obvious that ∑ P (x, z)λ(z)Q(z, y) is finite for (x, y) ∈ S . On the other hand, the forward equation is sometimes
z∈S t
easier to solve than the backward equation, and the assumption that λ is bounded is met in many applications (and of course holds
automatically if S is finite).
As a simple corollary, the transition matrices and the generator matrix commute for a uniform semigroup: P G = GP for t t
t ∈ [0, ∞). The forward and backward equations formally look like the differential equations for the exponential function. This
Suppose again that P = { Pt : t ∈ [0, ∞)} is a uniform transition semigroup with generator G. Then
∞ n
tG
t n
Pt = e =∑ G , t ∈ [0, ∞) (16.16.34)
n!
n=0
Proof
We can characterize the generators of uniform transition semigroups. We just need the minimal conditions that the diagonal entries
are nonpositive and the row sums are 0.
Suppose that G a matrix on S with ∥G∥ < ∞ . Then G is the generator of a uniform transition semigroup
P = { Pt : t ∈ [0, ∞)} if and only if for every x ∈ S ,
1. G(x, x) ≤ 0
2. ∑ G(x, y) = 0
y∈S
Proof
transition rate b ∈ [0, ∞) from 1 to 0. This two-state Markov chain was studied in the previous section. To avoid the trivial case
with both states absorbing, we will assume that a + b > 0 .
16.16.5 https://stats.libretexts.org/@go/page/10389
Show that for t ∈ [0, ∞),
1 b a 1 −(a+b)t
−a a
Pt = [ ]− e [ ] (16.16.39)
a+b b a a+b b −b
You probably noticed that the forward equation is easier to solve because there is less coupling of terms than in the backward
equation.
a+b
, f (1) = a
a+b
. Show that
b a
1. Pt →
1
a+b
[ ] as t → ∞ , the matrix with f in both rows.
b a
3. f G = 0 .
Computational Exercises
Consider the Markov chain X = { Xt : t ∈ [0, ∞)} on S = {0, 1, 2} with exponential parameter function λ = (4, 1, 3) and
embedded transition matrix
1 1
⎡ 0 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.16.40)
⎢ ⎥
⎣ 1 2 ⎦
0
3 3
4. Find lim P .
t→∞ t
Answer
Special Models
Read the discussion of generator and transition matrices for chains subordinate to the Poisson process.
Read the discussion of the infinitesimal generator for continuous-time birth-death chains.
Read the discussion of the infinitesimal generator for continuous-time queuing chains.
Read the discussion of the infinitesimal generator for continuous-time branching chains.
This page titled 16.16: Transition Matrices and Generators of Continuous-Time Chains is shared under a CC BY 2.0 license and was authored,
remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts
platform; a detailed edit history is available upon request.
16.16.6 https://stats.libretexts.org/@go/page/10389
16.17: Potential Matrices
Prelimnaries
This is the third of the introductory sections on continuous-time Markov chains. So our starting point is a time-homogeneous
Markov chain X = {X : t ∈ [0, ∞)} defined on an underlying probability space (Ω, F , P) and with discrete state space (S, S ).
t
Thus S is countable and S is the power set of S , so every subset of S is measurable, as is every function from S into another
measurable space. In addition, S is given the discret topology so that S can also be thought of as the Borel σ-algebra. Every
function from S to another topological space is continuous. Counting measure # is the natural measure on (S, S ), so in the
context of the general introduction, integrals over S are simply sums. Also, kernels on S can be thought of as matrices, with rows
and sums indexed by S , so the left and right kernel operations are generalizations of matrix multiplication. As before, let B denote
the collection of bounded functions f : S → R . With the usual pointwise definitions of addition and scalar multiplication, B is a
vector space. The supremum norm on B is given by
∥f ∥ = sup{|f (x)| : x ∈ S}, f ∈ B (16.17.1)
Of course, if S is finite, B is the set of all real-valued functions on S , and ∥f ∥ = max{|f (x)| : x ∈ S} for f ∈ B . The time
space is ([0, ∞), T ) where as usual, T is the Borel σ-algebra on [0, ∞) corresponding to the standard Euclidean topology.
Lebesgue measure is the natural measure on ([0, ∞), T ).
In our first point of view, we studied X in terms of when and how the state changes. To review briefly, let
τ = inf{t ∈ (0, ∞) : X ≠ X } . Assuming that X is right continuous, the Markov property of X implies the memoryless
t 0
property of τ , and hence the distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞) for each x ∈ S . The
0
assumption of right continuity rules out the pathological possibility that λ(x) = ∞ , which would mean that x is an instantaneous
state so that P(τ = 0 ∣ X = x) = 1 . On the other hand, if λ(x) ∈ (0, ∞) then x is a stable state, so that τ has a proper
0
exponential distribution given X = x with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, if λ(x) = 0 then x is an absorbing state, so
0 0
that P(τ = ∞ ∣ X = x) = 1 . Next we define a sequence of stopping times: First τ = 0 and τ = τ . Recursively, if τ < ∞ then
0 0 1 n
with one-step transition matrix Q given by Q(x, y) = P(X = y ∣ X = x) if x, y ∈ S with x stable, and Q(x, x) = 1 if x ∈ S is
τ 0
absorbing. Assuming that X is regular, which means that τ → ∞ as n → ∞ with probability 1 (ruling out the explosion event of
n
infinitely many transitions in finite time), the structure of X is completely determined by the sequence of stopping times
τ = (τ , τ , …) and the embedded discrete-time jump chain Y = (Y , Y , …) . Analytically, the distribution X is determined by
0 1 0 1
the exponential parameter function λ and the one-step transition matrix Q of the jump chain.
In our second point of view, we studied X in terms of the collection of transition matrices P = { Pt : t ∈ [0, ∞)} , where for
t ∈ [0, ∞),
2
Pt (x, y) = P(Xt = y ∣ X0 = x), (x, y) ∈ S (16.17.2)
The Markov and time-homogeneous properties imply the Chapman-Kolmogorov equations P P = P for s, t ∈ [0, ∞) , so that
s t s+t
P is a semigroup of transition matrices. The semigroup P , along with the initial distribution of X , completely determines the 0
distribution of X. For a regular Markov chain X, the fundamental integral equation connecting the two points of view is
t
−λ(x)t −λ(x)s 2
Pt (x, y) = I (x, y)e +∫ λ(x)e Q Pt−s (x, y) ds, (x, y) ∈ S (16.17.3)
0
which is obtained by conditioning on τ and X . It then follows that the matrix function t ↦ P is differentiable, with the derivative
τ t
satisfying the Kolmogorov backward equation P = GP where the generator matrix G is given by
t
′
t
2
G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y), (x, y) ∈ S (16.17.4)
If the exponential parameter function λ is bounded, then the transition semigroup P is uniform, which leads to stronger results.
The generator G is a bounded operator on B , the backward equation holds as well as a companion forward equation P = P G , as ′
t t
operators on B (so with respect to the supremum norm rather than just pointwise). Finally, we can represent the transition matrix
as an exponential: P = e for t ∈ [0, ∞).
t
tG
16.17.1 https://stats.libretexts.org/@go/page/10390
In this section, we study the Markov chain X in terms of a family of matrices known as potential matrices. This is the least
intuitive of the three points of view, but analytically one of the best approaches. Essentially, the potential matrices are transforms of
the transition matrices.
Basic Theory
We assume again that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S with transition semigroup P = {P
t t : t ∈ [0, ∞)} .
Our first discussion closely parallels the general theory, except for simplifications caused by the discrete state space.
∞
−αt 2
Uα (x, y) = ∫ e Pt (x, y) dt, (x, y) ∈ S (16.17.5)
0
2. For (x. y) ∈ S , U (x, y) is the expected amount of time that X spends in y , starting at x.
2
Proof
It's quite possible that U (x, y) = ∞ for some (x, y) ∈ S , and knowing when this is the case is of considerable interest. If
2
∞
−αt
Uα f (x) = ∑ Uα (x, y)f (y) = ∫ e Pt f (x) dt
0
y∈S
∞ ∞
−αt −αt
=∫ e ∑ Pt (x, y)f (y) = ∫ e E[f (Xt ) ∣ X0 = x] dt, x ∈ S
0 y∈S 0
assuming, as always, that the sums and integrals make sense. This will be the case in particular if f is nonnegative (although ∞ is a
possible value), or as we will now see, if f ∈ B and α > 0 .
α
for all x ∈ S .
Proof
It follows that for α ∈ (0, ∞) , the right potential operator U is a bounded, linear operator on B with ∥U
α α∥ =
1
α
. It also follows
that αU is a probability matrix. This matrix has a nice interpretation.
α
So αU is a transition probability matrix, just as P is a transition probability matrix, but corresponding to the random time T
α t
(with α ∈ (0, ∞) as a parameter), rather than the deterministic time t ∈ [0, ∞). The potential matrix can also be interpreted in
economic terms. Suppose that we receive money at a rate of one unit per unit time whenever the process X is in a particular state
y ∈ S . Then U (x, y) is the expected total amount of money that we receive, starting in state x ∈ S . But money that we receive
later is of less value to us now than money that we will receive sooner. Specifically, suppose that one monetary unit at time
t ∈ [0, ∞) has a present value of e where α ∈ (0, ∞) is the inflation factor or discount factor. Then U (x, y) is the total,
−αt
α
expected, discounted amount that we receive, starting in x ∈ S . A bit more generally, suppose that f ∈ B and that f (y) is the
reward (or cost, depending on the sign) per unit time that we receive when the process is in state y ∈ S . Then U f (x) is the α
α Uα → I as α → ∞ .
Proof
If f : S → [0, ∞) , then giving the left potential operation in its various forms,
16.17.2 https://stats.libretexts.org/@go/page/10390
∞
−αt
f Uα (y) = ∑ f (x)Uα (x, y) = ∫ e f Pt (y) dt
0
x∈S
∞ ∞
−αt −αt
=∫ e [ ∑ f (x)Pt (x, y)] dt = ∫ e [ ∑ f (x)P(Xt = y)] dt, y ∈ S
0 0
x∈S x∈S
In particular, suppose that α > 0 and that f is the probability density function of X . Then f P is the probability density function 0 t
of X for t ∈ [0, ∞), and hence from the last result, αf U is the probability density function of X , where again, T is
t α T
independent of X and has the exponential distribution on [0, ∞) with parameter α . The family of potential kernels gives the same
information as the family of transition kernels.
The resolvent U = { Uα : α ∈ (0, ∞)} completely determines the family of transition kernels P = { Pt : t ∈ (0, ∞)} .
Proof
Although not as intuitive from a probability view point, the potential matrices are in some ways nicer than the transition matrices
because of additional smoothness. In particular, the resolvent {U : α ∈ [0, ∞)} , along with the initial distribution, completely
α
determine the finite dimensional distributions of the Markov chain X. The potential matrices commute with the transition matrices
and with each other.
Proof
The equations above are matrix equations, and so hold pointwise. The same identities hold for the right operators on the space B
under the additional restriction that α > 0 and β > 0 . The fundamental equation that relates the potential kernels, known as the
resolvent equation, is given in the next theorem:
The equation above is a matrix equation, and so holds pointwise. The same identity holds for the right potential operators on the
space B , under the additional restriction that α > 0 .
step transition matrix Q for the jump chain. There are fundamental connections between the potential U and the generator matrix α
Proof 1
Proof 2
Proof 3
As before, we can get stronger results if we assume that λ is bounded, or equivalently, the transition semigroup P is uniform.
Suppose that λ is bounded and α ∈ (0, ∞) . Then as operators on B (and hence also as matrices),
1. I + GU = α U α α
2. I + U G = α U
α α
Proof
16.17.3 https://stats.libretexts.org/@go/page/10390
As matrices, the equation in (a) holds with more generality than the equation in (b), much as the Kolmogorov backward equation
holds with more generality than the forward equation. Note that
2
Uα G(x, y) = ∑ Uα (x, z)G(z, y) = −λ(y)Uα (x, y) + ∑ Uα (x, z)λ(z)Q(z, y), (x, y) ∈ S (16.17.26)
z∈S z∈S
Suppose that λ is bounded and α ∈ (0, ∞) . Then as operators on B (and hence also as matrices),
1. U = (αI − G)
α
−1
2. G = αI − U −1
α
Proof
So the potential operator U and the generator G have a simple, elegant inverse relationship. Of course, these results hold in
α
particular if S is finite, so that all of the various matrices really are matrices in the elementary sense.
transition rate b ∈ [0, ∞) from 1 to 0. To avoid the trivial case with both states absorbing, we will assume that a + b > 0 . The first
two results below are a review from the previous two sections.
1 b a 1 −a a
Uα = [ ]− [ ] (16.17.29)
α(a + b) b a (α + a + b)(a + b) b −b
Computational Exercises
Consider the Markov chain X = { Xt : t ∈ [0, ∞)} on S = {0, 1, 2} with exponential parameter function λ = (4, 1, 3) and
jump transition matrix
1 1
0
⎡ 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.17.30)
⎢ ⎥
⎣ 1 2
0 ⎦
3 3
Answer
16.17.4 https://stats.libretexts.org/@go/page/10390
Special Models
Read the discussion of potential matrices for chains subordinate to the Poisson process.
This page titled 16.17: Potential Matrices is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
16.17.5 https://stats.libretexts.org/@go/page/10390
16.18: Stationary and Limting Distributions of Continuous-Time Chains
In this section, we study the limiting behavior of continuous-time Markov chains by focusing on two interrelated ideas: invariant
(or stationary) distributions and limiting distributions. In some ways, the limiting behavior of continuous-time chains is simpler
than the limiting behavior of discrete-time chains, in part because the complications caused by periodicity in the discrete-time case
do not occur in the continuous-time case. Nonetheless as we will see, the limiting behavior of a continuous-time chain is closely
related to the limiting behavior of the embedded, discrete-time jump chain.
Review
Once again, our starting point is a time-homogeneous, continuous-time Markov chain X = {X : t ∈ [0, ∞)} defined on an t
underlying probability space (Ω, F , P) and with discrete state space (S, S ). By definition, this means that S is countable with the
discrete topology, so that S is the σ-algebra of all subsets of S .
Let's review what we have so far. We assume that the Markov chain X is regular. Among other things, this means that the basic
structure of X is determined by the transition times τ = (τ , τ , τ , …) and the jump chain Y = (Y , Y , Y , …). First, τ = 0
0 1 2 0 1 2 0
and τ = τ = inf{t > 0 : X ≠ X } . The time-homogeneous and Markov properties imply that the distribution of τ given
1 t 0
X = x is exponential with parameter λ(x) ∈ [0, ∞). Part of regularity is that X is right continuous so that there are no
0
instantaneous states where λ(x) = ∞ , which would mean P(τ = 0 ∣ X = x) = 1 . On the other hand, λ(x) ∈ (0, ∞) means that 0
x is a stable state so that τ has a proper exponential distribution given X = x , with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, 0 0
λ(x) = 0 means that x is an absorbing state so that P(τ = ∞ ∣ X = x) = 1 . The remaining transition times are defined 0
probability 1, τ → ∞ as n → ∞ , ruling out the explosion event of infinitely many jumps in finite time. The jump chain Y is
n
formed by sampling X at the transition times (until the chain is sucked into an absorbing state, if that happens). That is, with
M = sup{n : τ < ∞} and for n ∈ N , we define Y = X
n if n ≤ M and Y = X if n > M . Then Y is a discrete-time
n τn n τM
Markov chain with one-step transition matrix Q given Q(x, y) = P(X = y ∣ X = x) if (x, y) ∈ S with x stable and τ 0
2
Q(x, x) = 1 if x ∈ S is absorbing.
The transition matrix P at time t ∈ [0, ∞) is given by P (x, y) = P(X = y ∣ X = x) for (x, y) ∈ S . The time-homogenous
t t t 0
2
and Markov properties imply that the collection of transition matrices P = {P : t ∈ [0, ∞)} satisfies the Chapman-Kolmogorov t
equations P P = P s for s, t ∈ [0, ∞) , and hence is a semigroup. of transition matrices The transition semigroup P and the
t s+t
initial distribution of X determine all of the finite-dimensional distributions of X. Since there are no instantaneous states, P is
0
standard which means that P → I as t ↓ 0 (as matrices, and so pointwise). The fundamental relationship between P on the one
t
From this, it follows that the matrix function t ↦ P is differentiable (again, pointwise) and satisfies the Kolmogorov backward
t
equation P = GP , where the infinitesimal generator matrix G is given by G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y) for
d
dt
t t
(x, y) ∈ S . If we impose the stronger assumption that P is uniform, which means that P → I as t ↓ 0 as operators on B (so
2
t
with respect to the supremum norm), then the backward equation as well as the companion Kolmogorov forward equation
P == P G hold as operators on B . In addition, we have the matrix exponential representation P = e for t ∈ [0, ∞). The
d tG
t t t
dt
Laplace transform of P and hence gives the same information as P . From this point of view, the time-homogeneous and Markov
properties lead to the resolvent equation U = U + (β − α)U U for α, β ∈ (0, ∞) with α ≤ β . For α ∈ (0, ∞) , the α
α β α β
potential matrix is related to the generator by the fundamental equation α U = I + GU . If P is uniform, then this equation, as α α
Basic Theory
16.18.1 https://stats.libretexts.org/@go/page/10391
Relations and Classification
We start our discussion with relations among states and classifications of states. These are the same ones that we studied for
discrete-time chains in our study of recurrence and transience, applied here to the jump chain Y . But as we will see, the relations
and classifications make sense for the continuous-time chain X as well. The discussion is complicated slightly when there are
absorbing states. Only when X is in an absorbing state can we not interpret the values of Y as the values of X at the transition
times (because of course, there are no transitions when X is in an absorbing state). But x ∈ S is absorbing for the continuous-time
chain X if and only if x is absorbing for the jump chain Y , so this trivial exception is easily handled.
For y ∈ S let ρ = inf{n ∈ N : Y = y} , the (discrete) hitting time to y for the jump chain Y , where as usual, inf(∅) = ∞ . That
y + n
is, ρ is the first positive (discrete) time that Y in in state y . The analogous random time for the continuous-time chain X is τ ,
y ρy
where naturally we take τ = ∞ . This is the first time that X is in state y , not counting the possible initial period in y .
∞
So for the continuous-time chain, if x ∈ S is stable then H (x, x) is the probability that, starting in x, the chain X returns to x after
its initial period in x. If x, y ∈ S are distinct, then H (x, y) is simply the probability that X, starting in x, eventually reaches y . It
follows that the basic relation among states makes sense for either the continuous-time chain X as well as its jump chain Y .
The leads to relation → is reflexive by definition: x → x for every x ∈ S . From our previous study of discrete-time chains, we
know it's also transitive: if x → y and y → z then x → z for x, y, z ∈ S . We also know that x → y if and only if there is a
directed path in the state graph from x to y , if and only if Q (x, y) > 0 for some n ∈ N . For the continuous-time transition
n
matrices, we have a stronger result that in turn makes a stronger case that the leads to relation is fundamental for X.
Suppose (x, y) ∈ S . 2
This result is known as the Lévy dichotomy, and is named for Paul Lévy. Let's recall a couple of other definitions:
If A is a nonempty subset of S , then cl(A) = {y ∈ S : x → y for some x ∈ A} is the smallest closed set containing A , and
is called the closure of A .
Proof
16.18.2 https://stats.libretexts.org/@go/page/10391
The to and from relation ↔ defines an equivalence relation on S and hence partitions S into mutually disjoint equivalence classes.
Recall from our study of discrete-time chains that a closed set is not necessarily an equivalence class, nor is an equivalence class
necessarily closed. However, an irreducible set is an equivalence class, but an equivalence class may not be irreducible. The
importance of the relation ↔ stems from the fact that many important properties of Markov chains (in discrete or continuous time)
turn out to be class properties, shared by all states in an equivalence class. The following definition is fundamental, and once again,
makes sense for either the continuous-time chain X or its jump chain Y .
Let x ∈ S .
1. State x is transient if H (x, x) < 1
2. State x is recurrent if H (x, x) = 1.
Recall from our study of discrete-time chains that if x is recurrent and x → y then y is recurrent and y → x . Thus, recurrence and
transience are class properties, shared by all states in an equivalence class.
The expected values R(x, y) = E(N ∣ Y = x) and U (x, y) = E(T ∣ X = x) for (x, y) ∈ S define the potential matrices of
y 0 y 0
2
Y and X, respectively. From our previous study of discrete-time chains, we know the distribution and mean of N given Y = x y 0
in terms of the hitting matrix H . The next two results give a review:
Let's take cases. First suppose that y is recurrent. In part (a), P(N = n ∣ Y = y) = 0 for all n ∈ N , and consequently
y 0 +
P(N = ∞ ∣ Y = y) = 1 .
y 0 In part (b), P(N = n ∣ Y = x) = 0 y for
0 n ∈ N , and consequently +
P(N = 0 ∣ Y = x) = 1 − H (x, y) while P(N = ∞ ∣ Y = x) = H (x, y) . Suppose next that y is transient. Part (a) specifies a
y 0 y 0
proper geometric distribution on N while in part (b), probability 1 − H (x, y) is assigned to 0 and the remaining probability
+
H (x, y) is geometrically distributed over N as in (a). In both cases, N is finite with probability 1. Next we consider the expected
+ y
value, that is, the (discrete) potential. To state the results succinctly we will use the convention that a/0 = ∞ if a > 0 and
0/0 = 0 .
Let's take cases again. If y ∈ S is recurrent then R(y, y) = ∞ , and for x ∈ S with x ≠ y , either R(x, y) = ∞ if x → y or
R(x, y) = 0 if x ↛ y . If y ∈ S is transient, R(y, y) is finite, as is R(x, y) for every x ∈ S with x ≠ y . Moreover, there is an
inverse relationship of sorts between the potential and the hitting probabilities.
Naturally, our next goal is to find analogous results for the continuous-time chain X. For the distribution of T it's best to use the y
Proof
16.18.3 https://stats.libretexts.org/@go/page/10391
Let's take cases as before. Suppose first that y is recurrent. In part (a), P(T > t ∣ X = y) = 1 for every t ∈ [0, ∞) and hence
y 0
P(T = ∞ ∣ X = y) = 1 .
y 0 In part (b), P(T > t ∣ X = x) = H (x, y) for every t ∈ [0, ∞) and consequently
y 0
P(T = 0 ∣ X = x) = 1 − H (x, y) while P(T = ∞ ∣ X = x) = H (x, y) . Suppose next that y is transient. From part (a), the
y 0 y 0
distribution of T given X = y is exponential with parameter λ(y)[1 − H (y, y)] . In part (b), the distribution assigns probability
y 0
1 − H (x, y) to 0 while the remaining probability H (x, y) is exponentially distributed over (0, ∞) as in (a). Taking expected value,
we get a very nice relationship between the potential matrix U of the continuous-time chain X and the potential matrix R of the
discrete-time jump chain Y :
R(x, y)
U (x, y) = (16.18.6)
λ(y)
Proof
In particular, y ∈ S is transient if and only if R(x, y) < ∞ for every x ∈ S , if and only if U (x, y) < ∞ for every x ∈ S . On the
other hand, y is recurrent if and only if R(x, y) = U (x, y) = ∞ if x → y and R(x, y) = U (x, y) = 0 if x ↛ y .
x. Recall that x is positive recurrent for Y if ν (x) < ∞ and x is null recurrent if x is recurrent but not positive recurrent, so that
H (x, x) = 1 but ν (x) = ∞ . The definitions are similar for X, but using the continuous hitting time τ . ρ
x
For x ∈ S , let μ(x) = 0 if x is absorbing and μ(x) = E (τ ρx ∣ X0 = x) if x is stable. So if x is stable, μ(x) is the expected
return time to x starting in x (after the initial period in x).
1. State x is positive recurrent for X if μ(x) < ∞ .
2. State x is null recurrent for X if x recurrent but not positive recurrent, so that H (x, x) = 1 but μ(x) = ∞ .
A state x ∈ S can be positive recurrent for X but null recurrent for its jump chain Y or can be null recurrent for X but positive
recurrent for Y . But like transience and recurrence, positive and null recurrence are class properties, shared by all states in an
equivalence class under the to and from equivalence relation ↔.
Invariant Functions
Our next discussion concerns functions that are invariant for the transition matrix Q of the jump chain Y and functions that are
invariant for the transition semigroup P = {P : t ∈ [0, ∞)} of the continuous-time chain X. For both discrete-time and
t
continuous-time chains, there is a close relationship between invariant functions and the limiting behavior in time.
First let's recall the definitions. A function f : S → [0, ∞) is invariant for Q (or for the chain Y ) if f Q = f . It then follows that
f Q = f for every n ∈ N . In continuous time we must assume invariance at each time. That is, a function f : S → [0, ∞) is
n
invariant for P (or for the chain X) if f P = f for all t ∈ [0, ∞). Our interest is in nonnegative functions, because we can think of
t
such a function as the density function, with respect to counting measure, of a positive measure on S . We are particularly interested
in the special case that f is a probability density function, so that ∑ f (x) = 1 . If Y has a probability density function f that is
x∈S 0
invariant for Q, then Y has probability density function f for all n ∈ N and hence Y is stationary. Similarly, if X has a
n 0
probability density function f that is invariant for P then X has probability density function f for every t ∈ [0, ∞) and once
t
If our chain X has no absorbing states, then f : S → [0, ∞) is invariant for Q if and only if (f /λ)G = 0 .
16.18.4 https://stats.libretexts.org/@go/page/10391
Suppose that f : S → [0, ∞) . Then f is invariant for P if and only if f G = 0 .
Proof 1
Proof 2
So putting the two main results together we see that f is invariant for the continuous-time chain X if and only if λf is invariant for
the jump chain Y . Our next result shows how functions that are invariant for X are related to the resolvent
U = {U : α ∈ (0, ∞)} . To appreciate the result, recall that for α ∈ (0, ∞) the matrix αU
α is a probability matrix, and in fact
α
α U (x, ⋅) is the conditional probability density function of X , given X = x , where T is independent of X and has the
α T 0
exponential distribution with parameter α . So αU is a transition matrix just as P is a transition matrix, but corresponding to the
α t
exponentially distributed random time T with parameter α ∈ (0, ∞) rather than the deterministic time t ∈ [0, ∞).
Suppose that f : S → [0, ∞) . If f G = 0 then f (α U α) =f for α ∈ (0, ∞) . Conversely if f (α U α) =f for α ∈ (0, ∞) then
fG = 0.
Proof
So extending our summary, f : S → [0, ∞) is invariant for the transition semigroup P = {P : t ∈ [0, ∞)} if and only if λf is
t
invariant for jump transition matrices {Q : n ∈ N} if and only if f G = 0 if and only if f is invariant for the collection of
n
probability matrices {α U : α ∈ (0, ∞)} . From our knowledge of the theory for discrete-time chains, we now have the following
α
fundamental result:
Invariant functions have a nice interpretation in terms of occupation times, an interpretation that parallels the discrete case. The
potential gives the expected total time in a state, starting in another state, but here we need to consider the expected time in a state
during a cycle that starts and ends in another state.
τρ
x
∣
γx (y) = E ( ∫ 1(Xs = y) ds ∣ X0 = x) , y ∈ S (16.18.13)
∣
0
so that γ x (y) is the expected occupation time in state y before the first return to x, starting in x.
2. γ is invariant for X
x
3. γ (x) = 1/λ(x)
x
4. μ(x) = ∑ γ (y)
y∈S x
Proof
So now we have some additional insight into positive and null recurrence for the continuous-time chain X and the associated jump
chain Y . Suppose again that the chains are irreducible and recurrent. There exist g : S → (0, ∞) that is invariant for Y , and then
g/λ is invariant for X. The invariant functions are unique up to multiplication by positive constants. The jump chain Y is positive
16.18.5 https://stats.libretexts.org/@go/page/10391
Limiting Behavior
Our next discussion focuses on the limiting behavior of the transition semigroup P = { Pt : t ∈ [0, ∞)} . Our first result is a
simple corollary of the result above for potentials.
So we should turn our attention to the recurrent states. The set of recurrent states partitions into equivalent classes under ↔, and
each of these classes is irreducible. Hence we can assume without loss of generality that our continuous-time chain
X = { X : t ∈ [0, ∞)} is irreducible and recurrent. To avoid trivialities, we will also assume that S has at least two states. Thus,
t
there are no absorbing states and so λ(x) > 0 for x ∈ S . Here is the main result.
The limiting function f can be computed in a number of ways. First we find a function g : S → (0, ∞) that is invariant for X. We
can do this by solving
gPt = g for t ∈ (0, ∞)
gG = 0
The following result is known as the ergodic theorem for continuous-time Markov chains. It can also be thought of as a strong law
of large numbers for continuous-time Markov chains.
Suppose that X = {X t : t ∈ [0, ∞)} is irreducible and positive recurrent, with (unique) invariant probability density function
f . If h : S → R then
t
1
∫ h(Xs )ds → ∑ f (x)h(x) as t → ∞ (16.18.20)
t 0
x∈S
with probability 1, assuming that the sum on the right converges absolutely.
Notes
Note that no assumptions are made about X , so the limit is independent of the initial state. By now, this should come as no
0
surprise. After a long period of time, the Markov chain X “forgets” about the initial state. Note also that ∑ f (x)h(x) is the
x∈S
expected value of h , thought of as a random variable on S with probability measure defined by f . On the other hand,
t
∫ h(X )ds is the average of the time function s ↦ h(X ) on the interval [0, t]. So the ergodic theorem states that the limiting
1
t 0 s s
time average on the left is the same as the spatial average on the right.
16.18.6 https://stats.libretexts.org/@go/page/10391
The Two-State Chain
The continuous-time, two-state chain has been studied in the last several sections. The following result puts the pieces together and
completes the picture.
Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with transition rate a ∈ (0, ∞) from 0 to
t
Answer
Computational Exercises
The following continuous-time chain has also been studied in the previous three sections.
Consider the Markov chain X = { Xt : t ∈ [0, ∞)} on S = {0, 1, 2} with exponential parameter function λ = (4, 1, 3) and
jump transition matrix
1 1
⎡ 0 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.18.21)
⎢ ⎥
⎣ 1 2 ⎦
0
3 3
7. Verify the result in (f) by recalling the transition matrix P for X at t ∈ [0, ∞).
t
Answer
Special Models
Read the discussion of stationary and limiting distributions for chains subordinate to the Poisson process.
Read the discussion of stationary and limiting distributions for continuous-time birth-death chains.
Read the discussion of classification and limiting distributions for continuous-time queuing chains.
This page titled 16.18: Stationary and Limting Distributions of Continuous-Time Chains is shared under a CC BY 2.0 license and was authored,
remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts
platform; a detailed edit history is available upon request.
16.18.7 https://stats.libretexts.org/@go/page/10391
16.19: Time Reversal in Continuous-Time Chains
Earlier, we studied time reversal of discrete-time Markov chains. In continous time, the issues are basically the same. First, the
Markov property stated in the form that the past and future are independent given the present, essentially treats the past and future
symmetrically. However, there is a lack of symmetry in the fact that in the usual formulation, we have an initial time 0, but not a
terminal time. If we introduce a terminal time, then we can run the process backwards in time. In this section, we are interested in the
following questions:
Is the new process still Markov?
If so, how are the various parameters of the reversed Markov chain related to those of the original chain?
Under what conditions are the forward and backward Markov chains stochastically the same?
Consideration of these questions leads to reversed chains, an important and interesting part of the theory of continuous-time Markov
chains. As always, we are also interested in the relationship between properties of a continuous-time chain and the corresponding
properties of its discrete-time jump chain. In this section we will see that there are simple and elegant connections between the time
reversal of a continuous-time chain and the time-reversal of the jump chain.
Basic Theory
Reversed Chains
Our starting point is a (homogeneous) continuous-time Markov chain X = {X : t ∈ [0, ∞)} with (countable ) state space S . We
t
will assume that X is irreducible, so that every state in S leads to every other state, and to avoid trivialities, we will assume that there
are at least two states. The irreducibility assumption involves no serious loss of generality since otherwise we could simply restrict
our attention to an irreducible equivalence class of states. With our usual notation, we will let P = {P : t ∈ [0, ∞)} denote the
t
semigroup of transition matrices of X and G the infinitesimal generator. Let λ(x) denote the exponential parameter for the holding
time in state x ∈ S and Q the transition matrix for the discrete-time jump chain Y = (Y , Y , …). Finally, let 0 1
U = {U : α ∈ [0, ∞)} denote the collection of potential matrices of X. We will assume that the chain X is regular, which gives us
α
We may assume that the chain X is right continuous and has left limits.
The assumption of regularity rules out various types of weird behavior that, while mathematically possible, are usually not
appropriate in applications. If X is uniform, a stronger assumption than regularity, we have the following additional properties:
Pt (x, x) → 1 as t ↓ 0 uniformly in x ∈ S .
λ is bounded.
P =et for t ∈ [0, ∞).
tG
U = (αI − G)
α for α ∈ (0, ∞) .
−1
Now let h ∈ (0, ∞) . We will think of h as the terminal time or time horizon so the chains in our first discussion will be defined on
the time interval [0, h]. Notationally, we won't bother to indicate the dependence on h , since ultimately the time horizon won't matter.
Define X^
=X t for t ∈ [0, h]. Thus, the process forward in time is X = {X : t ∈ [0, h]} while the process backwards in time is
h−t t
^ ^
X = { X t : t ∈ [0, h]} = { Xh−t : t ∈ [0, h]} (16.19.1)
Similarly let
^ ^ : s ∈ [0, t]} = σ{ X
F t = σ{ X s h−s : s ∈ [0, t]} = σ{ Xr : r ∈ [h − t, h]}, t ∈ [0, h] (16.19.2)
forward. Our first result is that the chain reversed in time is still Markov
The process X ^
= { X : t ∈ [0, h]} is a Markov chain, but is not time homogeneous in general. For s,
^
t t ∈ [0, h] with s < t , the
transition matrix from s to t is
16.19.1 https://stats.libretexts.org/@go/page/10392
P(Xh−t = y)
^ 2
P s,t (x, y) = Pt−s (y, x), (x, y) ∈ S (16.19.3)
P(Xh−s = x)
Proof
However, the backwards chain will be time homogeneous if X has an invariant distribution.
0
Suppose that X is positive recurrent, with (unique) invariant probability density function f . If X has the invariant distribution,
0
then X
^
is a time-homogeneous Markov chain. The transition matrix at time t ∈ [0, ∞) (for every terminal time h ≥ t ), is given
by
f (y)
^ 2
P t (x, y) = Pt (y, x), (x, y) ∈ S (16.19.6)
f (x)
Proof
The previous result holds in the limit of the terminal time, regardless of the initial distribution.
Suppose again that X is positive recurrent, with (unique) invariant probability density function f . Regardless of the distribution
of X ,
0
f (y)
^
P(X ^
s+t = y ∣ X s = x) → Pt (y, x) as h → ∞ (16.19.7)
f (x)
Proof
These three results are motivation for the definition that follows. We can generalize by defining the reversal of an irreducible Markov
chain, as long as there is a positive, invariant function. Recall that a positive invariant function defines a positive measure on S , but
of course not in general a probability measure.
Suppose that g : S → (0, ∞) is invariant for X. The reversal of X with respect to g is the Markov chain
^ ^ : t ∈ [0, ∞)}
X = {X t with transition semigroup P^ defined by
g(y)
^ 2
P t (x, y) = Pt (y, x), (x, y) ∈ S , t ∈ [0, ∞) (16.19.8)
g(x)
Justification
Recall that if g is a positive invariant function for X then so is cg for every constant c ∈ (0, ∞) . Note that g and cg generate the
same reversed chain. So let's consider the cases:
Nonetheless, the general definition is natural, because most of the important properties of the reversed chain follow from the basic
balance equation relating the transition semigroups P and P^, and the invariant function g :
^ (x, y) = g(y)P (y, x), 2
g(x)P t t (x, y) ∈ S , t ∈ [0, ∞) (16.19.10)
We will see the balance equation repeated for other objects associated with the Markov chains.
16.19.2 https://stats.libretexts.org/@go/page/10392
2. X is the time reversal of X
^
with respect to g .
Proof
In the balance equation for the transition semigroups, it's not really necessary to know a-priori that the function g is invariant, if we
know the two transition semigroups.
Suppose that g : S → (0, ∞) . Then g is invariant and the Markov chains X and ^
X are time reversals with respect to g if and
only if
^ (x, y) = g(y)P (y, x), 2
g(x)P t t (x, y) ∈ S , t ∈ [0, ∞) (16.19.12)
Proof
Here is a slightly more complicated (but equivalent) version of the balance equation for the transition probabilities.
Suppose again that g : S → (0, ∞) . Then g is invariant and the chains X and X
^
are time reversals with respect to g if and only
if
^ ^ ^
g(x1 )P t1 (x1 , x2 )P t2 (x2 , x3 ) ⋯ P tn (xn , xn+1 ) = g(xn+1 )Ptn (xn+1 , xn )Ptn−1 (xn , xn−1 ) ⋯ Pt1 (x2 , x1 ) (16.19.14)
for all n ∈ N , (t
+ 1, t2 , … , tn ) ∈ [0, ∞)
n
, and (x 1, x2 , … , xn+1 ) ∈ S
n+1
.
Proof
Suppose again that g : S → (0, ∞) . Then g is invariant and the chains X and X
^ are time reversals with respect to g if and only
Proof
As a corollary, continuous-time chains that are time reversals are of the same type.
If X and X
^
are time reversals, then X and X
^
are of the same type: transient, null recurrent, or positive recurrent.
Proof
Suppose again that g : S → (0, ∞) . Then g is invariant and the Markov chains X and ^
X are time reversals if and only if the
infinitesimal generators satisfy
^ 2
g(x)G(x, y) = g(y)G(y, x), (x, y) ∈ S (16.19.19)
Proof
In our original discussion of time reversal in the positive recurrent case, we could have argued that the previous results must be true.
If we run the positive recurrent chain X = {X : t ∈ [0, h]} backwards in time to obtain the time reversed chain
t
X = { X : t ∈ [0, h]} , then the exponential parameters for X must the be same as those for X, and the jump chain Y for X must
^ ^ ^ ^ ^
t
16.19.3 https://stats.libretexts.org/@go/page/10392
Reversible Chains
Clearly an interesting special case is when the time reversal of a continuous-time Markov chain is stochastically the same as the
original chain. Once again, we assume that we have a regular Markov chain X = {X : t ∈ [0, ∞)} that is irreducible on the state
t
space S , with transition semigroup P = {P : t ∈ [0, ∞)} . As before, U = {U : α ∈ [0, ∞)} denotes the collection of potential
t α
matrices, and G the infinitesimal generator. Finally, λ denotes the exponential parameter function, Y = {Y : n ∈ N} the jump n
Suppose that g : S → (0, ∞) is invariant for X. Then X is reversible with respect to g if the time reversed chain
^ = {X
X ^ : t ∈ [0, ∞)}
t also has transition semigroup P . That is,
2
g(x)Pt (x, y) = g(y)Pt (y, x), (x, y) ∈ S , t ∈ [0, ∞) (16.19.21)
Clearly if X is reversible with respect to g then X is reversible with respect to cg for every c ∈ (0, ∞). So here is another review of
the cases:
The following results are corollaries of the results above for time reversals. First, we don't need to know a priori that the function g is
invariant.
Suppose that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if
2
g(x)Pt (x, y) = g(y)Pt (y, x), (x, y) ∈ S , t ∈ [0, ∞) (16.19.22)
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if
g(x1 )Pt1 (x1 , x2 )Pt2 (x2 , x3 ) ⋯ Ptn (xn , xn+1 ) = g(xn+1 )Ptn (xn+1 , xn )Ptn−1 (xn , xn−1 ) ⋯ Pt1 (x2 , x1 ) (16.19.23)
for all n ∈ N , (t
+ 1,
n
t2 , … , tn ) ∈ [0, ∞) , and (x 1, x2 , … , xn+1 ) ∈ S
n+1
.
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if
2
g(x)Uα (x, y) = g(y)Uα (y, x), (x, y) ∈ S , α ∈ [0, ∞) (16.19.24)
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if
2
g(x)G(x, y) = g(y)G(y, x), (x, y) ∈ S (16.19.25)
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible if and only if the jump chain Y is reversible with
respect to λg.
Recall that X is recurrent if and only if the jump chain Y is recurrent. In this case, the invariant functions for X and Y exist and are
unique up to positive constants. So in this case, the previous theorem states that X is reversible if and only if Y is reversible. In the
positive recurrent case (the most important case), the following theorem gives a condition for reversibility that does not directly
reference the invariant distribution. The condition is known as the Kolmogorov cycle condition, and is named for Andrei Kolmogorov
Suppose that X is positive recurrent. Then X is reversible if and only if for every sequence of distinct states (x 1, ,
x2 , … , xn )
G(x1 , x2 )G(x2 , x3 ) ⋯ G(xn−1 , xn )G(xn , x1 ) = G(x1 , xn )G(xn , xn−1 ) ⋯ G(x3 , x2 )G(x2 , x1 ) (16.19.26)
16.19.4 https://stats.libretexts.org/@go/page/10392
Proof
Note that the Kolmogorov cycle condition states that the transition rate of visiting states (x , x , … , x , x ) in sequence, starting in
2 3 n 1
state x is the same as the transition rate of visiting states (x , x , … , x , x ) in sequence, starting in state x . The cycle
1 n n−1 2 1 1
Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with transition rate
t a ∈ (0, ∞) from 0 to 1
and transition rate b ∈ (0, ∞) from 1 to 0. Show that X is reversible
1. Using the transition semigroup P = {P : t ∈ [0, ∞)} .
t
Computational Exercises
The Markov chain in the following exercise has also been studied in previous sections.
Consider the Markov chain X = { Xt : t ∈ [0, ∞)} on S = {0, 1, 2} with exponential parameter function λ = (4, 1, 3) and
jump transition matrix
1 1
⎡ 0 2 2 ⎤
Q =⎢ 1 0 0 ⎥ (16.19.29)
⎢ ⎥
⎣ 1 2 ⎦
0
3 3
Special Models
Read the discussion of time reversal for chains subordinate to the Poisson process.
This page titled 16.19: Time Reversal in Continuous-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.19.5 https://stats.libretexts.org/@go/page/10392
16.20: Chains Subordinate to the Poisson Process
Basic Theory
Introduction
Recall that the standard Poisson process with rate parameter r ∈ (0, ∞) involves three interrelated stochastic processes. First the
sequence of interarrival times T = (T , T , …) is independent, and each variable has the exponential distribution with parameter
1 2
r . Next, the sequence of arrival times τ = (τ , τ , …) is the partial sum sequence associated with the interrival sequence T :
0 1
τn = ∑ Ti , n ∈ N (16.20.1)
i=1
For n ∈ N+ , the arrival time τ has the gamma distribution with parameters
n n and r. Finally, the Poisson counting process
N = { Nt : t ∈ [0, ∞)} is defined by
so that N is the number of arrivals in (0, t] for t ∈ [0, ∞). The counting variable N has the Poisson distribution with parameter
t t
rt for t ∈ [0, ∞). The counting process N and the arrival time process τ are inverses in the sense that τ ≤ t if and only if n
N ≥ n for t ∈ [0, ∞) and n ∈ N . The Poisson counting process can be viewed as a continuous-time Markov chain.
t
Suppose that takes values in N and is independent of N . Define X = X + N for t ∈ [0, ∞). Then
X0 t 0 t
X = { Xt : t ∈ [0, ∞)} is a continuous-time Markov chain on N with exponential parameter function given by λ(x) = r for
x ∈ N and jump transition matrix Q given by Q(x, x + 1) = 1 for x ∈ S .
Proof
Note that the Poisson process, viewed as a Markov chain is a pure birth chain. Clearly we can generalize this continuous-time
Markov chain in a simple way by allowing a general embedded jump chain.
Suppose that X = {X : t ∈ [0, ∞)} is a Markov chain with (countable) state space S , and with constant exponential
t
parameter λ(x) = r ∈ (0, ∞) for x ∈ S , and jump transition matrix Q. Then X is said to be subordinate to the Poisson
process with rate parameter r.
1. The transition times (τ , τ , …) are the arrival times of the Poisson process with rate r.
1 2
2. The inter-transition times (τ , τ − τ , …) are the inter-arrival times of the Poisson process with rate r (independent, and
1 2 1
4. The Poisson process and the jump chain Y = (Y , Y , …) are independent, and X = Y for t ∈ [0, ∞).
0 1 t Nt
Proof
Since all states are stable, note that we must have Q(x, x) = 0 for x ∈ S . Note also that for x, y ∈ S with x ≠ y , the exponential
rate parameter for the transition from x to y is μ(x, y) = rQ(x, y). Conversely suppose that μ : S → (0, ∞) satisfies 2
μ(x, x) = 0 and ∑ μ(x, y) = r for every x ∈ S . Then the Markov chain with transition rates given by μ is subordinate to the
y∈S
Poisson process with rate r. It's easy to construct a Markov chain subordinate to the Poisson process.
Suppose that N = {N : t ∈ [0, ∞)} is a Poisson counting process with rate r ∈ (0, ∞) and that Y = {Y : n ∈ N} is a
t n
discrete-time Markov chain on S , independent of N , whose transition matrix satisfies Q(x, x) = 0 for every x ∈ S . Let
X =Y t for t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain subordinate to the Poisson
Nt t
process.
Markov chain on S subordinate to the Poisson process with rate r ∈ (0, ∞) and with jump transition matrix Q. As usual, let
P = { P : t ∈ [0, ∞)} denote the transition semigroup and G the infinitesimal generator.
t
16.20.1 https://stats.libretexts.org/@go/page/10393
The generator matrix G of X is G = r(Q − I ) . Hence for t ∈ [0, ∞)
1. The Kolmogorov backward equation is P = r(Q − I )P
′
t t
Proof
There are several ways to find the transition semigroup P = { Pt : t ∈ [0, ∞)} . The best way is a probabilistic argument using the
underlying Poisson process.
∞ n
(rt)
−rt n
Pt = ∑ e Q (16.20.3)
n!
n=0
Potential Matrices
Next let's find the potential matrices. As with the transition matrices, we can do this in (at least) two different ways.
Suppose again that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with
t
rate r ∈ (0, ∞) and with jump transition matrix Q. For α ∈ (0, ∞) , the potential matrix U of X is α
∞ n
1 r n
Uα = ∑( ) Q (16.20.5)
α +r α +r
n=0
Recall that for p ∈ (0, 1), the p-potential matrix of the jump chain Y is R p =∑
∞
n=0
n
p Q
n
. Hence we have the following nice
relationship between the potential matrix of X and the potential matrix of Y :
1
Uα = Rr/(α+r) (16.20.9)
α +r
Next recall that α U (x, ⋅) is the probability density function of X given X = x , where T has the exponential distribution with
α T 0
parameter α and is independent of X. On the other hand, α U (x, ⋅) = (1 − p)R (x, ⋅) where p = r/(α + r) . We know from our
α p
study of discrete potentials that (1 − p)R (x, ⋅) is the probability density function of Y where M has the geometric distribution
p M
on N with parameter 1 − p and is independent of Y . But also X = Y . So it follows that if T has the exponential distribution
T NT
with parameter α , N = {N : t ∈ [0, ∞)} is a Poisson process with rate r, and is independent of T , then N has the geometric
t T
distribution on N with parameter α/(α + r) . Of course, we could easily verify this directly, but it's still fun to see such
connections.
rate r ∈ (0, ∞) and with jump transition matrix Q. Let Y = {Y : n ∈ N} denote the jump process. The limiting behavior and
n
(x, y) ∈ S .
2
Proof
16.20.2 https://stats.libretexts.org/@go/page/10393
Time Reversal
Once again, suppose that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with
t
rate r ∈ (0, ∞) and with jump transition matrix Q. Let Y = {Y : n ∈ N} denote the jump process. We assume that X (and
n
Suppose that g : S → (0, ∞) is invariant for X. The time reversal X ^ with respect to g is also subordinate to the Poisson
In particular, X is reversible with respect to g if and only if Y is reversible with respect to g . As noted earlier, X and Y are of the
same type: both transient or both null recurrent or both positive recurrent. In the recurrent case, there exists a positive invariant
function that is unique up to multiplication by constants. In this case, the reversal of X is unique, and is the chain subordinate to
the Poisson process with rate r whose jump chain is the reversal of Y .
Uniform Chains
In the construction above for a Markov chain X = {X : t ∈ [0, ∞)} that is subordinate to the Poisson process with rate r and
t
jump transition kernel Q, we assumed of course that Q(x, x) = 0 for every x ∈ S . So there are no absorbing states and the
sequence (τ , τ , …) of arrival times of the Poisson process are the jump times of the chain X. However in our introduction to
1 2
continuous-time chains, we saw that the general construction of a chain starting with the function λ and the transition matrix Q
works without this assumption on Q, although the exponential parameters and transition probabilities change. The same idea works
here.
Suppose that N = {N : t ∈ [0, ∞)} is a counting Poisson process with rate r ∈ (0, ∞) and that Y = {Y : n ∈ N} is a
t n
discrete-time Markov chain with transition matrix Q on S × S satisfying Q(x, x) < 1 for x ∈ S . Assume also that N and Y
are independent. Define X = Y t for t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-Markov chain with
Nt t
~
exponential parameter function λ(x) = r[1 − Q(x, x)] for x ∈ S and jump transition matrix Q given by
~ Q(x, y)
2
Q(x, y) = , (x, y) ∈ S , x ≠y (16.20.10)
1 − Q(x, x)
Proof
The Markov chain constructed above is no longer a chain subordinate to the Poisson process by our definition above, since the
exponential parameter function is not constant, and the transition times of X are no longer the arrival times of the Poisson process.
Nonetheless, many of the basic results above still apply.
Let X = {X t : t ∈ [0, ∞)} be the Markov chain constructed in the previous theorem. Then
1. For t ∈ [0, ∞), the transition matrix P is given by
t
∞ n
(rt)
−rt n
Pt = ∑ e Q (16.20.11)
n!
n=0
It's a remarkable fact that every continuous-time Markov chain with bounded exponential parameters can be constructed as in the
last theorem, a process known as uniformization. The name comes from the fact that in the construction, the exponential parameters
become constant, but at the expense of allowing the embedded discrete-time chain to jump from a state back to that state. To review
the definition, suppose that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S with transition semigroup
t
16.20.3 https://stats.libretexts.org/@go/page/10393
P = { Pt : t ∈ [0, ∞)} , exponential parameter function λ and jump transition matrix Q . Then P is uniform if Pt (x, x) → 1 as
t ↓ 0 uniformly in x, or equivalently if λ is bounded.
Suppose that λ : S → (0, ∞) is bounded and that Q is a transition matrix on S with Q(x, x) = 0 for every x ∈ S . Let
r ∈ (0, ∞) be an upper bound on λ and N = { N : t ∈ [0, ∞)} a Poisson counting process with rate r . Define the transition
t
matrix Q^
on S by
λ(x)
^
Q(x, x) = 1 − x ∈ S
r
λ(x)
^ 2
Q(x, y) = Q(x, y) (x, y) ∈ S , x ≠y
r
t ∈ [0, ∞) . Then X = {X t : t ∈ [0, ∞)} is a continuous-time Markov chain with exponential parameter function λ and jump
transition matrix Q.
Proof
Note in particular that if the state space S is finite then of course λ is bounded so the previous theorem applies. The theorem is
useful for simulating a continuous-time Markov chain, since the Poisson process and discrete-time chains are simple to simulate. In
addition, we have nice representations for the transition matrices, potential matrices, and the generator matrix.
Suppose that X = {X t : t ∈ [0, ∞} is a continuous-time Markov chain on S with bounded exponential parameter function
λ : S → (0, ∞) and jump transition matrix Q. Define r and ^
Q as in the last theorem. Then
1. For t ∈ [0, ∞), the transition matrix P is given by t
∞ n
(rt) n
−rt ^
Pt = ∑ e Q (16.20.14)
n!
n=0
Examples
The Two-State Chain
The following exercise applies the uniformization method to the two-state chain.
Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with exponential parameter function
t
λ = (a, b) , where a, b ∈ (0, ∞) . Thus, states 0 and 1 are stable and the jump chain has transition matrix
0 1
Q =[ ] (16.20.16)
1 0
−a a
2. G = [ ]
b −b
3. P t
^
= Q−
1
a+b
e
−(a+b)t
G for t ∈ [0, ∞)
4. U α =
1
α
^
Q−
1
(α+a+b)(a+b)
G for α ∈ (0, ∞)
16.20.4 https://stats.libretexts.org/@go/page/10393
Proof
Although we have obtained all of these results for the two-state chain before, the derivation based on uniformization is the easiest.
This page titled 16.20: Chains Subordinate to the Poisson Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated
by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history
is available upon request.
16.20.5 https://stats.libretexts.org/@go/page/10393
16.21: Continuous-Time Birth-Death Chains
Basic Theory
Introduction
A continuous-time birth-death chain is a simple class of Markov chains on a subset of Z with the property that the only possible
transitions are to increase the state by 1 (birth) or decrease the state by 1 (death). It's easiest to define the birth-death process in
terms of the exponential transition rates, part of the basic structure of continuous-time Markov chains.
Suppose that S is an integer interval (that is, a set of consecutive integers), either finite or infinite. The birth-death chain with
birth rate function α : S → [0, ∞) and death rate function β : S → [0, ∞) is the Markov chain X = {X : t ∈ [0, ∞)} on S
t
with transition rate α(x) from x to x + 1 and transition rate β(x) from x to x − 1 , for x ∈ S .
If S has a minimum element m, then of course we must have β(m) = 0 . If α(m) = 0 also, then the boundary point m is
absorbing. Similarly, if S has a maximum element n then we must have α(n) = 0 . If β(n) = 0 also then the boundary point n is
absorbing. If x ∈ S is not a boundary point, then typically we have α(x) + β(x) > 0 , so that x is stable. If β(x) = 0 for all x ∈ S ,
then X is a pure birth process, and similarly if α(x) = 0 for all x ∈ S then X is a pure death process. From the transition rates,
it's easy to compute the parameters of the exponential holding times in a state and the transition matrix of the embedded, discrete-
time jump chain.
Consider again the birth-death chain X on S with birth rate function α and death rate function β. As usual, let λ denote the
exponential parameter function and Q the transition matrix for the jump chain.
1. λ(x) = α(x) + β(x) for x ∈ S
2. If x ∈ S is stable, so that α(x) + β(x) > 0 , then
α(x) β(x)
Q(x, x + 1) = , Q(x, x − 1) = (16.21.1)
α(x) + β(x) α(x) + β(x)
β(x)
q(x) = Q(x, x − 1) =
α(x) + β(x)
r(x) = Q(x, x) = 0
If x is absorbing then of course p(x) = q(x) = 0 and r(x) = 1 . Except for the initial state, the jump chain Y is deterministic for a
pure birth process, with Q(x, x) = 1 if x is absorbing and Q(x, x + 1) = 1 if x is stable. Similarly, except for the initial state, Y is
deterministic for a pure death process, with Q(x, x) = 1 if x is absorbing and Q(x, x − 1) = 1 if x is stable. Note that the Poisson
process with rate parameter r ∈ (0, ∞), viewed as a continuous-time Markov chain, is a pure birth process on N with birth
function α(x) = r for each x ∈ N. More generally, a birth death process with λ(x) = α(x) + β(x) = r for all x ∈ S is also
subordinate to the Poisson process with rate r.
Note that λ is bounded if and only if α and β are bounded (always the case if S is finite), and in this case the birth-death chain
X = { X : t ∈ [0, ∞)} is uniform. If λ is unbounded, then X may not even be regular, as an example below shows. Recall that a
t
where S = {x ∈ S : λ(x) = α(x) + β(x) > 0} is the set of stable states. Except for the aforementioned example, we will
+
16.21.1 https://stats.libretexts.org/@go/page/10394
Infinitesimal Generator and Transition Matrices
Suppose again that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on an interval S ⊆ Z with birth rate function α
t
and death rate function β. As usual, we will let P denote the transition matrix at time t ∈ [0, ∞) and G the infinitesimal generator.
t
As always, the infinitesimal generator gives the same information as the exponential parameter function and the jump transition
matrix, but in a more compact and useful form.
Proof
dt
Pt (x, y) = −[α(x) + β(x)] Pt (x, y) + α(x)Pt (x + 1, y) + β(x)Pt (x − 1, y) for (x, y) ∈ S . 2
2. d
dt
Pt (x, y) = −[α(y) + β(y)] Pt (x, y) + α(y − 1)Pt (x, y − 1) + β(y + 1)Pt (x, y + 1) for (x, y) ∈ S
2
Proof
Y = { Y : n ∈ N} , recall that
n
α(x) β(x)
p(x) = Q(x, x + 1) = , q(x) = Q(x, x − 1) = , x ∈ N (16.21.4)
α(x) + β(x) α(x) + β(x)
The jump chain Y is a discrete-time birth-death chain, and our notation here is consistent with the notation that we used in that
section. Note that X and Y are irreducible. We first consider transience and recurrence.
Proof
Next we consider positive recurrence and invariant distributions. It's nice to look at this from different points of view.
is invariant for X, and is the only invariant function, up to multiplication by constants. Hence X is positive recurrent if and
only if B = ∑ ∞
g(x) < ∞ , in which case the (unique) invariant probability density function f is given by f (x) =
x=0
g(x)
1
1. X is transient if A < ∞ .
2. X is null recurrent if A = ∞ and B = ∞ .
3. X is positive recurrent if B < ∞ .
16.21.2 https://stats.libretexts.org/@go/page/10394
Suppose now that n ∈ N+ and that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on the integer interval
t
Nn = {0, 1, … , n} . We assume that α(x) > 0 for x ∈ {0, 1, … , n − 1} while β(x) > 0 for x ∈ {1, 2, … n}. Of course, we
must have β(0) = α(n) = 0 . With these assumptions, X is irreducible, and since the state space is finite, positive recurrent. So all
that remains is to find the invariant distribution. The result is essentially the same as when the state space is N.
n
1 α(0) ⋯ α(x − 1) α(0) ⋯ α(x − 1)
fn (x) = for x ∈ Nn where Bn = ∑ (16.21.14)
Bn β(1) ⋯ β(x) β(1) ⋯ β(x)
x=0
Proof
Note that B → B as n → ∞ , and if B < ∞ , f (x) → f (x) as n → ∞ for x ∈ N. We will see this type of behavior again.
n n
Results for the birth-death chain on N often converge to the corresponding results for the birth-death chain on N as n → ∞ .
n
Absorption
Often when the state space S = N , the state of a birth-death chain represents a population of individuals of some sort (and so the
terms birth and death have their usual meanings). In this case state 0 is absorbing and means that the population is extinct.
Specifically, suppose that X = {X : t ∈ [0, ∞)} is a regular birth-death chain on N with α(0) = β(0) = 0 and with
t
α(x), β(x) > 0 for x ∈ N . Thus, state 0 is absorbing and all positive states lead to each other and to 0. Let
+
T = min{t ∈ [0, ∞) : X = 0} denote the time until absorption, where as usual, min ∅ = ∞ . Many of the results concerning
t
extinction of the continuous-time birth-death chain follow easily from corresponding results for the discrete-time birth-death jump
chain.
Proof
Naturally we would like to find the probability of these complementary events, and happily we have already done so in our study of
discrete-time birth-death chains. The absorption probability function v is defined by
As before, let
∞
β(1) ⋯ β(i)
A =∑ (16.21.17)
α(1) ⋯ α(i)
i=0
Proof
The mean time to extinction is considered next, so let m(x) = E(T ∣ X = x) for x ∈ N. Unlike the probability of extinction,
0
computing the mean time to extinction cannot be easily reduced to the corresponding discrete-time computation. However, the
method of computation does extend.
Probabilisitic Proof
16.21.3 https://stats.libretexts.org/@go/page/10394
In particular, note that
∞
α(1) ⋯ α(k)
m(1) = ∑ (16.21.23)
β(1) ⋯ β(k + 1)
k=0
If m(1) = ∞ then m(x) = ∞ for all x ∈ S . If m(1) < ∞ then m(x) < ∞ for all x ∈ S
Next we will consider a birth-death chain on a finite integer interval with both endpoints absorbing. Our interest is in the
probability of absorption in one endpoint rather than the other, and in the mean time to absorption. Thus suppose that n ∈ N and +
that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on N = {0, 1, … , n} with α(0) = β(0) = 0 ,
t n
α(n) = β(n) = 0 , and α(x) > 0 , β(x) > 0 for x ∈ {1, 2, … , n − 1}. So the endpoints 0 and n are absorbing, and all other states
lead to each other and to the endpoints. Let T = inf{t ∈ [0, ∞) : X ∈ {0, n}} , the time until absorption, and for x ∈ S let
t
v (x) = P(X
n = 0 ∣ X = x) and m (x) = E(T ∣ X = x) . The definitions make sense since T is finite with probability 1.
T 0 n 0
Proof
Note that A → A as n → ∞ where A is the constant above for the absorption probability at 0 with the infinite state space N. If
n
Time Reversal
Essentially, every irreducible continuous-time birth-death chain is reversible.
Suppose that X = {X : t ∈ [0, ∞)} is a positive recurrent birth-death chain on an integer interval S ⊆ Z with birth rate
t
function α : S → [0, ∞) and death rate function β : S → ∞ . Assume that α(x) > 0 , except at the maximum value of S , if
there is one, and similarly that β(x) > 0, except at the minimum value of X, if there is one. Then X is reversible.
Proof
In the important special case of a birth-death chain on N, we can verify the balance equations directly.
Proof
In the positive recurrent case, it follows that the birth-death chain is stochastically the same, forward or backward in time, if the
chain has the invariant distribution.
Consider the pure birth process X = {X t : t ∈ [0, ∞)} on N with birth rate function α .
+
Proof
16.21.4 https://stats.libretexts.org/@go/page/10394
Suppose that X = {X : t ∈ [0, ∞)} is the birth-death chain on
t N , with constant birth rate α ∈ (0, ∞) on N and constant
death rate β ∈ (0, ∞) on N . +
1. X is transient if β < α .
2. X is null recurrent if β = α .
3. X is positive recurrent if β > α . The invariant distribution is the geometric distribution on N with parameter α/β
x
α α
f (x) = (1 − )( ) , x ∈ N (16.21.28)
β β
Proof
Next we consider the chain with 0 absorbing. As in the general discussion above, let v denote the function that gives the probability
of absorption and m the function that gives the mean time to absorption.
Suppose that X = {X : t ∈ [0, ∞)} is the birth-death chain in N with constant birth rate α ∈ (0, ∞) on N , constant death
t +
Next let's look at chains on a finite state space. Let n ∈ N + and define N n = {0, 1, … , n} .
Suppose that X = { Xt : t ∈ [0, ∞)} is a continuous-time birth-death chain on N with constant birth rate α ∈ (0, ∞) on
n
{0, 1, … , n − 1} and constant death rate β ∈ (0, ∞) on {1, 2, … n}. The invariant probability density function f is given as n
follows:
1. If α ≠ β then
x
(α/β ) (1 − α/β)
fn (x) = , x ∈ Nn (16.21.30)
n+1
1 − (α/β)
Note that when α = β , the invariant distribution is uniform on N . Our final exercise considers the absorption probability at 0
n
when both endpoints are absorbing. Let v denote the function that gives the probability of absorption into 0, rather than n .
n
1. If α ≠ β then
x n
(β/α ) − (β/α )
vn (x) = , x ∈ Nn (16.21.31)
n
1 − (β/α)
Let X denote the population at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a regular, continuous-time birth-death chain
t t
with birth and death rate functions given by α(x) = ax and β(x) = bx for x ∈ N.
Proof
Note that 0 is absorbing since the population is extinct, so as usual, our interest is in the probability of absorption and the mean
time to absorption as functions of the initial state. The probability of absorption is the same as for the chain with constant birth and
death rates discussed above.
16.21.5 https://stats.libretexts.org/@go/page/10394
1. v(x) = 1 for all x ∈ N if b ≥ a .
2. v(x) = (b/a) for x ∈ N if b < a .
x
Proof
2. If a < b then
x j−1 a/b j−1
b u
m(x) = ∑ ∫ du, x ∈ N (16.21.34)
j
a 0 1 −u
j=1
Proof
For small values of x ∈ N, the integrals in the case a < b can be done by elementary methods. For example,
1 a
m(1) = − ln(1 − )
a b
1 b a
m(2) = m(1) − − ln(1 − )
2
a a b
2
1 b b a
m(3) = m(2) − − − ln(1 − )
2 3
2a a a b
However, a general formula requires the introduction of a special function that is not much more helpful than the integrals
themselves. The Markov chain X is actually an example of a branching chain. We will revisit this chain in that section.
Let X denote the population at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a regular, continuous-time birth-death chain
t t
with birth and death rate functions given by α(x) = ax + c and β(x) = bx for x ∈ N.
Proof
The backward and forward equations are given as follows, for (x, y) ∈ N and t ∈ (0, ∞) 2
1. d
dt
Pt (x, y) = −[(a + b)x + c] Pt (x, y) + (ax + c)Pt (x + 1, y) + bx Pt (x − 1, y)
2. d
dt
Pt (x, y) = −[(a + b)y + c] Pt (x, y) + [a(y − 1) + c] Pt (x, y − 1) + b(y + 1)Pt (x, y + 1 )
We can use the forward equation to find the expected population size. Let M t (x) = E(Xt , ∣ X0 = x) for t ∈ [0, ∞) and x ∈ N.
For t ∈ [0, ∞) and x ∈ N, the mean population size M t (x) is given as follows:
1. If a = b then M t (x) = ct + x .
2. If a ≠ b then
c (a−b)t (a−b)t
Mt (x) = [e − 1] + x e (16.21.40)
a−b
Proof
16.21.6 https://stats.libretexts.org/@go/page/10394
Note that b > a , so that the individual death rate exceeds the birth rate, then M (x) → c/(b − a) as t → ∞ for x ∈ N. If a ≥ b so
t
that the birth rate equals or exceeds the death rate, then M (x) → ∞ as t → ∞ for x ∈ N .
t +
Next we will consider the special case with no births, but only death and immigration. In this case, the invariant distribution is easy
to compute, and is one of our favorites.
Suppose that a = 0 and that b, c >0 . Then X is positive recurrent. The invariant distribution is Poisson with parameter c/b:
x
(c/b)
−c/b
f (x) = e , x ∈ N (16.21.42)
x!
Proof
Given the population size, the individuals act independently and identically. Specifically, if the population is
x ∈ {m, m + 1, … , n} then an individual splits in two at exponential rate a(n − x) and dies at exponential rate b(x − m) , where
a, b ∈ (0, ∞) . Thus, an individual's birth rate decreases linearly with the population size from a(n − m) to 0 while the death rate
increases linearly with the population size from 0 to b(n − m) . These assumptions lead to the following definition.
The continuous-time birth-death chain X = { Xt : t ∈ [0, ∞)} on S = {m, m + 1, … , n} with birth rate function α and
death rate function β given by
α(x) = ax(n − x), β(x) = bx(x − m), x ∈ S (16.21.45)
Note that the logistics chain is a stochastic counterpart of the logistics differential equation, which typically has the form
dx
= c(x − m)(n − x) (16.21.46)
dt
where m, n, c ∈ (0, ∞) and m < n . Starting in x(0) ∈ (m, n), the solution remains in (m, n) for all t ∈ [0, ∞). Of course, the
logistics differential equation models a system that is continuous in time and space, whereas the logistics Markov chain models a
system that is continuous in time and discrete is space.
In particular, m and n are reflecting boundary points, and so the chain is irreducible.
The generator matrix G for the logistics chain is given as follows, for x ∈ S :
1. G(x, x) = −x[a(n − x) + b(x − m)]
2. G(x, x − 1) = bx(x − m)
3. G(x, x + 1) = ax(n − x)
Define g : S → (0, ∞) by
x−m
1 n−m a
g(x) = ( )( ) , x ∈ S (16.21.49)
x x −m b
16.21.7 https://stats.libretexts.org/@go/page/10394
Then g is invariant for X.
Proof
Of course it now follows that the invariant probability density function f for X is given by f (x) = g(x)/c for x ∈ S where c is
the normalizing constant
n
x−m
1 n−m a
c = ∑ ( )( ) (16.21.51)
x x −m b
x=m
This page titled 16.21: Continuous-Time Birth-Death Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.21.8 https://stats.libretexts.org/@go/page/10394
16.22: Continuous-Time Queuing Chains
Basic Theory
Introduction
In a queuing model, customers arrive at a station for service. As always, the terms are generic; here are some typical examples:
The customers are persons and the service station is a store.
The customers are file requests and the service station is a web server.
Queuing models can be quite complex, depending on such factors as the probability distribution that governs the arrival of
customers, the probability distribution that governs the service of customers, the number of servers, and the behavior of the
customers when all servers are busy. Indeed, queuing theory has its own lexicon to indicate some of these factors. In this section,
we will discuss a few of the basic, continuous-time queuing chains. In a general sense, the main interest in any queuing model is
the number of customers in the system as a function of time, and in particular, whether the servers can adequately handle the flow
of customers. This section parallels the section on discrete-time queuing chains.
Assumption (b) means that the times between arrivals of customers are independent and exponentially distributed, with parameter
μ . Assumption (c) means that we have a first-in, first-out model, often abbreviated FIFO. Note that there are three parameters in
the model: the number of servers k , the exponential parameter μ that governs the arrivals, and the exponential parameter ν that
governs the service times. The special cases k = 1 (a single server) and k = ∞ (infinitely many servers) deserve special attention.
As you might guess, the assumptions lead to a continuous-time Markov chain.
Let Xt denote the number of customers in the system (waiting in line or being served) at time t ∈ [0, ∞) . Then
X = { Xt : t ∈ [0, ∞)} is a continuous-time Markov chain on N, known as the M/M/k queuing chain.
In terms of the basic structure of the chain, the important quantities are the exponential parameters for the states and the transition
matrix for the embedded jump chain.
νk μ
Q(x, x − 1) = , Q(x, x + 1) = , x ∈ N, x ≥ k
μ + νk μ + νk
So the M/M/k chain is a birth-death chain with 0 as a reflecting boundary point. That is, in state x ∈ N , the next state is either
+
x − 1 or x + 1 , while in state 0, the next state is 1. When k = 1 , the single-server queue, the exponential parameter in state
16.22.1 https://stats.libretexts.org/@go/page/10395
ν μ
Q(x, x − 1) = , Q(x, x + 1) = (16.22.1)
μ+ν μ+ν
When k =∞ , the infinite server queue, the cases above for x ≥k are vacuous, so the exponential parameter in state x ∈ N is
μ + xν and the transition probabilities are
νx μ
Q(x, x − 1) = , Q(x, x + 1) = (16.22.2)
μ + νx μ + νx
Infinitesimal Generator
The infinitesimal generator of the chain gives the same information as the exponential parameter function and the jump transition
matrix, but in a more compact form.
So for , the single server queue, the generator G is given by G(0, 0) = −μ , G(0, 1) = μ , while for x ∈ N ,
k =1 +
G(x, x) = −(μ + ν ), G(x, x − 1) = ν , G(x, x + 1) = μ . For k = ∞ , the infinite server case, the generator G is given by
G(x, x) = −(μ + ν x) , G(x, x − 1) = ν x , and G(x, x + 1) = μ for all x ∈ N .
servers. As noted in the introduction, of fundamental importance is the question of whether the servers can handle the flow of
customers, so that the queue eventually empties, or whether the length of the queue grows without bound. To understand the
limiting behavior, we need to classify the chain as transient, null recurrent, or positive recurrent, and find the invariant functions.
This will be easy to do using our results for more general continuous-time birth-death chains. Note first that X is irreducible. It's
best to consider the single server and infinite server cases individually.
f (x) = (1 − )( ) , x ∈ N (16.22.3)
ν ν
Proof
The result makes intuitive sense. If the service rate is less than the arrival rate, the chain is transient and the length of the queue
grows to infinity. If the service rate is greater than the arrival rate, the chain is positive recurrent. At the boundary between these
two cases, when the arrival and service rates are the same, the chain is null recurrent.
The infinite server queuing chain X is positive recurrent. The invariant distribution is the Poisson distribution with parameter
μ/ν. The invariant probability density function f is given by
x
(μ/ν )
−μ/ν
f (x) = e , x ∈ N (16.22.4)
x!
Proof
This page titled 16.22: Continuous-Time Queuing Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.22.2 https://stats.libretexts.org/@go/page/10395
16.23: Continuous-Time Branching Chains
Basic Theory
Introduction
Generically, suppose that we have a system of particles that can generate or split into other particles of the same type. Here are
some typical examples:
The particles are biological organisms that reproduce.
The particles are neutrons in a chain reaction.
The particles are electrons in an electron multiplier.
We assume that the lifetime of each particle is exponentially distributed with parameter α ∈ (0, ∞) , and at the end of its life, is
replaced by a random number of new particles that we will refer to as children of the original particle. The number of children N
of a particle has probability density function f on N. The particles act independently, so in addition to being identically distributed,
the lifetimes and the number of children are independent from particle to particle. Finally, we assume that f (1) = 0 , so that a
particle cannot simply die and be replaced by a single new particle. Let μ and σ denote the mean and variance of the number of
2
2 2
μ = E(N ) = ∑ nf (n), σ = var(N ) = ∑(n − μ) f (n) (16.23.1)
n=0 n=0
We assume that μ is finite and so σ makes sense. In our study of discrete-time Markov chains, we studied branching chains in
2
terms of generational time. Here we want to study the model in real time.
Let X denote the number of particles at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on
t t
N , known as a branching chain. The exponential parameter function λ and jump transition matrix Q are given by
1. λ(x) = αx for x ∈ N
2. Q(x, x + k − 1) = f (k) for x ∈ N and k ∈ N .
+
Proof
Of course 0 is an absorbing state, since this state means extinction with no particles. (Note that λ(0) = 0 and so by default,
Q(0, 0) = 1.) So with a branching chain, there are essentially two types of behavior: population extinction or population explosion.
For the branching chain X = {X t : t ∈ [0, ∞)} one of the following events occurs with probability 1:
1. Extinction: X t =0 for some t ∈ [0, ∞) and hence X s =0 for all s ≥ t .
2. Explosion: X t → ∞ as t → ∞ .
Proof
Without the assumption that μ < ∞ , explosion can actually occur in finite time. On the other hand, the assumption that f (1) = 0 is
for convenience. Without this assumption, X would still be a continuous-time Markov chain, but as discussed in the Introduction,
the exponential parameter function would be λ(x) = αf (1)x for x ∈ N and the jump transition matrix would be
f (k)
Q(x, x + k − 1) = , x ∈ N+ , k ∈ {0, 2, 3, …} (16.23.2)
1 − f (1)
Because all particles act identically and independently, the branching chain starting with x ∈ N particles is essentially x
+
independent copies of the branching chain starting with 1 particle. In many ways, this is the fundamental insight into branching
chains, and in particular, means that we can often condition on X(0) = 1 .
16.23.1 https://stats.libretexts.org/@go/page/10396
The infinitesimal generator G is given by
G(x, x) = −αx, x ∈ N
Proof
Proof
Unlike some of our other continuous-time models, the jump chain Y governed by Q is not the discrete-time version of the model.
That is, Y is not a discrete-time branching chain, since in discrete time, the index n represents the n th generation, whereas here it
represent the n th time that a particle reproduces. However, there are lots of discrete-time branching chain embedded in the
continuous-time chain.
Fix t ∈ (0, ∞) and define Z = {X : n ∈ N} . Then Z is a discrete-time branching chain with offspring probability density
t nt t
Proof
Xt x
Φt (r) = E (r ∣ X0 = 1) = ∑ r Pt (1, x) (16.23.5)
x=0
N n
Ψ(r) = E (r ) = ∑ r f (n) (16.23.6)
n=0
The generating functions are defined (the series are absolutely convergent) at least for r ∈ (−1, 1] .
The collection of generating functions Φ = {Φ : t ∈ [0, ∞)} gives the same information as the collection of probability density
t
functions {P (1, ⋅) : t ∈ [0, ∞)}. With the fundamental insight that the branching process starting with one particle determines the
t
branching process in general, Φ actually determines the transition semigroup P = {P : t ∈ [0, ∞)} . t
y x
∑ r Pt (x, y) = [ Φt (r)] (16.23.7)
y=0
Proof
Note that Φt is the generating function of the offspring distribution for the embedded discrete-time branching chain
Zt = { Xnt : n ∈ N} for t ∈ (0, ∞) . On the other hand, Ψ is the generating function of the offspring distribution for the
continuous-time chain. So our main goal in this discussion is to see how Φ is built from Ψ. Because P is a semigroup under matrix
multiplication, and because the particles act identically and independently, Φ is a semigroup under composition.
Note also that Φ (r) = E(r ∣ X = 1) = r for all r ∈ R . This also follows from the semigroup property: Φ = Φ ∘ Φ . The
0
X0
0 0 0 0
fundamental relationship between the collection of generating functions Φ and the generating function Ψ is given in the following
16.23.2 https://stats.libretexts.org/@go/page/10396
theorem:
d
Φt = α(Ψ ∘ Φt − Φt ) (16.23.8)
dt
Proof
This differential equation, along with the initial condition Φ (r) = r for all r ∈ R determines the collection of generating functions
0
Φt (r)
1
∫ du = αt (16.23.11)
r
Ψ(u) − u
Another relationship is given in the following theorem. Here, Φ refers to the derivative of the generating function Φ with respect
′
t t
Proof
Moments
In this discussion, we wil study the mean and variance of the number of particles at time t ∈ [0, ∞). Let
so that m and v are the mean and variance, starting with a single particle. As always with a branching process, it suffices to
t t
2. var(X ∣ X = x) = x v
t 0 t
Proof
Recall also that μ and σ are the the mean and variance of the number of offspring of a particle. Here is the connection between the
2
means:
mt = e
α(μ−1)t
for t ∈ [0, ∞).
1. If μ < 1 then m t → 0 as t → ∞ . This is extinction in the mean.
2. If μ > 1 then m t → ∞ as t → ∞ . This is explosion in the mean.
3. If μ = 1 then m t =1 for all t ∈ [0, ∞). This is stability in the mean.
Proof
This result is intuitively very appealing. As a function of time, the expected number of particles either grows or decays
exponentially, depending on whether the expected number of offspring of a particle is greater or less than one. The connection
between the variances is more complicated. We assume that σ < ∞ . 2
If μ ≠ 1 then
2
σ
2α(μ−1)t α(μ−1)t
vt = [ + (μ − 1)] [e −e ], t ∈ [0, ∞) (16.23.21)
μ−1
If μ = 1 then v t = ασ t
2
.
1. If μ < 1 then v t → 0 as t → ∞
2. If μ ≥ 1 then v t → ∞ as t → ∞
16.23.3 https://stats.libretexts.org/@go/page/10396
Proof
If μ < 1 so that m → 0 as t → ∞ and we have extinction in the mean, then v → 0 as t → ∞ also. If μ > 1 so that m → ∞ as
t t t
t → ∞ and we have explosion in the mean, then v → ∞ as t → ∞ also. We would expect these results. On the other hand, if
t
μ = 1 so that m = 1 for all t ∈ [0, ∞) and we have stability in the mean, then v grows linearly in t . This gives some insight into
t t
Need we say it? The extinction probability starting with an arbitrary number of particles is easy.
For x ∈ N,
x
P(Xt = 0 for some t ∈ (0, ∞) ∣ X0 = x) = lim P(Xt = 0 ∣ X0 = x) = q (16.23.28)
t→∞
Proof
We can easily relate extinction for the continuous-time branching chain X to extinction for any of the embedded discrete-time
branching chains.
If extinction occurs for X then extinction occurs for Z for every t ∈ (0, ∞) . Conversely if extinction occurs for Z for some
t t
t ∈ (0, ∞) then extinction occurs for Z for every t ∈ (0, ∞) and extinction occurs for X. Hence q is the minimum solution
t
Proof
The extinction probability q and the mean of the offspring distribution μ are related as follows:
1. If μ ≤ 1 then q = 1 , so extinction is certain.
2. If μ > 1 then 0 < q < 1 , so there is a positive probability of extinction and a positive probability of explosion.
Proof
It would be nice to have an equation for q in terms of the offspring probability generating function Ψ. This is also easy
The probability of extinction q is the minimum solution in (0, 1] of the equation Ψ(r) = r .
Proof
Special Models
We now turn our attention to a number of special branching chains that are important in applications or lead to interesting insights.
We will use the notation established above, so that α is the parameter of the exponential lifetime of a particle, Q is the transition
matrix of the jump chain, G is the infinitesimal generator matrix, and P is the transition matrix at time t ∈ [0, ∞). Similarly,
t
particles at time t ∈ [0, ∞), starting with a single particle. As always, be sure to try these exercises yourself before looking at the
proofs and solutions.
with f (0) = 1 .
The transition matrix of the jump chain and the generator matrix are given by
16.23.4 https://stats.libretexts.org/@go/page/10396
1. Q(0, 0) = 1 and Q(x, x − 1) = 1 for x ∈ N +
2. v = e
t
−αt
−e
−2αt
3. Φ (r) = 1 − (1 − r)e
t for r ∈ R
−αt
4. Given X = x the distribution of X is binomial with trial parameter x and success parameter e
0 t
−αt
.
x
−αty −αt x−y
Pt (x, y) = ( )e (1 − e ) , x ∈ N, y ∈ {0, 1, … , x} (16.23.29)
y
Direct Proof
In particular, note that P (x, 0) = (1 − e ) → 1 as t → ∞ . that is, the probability of extinction by time t increases to 1
t
−αt x
exponentially fast. Since we have an explicit formula for the transition matrices, we can find an explicit formula for the potential
matrices as well. The result uses the beta function B .
1 x
Uβ (x, y) = ( )B(y + β/α, x − y + 1), x ∈ N, y ∈ {0, 1, … , x} (16.23.31)
α y
Proof
We could argue the results for the potential U directly. Recall that U (x, y) is the expected time spent in state y starting in state x.
Since 0 is absorbing and all states lead to 0, U (x, 0) = ∞ for x ∈ N. If x, y ∈ N and x ≤ y , then x leads to y with probability 1.
+
Once in state y the time spent in y has an exponential distribution with parameter λ(y) = αy , and so the mean is 1/αy. Of course,
when the chain leaves y , it never returns.
Recall that βU is a transition probability matrix for β > 0 , and in fact β U (x, ⋅) is the probability density function of X given
β β T
X = x where T is independent of X has the exponential distribution with parameter β. For the next result, recall the ascending
0
power notation
[k]
a = a(a + 1) ⋯ (a + k − 1), a ∈ R, k ∈ N (16.23.36)
Proof
density function given by f (2) = 1 . Explosion is inevitable, starting with at least one particle, but other properties of the Yule
process are interesting. in particular, there are fascinating parallels with the pure death branching chain. Since 0 is an isolated,
absorbing state, we will sometimes restrict our attention to positive states.
16.23.5 https://stats.libretexts.org/@go/page/10396
The transition matrix of the jump chain and the generator matrix are given by
1. Q(0, 0) = 1 and Q(x, x + 1) = 1 for x ∈ N +
Since the Yule process is a pure birth process and the birth rate in state x ∈ N is αx, the process is also called the linear birth
chain. As with the pure death process, we can give the distribution of X specifically. t
2. v = e
t
2αt
−e
αt
−αt
3. Φ (r) =
t
re
1−r+re−αt
for |r| < 1
1−e−αt
4. Given X 0 =x ,X t has the negative binomial distribution on N with stopping parameter x and success parameter e
+
−αt
.
y −1 −xαt −αt y−x
Pt (x, y) = ( )e (1 − e ) , x ∈ N+ , y ∈ {x, x + 1, …} (16.23.40)
x −1
Recall that the negative binomial distribution with parameters k ∈ N and p ∈ (0, 1) governs the trial number of the k th success
+
in a sequence of Bernoulli trials with success parameter p. So the occurrence of this distribution in the Yule process suggests such
an interpretation. However this interpretation is not nearly as obvious as with the binomial distribution in the pure death branching
chain. Next we give the potential matrices.
1 y −1
Uβ (x, y) = ( )B(x + β/α, y − x + 1), x ∈ N+ , y ∈ {x, x + 1, …} (16.23.45)
α x −1
If β > 0 , the function β U β (x, ⋅) is the beta-negative binomial probability density function with parameters x, β/α, and 1:
[x] [y−x]
y −1 (β/α) 1
β Uβ (x, y) = ( ) , x ∈ N, y ∈ {x, x + 1, …} (16.23.46)
[y]
x −1 (1 + β/α)
Proof
If we think of the Yule process in terms of particles that never die, but each particle gives birth to a new particle at rate α , then we
can study the age of the particles at a given time. As usual, we can start with a single, new particle at time 0. So to set up the
notation, let X = {X : t ∈ [0, ∞)} be the Yule branching chain with birth rate α ∈ (0, ∞) , and assume that X = 1 . Let τ = 0
t 0 0
For t ∈ [0, ∞), let A denote the total age of the particles at time t . Then
t
Xt −1
At = ∑ (t − τn ), t ∈ [0, ∞) (16.23.49)
n=0
Again, let A = {A t : t ∈ [0, ∞)} be the age process for the Yule chain starting with a single particle. Then
t
Proof
With the last representation, we can easily find the expected total age at time t .
16.23.6 https://stats.libretexts.org/@go/page/10396
Again, let A = {A t : t ∈ [0, ∞)} be the age process for the Yule chain starting with a single particle. Then
αt
e −1
E(At ) = , t ∈ [0, ∞) (16.23.53)
α
Proof
parameter α ∈ (0, ∞) and offspring probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ [0, 1]. When
p = 0 we have the pure death chain, and when p = 1 we have the Yule process. We have already studied these, so the interesting
case is when p ∈ (0, 1) so that both extinction and explosion are possible.
The transition matrix of the jump chain and the generator matrix are given by
1. Q(0, 0) = 1, and Q(x, x − 1) = 1 − p , Q(x, x + 1) = p for x ∈ N +
2. G(x, x) = −αx for x ∈ N, and G(x, x − 1) = α(1 − p)x , G(x, x + 1) = αpx for x ∈ N +
As mentioned earlier, X is also a continuous-time birth-death chain on N, with 0 absorbing. In state x ∈ N , the birth rate is αpx +
and the death rate is α(1 − p)x . The moment functions are given next.
2. If p ≠ , 1
4p(1 − p)
2α(2p−1)t α(2p−1)t
vt = [ + (2p − 1)] [e −e ] (16.23.55)
2p − 1
If p = 1
2
,v
t = 4αp(1 − p)t .
Proof
The next result gives the generating function of the offspring distribution and the extinction probability.
1−p
2. q = 1 if 0 < p ≤ and q =1
2
if p
1
2
<p <1 .
Proof
16.23.7 https://stats.libretexts.org/@go/page/10396
Figure 16.23.1: Graphs of r ↦ Ψ(r) and r ↦ r when p = 1
α(2p−1)t
pr − (1 − p) + (1 − p)(1 − r)e
Φt (r) = , if p ≠ 1/2
α(2p−1)t
pr − (1 − p) + p(1 − r)e
2r + (1 − r)αt 1
Φt (r) = , if p =
2 + (1 − r)αt 2
Solution
This page titled 16.23: Continuous-Time Branching Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
16.23.8 https://stats.libretexts.org/@go/page/10396
CHAPTER OVERVIEW
17: Martingales
Martingalges, and their cousins sub-martingales and super-martingales are real-valued stochastic processes that are abstract
generalizations of fair, favorable, and unfair gambling processes. The importance of martingales extends far beyond gambling, and
indeed these random processes are among the most important in probability theory, with an incredible number and diversity of
applications.
17.1: Introduction to Martingalges
17.2: Properties and Constructions
17.3: Stopping Times
17.4: Inequalities
17.5: Convergence
17.6: Backwards Martingales
This page titled 17: Martingales is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
17.1: Introduction to Martingalges
Basic Theory
Basic Assumptions
For our basic ingredients, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P),
t
having state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). So to
review what all this means, Ω is the sample space, F the σ-algebra of events, P the probability measure on (Ω, F ), and X is a t
random variable with values in R for each t ∈ T . Next, we have a filtration F = {F : t ∈ T } , and we assume that X is adapted
t
to F. To review again, F is an increasing family of sub σ-algebras of F , so that F ⊆ F ⊆ F for s, t ∈ T with s ≤ t , and X is
s t t
measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T , thus encoding the
t t
information available at time t . Finally, we assume that E (|X |) < ∞ , so that the mean of X exists as a real number, for each
t t
t ∈ T .
There are two important special cases of the basic setup. The simplest case, of course, is when F = σ{X : s ∈ T , s ≤ t} for
t s
t ∈ T , so that F is the natural filtration associated with X. Another case that arises frequently is when we have a second stochastic
process Y = {Y : t ∈ T } on (Ω, F , P) with values in a general measure space (S, S ), and F is the natural filtration associated
t
with Y . So in this case, our main assumption is that X is measurable with respect to σ{Y : s ∈ T , s ≤ t} for t ∈ T .
t s
The theory of martingales is beautiful, elegant, and mostly accessible in discrete time, when T = N . But as with the theory of
Markov processes, martingale theory is technically much more complicated in continuous time, when T = [0, ∞) . In this case,
additional assumptions about the continuity of the sample paths t ↦ X and the filtration t ↦ F are often necessary in order to
t t
have a nice theory. Specifically, we will assume that the processX is right continuous and has left limits, and that the filtration F is
right continuous and complete. These are the standard assumptions in continuous time.
Definitions
For the basic definitions that follow, you may need to review conditional expected value with respect to a σ-algebra.
In the special case that F is the natural filtration associated with X, we simply say that X is a martingale, without reference to the
filtration. In the special case that we have a second stochastic process Y = {Y : t ∈ T } and F is the natural filtration associated
t
available to her at time t . Suppose now that s, t ∈ T with s < t and that we think of s as the current time, so that t is a future
time. If X is a martingale with respect to F then the games are fair in the sense that the gambler's expected fortune at the future
time t is the same as her current fortune at time s . To venture a bit from the casino, suppose that X is the price of a stock, or the
t
value of a stock index, at time t ∈ T . If X is a martingale, then the expected value at a future time, given all of our information, is
the present value.
Figure 17.1.1 : An English-style breastplate with a running martingale attachement. By Danielle M., CC BY 3.0, from Wikipedia
17.1.1 https://stats.libretexts.org/@go/page/10299
But as we will see, martingales are useful in probability far beyond the application to gambling and even far beyond financial
applications generally. Indeed, martingales are of fundamental importance in modern probability theory. Here are two related
definitions, with equality in the martingale condition replaced by inequalities.
Suppose again that the process X and the filtration F satisfy the basic assumptions above.
1. X is a sub-martingale with respect to F if E (X ∣ F ) ≥ X for all s, t ∈ T with s ≤ t .
t s s
In the gambling setting, a sub-martingale models games that are favorable to the gambler on average, while a super-martingale
models games that are unfavorable to the gambler on average. To venture again from the casino, suppose that X is the price of a t
stock, or the value of a stock index, at time t ∈ T . If X is a sub-martingale, the expected value at a future time, given all of our
information, is greater than the present value, and if X is a super-martingale then the expected value at the future time is less than
the present value. One hopes that a stock index is a sub-martingale.
Clearly X is a martingale with respect to F if and only if it is both a sub-martingale and a super-martingale. Finally, recall that the
conditional expected value of a random variable with respect to a σ-algebra is itself a random variable, and so the equations and
inequalities in the definitions should be interpreted as holding with probability 1. In this section generally, statements involving
random variables are assumed to hold with probability 1.
The conditions that define martingale, sub-martingale, and super-martingale make sense if the index set T is any totally ordered
set. In some applications that we will consider later, T = {0, 1, … , n} for fixed n ∈ N . In the section on backwards martingales,
+
T = {−n : n ∈ N} or T = (−∞, 0] . In the case of discrete time when T = N , we can simplify the definitions slightly.
Proof
The relations that define martingales, sub-martingales, and super-martingales hold for the ordinary (unconditional) expected values.
Proof
So if X is a martingale then X has constant expected value, and this value is referred to as the mean of X.
Examples
The goal for the remainder of this section is to give some classical examples of martingales, and by doing so, to show the wide
variety of applications in which martingales occur. We will return to many of these examples in subsequent sections. Without
further ado, we assume that all random variables are real-valued, unless otherwise specified, and that all expected values mentioned
below exist in R. Be sure to try the proofs yourself before expanding the ones in the text.
Constant Sequence
Our first example is rather trivial, but still worth noting.
Suppose that F = {F : t ∈ T } is a filtration on the probability space (Ω, F , P) and that X is a random variable that is
t
measurable with respect to F and satisfies E(|X|) < ∞ . Let X = X for t ∈ T . Them X = {X : t ∈ T } is a martingale
0 t t
with respect to F.
Proof
17.1.2 https://stats.libretexts.org/@go/page/10299
Partial Sums
For our next discussion, we start with one of the most basic martingales in discrete time, and the one with the simplest
interpretation in terms of gambling. Suppose that V = {V : n ∈ N} is a sequence of independent random variables with
n
Xn = ∑ Vk , n ∈ N (17.1.3)
k=0
In terms of gambling, if X = V is the gambler's initial fortune and V is the gambler's net winnings on the ith game, then X is
0 0 i n
the gamblers net fortune after n games for n ∈ N . But partial sum processes associated with independent sequences are important
+
far beyond gambling. In fact, much of classical probability deals with partial sums of independent and identically distributed
variables. The entire chapter on Random Samples explores this setting.
Note that E(Xn ) = ∑ E(V ) . Hence condition (a) is equivalent to n ↦ E(X ) increasing, condition (b) is equivalent to
n
k=0 k n
n ↦ E(Xn ) decreasing, and condition (c) is equivalent to n ↦ E(X ) constant. Here is another martingale associated with the
n
So under the assumptions in this theorem, both X and Y are martingales. We will generalize the results for partial sum processes
below in the discussion on processes with independent increments.
Let V = X and V = X
0 0 n n − Xn−1 for n ∈ N+ . The process V = { Vn : n ∈ N} is the martingale difference sequence
associated with X and
n
Xn = ∑ Vk , n ∈ N (17.1.8)
k=0
As promised, the martingale difference variables have mean 0, and in fact satisfy a stronger property.
Proof
Also as promised, if the martingale variables have finite variance, then the martingale difference variables are uncorrelated.
17.1.3 https://stats.libretexts.org/@go/page/10299
Suppose again that V = {V : n ∈ N} is the martingale difference sequence associated with the martingale X. Assume that
n
n n
2
var(Xn ) = ∑ var(Vk ) = var(X0 ) + ∑ E(V ), n ∈ N (17.1.11)
k
k=0 k=1
Proof
We now know that a discrete-time martingale is the partial sum process associated with a sequence of uncorrelated variables.
Hence we might hope that there are martingale versions of the fundamental theorems that hold for a partial sum process associated
with an independent sequence. This turns out to be true, and is a basic reason for the importance of martingales.
assume that E(|V |) < ∞ for n ∈ N and we let a denote the common mean of {V : n ∈ N } . Let X = {X : n ∈ N} be the
n n + n
Xn = ∑ Vi , n ∈ N (17.1.13)
i=0
This setting is a special case of the more general partial sum process considered above. The process X is sometimes called a
(discrete-time) random walk. The initial position X = V of the walker can have an arbitrary distribution, but then the steps that
0 0
the walker takes are independent and identically distributed. In terms of gambling, X = V is the initial fortune of the gambler
0 0
playing a sequence of independent and identical games. If V is the amount won (or lost) on game i ∈ N , then X is the gambler's
i + n
For the second moment martingale, suppose that V has common mean a = 0 and common variance b
n
2
<∞ for n ∈ N , and that
+
var(V ) < ∞ .
0
Let Y n
2
= Xn − var(V0 ) − b n
2
for n ∈ N . Then Y = { Yn : n ∈ N} is a martingale with respect to X.
Proof
We will generalize the results for discrete-time random walks below, in the discussion on processes with stationary, independent
increments.
Partial Products
Our next discussion is similar to the one on partial sum processes above, but with products instead of sums. So suppose that
V = { V : n ∈ N} is an independent sequence of nonnegative random variables with E(V ) < ∞ for n ∈ N . Let
n n
Xn = ∏ Vi , n ∈ N (17.1.15)
i=0
As with random walks, a special case of interest is when {V n : n ∈ N+ } is an identically distributed sequence.
17.1.4 https://stats.libretexts.org/@go/page/10299
The Simple Random Walk
Suppose now that that V = {V : n ∈ N} is a sequence of independent random variables with P(V = 1) = p and
n i
P(V = −1) = 1 − p for i ∈ N , where p ∈ (0, 1). Let X = { X : n ∈ N} be the partial sum process associated with V so that
i + n
Xn = ∑ Vi , n ∈ N (17.1.18)
i=0
Then X is the simple random walk with parameter p, and of course, is a special case of the more general random walk studied
above. In terms of gambling, our gambler plays a sequence of independent and identical games, and on each game, wins €1 with
probability p and loses €1 with probability 1 − p . So if V is the gambler's initial fortune, then X is her net fortune after n games.
0 n
2
then X is a sub-martingale.
2. If p < 1
2
then X is a super-martingale.
3. If p = 1
2
then X is a martingale.
Proof
So case (a) corresponds to favorable games, case (b) to unfavorable games, and case (c) to fair games.
Open the simulation of the simple symmetric random. For various values of the number of trials n , run the simulation 1000
times and note the general behavior of the sample paths.
Here is the second moment martingale for the simple, symmetric random walk.
2
, and let 2
Yn = Xn − var(V0 ) − n for n ∈ N . Then
Y = { Y : n ∈ N} is a martingale with respect to X
n
Proof
But there is another martingale that can be associated with the simple random walk, known as De Moivre's martingale and named
for one of the early pioneers of probability theory, Abraham De Moivre.
For n ∈ N define
Xn
1 −p
Zn = ( ) (17.1.19)
p
P(X = 1) = p for i ∈ N . As usual, we couch this in reliability terms, so that X = 1 means success on trial i and X = 0 means
i + i i
failure. In our study of this process, we showed that the finite-dimensional distributions are given by
[k] [n−k]
a b
n
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = , n ∈ N+ , (x1 , x2 , … , xn ) ∈ {0, 1 } (17.1.22)
[n]
(a + b)
denote the partial sum process associated with X, so that once again,
n
Yn = ∑ Xi , n ∈ N (17.1.23)
i=1
Of course Y is the number of success in the first n trials and has the beta-binomial distribution defined by
n
17.1.5 https://stats.libretexts.org/@go/page/10299
[k] [n−k]
n a b
P(Yn = k) = ( ) , k ∈ {0, 1, … , n} (17.1.24)
[n]
k (a + b)
Now let
a + Yn
Zn = , n ∈ N (17.1.25)
a+b +n
This variable also arises naturally. Let F = σ{X , X , … , X } for n ∈ N . Then as shown in the section on the beta-Bernoulli
n 1 2 n
process, Z = E(X
n ∣ F ) = E(P ∣ F ) . In statistical terms, the second equation means that Z
n+1 n n is the Bayesian estimator of n
the unknown success probability p in a sequence of Bernoulli trials, when p is modeled by the random variable P .
Open the beta-Binomial experiment. Run the simulation 1000 times for various values of the parameters, and compare the
empirical probability density function with the true probability density function.
red and 0 means green. The process X = {X : n ∈ N } is a classical example of a sequence of exchangeable yet dependent
n +
variables. Let Y = {Y : n ∈ N} denote the partial sum process associated with X, so that once again,
n
Yn = ∑ Xi , n ∈ N (17.1.28)
i=1
Of course Y is the total number of red balls selected in the first n draws. Hence at time n ∈ N , the total number of red balls in the
n
urn is a + cY , while the total number of balls in the urn is a + b + cn and so the proportion of red balls in the urn is
n
a + cYn
Zn = (17.1.29)
a + b + cn
Open the simulation of Pólya's Urn Experiment. Run the simulation 1000 times for various values of the parameters, and
compare the empirical probability density function of the number of red ball selected to the true probability density function.
is independent of (X , X , … , X ). The random walk process has the additional property of stationary increments. That is, the
0 1 m
discrete or continuous time with these properties. Thus, suppose that X = {X : t ∈ T } satisfying the basic assumptions above
t
Processes with stationary and independent increments were studied in the Chapter on Markov processes. In continuous time (with
the continuity assumptions we have imposed), such a process is known as a Lévy process, named for Paul Lévy, and also as a
17.1.6 https://stats.libretexts.org/@go/page/10299
continuous-time random walk. For a process with independent increments (not necessarily stationary), the connection with
martingales depends on the mean function m given by m(t) = E(X ) for t ∈ T . t
Compare this theorem with the corresponding theorem for the partial sum process above. Suppose now that
X = { X : t ∈ [0, ∞)} is a stochastic process as above, with mean function m , and let Y = X − m(t) for t ∈ [0, ∞). The
t t t
process Y = {Y : t ∈ [0, ∞)} is sometimes called the compensated process associated with X and has mean function 0. If X has
t
independent increments, then clearly so does Y . Hence the following result is a trivial corollary to our previous theorem.
Next we give the second moment martingale for a process with independent increments, generalizing the second moment
martingale for a partial sum process.
Suppose that X = {X : t ∈ T } has independent increments with constant mean function and and with
t var(Xt ) < ∞ for
t ∈ T . Then Y = { Y : t ∈ T } is a martingale where
t
2
Yt = X − var(Xt ), t ∈ T (17.1.35)
t
Proof
Of course, since the mean function is constant, X is also a martingale. For processes with independent and stationary increments
(that is, random walks), the last two theorems simplify, because the mean and variance functions simplify.
Suppose that X = {X t : t ∈ T} has stationary, independent increments, and let a = E(X 1 − X0 ) . Then
1. X is a martingale if a = 0
2. X is a sub-martingale if a ≥ 0
3. X is a super-martingale if a ≤ 0
Proof
Compare this result with the corresponding one above for discrete-time random walks. Our next result is the second moment
martingale. Compare this with the second moment martingale for discrete-time random walks.
2 2
Yt = X − var(X0 ) − b t, t ∈ T (17.1.40)
t
Proof
In discrete time, as we have mentioned several times, all of these results reduce to the earlier results for partial sum processes and
random walks. In continuous time, the Poisson processes, named of course for Simeon Poisson, provides examples. The standard
(homogeneous) Poisson counting process N = {N : t ∈ [0, ∞)} with constant rate r ∈ (0, ∞) has stationary, independent
t
increments and mean function given by m(t) = rt for t ∈ [0, ∞). More generally, suppose that r : [0, ∞) → (0, ∞) is piecewise
continuous (and non-constant). The non-homogeneous Poisson counting process N = {N : t ∈ [0, ∞)} with rate function r has
t
The increment N − N has the Poisson distribution with parameter m(t) − m(s) for s, t ∈ [0, ∞) with s < t , so the process
t s
does not have stationary increments. In all cases, m is increasing, so the following results are corollaries of our general results:
17.1.7 https://stats.libretexts.org/@go/page/10299
Let N = { Nt : t ∈ [0, ∞)} be the Poisson counting process with rate function r : [0, ∞) → (0, ∞) . Then
1. N is a sub-martingale
2. The compensated process X = {N t − m(t) : t ∈ [0, ∞)} is a martinagle.
Open the simulation of the Poisson counting experiment. For various values of r and t , run the experiment 1000 times and
compare the empirical probability density function of the number of arrivals with the true probability density function.
We will see further examples of processes with stationary, independent increments in continuous time (and so also examples of
continuous-time martingales) in our study of Brownian motion.
distributed random variables, taking values in S . In statistical terms, X corresponds to sampling from the common distribution,
which is usually not completely known. Indeed, the central problem in statistics is to draw inferences about the distribution from
observations of X. Suppose now that the underlying distribution either has probability density function g or probability density
0
function g , with respect to μ . We assume that g and g are positive on S . Of course the common special cases of this setup are
1 0 1
known as the likelihood ratio test statistic. Small values of the test statistic are evidence in favor of the alternative hypothesis H . 1
Under the alternative hypothesis H1 , the process L = { Ln : n ∈ N} is a martingale with respect to X , known as the
likelihood ratio martingale.
Proof
Branching Processes
In the simplest model of a branching process, we have a system of particles each of which can die out or split into new particles of
the same type. The fundamental assumption is that the particles act independently, each with the same offspring distribution on N.
We will let f denote the (discrete) probability density function of the number of offspring of a particle, m the mean of the
distribution, and ϕ the probability generating function of the distribution. Thus, if U is the number of children of a particle, then
f (n) = P(U = n) for n ∈ N , m = E(U ) , and ϕ(t) = E (t ) defined at least for t ∈ (−1, 1] .
U
Our interest is in generational time rather than absolute time: the original particles are in generation 0, and recursively, the children
a particle in generation n belong to generation n + 1 . Thus, the stochastic process of interest is X = {X : n ∈ N} where X is
n n
the number of particles in the n th generation for n ∈ N . The process X is a Markov chain and was studied in the section on
discrete-time branching chains. In particular, one of the fundamental problems is to compute the probability q of extinction starting
with a single particle:
q = P(Xn = 0 for some n ∈ N ∣ X0 = 1) (17.1.45)
Then, since the particles act independently, the probability of extinction starting with x ∈ N particles is simply q . We will assume
x
that f (0) > 0 and f (0) + f (1) < 1 . This is the interesting case, since it means that a particle has a positive probability of dying
without children and a positive probability of producing more than 1 child. The fundamental result, you may recall, is that q is the
smallest fixed point of ϕ (so that ϕ(q) = q ) in the interval [0, 1]. Here are two martingales associated with the branching process:
17.1.8 https://stats.libretexts.org/@go/page/10299
Each of the following is a martingale with respect to X.
1. Y = {Y n : n ∈ N} where Y n = Xn / m
n
for n ∈ N .
2. Z = {Z n : n ∈ N} where Z n =q
Xn
for n ∈ N .
Proof
Doob's Martingale
Our next example is one of the simplest, but most important. Indeed, as we will see later in the section on convergence, this type of
martingale is almost universal in the sense that every uniformly integrable martingale is of this type. The process is constructed by
conditioning a fixed random variable on the σ-algebras in a given filtration, and thus accumulating information about the random
variable.
Suppose that F = {F : t ∈ T } is a filtration on the probability space (Ω, F , P), and that X is a real-valued random variable
t
Proof
The martingale in the last theorem is known as Doob's martingale and is named for Joseph Doob who did much of the pioneering
work on martingales. It's also known as the Lévy martingale, named for Paul Lévy.
Doob's martingale arises naturally in the statistical context of Bayesian estimation. Suppose that X = (X , X , …) is a sequence 1 2
of independent random variables whose common distribution depends on an unknown real-valued parameter θ , with values in a
parameter space A ⊆ R . For each n ∈ N , let F = σ{X , X , … , X } so that F = {F : n ∈ N } is the natural filtration
+ n 1 2 n n +
associated with X. In Bayesian estimation, we model the unknown parameter θ with a random variable Θ taking values in A and
having a specified prior distribution. The Bayesian estimator of θ based on the sample X = (X , X , … , X ) is
n 1 2 n
Un = E(Θ ∣ Fn ), n ∈ N+ (17.1.50)
So it follows that the sequence of Bayesian estimators U = (U n : n ∈ N+ ) is a Doob martingale. The estimation referred to in the
discussion of the beta-Bernoulli process above is a special case.
Density Functions
For this example, you may need to review general measures and density functions in the chapter on Distributions. We start with our
probability space (Ω, F , P) and filtration F = {F : n ∈ N} in discrete time. Suppose now that μ is a finite measure on the
n
sample space (Ω, F ). For each n ∈ N , the restriction of μ to F is a measure on the measurable space (Ω, F ), and similarly the
n n
restriction of P to F is a probability measure on (Ω, F ). To save notation and terminology, we will refer to these as μ and P on
n n
if A ∈ F and P(A) = 0 then μ(B) = 0 for every B ∈ F with B ⊆ A . By the Radon-Nikodym theorem, μ has a density
n n
function X : Ω → R with respect to P on F for each n ∈ N . The density function of a measure with respect to a positive
n n +
measure is known as a Radon-Nikodym derivative. The theorem and the derivative are named for Johann Radon and Otto Nikodym.
Here is our main result.
Note that μ may not be absolutely continuous with respect to P on F or even on F = σ (⋃ F ) . On the other hand, if μ is
∞
∞
n=0 n
absolutely continuous with respect to P on F then μ has a density function X with respect to P on F . So a natural question in
∞ ∞
this case is the relationship between the martingale X and the random variable X. You may have already guessed the answer, but at
any rate it will be given in the section on convergence.
This page titled 17.1: Introduction to Martingalges is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
17.1.9 https://stats.libretexts.org/@go/page/10299
17.2: Properties and Constructions
Basic Theory
Preliminaries
As in the Introduction, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P), having
t
state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). Next, we have
a filtration F = {F : t ∈ T } , and we assume that X is adapted to F. So F is an increasing family of sub σ-algebras of F and X
t t
is measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T . We assume that
t t
E (| X |) < ∞ , so that the mean of X exists as a real number, for each t ∈ T . Finally, in continuous time where T = [0, ∞) , we
t t
make the standard assumptions that X is right continuous and has left limits, and that the filtration F is right continuous and
complete. Please recall the following from the Introduction:
Definitions
1. X is a martingale with respect to F if E(X ∣ F ) = X for all s, t ∈ T with s ≤ t .
t s s
Our goal in this section is to give a number of basic properties of martingales and to give ways of constructing martingales from
other types of processes. The deeper, fundamental theorems will be studied in the following sections.
Basic Properties
Our first result is that the martingale property is preserved under a coarser filtration.
Suppose that the process X and the filtration F satisfy the basic assumptions above and that G is a filtration coarser than F so
that G ⊆ F for t ∈ T . If X is a martingale (sub-martingale, super-martingale) with respect to F and X is adapted to G then
t t
Proof
In particular, if X is a martingale (sub-martingale, super-martingale) with respect to some filtration, then it is a martingale (sub-
martingale, super-martingale) with respect to its own natural filtration.
The relations that define martingales, sub-martingales, and super-martingales hold for the ordinary (unconditional) expected values.
We had this result in the last section, but it's worth repeating.
Proof
So if X is a martingale then X has constant expected value, and this value is referred to as the mean of X . The martingale
properties are preserved under sums of the stochastic processes.
The sub-martingale and super-martingale properties are preserved under multiplication by a positive constant and are reversed
under multiplication by a negative constant.
17.2.1 https://stats.libretexts.org/@go/page/10300
1. If X is a martingale with respect to F then cX is also a martingale with respect to F
2. If X is a sub-martingale with respect to F, then cX is a sub-martingale if c > 0 , a super-martingale if c < 0 , and a
martingale if c = 0 .
3. If X is a super-martingale with respect to F, then cX is a super-martingale if c > 0 , a sub-martingale if c < 0 , and a
martingale if c = 0 .
Proof
Property (a), together with the previous additive property, means that the collection of martingales with respect to a fixed filtration
F forms a vector space. Here is a class of transformations that turns martingales into sub-martingales.
Suppose that X takes values in an interval S ⊆ R and that g : S → R is convex with E [|g(X )|] < ∞ for t ∈ T . If either of
t
the following conditions holds then g(X) = {g(X ) : t ∈ T } is a sub-martingale with respect to F:
t
1. X is a martingale.
2. X is a sub-martingale and g is also increasing.
Proof
Suppose again that X is a martingale with respect to F. Let k ∈ [1, ∞) and suppose that E (|X k
t| ) <∞ for t ∈ T . Then the
process |X |k k
= {| Xt | : t ∈ T} is a sub-martingale with respect to F
Proof
Figure 17.2.2 : The graphs of x ↦ |x| , x ↦ |x| and x ↦ |x| on the interval [−2, 2]
2 3
In particular, if X is a martingale relative to F then |X| = {|X | : t ∈ T } is a sub-martingale relative to F. Here is a related result
t
that we will need later. First recall that the positive and negative parts of x ∈ R are x = max{x, 0} and x = max{−x, 0}, so
+ −
17.2.2 https://stats.libretexts.org/@go/page/10300
Figure 17.2.3 : The graph of x ↦ x+
on the interval [−5, 5]
If X = {X : t ∈ T } is a sub-martingale relative to
t F = { Ft : t ∈ T } then X
+
= {X
t
+
: t ∈ T} is also a sub-martingale
relative to F.
Proof
Our last result of this discussion is that if we sample a continuous-time martingale at an increasing sequence of time points, we get
a discrete-time martingale.
Suppose again that the process X = {X : t ∈ [0, ∞)} and the filtration F = {F : t ∈ [0, ∞)} satisfy the basic assumptions
t t
above. Suppose also that {t : n ∈ N} ⊂ [0, ∞) is a strictly increasing sequence of time points with t = 0 , and define
n 0
Y =X
n tnand G = F for n ∈ N . If X is a martingale (sub-martingale, super-martingale) with respect to F then
n tn
Proof
This result is often useful for extending proofs of theorems in discrete time to continuous time.
Think of Y as the bet that a gambler makes on game n ∈ N . The gambler can base the bet on all of the information she has at
n +
that point, including the outcomes of the previous n − 1 games. That is, she can base the bet on the information encoded in F . n−1
k=1
The motivating example behind the transfrom, in terms of a gambler making a sequence of bets, is given in an example below.
Note that Y ⋅ X is also adapted to F. Note also that the transform depends on X only through X and {X − X : n ∈ N } . If
0 n n−1 +
This construction is known as a martingale transform, and is a discrete version of the stochastic integral that we will study in the
chapter on Brownian motion. The result also holds if instead of Y being bounded, we have X bounded and E(|Y |) < ∞ for n
n ∈ N+
17.2.3 https://stats.libretexts.org/@go/page/10300
The Doob Decomposition
The next result, in discrete time, shows how to decompose a basic stochastic process into a martingale and a predictable process.
The result is known as the Doob decomposition theorem and is named for Joseph Doob who developed much of the modern theory
of martingales.
Suppose that X = { Xn : n ∈ N} satisfies the basic assumptions above relative to the filtration F = {F : n ∈ N} . Then n
A decomposition of this form is more complicated in continuous time, in part because the definition of a predictable process is
more subtle and complex. The decomposition theorem holds in continuous time, with our basic assumptions and the additional
assumption that the collection of random variables {X : τ is a finite-valued stopping time} is uniformly integrable. The result
τ
is known as the Doob-Meyer decomposition theorem, named additionally for Paul Meyer.
Markov Processes
As you might guess, there are important connections between Markov processes and martingales. Suppose that X = {X t : t ∈ T}
is a (homogeneous) Markov process with state space (S, S ), relative to the filtration F = {F : t ∈ T } . Let P = {P
t t : t ∈ T}
Recall that (like all probability kernels), P operates (on the right) on (measurable) functions h : S → R by the rule
t
assuming as usual that the expected value exists. Here is the critical definition that we will need.
The following theorem gives the fundamental connection between the two types of stochastic processes. Given the similarity in the
terminology, the result may not be a surprise.
Suppose that h : S → R and E[|h(X t )|] <∞ for t ∈ T . Define h(X) = {h(X t) : t ∈ T} .
1. h is harmonic for X if and only if h(X) is a martingale with respect to F.
2. h is sub-harmonic for X if and only if h(X) is a sub-martingale with respect to F.
3. h is super-harmonic for X if and only if h(X) is a super-martingale with respect to F.
Proof
Several of the examples given in the Introduction can be re-interpreted in the context of harmonic functions of Markov chains. We
explore some of these below.
Examples
Let R denote the usual set of Borel measurable subsets of R, and for A ∈ R and x ∈ R let A − x = {y − x : y ∈ A} . Let I
denote the identity function on R, so that I (x) = x for x ∈ R. We will need this notation in a couple of our applications below.
Random Walks
Suppose that V = {V : n ∈ N} is a sequence of independent, real-valued random variables, with {V : n ∈ N } identically
n n +
distributed and having common probability measure Q on (R, R) and mean a ∈ R . Recall from the Introduction that the partial
17.2.4 https://stats.libretexts.org/@go/page/10300
sum process X = {X n : n ∈ N} associated with V is given by
n
Xn = ∑ Vi , n ∈ N (17.2.12)
i=0
and that X is a (discrete-time) random walk. But X is also a discrete-time Markov process with one-step transition kernel P given
by P (x, A) = Q(A − x) for x ∈ R and A ∈ R .
It now follows from our theorem above that X is a martingale if a = 0 , a sub-martingale if a > 0 , and a super-martingale if a < 0 .
We showed these results directly from the definitions in the Introduction.
P(V = −1) = 1 − p for i ∈ N , where p ∈ (0, 1). Let X = { X : n ∈ N} be the partial sum process associated with V so that
i + n
Xn = ∑ Vi , n ∈ N (17.2.14)
i=0
Then X is the simple random walk with parameter p. In terms of gambling, our gambler plays a sequence of independent and
identical games, and on each game, wins €1 with probability p and loses €1 with probability 1 − p . So if X is the gambler's initial 0
fortune, then X is her net fortune after n games. In the Introduction we showed that X is a martingale if p = , a super-
n
1
martingale if p < , and a sub-martingale if p > . But suppose now that instead of making constant unit bets, the gambler makes
1
2
1
bets that depend on the outcomes of previous games. This leads to a martingale transform as studied above.
Suppose that the gambler bets Y on game n ∈ N (at even stakes), where Y ∈ [0, ∞) depends on (V , V , V , … , V )
n + n 0 1 2 n−1
and satisfies E(Y ) < ∞ . So the process Y = {Y : n ∈ N } is predictable with respect to X, and the gambler's net
n n +
k=1 k=1
1. Y ⋅X is a sub-martingale if p > . 1
2. Y ⋅X is a super-martingale if p < . 1
3. Y ⋅X is a martingale if p = . 1
Proof
The simple random walk X is also a discrete-time Markov chain on Z with one-step transition matrix P given by
P (x, x + 1) = p , P (x, x − 1) = 1 − p .
x
1−p
The function h given by h(x) = ( p
) for x ∈ Z is harmonic for X.
Proof
Xn
1−p
It now follows from our theorem above that the process Z = { Zn : n ∈ N} given by Zn = (
p
) for n ∈ N is a martingale.
We showed this directly from the definition in the Introduction. As you may recall, this is De Moivre's martingale and named for
Abraham De Moivre.
17.2.5 https://stats.libretexts.org/@go/page/10300
Branching Processes
Recall the discussion of the simple branching process from the Introduction. The fundamental assumption is that the particles act
independently, each with the same offspring distribution on N. As before, we will let f denote the (discrete) probability density
function of the number of offspring of a particle, m the mean of the distribution, and ϕ the probability generating function of the
distribution. We assume that f (0) > 0 and f (0) + f (1) < 1 so that a particle has a positive probability of dying without children
and a positive probability of producing more than 1 child. Recall that q denotes the probability of extinction, starting with a single
particle.
The stochastic process of interest is X = {X : n ∈ N} where X is the number of particles in the n th generation for n ∈ N .
n n
Recall that X is a discrete-time Markov chain on N with one-step transition matrix P given by P (x, y) = f (y) for x, y ∈ N ∗x
Proof
It now follows from our theorem above that the process Z = {Z : n ∈ N} is a martingale where Z = q
n for n ∈ N . We n
Xn
showed this directly from the definition in the Introduction. We also showed that the process Y = {Y : n ∈ N} is a martingale
n
where Y = X /m for n ∈ N . But we can't write Y = h(X ) for a function h defined on the state space, so we can't interpret
n n
n
n n
The process X has independent increments if this increment is always independent of F , and has stationary increments this
s
increment always has the same distribution as X − X . In discrete time, a process with stationary, independent increments is
t 0
simply a random walk as discussed above. In continuous time, a process with stationary, independent increments (and with the
continuity assumptions we have imposed) is called a continuous-time random walk, and also a Lévy process, named for Paul Lévy.
So suppose that X has stationary, independent increments. For t ∈ T let Q denote the probability distribution of X − X on
t t 0
We also know that E(X t − X0 ) = at for t ∈ T where a = E(X 1 − X0 ) (assuming of course that the last expected value exists in
R ).
It now follows that X is a martingale if a = 0 , a sub-martingale if a ≥ 0 , and a super-martingale if a ≤ 0 . We showed this directly
in the Introduction. Recall that in continuous time, the Poisson counting process has stationary, independent increments, as does
standard Brownian motion
This page titled 17.2: Properties and Constructions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
17.2.6 https://stats.libretexts.org/@go/page/10300
17.3: Stopping Times
Basic Theory
As in the Introduction, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P), having
t
state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). Next, we have
a filtration F = {F : t ∈ T } , and we assume that X is adapted to F. So F is an increasing family of sub σ-algebras of F and X
t t
is measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T . We assume that
t t
E (| X |) < ∞ , so that the mean of X exists as a real number, for each t ∈ T . Finally, in continuous time where T = [0, ∞) , we
t t
make the standard assumption that X is right continuous and has left limits, and that the filtration F is right continuous and
complete.
Our general goal in this section is to see if some of the important martingale properties are preserved if the deterministic time
t ∈ T is replaced by a (random) stopping time. Recall that a random time τ with values in T ∪ {∞} is a stopping time relative to F
if {τ ≤ t} ∈ F for t ∈ T . So a stopping time is a random time that does not require that we see into the future. That is, we can
t
tell if τ ≤ t from the information available at time t . Next recall that the σ-algebra associated with the stopping time τ is
So F is the collection of events up to the random time τ just as F is the collection of events up to the deterministic time t ∈ T .
τ t
In terms of a gambler playing a sequence of games, the time that the gambler decides to stop playing must be a stopping time, and
in fact this interpretation is the origin of the name. That is, the time when the gambler decides to stop playing can only depend on
the information that the gambler has up to that point in time.
Optional Stopping
The basic martingale equation E(X ∣ F ) = X for s, t ∈ T with s ≤ t can be generalized by replacing both s and t by
t s s
bounded stopping times. The result is known as the Doob's optional stopping theorem and is named again for Joseph Doob.
Suppose that X = {X : t ∈ T } satisfies the basic assumptions above with respect to the filtration F = {F : t ∈ T }
t t
The assumption that the stopping times are bounded is critical. A counterexample when this assumption does not hold is given
below. Here are a couple of simple corollaries:
Suppose again that ρ and τ are bounded stopping times relative to F with ρ ≤ τ .
1. If X is a martingale relative to F then E(X ) = E(X ) .
τ ρ
Proof
17.3.1 https://stats.libretexts.org/@go/page/10301
Suppose that X satisfies the assumptions above and that τ is a stopping time relative to the filtration F. The stopped proccess
X = { X : t ∈ [0, ∞)} is defined by
τ τ
t
τ
X = Xt∧τ , t ∈ [0, ∞) (17.3.7)
t
Details
then X is the revised fortune at time t when τ is the stopping time of the gamber. Our next result, known as the elementary
τ
t
Suppose again that X satisfies the assumptions above, and that τ is a stopping time relative to F.
1. If X is a martingale relative to F then so is X . τ
General proof
Special proof in discrete time
The elementary stopping theorem is bad news for the gambler playing a sequence of games. If the games are fair or unfavorable,
then no stopping time, regardless of how cleverly designed, can help the gambler. Since a stopped martingale is still a martingale,
the the mean property holds.
Suppose again that X satisfies the assumptions above, and that τ is a stopping time relative to F. Let t ∈ T .
1. If X is a martingale relative to F then E(X ) = E(X ) t∧τ 0
other conditions which give these results in discrete time. Suppose that X = {X : n ∈ N} satisfies the basic assumptions above n
Proof
Proof
Let's return to our original interpretation of a martingale X representing the fortune of a gambler playing fair games. The gambler
could choose to quit at a random time τ , but τ would have to be a stopping time, based on the gambler's information encoded in the
filtration F. Under the conditions of the theorem, no such scheme can help the gambler in terms of expected value.
P(Vi = −1) = 1 − p for i ∈ N , where p ∈ (0, 1). Let X = (X , X , X , …) be the partial sum process associated with V so
+ 0 1 2
17.3.2 https://stats.libretexts.org/@go/page/10301
that
n
Xn = ∑ Vi , n ∈ N (17.3.14)
i=1
Then X is the simple random walk with parameter p. In terms of gambling, our gambler plays a sequence of independent and
identical games, and on each game, wins €1 with probability p and loses €1 with probability 1 − p . So X is the the gambler's total
n
net winnings after n games. We showed in the Introduction that X is a martingale if p = (the fair case), a sub-martingale if
1
p > (the favorable case), and a super-martingale if p < (the unfair case). Now, for c ∈ Z , let
1
2
1
τc = inf{n ∈ N : Xn = c} (17.3.15)
where as usual, inf(∅) = ∞ . So τ is the first time that the gambler's fortune reaches c . What if the gambler simply continues
c
playing until her net winnings is some specified positive number (say €1 000 000 )? Is that a workable strategy?
Suppose that p = 1
2
and that c ∈ N . +
1. P(τ < ∞) = 1
c
2. E (X ) = c ≠ 0 = E(X
τc 0)
3. E(τ ) = ∞
c
Proof
Note that part (b) does not contradict the optional stopping theorem because of part (c). The strategy of waiting until the net
winnings reaches a specified goal c is unsustainable. Suppose now that the gambler plays until the net winnings either falls to a
specified negative number (a loss that she can tolerate) or reaches a specified positive number (a goal she hopes to reach).
2
. For a, b ∈ N+ , let τ = τ−a ∧ τb . Then
1. E(τ ) < ∞
2. E(X ) = 0τ
Proof
So gambling until the net winnings either falls to −a or reaches b is a workable strategy, but alas has expected value 0. Here's
another example that shows that the first version of the optional sampling theorem can fail if the stopping times are not bounded.
2
. Let a, b ∈ N+ with a < b . Then τ a < τb < ∞ but
b = E (Xτ ∣ Fτ ) ≠ Xτ =a (17.3.17)
b a a
Proof
This result does not contradict the optional stopping theorem since the stopping times are not bounded.
Wald's Equation
Wald's equation, named for Abraham Wald is a formula for the expected value of the sum of a random number of independent,
identically distributed random variables. We have considered this before, in our discussion of conditional expected value and our
discussion of random samples, but martingale theory leads to a particularly simple and elegant proof.
Suppose that X = (X : n ∈ N ) is a sequence of independent, identically distributed variables with common mean
n + μ ∈ R .
If N is a stopping time for X with E(N ) < ∞ then
N
E ( ∑ Xk ) = E(N )μ (17.3.18)
k=1
Proof
17.3.3 https://stats.libretexts.org/@go/page/10301
Patterns in Multinomial Trials
Patterns in multinomial trials were studied in the chapter on Renewal Processes. As is often the case, martingales provide a more
elegant solution. Suppose that L = (L , L , …) is a sequence of independent, identically distributed random variables taking
1 2
values in a finite set S , so that L is a sequence of multinomial trials. Let f denote the common probability density function so that
for a generic trial variable L, we have f (a) = P(L = a) for a ∈ S . We assume that all outcomes in S are actually possible, so
f (a) > 0 for a ∈ S .
In this discussion, we interpret S as an alphabet, and we write the sequence of variables in concatenation form, L = L L ⋯ 1 2
rather than standard sequence form. Thus the sequence is an infinite string of letters from our alphabet S . We are interested in the
first occurrence of a particular finite substring of letters (that is, a “word” or “pattern”) in the infinite sequence. The following
definition will simplify the notation.
i=1
So, fix a word a = a a ⋯ a of length k ∈ N from the alphabet S , and consider the number of trials N until a is completed.
1 2 k + a
Our goal is compute ν (a) = E (N ) . We do this by casting the problem in terms of a sequence of gamblers playing fair games and
a
then using the optional stopping theorem above. So suppose that if a gambler bets c ∈ (0, ∞) on a letter a ∈ S on a trial, then the
gambler wins c/f (a) if a occurs on that trial and wins 0 otherwise. The expected value of this bet is
c
f (a) −c = 0 (17.3.23)
f (a)
and so the bet is fair. Consider now a gambler with an initial fortune 1. When she starts playing, she bets 1 on a . If she wins, she
1
bet her entire fortune 1/f (a ) on the next trial on a . She continues in this way: as long as she wins, she bets her entire fortune on
1 2
the next trial on the next letter of the word, until either she loses or completes the word a . Finally, we consider a sequence of
independent gamblers playing this strategy, with gambler i starting on trial i for each i ∈ N .
+
For a finite word a from the alphabet S , ν (a) is the total winnings by all of the players at time N .
a
Proof
Given a , we can compute the total winnings precisely. By definition, trials N − k + 1, … , N form the word a for the first time.
Hence for i ≤ N − k , gambler i loses at some point. Also by definition, gambler N − k + 1 wins all of her bets, completes word
a and so collects 1/f (a). The complicating factor is that gamblers N − k + 2, … , N may or may not have won all of their bets
at the point when the game ends. The following exercise illustrates this.
Suppose that L is a sequence of Bernoulli trials (so S = {0, 1}) with success probability p ∈ (0, 1). For each of the following
strings, find the expected number of trials needed to complete the string.
1. 001
2. 010
Solution
The difference between the two words is that the word in (b) has a prefix (a proper string at the beginning of the word) that is also a
suffix (a proper string at the end of the word). Word a has no such prefix. Thus we are led naturally to the following dichotomy:
Suppose that a is a finite word from the alphabet S . If no proper prefix of a is also a suffix, then a is simple. Otherwise, a is
compound.
Here is the main result, which of course is the same as when the problem was solved using renewal theory.
17.3.4 https://stats.libretexts.org/@go/page/10301
2. If a is compound, then ν (a) = 1/f (a) + ν (b) where b is the longest word that is both a prefix and a suffix of a .
Proof
For a compound word, we can use (b) to reduce the computation to simple words.
Consider Bernoulli trials with success probability p ∈ (0, 1) . Find the expected number of trials until each of the following
strings is completed.
1. 1011011
2. 11 ⋯ 1 (k times)
Solutions
Recall that an ace-six flat die is a six-sided die for which faces 1 and 6 have probability 1
4
each while faces 2, 3, 4, and 5 have
probability each. Ace-six flat dice are sometimes used by gamblers to cheat.
1
Suppose that an ace-six flat die is thrown repeatedly. Find the expected number of throws until the pattern 6165616 occurs.
Solution
Suppose that a monkey types randomly on a keyboard that has the 26 lower-case letter keys and the space key (so 27 keys).
Find the expected number of keystrokes until the monkey produces each of the following phrases:
1. it was the best of times
2. to be or not to be
Solution
candidates arrive sequentially in random order and are interviewed. We measure the quality of each candidate by a number in the
interval [0, 1]. Our goal is to select the very best candidate, but once a candidate is rejected, she cannot be recalled. Mathematically,
our assumptions are that the sequence of candidate variables X = (X , X , … , X ) is independent and that each is uniformly
1 2 n
distributed on the interval [0, 1] (and so has the standard uniform distribution). Our goal is to select a stopping time τ with respect
to X that maximizes E(X ), the expected value of the chosen candidate. The following sequence will play a critical role as a
τ
sequence of thresholds.
2
(1 + a )
2
k
for k ∈ N . Then
1. a < 1 for k ∈ N .
k
2. a < a
k k+1for k ∈ N .
3. a → 1 as k → ∞ .
k
Since a0 =0 , all of the terms of the sequence are in [0, 1) by (a). Approximations of the first 10 terms are
(0, 0.5, 0.625, 0.695, 0.742, 0.775, 0.800, 0.820, 0.836, 0.850, 0.861, …) (17.3.27)
Property (d) gives some indication of why the sequence is important for the secretary probelm. At any rate, the next theorem gives
the solution. To simplify the notation, let N = {0, 1, … , n} and N = {1, 2, … , n}.
n
+
n
Proof
17.3.5 https://stats.libretexts.org/@go/page/10301
1. Select candidate 1 if X 1 > 0.742 ; otherwise,
2. select candidate 2 if X 2 > 0.695 ; otherwise,
3. select candidate 3 if X 3 > 0.625; otherwise,
5. select candidate 5.
The expected value of our chosen candidate is 0.775.
In our original version of the secretary problem, we could only observe the relative ranks of the candidates, and our goal was to
maximize the probability of picking the best candidate. With n = 5 , the optimal strategy is to let the first two candidates go by and
then pick the first candidate after that is better than all previous candidates, if she exists. If she does not exist, of course, we must
select candidate 5. The probability of picking the best candidate is 0.433
This page titled 17.3: Stopping Times is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
17.3.6 https://stats.libretexts.org/@go/page/10301
17.4: Inequalities
Basic Theory
In this section, we will study a number of interesting inequalities associated with martingales and their sub-martingale and super-
martingale cousins. These turn out to be very important for both theoretical reasons and for applications. You many need to review
infimums and supremums.
Basic Assumptions
As in the Introduction, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P), having
t
state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). Next, we have
a filtration F = {F : t ∈ T } , and we assume that X is adapted to F. So F is an increasing family of sub σ-algebras of F and X
t t
is measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T . We assume that
t t
E (| X |) < ∞ , so that the mean of X exists as a real number, for each t ∈ T . Finally, in continuous time where T = [0, ∞) , we
t t
make the standard assumptions that t ↦ X is right continuous and has left limits, and that the filtration F is right continuous and
t
complete.
Maximal Inequalites
For motivation, let's review a modified version of Markov's inequality, named for Andrei Markov.
Proof
So Markov's inequality gives an upper bound on the probability that X exceeds a given positive value x, in terms of a monent of
X. Now let's return to our stochastic process X = { X : t ∈ T } . To simplify the notation, let T = {s ∈ T : s ≤ t}
t for t ∈ T .
t
Clearly, the maximal process is increasing, so that if s, t ∈ T with s ≤t then Us ≤ Ut . A trivial application of Markov's
inequality above would give
1
P(Ut ≥ x) ≤ E(Ut ; Ut ≥ x), x >0 (17.4.4)
x
But when X is a sub-martingale, the following theorem gives a much stronger result by replacing the first occurrent of U on the t
right with X . The theorem is known as Doob's sub-martingale maximal inequality (or more simply as Doob's inequaltiy), named
t
once again for Joseph Doob who did much of the pioneering work on martingales. A sub-martingale has an increasing property of
sorts in the sense that if s, t ∈ T with s ≤ t then E(X ∣ F ) ≥ X , so it's perhaps not entirely surprising that such a bound is
t s s
possible.
There are a number of simple corollaries of the maximal inequality. For the first, recall that the positive part of x ∈ R is
= x ∨ 0 , so that x = x if x > 0 and x = 0 if x ≤ 0 .
+ + +
x
17.4.1 https://stats.libretexts.org/@go/page/10302
Suppose that X is a sub-martingale. For t ∈ T , let V t = sup{ Xs
+
: s ∈ Tt } . Then
1 +
P(Vt ≥ x) ≤ E(Xt ; Vt ≥ x), x ∈ (0, ∞) (17.4.14)
x
Proof
Proof
1/k
k
Next recall that for k ∈ (1, ∞) , the k -norm of a real-valued random variable X is ∥X∥ k = [E(|X | )] , and the vector space L k
consists of all real-valued random variables for which this norm is finite. The following theorem is the norm version of the Doob's
maximal inequality.
Suppose again that X is a martingale. For t ∈ T , let W t = sup{| Xs | : s ∈ Tt } . Then for k > 1 ,
k
∥ Wt ∥k ≤ ∥ Xt ∥k (17.4.18)
k−1
Proof
Once again, W = { Wt : t ∈ T } is the maximal process associated with |X| = {|X | : t ∈ T } . As noted in the proof, t
j = k/(k − 1) is the exponent conjugate to k , satisfying 1/j + 1/k = 1 . So this version of the maximal inequality states that the
k norm of the maximum of the martingale X on T is bounded by j times the k norm of X , where j and k are conjugate
t t
exponents. Stated just in terms of expected value, rather than norms, the L maximal inequality is k
k
k
k k
E (| Wt | ) ≤ ( ) E (| Xt | ) (17.4.27)
k−1
Our final result in this discussion is a variation of the maximal inequality for super-martingales.
Proof
Suppose that x = (x n : n ∈ N) is a sequence of real numbers, and that a, b ∈ R with a <b . Define t0 (x) = 0 and then
recursively define
sk+1 (x) = inf{n ∈ N : n ≥ tk (x), xn ≤ a}, k ∈ N
17.4.2 https://stats.libretexts.org/@go/page/10302
1. The number of up-crossings of the interval [a, b] by the sequence x up to time n ∈ N is
un (a, b, x) = sup{k ∈ N : tk (x) ≤ n} (17.4.32)
Details
So informally, as the name suggests, u (a, b, x) is the number of times that the sequence (x , x , … , x ) goes from a value below
n 0 1 n
a to one above b , and u(a, b, x) is the number of times the entire sequence x goes from a value below a to one above b . Here are a
Suppose again that x = (x n : n ∈ N) is a sequence of real numbers and that a, b ∈ R with a < b .
1. u (a, b, x) is increasing in n ∈ N .
n
2. u (a, b, x) → u(a, b, x) as n → ∞ .
n
3. If c, d ∈ R with a < c < d < b then u n (c, d, x) ≥ un (a, b, x) for n ∈ N , and u(c, d, x) ≥ u(a, b, x).
Proof
The importance of the definitions is found in the following theorem. Recall that R
∗
= R ∪ {−∞, ∞} is the set of extended real
numbers, and Q is the set of rational real numbers.
Proof
Clearly the theorem is true with Q replaced with R, but the countability of Q will be important in the martingale convergence
theorem. As a simple corollary, if x is bounded and u (a, b, x) < ∞ for every a, b ∈ Q with a < b , then x converges in R. The
∞
up-crossing inequality for a discrete-time martingale X gives an upper bound on the expected number of up-crossings of X up to
time n ∈ N in terms of a moment of X . n
Suppose that X = {X : n ∈ N} satisfies the basic assumptions with respect to the filtration F = {F : n ∈ N} , and let
n n
a, b ∈ R with a < b . Let U = u (a, b, X) , the random number of up-crossings of [a, b] by X up to time n ∈ N .
n n
Proof
Additional details
Of course if X is a martingale with respect to F then both inequalities apply. In continuous time, as usual, the concepts are more
complicated and technical.
I I
t (x) = inf {t ∈ I : t ≥ s (x), xt ≥ b} , k ∈ N
k+1 k+1
17.4.3 https://stats.libretexts.org/@go/page/10302
2. If I ⊆ [0, ∞) is infinte, the number of up-crossings of the interval [a, b] by x restricted to I is
uI (a, b, x) = sup{ uJ (a, b, x) : J is finite and J ⊂ I } (17.4.40)
To simplify the notation, we will let u (a, b, x) = u (a, b, x) , the number of up-crossings of [a, b] by x on [0, t], and
t [0,t]
u∞ (a, b, x) = u (a, b, x) , the total number of up-crossings of [a, b] by x . In continuous time, the definition of up-crossings is
[0,∞)
built out of finte subsets of [0, ∞) for measurability concerns, which arise when we replace the deterministic function x with a
stochastic process X. Here are the simple properties that are analogous to our previous ones.
n=0
In then uIn (a, b, x) → uJ (a, b, x) as n → ∞ .
3. If c, d ∈ R with a < c < d < b and I ⊂ [0, ∞) then u (c, d, x) ≥ u I I (a, b, x) .
Proof
The following result is the reason for studying up-crossings in the first place. Note that the definition built from finite set is
sufficient.
Finally, here is the up-crossing inequality for martingales in continuous time. Once again, the inequality gives a bound on the
expected number of up-crossings.
Suppose that X = {X : t ∈ [0, ∞)} satisfies the basic assumptions with respect to the filtration F = {F : t ∈ [0, ∞)} , and
t t
let a, b ∈ R with a < b . Let U = u (a, b, X) , the random number of up-crossings of [a, b] by X up to time t ∈ [0, ∞).
t t
Proof
Yn = ∑ Xi , n ∈ N (17.4.46)
i=1
From the Introduction we know that Y is a martingale. A simple application of the maximal inequality gives the following result,
which is known as Kolmogorov's inequality, named for Andrei Kolmogorov.
Proof
17.4.4 https://stats.libretexts.org/@go/page/10302
Red and Black
In the game of red and black, a gambler plays a sequence of Bernoulli games with success parameter p ∈ (0, 1) at even stakes. The
gambler starts with an initial fortune x and plays until either she is ruined or reaches a specified target fortune a , where
x, a ∈ (0, ∞) with x < a . When p ≤ , so that the games are fair or unfair, an optimal strategy is bold play: on each game, the
1
gambler bets her entire fortune or just what is needed to reach the target, whichever is smaller. In the section on bold play we
showed that when p = , so that the games are fair, the probability of winning (that is, reaching the target a starting with x) is
1
x/a. We can use the maximal inequality for super-martingales to show that indeed, one cannot do better.
To set up the notation and review various concepts, let X denote the gambler's initial fortune and let X denote the outcome of
0 n
game n ∈ N , where 1 denotes a win and −1 a loss. So {X : n ∈ N} is a sequence of independent variables with
+ n
Yn = ∑ Xi , n ∈ N (17.4.50)
i=0
so that Y = {Y : n ∈ N} is the partial sum process associated with X = {X : n ∈ N} . Recall that Y is also known as the
n n
simple random walk with parameter p, and since p ≤ , is a super-martingale. The process {X : n ∈ N } is the difference
1
2
n +
sequence associated with Y . Next let Z denote the amount that the gambler bets on game n ∈ N . The process
n +
Z = {Z : n ∈ N }
n is predictable with respect to X = {X : n ∈ N} , so that Z is measurable with respect to
+ n n
σ{ X , X , … , X
0 1 } for n ∈ N
n−1 . So the gambler's fortune after n games is
+
n n
i=1 i=1
Recall that W = {W : n ∈ N} is the transform of Z with Y , denoted W = Z ⋅ Y . The gambler is not allowed to go into debt
n
and so we must have Z ≤ W nfor n ∈ N : the gambler's bet on game n cannot exceed her fortune after game n − 1 . What's
n−1 +
the probability that the gambler can ever reach or exceed the target a starting with fortune x < a ?
Let U ∞ = sup{ Wn : n ∈ N} . Suppose that x, a ∈ (0, ∞) with x < a and that X 0 =x . Then
x
P(U∞ ≥ a) ≤ (17.4.52)
a
Proof
Note that the only assumptions made on the gambler's sequence of bets Z is that the sequence is predictable, so that the gambler
cannot see into the future, and that gambler cannot go into debt. Under these basic assumptions, no strategy can do any better than
bold play. However, there are strategies that do as well as bold play; these are variations on bold play.
Open the simulation of the red and black game. Select bold play and p =
1
2
. Play the game with various values of initial and
target fortunes.
This page titled 17.4: Inequalities is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
17.4.5 https://stats.libretexts.org/@go/page/10302
17.5: Convergence
Basic Theory
Basic Assumptions
As in the Introduction, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P), having
t
state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). Next, we have
a filtration F = {F : t ∈ T } , and we assume that X is adapted to F. So F is an increasing family of sub σ-algebras of F and X
t t
is measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T . We assume that
t t
E (| X |) < ∞ , so that the mean of X exists as a real number, for each t ∈ T . Finally, in continuous time where T = [0, ∞) , we
t t
need the additional assumptions that t ↦ X is right continuous and has left limits, and that the filtration F is standard (that is,
t
right continuous and complete). Recall also that F = σ (⋃ F ) , and this is the σ-algebra that encodes our information over
∞ t∈T t
all time.
Similarly, if X is a super-martingale relative to F then X has a decreasing property of sorts, since the last inequality is reversed.
Thus, there is hope that if this increasing or decreasing property is coupled with an appropriate boundedness property, then the sub-
martingale or super-martingale might converge, in some sense, as t → ∞ . This is indeed the case, and is the subject of this section.
The martingale convergence theorems, first formulated by Joseph Doob, are among the most important results in the theory of
martingales. The first martingale convergence theorem states that if the expected absolute value is bounded in the time, then the
martingale process converges with probability 1.
is bounded in t ∈ T . Then there exists a random variable X that is measurable with respect to F
∞ ∞ such that E(|X |) < ∞
∞
Proof
The boundedness condition means that X is bounded (in norm) as a subset of the vector space L1 . Here is a very simple, but
useful corollary:
Proof
Of course, the corollary applies to a nonnegative martingale as a special case. For the second martingale convergence theorem you
will need to review uniformly integrable variables. Recall also that for k ∈ [1, ∞), the k -norm of a random variable X is
1/k
k
∥X ∥k = [E (|X | )] (17.5.4)
and L is the normed vector space of all real-valued random variables for which this norm is finite. Convergence in mean refers to
k
Suppose that X is a uniformly integrable and is a sub-martingale or super-martingale with respect to F. Then there exists a
random variable X , measurable with respect to F such that X → X as t → ∞ with probability 1 and in mean.
∞ ∞ t ∞
Proof
As a simple corollary, recall that if ∥X ∥ is bounded in t ∈ T for some k ∈ (1, ∞) then X is uniformly integrable, and hence the
t k
17.5.1 https://stats.libretexts.org/@go/page/10303
t → ∞ in L .
k
Proof
P(V = −1) = 1 − p for i ∈ N , where p ∈ (0, 1). Let X = { X : n ∈ N} be the partial sum process associated with V so that
i + n
Xn = ∑ Vi , n ∈ N (17.5.8)
i=0
Recall that X is the simple random walk with parameter p. From our study of Markov chains, we know that p > then X → ∞ 1
2
n
as n → ∞ and if p < then X → −∞ as n → ∞ . The chain is transient in these two cases. If p = , the chain is (null)
1
2
n
1
recurrent and so visits every state in N infinitely often. In this case X does not converge as n → ∞ . But of course
n
Doob's Martingale
Recall that if X is a random variable with E(|X|) < ∞ and we define X = E(X ∣ F ) for t ∈ T , then X = {X : t ∈ T } is a
t t t
martingale relative to F and is known as a Doob martingale, named for you know whom. So the second martingale convergence
theorem states that every uniformly integrable martingale is a Doob martingale. Moreover, we know that the Doob martingale X
constructed from X and F is uniformly integrable, so the second martingale convergence theorem applies. The last remaining
question is the relationship between X and the limiting random variable X . The answer may come as no surprise.
∞
Of course if F ∞ =F , which is quite possible, then X ∞ =X . At the other extreme, if F t = {∅, Ω} , the trivial σ-algebra for all
t ∈ T , then X ∞ = E(X) , a constant.
n=1
G . So G
n is the tail σ-algebra of X, the collection of events that depend
∞
only on the terms of the sequence with arbitrarily large indices. For example, if the sequence is real-valued (or more generally takes
values in a metric space), then the event that X has a limit as n → ∞ is a tail event. If B ∈ S , then the event that X ∈ B for
n n
infinitely many n ∈ N is another tail event. The Kolmogorov zero-one law, named for Andrei Kolmogorov, states that if X is an
+
Tail events and the Kolmogorov zero-one law were studied earlier in the section on measure in the chapter on probability spaces. A
random variable that is measurable with respect to G is a tail random variable. From the Kolmogorov zero-one law, a real-valued
∞
tail random variable for an independent sequence must be a constant (with probability 1).
Branching Processes
Recall the discussion of the simple branching process from the Introduction. The fundamental assumption is that the particles act
independently, each with the same offspring distribution on N. As before, we will let f denote the (discrete) probability density
function of the number of offspring of a particle, m the mean of the distribution, and q the probability of extinction starting with a
17.5.2 https://stats.libretexts.org/@go/page/10303
single particle. We assume that f (0) > 0 and f (0) + f (1) < 1 so that a particle has a positive probability of dying without
children and a positive probability of producing more than 1 child.
The stochastic process of interest is X = {X : n ∈ N} where X is the number of particles in the n th generation for n ∈ N .
n n
Recall that X is a discrete-time Markov chain on N. Since 0 is an absorbing state, and all positive states lead to 0, we know that the
positive states are transient and so are visited only finitely often with probability 1. It follows that either X → 0 as n → ∞ n
(extinction) or X → ∞ as n → ∞ (explosion). We have quite a bit of information about which of these events will occur from
n
our study of Markov chains, but the martingale convergence theorems give more information.
P = p ∈ (0, 1) with P(X = 1 ∣ P = p) = p for i ∈ N . Let Y = { Y : n ∈ N} denote the partial sum process associated with
i + n
X, so that once again, Y = ∑ for n ∈ N . Next let M = Y /n for n ∈ N so that M is the sample mean of
n
n X i n n + n
i=1
(X , X , … , X ). Finally let
1 2 n
a + Yn
Zn = , n ∈ N (17.5.10)
a+b +n
This is a very nice result and is reminiscent of the fact that for the ordinary Bernoulli trials sequence with success parameter
p ∈ (0, 1) we have the law of large numbers that M → p as n → ∞ with probability 1 and in mean.
n
X = { X : i ∈ N } . Since Y
i + is the number of red balls in the urn at time n ∈ N , the average number of balls at time n is
n +
M = Y /n . On the other hand, the total number of balls in the urn at time n ∈ N is a + b + cn so the proportion of red balls in
n n
We showed in the Introduction, that Z = {Z : n ∈ N} is a martingale. Now we are interested in the limiting behavior of M and
n n
Zn as n → ∞ . When c = 0 , the answer is easy. In this case, Y has the binomial distribution with trial parameter n and success
n
parameter a/(a + b) , so by the law of large numbers, M → a/(a + b) as n → ∞ with probability 1 and in mean. On the other
n
Suppose that c ∈ N . Then there exists a random variable P such that M → P and Z → P as n → ∞ with probability 1
+ n n
and in mean. Moreover, P has the beta distribution with left parameter a/c and right parameter b/c.
Proof
17.5.3 https://stats.libretexts.org/@go/page/10303
Likelihood Ratio Tests
Recall the discussion of likelihood ratio tests in the Introduction. To review, suppose that (S, S , μ) is a general measure space, and
that X = {X : n ∈ N} is a sequence of independent, identically distributed random variables, taking values in S , and having a
n
common probability density function with respect to μ . The likelihood ratio test is a hypothesis test, where the null and alternative
hypotheses are
H0 : the probability density function is g . 0
We assume that g and g are positive on S . Also, it makes no sense for g and g to be the same, so we assume that g
0 1 0 1 0 ≠ g1 on a
set of positive measure. The test is based on the likelihood ratio test statistic
n
g0 (Xi )
Ln = ∏ , n ∈ N (17.5.12)
g1 (Xi )
i=1
We showed that under the alternative hypothesis H1 , L = { Ln : n ∈ N} is a martingale with respect to X , known as the
likelihood ratio martingale.
This result is good news, statistically speaking. Small values of L are evidence in favor of H , so the decision rule is to reject H
n 1 0
in favor of H if L ≤ l for a chosen critical value l ∈ (0, ∞). If H is true and the sample size n is sufficiently large, we will
1 n 1
reject H . In the proof, note that ln(L ) must diverge to −∞ at least as fast as n diverges to ∞. Hence L → 0 as n → ∞
0 n n
exponentially fast, at least. It also worth noting that L is a mean 1 martingale (under H ) so trivially E(L ) → 1 as n → ∞ even 1 n
though L → 0 as n → ∞ with probability 1. So the likelihood ratio martingale is a good example of a sequence where the
n
Partial Products
Suppose that X = { Xn : n ∈ N+ } is an independent sequence of nonnegative random variables with E(Xn ) = 1 for n ∈ N+ .
Let
n
Yn = ∏ Xi , n ∈ N (17.5.15)
i=1
so that Y = {Y : n ∈ N} is the partial product process associated with X. From our discussion of this process in the
n
Introduction, we know that Y is a martingale with respect to X. Since Y is nonnegative, the second martingale convergence
theorem applies, so there exists a random variable Y such that Y → Y as n → ∞ with probability 1. What more can we say?
∞ n ∞
The following result, known as the Kakutani product martingale theorem, is due to Shizuo Kakutani.
−
−−
Let a n = E (√Xn ) for n ∈ N+ and let A = ∏ ∞
i=1
ai .
1. If A > 0 then Y n → Y∞ as n → ∞ in mean and E(Y ∞) =1 .
2. If A = 0 then Y ∞ =0 with probability 1.
Proof
Density Functions
This discussion continues the one on density functions in the Introduction. To review, we start with our probability space (Ω, F , P)
and a filtration F = {F : n ∈ N} in discrete time. Recall again that F = σ (⋃
n F ) . Suppose now that μ is a finite measure
∞
∞
n=0 n
on the sample space (Ω, F ). For each n ∈ N ∪ {∞} , the restriction of μ to F is a measure on (Ω, F ) and similarly the n n
restriction of P to F is a probability measure on (Ω, F ). To save notation and terminology, we will refer to these as μ and P on
n n
theorem, μ has a density function (or Radon-Nikodym derivative) X : Ω → R with respect to P on F for each n ∈ N . The
n n
theorem and the derivative are named for Johann Radon and Otto Nikodym. In the Introduction we showed that
X = { X : n ∈ N} is a martingale with respect to F . Here is the convergence result:
n
17.5.4 https://stats.libretexts.org/@go/page/10303
1. If μ is absolutely continuous with respect to P on F then X is a density function of μ with respect to P on F .
∞ ∞ ∞
Proof
The martingale approach can be used to give a probabilistic proof of the Radon-Nikodym theorem, at least in certain cases. We start
with a sample set Ω. Suppose that A = {A : i ∈ I } is a countable partition of Ω for each n ∈ N . Thus I is countable,
n
n
i n n
A ∩ A = ∅ for distinct i, j ∈ I , and ⋃ A = Ω . Suppose also that A refines A for each n ∈ N in the sense that A is
n n n n
i j n i∈In i n+1 n i
a union of sets in A for each i ∈ I . Let F = σ(A ) . Thus F is generated by a countable partition, and so the sets in F
n+1 n n n n n
is a filtration. Let F = F = σ (⋃ ∞
∞
F ) = σ (⋃
n=0
A ) , so that our sample space is (Ω, F ). Finally, suppose that P is a
n
∞
n=0 n
probability measure on (Ω, F ) with the property that P(A ) > 0 for n ∈ N and i ∈ I . We now have a probability space
n
i n
(Ω, F , P). Interesting probability spaces that occur in applications are of this form, so the setting is not as specialized as you might
think.
Suppose now that μ a finte measure on (Ω, F ). From our assumptions, the only null set for P on F is ∅, so μ is automatically n
absolutely continuous with respect to P on F . Moreover, for n ∈ N , we can give the density function of μ with respect to P on
n
F n explicitly:
The density function of μ with respect to P on F is the random variable X whose value on A is μ(A
n n
n
i
n
i
)/P(A )
n
i
for each
i ∈ I . Equivalently,
n
n
μ(A )
i n
Xn = ∑ 1(A ) (17.5.22)
n i
P(A )
i∈In i
Proof
By our theorem above, there exists a random variable X such that X → X as n → ∞ with probability 1. If μ is absolutely n
continuous with respect to P on F , then X is a density function of μ with respect to P on F . The point is that we have given a
more or less explicit construction of the density.
For a concrete example, consider Ω = [0, 1). For n ∈ N , let
j j+1
n
An = {[ , ) : j ∈ {0, 1, … , 2 − 1}} (17.5.24)
n n
2 2
This is the partition of [0, 1) into 2 subintervals of equal length 1/2 , based on the dyadic rationals (or binary rationals) of rank
n n
n or less. Note that every interval in A is the union of two adjacent intervals in A , so the refinement property holds. Let P be
n n+1
ordinary Lebesgue measure on [0, 1) so that P(A ) = 1/2 for n ∈ N and i ∈ {0, 1, … , 2 − 1}. As above, let F = σ(A )
n
i
n n
n n
and F = σ (⋃ ∞
n=0
F ) = σ (⋃
n A ) . The dyadic rationals are dense in [0, 1), so F is the ordinary Borel σ-algebra on [0, 1).
∞
n=0 n
Thus our probability space (Ω, F , P) is simply [0, 1) with the usual Euclidean structures. If μ is a finite measure on ([0, 1), F )
then the density function of μ on F is the random variable X whose value on the interval [j/2 , (j + 1)/2 ) is
n n
n n
2 μ[j/ 2 , (j + 1)/ 2 ) . If μ is absolutely continuous with respect to P on F (so absolutely continuous in the usual sense), then a
n n n
This page titled 17.5: Convergence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
17.5.5 https://stats.libretexts.org/@go/page/10303
17.6: Backwards Martingales
Basic Theory
A backwards martingale is a stochastic process that satisfies the martingale property reversed in time, in a certain sense. In some
ways, backward martingales are simpler than their forward counterparts, and in particular, satisfy a convergence theorem similar to
the convergence theorem for ordinary martingales. The importance of backward martingales stems from their numerous
applications. In particular, some of the fundamental theorems of classical probability can be formulated in terms of backward
martingales.
Definitions
As usual, we start with a stochastic process Y = {Y : t ∈ T } on an underlying probability space (Ω, F , P), having state space R,
t
and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). So to review what all this
means, Ω is the sample space, F the σ-algebra of events, P the probability measure on (Ω, F ), and Y is a random variable with t
values in R for each t ∈ T . But at this point our formulation diverges. Suppose that G is a sub σ-algebra of F for each t ∈ T , and
t
A backwards martingale can be formulated as an ordinary martingale by using negative times as the indices. Let
= {−t : t ∈ T } , so that if T = N (the discrete case) then T is the set of non-positive integers, and if T = [0, ∞) (the
− −
T
continuous case) then T = (−∞, 0] . Recall also that the standard martingale definitions make sense for any totally ordered index
−
set.
Proof
Most authors define backwards martingales with negative indices, as above, in the first place. There are good reasons for doing so,
since some of the fundamental theorems of martingales apply immediately to backwards martingales. However, for the
applications of backwards martingales, this notation is artificial and clunky, so for the most part, we will use our original definition.
The next result is another way to view a backwards martingale as an ordinary martingale. This one preserves nonnegative time, but
introduces a finite time horizon. For t ∈ T , let T = {s ∈ T : s ≤ t} , a notation we have used often before.
t
Suppose again that Y = { Yt : t ∈ T } is a backwards martingale with respect to G = {G : t ∈ T } . Fix t ∈ T and define
t
X =Y
t
s and F
t−s
t
s
= Gt−s for s ∈ T . Then X = {X : s ∈ T } is a martingale relative to F = {F : s ∈ T } .
t
t t
s t
t t
s t
Proof
Properties
Backwards martingales satisfy a simple and important property.
Proof
Here is the Doob backwards martingale, analogous to the ordinary Doob martingale, and of course named for Joseph Doob. In a
sense, this is the converse to the previous result.
Suppose that Y is a random variable on our probability space (Ω, F , P) with E(|Y |) < ∞ , and that G = {G : t ∈ T } is a t
decreasing family of sub σ-algebras of F , as above. Let Y = E(Y ∣ G ) for t ∈ T . Then Y = {Y : t ∈ T } is a backwards
t t t
17.6.1 https://stats.libretexts.org/@go/page/10304
Proof
The convergence theorems are the most important results for the applications of backwards martingales. Recall once again that for
k ∈ [1, ∞), the k-norm of a real-valued random variable X is
1/k
k
∥X ∥k = [E (|X | )] (17.6.5)
and the normed vector space L consists of all X with ∥X∥ < ∞ . Convergence in the space L is also referred to as
k k 1
convergence in mean, and convergence in the space L is referred to as convergence in mean square. Here is the primary
2
1. Yt → Y∞ as t → ∞ with probability 1.
2. Yt → Y ∞as t → ∞ in mean.
3. Y∞ = E(Y ∣ G ).
0 ∞
Proof
As a simple extension of the last result, if Y 0 ∈ Lk for some k ∈ [1, ∞) then the convergence is in L also. k
Proof
Applications
The Strong Law of Large Numbers
The strong law of large numbers is one of the fundamental theorems of classical probability. Our previous proof required that the
underlying distribution have finite variance. Here we present an elegant proof using backwards martingales that does not require
this extra assumption. So, suppose that X = {X : n ∈ N } is a sequence of independent, identically distributed random
n +
variables with common mean μ ∈ R . In statistical terms, X corresponds to sampling from the underlying distribution. Next let
n
Yn = ∑ Xi , n ∈ N (17.6.12)
i=1
so that Y = {Y : n ∈ N} is the partial sum process associated with X. Recall that the sequence Y is also a discrete-time random
n
Exchangeable Variables
We start with a probability space (Ω, F , P) and another measurable space (S, S ). Suppose that X = (X , X , …) is a sequence 1 2
of random variables each taking values in S . Recall that X is exchangeable if for every n ∈ N , every permutation of
(X , X , … , X ) has the same distribution on (S , S ) (where S is the n -fold product σ-algebra). Clearly if X is a sequence
n n n
1 2 n
of independent, identically distributed variables, then X is exchangeable. Conversely, if X is exchangeable then the variables are
identically distributed (by definition), but are not necessarily independent. The most famous example of a sequence that is
exchangeable but not independent is Pólya's urn process, named for George Pólya. On the other hand, conditionally independent
and identically distributed sequences are exchangeable. Thus suppose that (T , T ) is another measurable space and that Θ is a
random variable taking values in T .
17.6.2 https://stats.libretexts.org/@go/page/10304
Proof
Often the setting of this theorem arises when we start with a sequence of independent, identically distributed random variables that
are governed by a parametric distribution, and then randomize one of the parameters. In a sense, we can always think of the setting
in this way: Imagine that θ ∈ T is a parameter for a distribution on S . A special case is the beta-Bernoulli process, in which the
success parameter p in sequence of Bernoulli trials is randomized with the beta distribution. On the other hand, Pólya's urn process
is an example of an exchangeable sequence that does not at first seem to have anything to do with randomizing parameters. But in
fact, we know that Pólya's urn process is a special case of the beta-Bernoulli process. This connection gives a hint of de Finetti's
theorem, named for Bruno de Finetti, which we consider next. This theorem states any exchangeable sequence of indicator random
variables corresponds to randomizing the success parameter in a sequence of Bernoulli trials.
de Finetti's Theorem. Suppose that X = (X , X , …) is an exchangeable sequence of random variables, each taking values
1 2
in {0, 1}. Then there exists a random variable P with values in [0, 1], such that given P = p ∈ [0, 1], X is a sequence of
Bernoulli trials with success parameter p.
Proof
De Finetti's theorem has been extended to much more general sequences of exchangeable variables. Basically, if
X = (X , X , …) is an exchangeable sequence of random variables, each taking values in a significantly nice measurable space
1 2
(S, S ) then there exists a random variable Θ such that X is independent and identically distributed given Θ. In the proof, the
n
result that M → P as n → ∞ with probability 1, where M = ∑ X , is known as de Finetti's strong law of large numbers.
n n
1
n i=1 i
De Finetti's theorem, and it's generalizations are important in Bayesian statistical inference. For an exchangeable sequence of
random variables (our observations in a statistical experiment), there is a hidden, random parameter Θ. Given Θ = θ , the variables
are independent and identically distributed. We gain information about Θ by imposing a prior distribution on Θ and then updating
this, based on our observations and using Baye's theorem, to a posterior distribution.
Stated more in terms of distributions, de Finetti's theorem states that the distribution of n distinct variables in the exchangeable
sequence is a mixture of product measures. That is, if μ is the distribution of a generic X on (S, S ) given Θ = θ , and ν is the
θ
n
B ↦ ∫ μ (B) dν (θ) (17.6.28)
θ
T
This page titled 17.6: Backwards Martingales is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
17.6.3 https://stats.libretexts.org/@go/page/10304
CHAPTER OVERVIEW
18: Brownian Motion
Brownian motion is a stochastic process of great theoretical importance, and as the basic building block of a variety of other
processes, of great practical importance as well. In this chapter we study Brownian motion and a number of random processes that
can be constructed from Brownian motion. We also study the Ito stochastic integral and the resulting calculus, as well as two
remarkable representation theorems involving stochastic integrals.
18.1: Standard Brownian Motion
18.2: Brownian Motion with Drift and Scaling
18.3: The Brownian Bridge
18.4: Geometric Brownian Motion
This page titled 18: Brownian Motion is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random
Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
18.1: Standard Brownian Motion
Basic Theory
History
In 1827, the botanist Robert Brown noticed that tiny particles from pollen, when suspended in water, exhibited continuous but very
jittery and erratic motion. In his “miracle year” in 1905, Albert Einstein explained the behavior physically, showing that the
particles were constantly being bombarded by the molecules of the water, and thus helping to firmly establish the atomic theory of
matter. Brownian motion as a mathematical random process was first constructed in rigorous way by Norbert Wiener in a series of
papers starting in 1918. For this reason, the Brownian motion process is also known as the Wiener process.
Run the two-dimensional Brownian motion simulation several times in single-step mode to get an idea of what Mr. Brown may
have observed under his microscope.
Along with the Bernoulli trials process and the Poisson process, the Brownian motion process is of central importance in
probability. Each of these processes is based on a set of idealized assumptions that lead to a rich mathematial theory. In each case
also, the process is used as a building block for a number of related random processes that are of great importance in a variety of
applications. In particular, Brownian motion and related processes are used in applications ranging from physics to statistics to
economics.
Definition
A standard Brownian motion is a random process X = { Xt : t ∈ [0, ∞)} with state space R that satisfies the following
properties:
1. X = 0 (with probability 1).
0
2. X has stationary increments. That is, for s, t ∈ [0, ∞) with s < t , the distribution of X − X is the same as the
t s
distribution of X .t−s
3. X has independent increments. That is, for t , t , … , t ∈ [0, ∞) with t < t < ⋯ < t , the random variables
1 2 n 1 2 n
X ,Xt1 −X , … , X
t2 t1 −X
tn are independent.
tn−1
2. This is a statement of time homogeneity: the underlying dynamics (namely the jostling of the particle by the molecules of water)
do not change over time, so the distribution of the displacement of the particle in a time interval [s, t] depends only on the
length of the time interval.
3. This is an idealized assumption that would hold approximately if the time intervals are large compared to the tiny times between
collisions of the particle with the molecules.
4. This is another idealized assumption based on the central limit theorem: the position of the particle at time t is the result of a
very large number of collisions, each making a very small contribution. The fact that the mean is 0 is a statement of spatial
homogeneity: the particle is no more or less likely to be jostled to the right than to the left. Next, recall that the assumptions of
stationary, independent increments means that var(X ) = σ t for some positive constant σ . By a change in time scale, we can
t
2 2
assume σ = 1 , although we will consider more general Brownian motions in the next section.
2
5. Finally, the continuity of the sample paths is an essential assumption, since we are modeling the position of a physical particle
as a function of time.
Of course, the first question we should ask is whether there exists a stochastic process satisfying the definition. Fortunately, the
answer is yes, although the proof is complicated.
18.1.1 https://stats.libretexts.org/@go/page/10403
There exists a probability space (Ω, F , P) and a stochastic process X = { Xt : t ∈ [0, ∞)} on this probability space
satisfying the assumptions in the definition.
Proof sketch
Run the simulation of the standard Brownian motion process a few times in single-step mode. Note the qualitative behavior of
the sample paths. Run the simulation 1000 times and compare the empirical density function and moments of X to the true t
X =∑
n U
n
where U = (U , U , …) is a sequence of independent variables with P(U = 1) = P(U = −1) =
i=1 i 1 2 for each i i
1
i ∈ N . Recall that E(X ) = 0 and var(X ) = n for n ∈ N . Also, since X is the partial sum process associated with an IID
+ n n
sequence, X has stationary, independent increments (but of course in discrete time). Finally, recall that by the central limit
theorem, X /√− n converges to the standard normal distribution as n → ∞ . Now, for h, d ∈ (0, ∞) the continuous time process
n
is a jump process with jumps at {0, h, 2h, …} and with jumps of size ±d . Basically we would like to let h ↓ 0 and d ↓ 0 , but this
cannot be done arbitrarily. Note that E [X (t)] = 0 but var [X (t)] = d ⌊t/h⌋ . Thus, by the central limit theorem, if we take
h,d h,d
2
−
d = √h then the distribution of X (t) will converge to the normal distribution with mean 0 and variance t as h ↓ 0 . More
h,d
generally, we might hope that all of requirements in the definition are satisfied by the limiting process, and if so, we have a
standard Brownian motion.
Run the simulation of the random walk process for increasing values of n . In particular, run the simulation several times with
n = 100 . Compare the qualitative behavior with the standard Brownian motion process. Note that the scaling of the random
walk in time and space is effecitvely accomplished by scaling the horizontal and vertical axes in the graph window.
2
1 x
ft (x) = − exp(−
−− ), x ∈ R (18.1.2)
√2πt 2t
Proof
X is a Gaussian process with mean function mean function m(t) = 0 for t ∈ [0, ∞) and covariance function
c(s, t) = min{s, t} for s, t ∈ [0, ∞) .
Proof
Recall that for a Gaussian process, the finite dimensional (multivariate normal) distributions are completely determined by the
mean function m and the covariance function c . Thus, it follows that a standard Brownian motion is characterized as a continuous
Gaussian process with the mean and covariance functions in the last theorem. Note also that
−−−−−−−−
min{s, t} min{s, t}
2
cor(Xs , Xt ) = −
− =√ , (s, t) ∈ [0, ∞) (18.1.5)
√st max{s, t}
We can also give the higher moments and the moment generating function for X . t
18.1.2 https://stats.libretexts.org/@go/page/10403
For n ∈ N and t ∈ [0, ∞),
1. E (X 2n
t
) = 1 ⋅ 3 ⋯ (2n − 1)t
n n n
= (2n)! t /(n! 2 )
2. E (X 2n+1
t
) =0
Proof
uXt tu/2
E (e ) =e , u ∈ R (18.1.6)
Proof
Simple Transformations
There are several simple transformations that preserve standard Brownian motion and will give us insight into some of its
properties. As usual, our starting place is a standard Brownian motion X = {X : t ∈ [0, ∞)} . Our first result is that reflecting the
t
Our next result is related to the Markov property, which we explore in more detail below. If we “restart” Brownian motion at a
fixed time s , and shift the origin to X , then we have another standard Brownian motion. This means that Brownian motion is both
s
Fix s ∈ [0, ∞) and define Y t = Xs+t − Xs for t ≥ 0 . Then Y = { Yt : t ∈ [0, ∞)} is also a standard Brownian motion.
Proof
Our next result is a simple time reversal, but to state this result, we need to restrict the time parameter to a bounded interval of the
form [0, T ] where T > 0 . The upper endpoint T is sometimes referred to as a finite time horizon. Note that {X : t ∈ [0, T ]} still t
satisfies the definition, but with the time parameters restricted to [0, T ].
Our next transformation involves scaling X both temporally and spatially, and is known as self-similarity.
a
Xa2 t for t ≥ 0 . Then Y = { Yt : t ∈ [0, ∞)} is also a standard Brownian motion.
Proof
Note that the graph of Y can be obtained from the graph of X by scaling the time axis t by a factor of a and scaling the spatial 2
axis x by a factor of a . The fact that the temporal scale factor must be the square of the spatial scale factor is clearly related to
Brownian motion as the limit of random walks. Note also that this transformation amounts to “zooming in or out” of the graph of
X and hence Brownian motion has a self-similar, fractal quality, since the graph is unchanged by this transformation. This also
suggests that, although continuous, t ↦ X is highly irregular. We consider this in the next subsection.
t
Let Y 0 =0 and Yt = tX1/t for t > 0 . Then Y = { Yt : t ∈ [0, ∞)} is also a standard Brownian motion.
Proof
Irregularity
The defining properties suggest that standard Brownian motion X = { Xt : t ∈ [0, ∞)} cannot be a smooth, differentiable
function. Consider the usual difference quotient at t ,
Xt+h − Xt
(18.1.11)
h
18.1.3 https://stats.libretexts.org/@go/page/10403
By the stationary increments property, if h > 0 , the numerator has the same distribution as X , while if h < 0 , the numerator has
h
the same distribution as −X , which in turn has the same distribution as X . So, in both cases, the difference quotient has the
−h −h
same distribution as X /h, and this variable has the normal distribution with mean 0 and variance |h| /h = 1/ |h| . So the
|h|
2
variance of the difference quotient diverges to ∞ as h → 0 , and hence the difference quotient does not even converge in
distribution, the weakest form of convergence.
The temporal-spatial transformation above also suggests that Brownian motion cannot be differentiable. The intuitive meaning of
differentiable at t is that the function is locally linear at t —as we zoon in, the graph near t begins to look like a line (whose slope,
of course, is the derivative). But as we zoon in on Brownian motion, (in the sense of the transformation), it always looks the same,
and in particular, just as jagged. More formally, if X is differentiable at t , then so is the transformed process Y , and the chain rule
gives Y (t) = aX (a t) . But Y is also a standard Brownian motion for every a > 0 , so something is clearly wrong. While not
′ ′ 2
Run the simulation of the standard Brownian motion process. Note the continuity but very jagged quality of the sample paths.
Of course, the simulation cannot really capture Brownian motion with complete fidelity.
The following theorems gives a more precise measure of the irregularity of standard Brownian motion.
Standard Brownian motion X has Hölder exponent . That is, X is Hölder continuous with exponent α for every α <
1
2
1
2
, but
is not Hölder continuous with exponent α for any α > . 1
In particular, X is not Lipschitz continuous, and this shows again that it is not differentiable. The following result states that in
terms of Hausdorff dimension, the graph of standard Brownian motion lies midway between a simple curve (dimension 1) and the
plane (dimension 2).
2
.
Yet another indication of the irregularity of Brownian motion is that it has infinite total variation on any interval of positive length.
the future is independent of the past, given the present state. Because of the stationary, independent increments property, Brownian
motion has the property. As a minor note, to view X as a Markov process, we sometimes need to relax Assumption 1 and let X 0
have an arbitrary value in R. Let F = σ{X : 0 ≤ s ≤ t} , the sigma-algebra generated by the process up to time t ∈ [0, ∞). The
t s
Standard Brownian motion is a time-homogeneous Markov process with transition probability density p given by
2
1 (y − x)
pt (x, y) = ft (y − x) = exp[− ], t ∈ (0, ∞); x, y ∈ R (18.1.12)
−−
−
√2πt 2t
Proof
The transtion density p satisfies the following diffusion equations. The first is known as the forward equation and the second as
the backward equation.
2
∂ 1 ∂
pt (x, y) = pt (x, y) (18.1.13)
∂t 2 ∂y 2
2
∂ 1 ∂
pt (x, y) = pt (x, y) (18.1.14)
2
∂t 2 ∂x
Proof
18.1.4 https://stats.libretexts.org/@go/page/10403
The diffusion equations are so named, because the spatial derivative in the first equation is with respect to y , the state forward at
time t , while the spatial derivative in the second equation is with respect to x, the state backward at time 0.
Recall that a random time τ taking values in [0, ∞] is a stopping time with respect to the process X = {X : t ∈ [0, ∞)} if t
{τ ≤ t} ∈ F for every t ∈ [0, ∞). Informally, we can determine whether or not τ ≤ t by observing the process up to time t . An
t
important special case is the first time that our Brownian motion hits a specified state. Thus, for x ∈ R let
τ = inf{t ∈ [0, ∞) : X = x} . The random time τ
x t is a stopping time.
x
For a stopping time τ , we need the σ-algebra of events that can be defined in terms of the process up to the random time τ ,
analogous to F , the σ-algebra of events that can be defined in terms of the process up to a fixed time t . The appropriate definition
t
is
Fτ = {B ∈ F : B ∩ {τ ≤ t} ∈ Ft for all t ≥ 0} (18.1.15)
See the section on Filtrations and Stopping Times for more information on filtrations, stopping times, and the σ-algebra associated
with a stopping time.
The strong Markov property is the Markov property generalized to stopping times. Standard Brownian motion X is also a strong
Markov process. The best way to say this is by a generalization of the temporal and spatial homogeneity result above.
Suppose that τ is a stopping time and define Yt = Xτ+t − Xτ for t ∈ [0, ∞) . Then Y = { Yt : t ∈ [0, ∞)} is a standard
Brownian motion and is independent of F . τ
Xt , 0 ≤t <τ
Wt = { (18.1.16)
2 Xτ − Xt , τ ≤t <∞
Thus, the graph of W = {W : t ∈ [0, ∞)} can be obtained from the graph of X by reflecting in the line x = X after time τ . In
t τ
particular, if the stopping time τ is τ , the first time that the process hits a specified state a > 0 , then the graph of W is obtained
a
Open the simulation of reflecting Brownian motion. This app shows the process W corresponding to the the stopping time τ , a
the time of first visit to a positive state a . Run the simulation in single step mode until you see the reflected process several
times. Make sure that you understand how the process W works.
Run the simulation of the reflected Brownian motion process 1000 times. Compaure the empirical density function and
moments of W to the true probability density function and moments.
t
Martingales
As usual, let X = { Xt : t ∈ [0, ∞)} be a standard Brownian motion, and let F = σ{X : 0 ≤ s ≤ t} for t ∈ [0, ∞), so that
t s
F = { Ft : t ∈ [0, ∞)} is the natural filtration for X. There are several important martingales associated with X. We will study a
couple of them in this section, and others in subsequent sections. Our first result is that X itself is a martingale, simply by virtue of
having stationary, independent increments and 0 mean.
Let Y t =X
t
2
−t for t ∈ [0, ∞). Then Y = { Yt : t ∈ [0, ∞)} is a martingale with respect to F.
Proof
18.1.5 https://stats.libretexts.org/@go/page/10403
Maximums and Hitting Times
As usual, we start
with a standard Brownian motion X = {X : t ∈ [0, ∞)} . For y ∈ [0, ∞) recall that t
τy = min{t ≥ 0 : Xt = y} is the first time that the process hits state y . Of course, τ = 0 . For t ∈ [0, ∞), let 0
Y = max{ X : 0 ≤ s ≤ t} , the maximum value of X on the interval [0, t]. Note that Y is well defined by the continuity of X,
t s t
and of course Y = 0 . Thus we have two new stochastic processes: {τ : y ∈ [0, ∞)} and {Y : t ∈ [0, ∞)} . Both have index set
0 y t
[0, ∞) and (as we will see) state space [0, ∞). Moreover, the processes are inverses of each other in a sense:
Thus, if we can compute the distribution of Y for each t ∈ (0, ∞) then we can compute the distribution of τ for each y ∈ (0, ∞),
t y
and conversely.
is given by
2
y y
gy (t) = −− − exp(−
− ), t ∈ (0, ∞) (18.1.20)
√2πt3 2t
Proof
The distribution of τ is the Lévy distribution with scale parameter y , and is named for the French mathematician Paul Lévy. The
y
2
Open the hitting time experiment. Vary y and note the shape and location of the probability density function of τ . For selected y
values of the parameter, run the simulation in single step mode a few times. Then run the experiment 1000 times and compare
the empirical density function to the probability density function.
Open the special distribution simulator and select the Lévy distribution. Vary the parameters and note the shape and location of
the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the
empirical density function to the probability density function.
Standard Brownian motion is recurrent. That is, P(τ y < ∞) = 1 for every y ∈ R .
Proof
Thus, for each y ∈ R , X eventually hits y with probability 1. Actually we can say more:
Standard Brownian motion is null recurrent. That is, E(τ y) =∞ for every y ∈ R ∖ {0} .
Proof
The family of probability density functions { gx : x ∈ (0, ∞)} is closed under convolution. That is, gx ∗ gy = gx+y for
x, y ∈ (0, ∞).
Proof
Now we turn our attention to the maximum process {Y t : t ∈ [0, ∞)} , the “inverse” of the hitting process {τ
y : y ∈ [0, ∞)} .
For t > 0 , Y has the same distribution as |X |, known as the half-normal distribution with scale parameter t . The probability
t t
density function is
18.1.6 https://stats.libretexts.org/@go/page/10403
−
−− 2
2 y
ht (y) = √ exp(− ), y ∈ [0, ∞) (18.1.27)
πt 2t
Proof
The half-normal distribution is a special case of the folded normal distribution, which is studied in more detail in the chapter on
special distributions.
−
−
1. E(Y t) =√
2t
2. var(Y t) = t (1 −
2
π
)
Proof
In the standard Brownian motion simulation, select the maximum value. Vary the parameter t and note the shape of the
probability density function and the location and size of the mean-standard deviation bar. Run the simulation 1000 times and
compare the empirical density and moments to the true probability density function and moments.
Open the special distribution simulator and select the folded-normal distribution. Vary the parameters and note the shape and
location of the probability density function and the size and location of the mean-standard deviation bar. For selected values of
the parameters, run the simulation 1000 times and compare the empirical density function and moments to the true density
function and moments.
probability laws referred to as arcsine laws, because as we might guess, the probabilities and distributions involve the arcsine
function.
For s, t ∈ [0, ∞) with s <t , let E(s, t) be the event that X has a zero in the time interval (s, t) . That is,
E(s, t) = { Xu = 0 for some u ∈ (s, t)} . Then
−
−
2 s
P [E(s, t)] = 1 − arcsin( √ ) (18.1.29)
π t
Proof
In paricular, P [E(0, t)] = 1 for every t > 0 , so with probability 1, X has a zero in (0, t). Actually, we can say a bit more:
The last result is further evidence of the very strange and irregular behavior of Brownian motion. Note also that P [E(s, t)]
depends only on the ratio s/t . Thus, P [E(s, t)] = P [E(1/t, 1/s)] and P [E(s, t)] = P [E(cs, ct)] for every c > 0 . So, for
example the probability of at least one zero in the interval (2, 5) is the same as the probability of at least one zero in (1/5, 1/2), the
same as the probability of at least one zero in (6, 15), and the same as the probability of at least one zero in (200, 500).
For t > 0 , let Z denote the time of the last zero of X before time t . That is, Z = max {s ∈ [0, t] : X = 0} . Then Z has
t t s t
the arcsine distribution with parameter t . The distribution function H and the probability density function h are given by
t t
−
−
2 s
Ht (s) = arcsin( √ ), 0 ≤s ≤t (18.1.34)
π t
1
ht (s) = − −−−− −, 0 <s <t (18.1.35)
π √ s(t − s)
Proof
18.1.7 https://stats.libretexts.org/@go/page/10403
The density function of Z is u-shaped and symmetric about the midpoint t/2, so the points with the largest density are those near
t
the endpoints 0 and t , a surprising result at first. The arcsine distribution is studied in more detail in the chapter on special
distributions.
1. E(Z t) = t/2
2. E(Z 2
t ) = t /8
Proof
In the simulation of standard Brownian motion, select the last zero variable. Vary the parameter t and note the shape of the
probability density function and the size and location of the mean-standard deviation bar. For selected values of t run the
simulation is single step mode a few times and note the position of the last zero. Finally, run the simulation 1000 times and
compare the empirical density function and moments to the true probability density function and moments.
Open the special distribution simulator and select the arcsine distribution. Vary the parameters and note the shape and location
of the probability density function and the size and location of the mean-standard deviation bar. For selected values of the
parameters, run the simulation 1000 times and compare the empirical density function and moments to the true density
function and moments.
Now let Z = {t ∈ [0, ∞) : X = 0} denote the set of zeros of X, so that Z is a random subset of [0, ∞). The theorem below
t
gives some of the strange properties of the random set Z , but to understand these, we need to review some definitions. A nowhere
dense set is a set whose closure has empty interior. A perfect set is a set with no isolated points. As usual, we let λ denote Lebesgue
measure on R.
With probability 1,
1. Z is closed.
2. λ(Z) = 0
3. Z is nowhere dense.
4. Z is perfect.
Proof
The following theorem gives a deeper property of Z . The Hausdorff dimension of Z is midway between that of a point (dimension
0) and a line (dimension 1).
2
.
with mean 0 and standard deviation √t, so the function x = √t gives some idea of how the process grows in time. The precise
growth rate is given by the famous law of the iterated logarithm
With probability 1,
Xt
lim sup =1 (18.1.37)
−−−−− −
t→∞ √ 2t ln ln t
Computational Exercises
In the following exercises, X = {X t : t ∈ [0, ∞)} is a standard Brownian motion process.
Explicitly find the probability density function, covariance matrix, and correlation matrix of (X 0.5 , .
X1 , X2.3 )
This page titled 18.1: Standard Brownian Motion is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
18.1.8 https://stats.libretexts.org/@go/page/10403
upon request.
18.1.9 https://stats.libretexts.org/@go/page/10403
18.2: Brownian Motion with Drift and Scaling
Basic Theory
Definition
We start with the assumptions that govern standard Brownian motion, except that we relax the restrictions on the parameters of the
normal distribution.
Suppose that μ ∈ R and σ ∈ (0, ∞). Brownian motion with drift parameter μ and scale parameter σ is a random process
X = { Xt : t ∈ [0, ∞)} with state space R that satisfies the following properties:
1. X = 0 (with probability 1).
0
2. X has stationary increments. That is, for s, t ∈ [0, ∞) with s < t , the distribution of X − X is the same as the t s
distribution of X . t−s
3. X has independent increments. That is, for t , t , … , t ∈ [0, ∞) with t < t < ⋯ < t , the random variables
1 2 n 1 2 n
X ,Xt1 −X , … , X
t2 t1 −X are independent.
tn tn−1
4. X has the normal distribution with mean μt and variance σ t for t ∈ [0, ∞).
t
2
Note that we cannot assign the parameters of the normal distribution of X arbitrarily. We know that since t X has stationary,
independent increments, E(X ) and var(X ) must be linear functions of t ∈ [0, ∞).
t t
Open the simulation of Brownian motion with drift and scaling. Run the simulation in single step mode several times for
various values of the parameters. Note the behavior of the sample paths. For selected values of the parameters, run the
simulation 1000 times and compare the empirical density function and moments to the true density function and moments.
It's easy to construct Brownian motion with drift and scaling from a standard Brownian motion, so we don't have to worry about the
existence question.
for t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a Brownian motion with drift parameter μ and scale parameter σ.
t
2. Conversely, suppose that X = {X : t ∈ [0, ∞)} is a Brownian motion with drift parameter μ ∈ R and scale parameter
t
σ ∈ (0, ∞) . Let Z = (X − μt)/σ for t ∈ [0, ∞). Then Z = { Z : t ∈ [0, ∞)} is a standard Brownian motion.
t t t
Proof
1 1 2
ft (x) = − exp[−
−− (x − μt) ], x ∈ R (18.2.2)
2
σ √2πt 2σ t
If t
1, t2 , … , tn ∈ (0, ∞) with 0 < t 1 < t2 ⋯ < tn then (X t1 , Xt2 , … , Xtn ) has probability density function f t1 , t2 ,…, tn given
by
n
ft , t2 ,…, tn (x1 , x2 , … , xn ) = ft (x1 )ft −t1 (x2 − x1 ) ⋯ ft −tn−1 (xn − xn−1 ), (x1 , x2 , … , xn ) ∈ R (18.2.3)
1 1 2 n
Proof
18.2.1 https://stats.libretexts.org/@go/page/10404
X is a Gaussian process with mean function mean function m and covariance function c given by
1. m(t) = μt for t ∈ [0, ∞)
2. c(s, t) = σ min{s, t} for s,
2
t ∈ [0, ∞) .
Proof
The correlation function is independent of the parameters, and thus is the same as for standard Brownian motion. This is hardly
surprising since correlation is a standardized measure of association.
−−−−−−−−
2
σ min{s, t} min{s, t} min{s, t}
2
cor(Xs , Xt ) = =√ , (s, t) ∈ [0, ∞) (18.2.4)
σsσt st max{s, t}
Transformations
There are a couple simple transformations that preserve Brownian motion, but perhaps change the drift and scale parameters. Our
starting place is a Brownian motion X = {X : t ∈ [0, ∞)} with drift parameter μ ∈ R and scale parameter σ ∈ (0, ∞). Our first
t
result involves scaling X is time and space (and possible reflecting in the spatial origin).
Let a ∈ R ∖ {0} and b ∈ (0, ∞). Define Y = aX t bt for t ≥0 . Then Y = { Yt : t ≥ 0} is also a Brownian motion with drift
parameter abμ and scale parameter |a| √bσ.
Proof
Suppose that a > 0 in the previous theorem, so that we are scaling temporally and spatially. In order to preserve the original drift
parameter μ we must have ab = 1 (if μ ≠ 0 ). In order to preserve the original scale parameter σ, we must have a√b = 1 . We can't
have both unless μ = 0 , which leads to a slight generalization of one of our results for standard Brownian motion:
Suppose that X is a Brownian motion with drift parameter μ = 0 and scale parameter σ > 0 . Suppose also that c > 0 and let
Y =
t X
1
c
for t ≥ 0 . Then Y = {Y : t ∈ [0, ∞)} is also a Brownian motion with drift parameter 0 and scale parameter σ.
c2 t t
Our next result is related to the Markov property, which we explore in more detail below. We return to the general case where
X = { X : t ∈ [0, ∞)} is a Brownian motion with drift parameter μ ∈ R and scale parameter σ ∈ (0, ∞) . If we “restart”
t
Brownian motion at a fixed time s , and shift the origin to X , then we have another Brownian motion with the same parameters.
s
Fix s ∈ [0, ∞) and define Y t = Xs+t − Xs for t ≥ 0 . Then Y = { Yt : t ∈ [0, ∞)} is also a Brownian motion with the same
drift and scale parameters.
Proof
a Markov process has the property that the future is independent of the past, given the present state. Because of the stationary,
independent increments property, Brownian motion has the property. As a minor note, to view X as a Markov process, we
sometimes need to relax Assumption 1 and let X have an arbitrary value in R. Let F = σ{X : 0 ≤ s ≤ t} , the sigma-algebra
0 t s
generated by the process up to time t ∈ [0, ∞). The family of σ-algebras F = {F : t ∈ [0, ∞)} is known as a filtration.
t
Brownian motion is a time-homogeneous Markov process with transition probability density p given by
1 1
2
pt (x, y) = ft (y − x) = exp[− (y − x − μt) ], t ∈ (0, ∞); x, y ∈ R (18.2.8)
−−
− 2
σ √2πt 2σ t
Proof
The transtion density p satisfies the following diffusion equations. The first is known as the forward equation and the second as
the backward equation.
18.2.2 https://stats.libretexts.org/@go/page/10404
2
∂ ∂ 1 2
∂
pt (x, y) = −μ pt (x, y) + σ pt (x, y) (18.2.9)
2
∂t ∂y 2 ∂y
2
∂ ∂ 1 ∂
2
pt (x, y) = μ pt (x, y) + σ pt (x, y) (18.2.10)
2
∂t ∂x 2 ∂x
Proof
The diffusion equations are so named, because the spatial derivative in the first equation is with respect to y , the state forward at
time t , while the spatial derivative in the second equation is with respect to x, the state backward at time 0.
Recall again that a random time τ taking values in [0, ∞] is a stopping time with respect to the process X if {τ ≤ t} ∈ Ft for
every t ∈ [0, ∞). The σ-algebra associated with τ is
Fτ = {B ∈ F : B ∩ {τ ≤ t} ∈ Ft for all t ≥ 0} (18.2.11)
See the section on Filtrations and Stopping Times for more information on filtrations, stopping times, and the σ-algebra associated
with a stopping time. Brownian motion X is also a strong Markov process.
Suppose that τ is a stopping time and define Y = X − X for t ∈ [0, ∞). Then
t τ+t τ Y = { Yt : t ∈ [0, ∞)} is a Brownian
motion with the same drift and scale parameters, and is independent of F . τ
This page titled 18.2: Brownian Motion with Drift and Scaling is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by
Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
18.2.3 https://stats.libretexts.org/@go/page/10404
18.3: The Brownian Bridge
Basic Theory
Definition and Constructions
In the most common formulation, the Brownian bridge process is obtained by taking a standard Brownian motion process X,
restricted to the interval [0, 1], and conditioning on the event that X = 0 . Since X = 0 also, the process is “tied down” at both
1 0
ends, and so the process in between forms a “bridge” (albeit a very jagged one). The Brownian bridge turns out to be an interesting
stochastic process with surprising applications, including a very important application to statistics. In terms of a definition,
however, we will give a list of characterizing properties as we did for standard Brownian motion and for Brownian motion with
drift and scaling.
A Brownian bridge is a stochastic process X = {X t : t ∈ [0, 1]} with state space R that satisfies the following properties:
1. X = 0 and X = 0 (each with probability 1).
0 1
2. X is a Gaussian process.
3. E(X ) = 0 for t ∈ [0, 1].
t
So, in short, a Brownian bridge X is a continuous Gaussian process with X = X = 0 , and with mean and covariance functions
0 1
given in (c) and (d), respectively. Naturally, the first question is whether there exists such a process. The answer is yes, of course,
otherwise why would we be here? But in fact, we will see several ways of constructing a Brownian bridge from a standard
Brownian motion. To help with the proofs, recall that a standard Brownian motion process Z = {Z : t ∈ [0, ∞)} is a continuous
t
Gaussian process with Z = 0 , E(Z ) = 0 for t ∈ [0, ∞) and cov(Z , Z ) = min{s, t} for s, t ∈ [0, ∞) . Here is our first
0 t s t
construction:
Suppose that Z = { Zt : t ∈ [0, ∞)} is a standard Brownian motion, and let Xt = Zt − tZ1 for .
t ∈ [0, 1] Then
X = { Xt : t ∈ [0, 1]} is a Brownian bridge.
Proof
Run the simulation of the Brownian bridge process in single step mode a few times.
For the Brownian bridge X, note in particular that X is normally distributed with mean 0 and variance t(1 − t) for t ∈ [0, 1].
t
Thus, the variance increases and then decreases on [0, 1] reaching a maximum of 1/4 at t = 1/2 . Of course, the variance is 0 at
t = 0 and t = 1 , since X = X = 0 deterministically.
0 1
Open the simulation of the Brownian bridge process. Vary t and note the change in the probability density function and
moments. For various values of t , run the simulation 1000 times and compare the empirical density function and moments to
the true density function and moments.
Conversely to the construction above, we can build a standard Brownian motion on the time interval [0, 1] from a Brownian bridge.
Suppose that X = {X : t ∈ [0, 1]} is a Brownian bridge, and suppose that Z is a random variable with a standard normal
t
distribution, independent of X. Let Z = X + tZ for t ∈ [0, 1]. Then Z = {Z : t ∈ [0, 1]} is a standard Brownian motion
t t t
on [0, 1].
Proof
Here's another way to construct a Brownian bridge from a standard Brownian motion.
18.3.1 https://stats.libretexts.org/@go/page/10405
t
Xt = (1 − t)Z ( ), t ∈ [0, 1) (18.3.1)
1 −t
We return to the comments at the beginning of this section, on conditioning a standard Brownian motion to be 0 at time 1. Unlike
the previous two constructions, note that we are not transforming the random variables, rather we are changing the underlying
probability measure.
Suppose that X = { Xt : t ∈ [0, ∞)} is a standard Brownian motion. Then conditioned on X1 = 0 , the process
{ Xt : t ∈ [0, 1]} is a Brownian bridge process.
Proof
Suppose that Z = {Z : t ∈ [0, 1]} is a standard Brownian bridge process. Let a, b ∈ R and define X
t t = (1 − t)a + tb + Zt
for t ∈ [0, 1]. Then X = {X : t ∈ [0, 1]} is a Brownian bridge process from a to b .
t
Of course, any of the constructions above for standard Brownian bridge can be modified to produce a general Brownian bridge.
Here are the characterizing properties.
The Brownian bridge process X = {X t : t ∈ [0, 1]} from a to b is characterized by the following properties:
1. X = a and X = b (each with probability 1).
0 1
2. X is a Gaussian process.
3. E(X ) = (1 − t)a + tb for t ∈ [0, 1].
t
18.3.2 https://stats.libretexts.org/@go/page/10405
Applications
The Empirical Distribution Function
We start with a problem that is one of the most basic in statistics. Suppose that T is a real-valued random variable with an unknown
distribution. Let F denote the distribution function of T , so that F (t) = P(T ≤ t) for t ∈ R . Our goal is to construct an estimator
of F , so naturally our first step is to sample from the distribution of T . This generates a sequence T = (T , T , …) of independent
1 2
variables, each with the distribution of T (and so with distribution function F ). Think of T as a sequence of independent copies of
T . For n ∈ N and t ∈ R , the natural estimator of F (t) based on the first n sample values is
+
n
1
Fn (t) = ∑ 1(Ti ≤ t) (18.3.12)
n
i=1
which is simply the proportion of the first n sample values that fall in the interval (−∞, t] . Appropriately enough, F is known as
n
the empirical distribution function corresponding to the sample of size n . Note that (1(T ≤ t), 1(T ≤ t), …) is a sequence of
1 2
independent, identically distributed indicator variables (and hence is a sequence of Bernoulli trials), and corresponds to sampling
from the distribution of 1(T ≤ t) . The estimator F (t) is simply the sample mean of the first n of these variables. The numerator,
n
the number of the original sample variables with values in (−∞, t] , has the binomial distribution with parameters n and F (t). Like
all sample means from independent, identically distributed samples, F (t) satisfies some basic and important properties. A
n
summary is given below, but to make sense of some of these facts, you need to recall the mean and variance of the indicator
variable that we are sampling from: E [1(T ≤ t)] = F (t) , var [1(T ≤ t)] = F (t) [1 − F (t)]
For fixed t ∈ R ,
1. E [F (t)] = F (t) so F (t) is an unbiased estimator of F (t)
n n
4. √−n [ F (t) − F (t)] has mean 0 and variance F (t) [1 − F (t)] and converges to the normal distribution with these
n
The theorem above gives us a great deal of information about F (t) for fixed t , but now we want to let t vary and consider the
n
distribution first.
Suppose that T has the standard uniform distribution, that is, the continuous uniform distribution on the interval [0, 1]. In this case
the distribution function is simply F (t) = t for t ∈ [0, 1], so we have the sequence of stochastic processes
X = { X (t) : t ∈ [0, 1]} for n ∈ N
n n , where
+
−
Xn (t) = √n [ Fn (t) − t] (18.3.13)
Of course, the previous results apply, so the process X has mean function 0, variance function t ↦ t(1 − t) , and for fixed
n
t ∈ [0, 1], the distribution X (t) converges to the corresponding normal distribution as n → ∞ . Here is the new bit of
n
information, the covariance function of X is the same as that of the Brownian bridge!
n
This page titled 18.3: The Brownian Bridge is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist
(Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available
upon request.
18.3.3 https://stats.libretexts.org/@go/page/10405
18.4: Geometric Brownian Motion
Basic Theory
Geometric Brownian motion, and other stochastic processes constructed from it, are often used to model population growth,
financial processes (such as the price of a stock over time), subject to random “noise”.
Definition
Suppose that Z = {Z t : t ∈ [0, ∞)} is standard Brownian motion and that μ ∈ R and σ ∈ (0, ∞). Let
2
σ
Xt = exp[(μ − ) t + σ Zt ], t ∈ [0, ∞) (18.4.1)
2
The stochastic process X = {X t : t ∈ [0, ∞)} is geometric Brownian motion with drift parameter μ and volatility parameter
σ.
is Brownian motion with drift parameter μ − σ /2 and scale parameter σ, so geometric Brownian motion is simply the exponential
2
of this process. In particular, the process is always positive, one of the reasons that geometric Brownian motion is used to model
financial and other processes that cannot be negative. Note also that X = 1 , so the process starts at 1, but we can easily change
0
this. For x ∈ (0, ∞) , the process {x X : t ∈ [0, ∞)} is geometric Brownian motion starting at x . You may well wonder about
0 0 t 0
the particular combination of parameters μ − σ /2 in the definition. The short answer to the question is given in the following
2
theorem:
Geometric Brownian motion X = {X t : t ∈ [0, ∞)} satisfies the stochastic differential equation
Note that the deterministic part of this equation is the standard differential equation for exponential growth or decay, with rate
parameter μ .
Run the simulation of geometric Brownian motion several times in single step mode for various values of the parameters. Note
the behavior of the process.
Distributions
2
2
)t and σ √t. The probability density function ft
is given by
2
2
1 [ln(x) − (μ − σ /2) t]
ft (x) = −−
− exp(− ), x ∈ (0, ∞) (18.4.4)
√2πtσx 2σ 2 t
2
2
σ ) t]
−−−−−− −
2. f is concave upward, then downward, then upward again with inflection points at x = exp[(μ − σ 2
)t ±
1
2
2 2
σ √σ t + 4t ]
Proof
Open the simulation of geometric Brownian motion. Vary the parameters and note the shape of the probability density function
of X . For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the
t
18.4.1 https://stats.libretexts.org/@go/page/10406
true probability density function.
2
ln(x) − (μ − σ /2)t
Ft (x) = Φ [ ], x ∈ (0, ∞) (18.4.5)
σ √t
−1 2 −1
F (p) = exp[(μ − σ /2)t + σ √tΦ (p)], p ∈ (0, 1) (18.4.6)
t
where Φ −1
is the standard normal quantile function.
Proof
Moments
For n ∈ N and t ∈ [0, ∞),
2
σ
n 2
E (X ) = exp{[nμ + (n − n)] t} (18.4.7)
t
2
Proof
In terms of the order of the moment n , the dominant term inside the exponential is σ n /2. If n > 1 − 2μ/σ 2 2 2
then
2
(n − n) > 0 so E(X ) → ∞ as t → ∞ . The mean and variance follow easily from the general moment result.
σ 2 n
nμ + t
2
2. var(X
2
2μt σ t
t) =e (e − 1)
In particular, note that the mean function m(t) = E(X ) = e for t ∈ [0, ∞) satisfies the deterministic part of the stochastic
t
μt
differential equation above. If μ > 0 then m(t) → ∞ as t → ∞ . If μ = 0 then m(t) = 1 for all t ∈ [0, ∞). If μ < 0 then
m(t) → 0 as t → ∞ .
Open the simulation of geometric Brownian motion. The graph of the mean function m is shown as a blue curve in the main
graph box. For various values of the parameters, run the simulation 1000 times and note the behavior of the random process in
relation to the mean function.
Open the simulation of geometric Brownian motion. Vary the parameters and note the size and location of the mean±standard
deviation bar for X . For various values of the parameter, run the simulation 1000 times and compare the empirical mean and
t
Properties
The parameter μ − σ 2
/2 determines the asymptotic behavior of geometric Brownian motion.
Asymptotic behavior:
1. If μ > σ 2
/2 then X → ∞ as t → ∞ with probability 1.
t
2. If μ < σ 2
/2 then X → 0 as t → ∞ with probability 1.
t
3. If μ = σ 2
/2 then X has no limit as t → ∞ with probability 1.
t
Proof
It's interesting to compare this result with the asymptotic behavior of the mean function, given above, which depends only on the
parameter μ . When the drift parameter is 0, geometric Brownian motion is a martingale.
18.4.2 https://stats.libretexts.org/@go/page/10406
If μ = 0 , geometric Brownian motion X is a martingale with respect to the underlying Brownian motion Z .
Proof from stochastic integrals
Direct proof
This page titled 18.4: Geometric Brownian Motion is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is
available upon request.
18.4.3 https://stats.libretexts.org/@go/page/10406
Index
A F moments
arcsine distribution Feller Markov processes 7.2: The Method of Moments
5.19: The Arcsine Distribution 16.2: Potentials and Generators for General Markov Monty Hall problem
Processes 13.6: The Monty Hall Problem
B multivariate normal distribution
Bayesian estimation G 5.7: The Multivariate Normal Distribution
1 https://stats.libretexts.org/@go/page/10524
Student t distribution V Weibull distribution
5.10: The Student t Distribution variance 5.38: The Weibull Distribution
4.3: Variance Wiener process
T 18.1: Standard Brownian Motion
The F distribution W
5.11: The F Distribution Wald distribution Z
timid play 5.37: The Wald Distribution zeta distribution
13.9: Timid Play weak law of large numbers 5.40: The Zeta Distribution
2.6: Convergence Zipf distribution
5.40: The Zeta Distribution
2 https://stats.libretexts.org/@go/page/10524
Glossary
Sample Word 1 | Sample Definition 1
1 https://stats.libretexts.org/@go/page/25717
Detailed Licensing
Overview
Title: Probability, Mathematical Statistics, and Stochastic Processes (Siegrist)
Webpages: 226
All licenses found:
CC BY 2.0: 98.7% (223 pages)
Undeclared: 1.3% (3 pages)
By Page
Probability, Mathematical Statistics, and Stochastic 3.1: Discrete Distributions - CC BY 2.0
Processes (Siegrist) - CC BY 2.0 3.2: Continuous Distributions - CC BY 2.0
Front Matter - CC BY 2.0 3.3: Mixed Distributions - CC BY 2.0
TitlePage - CC BY 2.0 3.4: Joint Distributions - CC BY 2.0
InfoPage - CC BY 2.0 3.5: Conditional Distributions - CC BY 2.0
Introduction - CC BY 2.0 3.6: Distribution and Quantile Functions - CC BY 2.0
Table of Contents - Undeclared 3.7: Transformations of Random Variables - CC BY
Licensing - Undeclared 2.0
Table of Contents - CC BY 2.0 3.8: Convergence in Distribution - CC BY 2.0
Object Library - CC BY 2.0 3.9: General Distribution Functions - CC BY 2.0
Credits - CC BY 2.0 3.10: The Integral With Respect to a Measure - CC
Sources and Resources - CC BY 2.0 BY 2.0
3.11: Properties of the Integral - CC BY 2.0
1: Foundations - CC BY 2.0
3.12: General Measures - CC BY 2.0
1.1: Sets - CC BY 2.0 3.13: Absolute Continuity and Density Functions -
1.2: Functions - CC BY 2.0 CC BY 2.0
1.3: Relations - CC BY 2.0 3.14: Function Spaces - CC BY 2.0
1.4: Partial Orders - CC BY 2.0
4: Expected Value - CC BY 2.0
1.5: Equivalence Relations - CC BY 2.0
1.6: Cardinality - CC BY 2.0 4.1: Definitions and Basic Properties - CC BY 2.0
1.7: Counting Measure - CC BY 2.0 4.2: Additional Properties - CC BY 2.0
1.8: Combinatorial Structures - CC BY 2.0 4.3: Variance - CC BY 2.0
1.9: Topological Spaces - CC BY 2.0 4.4: Skewness and Kurtosis - CC BY 2.0
1.10: Metric Spaces - CC BY 2.0 4.5: Covariance and Correlation - CC BY 2.0
1.11: Measurable Spaces - CC BY 2.0 4.6: Generating Functions - CC BY 2.0
1.12: Special Set Structures - CC BY 2.0 4.7: Conditional Expected Value - CC BY 2.0
4.8: Expected Value and Covariance Matrices - CC
2: Probability Spaces - CC BY 2.0
BY 2.0
2.1: Random Experiments - CC BY 2.0 4.9: Expected Value as an Integral - CC BY 2.0
2.2: Events and Random Variables - CC BY 2.0 4.10: Conditional Expected Value Revisited - CC BY
2.3: Probability Measures - CC BY 2.0 2.0
2.4: Conditional Probability - CC BY 2.0 4.11: Vector Spaces of Random Variables - CC BY 2.0
2.5: Independence - CC BY 2.0 4.12: Uniformly Integrable Variables - CC BY 2.0
2.6: Convergence - CC BY 2.0 4.13: Kernels and Operators - CC BY 2.0
2.7: Measure Spaces - CC BY 2.0
5: Special Distributions - CC BY 2.0
2.8: Existence and Uniqueness - CC BY 2.0
5.1: Location-Scale Families - CC BY 2.0
2.9: Probability Spaces Revisited - CC BY 2.0
5.2: General Exponential Families - CC BY 2.0
2.10: Stochastic Processes - CC BY 2.0
5.3: Stable Distributions - CC BY 2.0
2.11: Filtrations and Stopping Times - CC BY 2.0
5.4: Infinitely Divisible Distributions - CC BY 2.0
3: Distributions - CC BY 2.0
5.5: Power Series Distributions - CC BY 2.0
1 https://stats.libretexts.org/@go/page/32586
5.6: The Normal Distribution - CC BY 2.0 7.2: The Method of Moments - CC BY 2.0
5.7: The Multivariate Normal Distribution - CC BY 7.3: Maximum Likelihood - CC BY 2.0
2.0 7.4: Bayesian Estimation - CC BY 2.0
5.8: The Gamma Distribution - CC BY 2.0 7.5: Best Unbiased Estimators - CC BY 2.0
5.9: Chi-Square and Related Distribution - CC BY 2.0 7.6: Sufficient, Complete and Ancillary Statistics -
5.10: The Student t Distribution - CC BY 2.0 CC BY 2.0
5.11: The F Distribution - CC BY 2.0 8: Set Estimation - CC BY 2.0
5.12: The Lognormal Distribution - CC BY 2.0 8.1: Introduction to Set Estimation - CC BY 2.0
5.13: The Folded Normal Distribution - CC BY 2.0 8.2: Estimation the Normal Model - CC BY 2.0
5.14: The Rayleigh Distribution - CC BY 2.0 8.3: Estimation in the Bernoulli Model - CC BY 2.0
5.15: The Maxwell Distribution - CC BY 2.0 8.4: Estimation in the Two-Sample Normal Model -
5.16: The Lévy Distribution - CC BY 2.0 CC BY 2.0
5.17: The Beta Distribution - CC BY 2.0 8.5: Bayesian Set Estimation - CC BY 2.0
5.18: The Beta Prime Distribution - CC BY 2.0
9: Hypothesis Testing - CC BY 2.0
5.19: The Arcsine Distribution - CC BY 2.0
5.20: General Uniform Distributions - CC BY 2.0 9.1: Introduction to Hypothesis Testing - CC BY 2.0
5.21: The Uniform Distribution on an Interval - CC 9.2: Tests in the Normal Model - CC BY 2.0
BY 2.0 9.3: Tests in the Bernoulli Model - CC BY 2.0
5.22: Discrete Uniform Distributions - CC BY 2.0 9.4: Tests in the Two-Sample Normal Model - CC BY
5.23: The Semicircle Distribution - CC BY 2.0 2.0
5.24: The Triangle Distribution - CC BY 2.0 9.5: Likelihood Ratio Tests - CC BY 2.0
5.25: The Irwin-Hall Distribution - CC BY 2.0 9.6: Chi-Square Tests - CC BY 2.0
5.26: The U-Power Distribution - CC BY 2.0 10: Geometric Models - CC BY 2.0
5.27: The Sine Distribution - CC BY 2.0 10.1: Buffon's Problems - CC BY 2.0
5.28: The Laplace Distribution - CC BY 2.0 10.2: Bertrand's Paradox - CC BY 2.0
5.29: The Logistic Distribution - CC BY 2.0 10.3: Random Triangles - CC BY 2.0
5.30: The Extreme Value Distribution - CC BY 2.0 11: Bernoulli Trials - CC BY 2.0
5.31: The Hyperbolic Secant Distribution - CC BY 2.0
11.1: Introduction to Bernoulli Trials - CC BY 2.0
5.32: The Cauchy Distribution - CC BY 2.0
11.2: The Binomial Distribution - CC BY 2.0
5.33: The Exponential-Logarithmic Distribution - CC
11.3: The Geometric Distribution - CC BY 2.0
BY 2.0
11.4: The Negative Binomial Distribution - CC BY
5.34: The Gompertz Distribution - CC BY 2.0
2.0
5.35: The Log-Logistic Distribution - CC BY 2.0
11.5: The Multinomial Distribution - CC BY 2.0
5.36: The Pareto Distribution - CC BY 2.0
11.6: The Simple Random Walk - CC BY 2.0
5.37: The Wald Distribution - CC BY 2.0
11.7: The Beta-Bernoulli Process - CC BY 2.0
5.38: The Weibull Distribution - CC BY 2.0
5.39: Benford's Law - CC BY 2.0 12: Finite Sampling Models - CC BY 2.0
5.40: The Zeta Distribution - CC BY 2.0 12.1: Introduction to Finite Sampling Models - CC
5.41: The Logarithmic Series Distribution - CC BY BY 2.0
2.0 12.2: The Hypergeometric Distribution - CC BY 2.0
6: Random Samples - CC BY 2.0 12.3: The Multivariate Hypergeometric Distribution -
CC BY 2.0
6.1: Introduction - CC BY 2.0
12.4: Order Statistics - CC BY 2.0
6.2: The Sample Mean - CC BY 2.0
12.5: The Matching Problem - CC BY 2.0
6.3: The Law of Large Numbers - CC BY 2.0
12.6: The Birthday Problem - CC BY 2.0
6.4: The Central Limit Theorem - CC BY 2.0
12.7: The Coupon Collector Problem - CC BY 2.0
6.5: The Sample Variance - CC BY 2.0
12.8: Pólya's Urn Process - CC BY 2.0
6.6: Order Statistics - CC BY 2.0
12.9: The Secretary Problem - CC BY 2.0
6.7: Sample Correlation and Regression - CC BY 2.0
6.8: Special Properties of Normal Samples - CC BY 13: Games of Chance - CC BY 2.0
2.0 13.1: Introduction to Games of Chance - CC BY 2.0
7: Point Estimation - CC BY 2.0 13.2: Poker - CC BY 2.0
13.3: Simple Dice Games - CC BY 2.0
7.1: Estimators - CC BY 2.0
2 https://stats.libretexts.org/@go/page/32586
13.4: Craps - CC BY 2.0 16.9: The Bernoulli-Laplace Chain - CC BY 2.0
13.5: Roulette - CC BY 2.0 16.10: Discrete-Time Reliability Chains - CC BY 2.0
13.6: The Monty Hall Problem - CC BY 2.0 16.11: Discrete-Time Branching Chain - CC BY 2.0
13.7: Lotteries - CC BY 2.0 16.12: Discrete-Time Queuing Chains - CC BY 2.0
13.8: The Red and Black Game - CC BY 2.0 16.13: Discrete-Time Birth-Death Chains - CC BY 2.0
13.9: Timid Play - CC BY 2.0 16.14: Random Walks on Graphs - CC BY 2.0
13.10: Bold Play - CC BY 2.0 16.15: Introduction to Continuous-Time Markov
13.11: Optimal Strategies - CC BY 2.0 Chains - CC BY 2.0
14: The Poisson Process - CC BY 2.0 16.16: Transition Matrices and Generators of
14.1: Introduction to the Poisson Process - CC BY 2.0 Continuous-Time Chains - CC BY 2.0
14.2: The Exponential Distribution - CC BY 2.0 16.17: Potential Matrices - CC BY 2.0
14.3: The Gamma Distribution - CC BY 2.0 16.18: Stationary and Limting Distributions of
14.4: The Poisson Distribution - CC BY 2.0 Continuous-Time Chains - CC BY 2.0
14.5: Thinning and Superpositon - CC BY 2.0 16.19: Time Reversal in Continuous-Time Chains -
14.6: Non-homogeneous Poisson Processes - CC BY CC BY 2.0
2.0 16.20: Chains Subordinate to the Poisson Process -
14.7: Compound Poisson Processes - CC BY 2.0 CC BY 2.0
14.8: Poisson Processes on General Spaces - CC BY 16.21: Continuous-Time Birth-Death Chains - CC BY
2.0 2.0
16.22: Continuous-Time Queuing Chains - CC BY 2.0
15: Renewal Processes - CC BY 2.0
16.23: Continuous-Time Branching Chains - CC BY
15.1: Introduction - CC BY 2.0 2.0
15.2: Renewal Equations - CC BY 2.0
17: Martingales - CC BY 2.0
15.3: Renewal Limit Theorems - CC BY 2.0
17.1: Introduction to Martingalges - CC BY 2.0
15.4: Delayed Renewal Processes - CC BY 2.0
17.2: Properties and Constructions - CC BY 2.0
15.5: Alternating Renewal Processes - CC BY 2.0
17.3: Stopping Times - CC BY 2.0
15.6: Renewal Reward Processes - CC BY 2.0
17.4: Inequalities - CC BY 2.0
16: Markov Processes - CC BY 2.0
17.5: Convergence - CC BY 2.0
16.1: Introduction to Markov Processes - CC BY 2.0 17.6: Backwards Martingales - CC BY 2.0
16.2: Potentials and Generators for General Markov
18: Brownian Motion - CC BY 2.0
Processes - CC BY 2.0
16.3: Introduction to Discrete-Time Chains - CC BY 18.1: Standard Brownian Motion - CC BY 2.0
2.0 18.2: Brownian Motion with Drift and Scaling - CC
16.4: Transience and Recurrence for Discrete-Time BY 2.0
Chains - CC BY 2.0 18.3: The Brownian Bridge - CC BY 2.0
16.5: Periodicity of Discrete-Time Chains - CC BY 18.4: Geometric Brownian Motion - CC BY 2.0
2.0 Back Matter - CC BY 2.0
16.6: Stationary and Limiting Distributions of Index - CC BY 2.0
Discrete-Time Chains - CC BY 2.0 Glossary - CC BY 2.0
16.7: Time Reversal in Discrete-Time Chains - CC Detailed Licensing - Undeclared
BY 2.0
16.8: The Ehrenfest Chains - CC BY 2.0
3 https://stats.libretexts.org/@go/page/32586